The main goal was to deliver near real-time statistics on clicks and displays, so that advertisers could manage their budgets more effectively. Consequently, this helped to improve the impact of advertisements and increased the overall turnover.
The customer is a large Internet service provider that operates 20+ popular Web sites. The company has more than ten years of experience in digital marketing, advertising, Web development, and hosting. They also produce Web security software and use their own banner network. They needed a system that would enable showing more targeted advertisements.
Before Hadoop had been implemented, there was no possibility for the company to get real-time statistics on displays. Data analysis was extremely slow and could only be launched once in 24 hours. The company also had to purchase expensive hardware to scale the system as the number of displays grew.
We were also to select a product for storing data. The key idea was to find a cloud-based solution with a minimum input threshold that would be similar to standard SQL. In that way, DBA engineers already employed by the customer’s company would be able to build data queries. The customer will not incur any additional expenses on training for the staff.
We built a Hadoop cluster to ensure fast data analysis. The system loads data from Nginx servers at certain intervals. The servers upload banners directly to the HDFS via the WebDAV extension. In order to make CDH3 work, we had to add some patches to the standard WebDAV protocol.
Apache Hive was used as a data warehouse system for fast data querying and analysis. Data is partitioned by preset time intervals, so we can store active and archived data in one table. This also helps to simplify system maintenance and reduce costs, in case we need to recalculate results for any elapsed period of time. In this project, we used Pentaho-Kettle, an ETL solution by Pentaho that connects to Hive through a standard JDBC connector.
NameNode backup was implemented with NFS protocol to ensure high availability. We also developed some utility programs for Apache Hadoop to simplify the most frequent cluster administration tasks.
The resulting system allowed for the following:
Apache Hadoop, Apache Hive, Pentaho