Brief results of the collaboration:
The company is a California-based provider of digital marketing solutions. The customer offers promotional campaigns via advertisement embedded in YouTube and Facebook videos.
The customer experienced issues with the analytical module of its digital marketing platform. Generating BI reports either took around a day or resulted in a timeout error, while the company’s analysts needed the up-to-date information daily.
As the company planned to embrace a bigger market segment, the platform should also be able to aggregate billions of video metadata records and merge new updates each day. So, there was a need for a distributed data processing solution that would replace the existing DB2-based module.
In the course of the project, the team faced the following challenges:
As the customer aimed at aggregating larger data volumes—billions of videos—the legacy DB2 database was no longer an option. So, engineers at Altoros implemented a distributed solution based on Cloudera/Hadoop.
Apache Kafka was used to smartly queue video metadata updates (titles, number of clicks, etc.), so that only the latest information would be sent for processing. The data model was also optimized, improving performance.
Experts at Altoros evaluated a variety of distributed frameworks and utilized Apache Spark to enable the system to analyze terabytes of data in parallel.
The team discovered that the BI reports were not responding due to some special characters in the input CSV files. To solve the issue, our developers employed the OpenCSV library—to transform data into a readable format prior to merging it in Hive.
Finally, Apache Oozie automated the process of setting up and running jobs inside the data processing layer. The Zabbix service helped to monitor cluster performance.
Cooperating with Altoros, the customer enabled its platform to generate timely BI reports based on a larger amount of videos. Now, the company’s analysts have the up-to-date information daily—for offering targeted ads.
The new distributed data processing module enables to store and analyze 30 TB of compressed data, merging 1 TB of new information within a night. The time spent on executing queries within the data processing layer was also cut multi-fold.
Ubuntu (over AWS)
Cloudera (CDH 6)
Apache Spark, Hive, Cloudera Impala, Tableau, Zabbix, Apache Oozie, Apache Kafka
DB2, HDFS, SQL Server, SQLite