Hadoop Ecosystem: May 2013 in Review

by Andrei PaleyesMay 27, 2013

The main Hadoop-related news cover the Cloudera Development Kit release, Concurrent partnership with MapR Technologies, Hive 0.11.0 being available, etc.

Table of Contents

Highlights

Here is what has been happening in the world of Hadoop in May 2013. The summary has been prepared by R&D team of Altoros.

75% of enterprises use Hadoop for data storage and ETL.
Cloudera Development Kit: New SDK for Hadoop developers.
Hadoop overview in three blog posts.
HDInsight may be released this summer.
Review of Hadoop distributions and tools.
Concurrent partners with MapR Technologies.
Concurrent introduces a scoring engine for machine-learning applications.
Hadoop plays key part in McLaren’s F1 success.
Hive 0.11.0 is available.

To learn the details, proceed with our recap of the news below.

75% of enterprises use Hadoop for data storage and ETL

In his article, “Hadoop Adoption Accelerates, But Not For Data Analytics,” Matt Asay of 10gen, the company behind MongoDB, draws attention to a controversial issue. While Hadoop is widely considered to be a powerful tool for analytics and calculation, in reality, more than 75% of enterprises use it for data storage and ETL (Extract, Transform, and Load) operations. However, that does not suggest Hadoop was improperly used before or is being misused now. This is just an indicator of the current market state. Mr. Asay also cites Matt Aslett’s speech delivered at Hadoop Summit back in March. In Mr. Aslett’s opinion, progression from data storage to transforming and ultimately analyzing data is a natural process. Thus, it is just a matter of time before enterprises collect enough data to start actually using the vast analytical capabilities of Hadoop.

Cloudera Development Kit: New SDK for Hadoop developers

Cloudera, a provider of Hadoop-based software and services, announced the Cloudera Development Kit (CDK). It is an open-source project aimed at developers who build applications using CDH, the company’s core Hadoop distribution. The project is essentially a collection of libraries, tools, examples, and documentation designed to simplify the most common tasks when working with the CDH platform. The first release (version 0.2.0) is the CDK data module that includes APIs for various operations with data storage in Hadoop. However, Cloudera promises that the framework will expand to include features for various Hadoop routines. It will answer most of developers’ demands, while staying well-defined, documented, and open.

Hadoop overview in three blog posts

Jonathan Gershater released a series of blog posts covering the basics of Hadoop and big data processing. His first post describes the issues related to analyzing huge amounts of data, explains how the MapReduce approach is applicable to it, and introduces Hadoop as one of the tools designed to solve this kind of problems. The second post digs into Hadoop’s structure and basic terminology, such as DataNode, Job, and HDFS. It also briefly lists related projects: Apache Pig, Apache Hive, HBase, Mahout, etc. The third article focuses on Hadoop’s core components and interaction between them. It explains the advantages of HDFS over the NTFS file system and describes the essence of the MapReduce model, as well as how it is implemented inside Hadoop via corresponding jobs.

HDInsight may be released this summer

HDInsight, a Windows Azure-based Hadoop platform developed by Microsoft, which has been in the beta status since March, is “almost ready for prime time.” The news has been announced by Andrew Brust, CEO of Blue Badge Insights, the company that provides strategy and advisory services to Microsoft’s customers and partners. In his session at Visual Studio Live! Chicago, Mr. Burst shared information on the current status of the service. He said that, in general, HDInsight performs well, but it still has problems with queries run against large file stores. According to Mr. Burst, this is a common problem for all Hadoop distributions and “Hadoop is not yet ready for the enterprise.”

Review of Hadoop distributions and tools

If you are an enterprise agent challenged by unstructured big data and you have never worked with any Hadoop distribution before, you might want to check out Timothy Prickett Morgan’s overview. In his recent post, “Making Hadoop Elephants Drink From Silverlake,” the editor and author at IT Jungle gives a brief description of Amazon’s Elastic MapReduce, Microsoft’s HDInsight, Google’s BigQuery, BIME’s front end for BigQuery, and Splunk. Each solution is described from several perspectives: general architecture, available features, pricing, etc.

Concurrent partners with MapR Technologies

Concurrent, the company behind a popular enterprise big data application platform, announced a partnership with MapR Technologies, a leading Hadoop technology provider. According to the press release published on May 15, the deal is to expand Apache Hadoop usage within enterprises by adding the capabilities of MapR’s Hadoop distribution into Concurrent’s Cascading framework. By combining MapR’s dependability, data protection, and performance innovations with the power and broad platform support of Cascading, the two companies are bringing together a simple-to-use, enterprise-grade development framework and deployment platform for large-scale data analysis.

Concurrent introduces a scoring engine for machine-learning applications

On May 21, six days after announcing its partnership with MapR Technolodgies, Concurrent introduced its new project, Pattern, which runs on top of the Cascading framework. The product is an open-source, standard-based scoring engine that enables analysts and data scientists to quickly deploy machine-learning applications with Apache Hadoop. With Pattern, companies can run their existing machine-learning models on Hadoop using the Predictive Model Markup Language (PMML) or through a programming interface. PMML is a standard export format for R, MicroStrategies, SAS, and other systems. This means, data scientists and engineers familiar with these tools can now leverage the full power of Hadoop in their research and development that involve machine learning.

Hadoop plays key part in McLaren’s F1 success

Stuart Birrell, CIO of Britain’s McLaren Group, shared how the company uses big data insights for designing its successful Formula 1 racing bolides, high-cost consumer vehicles, bicycles, and even medical equipment. Having analyzed tons of data over the recent years, McLaren’s departments—McLaren Electronic Systems, McLaren Applied Technologies, and McLaren Racing—learnt to extract value from large amounts of data with Hadoop and related technologies. For instance, each racing car carries about 160 sensors that generate gigabytes of raw data during races. The data is later used in physical models and testing, producing new sets of data. Thus, the company is in a continuous data-driven loop, and its cars can be modified any day or even any hour. McLaren’s team believes that this evolution is the key to their success.

Hive 0.11.0 is available

The new version of Hive, a data warehouse system for Hadoop, was released on May 15. The key features are the following:

implementation of the Optimized RC File, the way to speed up data access in Hive with meta information
support for the decimal data type
new windowing functions: RANK, LEAD/LAG, ROW_NUMBER, FIRST_VALUE, LAST_VALUE, etc.
various optimizations for joins

This was a summary of the main Hadoop-related news for May 2013. Stay tuned for other updates on Hadoop and big data from our team.