ETL Explained: ”T” for ”Transformation”

by Katherine VasilegaOctober 19, 2010

One of the main functions of an Extract, Transform, and Load (ETL) tool is to transform data. The transformation step (“T”) is the most vital stage of building a structured data warehouse. With this post, I’d like to help you get a better understanding of the major transformation types in ETL. Here they are:

    • Format revision
    • Decoding of fields
    • Calculated and derived values
    • Splitting of single fields
    • Merging of information
    • Character set conversion
    • Unit of measurement conversion
    • Date/Time conversion
    • Summarization
    • Key restructuring
    • De-duplication

Let me shed some light on these 11 ETL transformation types.

Format revision. Fields can contain numeric and text data types. If they do, you need to standardize and change the data type to text to provide values that could be correctly perceived by the users. The length of fields can also be different and you can standardize it.

Decoding of fields.
In multiple source systems, the same data items are described by a variety of field values. Also, many legacy systems are notorious for using cryptic codes to represent business values. This ETL transformation type changes codes into values that make sense to the end-users.

Calculated and derived values. Sometimes you have to calculate the total cost and the profit margin before data can be stored in the data warehouse, which is an example of the calculated value. You may also want to store customer’s age separately—that would be an example of the derived value.

Splitting single fields. The first name, middle name, and last name, as well as some other values, were stored as a large text in a single field in the earlier legacy systems. You need to store individual components of names and addresses in separate fields in your data repository to improve the operating performance by indexing and analyzing individual components.

Merging of information. This type of data transformation in ETL does not literally mean the merging of several fields to create a single field. In this case, merging of information stands for establishing the connection between different fields, such as product price, description, package types, and viewing these fields as a single entity.

Character set conversion. It is a common task to convert character sets to an agreed standard character set. There are ISO character sets, Unicode character sets, DOS character sets, Apple character sets, Windows character sets, etc. When your source data is on different types of hardware and operating systems, you will have to deal with this type of conversions.

Unit of measurement conversion. Many companies today are located around the globe. If your company’s headquarters is in the USA, while the production offices are in Europe, you will have to convert lb. into kg, ml into km, and so on.

Date/Time conversion.
Date and time have to be represented in standard formats. For example, the American and the British date formats may be standardized to an international format. The date of October 20, 2010 is written as 10/20/2010 in the American format and as 20/10/2010 in the British format. This date may be standardized to be written as 20 OCT 2010.

Aggregation and summarization. Summarization is the addition of corresponding values in a single business dimension, e.g., adding up revenue values by day to calculate weekly totals. Aggregation refers to a summarization across different business dimensions. Commonly, both summarization and aggregation can be deployed during data transformation in the ETL process.

Key restructuring. When choosing keys for your database tables, you have to avoid the ones with built-in meanings. If you use the product code as the primary key, you are sure to face this issue. If the product is moved to another warehouse, the warehouse part of the product key will have to be changed. Restructuring in the ETL is the transformation of such keys into generic keys produced by the system itself.

De-duplication. In a customer database some customers may be represented by several records for various reasons: incorrect data values because of data entry errors, incomplete information, change of address, etc. It makes sense to keep a single record for one customer and link all the duplicates to this single record. This process in ETL is called de-duplication of the customer file.

As you can see, the amount of manipulation needed for the transformation process in ETL depends on data. Accurate data sources will require little transformation, while inaccurate and outdated sources may require various transformation techniques to meet your business and technical requirements.