Sources of Poor Data Quality
Now it’s clear that data quality is no longer the domain of just the data warehouse. It is accepted as an enterprise responsibility. However, having tools, experience and lots of best practices learned, companies still seem to have problems with data quality.
The reason lies in the difficulty of understanding what quality data is and in estimating the cost of bad data. It isn’t always clear why or how to correct this problem because poor data quality presents itself in many different ways.
Better understanding of the underlying sources of quality issues can help develop a plan of action to address the problem that is both proactive and strategic.
I’ve stumbled upon an article at Information-Management.com, where the seven possible sources of poor data quality were described. Here are the seven data quality issues to be taken into account, and the sources of their occurrence:
- Entry quality – Usually caused by a person entering data into a system. The problem may occur due to a typo or a willful decision, such as providing a dummy phone number or address. Identifying these outliers or missing data is easily accomplished with profiling tools or simple queries.
- Process quality – Such issues occur systematically as data is moved through an organization. They may result from a system crash, file loss or any other technical occurrence that results from integrated systems. They are often difficult to identify, especially if the data has made a number of transformations on the way to its destination. Process quality can usually be remedied easily once the source of the problem is identified. Proper checks and quality control at each touch point along the path is needed to ensure that problems are rooted out, though these checks may often be absent in legacy processes.
- Identification quality – Resulting from a failure to recognize the relationship between two objects. For example, two similar products with different SKUs are incorrectly judged to be the same. Identification quality may have significant associated costs, such as mailing the same household more than once. Data quality processes can largely eliminate this problem by matching records, identifying duplicates and placing a confidence score on the similarity of records.
- Integration quality – Occurring due to failed integration of all the known information about an object, which, normally, should be done to provide an accurate representation of the object.
- Usage quality – Caused by incorrect use and interpretation of the information at the point of access.
- Aging quality – occurring as the information that can no longer be trusted due to the fact that time passes and the information become outdated.
- Organizational quality – Which may occur when the information is reconciled between two systems based on the way the organization constructs and views the data.