Please use this identifier to cite or link to this item: http://biblioteca.unisced.edu.mz/handle/123456789/2671
Title: Quantitative Data Cleaning for Large Databases
Authors: Hellerstein, Joseph M.
Keywords: Database
Database System
Analysis
data structures
Issue Date: 15-Oct-2013
Publisher: UC Berkeley
Citation: 42pg
Abstract: Data collection has become a ubiquitous function of large organizations { not only for record keeping, but to support a variety of data analysis tasks that are critical to the organizational mission. Data analysis typically drives decision-making processes and eficiency optimizations, and in an increasing number of settings is the raison d'etre of entire agencies or firms. Despite the importance of data collection and analysis, data quality remains a pervasive and thorny problem in almost every large organization. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. As a result, there has been a variety of research over the last decades on various aspects of data cleaning: computational procedures to automatically or semi-automatically identify { and, when possible, correct { errors in large data sets. In this report, we survey data cleaning methods that focus on errors in quantitative attributes of large databases, though we also provide references to data cleaning methods for other types of attributes. The discussion is targeted at computer practitioners who manage large databases of quantitative information, and designers developing data entry and auditing tools for end users. Because of our focus on quantitative data, we take a statistical view of data quality, with an emphasis on intuitive outlier detection and exploratory data analysis methods based in robust statistics. In addition, we stress algorithms and implementations that can be easily and eficiently implemented in very large databases, and which are easy to understand and visualize graphically. The discussion mixes statistical intuitions and methods, algorithmic building blocks, eficient relational database implementation strategies, and user interface considerations. Throughout the discussion, references are provided for deeper reading on all of these issues.
URI: http://biblioteca.unisced.edu.mz/handle/123456789/2671
Appears in Collections:Bancos de dados

Files in This Item:
File Description SizeFormat 
Quantitative-Data-Cleaning-for-Large-Databases.pdf861.3 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.