37

From my limited dabbling with data science using R, I realized that cleaning bad data is a very important part of preparing data for analysis.

Are there any best practices or processes for cleaning data before processing it? If so, are there any automated or semi-automated tools which implement some of these best practices?

akellyirl
  • 723
  • 1
  • 6
  • 9
Jay Godse
  • 471
  • 5
  • 7

7 Answers7

23

R contains some standard functions for data manipulation, which can be used for data cleaning, in its base package (gsub, transform, etc.), as well as in various third-party packages, such as stringr, reshape/reshape2, and plyr/dplyr. Examples and best practices of usage for these packages and their functions are described in the following paper: http://vita.had.co.nz/papers/tidy-data.pdf.

Additionally, R offers some packages specifically focused on data cleaning and transformation:

A comprehensive and coherent approach to data cleaning in R, including examples and use of editrules and deducorrect packages, as well as a description of workflow (framework) of data cleaning in R, is presented in the following paper, which I highly recommend: http://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf.

Aleksandr Blekh
  • 6,603
  • 4
  • 29
  • 55
18

From my point of view, this question is suitable for a two-step answer. The first part, let us call it soft preprocessing, could be taken as the usage of different data mining algorithms to preprocess data in such a way that makes it suitable for further analyses. Notice that this could be the analysis itself, in case the goal is simple enough to be tackled in a single shot.

The second part, the hard preprocessing, actually comes prior to any other process, and is may be taken as the usage of simple tools or scripts to clean up data, selecting specific contents to be processed. To this problem, POSIX provides us with a wonderous set of magic tools, which can be used to compose concise -- and very powerful -- preprocessing scripts.

For example, for people who deal with data coming from social websites (twitter, facebook, ...), the data retrieval usually yields files with very specific format -- although not always nicely structure, as they may contain missing fields, and so. For these cases, a simple awk script could clean up the data, producing a valid input file for later processing. From the magic set, one may also point out grep, sed, cut, join, paste, sort, and a whole multitude of other tools.

In case simple the source file has too many nitty-gritties, it may also be necessary to produce a bundle of methods to clean up data. In such cases, it is usually better to use scripting languages (other than shell ones), such as Python, Ruby, and Perl. This allows for building up API's to select specific data in a very straightforward and reusable way. Such API's are sometimes made public by their writers, such as IMDbPY, Stack Exchange API, and many others.

So, answering the question: are there any best practices? It usually depends on your task. If you will always deal with the same data format, it's commonly best to write an organized script to preprocess it; whereas, if you just need a simple and fast clean up on some dataset, count on POSIX tools for concise shell scripts that will do the whole job much faster than a Python script, or so. Since the clean up depends both on the dataset and on your purposes, it's hard to have everything already done. Yet, there are lots of API's that puts you halfway through with the problem.

Rubens
  • 4,117
  • 5
  • 25
  • 42
11

One reason that data cleaning is rarely fully automated is that there is so much judgment required to define what "clean" means given your particular problem, methods, and goals.

It may be as simple as imputing values for any missing data, or it might be as complex as diagnosing data entry errors or data transformation errors from previous automated processes (e.g. coding, censoring, transforming). In these last two cases, the data looks good by outward appearance but it's really erroneous. Such diagnosis often requires manual analysis and inspection, and also out-of-band information such as information about the data sources and methods they used.

Also, some data analysis methods work better when erroneous or missing data is left blank (or N/A) rather than imputed or given a default value. This is true when there is explicit representations of uncertainty and ignorance, such as Dempster-Shafer Belief functions.

Finally, it's useful to have specific diagnostics and metrics for the cleaning process. Are missing or erroneous values randomly distributed or are they concentrated in any way that might affect the outcome of the analysis. It's useful to test the effects of alternative cleaning strategies or algorithms to see if they affect the final results.

Given these concerns, I'm very suspicious of any method or process that treats data cleaning in a superficial, cavalier or full-automated fashion. There are many devils hiding in those details and it pays to give them serious attention.

MrMeritology
  • 1,840
  • 13
  • 14
7

About automatic cleaning: You really cannot clean data automatically, because the number of errors and the definition of an error is often dependent on the data. E.g.: Your column "Income" might contain negative values, which are an error - you have to do something about the cases. On the other hand a column "monthly savings" could reasonably contain negative values.

Such errors are highly domain dependent - so to find them, you must have domain knowledge, something at which humans excel, automated processes not so much.

Where you can and should automate is repeated projects. E.g. a report which has to produced monthly. If you spot errors, you should place some automated process which can spot these kinds of errors in subsequent months, freeing your time.

Christian Sauer
  • 657
  • 4
  • 7
6

I think that there is no universal technique for "cleaning" data before doing actual research. On the other hand, I'm aiming for doing as much reproducible research as possible. By doing reproducible research, if you used cleaning techniques with bugs or with poor parameters/assumptions it could be spot by others.

There is nice R package knitr which helps a lot in reproducible research.

Of course, not all research could be fully reproduced (for example live Twitter data) , but at least you can document cleaning, formating and preprocessing steps easily.

You can check my assessment prepared for Reproducible Research course at Coursera.

Damian Melniczuk
  • 649
  • 4
  • 19
4

OpenRefine (formerly Google Refine) is a great tool where you can interactively perform data cleanup and transformations. Your final output can be exported to a variety of standard formats (json, csv) or any custom format(eg: wiki).

Exact duplicates can be handled in OpenRefine by sorting + using the "blank down" feature. Also it has has approximate string (fuzzy) clustering feature where you can group similar looking duplicates such as company names or addresses. This feature alone makes it awesome.

Thyag
  • 141
  • 3
2

There is an entire course devoted to this in Coursera. You might want to go over the techniques they mention and the important part is to know when to use what.

https://www.coursera.org/learn/data-cleaning

It's always better to do it manually instead of automating since each dataset has its own problems and issues and not all steps are applicable to all the datasets.

Ram
  • 349
  • 3
  • 2