Skip to main content

Clean Data

Introduction

In this step, you ensure your data is consistent in how it expresses entities and the relationships between them. Your data needs to be consistent and clean to be mapped and converted.

info

Data cleaning is often time consuming; LINCS recommends you start as soon as you can. Go ahead and follow the tips on this page and our Data Cleaning Guide, even before committing to the entire conversion process.

Resources Needed

By cleaning your data at this early stage, the rest of the conversion steps will be easier and your converted data will be more accurate and more meaningful. Typically, it is fastest to do bulk cleaning tasks on your original data because:

  • Your team is already familiar with the format
  • In the case of structured data, for example, it is efficient to make changes to a whole column of a spreadsheet at once and there are many common tools to help
  • By cleaning your original data, it is improved for other uses you have for your data besides publishing with LINCS

Still, you can continue with more cleaning at the Validate and Enhance Step and, once the conversion is complete, you can edit your data directly in ResearchSpace.

info

You may find it easiest to apply your data cleaning directly to the data store where it comes from and then follow the Export Data step after. Or, you may choose to export your data into an easy to work with format and clean that version of the data.

This decision will depend on how easy it is to edit the data in the data store, and whether you want the version of the data in the original data store to continue matching the cleaned version of the data.

Time Commitment

There is a trade-off of initial cleaning time versus data quality. Each Research Team needs to determine when they want to stop cleaning and move on to the next steps. You may also choose to split your full dataset into parts, where you clean and convert one part initially, and then repeat for each part as you have time.

Depending on the size of the dataset and how clean it is to begin with, cleaning usually ranges from a few hour task to a few weeks of part-time research assistant work. Data cleaning is always done by the Research Team because you are the experts in your own data.

Research TeamOntology TeamConversion TeamStorage Team
Clean your Dataset
Send your Cleaned Dataset

Clean your Dataset

You will need to identify the correct tools to clean your dataset.

OpenRefine is LINCS’s preferred tool for cleaning structured data because it offers built in functionality and good documentation for many of the cleaning tasks outlined in our Data Cleaning Guide. We also use a mix of spreadsheet editors like Google Sheets and Microsoft Excel, as well as custom Python scripts.

If your data is in a relational database or datastore with an editing interface, discuss options with your database administrator as there may already be editing methods in place.

info

The Conversion Team can offer you guidance specific to your data, but first, see our Data Cleaning Guide which should be a good starting point, covering the data cleaning steps and tools we have used with previous research teams. For more information about data cleaning tools, see Data Cleaning.

Send your Cleaned Dataset

The output from this step should be the same as the output from the Export Data step, except with cleaning applied to the data. Similar to that step, send LINCS a copy of your cleaned data if you would like approval or if LINCS is helping to implement your next conversion steps.