Clean Data
Introduction
In this step, you ensure your data is consistent in how it expresses entities and the relationships between them. Your data needs to be consistent and clean to be mapped and transformed.
Data cleaning is often time consuming; LINCS recommends you start as soon as you can. Go ahead and follow the tips on this page and our Data Cleaning Guide, even before committing to the entire transformation process.
Resources Needed
By cleaning your data at this early stage, the rest of the transformation steps will be easier and your transformed data will be more accurate and more meaningful. Typically, it is fastest to do bulk cleaning tasks on your original data because:
- Your team is already familiar with the format
- In the case of structured data, for example, it is efficient to make changes to a whole column of a spreadsheet at once and there are many common tools to help
- By cleaning your original data, it is improved for other uses you have for your data besides publishing with LINCS
Still, you can continue with more cleaning at the Validate and Enhance Step and, once the transformation is complete, you can edit your data directly in ResearchSpace Review.
You may find it easiest to apply your data cleaning directly to the data store where it comes from and then follow the Export Data step after. Or, you may choose to export your data into an easy to work with format and clean that version of the data.
This decision will depend on how easy it is to edit the data in the data store, and whether you want the version of the data in the original data store to continue matching the cleaned version of the data.
Time Commitment
There is a trade-off of initial cleaning time versus data quality. Each research team needs to determine when they want to stop cleaning and move on to the next steps. You may also choose to split your full dataset into parts, where you clean and transform one part initially, and then repeat for each part as you have time.
Depending on the size of the dataset and how clean it is to begin with, cleaning usually ranges from a few hour task to a few weeks of part-time research assistant work. Data cleaning is always done by the research team because you are the experts in your own data.
Research Team | Ontology Team | Transformation Team | Storage Team | |
---|---|---|---|---|
Clean your Dataset | ✓ | |||
Send your Cleaned Dataset | ✓ |
Clean your Dataset
You will need to identify the correct tools to clean your dataset.
- Structured Data
- Semi-Structured Data
- TEI Data
- Natural Language Data
OpenRefine is LINCS’s preferred tool for cleaning structured data because it offers built in functionality and good documentation for many of the cleaning tasks outlined in our Data Cleaning Guide. We also use a mix of spreadsheet editors like Google Sheets and Microsoft Excel, as well as custom Python scripts.
If your data is in a relational database or datastore with an editing interface, discuss options with your database administrator as there may already be editing methods in place.
LINCS often uses custom Python scripts with XML parsing libraries because each project has its own data structure and needs. LEAF-Writer may be useful if you have XML data.
For TEI created by hand, it is best to associate your file with a customized schema, so that you can check that your TEI is valid and that your tag use conforms to the use you intend for your project as you go. If you are new to schema customization, or would be happier working with an existing schema, use LEAF-Writer to create your TEI: it will automatically validate your TEI against a schema.
More advanced users can write XSLTs to create a new cleaner version of their TEI files.
LINCS typically uses custom Python scripts or manual changes in a text editor for smaller fixes.
The Transformation Team can offer you guidance specific to your data, but first, see our Data Cleaning Guide which should be a good starting point, covering the data cleaning steps and tools we have used with previous research teams. For more information about data cleaning tools, see Data Cleaning.
Send your Cleaned Dataset
The output from this step should be the same as the output from the Export Data step, except with cleaning applied to the data. Similar to that step, send LINCS a copy of your cleaned data if you would like approval or if LINCS is helping to implement your next transformation steps.