Export Data
Introduction
Before transforming your data, you need to prepare a version of it that is easy to share and work with. The idea is to export all of the data that you want to transform from your project’s data store and save it in a format suited for your next transformation step. Because this is a custom step for each project, relying on your unique data store, LINCS can only guide you on what the outcome should look like.
Resources Needed
If your data needs to be extracted from a data store like a relational database, you will likely need support from your database administrator. Support from a team member with basic programming knowledge (e.g., undergraduate Python experience) is useful for restructuring your data.
This step could range from a one hour task of downloading text files and moving them to a shared location, to a few days task of exporting complex relational data into XML files that group the relevant information.
Research Team | Ontology Team | Transformation Team | Storage Team | |
---|---|---|---|---|
Identify your Source Data | ✓ | ✓ | ||
Send a Representative Sample | ✓ | |||
Export your Full Dataset | ✓ | |||
Send your Full Dataset | ✓ |
Identify your Source Data
The research team identifies which dataset or part of a dataset they hope to transform. You will need to work with your technical team to answer:
- Where is the data stored?
- How can you access it?
- Whose help do you need to access it? To export it? To restructure it?
- Can you make changes directly to it during the cleaning and entity matching steps?
- Will you transform all of your data? Or only some fields?
Many projects have their data available to view on a website. When we talk about exporting data, we mean that you need to find out where the data is actually stored—where is the website getting the data from? You will need to go straight to that source or identify that there is an API or a data download link that can handle the export for you.
Choose your Source Data
LINCS emphasizes finding your source data because that will be the copy of your data that we use as the starting point for your transformation workflow. For many of the workflows, we develop code and mappings that rely on the structure of your source data remaining constant—even if the contents change through cleaning and entity matching as we go.
Think of your source data as the place where you will go to make changes if you find data errors during the transformation process. Would you go back to the database, edit there, re-export, then re-run the LINCS transformation code? This would mean you have improved the version of your data that you are likely to use for other purposes. Or would you rather export your data into an easy to work with format like a spreadsheet, clean your data in the sheet, run the LINCS transformation code, and leave changes to your true source data for a future project?
An important consideration is whether you intend to continue transforming more data to add to LINCS after the initial transformation. If yes, it is better to spend time at the start making sure the export and cleaning steps are easily repeatable without needing to waste time duplicating manual work like creating a spreadsheet with special formatting. If not, and this is a one-time transformation, it is safe to prioritize speed.
If you plan to contribute your data to LINCS, your research team, the Ontology Team, and the Transformation Team will meet to discuss the structure and contents of the identified dataset. We will discuss the overarching research goals of the research team so we can create linked open data (LOD) that is useful and meaningful.
It is helpful for the research team to come to this meeting prepared with some research questions they are hoping to answer.
A common scenario for the structured data transformation workflow is where a project has a relational database that is the true initial source of their data. They have a few options of what to count as the source of data when working with LINCS:
- Treat that relational database as the source. While we are transforming data, if errors need to be fixed in the data, the research team makes changes directly to their database. Their database administrator then creates a copy or a data dump of that database, which includes all of the data and the schema that tells us how the data is organized. LINCS’s transformation code takes that data dump as the input or starting point.
- If there is an API that allows us to request data from their database, then they still treat the database as the source like in the previous option. But we do not need a database dump to be regenerated each time significant changes are made. Instead, the transformation code would consider the API as the starting point, where we can call on it from the transformation code to get up-to-date data.
- The research team does not have access, permission, or capacity to make changes directly to the source database. Instead, they create a one-time export of their database into a format they like working in such as a spreadsheet or XML document. Cleaning and entity matching happen directly in that new format and the transformation code uses that new format as the starting point.
Send a Representative Sample
When you start working with LINCS, send us a representative data sample so that we can help create a conceptual mapping.
A representative sample:
- Must include all of the fields you want to be present in your transformed data. For example, there is a spreadsheet column for every category of data.
- Must not include fields that should remain private to your institution or that will not be useful in LOD form. Examples include internal database IDs, institutional information about object purchasing, or personal information about still-living persons.
- Must include blank fields or placeholders for data you will add before transformation is complete. If adding the fields is not possible, communicate to LINCS the changes you intend to make.
- Should not include blank fields or placeholders for data you will not have time to add in the near future. Those fields are usually best left for a second round of transformation once you are comfortable with the process.
LINCS will request a representative sample in our first meeting. Come prepared and kick-start the transformation process.
Export your Full Dataset
The expected output depends on the structure of your data and the transformation tools you plan to use. It is helpful to review the rest of the transformation workflow steps before completing this step, particularly the Implement Conceptual Mapping step.
- Structured Data
- Semi-Structured Data
- TEI Data
- Natural Language Data
If you plan to use 3M for transformation, as LINCS typically does for structured data, then export the data and transform it into XML document(s) following the suggestions in Preparing Data for 3M.
Otherwise, you can:
- Setup a custom data export, in the programming language of your choice, that outputs your data as LINCS compliant Resource Description Framework (RDF), as defined by your Develop Conceptual Mapping step.
- Find a tool online that suits your data and export your data into the format the tool expects as input.
See the Implement Conceptual Mapping page for help deciding if 3M or a custom solution is right for your data.
Typically, it is easiest to export data into individual XML files that can be worked on one at a time. However, if you have short documents with highly connected content, you may prefer to combine them into a single XML file.
Be sure to name each file with a unique and meaningful title, like with a unique document identifier.
Each file will have an extension .xml
and, while it is fine for the XML to contain tags unique to your project, it needs to be valid XML. You can check this using an online XML validator.
Export the data so you end up with an individual file for each TEI document. Be sure to name each file with a unique and meaningful title, like with a unique document identifier.
Each file will have a .xml
extension.
Export the data as cleanly as possible into TXT files. Be sure to name each file with a unique and meaningful title, like with a unique document identifier. If you extracted text that was embedded within documents, you will likely want to keep track of where each excerpt originated from. For example, by saving the offset of the text in the original document.
Images or PDFs that contain text need to be processed (e.g., through OCR) by the research team so we have well formatted, clean, typically English, plain text documents. Similarly, formats like Microsoft Word documents should be saved as plain text, as the formatting embedded in such file types is not useful here.
Depending on the transformation tools you choose, you may find it best to save long texts in a single file, or to split them into smaller files by chapters, sections, paragraphs, or sentences. Read through the rest of the transformation workflow steps to get an idea of what will best suit your data and remember you can experiment with multiple options using a sample of your data.
Send your Full Dataset
As we get further into the transformation process, if LINCS is completing any of the transformation steps with you, we will need a copy of your full dataset before any more work can continue.
If the data is publicly available on a website or through an API, then please share links and documentation. This documentation helps us see the meaning that you are showing through the data.