Workflow Paths
Introduction
LINCS has developed a series of transformation workflows to cover the most common starting points for creating Linked Open Data (LOD).
All of the general information about contributing data to LINCS and the initial steps of expressing interest and completing the data intake interview process apply to all workflows.
The Four Transformation Workflows
Browse through the following four tabs for an overview of each workflow and to understand how to categorize your data. The rest of the pages in this transformation workflow documentation cover each individual transformation step in order. Each step contains these same four tabs so that you can tailor the instructions to your data.
- Structured Data
- Semi-Structured Data
- TEI Data
- Natural Language Data
Structured Data can take the form of spreadsheets (e.g., CSV, TSV, XSL, XSLX), relational databases, JSON files, RDF files, and XML files.
We count data as structured if:
- The entities are all tagged individually (e.g., one entity per spreadsheet cell or XML element)
And the entities are connected, either:
- In a hierarchical way (e.g., nested XML elements)
- With relationships between entities expressed following some clearly-defined schema and data structure (e.g., spreadsheet headings relating columns of entities together)
Data Example
Here are data samples from two projects published with LINCS that began as structured data.
The Canadian Centre for Ethnomusicology data started as several spreadsheets with a row for each artifact.
ID | Title | placeMade | placeMadeID | material | materialID |
---|---|---|---|---|---|
CCEA-L1995.63 | Bamboo Flute | Edmonton | https://sws.geonames.org/5946768 | bamboo | http://www.wikidata.org/entity/Q27891820 |
CCEA1995.21 | Pair of Taiko Drums | Shinano | https://sws.geonames.org/1852136 | hide | http://www.wikidata.org/entity/Q3291230 |
The University of Saskatchewan Art Collection data began as an XML file with a parent element for each art object.
<?xml version="1.0" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description>
<ObjectIdentifier>1910.001.001</ObjectIdentifier>
<AcquistionDate>1910</AcquistionDate>
<ObjectTitle>Portrait of Thomas Copland</ObjectTitle>
<ArtistName url="http://www.wikidata.org/entity/Q100921439">Victor Albert Long</ArtistName>
<Medium url="http://vocab.getty.edu/aat/300015050">oil paint</Medium>
<Category url="http://vocab.getty.edu/aat/300033618">painting</Category>
</rdf:Description>
<rdf:Description>
<ObjectIdentifier>2018.026.001</ObjectIdentifier>
<AcquistionDate>2018</AcquistionDate>
<ObjectTitle>Grace</ObjectTitle>
<ArtistName url="http://www.wikidata.org/entity/Q19609740">Lori Blondeau</ArtistName>
<Medium url="http://vocab.getty.edu/aat/300265621">inkjet print</Medium>
<Category url="http://vocab.getty.edu/aat/300046300">Photograph</Category>
</rdf:Description>
</rdf:RDF>
Workflow Overview
This workflow is our most customizable and curatable because the entities and relationships are clearly defined in the source data. We typically create a custom conceptual mapping for each dataset, reusing past mappings where possible, and transform the data using the 3M mapping tool.
Semi-Structured Data is typically XML documents where there is some structure but not in a way that makes it easy to extract entities and relationships without manual work. For example, an XML document that contains natural language text that has been highly annotated with XML tags may be considered semi-structured data. These tags may identify some entities and some relationships between entities that could be turned into LOD using a combination of custom scripts, additional manual annotation, and vetting. In this case, this workflow becomes a cross between aspects of the structured and natural language data workflows.
Data Example
Here is a simplified excerpt from the Orlando Project data, which started its transformation process as hand-annotated XML documents.
<DATE>By March 1643</DATE>, early in this year of fierce <TOPIC>Civil War</TOPIC> fighting <NAME>Dorothy Osborne</NAME>’s mother moved with her children from <PLACE>Chicksands</PLACE> to the fortified port of <PLACE>St Malo</PLACE>.
Workflow Overview
This workflow is still being created. Check back as we release additional details and access to the tools discussed.
This workflow requires custom work by the research teams. It utilizes LEAF Writer with LINCS-API behind the scenes, helping you identify entities and relationships in your XML documents in a semi-automated way.
TEI data follows the guidelines of the Text Encoding Initiative (TEI).
Data Example
Here is a data sample of TEI data.
<person xml:id="h-nr254">
<idno type="Wikidata">http://www.wikidata.org/entity/Q28911659</idno>
<persName>
<name>Braderman, Joan</name>
</persName>
<persName type="preferred">
<forename>Joan</forename>
<surname>Braderman</surname>
</persName>
<floruit>
<date>1977</date>
</floruit>
<occupation cert="high">Film maker</occupation>
<affiliation cert="high">Heresies Collective</affiliation>
</person>
Workflow Overview
This workflow includes a simple web-based user interface, LINCS XTriples, where you can upload your TEI document and select a transformation template. The tool is designed to produce CIDOC CRM triples from TEI files that conform to the templates in LEAF-Writer. Then you will use the last steps of the structured data workflow to enhance, validate, and ingest your data into LINCS.
Workflow Limitations
- This workflow is intended to extract a small set of possible relationships from TEI files in an automated way. If you are working with files that do not conform to the LINCS templates in LEAF-Writer, please modify and use the XSLTs linked from the LINCS XTriples documentation to transform your TEI.
Natural Language Data is in a free-text format. For LINCS, this looks like a document of full sentences written in modern English, ideally following common grammatical rules.
Natural language data includes documents that are fully plain text—like a written biography saved as a TXT file—or any document of a different format where there is plain text embedded within. For example, if an XML document has certain tags that always contain full sentences of text, then we can extract that text and use the natural language workflow.
Data Example
Here is a simplified excerpt from the Orlando Project data, where we pulled out natural language text from XML documents.
By March 1643, early in this year of fierce Civil War fighting, Dorothy Osborne’s mother moved with her children from Chicksands to the fortified port of St Malo.
Workflow Overview
This workflow is still being created. Check back as we release additional details and access to the tools discussed.
This workflow uses LINCS APIs to perform automated named entity recognition (NER) and relation extraction (RE) on natural language text files. These APIs are able to output triples that have relationships from a set list of options. After you verify the output, you can use a second API that will transform your triples into CIDOC CRM for final validation and ingestion into LINCS.
This level of automation is meant to be a faster, though less precise, transformation method than that of the structured transformation workflow or a manual treatment of natural language texts. If your research team has the time, then you can put more manual curation into it, using the tools as a starting point. NERVE is a great option to use on this type of data when looking for a balanced approach between automation and manual editing.
Workflow Limitations
- This workflow is designed for English text and we are working towards French support. Some individual tools may support other languages. If so, it will be explained within that tool’s documentation.
- This workflow is best suited to documents containing factual statements about real-world people, places, and written-works, such as biographies or other non-fiction descriptive text.
- Text extracted from sources like social media or poor quality Optical Character Recognition (OCR) is unlikely to result in high-quality LOD, unless significant data cleaning can be done by the research team before transformation.