Skip to main content

Corpora

Corpora is a web-based application with a robust database system for Digital Humanities (DH) projects. You can use Corpora to perform Optical Character Recognition (OCR) on uploaded documents, ascribe Uniform Resource Identifiers (URIs) and Corpora content types to entities, build network visualizations, and more.

Corpora and LINCS

In collaboration with LINCS, Corpora is being used to ascribe URIs to named entities in the Advanced Research Consortium (ARC) catalogue and convert its data into triples so that it can be ingested into the LINCS triplestore.

Corpora lets users identify and assign URIs to entities in ARC, and will soon incorporate functionality from LINCS’s Natural Language Processing (NLP) tools such as NERVE. Corpora is also associated with the Rich Prospect Browser (RPB), an in-development visualization tool for Linked Data (LD) that allows users to browse between and within linked databases. Once complete, the RPB will be integrated into Corpora in place of the current network visualization tool.

At present, Corpora is tailored to working with bibliographic data in traditional DH projects that are focused on individual artifacts and entities. Corpora is particularly suited to converting these types of datasets.

Corpora can be used online or the tool itself can also be downloaded, running and saving data locally. While Corpora will make backups of uploaded datasets when used online, it is not committed to long term data storage.

Prerequisites

Users of Corpora:

  • Need to come with their own dataset
  • Need to create a user account
    • A GitLab or GitHub account can also be used to import a repository directly to Corpora.
  • Need a basic understanding of Python and JSON to access full functionality

Corpora supports the following inputs and outputs:

  • Input: PDF, JPEG, MARC, XML, and more
  • Output: JSON

Resources

To learn more about Corpora, see the following resources:

Information about the team that developed Corpora is available on the Tool Credits page.