This repository is for CSVs of DS data at various stages of extraction, transformation, and enrichment. It also includes RDF and JSON data extracted from our linked database for display and search through a user interface.
New files should be called: DATE-name.csv, where DATE is the date the file was created in YYYYMMDD format and name is a description, like jhu, combined, upenn, etc. When more than one descriptor is applied, descriptors are separated by a dash, such as in 20220920-language-combined-enriched. In addition, when element is provided in the instructions, it is a description of the metadata element, field, or type of data, such as in languages.csv, named-subjects-unreconciled.csv, or 20220705-places-combined-enriched.csv.
- Navigate to
member-datadirectory. - Click on
Add Filebutton and clickUpload filesfrom context menu. - Drag and drop or
choose your filesto be uploaded. Commit changesdirectly to main branch.
- Navigate to
ds-data/terms/batchdirectory. - Click on
Add Filebutton and clickUpload filesfrom context menu (file to be uploaded should beDATE-element-combined-enriched.csv). - Drag and drop or
choose your filesto be uploaded. Commit changesdirectly to main branch.
Updating reconciled "term" values to be used in Wikibase import and unreconciled data for documentation:
- Navigate to
ds-data/terms/reconcileddirectory. - Click on
Add Filebutton and clickUpload filesfrom context menu (file to be uploaded should be 'element.csv`). - Drag and drop or
choose your filesto be uploaded. Commit changesdirectly to main branch.
- Navigate to
ds-data/terms/unreconcileddirectory. - Click on
Add Filebutton and clickUpload filesfrom context menu (file to be uploaded should be 'element-unreconciled.csv`). - Drag and drop or
choose your filesto be uploaded. Commit changesdirectly to main branch.
- Navigate to
ds-data/terms/reconciled/archiveddirectory. - Click on
Add Filebutton and clickUpload filesfrom context menu (file to be uploaded should be 'DATE-element.csv`). - Drag and drop or
choose your filesto be uploaded. Commit changesdirectly to main branch.
- Navigate to
ds-data/terms/unreconciled/archiveddirectory. - Click on
Add Filebutton and clickUpload filesfrom context menu (file to be uploaded should be 'DATE-element-unreconciled.csv`). - Drag and drop or
choose your filesto be uploaded. Commit changesdirectly to main branch.
.
├── README.md
├── Workflow-README-template.md
├── config.yml
├── member-data
│ ├── burke
│ │ ├── 2022-02-23-combined-burke.csv
│ │ ├── 2022-02-23-combined-enriched-burke.csv
│ │ └── 2022-04-04-combined-burke.csv
│ ├── ccny
│ │ ├── 2022-02-23-combined-ccny.csv
│ │ ├── 2022-02-23-combined-enriched-ccny.csv
│ │ └── 2022-04-04-combined-ccny.csv
│ ├── columbia
etc.
├── split_data.rb
├── test
│ ├── missing_inst_name.csv
│ ├── missing_qid.csv
│ └── unknown_qid.csv
└── workflow
├── 2022-02-23-combined-README.md
├── 2022-02-23-combined-enriched.csv
├── 2022-02-23-combined.csv
├── 2022-04-04-combined-README.md
├── 2022-04-04-combined-enriched.csv
└── 2022-04-04-combined.csv
New files should be added to the workflow directory and named using the date
and a signifier describing the data; e.g., 2022-06-13-combined.csv. Subsequent
and related files should use the same data an name:
2022-06-13-combined-enriched.csv, 2022-06-13-combined-README.csv, and so
forth.
- Add a CSV of extracted data named according to the pattern described above;
e.g.,
2022-06-13-combined.csv. - Copy the
Workflow-README-template.mdto the folder following the name pattern; e.g.,2022-06-13-combined-README.md. - Edit the README for the current set of data.
- Once the data has been cleaned and reconciled, add an
enrichedversion of the CSV; e.g.,2022-06-13-combined-enriched.csv. - Add notes about enrichment to the README.
- When the data is imported, add an
importedfile to theworkflowfolder; e.g.,2022-06-13-combined-imported.csv. - Add notes about the import to the README.
.
|-- import
| |-- batch-20220223
| | |-- README.md
| | |-- base.csv
| | |-- clean.csv
| | `-- imported.csv
| `-- batch-20220505
| |-- README.md
| |-- base.csv
| |-- clean.csv
| `-- imported.csv
`-- terms
|-- genres.csv
|-- names.csv
|-- places.csv
|-- subjects-named.csv
`-- subjects.csv
The workflow is as follows.
- A new set of raw data (CSV, MARC XML) is received.
A. Terms reconciliations
- Term extraction: All terms (names, places, genres, etc.) are extracted from the new data.
The lists will contain all terms from the new data, but previously reconciled terms will be accompanied by their URIs/identifiers.
- Reconciliation: New terms (those not previously reconciled) are reconciled
and added to the appropriate CSVs in the
termsfolder (places.csv,names.csv, etc.)
B. Import CSV preparation
-
Extraction of import CSV: Using new raw data and updated
termsCSVs, a newbase.csvis generated and added to the folderimport/batch-<DATE>. -
Cleaning and reconciliation: The
base.csvis processed in OpenRefine for reconciliation of language and material columns; any other needed cleaning is preformed, and the result is added asclean.csvtoimport/batch-<DATE>.
C. Data import
- Data import and CSV updat: The file
import/batch-<DATE>/clean-recon.csvis imported into DS, and the output CSV with DS IDs is added toimport/batch-<DATE>asimported.csv.
CSVs in the workflow directory should be split into institution-specific
directories. The split_data.rb script splits the CSV on the QID in the
holding_institution column and puts the file in folder as defined in the
config.yml file.
The configuration file contains the QID, name and a single-word folder for each institution. New repositories should be added to the configuration. The format of the entries is like so:
---
- :qid: Q814779
:name: Beinecke Rare Book & Manuscript Library
:directory: beinecke
- :qid: Q995265
:name: Bryn Mawr College
:directory: brynmawr
- :qid: Q63969940
:name: Burke Library at Union Theological Seminary
:directory: burkesplit_data.rb validates the config file and the CSV.