Skip to content

DigitalScriptorium/ds-open-refine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

895 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Digital Scriptorium Data Reconciliation Process through OpenRefine

Digital Scriptorium OpenRefine documentation and JSON recipes for data reconciliation

Notes on editing file name variables (use all lowercase letters where applicable):

  • DATE = the date the file/dataset was generated/created/extracted in YYYYMMDD format
  • DATATYPE = the type of encoding standard or technical format of the metadata source, such as marcxml or mets or csv
  • INSTITUTION = the code for the name of the institutional source for the data, such as penn or kansas or csl

Reconciling names to Wikidata

  1. Load DATE-names-DATATYPE-INSTITUTION.csv into OpenRefine; rename DATE-names-DATATYPE-INSTITUTION-enriched.csv
  2. Add workflow columns: JSON (On the left, go to Undo/Redo, Apply and paste the JSON code)
  3. Copy name_as_recorded column and reconcile new recon-human column against human type (Q5): JSON
  4. Apply list of previously reconciled or known human names: 0. JSON, 1. JSON, 2. JSON
  5. Manually reconcile and update known human names: edit JSON
  6. Add human-label, instance-of-human, and human-qid columns; rename reconciliation column to recon-organization to reconcile against organization type (Q43229): JSON
  7. Apply list of previously reconciled or known organization names: JSON
  8. Manually reconcile and update known organization names: edit JSON
  9. Add organization-label, instance-of-organization, and organization-qid columns; consolidate authorized_label, instance_of, and structured_value columns; finalize workflow: JSON
  10. Do not forget to close all facets
  11. Export three versions from OpenRefine as CSV files: 1) full document (retain file name), 2) facet by structured_value blank (null/empty) = true and rename it DATE-names-DATATYPE-INSTITUTION-unreconciled.csv, 3) facet by structured_value blank (null/empty) = false and rename it DATE-names-DATATYPE-INSTITUTION-reconciled.csv
json/name/010-name-workflow.json
json/name/030-name-recon-human.json
json/name/040-name-known-human.json
json/name/041-name-known-human.json
json/name/042-name-known-human.json
json/name/050-name-recon-org.json
json/name/060-name-known-org.json
json/name/090-name-finalize.json

Reconciling genres

  1. Load DATE-genres-DATATYPE-INSTITUTION.csv into OpenRefine; rename DATE-genres-DATATYPE-INSTITUTION-enriched.csv
  2. Add workflow columns: JSON

to AAT

  1. Copy filtered genre_as_recorded column and reconcile new recon-genre column against AAT vocabulary: JSON
  2. Apply list of previously reconciled or known AAT terms: JSON
  3. Manually reconcile and update known AAT genre terms: edit JSON
  4. Add aat-label and genre-aat columns: JSON

to FAST

  1. Copy filtered genre_as_recorded column and reconcile new recon-genre column against FAST terms: JSON
  2. Apply list of previously reconciled or known FAST terms: JSON
  3. Manually reconcile and update known FAST genre terms: edit JSON
  4. Add fast-label and genre-fast columns: JSON

TBD instructions for other genres as needed

all genre terms: finalize

  1. Finalize workflow; consolidate authorized_label and structured_value columns: JSON
  2. Export three versions from OpenRefine as CSV files: 1) full document (retain file name), 2) facet by structured_value blank (null/empty) = true and rename it DATE-genres-DATATYPE-INSTITUTION-unreconciled.csv, 3) facet by structured_value blank (null/empty) = false and rename it DATE-genres-DATATYPE-INSTITUTION-reconciled.csv
json/genre/010-genre-workflow.json
json/genre/aat/030-genre-aat-recon.json
json/genre/aat/040-genre-aat-known.json
json/genre/aat/050-genre-aat.json
json/genre/fast/030-genre-fast-recon.json
json/genre/fast/040-genre-fast-known.json
json/genre/fast/050-genre-fast.json
json/genre/090-genre-finalize.json

Reconciling subjects to FAST

Reconciling named subjects

  1. Load DATE-named-subjects-combined.csv into OpenRefine; rename DATE-named-subjects-combined-enriched.csv
  2. Add workflow columns: JSON
  3. Copy subject_as_recorded column and reconcile new recon-subject column against FAST terms: JSON
  4. Apply list of previously reconciled or known FAST terms: JSON
  5. Manually reconcile and update known FAST terms: edit JSON
  6. Add named-subject-label-1 and named-subject-fast-1 columns, reconcile next recon-subject column: JSON
  7. Apply list of previously reconciled or known FAST terms: JSON
  8. Manually reconcile and update known FAST terms: edit JSON
  9. Add named-subject-label-2 and named-subject-fast-2 columns; consolidate authorized_label and structured_value columns; finalize workflow: JSON
  10. Export three versions from OpenRefine as CSV files: 1) full document, 2) facet by structured_value blank (null/empty) = true, 3) facet by structured_value blank (null/empty) = false

Reconciling subjects (topical, etc.)

  1. Load DATE-subjects-combined.csv into OpenRefine; rename DATE-subjects-combined-enriched.csv
  2. Add workflow columns: JSON
  3. Copy subject_as_recorded column and reconcile new recon-subject column against FAST terms: JSON
  4. Apply list of previously reconciled or known FAST terms: JSON
  5. Manually reconcile and update known FAST terms: edit JSON
  6. Add subject-label-1 and subject-fast-1 columns, reconcile next recon-subject column: JSON
  7. Apply list of previously reconciled or known FAST terms: JSON
  8. Manually reconcile and update known FAST terms: edit JSON
  9. Add subject-label-2 and subject-fast-2 columns, reconcile next recon-subject column: JSON
  10. Apply list of previously reconciled or known FAST terms: JSON
  11. Manually reconcile and update known FAST terms: edit JSON
  12. Add subject-label-3 and subject-fast-3 columns; consolidate authorized_label and structured_value columns; finalize workflow: JSON
  13. Export three versions from OpenRefine as CSV files: 1) full document (retain file name), 2) facet by structured_value blank (null/empty) = true and rename it DATE-subjects-DATATYPE-INSTITUTION-unreconciled.csv, 3) facet by structured_value blank (null/empty) = false and rename it DATE-subjects-DATATYPE-INSTITUTION-reconciled.csv
json/subject/010-subject-workflow.json
json/subject/named/030-named-subject-recon-1.json
json/subject/named/040-named-subject-known.json
json/subject/named/060-named-subject-recon-2.json
json/subject/named/090-named-subject-finalize.json
json/subject/topic/030-subject-recon-1.json
json/subject/topic/040-subject-known.json
json/subject/topic/060-subject-recon-2.json
json/subject/topic/090-subject-recon-3.json
json/subject/topic/120-subject-finalize.json

Languages

Language reconciliation instructions

Materials

Material reconciliation instructions

Places

Place reconciliation instructions

Packages

 
 
 

Contributors