Skip to content

Commit 4ee4a5d

Browse files
Update README.md
1 parent 4e50a49 commit 4ee4a5d

1 file changed

Lines changed: 22 additions & 3 deletions

File tree

‎README.md‎

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,25 @@
11
# Digital Scriptorium Data Reconciliation Process through OpenRefine
22

3-
Digital Scriptorium OpenRefine documentation and JSON recipes for data reconciliation and management
3+
Digital Scriptorium (DS) OpenRefine documentation and JSON recipes for data reconciliation and management
44

5-
## General instructions
5+
## Data reconciliation and semantic enrichment
6+
7+
DS aggregates, harmonizes, enriches, transforms, and republishes institutional metadata about pre- and early modern manuscript objects in North American collections to build and grow a national union catalog. The results of this work is the DS Catalog, which renders previously siloed institutional data into linked open data (LOD) stored in a free and open Wikibase database. Member institutions help build this national union catalog through contributions of their metadata records describing manuscripts in their collections. DS enhances the data and makes it more powerful through semantic enrichment, linking relatively unstructured or semi-structured "as recorded" data in manuscript records to LOD vocabularies and authorities. The enrichment process helps to better standardize and align heterogeneous metadata supplied from disparate institutions so that it can be searched in a single interface, adding significant value to the data in two ways. First, semantic enrichment powers robust faceted browsing and searching as well as the ability to query linked data using SPARQL, which allows users to search across previously siloed collections that originally used different data encoding standards. Second, enrichment and transformation increases the linkability of data in manuscript records by connecting them to other linked data, thus making institutional data more valuable and reusable.
8+
9+
When structured data is contributed by a member institution, this data is extracted for both data reconciliation (recon-extraction) and generation of a spreadsheet used to import records into the DS Wikibase (import generation). Recon-extraction begins the semantic enrichment process by using DS-designed software to isolate unique data values occurring in the structured institutional data for metadata elements which will be enriched by DS staff prior to upload in the DS Catalog. These data values, based on the DS data model, are
10+
- genres
11+
- languages
12+
- materials
13+
- names
14+
- places
15+
- subjects
16+
- titles
17+
18+
During recon-extraction, data values in institutional ("as recorded") data are compared against existing data dictionaries (by metadata element) which includes all data values previously encountered. Recon-extraction matches existing reconciliations automatically with authority records in the Wikibase. Recon-extraction results in spreadsheets for each metadata element created from extracted data and showing both reconciled and unreconciled values from a dataset. For unreconciled data, DS staff undertake data reconciliation to match unreconciled values with their equivalents in designated LOD authorities and vocabularies, such as Wikidata, OCLC Faceted Application of Subject Terminology (FAST), and the Getty Vocabularies. Once newly encountered values have been reconciled, the data dictionaries are updated with the new reconciliations. This means that on future processing of the data, these reconciliations will be automatically captured, allowing data to be automatically enriched prior to upload. Enriched data is generated and stored temporarily in an import spreadsheet which is pushed to the DS Wikibase to create DS records.
19+
20+
## General instructions for processing data
21+
22+
The following instructions discuss management, handling, and processing of recon-extraction spreadsheets through the naming of files, discussion of the manual steps of the reconciliation process, and JSON recipes for automating reconciliation tasks using OpenRefine.
623

724
### File naming conventions
825

@@ -25,10 +42,12 @@ Examples of correctly formatted file names:
2542
- `20230901-places-hrc-csv-fragments-batch-1-enriched.csv`
2643

2744
### Manual reconciliation
45+
2846
TBD.
2947

3048
### Automated reconciliation instructions by metadata element / authority type
31-
Introductory text?
49+
50+
Introductory text
3251

3352
#### Genres
3453

0 commit comments

Comments
 (0)