|
2 | 2 |
|
3 | 3 | Repository for CSVs DS data. |
4 | 4 |
|
5 | | -New files should be called: `DATE-name.csv`, where `DATE` is the date the file was created in `YYYY-MM-DD` format and `name` is a one-word description, like `jhu`, `combined`, `upenn`, etc. |
| 5 | +New files should be called: `DATE-name.csv`, where `DATE` is the date the file was created in `YYYY-MM-DD` format and `name` is a description, like `jhu`, `combined`, `upenn`, etc. |
6 | 6 |
|
7 | 7 | Directory structure: |
8 | 8 |
|
9 | 9 | ```text |
10 | | -/ # root folder |
11 | | -- README.md |
12 | | -- Workflow-README-template.md |
13 | | -- workflow |
14 | | - - 2022-02-23-README.md |
15 | | - - 2022-02-23-combined.csv |
16 | | - - 2022-02-23-combined-enriched.csv |
17 | | - - 2022-02-23-combined-imported.csv |
18 | | -- member-data |
19 | | - - beinecke |
20 | | - - harvard |
21 | | - - nyu |
22 | | - - 2021-07-06-nyu.csv |
23 | | - - 2021-07-06-nyu-enriched.csv |
24 | | - - 2021-07-06-nyu-imported.csv |
25 | | - - 2022-06-06-nyu.csv |
26 | | - - 2022-06-06-nyu-enriched.csv |
27 | | - - 2022-06-06-nyu-imported.csv |
| 10 | +. |
| 11 | +├── README.md |
| 12 | +├── Workflow-README-template.md |
| 13 | +├── config.yml |
| 14 | +├── member-data |
| 15 | +│ ├── burke |
| 16 | +│ │ ├── 2022-02-23-combined-burke.csv |
| 17 | +│ │ ├── 2022-02-23-combined-enriched-burke.csv |
| 18 | +│ │ └── 2022-04-04-combined-burke.csv |
| 19 | +│ ├── ccny |
| 20 | +│ │ ├── 2022-02-23-combined-ccny.csv |
| 21 | +│ │ ├── 2022-02-23-combined-enriched-ccny.csv |
| 22 | +│ │ └── 2022-04-04-combined-ccny.csv |
| 23 | +│ ├── columbia |
| 24 | +etc. |
| 25 | +├── split_data.rb |
| 26 | +├── test |
| 27 | +│ ├── missing_inst_name.csv |
| 28 | +│ ├── missing_qid.csv |
| 29 | +│ └── unknown_qid.csv |
| 30 | +└── workflow |
| 31 | + ├── 2022-02-23-combined-README.md |
| 32 | + ├── 2022-02-23-combined-enriched.csv |
| 33 | + ├── 2022-02-23-combined.csv |
| 34 | + ├── 2022-04-04-combined-README.md |
| 35 | + ├── 2022-04-04-combined-enriched.csv |
| 36 | + └── 2022-04-04-combined.csv |
| 37 | +
|
28 | 38 | ``` |
29 | | - |
30 | | - - [ ] TODO: Add script for splitting and placing member-data files in place |
31 | 39 |
|
| 40 | +## Adding data |
| 41 | + |
| 42 | +New files should be added to the `workflow` directory and named using the date |
| 43 | +and a signifier describing the data; e.g., `2022-06-13-combined.csv`. Subsequent |
| 44 | +and related files should use the same data an name: |
| 45 | +`2022-06-13-combined-enriched.csv`, `2022-06-13-combined-README.csv`, and so |
| 46 | +forth. |
| 47 | + |
| 48 | +1. Add a CSV of extracted data named according to the pattern described above; |
| 49 | + e.g., `2022-06-13-combined.csv`. |
| 50 | +2. Copy the `Workflow-README-template.md` to the folder following the name |
| 51 | + pattern; e.g., `2022-06-13-combined-README.md`. |
| 52 | +3. Edit the README for the current set of data. |
| 53 | +4. Once the data has been cleaned and reconciled, add an `enriched` version of |
| 54 | + the CSV; e.g., `2022-06-13-combined-enriched.csv`. |
| 55 | +5. Add notes about enrichment to the README. |
| 56 | +6. When the data is imported, add an `imported` file to the `workflow` folder; |
| 57 | + e.g., `2022-06-13-combined-imported.csv`. |
| 58 | +7. Add notes about the import to the README. |
32 | 59 |
|
33 | 60 | ## Proposed alternate workflow |
34 | 61 |
|
@@ -84,3 +111,31 @@ C. Data import |
84 | 111 | - Data import and CSV updat: The file `import/batch-<DATE>/clean-recon.csv` is |
85 | 112 | imported into DS, and the output CSV with DS IDs is added to |
86 | 113 | `import/batch-<DATE>` as `imported.csv`. |
| 114 | + |
| 115 | +## Splitting files |
| 116 | + |
| 117 | +CSVs in the workflow directory should be split into institution-specific |
| 118 | +directories. The `split_data.rb` script splits the CSV on the QID in the |
| 119 | +`holding_institution` column and puts the file in folder as defined in the |
| 120 | +`config.yml` file. |
| 121 | + |
| 122 | +### `config.yml` |
| 123 | + |
| 124 | +The configuration file contains the QID, name and a single-word folder for each |
| 125 | +institution. New repositories should be added to the configuration. The format |
| 126 | +of the entries is like so: |
| 127 | + |
| 128 | +```yaml |
| 129 | +--- |
| 130 | +- :qid: Q814779 |
| 131 | + :name: Beinecke Rare Book & Manuscript Library |
| 132 | + :directory: beinecke |
| 133 | +- :qid: Q995265 |
| 134 | + :name: Bryn Mawr College |
| 135 | + :directory: brynmawr |
| 136 | +- :qid: Q63969940 |
| 137 | + :name: Burke Library at Union Theological Seminary |
| 138 | + :directory: burke |
| 139 | +``` |
| 140 | +
|
| 141 | +`split_data.rb` validates the config file and the CSV. |
0 commit comments