Skip to content

Commit 91e801f

Browse files
committed
Merge branch 'main' into feature/1-directory-for-reconciliations
2 parents 886be1f + b9fe07b commit 91e801f

4 files changed

Lines changed: 101 additions & 23 deletions

File tree

‎README.md‎

Lines changed: 76 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,33 +2,60 @@
22

33
Repository for CSVs DS data.
44

5-
New files should be called: `DATE-name.csv`, where `DATE` is the date the file was created in `YYYY-MM-DD` format and `name` is a one-word description, like `jhu`, `combined`, `upenn`, etc.
5+
New files should be called: `DATE-name.csv`, where `DATE` is the date the file was created in `YYYY-MM-DD` format and `name` is a description, like `jhu`, `combined`, `upenn`, etc.
66

77
Directory structure:
88

99
```text
10-
/ # root folder
11-
- README.md
12-
- Workflow-README-template.md
13-
- workflow
14-
- 2022-02-23-README.md
15-
- 2022-02-23-combined.csv
16-
- 2022-02-23-combined-enriched.csv
17-
- 2022-02-23-combined-imported.csv
18-
- member-data
19-
- beinecke
20-
- harvard
21-
- nyu
22-
- 2021-07-06-nyu.csv
23-
- 2021-07-06-nyu-enriched.csv
24-
- 2021-07-06-nyu-imported.csv
25-
- 2022-06-06-nyu.csv
26-
- 2022-06-06-nyu-enriched.csv
27-
- 2022-06-06-nyu-imported.csv
10+
.
11+
├── README.md
12+
├── Workflow-README-template.md
13+
├── config.yml
14+
├── member-data
15+
│ ├── burke
16+
│ │ ├── 2022-02-23-combined-burke.csv
17+
│ │ ├── 2022-02-23-combined-enriched-burke.csv
18+
│ │ └── 2022-04-04-combined-burke.csv
19+
│ ├── ccny
20+
│ │ ├── 2022-02-23-combined-ccny.csv
21+
│ │ ├── 2022-02-23-combined-enriched-ccny.csv
22+
│ │ └── 2022-04-04-combined-ccny.csv
23+
│ ├── columbia
24+
etc.
25+
├── split_data.rb
26+
├── test
27+
│ ├── missing_inst_name.csv
28+
│ ├── missing_qid.csv
29+
│ └── unknown_qid.csv
30+
└── workflow
31+
├── 2022-02-23-combined-README.md
32+
├── 2022-02-23-combined-enriched.csv
33+
├── 2022-02-23-combined.csv
34+
├── 2022-04-04-combined-README.md
35+
├── 2022-04-04-combined-enriched.csv
36+
└── 2022-04-04-combined.csv
37+
2838
```
29-
30-
- [ ] TODO: Add script for splitting and placing member-data files in place
3139

40+
## Adding data
41+
42+
New files should be added to the `workflow` directory and named using the date
43+
and a signifier describing the data; e.g., `2022-06-13-combined.csv`. Subsequent
44+
and related files should use the same data an name:
45+
`2022-06-13-combined-enriched.csv`, `2022-06-13-combined-README.csv`, and so
46+
forth.
47+
48+
1. Add a CSV of extracted data named according to the pattern described above;
49+
e.g., `2022-06-13-combined.csv`.
50+
2. Copy the `Workflow-README-template.md` to the folder following the name
51+
pattern; e.g., `2022-06-13-combined-README.md`.
52+
3. Edit the README for the current set of data.
53+
4. Once the data has been cleaned and reconciled, add an `enriched` version of
54+
the CSV; e.g., `2022-06-13-combined-enriched.csv`.
55+
5. Add notes about enrichment to the README.
56+
6. When the data is imported, add an `imported` file to the `workflow` folder;
57+
e.g., `2022-06-13-combined-imported.csv`.
58+
7. Add notes about the import to the README.
3259

3360
## Proposed alternate workflow
3461

@@ -84,3 +111,31 @@ C. Data import
84111
- Data import and CSV updat: The file `import/batch-<DATE>/clean-recon.csv` is
85112
imported into DS, and the output CSV with DS IDs is added to
86113
`import/batch-<DATE>` as `imported.csv`.
114+
115+
## Splitting files
116+
117+
CSVs in the workflow directory should be split into institution-specific
118+
directories. The `split_data.rb` script splits the CSV on the QID in the
119+
`holding_institution` column and puts the file in folder as defined in the
120+
`config.yml` file.
121+
122+
### `config.yml`
123+
124+
The configuration file contains the QID, name and a single-word folder for each
125+
institution. New repositories should be added to the configuration. The format
126+
of the entries is like so:
127+
128+
```yaml
129+
---
130+
- :qid: Q814779
131+
:name: Beinecke Rare Book & Manuscript Library
132+
:directory: beinecke
133+
- :qid: Q995265
134+
:name: Bryn Mawr College
135+
:directory: brynmawr
136+
- :qid: Q63969940
137+
:name: Burke Library at Union Theological Seminary
138+
:directory: burke
139+
```
140+
141+
`split_data.rb` validates the config file and the CSV.

‎split_data.rb‎

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
require 'logger'
88

99
LOGGER = Logger.new STDOUT
10-
LOGGER.level = (Logger::DEBUG || ENV['DS_LOGLEVEL'])
10+
LOGGER.level = (ENV['DS_LOGLEVEL'] || Logger::DEBUG)
1111

1212
OUT_DIR = File.expand_path '../member-data', __FILE__
1313
QID_DEFAULT = 'holding_institution'
@@ -115,7 +115,13 @@ def validate_config config
115115
end
116116

117117
ARGV.options do |opts|
118-
opts.banner = "Usage: #{File.basename __FILE__} [OPTIONS] CSV_TO_SPLIT"
118+
opts.banner = <<~EOF
119+
Usage: #{File.basename __FILE__} [OPTIONS] CSV_TO_SPLIT
120+
121+
Split CSV_TO_SPLIT by institution QIDs and put into institution folders in
122+
'member-data'.
123+
124+
EOF
119125

120126
q_msg = %Q{Institution QID column; default: #{QID_DEFAULT} }
121127
opts.on '-q', '--qid-column COLUMN', q_msg do |qid|
@@ -136,6 +142,23 @@ def validate_config config
136142
options[:verbose] = verbose
137143
end
138144

145+
opts.on('-h', '--help', 'Prints this help') do
146+
puts opts
147+
puts <<~EOF
148+
149+
Institution folders are defined in 'config.yml'.
150+
151+
Validation confirms that:
152+
153+
1. The `config.yml` file has no duplicates
154+
2. The CSV has institution QID and 'as recorded' columns
155+
3. All rows in the CSV have institution QIDs
156+
4. All the QIDs in the CSV are in `config.yml`
157+
158+
EOF
159+
exit
160+
end
161+
139162
opts.parse!
140163
end
141164
qid_col = options[:qid]
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)