DigitalScriptorium
diff --git a/‎README.md‎
Lines changed: 76 additions & 21 deletions b/‎README.md‎
Lines changed: 76 additions & 21 deletions
diff --git a/‎split_data.rb‎
Lines changed: 25 additions & 2 deletions b/‎split_data.rb‎
Lines changed: 25 additions & 2 deletions
diff --git a/‎workflow/2022-02-23-README.md‎ ‎workflow/2022-02-23-combined-README.md‎workflow/2022-02-23-README.md renamed to workflow/2022-02-23-combined-README.md b/‎workflow/2022-02-23-README.md‎ ‎workflow/2022-02-23-combined-README.md‎workflow/2022-02-23-README.md renamed to workflow/2022-02-23-combined-README.md
diff --git a/‎workflow/2022-04-04-README.md‎ ‎workflow/2022-04-04-combined-README.md‎workflow/2022-04-04-README.md renamed to workflow/2022-04-04-combined-README.md b/‎workflow/2022-04-04-README.md‎ ‎workflow/2022-04-04-combined-README.md‎workflow/2022-04-04-README.md renamed to workflow/2022-04-04-combined-README.md
@@ -2,33 +2,60 @@
 
 Repository for CSVs DS data.
 
-New files should be called: `DATE-name.csv`, where `DATE` is the date the file was created in `YYYY-MM-DD` format and `name` is a one-word description, like `jhu`, `combined`, `upenn`, etc.
+New files should be called: `DATE-name.csv`, where `DATE` is the date the file was created in `YYYY-MM-DD` format and `name` is a description, like `jhu`, `combined`, `upenn`, etc.
 
 Directory structure:
 
 ```text
-/ # root folder
-- README.md
-- Workflow-README-template.md
-- workflow
-    - 2022-02-23-README.md
-    - 2022-02-23-combined.csv
-    - 2022-02-23-combined-enriched.csv
-    - 2022-02-23-combined-imported.csv
-- member-data
-    - beinecke
-    - harvard
-    - nyu
-        - 2021-07-06-nyu.csv
-        - 2021-07-06-nyu-enriched.csv
-        - 2021-07-06-nyu-imported.csv
-        - 2022-06-06-nyu.csv
-        - 2022-06-06-nyu-enriched.csv
-        - 2022-06-06-nyu-imported.csv
+.
+├── README.md
+├── Workflow-README-template.md
+├── config.yml
+├── member-data
+│     ├── burke
+│     │     ├── 2022-02-23-combined-burke.csv
+│     │     ├── 2022-02-23-combined-enriched-burke.csv
+│     │     └── 2022-04-04-combined-burke.csv
+│     ├── ccny
+│     │     ├── 2022-02-23-combined-ccny.csv
+│     │     ├── 2022-02-23-combined-enriched-ccny.csv
+│     │     └── 2022-04-04-combined-ccny.csv
+│     ├── columbia
+etc.
+├── split_data.rb
+├── test
+│     ├── missing_inst_name.csv
+│     ├── missing_qid.csv
+│     └── unknown_qid.csv
+└── workflow
+    ├── 2022-02-23-combined-README.md
+    ├── 2022-02-23-combined-enriched.csv
+    ├── 2022-02-23-combined.csv
+    ├── 2022-04-04-combined-README.md
+    ├── 2022-04-04-combined-enriched.csv
+    └── 2022-04-04-combined.csv
+
   ```
-  
-  - [ ] TODO: Add script for splitting and placing member-data files in place
 
+## Adding data
+
+New files should be added to the `workflow` directory and named using the date
+and a signifier describing the data; e.g., `2022-06-13-combined.csv`. Subsequent
+and related files should use the same data an name:
+`2022-06-13-combined-enriched.csv`, `2022-06-13-combined-README.csv`, and so
+forth.
+
+1. Add a CSV of extracted data named according to the pattern described above;
+   e.g., `2022-06-13-combined.csv`.
+2. Copy the `Workflow-README-template.md` to the folder following the name
+   pattern; e.g., `2022-06-13-combined-README.md`.
+3. Edit the README for the current set of data.
+4. Once the data has been cleaned and reconciled, add an `enriched` version of
+   the CSV; e.g., `2022-06-13-combined-enriched.csv`.
+5. Add notes about enrichment to the README.
+6. When the data is imported, add an `imported` file to the `workflow` folder;
+   e.g., `2022-06-13-combined-imported.csv`.
+7. Add notes about the import to the README.
 
 ## Proposed alternate workflow
 
@@ -84,3 +111,31 @@ C. Data import
 - Data import and CSV updat: The file `import/batch-<DATE>/clean-recon.csv` is
   imported into DS, and the output CSV with DS IDs is added to
   `import/batch-<DATE>` as `imported.csv`.
+
+## Splitting files
+
+CSVs in the workflow directory should be split into institution-specific
+directories. The `split_data.rb` script splits the CSV on the QID in the
+`holding_institution` column and puts the file in folder as defined in the
+`config.yml` file.
+
+### `config.yml`
+
+The configuration file contains the QID, name and a single-word folder for each
+institution. New repositories should be added to the configuration. The format
+of the entries is like so:
+
+```yaml
+---
+- :qid: Q814779
+  :name: Beinecke Rare Book & Manuscript Library
+  :directory: beinecke
+- :qid: Q995265
+  :name: Bryn Mawr College
+  :directory: brynmawr
+- :qid: Q63969940
+  :name: Burke Library at Union Theological Seminary
+  :directory: burke
+```
+
+`split_data.rb` validates the config file and the CSV.
@@ -7,7 +7,7 @@
 require 'logger'
 
 LOGGER = Logger.new STDOUT
-LOGGER.level = (Logger::DEBUG || ENV['DS_LOGLEVEL'])
+LOGGER.level = (ENV['DS_LOGLEVEL'] || Logger::DEBUG)
 
 OUT_DIR              = File.expand_path '../member-data', __FILE__
 QID_DEFAULT         = 'holding_institution'
@@ -115,7 +115,13 @@ def validate_config config
 end
 
 ARGV.options do |opts|
-  opts.banner = "Usage: #{File.basename __FILE__} [OPTIONS] CSV_TO_SPLIT"
+  opts.banner = <<~EOF
+Usage: #{File.basename __FILE__} [OPTIONS] CSV_TO_SPLIT
+
+Split CSV_TO_SPLIT by institution QIDs and put into institution folders in
+'member-data'.
+
+EOF
 
   q_msg = %Q{Institution QID column; default: #{QID_DEFAULT}  }
   opts.on '-q', '--qid-column COLUMN', q_msg do |qid|
@@ -136,6 +142,23 @@ def validate_config config
     options[:verbose] = verbose
   end
 
+  opts.on('-h', '--help', 'Prints this help') do
+    puts opts
+    puts <<~EOF
+
+Institution folders are defined in 'config.yml'.
+
+Validation confirms that:
+
+1. The `config.yml` file has no duplicates
+2. The CSV has institution QID and 'as recorded' columns
+3. All rows in the CSV have institution QIDs
+4. All the QIDs in the CSV are in `config.yml`
+
+EOF
+    exit
+  end
+
   opts.parse!
 end
 qid_col = options[:qid]