CNH Workshop June2015
This page contains a description of the setup and operation of the FilteredPush and Kurator projects' FP-Akka software for quality control in support of a workshop given at the Consortium of North East Herbaria meeting on 2015 June 26. Current and more comprehensive information is available on User Documentation.
Support is also available through the Public discussion list: https://lists.illinois.edu/lists/info/kurator
Contents
Introductory Example
See also: iDigBio Webinar 28 May 2015 demonstrating an FP-Akka workflow for data quality control 2-4 pm EDT May 28 2015 (listen to recording of webinar).
See also: User Documentation.
Example result: QC report spreadsheet: Output_demoset.xls Look around in this after reading the brief text in its introductory sheet. We'll do so quickly at the beginning of the workshop.
Example input: occurrence.txt file from a DwC Archive: occurrence_demoset.txt. Running the software against this gives rise to the above human-centric data cleaning outcomes. The wworkshop will show you how, following the instructions below.
What the FP-Akka software does
Given flat DarwinCore input (as a csv file, a tab file, a DarwinCore archive .zip file with an occurrence core, or an unzipped DarwinCore archive directory with an occurrence core), FP-Akka runs a series of data quality control steps on the scientific names, georeferences, collector/collecting event date, and basis of record values found in the input. It then writes out a machine readable (json) file containing the input data with quality control assertions including proposed changes to the data, and a set of provenance information describing the basis for the quality control assertions.
Preparation
The use described below executes workflow applications published in Java Archives (JARs) and requires a Java Runtime Environment (JRE) installed on your computer to run the contents of the JAR.
Microsoft Windows users can find more help on using the command line environment here and Mac users here. Linux users: We suppose you'll know what to do :-)
Kurator-Akka requires Java version 1.7 or higher. To determine the version of java installed on your computer use the -version option to the java command. For example,
$ java -version java version "1.7.0_67" Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
On Windows, you will probably need to find the installed location of Java, look in C:\Program Files\Java\jre{version}\bin\java, run from the command line with:
"C:\Program Files\Java\jdk1.8.0_45\bin\java" -version
Jar with workflow: Download from: FP-Akka on SourceForge (This will take you to a page on SourceForge, your download will start after af few seconds.)
Jar with spreadsheet post processor: Download from: fp-postprocess on SourceForge (This will take you to a page on SourceForge, your download will start after af few seconds.) After downloading, unzip this file to extract postprocess.jar.
The instructions assume you can open a command window and that you can arrange to move the downloaded files to the folder that window sees[1]. Typically the default download directory is not where your command window opens. Microsoft Windows users can find more help here and Mac users here. Linux users: We suppose you'll know what to do :-)
Mechanics of Running the software
Option 1: From a command prompt
Open a shell/command prompt and move to the directory where you downloaded these files.
cd Downloads
Then try running the workflow. The quotes are needed in a windows command prompt (unless you set the path for java see WorkFlowCommandLineWindowSetup):
Windows:
"C:\Program Files\Java\jdk1.8.0_45\bin\java" -jar "FP-Akka-1.4.4-workflowstarter.jar" -w dwca -i Occurrence_demoset.txt -o output.json -a COL
OSX/Linux
java -jar FP-Akka-1.4.4-workflowstarter.jar -w dwca -i occurrence_demoset.txt -o output.json -a COL
C:\Users\mole\Downloads\webinar>"c:\Program Files\Java\jdk1.8.0_45\bin\java" -jar "FP-Akka-1.4.4-workflowstarter.jar" -w dwca -i occurrence_demoset.txt -o output.json -a COL Selected Workflow: dwca NewScientificNameValidator authority: COL NewScientificNameValidator taxonomicMode: false NewScientificNameValidator insertGUID: true NewScientificNameValidator will make pull requests to RepointableActorRef Read initial 13 records. Read a total of 13 records. Stopped Reader, processing remaining records. Stopped ScinRefValidator Stopped BasisOfRecordValidator Stopped DateValidator Stopped GeoRefValidator Stopped MongoSummaryWriter Wrote out 13 records 7785 C:\Users\mole\Downloads\webinar>
Then run the postprocessor to create a spreadsheet:
"c:\Program Files\Java\jdk1.8.0_45\bin\java" -jar "postprocess.jar" -i output.json -o output.xls
Then open the spreadsheet.
Option 2: By Editing properties files
Double click on FP-Akka-1.4.0-workflowstarter.jar to run it. (You may need to right click, Properties=>General->Unblock to run the jar file).
Open analysis.properties in Notepad, edit it, and save.
analysis.input=occurrence_demoset.txt analysis.taxonomicMode=false analysis.output=occurrence_qc.json analysis.authority=COL analysis.sciNameValidatorOnly=false analysis.workflow=DwCa
Delete occurrence_qc.json if it exists (the workflow won't run if the output file exists).
Double click on FP-Akka-1.4.4-workflowstarter.jar to run the workflow.
Open postprocess.properties in Notepad, edit it, and save.
postprocess.inputFile=occurrence_qc.json postprocess.outputFile=occurrence.xls postprocess.actionableItemsOnly=false
Double click on postprocess.jar to run the postprocessor. This will create a file occurrence.xls (and will overwrite an existing file of that name), open this file to see the results.
How to tell the software what to do
Command Line and Properties file Options
FP-Akka-1.4.4-workflowstarter.jar
Workflow to run: -w dwca or csv (not case sensitive).
Taxonomic authoritity to consult first: -a IPNI or IF or COL or WoRMS or GBIF or GLOBALNAMES (not case sensitive). IPNI has good coverage for vascular plants, IF for fungi, COL has some reasonable sources for non-vascular plants.
Input file (csv): -i {input filename}
Intermediate output file (json) -o {output filename}, usually occurrence_qc.json
Look up names in current use (taxonomic mode): -t
postprocess.jar
Input file (json): -i {input filename}, the output json file from FP-Akka.
Output spreadsheet (xls): -o {output filename}
Limit output to actionble items (exclude results that don't have QC problems) - a
More Example Datasets
- 9 fungal records: occurrence_fungi.txt
- 9 georeferenced records: occurrence_geo.txt
- 9 molluscan records: occurrence_mol.txt
- 9 plant records: occurrence_plants.txt
Or: Download a data set from a Symbiota Portal
Note: we are seeing some issues in postprocessing with character set encodings on some data from some portals (including the CNH portal) on Windows machines.
- Run a query on the portal (e.g. search on NEVP/CNH portal), and click the download icon in the upper right of the search results.
- On the Specimen Download page, select:
- Structure: Darwin Core
- Data Extenstions: None (leave both unchecked)
- File Format: Tab Delimited
- Character Set: UTF-8
- Compression: None (leave Compressed Zip File unchecked)
- Click Download Data, save the file with a known location and name.
- Give FP-Akka the location of the .tab file as input:
-i {filename.tab}
or
- Run a query on the portal (e.g. search on NEVP/CNH portal), and click the download icon in the upper right of the search results.
- On the Specimen Download page, select:
- Structure: Darwin Core
- Data Extenstions: None (leave both unchecked)
- File Format: Tab Delimited
- Character Set: UTF-8
- Compression: Check Compressed Zip File
- Click Download Data, save the file with a known location and name.
- Give FP-Akka the location of the .zip DarwinCoreArchive file as input:
-i {filename.zip}
Or: Download a data set from the iDigBio Portal
Visit: https://www.idigbio.org/portal/search
- Enter search terms, and run a query on the portal.
- Enter your email to download, and click download.
- Download your data when ready, either from the page if a small dataset, or from the email if a large dataset.
- Unzip the downloaded archive, and run FP-Akka on the occurrence.txt file.
For More Details
See the User_Documentation.
References
- ↑ Java experts: it suffices to download into your classpath