iDigBioWebinar May2015

From Kurator
Jump to: navigation, search

This page contains a description of the setup and operation of the Kurator project's FP-Akka software for quality control in support of an iDigBio webinar given on 2015 May 28. Current and more comprehensive information is available on User Documentation.

Support is also available through the Public discussion list:

Introductory Example

Resources for iDigBio Webinar 28 May 2015 demonstrating an FP-Akka workflow for data quality control 2-4 pm EDT May 28 2015 (listen to recording of webinar). slides for introductory talk on data quality control.

Example result: QC report spreadsheet: Output_demoset.xls Look around in this after reading the brief text in its introductory sheet. We'll do so quickly at the beginning of the webinar.

Example input: occurrence.txt file from a DwC Archive: occurrence_demoset.txt. Running the software against this gives rise to the above human-centric data cleaning outcomes. The webinar will show you how, following the instructions below.


The use described below executes workflow applications published in Java Archives (JARs) and requires a Java Runtime Environment (JRE) installed on your computer to run the contents of the JAR.

Microsoft Windows users can find more help on using the command line environment here and Mac users here. Linux users: We suppose you'll know what to do :-)

Kurator-Akka requires Java version 1.7 or higher. To determine the version of java installed on your computer use the -version option to the java command. For example,

$ java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

On Windows, you will probably need to find the installed location of Java, look in C:\Program Files\Java\jre{version}\bin\java, run from the command line with:

"C:\Program Files\Java\jdk1.8.0_45\bin\java" -version

Jar with workflow: Download from: FP-Akka on SourceForge (This will take you to a page on SourceForge, your download will start after af few seconds.)

Jar with spreadsheet post processor: Download from: fp-postprocess on SourceForge (This will take you to a page on SourceForge, your download will start after af few seconds.) After downloading, unzip this file to extract postprocess.jar and

The instructions assume you can open a command window and that you can arrange to move the downloaded files to the folder that window sees[1]. Typically the default download directory is not where your command window opens. Microsoft Windows users can find more help here and Mac users here. Linux users: We suppose you'll know what to do :-)

Mechanics of Running the software

Option 1: From a command prompt

Open a shell/command prompt and move to the directory where you downloaded these files.

cd Downloads

Then try running the workflow. The quotes are needed in a windows command prompt (unless you set the path for java see WorkFlowCommandLineWindowSetup):


"C:\Program Files\Java\jdk1.8.0_45\bin\java" -jar "FP-Akka-1.4.0-workflowstarter.jar" -w dwca -i Occurrence_demoset.txt -o output.json -a COL


java -jar FP-Akka-1.4.0-workflowstarter.jar -w dwca -i occurrence_demoset.txt -o output.json -a COL
C:\Users\mole\Downloads\webinar>"c:\Program Files\Java\jdk1.8.0_45\bin\java" -jar "FP-Akka-1.4.0-workflowstarter.jar" 
-w dwca -i occurrence_demoset.txt -o output.json -a COL

Selected Workflow: dwca
NewScientificNameValidator authority: COL
NewScientificNameValidator taxonomicMode: false
NewScientificNameValidator insertGUID: true
NewScientificNameValidator will make pull requests to RepointableActorRef
Read initial 13 records.
Read a total of 13 records.
Stopped Reader, processing remaining records.
Stopped ScinRefValidator
Stopped BasisOfRecordValidator
Stopped DateValidator
Stopped GeoRefValidator
Stopped MongoSummaryWriter
Wrote out 13 records


Then run the postprocessor to create a spreadsheet:

"c:\Program Files\Java\jdk1.8.0_45\bin\java" -jar "postprocess.jar" -i output.json -o output.xls

Then open the spreadsheet.

Option 2: By Editing properties files

Double click on FP-Akka-1.4.0-workflowstarter.jar to run it. (You may need to right click, Properties=>General->Unblock to run the jar file).

Open in Notepad, edit it, and save.


Delete occurrence_qc.json if it exists (the workflow won't run if the output file exists).

Double click on FP-Akka-1.4.0-workflowstarter.jar to run the workflow.

Open in Notepad, edit it, and save.


Double click on postprocess.jar to run the postprocessor. This will create a file occurrence.xls (and will overwrite an existing file of that name), open this file to see the results.

How to tell the software what to do

Command Line and Properties file Options


Workflow to run: -w dwca or csv (not case sensitive).

Taxonomic authoritity to consult first: -a IPNI or IF or COL or WoRMS or GBIF or GLOBALNAMES (not case sensitive).

Input file (csv): -i {input filename}

Intermediate output file (json) -o {output filename}, usually occurrence_qc.json

Look up names in current use (taxonomic mode): -t


Input file (json): -i {input filename}, the output json file from FP-Akka.

Output spreadsheet (xls): -o {output filename}

Limit output to actionble items (exclude results that don't have QC problems) - a

More Example Datasets

Or: Download a data set from a Symbiota Portal

Note: we are seeing some failures on postprocessing with character set encodings on some data from some portals on Windows machines

  1. Run a query on the portal (e.g. search on SCAN or NEVP), and click the download icon in the upper right of the search results.
  2. On the Specimen Download page, select:
    1. Structure: Darwin Core
    2. Data Extenstions: None (leave both unchecked)
    3. File Format: Tab Delimited
    4. Character Set: UTF-8
    5. Compression: None (leave Compressed Zip File unchecked)
  3. Click Download Data.

Or: Download a data set from the iDigBio Portal


  1. Enter search terms, and run a query on the portal.
  2. Enter your email to download, and click download.
  3. Download your data when ready, either from the page if a small dataset, or from the email if a large dataset.
  4. Unzip the downloaded archive, and run FP-Akka on the occurrence.txt file.

For More Details

See the User_Documentation.


  1. Java experts: it suffices to download into your classpath