iDigBioWebinar May2015

From Kurator
Jump to: navigation, search

This page contains a description of the setup and operation of the Kurator project's FP-Akka software for quality control in support of an iDigBio webinar given on 2015 May 28. Current and more comprehensive information is available on User Documentation.

Support is also available through the Public discussion list: https://lists.illinois.edu/lists/info/kurator

Contents

Introductory Example

Resources for iDigBio Webinar 28 May 2015 demonstrating an FP-Akka workflow for data quality control 2-4 pm EDT May 28 2015 (listen to recording of webinar). slides for introductory talk on data quality control.


Example result: QC report spreadsheet: Output_demoset.xls Look around in this after reading the brief text in its introductory sheet. We'll do so quickly at the beginning of the webinar.

Example input: occurrence.txt file from a DwC Archive: occurrence_demoset.txt. Running the software against this gives rise to the above human-centric data cleaning outcomes. The webinar will show you how, following the instructions below.

Preparation

The use described below executes workflow applications published in Java Archives (JARs) and requires a Java Runtime Environment (JRE) installed on your computer to run the contents of the JAR.

Microsoft Windows users can find more help on using the command line environment here and Mac users here. Linux users: We suppose you'll know what to do :-)

Kurator-Akka requires Java version 1.7 or higher. To determine the version of java installed on your computer use the -version option to the java command. For example,

$ java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

On Windows, you will probably need to find the installed location of Java, look in C:\Program Files\Java\jre{version}\bin\java, run from the command line with:

"C:\Program Files\Java\jdk1.8.0_45\bin\java" -version

Jar with workflow: Download from: FP-Akka on SourceForge (This will take you to a page on SourceForge, your download will start after af few seconds.)

Jar with spreadsheet post processor: Download from: fp-postprocess on SourceForge (This will take you to a page on SourceForge, your download will start after af few seconds.) After downloading, unzip this file to extract postprocess.jar and postprocess.properties.

The instructions assume you can open a command window and that you can arrange to move the downloaded files to the folder that window sees[1]. Typically the default download directory is not where your command window opens. Microsoft Windows users can find more help here and Mac users here. Linux users: We suppose you'll know what to do :-)

Mechanics of Running the software

Option 1: From a command prompt

Open a shell/command prompt and move to the directory where you downloaded these files.

cd Downloads

Then try running the workflow. The quotes are needed in a windows command prompt (unless you set the path for java see WorkFlowCommandLineWindowSetup):

Windows:

"C:\Program Files\Java\jdk1.8.0_45\bin\java" -jar "FP-Akka-1.4.0-workflowstarter.jar" -w dwca -i Occurrence_demoset.txt -o output.json -a COL

OSX/Linux

java -jar FP-Akka-1.4.0-workflowstarter.jar -w dwca -i occurrence_demoset.txt -o output.json -a COL
C:\Users\mole\Downloads\webinar>"c:\Program Files\Java\jdk1.8.0_45\bin\java" -jar "FP-Akka-1.4.0-workflowstarter.jar" 
-w dwca -i occurrence_demoset.txt -o output.json -a COL

Selected Workflow: dwca
NewScientificNameValidator authority: COL
NewScientificNameValidator taxonomicMode: false
NewScientificNameValidator insertGUID: true
NewScientificNameValidator will make pull requests to RepointableActorRef
Read initial 13 records.
Read a total of 13 records.
Stopped Reader, processing remaining records.
Stopped ScinRefValidator
Stopped BasisOfRecordValidator
Stopped DateValidator
Stopped GeoRefValidator
Stopped MongoSummaryWriter
Wrote out 13 records
7785

C:\Users\mole\Downloads\webinar>

Then run the postprocessor to create a spreadsheet:

"c:\Program Files\Java\jdk1.8.0_45\bin\java" -jar "postprocess.jar" -i output.json -o output.xls

Then open the spreadsheet.

Option 2: By Editing properties files

Double click on FP-Akka-1.4.0-workflowstarter.jar to run it. (You may need to right click, Properties=>General->Unblock to run the jar file).

Open analysis.properties in Notepad, edit it, and save.

analysis.input=occurrence_demoset.txt
analysis.taxonomicMode=false
analysis.output=occurrence_qc.json
analysis.authority=COL
analysis.sciNameValidatorOnly=false
analysis.workflow=DwCa

Delete occurrence_qc.json if it exists (the workflow won't run if the output file exists).

Double click on FP-Akka-1.4.0-workflowstarter.jar to run the workflow.

Open postprocess.properties in Notepad, edit it, and save.

postprocess.inputFile=occurrence_qc.json
postprocess.outputFile=occurrence.xls
postprocess.actionableItemsOnly=false

Double click on postprocess.jar to run the postprocessor. This will create a file occurrence.xls (and will overwrite an existing file of that name), open this file to see the results.

How to tell the software what to do

Command Line and Properties file Options

FP-Akka-1.4.0-workflowstarter.jar

Workflow to run: -w dwca or csv (not case sensitive).

Taxonomic authoritity to consult first: -a IPNI or IF or COL or WoRMS or GBIF or GLOBALNAMES (not case sensitive).

Input file (csv): -i {input filename}

Intermediate output file (json) -o {output filename}, usually occurrence_qc.json

Look up names in current use (taxonomic mode): -t

postprocess.jar

Input file (json): -i {input filename}, the output json file from FP-Akka.

Output spreadsheet (xls): -o {output filename}

Limit output to actionble items (exclude results that don't have QC problems) - a

More Example Datasets

Or: Download a data set from a Symbiota Portal

Note: we are seeing some failures on postprocessing with character set encodings on some data from some portals on Windows machines

  1. Run a query on the portal (e.g. search on SCAN or NEVP), and click the download icon in the upper right of the search results.
  2. On the Specimen Download page, select:
    1. Structure: Darwin Core
    2. Data Extenstions: None (leave both unchecked)
    3. File Format: Tab Delimited
    4. Character Set: UTF-8
    5. Compression: None (leave Compressed Zip File unchecked)
  3. Click Download Data.

Or: Download a data set from the iDigBio Portal

Visit: https://www.idigbio.org/portal/search

  1. Enter search terms, and run a query on the portal.
  2. Enter your email to download, and click download.
  3. Download your data when ready, either from the page if a small dataset, or from the email if a large dataset.
  4. Unzip the downloaded archive, and run FP-Akka on the occurrence.txt file.

For More Details

See the User_Documentation.

References

  1. Java experts: it suffices to download into your classpath
Personal tools
Namespaces

Variants
Actions
Navigation
SMW
Toolbox
KuratorWikiAdmin