FP-Akka User Documentation

From Kurator
Jump to: navigation, search

Support for software developed by the Kurator project is also available through the Public discussion list: https://lists.illinois.edu/lists/info/kurator.

Contents

Using the 1.6.0 Release of FP-Akka

See also: CNH_Workshop_June2015 and the iDigBio Webinar

Overview

FP-Akka dwca workflow

The purpose of the data curation software described here is to take a set natural science collections data and to produce a data quality report on that set of data. The software examines your data records for internal consistency, checks them against external services (such as Geolocate), identifies potential problems, and in some cases, proposes corrections that you may apply to your data.

In general the process begins with a flat set of occurrence records, e.g. as extracted from a DarwinCore archive of your data. A small number of steps are then followed in order:

  • Run a piece of software (a workflow) that reads your data set, analyses it, compares the data to several authoritative services, and writes out a machine readable data quality report about your that data set.
  • Then, run a piece of software that converts this machine readable report into a human readable Quality Control report that summarizes issues with the scientific names, georeferences, and collecting dates found in that data set. This output is a human readable XLS spreadsheet.
  • In the spreadsheet you can find proposed corrections to your data, records that have been marked as problematic, and records that do not meet the requirements for validation (e.g. records that lack a collecting event date can't have that date validated).

Two data quality workflows are provided, one that examines the values of the scientific name, georeference, and date collected in a dataset, and another that checks lists of scientific names against authoritative sources.

In what follows we explain how to load and execute tools that accomplish the above in command line interfaces known to work on recent versions of Microsoft Windows, Macintosh, and Linux platforms. We do so for two related workflows for which we document configurations and execution options suitable for several different use cases.

Two workflows are supported by FP-Akka, one intended to check occurrence records, the other to check a local list of scientific names. (There is also a third workflow, used internaly by FilteredPush, that checks occurrence records, but loads data and writes results to and from MongoDB rather than from files).


Workflow 1 (All-in-1: SciName-Georef-Date (DwCa))

This is a composition of a csv loader, a scientific name validator, a georeference validator, a date validator, a basis of record validator, and a json output writer. It is configurable to run the scientific name validator against any of several nomenclatural or taxonomic authorities in a nomenclatural (just check the name) or taxonomic (find the name in current use) mode.[1]

Following the workflow execution, the JSON output is post-processed into a spreadsheet in Excel .xls format.

Workflow 2 (Just-1: SciName Validator for Taxon Authority File)

This is a composition of a csv loader, the same scientific name validator as used in Workflow 1, and a csv output writer.

Workflow 2 is intended for checking your own taxonomic authority files against taxonomic and nomenclatural authorities in order to augment your files with GUIDs from those authorities, and to find potential problems in your authority files for human review.

Notes

Notes on recent changes

Main changes in FP-Akka version 1.6.0

  • Added actor to validate information found in dwc:Event terms including cross checking eventDate against year, month, day, startDayOfYear, endDayOfYear and verbatimEventDate. Actor is able to interpret a wide range of valid ISO date/date range/date time values that may be found in eventDate, and compare with other Event term values. Actor is able to parse a wide range of verbatimEventDate values into eventDate values to either compare with an existing event date or populate an empty eventDate. Added this eventDate actor to the dwca workflow. Underlying code includes methods for assessing date precision to assert measures or validations in FFDQ, though these are not yet exposed in the workflows.
  • KURATOR-125 Clarify comment inserted when atomic fields for scientific name are blank.
  • MCZ-159 Kurator-173 Changes and other tests based on review of results from MCZ mollusk records. Refactoring beginning towards making assertions in FFDQ framework.
  • Kurator-127 Avoid changing 'auct non X' authorship to 'X'. D
  • Kurator-167, MCZ-159 Kurator-173 Fix handling of trinomials when using WoRMS fuzzy match.

Main changes in FP-Akka version 1.5.2

  • Using WoRMS or GBIF backbone taxonomy as a source, DwCa and CSV workflows will fill in higher taxonomic ranks, when values are not present in the input for the TDWG DarwinCore higher taxonomy terms kingdom, phylum, class, order, and family.

Main changes in FP-Akka version 1.5.1

  • Improved handling of marine and coastal georeference validation.
  • Able to validate georeferences when a countryCode but no country is provided.
  • Corrections to errors in georeference validation that resulted in some false negatives for out of range coordinates.
  • Improved logic and more curation comment output in georeference validation.

Main changes in FP-Akka version 1.4.6

  • Improvements to description of results in QC report.

Main changes in FP-Akka version 1.4.5

  • Improvements in reading data from DarwinCore archives (now using org.gbif dwca-io 1.23).
  • Improved handling of cases where scientificName is not populated in input, but data are provided in atomic fields.
  • Finished support for Workflow 2, check list of scientific names.

Main changes between FP-Akka versions 1.4.3 and 1.4.4

  • Support for reading data from DarwinCore archive files.
  • Experimental support for reading data from the iDigBio API
  • Improved handling of eventDate and year/month/day date fields, including handling of date ranges.
  • Homonym detection in COL and GBIF data sources
  • Fixed error in output of BasisOfRecord validation.
  • Handles a wider range of headers (case variations, including/excluding prefix) in CSV file input.

Main changes between FP-Akka versions 1.4.0 and 1.4.3:

  • Fixed issues with heap overflow when processing large data sets.
  • Added more metadata to results on georeference validation process.
  • Georeference validation compares coordinate in data with all geolocate results, not just best.
  • Improved handling of error conditions in date validation.
Notes on limitations
  • Large data sets can be processed, but rendering of quality control reports in human readable spreadsheets is limited by the number of rows that can be effectively rendered and worked with in a spreadsheet (on the order of say 60,000 records). For QC review of large data sets, it may be more effective to split the review into smaller targeted (e.g. by collection) parts.
  • The input file must be a DarwinCore archive with an occurrence core, or be a csv file with a header line, and this header line must use DarwinCore terms (e.g. scientificName, scientificNameAuthorship, eventDate, recordedBy) as the column headers. This requirement is relaxed if the input file is a DarwinCore archive with an occurrence core, as columns can be interpreted from the archive metadata file.
  • For Workflow 2 (-w CSV -s), the input file must be a csv file with a header line and the header line must be exactly "dbpk","scientificName","scientificNameAuthorship").
  • The output will probably not have the same order of rows as the input, and row order in the output may change from one run to the next. This is because FP-Akka uses Akka to run multiple copies of each actor in paralell, and the time required for one row to be processed may be dependent on multiple calls to remote network services.

Preparation

The use described below executes workflow applications published in Java Archives (JARs) and requires a Java Runtime Environment (JRE) installed on your computer to run the contents of the JAR.

Kurator-Akka requires Java version 1.7 or higher. To determine the version of java installed on your computer use the -version option to the java command. For example,

$ java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

Download

  1. Jar with workflows: Download from: FP-Akka (dwca) on SourceForge (This will take you to a page on SourceForge, your download will start after af few seconds.)
  2. Jar with spreadsheet post processor: Download from: fp-postprocess on SourceForge (This will take you to a page on SourceForge, your download will start after af few seconds.) After downloading, unzip this file to extract postprocess.jar and postprocess.properties.

The instructions assume you can open a command window and that you can arrange to move the downloaded files to the folder that window sees[2]. Typically the default download directory is not where your command window opens. Microsoft Windows users can find more help here and Mac users here. Linux users: We suppose you'll know what to do :-)

Input Data

The workflow engine in the FP-Akka jar is focused on DarwinCore data (either occurrences or taxa). An input file needs to be a delimited text file with a header line that lists DarwinCore terms. For occurrence data, it can be a (tab delimited) occurrence.txt occurrence core extracted from a DarwinCore archive, or a csv file that you have obtained from another source (such as a download from Symbiota), a DarwinCore archive file with an occurrence core, or a directory into which an DarwinCore archive with an occurrence core has been unzipped.

Occurrences

For occurrence data input, you can start with either a DarwinCore archive file (your own, someone else's, or the sample we link to below), or you can try out a workflow out with one of the small extracted occurrence.txt files that we link to below. These will run quickly, having only a few records in them. If you chose the latter, you can skip directly to the section Or: Small example data sets.

If you mean to run the Akka validation workflows on sets of Taxa instead of occurrence data, you can skip directly to the section Taxa.

Starting with a DarwinCore archive file

Assume you have a DarwinCore Archive file, with an occurrence core, for example DarwinCore Archive file: Subset of MCZ records.

Extract Occurrence.txt from the archive:

Unzip your DarwinCore archive to extract the occurrence.txt file.

unzip {dwcarchive}.zip

For the example file linked above:

$ unzip dwca-mcz_subset_for_scan.zip

Your occurrence.txt file must have a header line. This header line must contain the DarwinCore names for the columns [1] (e.g. eventDate, recordedBy, scientificName, scientificNameAuthorship, decimalLatitude, decimalLongitude, locality, country, stateProvince, county, basisOfRecord), and can contain an id column. If columns expected by a particular quality control actor are missing or not named following these expectations, then that actor may declare all rows to have a problem.

Or: Small example data sets

If you chose not to begin with a DarwinCore Archive file, you can download and use any of the following small test data sets, whose occurrence data has already been provided.

Or: Download a data set from a Symbiota Portal
  1. Run a query on the portal (e.g. search on SCAN or NEVP), and click the download icon in the upper right of the search results.
  2. On the Specimen Download page, select:
    1. Structure: Darwin Core
    2. Data Extenstions: None (leave both unchecked)
    3. File Format: Tab Delimited
    4. Character Set: UTF-8
    5. Compression: None (leave Compressed Zip File unchecked)
  3. Click Download Data.
Or: Download a data set from the iDigBio Portal

Visit: https://www.idigbio.org/portal/search

  1. Enter search terms, and run a query on the portal.
  2. Enter your email to download, and click download.
  3. Download your data when ready, either from the page if a small dataset, or from the email if a large dataset.
  4. Unzip the downloaded archive, and run FP-Akka on the occurrence.txt file.

Taxa

If you mean to run the validator workflows against a list of Taxa, rather than Occurrences, you need a csv file comprising an id, scientificName, and scientificNameAuthorship such as the Example below.

Example (a header line with the exact headings dbpk, scientificName, and scientificNameAuthorship is required):

dbpk,scientificName,scientificNameAuthorship
11,"Avicennia nitida","Jacquin"
109,"Trilocularia pedicellata","Guillaumin"
111,"Balanops montana","C. T. White"
126,"Balanophora zollingeri","Fawcett"
127,"Balanophora wrightii","Makino"

The column dbpk is intended to carry your local database primary key for a table that holds scientific names. It will be carried through into the output in order to allow you to match up assertions in the output with rows in your local taxon authority table. If you are checking taxon names in another context, you can leave dbpk blank, but the column must be present in the input.

Running the workflow

java -jar workflowfile.jar options


Options:

-i: path_to_input_file The input file may be a .csv, .tab, .zip, or a directory containing meta.xml and occurrence.txt.

-o: path_to_output_file

-a: authority to use, one of IPNI, COL, WoRMS, IndexFungorum, GBIF

-t: use taxonomic mode (look up names in current use for provided names) (Default is nomenclatural mode (check names for nomenclatural correctness only))

-s: scientific name validator only

-w: workflow to use, one of CSV, Mongo, DwCa (use DwCa)

-h: For -s -w csv, with -a gbif or -a worms, will include higher taxa in the csv output.

Example:

$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w DwCa -i occurrence.txt -o occurrence_qc.json -a COL

Workflow 1

Case 1: For Data Curators
$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w DwCa -i occurrence.txt -o occurrence_qc.json 

This invocation uses the scientific name validator in nomenclatural mode. No authority for names has been specified, so GBIF's checklist bank will be used.

You can subsequently postprocess the results to focus on problems (-a in the postprocessor).

If you know that your dataset consists primarily of taxa that are likely to be found in a particular authority (e.g. WoRMS, IndexFungorum, IPNI), you can specify that authority:

$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w DwCa -i occurrence.txt -o occurrence_qc.json -a WoRMS
Case 2: For Researchers
$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w DwCa -i occurrence.txt -o occurrence_qc.json -t

This invocation uses the switch to run the scientific name validator in taxonomic mode (-t). No authority was specified, so GBIF's checklist bank will be used by default.

You will probably want to postprocess to focus on good data (not using -a in the postprocessor).

If you know that the data consist predominantly of taxa that are likely to be present in one of the supported authorities, for example IndexFungorum, you can specify that authority.

$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w DwCa -i occurrence.txt -o occurrence_qc.json -t  -a IF

Workflow 2

To check just a list of scientific names, producing CSV output.

Requires: -w CSV -s

Options: -a IPNI, or IF, or WoRMS, or COL, or GBIF.

-i {input csv file}

-o {output csv file}

$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w CSV -s -i taxa.csv -o taxa_out.csv -a IPNI

Switch to run only single scientific name validator actor (-s).

Specify a .csv file with -o, rather than a .json file as specified in workflow 1.

Scientific name validator running in nomenclatural mode (default).

Example input:

dbpk,scientificName,scientificNameAuthorship
11,"Avicennia nitida","Jacquin"
109,"Trilocularia pedicellata","Guillaumin"
111,"Balanops montana","C. T. White"
126,"Balanophora zollingeri","Fawcett"
127,"Balanophora wrightii","Makino"
128,"Balanophora tobiracola","Makino"

No postprocessing needed.

To include higher taxa (according to WoRMS or the GBIF backbone taxonomy) in this output, include -h:

$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w CSV -s -h -i taxa.csv -o taxa_out.csv -a WoRMS

Getting Help

Specify a workflow with -w and no other options to obtain help on available command line options:

$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w dwca -h
Available Workflows: DwCa, CSV, MONGO (specify workflow to run with -w).
Selected Workflow: dwca
"-h" is not a valid option
 -a VAL : Authority to check scientific names against (IPNI, IF, WoRMS, COL, GBIF, GlobalNames), default GBIF.
 -i VAL : Input occurrence.txt (tab delimited occurrence core from a DwC archive) file.
 -o VAL : output JSON file
 -t     : Run scientific name validator in taxonomic mode (look up name in current use).

$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w csv
Available Workflows: DwCa, CSV, MONGO (specify workflow to run with -w).
Selected Workflow: csv
 -a VAL : Authority to check scientific names against (IPNI, IF, WoRMS, COL, GBIF, GlobalNames), default GBIF.
 -h     : If checking only scientific names (-s), include higher taxa for each name in the output, if available from the selected source.
 -i VAL : Input CSV file
 -l N   : Limit on the number of records to read before stopping.
 -o VAL : output file (.json unless -s is specified, in which case .csv)
 -s     : Only check scientific names with SciNameValidator (outputs will be .csv, not .json)
 -t     : SciNameValidator taxonomicMode Mode (look up name in current use).

$ java -jar FP-Akka-1.6.0-workflowstarter.jar -w mongo -h
Available Workflows: DwCa, CSV, MONGO (specify workflow to run with -w).
Selected Workflow: mongo
Option "-h" takes an operand
 -a VAL  : Authority to check scientific names against (IPNI, IF, WoRMS, COL, GBIF, GlobalNames), default GBIF.
 -ci VAL : Input Collection in mongo to query for records to process.
 -co VAL : Output Collection in mongo into which to write results.
 -d VAL  : db
 -h VAL  : MongoDB Host
 -q VAL  : Query on Mongo collection to select records to process, e.g. {institutionCode:\"NMSU\"} 
 -t      : Run scientific name validator in taxonomic mode (look up name in current use).

Post processing

The JSON output from Workflow 1 can be converted to a human readable XLS spreadsheet with the postprocessor. You can workwitn the CSV output of Workflow 2 directly, though there are also postprocessing actions you may wish to take on the csv output from Workflow 2 with other command line tools.

Svn checkout from : http://svn.code.sf.net/p/filteredpush/svn/trunk/FP-Tools/fp-postprocess (see README.txt for build instructions)

or download binary from: http://sourceforge.net/projects/filteredpush/files/fp-postprocess/releases/

java -jar postprocess.jar {options}

Options

  • -o output filename
  • -a actionable items only

For input from mongodb

  • -hostname mongodb hostname
  • -port mongodb port
  • -db mongodb database
  • -collection mongodb collection
  • -username mongodb username
  • -password mongodb password

For input from json

  • -i input filename

Workflow 1

Case 1: for data curators (focus on nomenclatural validity and problems).

java -jar postprocess.jar -a -i {json file produced by workflow}  -o datacuratorout.xls

Case 2: for researchers (focus on valid data)

java -jar postprocess.jar -a -i {json file produced by workflow}  -o researcherout.xls

For the example given above:

$ java -jar postprocess.jar -a -i occurrence_qc.json  -o datacuratorout.xls
Postprocessing 18 results
Saving spreadsheet: datacuratorout.xls
Done.

Workflow 2

If you use the -w CSV -s options, the output consists of a csv file. You can load this as a spreadsheet directly and work with it from there, or you can use command line tools such as grep and awk to select units of work from the dataset, and can transform the output into sql update statements to update or enhance your local scientific names authority list.

Example output is:

"dbpk","scientificName","authorship","guid","status","sciNameWas","sciNameAuthorshipWas","provenance"
"109","Trilocularia pedicellata","Guillaumin","urn:lsid:ipni.org:names:103425-1","Valid","","","| can't construct sciName from atomic fields | The scientific name and authorship are correct.  Exact Match | Found name in IPNI. |  Authorship: Exact Match Similarity: 1.0"
"128","Balanophora tobiracola","Makino","urn:lsid:ipni.org:names:103300-1","Valid","","","| can't construct sciName from atomic fields | The scientific name and authorship are correct.  Exact Match | Found name in IPNI. |  Authorship: Exact Match Similarity: 1.0"
"126","Balanophora zollingeri","Fawcett","urn:lsid:ipni.org:names:103309-1","Valid","","","| can't construct sciName from atomic fields | The scientific name and authorship are probably correct, but with a different abbreviation for the author (Fawcett vs. Fawc.).  Same Author, but abbreviated differently. | Found name in IPNI. | The scientific name and authorship are probably correct, but with a different abbreviation for the author.   |  Authorship: null Similarity: 0.5714285714285714"
"111","Balanops montana","C. T. White","urn:lsid:ipni.org:names:103420-1","Valid","","","| can't construct sciName from atomic fields | More than one (2) match in IPNI on scientific name, potential homonym. | Possible match: Balanops montana C.T.White  | The scientific name and authorship are correct. | The scientific name and authorship are correct.  Exact Match | Found name in IPNI. |  Authorship: Exact Match Similarity: 1.0"
"127","Balanophora wrightii","Makino","urn:lsid:ipni.org:names:103307-1","Valid","","","| can't construct sciName from atomic fields | The scientific name and authorship are correct.  Exact Match | Found name in IPNI. |  Authorship: Exact Match Similarity: 1.0"
"11","Avicennia nitida","Jacquin","","!Validated","","","| can't construct sciName from atomic fields | More than one (2) match in IPNI on scientific name, potential homonym. | Possible match: Avicennia nitida Jacq.  | Possible match: Avicennia nitida Jacq.  | Didn't find name in IPNI. | Failed to search for the ScientificName in the same lexical group as Avicennia nitida and from source 11 by accessing GNI service at: http://gni.globalnames.org/name_resolver.xml | No match found in IPNI with failover to GNI. |  | The provided name: Avicennia+nitida is valid after checking misspelling | More than one (2) match in IPNI on scientific name, potential homonym. | Possible match: Avicennia nitida Jacq.  | Possible match: Avicennia nitida Jacq.  | Didn't find name in IPNI. | Failed to search for the ScientificName in the same lexical group as Avicennia nitida and from source 11 by accessing GNI service at: http://gni.globalnames.org/name_resolver.xml | No match found in IPNI with failover to GNI. | Fail to access GNI service | null | found synonyms but can't parse accepted name | The original SciName and Authorship cannot be curated"

One expected use is extraction of a set of problem cases to send to a domain specialist (e.g. a collection manager) for review. On linux/unix systems, grep is useful for this:

grep "\!Validated" output.csv  > outputproblems.csv

Another expected use is direct conversion of the output into sql commands to update your database (e.g. to add GUIDs from a nomenclatural authority to scientific names). Using grep to extract target lines, and then applying a regular expression to convert them to sql update statements is useful for this. Here's an awk one liner to convert the output into sql statements that add LSIDs to a taxonomy table.

awk '/"Valid"/ {match($0, /^"([0-9]+)".*(urn:lsid[:a-zA-Z0-9.-]+)/, a); print "update taxonomy set guid = " substr($0, a[2, "start"], a[2, "length"]) " where taxonomyid = " substr($0, a[1, "start"], a[1, "length"]) ";"}' examplenamesoutput.csv > examplenamesoutput.sql

Where the converted output would be:

update taxonomy set guid = urn:lsid:ipni.org:names:103425-1 where taxonomyid = 109;
update taxonomy set guid = urn:lsid:ipni.org:names:103300-1 where taxonomyid = 128;
update taxonomy set guid = urn:lsid:ipni.org:names:103309-1 where taxonomyid = 126;
update taxonomy set guid = urn:lsid:ipni.org:names:103420-1 where taxonomyid = 111;
update taxonomy set guid = urn:lsid:ipni.org:names:103307-1 where taxonomyid = 127;

Actors (what the workflows do)

FP-Akka dwca workflow

Data Loading

Load Flat DarwinCore Occurrence records from:

  1. MongoDb
  2. occurrence.txt file (tab or comma separated)
  3. DarwinCore Archive with Occurrence core
  4. Directory into which a DarwinCore archive (with an Occurrence core) has been unpacked

FP-Akka CSV reader actor to uses the Apache CSV library (from Apache commons), trying first as a tab delimited, then trying comma delimited.


Load Taxon names from:

  1. csv file of id, scientific name, and scientific name authorship. (-w CSV -s)
  2. flat darwin core occurrence.txt from DwC Archive (???)

Scientific Name Validator

Function: Check scientificName and scientificNameAuthorship against nomenclatural or taxonomic authorities, fill in missing higher taxonomy. Options:

  1. For Data Curator uses: Test only for nomenclatural accuracy of scientific name string, do not assert a more recent synonym for a name.
  2. For research uses: Assert the name in current use for each name in the dataset.
  3. Optionally, provide a higher classification for each taxon name, that is, fill in any of Kingdom, Phylum, Class, Order, or Family that were not present in the input, if the selected authority is WoRMS or GBIF.


Available Nomenclatural or Taxonomic Services:

  1. IPNI (Nomenclatural only)
  2. Index Fungorum
  3. WoRMS
  4. COL (direct from COL).
  5. GBIF API: GBIF backbone taxonomy (Default).
  6. GBIF API (select a different name list).

Available Failover Services

  1. GNI (used to look for alternative authorships)
  2. GBIF backbone taxonomy

TODO: Read options and choose behaviors.

User selects primary service as an option (-a {servicename}, case insenstive).

User selects nomenclatural or taxonomic mode as an option (specify -t for taxonomic, nomenclatural mode runs by default).


Assumptions:

Limitations:

IPNI only supports nomenclatural mode. WoRMS offers a fuzzy match service. Services differ in coverage and ability to match input data.

Input Fields: dwc:scientificName, dwc:scientificNameAuthorship

Actions (For each mode):

Output Fields (For each mode):

  • dwc:scientificName
  • dwc:scientificNameAuthorship
  • dwc:taxonID - Filled in if one was provided by the authority consulted.
  • dwc:kingdom, phylum, class, order, family - Filled in, if provided by the authority consulted, and running CSV workflow with -s and -h options.

Georeference Validator

Function: Take a georeference and a textural locality, test if the georeference is out of bounds, assert a correction if a transposition of the georeference is near a geolocate result for the textual locality.

Available Services

  1. Geolocate

Assumptions:

Limitations: GeoLocate service may not be able to parse all the locality string due its variety, thus no reference coordinate to check with if it is the case

Input Fields: dwc:DecimalLatitude , dwc:DecimalLongitude, dwc:country, dwc:stateProvince, dwc:county, dwc:locality

Actions:

  1. check whether latitude and longitude is out of bounds. If out of bounds, will try to transpose
  2. get a reference coordinates from GeoLocate with (country, stateProvince, county, locality) and check original coordinate is close enough (currently defined as 200 km) or not.

Output Fields: update both dwc:DecimalLatitude and dwc:DecimalLongitude if "Curated" or "Filled_In"; no changes if "Unable_to_Curate" and "Unable_Determine_Validity"

Date Validator

Function: Test that collecting event date is a valid date, test to see if it matches start + end dates, day of year, etc, and, check to see if the collecting event date is within the lifespan of the collector in dwc:recordedBy

Available Services

  1. HUH List of Botanists http://kiki.huh.harvard.edu/databases/rdfgen.php?query=agent&name=A.%20Gray
  2. FP Entomologists SOLR index http://symbiota4.acis.ufl.edu/scan/portal/agents/
  3. Your SOLR index Not yet implemented

Assumptions:

Limitations:

HUH List of botanists has fairly wide coverage of people who have collected and published on plants and fungi. HUH list of botanists requires an exact match on a known variant of a collector's name.

FP Entomologists SOLR index is a very limited list compiled from several entomological sources. FP Entomologists SOLR index supports fuzzy matching on names.

Input Fields: dwc:recordedBy, dwc:eventDate, dwc:year, dwc:month, dwc:day, dwc:startDatOfYear, dwc:modified

Actions:

  1. Check whether eventDate is valid or ambiguous and update if it is
  2. Compare eventDate to atomic date fields
  3. Check whether eventDate lies within the lifespan of the collector

Output Fields: update dwc:eventDate if "Curated" or "Filled_In"; no changes if "Unable_to_Curate" and "Unable_Determine_Validity"

Output

Occurrences

The workflow JSON document containing corrected input data, and reports from each actor on each data record. Intermediate Provenance JSON

MongoSummaryWriter actor can be configured (different constructors) to output to a JSON file in the file system.

This will be post processed into a human readable spreadsheet (described above).

Taxa

Taxon Name Validation Only (Workflow 2 -w CSV -s): csv file of higher taxa, corrected dwc:scientificName, corrected dwc:scientificNameAuthorship, guid in nomenclator, brief provenance trace. (-w CSV -s does this).

You can work with this output directly without further postprocessing.

Example output (column order is fixed, this makes it easy to use a regular expression to convert to a set of sql statements in the form update taxon where dbpk = {} set scientificname={}, author={}, guid={}:

"dbpk","scientificName","authorship","guid","status","sciNameWas","sciNameAuthorshipWas","provenance"
"109","Trilocularia pedicellata","Guillaumin","urn:lsid:ipni.org:names:103425-1","Valid","","","| can't construct sciName from atomic fields | The scientific name and authorship are correct.  Exact Match | Found name in IPNI. |  Authorship: Exact Match Similarity: 1.0"
"128","Balanophora tobiracola","Makino","urn:lsid:ipni.org:names:103300-1","Valid","","","| can't construct sciName from atomic fields | The scientific name and authorship are correct.  Exact Match | Found name in IPNI. |  Authorship: Exact Match Similarity: 1.0"
"126","Balanophora zollingeri","Fawcett","urn:lsid:ipni.org:names:103309-1","Valid","","","| can't construct sciName from atomic fields | The scientific name and authorship are probably correct, but with a different abbreviation for the author (Fawcett vs. Fawc.).  Same Author, but abbreviated differently. | Found name in IPNI. | The scientific name and authorship are probably correct, but with a different abbreviation for the author.   |  Authorship: null Similarity: 0.5714285714285714"
"111","Balanops montana","C. T. White","urn:lsid:ipni.org:names:103420-1","Valid","","","| can't construct sciName from atomic fields | More than one (2) match in IPNI on scientific name, potential homonym. | Possible match: Balanops montana C.T.White  | The scientific name and authorship are correct. | The scientific name and authorship are correct.  Exact Match | Found name in IPNI. |  Authorship: Exact Match Similarity: 1.0"
"127","Balanophora wrightii","Makino","urn:lsid:ipni.org:names:103307-1","Valid","","","| can't construct sciName from atomic fields | The scientific name and authorship are correct.  Exact Match | Found name in IPNI. |  Authorship: Exact Match Similarity: 1.0"
"11","Avicennia nitida","Jacquin","","!Validated","","","| can't construct sciName from atomic fields | More than one (2) match in IPNI on scientific name, potential homonym. | Possible match: Avicennia nitida Jacq.  | Possible match: Avicennia nitida Jacq.  | Didn't find name in IPNI. | Failed to search for the ScientificName in the same lexical group as Avicennia nitida and from source 11 by accessing GNI service at: http://gni.globalnames.org/name_resolver.xml | No match found in IPNI with failover to GNI. |  | The provided name: Avicennia+nitida is valid after checking misspelling | More than one (2) match in IPNI on scientific name, potential homonym. | Possible match: Avicennia nitida Jacq.  | Possible match: Avicennia nitida Jacq.  | Didn't find name in IPNI. | Failed to search for the ScientificName in the same lexical group as Avicennia nitida and from source 11 by accessing GNI service at: http://gni.globalnames.org/name_resolver.xml | No match found in IPNI with failover to GNI. | Fail to access GNI service | null | found synonyms but can't parse accepted name | The original SciName and Authorship cannot be curated"

Future Directions

Configurable Kurator Workflows.

See: https://opensource.ncsa.illinois.edu/confluence/display/KURATOR/Software+Releases

References

  1. What is the difference between nomenclature and taxonomy? http://iczn.org/content/what-difference-between-nomenclature-and-taxonomy
  2. Java experts: it suffices to download into your classpath
Personal tools
Namespaces

Variants
Actions
Navigation
SMW
Toolbox
KuratorWikiAdmin