Using Kurator

From Kurator
Jump to: navigation, search

Introduction

The Kurator tools can be used at multiple places in the biodiversity data life cycle to:

  • Check data sets for internal consistency and validity
  • Check data sets against external authority resources
  • Identify potential problems or errors in the data and produce data quality reports
Kurator tools can be used in multiple places in biodivesity data workflows.



Tools

The Kurator project provides a range of tools for different sorts of biodiversity data users:

  • Kurator-web What most users will want to use. The rest of this page describes where and how you may want to use Kurator-web. See Kurator-Web User Documentation for specifics on how to use the web application.
  • Other Tools


Workflow Inputs

All current workflows accept TSV/CSV files or a Darwin Core Archive URL as input. Generally, the expected header format of the input files is one containing terms from darwin core.

However, since the Darwinizer actor is included as a first step in all of the workflows, in cases where input data contains headers that are not standard darwin core (e.g. Arctos data) running the Darwinizer workflow first on the input will help determine which fields are able to be converted to standard Darwin Core before running the workflows.

The default mappings currently used in workflows that include the Darwinizer can be found in the following vocabulary file darwin_cloud.txt. See https://github.com/kurator-org/kurator-validation/wiki/Vocabulary-File-Structure for more details about the vocabulary file format.

Available Workflows

The Kurator tools are implemented as a suite of workflows that can be run via the kurator-web web application available at http://kurator.acis.ufl.edu/kurator-web/. See the about page about page in kurator-web for more information about workflows available to run via the web application. If using the web application to run workflows, see the Workflows section of the kurator-web user documentation.

Listed below is a summary of the workflows available as well as information about their inputs and outputs.


Date Validator

Validates event date fields and fills in missing dates from atomic event date fields. Expects a csv/tsv file or a url to a darwin core archive as input.

This workflow runs a series of tests related to date fields in the input and produces an excel spreadsheet report about assertions made by running a series of validation, measurement and amendment tests and recording the results for each.

The tests that this workflow runs are listed below by assertion type.

Measures

  • EVENT_DATE_DURATION_SECONDS - Measures the duration of an event date in seconds.
  • EVENT_DATE_COMPLETENESS - For values of dwc:eventDate, check is not empty.

Validations

Georeference Validator

Performs validation of the georeference fields and fills in or transposes missing or inconsistent coordinates.


Darwinizer

Takes either a csv or tsv file as input and creates a new file with as many field names standardized to Darwin Core as possible.


This workflow outputs a single file containing the data from the original file with the new standardized column headers. The format (csv or tsv) of the output file can be specified as well as an option to include/exclude the namespace in the column headers of the output file (e.g. with namespace: "dwc:collectionCode" or without: "collectionCode").


See https://github.com/kurator-org/kurator-validation/wiki/CSV-File-Darwinizer for more info.


Field Value Counter

Creates a report of counts of distinct values for fields in the input data. This workflow allows you to either upload a new file, select a previously uploaded CSV or TSV file or provide a URL to a valid Darwin Core Archive to be used as input. Additionally, a "Field list" parameter allows the user to specify the list of fields to generate reports for as a pipe delimited list (e.g. country|stateProvince|...)


When the workflow is run a count report is generated for each field specified:

  1. Count report - count_[field].txt, containing the counts of distinct values for the field


See https://github.com/kurator-org/kurator-validation/wiki/CSV-File-Field-Value-Counter and https://github.com/kurator-org/kurator-validation/wiki/Darwin-Core-Archive-Field-Value-Counter for more details:


Controlled Field Assessor

Creates a report of counts of distinct values and provides recommended values for each of the fields present in the input data. This workflow allows you to either upload a new file, select a previously uploaded CSV or TSV file or provide a URL to a valid Darwin Core Archive to be used as input.


When the workflow is run two report types are generated for each field:

  1. Count report - count_[field].txt, containing the counts of distinct values for the field
  2. Recommendation report - recommended_[field].csv, containing the recommendations to standardize values in that field.


In addition to the reports, a single intermediary artifact used as input to later steps in the workflow provides some additional information about the vocabulary file used:

  1. Vocab lookup report - vocab_[field].txt, containing the lookup file used for standardizing that field.


See https://github.com/kurator-org/kurator-validation/wiki/CSV-File-Controlled-Field-Assessor and https://github.com/kurator-org/kurator-validation/wiki/Darwin-Core-Archive-Controlled-Field-Assessor for more info.


Geography Assessor

Creates a file containing the recommendations to standardize distinct combinations of higher geography fields. Operates on the higher geography fields present in the input data (dwc:continent, dwc:country, dwc:countryCode, dwc:stateProvince, dwc:county, dwc:municipality, dwc:waterBody, dwc:islandGroup and dwc:island). This workflow allows you to either upload a new file, select a previously uploaded CSV or TSV file or provide a URL to a valid Darwin Core Archive to be used as input.


Some examples of the recommendations that this workflow may include in the report are recommendations based on lookup of values of the original data for a standardized country code (filled in if empty), country name, state province and parsing out the county or other parts of the locality string.


When the workflow is run the report types listed below are generated as outputs:

  1. Recommendation report - recommended_geography.csv contains the recommendations to standardize distinct combinations of higher geography based on the lookup.
  2. Count reports - consists of reports containing the distinct values of the country field count_country.csv and combinations of higher geography count_geography.csv in addition to the number of times they appeared in the extracted core file
  3. Unknown country report - new_country.csv contains the country values not found in the country lookup file.
  4. Unknown geography report - new_geography.csv contains the distinct combinations of higher geography not found in the geography lookup file.


In addition to the reports, intermediary artifacts used as input to later steps in the workflow provides some additional information about the vocabulary files used. The default lookup file used for country can be found in country.txt and the higher geography lookup file used is dwc_geography.txt:

  1. Vocab lookup reports - lookup_country.txt and lookup_geography.txt downloaded copies of the country and higher geography lookup files used by the workflow when making recommendations.


See https://github.com/kurator-org/kurator-validation/wiki/Darwin-Core-Archive-Geography-Assessor for more info.


Geography Cleaner

Creates a new occurrences file with standardized geography and original geography saved in new fields. Operates on the higher geography fields present in the input data (dwc:continent, dwc:country, dwc:countryCode, dwc:stateProvince, dwc:county, dwc:municipality, dwc:waterBody, dwc:islandGroup and dwc:island). This workflow allows you to either upload a new file, select a previously uploaded CSV or TSV file or provide a URL to a valid Darwin Core Archive to be used as input. The name of the output file containing the changes made to the data can be supplied as a parameter to the workflow.


Some examples of the recommendations that this workflow may include in the report are recommendations based on lookup of values in the original data for a standardized country code (filled in if empty), country name, state province and parsing out the county or other parts of the locality string. This workflow is similar to the Geography Assessor however in addition to the recommendation report, an output file is provided with the recommended changes made to the original data.


When the workflow is run the report types listed below are generated as outputs:

  1. Recommendation report - recommended_geography.csv contains the recommendations to standardize distinct combinations of higher geography based on the lookup.
  2. Count reports - consists of reports containing the distinct values of the country field count_country.csv and combinations of higher geography count_geography.csv in addition to the number of times they appeared in the extracted core file
  3. Unknown country report - new_country.csv contains the country values not found in the country lookup file.
  4. Unknown geography report - new_geography.csv contains the distinct combinations of higher geography not found in the geography lookup file.


The workflow also produces the following output file:

  1. Extracted occurrences - dwca_extracted_occurrences_geography_standardized.txt or the value specified as the "output file" parameter to the workflow. This artifact is a copy of the input file with higher geography fields replaced by standard values from lookup_geography and with original higher geography values copied to new fields with '_orig' appended to the field names.


In addition to the reports, intermediary artifacts used as input to later steps in the workflow provides some additional information about the vocabulary files used. The default lookup file used for country can be found in country.txt and the higher geography lookup file used is dwc_geography.txt:

  1. Vocab lookup reports - lookup_country.txt and lookup_geography.txt downloaded copies of the country and higher geography lookup files used by the workflow when making recommendations.


See https://github.com/kurator-org/kurator-validation/wiki/CSV-File-Geography-Cleaner and https://github.com/kurator-org/kurator-validation/wiki/Darwin-Core-Archive-Geography-Cleaner for more info.


Property Parser

Parses "dynamicProperties" field in a DarwinCore-Archive spreadsheet and creates separate fields for each value.

See https://github.com/kurator-org/kurator-validation/wiki/CSV-File-Property-Parser for more info.


File Aggregator

Takes two csv or tsv files as input and creates an output file by combining the headers of the two files, then adding rows from each to the new structure. The name of the output file can be specified as a parameter to the workflow.

See https://github.com/kurator-org/kurator-validation/wiki/CSV-File-Aggregator for more info.


Vocabulary Maker

Creates a vocabulary file from the input CSV or TSV file with fields for the original value, the standard values, and vetted. Add the distinct values of the combination of fields in the field list to the file and leaves standard and vetted empty. The vocabulary file that this actor produces can be used as a parameter to the other workflows. An example of the default file used in the configuration of all workflows can be found in darwin_cloud.txt. More details about the vocabulary file format can be found on the following page: https://github.com/kurator-org/kurator-validation/wiki/Vocabulary-File-Structure for

See https://github.com/kurator-org/kurator-validation/wiki/Vocabulary-Maker for more info.


Working with a Symbiota Portal

Look at Your Collection Profile. Are you using Symbiota as the Database of Record for your collection?

  • Yes: Management: Live Data managed directly within data portal
    • You can use the Kurator tools to quality control data provided from your collection to the world from Symbiota.
  • No: Management: Data snapshot of local collection database
    • You can use the Kurator tools to prepare data for aggregation by Symbiota (e.g. prepare a data dump for mapping in IPT to provide a Darwin Core archive).
    • You can use the Kurator tools to quality control the data that you are providing to Symbiota (e.g. test the output of your IPT instance for data quality).
    • You can use the Kurator tools to quality control data provided from your collection to the world from Symbiota.

Preparing a data file for ingest into IPT (or Symbiota)

Dump data from your database of record into a flat file (e.g. csv file) with a header line.

The CSV-File-Darwinizer workflow takes the column headers and attempts to identify which Darwin Core terms they may match up to. It produces a copy of your input file with the column headers replace by standard Darwin Core field names. When you load your data file into IPT, this will simplify the process of mapping your data.

Quality control tests on a data file from your database

Export a flat text file containing a dump of data from your database and you can check several aspects of the quality of that data.

  • Unique values in fields that should be using controlled vocabularies.
    • CSV File Controlled Field Assessor (includes CSV File Darwinizer)
    • CSV File Field Value Counter (includes CSV File Darwinizer)
  • Values for Dates, Geography, and Georeferences
    • CSV File Date Validator (includes CSV File Darwinizer)
    • CSV File Geography Cleaner (includes CSV File Darwinizer)
    • CSV File Georeference Validator (includes CSV File Darwinizer)

Obtaining all the data from my collection from Symbiota

  1. Find your collid (visit the collections search page, click on the name of your collection, this takes you to your profile page. Copy the collid from the url bar in your browser e.g. 13 from /portal/collections/misc/collprofiles.php?collid=13)
  2. Use your collid as the db= parameter in a search, e.g. 13 in: http://invertebase.org/portal/collections/list.php?db=13;&page=1
  3. Click on the "Download Specimen Data" icon in the upper right of the Occurrence Records tab.
  4. Select the following parameters on the Download Specimen Records form:

To obtain a flat occurrence core

  • Structure: Darwin Core
  • File Format: Comma Delimited (CSV)
  • Character Set: UTF-8 (unicode)
  • Compression: Compressed ZIP file Not selected

Now, you can run any of these workflows:

  • Unique values in fields that should be using controlled vocabularies.
    • CSV File Controlled Field Assessor (includes CSV File Darwinizer)
    • CSV File Field Value Counter (includes CSV File Darwinizer)
  • Values for Dates, Geography, and Georeferences
    • CSV File Date Validator (includes CSV File Darwinizer)
    • CSV File Geography Cleaner (includes CSV File Darwinizer)
    • CSV File Georeference Validator (includes CSV File Darwinizer)

To obtain a Darwin Core Archive (not yet implemented)

  • Structure: Darwin Core
  • File Format: Tab Delimited
  • Character Set: UTF-8 (unicode)
  • Compression: Compressed ZIP file selected


Using the workflow results

If using the Kurator web application to run workflows, see the Workflow Runs section of the user documentation page for how to obtain the results after a workflow has run.


The result artifacts produced by the current set of workflows can take two forms. One being a collection of csv/tsv report files and the other an excel file containing a color-coded data quality report of tests run, or DQ Report. Additionally some workflows are provided as utilities and the output types are dependent on the functionality provided by the workflow.


XLSX DQ Report Workflows

The workflows that support the DQ Report style of result are:


CSV/TSV Count/Recommendation Report Workflows

The workflows that support the count/recommendation style of reports are:


Utility Workflows

The following are utility workflows: