Outcome Stats Example Workflow

From Kurator
Jump to: navigation, search

Contents

Workflow

The workflow demonstrates the use of Kurator-Akka and contains four Python actors that run in series. A python actor in Kurator-Akka takes a dictionary as its input parameter and returns a dictionary as its output. Each actor's output dictionary is supplied as an argument to the next actor.

   CreateWorkspace -> OutcomeStats -> PythonToOpenPyxl -> WrapUp

Actors

  1. CreateWorspace - inline helper actor creates a workspace directory and includes it's path in the output dictionary
  2. OutcomeStats - actor #1 takes json input file (occurrence_qc.json) and config file (stats.ini). Parses the config, computes the outcome stats, writes result to json file. Includes config (outcomes, validators, etc) supplied as arguments in the output dictionary
  3. PythonToOpenPyxl - actor #2 takes parameters output by previous actor and constructs a spreadsheet using OpenPyxl. User defined output file is supplied in the input dictionary from arguments to the workflow engine.
  4. WrapUp - inline helper actor records final state of the output dictionary in the output logs (artifacts produced, success, message)

Workflow Engine

Kurator-Akka provides the following contract between actor authors and clients of the workflow engine (i.e. Kurator-Web):

  • Workspace directory (%WORKSPACE_DIR%) - created by the "CreateWorkspace" actor or supplied as a workflow parameter. The first actor constructs the workspace and includes it in the dictionary. Subsequent actors are responsible for forwarding this value to downstream actors as a parameter in the output dictionary (for example, outputdict['workspace'] = inputdict['workspace'])
  • Publishing artifacts - In order to publish an artifact, each actor must include a nested 'artifacts' dictionary in its output. The artifacts dictionary keys correspond to artifact name/label (for example, 'stats_xlsx', or 'stats_json' ). Values are expressed as a filename relative to %WORKSPACE_DIR%. After the workflow engine finishes execution, the web app will have access to the entries in this list. Subsequent actors are responsible for forwarding the cumulative values of this dictionary to downstream actors as a parameter: (for example, outputdict['artifacts'] = inputdict['artifacts'])
  • Output log - The workflow engine provides clients written in Java with an output log expressed as a String or OutputStream. This is constructed by the workflow engine from Python stdout (print statements and use of Python logging)
  • Error log - The workflow engine provides clients written in Java with an output log expressed as a String or OutputStream. This is constructed by the workflow engine from Python stderr (exceptions thrown and use of Python logging)

OutcomeStats Actor

Input Dictionary

Sample of dictionary input supplied as input to the OutcomeStats actor

   {
       workspace: %WORKSPACE_DIR%				(path to workspace directory created by first actor in workflow. at this point this directory should contain the inputfile)
       inputfile: %WORKSPACE_DIR%/occurrence_qc.json	(full path to the input file in the workspace directory)
       outputfile: stats.json				(name of the output file relative to workspace directory. this is the file that the actor will create, write to and supply as input to the next actor)
       configfile: ./stats.ini				(config file path relative to script execution directory)
   }

Implementation Psuedocode

Actor implementation psuedocode

   ocstats.getStats(optdict):
       load stats.ini config from optdict['configfile'] and json.load(...) occurrence_qc.json from optdict['inputfile'] 
       json dump validators, outcomes, and the numpy stats.toList() to file located at optdict['workspace'] + '/' + optdict['outputfile']
       create the outputdict and copy values for optdict['workspace'], optdict['outputfile']. Include any additional parameters to be used as inputs to next actor
       add an artifacts dictionary to outputdict['artifacts'] that contains key/value pairs (ARTIFACT_NAME: ARTIFACT_FILE) for each artifact you wish the workflow engine to publish
       return outputdict
   

Output Dictionary

   {
       workspace: %WORKSPACE_DIR%                      (path to workspace directory created by first actor in workflow. at this point this directory should contain the inputfile and the outputfile)
       outputfile: %WORKSPACE_DIR%/stats.json
       artifacts: {
           'stats_json': 'stats.json'                  (the stats.json file relative to the workspace directory. the webapp will know where to find this file based on this entry)
       }
       
       'outcomes': ...
       'validators': ...
       'outcomesFolded': ...
       'colOrigin': ...
       'rowOrigin': ...
       'outcomeFills': ...
   }

PythonToOpenPyxl Actor

Input Dictionary

Same as output dictionary from the previous actor except the workflow engine will map the value of outputfile to inputfile and get a value for this actor's outputfile from the parameters to the workflow

   {
       workspace: %WORKSPACE_DIR%                      (path to workspace directory created by first actor in workflow. at this point this directory should contain the inputfile and the outputfile)
       inputfile: %WORKSPACE_DIR%/stats.json
       outputfile: %OUTPUT_FILE_PARAM%                 (name of the output file from user defined parameter to the workflow engine. 'stats.xlsx' for example)
       artifacts: {
           'stats_json': 'stats.json'                  (the stats.json file relative to the workspace directory. the webapp will know where to find this file based on this entry)
       }
       
       'outcomes': ...
       'validators': ...
       'outcomesFolded': ...
       'colOrigin': ...
       'rowOrigin': ...
       'outcomeFills': ...
   }

Implementation Pseudocode

   PythonToOpenPyxl.statsToXlsx(optdict):
       create the worksheet and call setColumnStyles(ws, optdict). use openpyxl to produce content from the parameters in optdict
       write workbook to optdict['workspace'] + '/' + optdict['outputfile']
       create the outputdict and include include optdict['workspace'], optdict['outputfile'], optdict['artifacts'] and any additional parameters that the WrapUp actor should include in the logs (for example, message indicating success or failure)
       add an artifact to outputdict['artifacts'] for the new output file produced by this actor, for example: outputdict['artifacts'] = { 'stats_xlsx': optdict['outputfile'] }
       return outputdict

Output Dictionary

   {
       workspace: %WORKSPACE_DIR%
       outputfile: %WORKSPACE_DIR%/stats.xlsx
       artifacts: {
           'stats_json': 'stats.json'          (the stats.json file relative to the workspace directory. the webapp will know where to find this file based on this entry)
           'stats_xlsx': 'stats.xlsx'          (the stats.xlsx file relative to the workspace directory. the webapp will know where to find this file based on this entry)
       }
       success: True | False
       message: 'your message here'
   }
Personal tools
Namespaces

Variants
Actions
Navigation
SMW
Toolbox
KuratorWikiAdmin