ToolFlow: Difference between revisions

From Meta, a Wikimedia project coordination wiki
Content deleted Content added
New tool doc
 
Line 5: Line 5:
;Node:A single step in a ''workflow''. This can be an adapter to another tool (e.g. PetScan), or an operation on one (e.g. ''filter'') or multiple (e.g. ''join'') nodes. Every node creates a single output ''file''. Output filles for adapter nodes are just the representation of the repective tool, in a standardized format.
;Node:A single step in a ''workflow''. This can be an adapter to another tool (e.g. PetScan), or an operation on one (e.g. ''filter'') or multiple (e.g. ''join'') nodes. Every node creates a single output ''file''. Output filles for adapter nodes are just the representation of the repective tool, in a standardized format.
;Output mapping:Adapter nodes need to map the output of the respective tools to the standardized internal format. ToolFlow will suggest mapping where possible, but for some tools, manual mapping may be required.
;Output mapping:Adapter nodes need to map the output of the respective tools to the standardized internal format. ToolFlow will suggest mapping where possible, but for some tools, manual mapping may be required.
;Inputs:The edges in the workflow are shown as inputs to each node. Adapter nodes do not have incoming edges, and therefore no inputs from other nodes; their input comes from the respective tools.
;File:Each data file produced by ''nodes'' is a [https://jsonlines.org/ JSONL] file. The first line is a JSON object containing the file header, specifically the column definitions. All subsequent lines are JSON arrays, JSON values according to the respective column header. Files that are used by subsequent ''nodes'' are marked as temporary, and automatically cleaned up at a pre-determined time. Files that are not used again are marked as permanent, and not cleaned up unless the run is repeated manually.
;File:Each data file produced by ''nodes'' is a [https://jsonlines.org/ JSONL] file. The first line is a JSON object containing the file header, specifically the column definitions. All subsequent lines are JSON arrays, JSON values according to the respective column header. Files that are used by subsequent ''nodes'' are marked as temporary, and automatically cleaned up at a pre-determined time. Files that are not used again are marked as permanent, and not cleaned up unless the run is repeated manually.



Revision as of 09:13, 21 September 2023

ToolFlow is a tool to retrieve the output of other tools, aggregate, filter, and modify these results, and create novel outputs automatically.

Concepts

Workflow
A flowchart-like group of nodes that are connected to each other, with the output of one node being the input of another one. Once logged in, you can create new workflows, or fork existing ones. You can edit and run only your own workflows.
Node
A single step in a workflow. This can be an adapter to another tool (e.g. PetScan), or an operation on one (e.g. filter) or multiple (e.g. join) nodes. Every node creates a single output file. Output filles for adapter nodes are just the representation of the repective tool, in a standardized format.
Output mapping
Adapter nodes need to map the output of the respective tools to the standardized internal format. ToolFlow will suggest mapping where possible, but for some tools, manual mapping may be required.
Inputs
The edges in the workflow are shown as inputs to each node. Adapter nodes do not have incoming edges, and therefore no inputs from other nodes; their input comes from the respective tools.
File
Each data file produced by nodes is a JSONL file. The first line is a JSON object containing the file header, specifically the column definitions. All subsequent lines are JSON arrays, JSON values according to the respective column header. Files that are used by subsequent nodes are marked as temporary, and automatically cleaned up at a pre-determined time. Files that are not used again are marked as permanent, and not cleaned up unless the run is repeated manually.

Node types

Tools

AListBuildingTool
Not quite sure what a-list-bulding-tool does, but it outputs wiki page/Wikidata item pairings.
PagePile
PagePile generates a page list for a single wiki
PetScan
Generates a PetScan page list (metadata to be implemented)

QuarryQueryLatest:The latest run for a specific Quarry page

Sparql
Results of a SPARQL query

Operations

Inner join on key
Takes two or more nodes and joins rows into one, given a column key name. Rows that do not have the value of the key column in all files will be removed. Similar to SQL INNER JOIN.
Join (merge by unique key)
This will concatenate the output of two or more nodes, if they all have the same header. To avoid duplicate rows, a column name is used as a key; only the first row for each key will be passed into the output. Similar to sort -u.
Filter
Filters output of a single node by a condition or regular expression on a key column. Can either keep or remove matching rows.

Code

The web UI and API are at https://github.com/magnusmanske/toolflow (HTML/JS/CSS and PHP, respectively) and the background service that actually does the processing is at https://github.com/magnusmanske/toolflow_rs/ . Feel free to submit issues and suggest new tools/improvements in the respective issue tracker.