Usage

Basic Usage

The script run-pipeline.sh allows to run the pipeline on a raw text document to produce a fully annotated NAF document:

./scripts/run-pipeline.sh < input.txt > output.naf

The script additionally produces a log file pipeline.log in the directory from which it is called. Internally, the script calls the python module wrapper with standard arguments from the repository’s directory. The script is set to accept the python wrapper arguments as command-line arguments.

Advanced usage

This section describes arguments to the python pipeline wrapper.

Configuration file

The pipeline wrapper uses a configuration file (provided in the repository under ./cfg/pipeline.yml) to define the pipeline components, their dependencies, and the name of their execution script. A different configuration file may be specified through the -c option.

Execution scripts

The pipeline wrapper relies on individual shell scripts for the execution of its components. Scripts for the components of the Dutch NewsReader pipeline are located by default under ./scripts/bin/. The wrapper allows to define a different location through the -d option. The arguments of some components can be set from the execution script through the -s option. This currently concerns the following elements:

  • vua-alpino wrapper around Alpino: time-out -t
  • model data for the opinion miner: data -d

The argument of -s is parsed to extract the component IDs and the relevant arguments. This is used to modify the component script calls produced by the configuration file.

Log file

By default, a log file is written to pipeline.log, in the directory from which run-pipeline.sh is called. A different file path can be specified through the -l option.

Filtering options

By default, the wrapper executes all the components listed in the configuration file. The pipeline can however be customized by filtering input or output layers and components:

  • excluded components (-e): excludes components when building the pipeline
  • input layers (-i): filters out components for upstream layers; the pipeline is set to start producing the specified layers
  • goal layers (-o): the wrapper will execute all components up to and including those that output these layers, and filter out downstream components;

Layers and components are documented in The Dutch NewsReader pipeline.

Summary

The pipeline wrapper arguments are summarized in the following table:

option description format
-c configuration file path absolute path
-d component scripts directory absolute path
-l log file path absolute path
-i input layers comma-separated naf layers string
-o goal layers comma-separated naf layers string
-e exclude components comma-separated components string
-s component arguments semicolumn-separated triplets component-id:option:value

Examples

Specifying custom paths to files

Suppose that you adopted the following structure for your project, and are working from /home/jdoe, with the alternative location custom/bin for the components shell scripts, and an alternative configuration file custom/pipeline.yml:

/home/jdoe/
|___vu-rm-pip3
|___custom
|   |___bin
|   |___pipeline.yml
|
|___data
    |___test.txt

To call the pipeline on data/test.txt from /home/jdoe and output a log file test.log under data/, run:

./vu-rm-pip3/scripts/run-pipeline.sh -c /home/jdoe/custom/pipeline.yml \
                                     -d /home/jdoe/custom/bin/ \
                                     -l /home/jdoe/data/test.log \
                                     < data/test.txt > data/test.naf

Filtering the pipeline

The following command allows to build a pipeline excluding the ixa-pipe-ned component; the execution graph is then filtered to keep components that produce the deps layer, and downstream components ending with the production of the entities layer.

./vu-rm-pip3/scripts/run-pipeline.sh -e ixa-pipe-ned \
                                     -i deps \
                                     -o entities \
                                     < data/test.tok.naf > data/test.out

Note that the -i option assumes that upstream NAF layers are present in the input file (the pipeline does not test this).

Specifying component arguments

The following call sets the Alpino time out to 0.2 min per sentence, and the opinion miner’s model to ‘hotel’:

./vu-rm-pip3/scripts/run-pipeline.sh -s 'vua-alpino:-t:0.2;opinion-miner:-d:hotel' \
                                     < data/test.txt > data/test.out

RDF extraction

The NAF pipeline output can be converted to RDF using the EventCoreference component. The NAF file should contain the following layers: text, terms, entities, srl, coreferences, timeExpressions. Call the script ./scripts/bin/naf2sem-grasp.sh to extract an RDF file from a pipeline output file:

./vu-rm-pip3/scripts/bin/naf2sem-grasp.sh < data/test.out > data/test.rdf