Commandline Manual

Subcommand Overview

usage: [-h]
                 {enrich,stats,split,filter,extract,eval,text,project} ...

This is the main module for the INTENT package.

positional arguments:
                        Valid subcommands
    enrich              Enrich igt data.
    stats               Get corpus statistics for a set of XIGT files.
    split               Command to split input file(s) into train/dev/test
    filter              Command to filter input file(s) for instances
    extract             Command to extract data from enriched XIGT-XML files
    eval                Command to eval INTENT functions against a gold-
                        standard XIGT-XML.
    text                Command to convert a text document into XIGT-XML.
    project             Command that will (re)project pos/ps/ds using the
                        specified pos source and alignment type.

A quick summary of the different subcommands is as follows:

  • enrich
    • The enrich command will ingest XIGT-XML instances, and attempt to perform various enrichment tasks on them. Instances that cannot be enriched will be left as-is.
  • stats
    • The stats command takes one or more XIGT-XML files and gathers some simple corpus statistics from them.
  • split
    • The split command takes one or more XIGT-XML files and produces train/dev/test XIGT-XML files, based upon the proportions for the split provided.
  • filter
    • The filter command takes one or more XIGT-XML files and produces an XIGT-XML file in which only instances matching the specified filters are kept.
  • extract
    • The extract command takes one or more XIGT-XML files and can extract cfg rules, dependency parsers, aligned word/sentence pairs, a POS tagger model, or a gloss-line POS classifier model.
  • eval
    • The eval command takes a XIGT-XML file with gold-standard POS, DS, and/or word alignments, and evaluates the automatic INTENT methods against them.
  • text
    • The odin command will ingest raw text IGT representations and convert them to XIGT-XML files.
  • project
    • The project command will take a XIGT-XML file and project POS tag data and DS/PS structures from the translation line given a specified alignment method.

Configuration Files

Running INTENT in batch mode may be made simpler by the use of configuration files. (Example)

To load a configuration file, simply prepend it with an @ symbol, like:

$ ./ @./example/enrich-config.conf

The configuration file then consists of the options you would like to send to the script, including subcommands.


usage: enrich [-h] [-v] [--align ALIGNMENT_LIST]
                        [--giza-symmetric {None,intersection,union,grow_diag_final,grow_diag}]
                        [--pos POS_LIST] [--parse PARSE_LIST]
                        [--max-parse-length MAX_PARSE_LENGTH]
                        [--class CLASS_PATH]
                        [--proj-aln {giza,gizaheur,heur,heurpos,manual,any}]
                        IN_FILE OUT_FILE

Ingest a XIGT document and add information, such as alignment, or POS tags.

positional arguments:
  IN_FILE               Input XIGT file.
  OUT_FILE              Path to output XIGT file.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Set the verbosity level. (default: 0)
                        Comma-separated list of alignments to add. ['giza',
                        'gizaheur', 'heur', 'heurpos'] (default: [])
  --giza-symmetric {None,intersection,union,grow_diag_final,grow_diag}
                        Symmetricization heuristic to apply to statistical
                        alignment (default: None)
  --pos POS_LIST        Comma-separated list of POS tags to add (no spaces):
                        ['class', 'proj', 'trans'] (default: [])
  --parse PARSE_LIST    List of parses to create. ['trans', 'proj'] (default:
  --max-parse-length MAX_PARSE_LENGTH
                        What is the maximum length to attempt parsing on?
                        (default: 25)
  --class CLASS_PATH
  --proj-aln {giza,gizaheur,heur,heurpos,manual,any}
                        Alignment to use when performing projection. Can use
                        "any" for any available alignment. (default: any)

Explanation of arguments

IN_FILE and OUT_FILE are the paths to the input document and output document respectively.


Align translation and gloss lines using the specified methods

  • heur: Align the gloss and translation lines using heuristics, such as stemming.
  • giza: Use mgiza++ to statistically align translation and gloss lines.

NOTE: Must have at least one method specified for projection.


Produce both heuristic and giza alignments:

--align heur,giza

Produce only giza alignments:

--align giza


Choose what kind of pos tags will be generated for the language line.

  • class: Produce language-line pos tags via the classifier method.
  • proj: Produce language-line pos tags via the projection method.

NOTE: class will typically provide better performance, but if manual alignments are present in the data, proj may perform better.


Choose what kinds of parses to provide.

  • trans Only parse the translation line.
  • proj Parse the translation line and project it to

NOTE proj Requires at least one method to be specified by --align.


usage: odin [-h] [-v] [--format {txt,xigt}] [--limit LIMIT]
                   LNG OUT_FILE

positional arguments:
  LNG                  ISO 639-3 code for a language
  OUT_FILE             Output path for the output file.

optional arguments:
  -h, --help           show this help message and exit
  -v, --verbose        Set the verbosity level.
  --format {txt,xigt}  Format to output odin data in.
  --limit LIMIT        Limit number of instances written.
  --randomize          Randomly select the instances


usage: stats [-h] [-v] FILE [FILE ...]

positional arguments:
  FILE           Files from which to gather statistics.

optional arguments:
  -h, --help     show this help message and exit
  -v, --verbose  Set the verbosity level.


usage: split [-h] [--train TRAIN] [--dev DEV] [--test TEST] [-v]
                    [-o PREFIX] [-f]
                    FILE [FILE ...]

positional arguments:
  FILE           XIGT files to gather together in order to generate the
                 train/dev/test split

optional arguments:
  -h, --help     show this help message and exit
  --train TRAIN  The proportion of the data to set aside for training.
  --dev DEV      The proportion of data to set aside for development.
  --test TEST    The proportion of data to set aside for testing.
  -v, --verbose  Set the verbosity level.
  -o PREFIX      Destination prefix for the output.
  -f             Force overwrite of existing files.


usage: filter [-h] [-v] [--require-lang] [--require-gloss]
                        [--require-trans] [--require-gloss-pos]
                        [--max-instances MAX_INSTANCES] [--require-aln]
                        IN_FILE [IN_FILE ...] OUT_FILE

positional arguments:
  IN_FILE               XIGT files to filter.
  OUT_FILE              Output file (Combines from inputs)

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Set the verbosity level.
  --require-lang        Require instances to have language line
  --require-gloss       Require instances to have gloss line
  --require-trans       Require instances to have trans line
  --require-gloss-pos   Require instance to have gloss pos tags
                        Filter out ungrammatical instances
  --max-instances MAX_INSTANCES
                        Limit the number of output instances
  --require-aln         Require instances to have 1-to-1 gloss/lang alignment.