You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@joshua.apache.org by mj...@apache.org on 2016/04/05 14:40:15 UTC
[26/50] incubator-joshua-site git commit: minor updates
minor updates
Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/commit/4b3bdd31
Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/tree/4b3bdd31
Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/diff/4b3bdd31
Branch: refs/heads/master
Commit: 4b3bdd31023372b769d9f3bd60f01365503a9e0b
Parents: 601d9f8
Author: Matt Post <po...@cs.jhu.edu>
Authored: Mon Jun 22 22:17:24 2015 -0400
Committer: Matt Post <po...@cs.jhu.edu>
Committed: Mon Jun 22 22:17:24 2015 -0400
----------------------------------------------------------------------
6.0/pipeline.md | 58 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 45 insertions(+), 13 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/4b3bdd31/6.0/pipeline.md
----------------------------------------------------------------------
diff --git a/6.0/pipeline.md b/6.0/pipeline.md
index f35f618..35d408d 100644
--- a/6.0/pipeline.md
+++ b/6.0/pipeline.md
@@ -35,7 +35,7 @@ The Joshua pipeline script is designed in the spirit of Moses' `train-model.pl`,
the user to define arbitrary execution dependency graphs. However, it is significantly simpler to
use, allowing many systems to be built with a single command (that may run for days or weeks).
-## Installation
+## Dependencies
The pipeline has no *required* external dependencies. However, it has support for a number of
external packages, some of which are included with Joshua.
@@ -67,7 +67,11 @@ external packages, some of which are included with Joshua.
standalone installation and use it to extract your grammar. This behavior will be triggered if
`$HADOOP` is undefined.
-- [SRILM](http://www.speech.sri.com/projects/srilm/) (not included)
+- [Moses](http://statmt.org/moses/) (not included). Moses is needed
+ if you wish to use its 'kbmira' tuner (--tuner kbmira), or if you
+ wish to build phrase-based models.
+
+- [SRILM](http://www.speech.sri.com/projects/srilm/) (not included; not needed; not recommended)
By default, the pipeline uses the included [KenLM](https://kheafield.com/code/kenlm/) for
building (and also querying) language models. Joshua also includes a Java program from the
@@ -83,9 +87,9 @@ external packages, some of which are included with Joshua.
having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) (the default) and
BerkeleyLM).
-- [Moses](http://statmt.org/moses/) (not included)
-
-Make sure that the environment variable `$JOSHUA` is defined, and you should be all set.
+After installing any dependencies, follow the brief instructions on
+the [installation page](install.html), and then you are ready to build
+models.
## A basic pipeline run
@@ -124,6 +128,8 @@ Running the pipeline requires two main steps: data preparation and invocation.
1. Run the pipeline. The following is the minimal invocation to run the complete pipeline:
$JOSHUA/bin/pipeline.pl \
+ --rundir . \
+ --type hiero \
--corpus input/train \
--tune input/tune \
--test input/devtest \
@@ -158,7 +164,8 @@ producing BLEU scores at the end. As it runs, you will see output that looks li
took 0 seconds (0s)
...
-And in the current directory, you will see the following files (among other intermediate files
+And in the current directory, you will see the following files (among
+other files, including intermediate files
generated by the individual sub-steps).
data/
@@ -179,6 +186,8 @@ generated by the individual sub-steps).
alignments/
0/
[giza/berkeley aligner output files]
+ 1/
+ ...
training.align
thrax-hiero.conf
thrax.log
@@ -193,12 +202,21 @@ generated by the individual sub-steps).
mert.log
joshua.config.final
final-bleu
+ test/
+ model/
+ [model files]
+ output
+ final-bleu
These files will be described in more detail in subsequent sections of this tutorial.
Another useful flag is the `--rundir DIR` flag, which chdir()s to the specified directory before
running the pipeline. By default the rundir is the current directory. Changing it can be useful
-for organizing related pipeline runs. Relative paths specified to other flags (e.g., to `--corpus`
+for organizing related pipeline runs. In fact, we highly recommend
+that you organize your runs using consecutive integers, also taking a
+minute to pass a short note with the `--readme` flag, which allows you
+to quickly generate reports on [groups of related experiments](#managing).
+Relative paths specified to other flags (e.g., to `--corpus`
or `--lmfile`) are relative to the directory the pipeline was called *from*, not the rundir itself
(unless they happen to be the same, of course).
@@ -217,7 +235,7 @@ of traditional pipeline tasks:
These steps are discussed below, after a few intervening sections about high-level details of the
pipeline.
-## Managing groups of experiments
+## <a id="managing" /> Managing groups of experiments
The real utility of the pipeline comes when you use it to manage groups of experiments. Typically,
there is a held-out test set, and we want to vary a number of training parameters to determine what
@@ -225,7 +243,7 @@ effect this has on BLEU scores or some other metric. Joshua comes with a script
`$JOSHUA/scripts/training/summarize.pl` that collects information from a group of runs and reports
them to you. This script works so long as you organize your runs as follows:
-1. Your runs should be grouped together in a root directory, which I'll call `$RUNDIR`.
+1. Your runs should be grouped together in a root directory, which I'll call `$EXPDIR`.
2. For comparison purposes, the runs should all be evaluated on the same test set.
@@ -241,6 +259,10 @@ the summarize script:
[other files]
2/
README.txt
+ test/
+ final-bleu
+ final-times
+ [other files]
...
You can get such directories using the `--rundir N` flag to the pipeline.
@@ -259,9 +281,11 @@ More details are below.
## Grammar options
-Joshua can extract three types of grammars: Hiero grammars, GHKM, and SAMT grammars. As described
-on the [file formats page](file-formats.html), all of them are encoded into the same file format,
-but they differ in terms of the richness of their nonterminal sets.
+Hierarchical Joshua can extract three types of grammars: Hiero
+grammars, GHKM, and SAMT grammars. As described on the
+[file formats page](file-formats.html), all of them are encoded into
+the same file format, but they differ in terms of the richness of
+their nonterminal sets.
Hiero grammars make use of a single nonterminals, and are extracted by computing phrases from
word-based alignments and then subtracting out phrase differences. More detail can be found in
@@ -280,6 +304,12 @@ By default, the Joshua pipeline extract a Hiero grammar, but this can be altered
but you can also use Moses' extractor with `--ghkm-extractor moses`. Galley's extractor only outputs
two features, so the scores tend to be significantly lower than that of Moses'.
+Joshua (new in version 6) also includes an unlexicalized phrase-based
+decoder. Building a phrase-based model requires you to have Moses
+installed, since its `train-model.perl` script is used to extract the
+phrase table. You can enable this by defining the `$MOSES` environment
+variable and then specifying `--type phrase`.
+
## Other high-level options
The following command-line arguments control run-time behavior of multiple steps:
@@ -294,7 +324,9 @@ The following command-line arguments control run-time behavior of multiple steps
This enables parallel operation over a cluster using the qsub command. This feature is not
well-documented at this point, but you will likely want to edit the file
`$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub environment, and may also
- want to pass specific qsub commands via the `--qsub-args "ARGS"` command.
+ want to pass specific qsub commands via the `--qsub-args "ARGS"`
+ command. We suggest you stick to the standard Joshua model that
+ tries to use as many cores as are available with the `--threads N` option.
## Restarting failed runs