You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@joshua.apache.org by mj...@apache.org on 2016/04/05 14:44:11 UTC
[24/50] incubator-joshua-site git commit: minor updates

minor updates


Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/commit/4b3bdd31
Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/tree/4b3bdd31
Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/diff/4b3bdd31

Branch: refs/heads/asf-site
Commit: 4b3bdd31023372b769d9f3bd60f01365503a9e0b
Parents: 601d9f8
Author: Matt Post <po...@cs.jhu.edu>
Authored: Mon Jun 22 22:17:24 2015 -0400
Committer: Matt Post <po...@cs.jhu.edu>
Committed: Mon Jun 22 22:17:24 2015 -0400

----------------------------------------------------------------------
 6.0/pipeline.md | 58 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 45 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/4b3bdd31/6.0/pipeline.md
----------------------------------------------------------------------
diff --git a/6.0/pipeline.md b/6.0/pipeline.md
index f35f618..35d408d 100644
--- a/6.0/pipeline.md
+++ b/6.0/pipeline.md
@@ -35,7 +35,7 @@ The Joshua pipeline script is designed in the spirit of Moses' `train-model.pl`,
 the user to define arbitrary execution dependency graphs. However, it is significantly simpler to
 use, allowing many systems to be built with a single command (that may run for days or weeks).
 
-## Installation
+## Dependencies
 
 The pipeline has no *required* external dependencies.  However, it has support for a number of
 external packages, some of which are included with Joshua.
@@ -67,7 +67,11 @@ external packages, some of which are included with Joshua.
    standalone installation and use it to extract your grammar.  This behavior will be triggered if
    `$HADOOP` is undefined.
    
--  [SRILM](http://www.speech.sri.com/projects/srilm/) (not included)
+-  [Moses](http://statmt.org/moses/) (not included). Moses is needed
+   if you wish to use its 'kbmira' tuner (--tuner kbmira), or if you
+   wish to build phrase-based models.
+   
+-  [SRILM](http://www.speech.sri.com/projects/srilm/) (not included; not needed; not recommended)
 
    By default, the pipeline uses the included [KenLM](https://kheafield.com/code/kenlm/) for
    building (and also querying) language models. Joshua also includes a Java program from the
@@ -83,9 +87,9 @@ external packages, some of which are included with Joshua.
    having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) (the default) and
    BerkeleyLM).
 
--  [Moses](http://statmt.org/moses/) (not included)
-
-Make sure that the environment variable `$JOSHUA` is defined, and you should be all set.
+After installing any dependencies, follow the brief instructions on
+the [installation page](install.html), and then you are ready to build
+models. 
 
 ## A basic pipeline run
 
@@ -124,6 +128,8 @@ Running the pipeline requires two main steps: data preparation and invocation.
 1. Run the pipeline.  The following is the minimal invocation to run the complete pipeline:
 
        $JOSHUA/bin/pipeline.pl  \
+         --rundir .             \
+         --type hiero           \
          --corpus input/train   \
          --tune input/tune      \
          --test input/devtest   \
@@ -158,7 +164,8 @@ producing BLEU scores at the end.  As it runs, you will see output that looks li
       took 0 seconds (0s)
     ...
    
-And in the current directory, you will see the following files (among other intermediate files
+And in the current directory, you will see the following files (among
+other files, including intermediate files
 generated by the individual sub-steps).
    
     data/
@@ -179,6 +186,8 @@ generated by the individual sub-steps).
     alignments/
         0/
             [giza/berkeley aligner output files]
+        1/
+        ...
         training.align
     thrax-hiero.conf
     thrax.log
@@ -193,12 +202,21 @@ generated by the individual sub-steps).
          mert.log
          joshua.config.final
          final-bleu
+    test/
+         model/
+               [model files]
+         output
+         final-bleu
 
 These files will be described in more detail in subsequent sections of this tutorial.
 
 Another useful flag is the `--rundir DIR` flag, which chdir()s to the specified directory before
 running the pipeline.  By default the rundir is the current directory.  Changing it can be useful
-for organizing related pipeline runs.  Relative paths specified to other flags (e.g., to `--corpus`
+for organizing related pipeline runs.  In fact, we highly recommend
+that you organize your runs using consecutive integers, also taking a
+minute to pass a short note with the `--readme` flag, which allows you
+to quickly generate reports on [groups of related experiments](#managing).
+Relative paths specified to other flags (e.g., to `--corpus`
 or `--lmfile`) are relative to the directory the pipeline was called *from*, not the rundir itself
 (unless they happen to be the same, of course).
 
@@ -217,7 +235,7 @@ of traditional pipeline tasks:
 These steps are discussed below, after a few intervening sections about high-level details of the
 pipeline.
 
-## Managing groups of experiments
+## <a id="managing" /> Managing groups of experiments
 
 The real utility of the pipeline comes when you use it to manage groups of experiments. Typically,
 there is a held-out test set, and we want to vary a number of training parameters to determine what
@@ -225,7 +243,7 @@ effect this has on BLEU scores or some other metric. Joshua comes with a script
 `$JOSHUA/scripts/training/summarize.pl` that collects information from a group of runs and reports
 them to you. This script works so long as you organize your runs as follows:
 
-1. Your runs should be grouped together in a root directory, which I'll call `$RUNDIR`.
+1. Your runs should be grouped together in a root directory, which I'll call `$EXPDIR`.
 
 2. For comparison purposes, the runs should all be evaluated on the same test set.
 
@@ -241,6 +259,10 @@ the summarize script:
                [other files]
            2/
                README.txt
+               test/
+                   final-bleu
+                   final-times
+               [other files]
                ...
                
 You can get such directories using the `--rundir N` flag to the pipeline. 
@@ -259,9 +281,11 @@ More details are below.
 
 ## Grammar options
 
-Joshua can extract three types of grammars: Hiero grammars, GHKM, and SAMT grammars.  As described
-on the [file formats page](file-formats.html), all of them are encoded into the same file format,
-but they differ in terms of the richness of their nonterminal sets.
+Hierarchical Joshua can extract three types of grammars: Hiero
+grammars, GHKM, and SAMT grammars.  As described on the
+[file formats page](file-formats.html), all of them are encoded into
+the same file format, but they differ in terms of the richness of
+their nonterminal sets.
 
 Hiero grammars make use of a single nonterminals, and are extracted by computing phrases from
 word-based alignments and then subtracting out phrase differences.  More detail can be found in
@@ -280,6 +304,12 @@ By default, the Joshua pipeline extract a Hiero grammar, but this can be altered
 but you can also use Moses' extractor with `--ghkm-extractor moses`. Galley's extractor only outputs
 two features, so the scores tend to be significantly lower than that of Moses'.
 
+Joshua (new in version 6) also includes an unlexicalized phrase-based
+decoder. Building a phrase-based model requires you to have Moses
+installed, since its `train-model.perl` script is used to extract the
+phrase table. You can enable this by defining the `$MOSES` environment
+variable and then specifying `--type phrase`.
+
 ## Other high-level options
 
 The following command-line arguments control run-time behavior of multiple steps:
@@ -294,7 +324,9 @@ The following command-line arguments control run-time behavior of multiple steps
   This enables parallel operation over a cluster using the qsub command.  This feature is not
   well-documented at this point, but you will likely want to edit the file
   `$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub environment, and may also
-  want to pass specific qsub commands via the `--qsub-args "ARGS"` command.
+  want to pass specific qsub commands via the `--qsub-args "ARGS"`
+  command. We suggest you stick to the standard Joshua model that
+  tries to use as many cores as are available with the `--threads N` option.
 
 ## Restarting failed runs