You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@joshua.apache.org by mj...@apache.org on 2016/04/09 05:10:46 UTC
[37/44] incubator-joshua-site git commit: First attempt
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/pipeline.html
----------------------------------------------------------------------
diff --git a/5.0/pipeline.html b/5.0/pipeline.html
new file mode 100644
index 0000000..bb56bee
--- /dev/null
+++ b/5.0/pipeline.html
@@ -0,0 +1,919 @@
+<!DOCTYPE html>
+<html lang="en">
+ <head>
+ <meta charset="utf-8">
+ <title>Joshua Documentation | The Joshua Pipeline</title>
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
+ <meta name="description" content="">
+ <meta name="author" content="">
+
+ <!-- Le styles -->
+ <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+ <style>
+ body {
+ padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
+ }
+ #download {
+ background-color: green;
+ font-size: 14pt;
+ font-weight: bold;
+ text-align: center;
+ color: white;
+ border-radius: 5px;
+ padding: 4px;
+ }
+
+ #download a:link {
+ color: white;
+ }
+
+ #download a:hover {
+ color: lightgrey;
+ }
+
+ #download a:visited {
+ color: white;
+ }
+
+ a.pdf {
+ font-variant: small-caps;
+ /* font-weight: bold; */
+ font-size: 10pt;
+ color: white;
+ background: brown;
+ padding: 2px;
+ }
+
+ a.bibtex {
+ font-variant: small-caps;
+ /* font-weight: bold; */
+ font-size: 10pt;
+ color: white;
+ background: orange;
+ padding: 2px;
+ }
+
+ img.sponsor {
+ height: 120px;
+ margin: 5px;
+ }
+ </style>
+ <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+
+ <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
+ <!--[if lt IE 9]>
+ <script src="bootstrap/js/html5shiv.js"></script>
+ <![endif]-->
+
+ <!-- Fav and touch icons -->
+ <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
+ <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
+ <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
+ <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
+ <link rel="shortcut icon" href="bootstrap/ico/favicon.png">
+ </head>
+
+ <body>
+
+ <div class="navbar navbar-inverse navbar-fixed-top">
+ <div class="navbar-inner">
+ <div class="container">
+ <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+ <span class="icon-bar"></span>
+ <span class="icon-bar"></span>
+ <span class="icon-bar"></span>
+ </button>
+ <a class="brand" href="/">Joshua</a>
+ <div class="nav-collapse collapse">
+ <ul class="nav">
+ <li><a href="index.html">Documentation</a></li>
+ <li><a href="pipeline.html">Pipeline</a></li>
+ <li><a href="tutorial.html">Tutorial</a></li>
+ <li><a href="decoder.html">Decoder</a></li>
+ <li><a href="thrax.html">Thrax</a></li>
+ <li><a href="file-formats.html">File formats</a></li>
+ <!-- <li><a href="advanced.html">Advanced</a></li> -->
+ <li><a href="faq.html">FAQ</a></li>
+ </ul>
+ </div><!--/.nav-collapse -->
+ </div>
+ </div>
+ </div>
+
+ <div class="container">
+
+ <div class="row">
+ <div class="span2">
+ <img src="/images/joshua-logo-small.png"
+ alt="Joshua logo (picture of a Joshua tree)" />
+ </div>
+ <div class="span10">
+ <h1>Joshua Documentation</h1>
+ <h2>The Joshua Pipeline</h2>
+ <span id="download">
+ <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a>
+ </span>
+ (version 5.0, released 16 August 2013)
+ </div>
+ </div>
+
+ <hr />
+
+ <div class="row">
+ <div class="span8">
+
+ <p>This page describes the Joshua pipeline script, which manages the complexity of training and
+evaluating machine translation systems. The pipeline eases the pain of two related tasks in
+statistical machine translation (SMT) research:</p>
+
+<ul>
+ <li>
+ <p>Training SMT systems involves a complicated process of interacting steps that are
+time-consuming and prone to failure.</p>
+ </li>
+ <li>
+ <p>Developing and testing new techniques requires varying parameters at different points in the
+pipeline. Earlier results (which are often expensive) need not be recomputed.</p>
+ </li>
+</ul>
+
+<p>To facilitate these tasks, the pipeline script:</p>
+
+<ul>
+ <li>
+ <p>Runs the complete SMT pipeline, from corpus normalization and tokenization, through alignment,
+model building, tuning, test-set decoding, and evaluation.</p>
+ </li>
+ <li>
+ <p>Caches the results of intermediate steps (using robust SHA-1 checksums on dependencies), so the
+pipeline can be debugged or shared across similar runs while doing away with time spent
+recomputing expensive steps.</p>
+ </li>
+ <li>
+ <p>Allows you to jump into and out of the pipeline at a set of predefined places (e.g., the alignment
+stage), so long as you provide the missing dependencies.</p>
+ </li>
+</ul>
+
+<p>The Joshua pipeline script is designed in the spirit of Moses’ <code class="highlighter-rouge">train-model.pl</code>, and shares many of
+its features. It is not as extensive, however, as Moses’
+<a href="http://www.statmt.org/moses/?n=FactoredTraining.EMS">Experiment Management System</a>, which allows
+the user to define arbitrary execution dependency graphs.</p>
+
+<h2 id="installation">Installation</h2>
+
+<p>The pipeline has no <em>required</em> external dependencies. However, it has support for a number of
+external packages, some of which are included with Joshua.</p>
+
+<ul>
+ <li>
+ <p><a href="http://code.google.com/p/giza-pp/">GIZA++</a> (included)</p>
+
+ <p>GIZA++ is the default aligner. It is included with Joshua, and should compile successfully when
+you typed <code class="highlighter-rouge">ant</code> from the Joshua root directory. It is not required because you can use the
+(included) Berkeley aligner (<code class="highlighter-rouge">--aligner berkeley</code>). We have recently also provided support
+for the <a href="http://code.google.com/p/jacana-xy/wiki/JacanaXY">Jacana-XY aligner</a> (<code class="highlighter-rouge">--aligner
+jacana</code>). </p>
+ </li>
+ <li>
+ <p><a href="http://hadoop.apache.org/">Hadoop</a> (included)</p>
+
+ <p>The pipeline uses the <a href="thrax.html">Thrax grammar extractor</a>, which is built on Hadoop. If you
+have a Hadoop installation, simply ensure that the <code class="highlighter-rouge">$HADOOP</code> environment variable is defined, and
+the pipeline will use it automatically at the grammar extraction step. If you are going to
+attempt to extract very large grammars, it is best to have a good-sized Hadoop installation.</p>
+
+ <p>(If you do not have a Hadoop installation, you might consider setting one up. Hadoop can be
+installed in a
+<a href="http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed">“pseudo-distributed”</a>
+mode that allows it to use just a few machines or a number of processors on a single machine.
+The main issue is to ensure that there are a lot of independent physical disks, since in our
+experience Hadoop starts to exhibit lots of hard-to-trace problems if there is too much demand on
+the disks.)</p>
+
+ <p>If you don’t have a Hadoop installation, there are still no worries. The pipeline will unroll a
+standalone installation and use it to extract your grammar. This behavior will be triggered if
+<code class="highlighter-rouge">$HADOOP</code> is undefined.</p>
+ </li>
+ <li>
+ <p><a href="http://www.speech.sri.com/projects/srilm/">SRILM</a> (not included)</p>
+
+ <p>By default, the pipeline uses a Java program from the
+<a href="http://code.google.com/p/berkeleylm/">Berkeley LM</a> package that constructs an
+Kneser-Ney-smoothed language model in ARPA format from the target side of your training data. If
+you wish to use SRILM instead, you need to do the following:</p>
+
+ <ol>
+ <li>Install SRILM and set the <code class="highlighter-rouge">$SRILM</code> environment variable to point to its installed location.</li>
+ <li>Add the <code class="highlighter-rouge">--lm-gen srilm</code> flag to your pipeline invocation.</li>
+ </ol>
+
+ <p>More information on this is available in the <a href="#lm">LM building section of the pipeline</a>. SRILM
+is not used for representing language models during decoding (and in fact is not supported,
+having been supplanted by <a href="http://kheafield.com/code/kenlm/">KenLM</a> (the default) and
+BerkeleyLM).</p>
+ </li>
+ <li>
+ <p><a href="http://statmt.org/moses/">Moses</a> (not included)</p>
+ </li>
+</ul>
+
+<p>Make sure that the environment variable <code class="highlighter-rouge">$JOSHUA</code> is defined, and you should be all set.</p>
+
+<h2 id="a-basic-pipeline-run">A basic pipeline run</h2>
+
+<p>The pipeline takes a set of inputs (training, tuning, and test data), and creates a set of
+intermediate files in the <em>run directory</em>. By default, the run directory is the current directory,
+but it can be changed with the <code class="highlighter-rouge">--rundir</code> parameter.</p>
+
+<p>For this quick start, we will be working with the example that can be found in
+<code class="highlighter-rouge">$JOSHUA/examples/pipeline</code>. This example contains 1,000 sentences of Urdu-English data (the full
+dataset is available as part of the
+<a href="/indian-parallel-corpora/">Indian languages parallel corpora</a> with
+100-sentence tuning and test sets with four references each.</p>
+
+<p>Running the pipeline requires two main steps: data preparation and invocation.</p>
+
+<ol>
+ <li>
+ <p>Prepare your data. The pipeline script needs to be told where to find the raw training, tuning,
+and test data. A good convention is to place these files in an input/ subdirectory of your run’s
+working directory (NOTE: do not use <code class="highlighter-rouge">data/</code>, since a directory of that name is created and used
+by the pipeline itself for storing processed files). The expected format (for each of training,
+tuning, and test) is a pair of files that share a common path prefix and are distinguished by
+their extension, e.g.,</p>
+
+ <div class="highlighter-rouge"><pre class="highlight"><code>input/
+ train.SOURCE
+ train.TARGET
+ tune.SOURCE
+ tune.TARGET
+ test.SOURCE
+ test.TARGET
+</code></pre>
+ </div>
+
+ <p>These files should be parallel at the sentence level (with one sentence per line), should be in
+UTF-8, and should be untokenized (tokenization occurs in the pipeline). SOURCE and TARGET denote
+variables that should be replaced with the actual target and source language abbreviations (e.g.,
+“ur” and “en”).</p>
+ </li>
+ <li>
+ <p>Run the pipeline. The following is the minimal invocation to run the complete pipeline:</p>
+
+ <div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/pipeline.pl \
+ --corpus input/train \
+ --tune input/tune \
+ --test input/devtest \
+ --source SOURCE \
+ --target TARGET
+</code></pre>
+ </div>
+
+ <p>The <code class="highlighter-rouge">--corpus</code>, <code class="highlighter-rouge">--tune</code>, and <code class="highlighter-rouge">--test</code> flags define file prefixes that are concatened with the
+language extensions given by <code class="highlighter-rouge">--target</code> and <code class="highlighter-rouge">--source</code> (with a “.” in between). Note the
+correspondences with the files defined in the first step above. The prefixes can be either
+absolute or relative pathnames. This particular invocation assumes that a subdirectory <code class="highlighter-rouge">input/</code>
+exists in the current directory, that you are translating from a language identified “ur”
+extension to a language identified by the “en” extension, that the training data can be found at
+<code class="highlighter-rouge">input/train.en</code> and <code class="highlighter-rouge">input/train.ur</code>, and so on.</p>
+ </li>
+</ol>
+
+<p><em>Don’t</em> run the pipeline directly from <code class="highlighter-rouge">$JOSHUA</code>. We recommend creating a run directory somewhere
+ else to contain all of your experiments in some other location. The advantage to this (apart from
+ not clobbering part of the Joshua install) is that Joshua provides support scripts for visualizing
+ the results of a series of experiments that only work if you</p>
+
+<p>Assuming no problems arise, this command will run the complete pipeline in about 20 minutes,
+producing BLEU scores at the end. As it runs, you will see output that looks like the following:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>[train-copy-en] rebuilding...
+ dep=/Users/post/code/joshua/test/pipeline/input/train.en
+ dep=data/train/train.en.gz [NOT FOUND]
+ cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n > data/train/train.en.gz
+ took 0 seconds (0s)
+[train-copy-ur] rebuilding...
+ dep=/Users/post/code/joshua/test/pipeline/input/train.ur
+ dep=data/train/train.ur.gz [NOT FOUND]
+ cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n > data/train/train.ur.gz
+ took 0 seconds (0s)
+...
+</code></pre>
+</div>
+
+<p>And in the current directory, you will see the following files (among other intermediate files
+generated by the individual sub-steps).</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>data/
+ train/
+ corpus.ur
+ corpus.en
+ thrax-input-file
+ tune/
+ tune.tok.lc.ur
+ tune.tok.lc.en
+ grammar.filtered.gz
+ grammar.glue
+ test/
+ test.tok.lc.ur
+ test.tok.lc.en
+ grammar.filtered.gz
+ grammar.glue
+alignments/
+ 0/
+ [giza/berkeley aligner output files]
+ training.align
+thrax-hiero.conf
+thrax.log
+grammar.gz
+lm.gz
+tune/
+ 1/
+ decoder_command
+ joshua.config
+ params.txt
+ joshua.log
+ mert.log
+ joshua.config.ZMERT.final
+ final-bleu
+</code></pre>
+</div>
+
+<p>These files will be described in more detail in subsequent sections of this tutorial.</p>
+
+<p>Another useful flag is the <code class="highlighter-rouge">--rundir DIR</code> flag, which chdir()s to the specified directory before
+running the pipeline. By default the rundir is the current directory. Changing it can be useful
+for organizing related pipeline runs. Relative paths specified to other flags (e.g., to <code class="highlighter-rouge">--corpus</code>
+or <code class="highlighter-rouge">--lmfile</code>) are relative to the directory the pipeline was called <em>from</em>, not the rundir itself
+(unless they happen to be the same, of course).</p>
+
+<p>The complete pipeline comprises many tens of small steps, which can be grouped together into a set
+of traditional pipeline tasks:</p>
+
+<ol>
+ <li><a href="#prep">Data preparation</a></li>
+ <li><a href="#alignment">Alignment</a></li>
+ <li><a href="#parsing">Parsing</a> (syntax-based grammars only)</li>
+ <li><a href="#tm">Grammar extraction</a></li>
+ <li><a href="#lm">Language model building</a></li>
+ <li><a href="#tuning">Tuning</a></li>
+ <li><a href="#testing">Testing</a></li>
+ <li><a href="#analysis">Analysis</a></li>
+</ol>
+
+<p>These steps are discussed below, after a few intervening sections about high-level details of the
+pipeline.</p>
+
+<h2 id="managing-groups-of-experiments">Managing groups of experiments</h2>
+
+<p>The real utility of the pipeline comes when you use it to manage groups of experiments. Typically,
+there is a held-out test set, and we want to vary a number of training parameters to determine what
+effect this has on BLEU scores or some other metric. Joshua comes with a script
+<code class="highlighter-rouge">$JOSHUA/scripts/training/summarize.pl</code> that collects information from a group of runs and reports
+them to you. This script works so long as you organize your runs as follows:</p>
+
+<ol>
+ <li>
+ <p>Your runs should be grouped together in a root directory, which I’ll call <code class="highlighter-rouge">$RUNDIR</code>.</p>
+ </li>
+ <li>
+ <p>For comparison purposes, the runs should all be evaluated on the same test set.</p>
+ </li>
+ <li>
+ <p>Each run in the run group should be in its own numbered directory, shown with the files used by
+the summarize script:</p>
+
+ <div class="highlighter-rouge"><pre class="highlight"><code>$RUNDIR/
+ 1/
+ README.txt
+ test/
+ final-bleu
+ final-times
+ [other files]
+ 2/
+ README.txt
+ ...
+</code></pre>
+ </div>
+ </li>
+</ol>
+
+<p>You can get such directories using the <code class="highlighter-rouge">--rundir N</code> flag to the pipeline. </p>
+
+<p>Run directories can build off each other. For example, <code class="highlighter-rouge">1/</code> might contain a complete baseline
+run. If you wanted to just change the tuner, you don’t need to rerun the aligner and model builder,
+so you can reuse the results by supplying the second run with the information it needs that was
+computed in step 1:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/pipeline.pl \
+ --first-step tune \
+ --grammar 1/grammar.gz \
+ ...
+</code></pre>
+</div>
+
+<p>More details are below.</p>
+
+<h2 id="grammar-options">Grammar options</h2>
+
+<p>Joshua can extract three types of grammars: Hiero grammars, GHKM, and SAMT grammars. As described
+on the <a href="file-formats.html">file formats page</a>, all of them are encoded into the same file format,
+but they differ in terms of the richness of their nonterminal sets.</p>
+
+<p>Hiero grammars make use of a single nonterminals, and are extracted by computing phrases from
+word-based alignments and then subtracting out phrase differences. More detail can be found in
+<a href="http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201">Chiang (2007) [PDF]</a>.
+<a href="http://www.isi.edu/%7Emarcu/papers/cr_ghkm_naacl04.pdf">GHKM</a> (new with 5.0) and
+<a href="http://www.cs.cmu.edu/~zollmann/samt/">SAMT</a> grammars make use of a source- or target-side parse
+tree on the training data, differing in the way they extract rules using these trees: GHKM extracts
+synchronous tree substitution grammar rules rooted in a subset of the tree constituents, whereas
+SAMT projects constituent labels down onto phrases. SAMT grammars are usually many times larger and
+are much slower to decode with, but sometimes increase BLEU score. Both grammar formats are
+extracted with the <a href="thrax.html">Thrax software</a>.</p>
+
+<p>By default, the Joshua pipeline extract a Hiero grammar, but this can be altered with the <code class="highlighter-rouge">--type
+(ghkm|samt)</code> flag. For GHKM grammars, the default is to use
+<a href="http://www-nlp.stanford.edu/~mgalley/software/stanford-ghkm-latest.tar.gz">Michel Galley’s extractor</a>,
+but you can also use Moses’ extractor with <code class="highlighter-rouge">--ghkm-extractor moses</code>. Galley’s extractor only outputs
+two features, so the scores tend to be significantly lower than that of Moses’.</p>
+
+<h2 id="other-high-level-options">Other high-level options</h2>
+
+<p>The following command-line arguments control run-time behavior of multiple steps:</p>
+
+<ul>
+ <li>
+ <p><code class="highlighter-rouge">--threads N</code> (1)</p>
+
+ <p>This enables multithreaded operation for a number of steps: alignment (with GIZA, max two
+threads), parsing, and decoding (any number of threads)</p>
+ </li>
+ <li>
+ <p><code class="highlighter-rouge">--jobs N</code> (1)</p>
+
+ <p>This enables parallel operation over a cluster using the qsub command. This feature is not
+well-documented at this point, but you will likely want to edit the file
+<code class="highlighter-rouge">$JOSHUA/scripts/training/parallelize/LocalConfig.pm</code> to setup your qsub environment, and may also
+want to pass specific qsub commands via the <code class="highlighter-rouge">--qsub-args "ARGS"</code> command.</p>
+ </li>
+</ul>
+
+<h2 id="restarting-failed-runs">Restarting failed runs</h2>
+
+<p>If the pipeline dies, you can restart it with the same command you used the first time. If you
+rerun the pipeline with the exact same invocation as the previous run (or an overlapping
+configuration – one that causes the same set of behaviors), you will see slightly different
+output compared to what we saw above:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>[train-copy-en] cached, skipping...
+[train-copy-ur] cached, skipping...
+...
+</code></pre>
+</div>
+
+<p>This indicates that the caching module has discovered that the step was already computed and thus
+did not need to be rerun. This feature is quite useful for restarting pipeline runs that have
+crashed due to bugs, memory limitations, hardware failures, and the myriad other problems that
+plague MT researchers across the world.</p>
+
+<p>Often, a command will die because it was parameterized incorrectly. For example, perhaps the
+decoder ran out of memory. This allows you to adjust the parameter (e.g., <code class="highlighter-rouge">--joshua-mem</code>) and rerun
+the script. Of course, if you change one of the parameters a step depends on, it will trigger a
+rerun, which in turn might trigger further downstream reruns.</p>
+
+<h2 id="a-idsteps--skipping-steps-quitting-early"><a id="steps"></a> Skipping steps, quitting early</h2>
+
+<p>You will also find it useful to start the pipeline somewhere other than data preparation (for
+example, if you have already-processed data and an alignment, and want to begin with building a
+grammar) or to end it prematurely (if, say, you don’t have a test set and just want to tune a
+model). This can be accomplished with the <code class="highlighter-rouge">--first-step</code> and <code class="highlighter-rouge">--last-step</code> flags, which take as
+argument a case-insensitive version of the following steps:</p>
+
+<ul>
+ <li>
+ <p><em>FIRST</em>: Data preparation. Everything begins with data preparation. This is the default first
+ step, so there is no need to be explicit about it.</p>
+ </li>
+ <li>
+ <p><em>ALIGN</em>: Alignment. You might want to start here if you want to skip data preprocessing.</p>
+ </li>
+ <li>
+ <p><em>PARSE</em>: Parsing. This is only relevant for building SAMT grammars (<code class="highlighter-rouge">--type samt</code>), in which case
+ the target side (<code class="highlighter-rouge">--target</code>) of the training data (<code class="highlighter-rouge">--corpus</code>) is parsed before building a
+ grammar.</p>
+ </li>
+ <li>
+ <p><em>THRAX</em>: Grammar extraction <a href="thrax.html">with Thrax</a>. If you jump to this step, you’ll need to
+ provide an aligned corpus (<code class="highlighter-rouge">--alignment</code>) along with your parallel data. </p>
+ </li>
+ <li>
+ <p><em>TUNE</em>: Tuning. The exact tuning method is determined with <code class="highlighter-rouge">--tuner {mert,mira,pro}</code>. With this
+ option, you need to specify a grammar (<code class="highlighter-rouge">--grammar</code>) or separate tune (<code class="highlighter-rouge">--tune-grammar</code>) and test
+ (<code class="highlighter-rouge">--test-grammar</code>) grammars. A full grammar (<code class="highlighter-rouge">--grammar</code>) will be filtered against the relevant
+ tuning or test set unless you specify <code class="highlighter-rouge">--no-filter-tm</code>. If you want a language model built from
+ the target side of your training data, you’ll also need to pass in the training corpus
+ (<code class="highlighter-rouge">--corpus</code>). You can also specify an arbitrary number of additional language models with one or
+ more <code class="highlighter-rouge">--lmfile</code> flags.</p>
+ </li>
+ <li>
+ <p><em>TEST</em>: Testing. If you have a tuned model file, you can test new corpora by passing in a test
+ corpus with references (<code class="highlighter-rouge">--test</code>). You’ll need to provide a run name (<code class="highlighter-rouge">--name</code>) to store the
+ results of this run, which will be placed under <code class="highlighter-rouge">test/NAME</code>. You’ll also need to provide a
+ Joshua configuration file (<code class="highlighter-rouge">--joshua-config</code>), one or more language models (<code class="highlighter-rouge">--lmfile</code>), and a
+ grammar (<code class="highlighter-rouge">--grammar</code>); this will be filtered to the test data unless you specify
+ <code class="highlighter-rouge">--no-filter-tm</code>) or unless you directly provide a filtered test grammar (<code class="highlighter-rouge">--test-grammar</code>).</p>
+ </li>
+ <li>
+ <p><em>LAST</em>: The last step. This is the default target of <code class="highlighter-rouge">--last-step</code>.</p>
+ </li>
+</ul>
+
+<p>We now discuss these steps in more detail.</p>
+
+<h3 id="a-idprep--1-data-preparation"><a id="prep"></a> 1. DATA PREPARATION</h3>
+
+<p>Data prepare involves doing the following to each of the training data (<code class="highlighter-rouge">--corpus</code>), tuning data
+(<code class="highlighter-rouge">--tune</code>), and testing data (<code class="highlighter-rouge">--test</code>). Each of these values is an absolute or relative path
+prefix. To each of these prefixes, a “.” is appended, followed by each of SOURCE (<code class="highlighter-rouge">--source</code>) and
+TARGET (<code class="highlighter-rouge">--target</code>), which are file extensions identifying the languages. The SOURCE and TARGET
+files must have the same number of lines. </p>
+
+<p>For tuning and test data, multiple references are handled automatically. A single reference will
+have the format TUNE.TARGET, while multiple references will have the format TUNE.TARGET.NUM, where
+NUM starts at 0 and increments for as many references as there are.</p>
+
+<p>The following processing steps are applied to each file.</p>
+
+<ol>
+ <li>
+ <p><strong>Copying</strong> the files into <code class="highlighter-rouge">$RUNDIR/data/TYPE</code>, where TYPE is one of “train”, “tune”, or “test”.
+Multiple <code class="highlighter-rouge">--corpora</code> files are concatenated in the order they are specified. Multiple <code class="highlighter-rouge">--tune</code>
+and <code class="highlighter-rouge">--test</code> flags are not currently allowed.</p>
+ </li>
+ <li>
+ <p><strong>Normalizing</strong> punctuation and text (e.g., removing extra spaces, converting special
+quotations). There are a few language-specific options that depend on the file extension
+matching the <a href="http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes">two-letter ISO 639-1</a>
+designation.</p>
+ </li>
+ <li>
+ <p><strong>Tokenizing</strong> the data (e.g., separating out punctuation, converting brackets). Again, there
+are language-specific tokenizations for a few languages (English, German, and Greek).</p>
+ </li>
+ <li>
+ <p>(Training only) <strong>Removing</strong> all parallel sentences with more than <code class="highlighter-rouge">--maxlen</code> tokens on either
+side. By default, MAXLEN is 50. To turn this off, specify <code class="highlighter-rouge">--maxlen 0</code>.</p>
+ </li>
+ <li>
+ <p><strong>Lowercasing</strong>.</p>
+ </li>
+</ol>
+
+<p>This creates a series of intermediate files which are saved for posterity but compressed. For
+example, you might see</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>data/
+ train/
+ train.en.gz
+ train.tok.en.gz
+ train.tok.50.en.gz
+ train.tok.50.lc.en
+ corpus.en -> train.tok.50.lc.en
+</code></pre>
+</div>
+
+<p>The file “corpus.LANG” is a symbolic link to the last file in the chain. </p>
+
+<h2 id="alignment-a-idalignment-">2. ALIGNMENT <a id="alignment"></a></h2>
+
+<p>Alignments are between the parallel corpora at <code class="highlighter-rouge">$RUNDIR/data/train/corpus.{SOURCE,TARGET}</code>. To
+prevent the alignment tables from getting too big, the parallel corpora are grouped into files of no
+more than ALIGNER_CHUNK_SIZE blocks (controlled with a parameter below). The last block is folded
+into the penultimate block if it is too small. These chunked files are all created in a
+subdirectory of <code class="highlighter-rouge">$RUNDIR/data/train/splits</code>, named <code class="highlighter-rouge">corpus.LANG.0</code>, <code class="highlighter-rouge">corpus.LANG.1</code>, and so on.</p>
+
+<p>The pipeline parameters affecting alignment are:</p>
+
+<ul>
+ <li>
+ <p><code class="highlighter-rouge">--aligner ALIGNER</code> {giza (default), berkeley, jacana}</p>
+
+ <p>Which aligner to use. The default is <a href="http://code.google.com/p/giza-pp/">GIZA++</a>, but
+<a href="http://code.google.com/p/berkeleyaligner/">the Berkeley aligner</a> can be used instead. When
+using the Berkeley aligner, you’ll want to pay attention to how much memory you allocate to it
+with <code class="highlighter-rouge">--aligner-mem</code> (the default is 10g).</p>
+ </li>
+ <li>
+ <p><code class="highlighter-rouge">--aligner-chunk-size SIZE</code> (1,000,000)</p>
+
+ <p>The number of sentence pairs to compute alignments over. The training data is split into blocks
+of this size, aligned separately, and then concatenated.</p>
+ </li>
+ <li>
+ <p><code class="highlighter-rouge">--alignment FILE</code></p>
+
+ <p>If you have an already-computed alignment, you can pass that to the script using this flag.
+Note that, in this case, you will want to skip data preparation and alignment using
+<code class="highlighter-rouge">--first-step thrax</code> (the first step after alignment) and also to specify <code class="highlighter-rouge">--no-prepare</code> so
+as not to retokenize the data and mess with your alignments.</p>
+
+ <p>The alignment file format is the standard format where 0-indexed many-many alignment pairs for a
+sentence are provided on a line, source language first, e.g.,</p>
+
+ <p>0-0 0-1 1-2 1-7 …</p>
+
+ <p>This value is required if you start at the grammar extraction step.</p>
+ </li>
+</ul>
+
+<p>When alignment is complete, the alignment file can be found at <code class="highlighter-rouge">$RUNDIR/alignments/training.align</code>.
+It is parallel to the training corpora. There are many files in the <code class="highlighter-rouge">alignments/</code> subdirectory that
+contain the output of intermediate steps.</p>
+
+<h3 id="a-idparsing--3-parsing"><a id="parsing"></a> 3. PARSING</h3>
+
+<p>To build SAMT and GHKM grammars (<code class="highlighter-rouge">--type samt</code> and <code class="highlighter-rouge">--type ghkm</code>), the target side of the
+training data must be parsed. The pipeline assumes your target side will be English, and will parse
+it for you using <a href="http://code.google.com/p/berkeleyparser/">the Berkeley parser</a>, which is included.
+If it is not the case that English is your target-side language, the target side of your training
+data (found at CORPUS.TARGET) must already be parsed in PTB format. The pipeline will notice that
+it is parsed and will not reparse it.</p>
+
+<p>Parsing is affected by both the <code class="highlighter-rouge">--threads N</code> and <code class="highlighter-rouge">--jobs N</code> options. The former runs the parser in
+multithreaded mode, while the latter distributes the runs across as cluster (and requires some
+configuration, not yet documented). The options are mutually exclusive.</p>
+
+<p>Once the parsing is complete, there will be two parsed files:</p>
+
+<ul>
+ <li><code class="highlighter-rouge">$RUNDIR/data/train/corpus.en.parsed</code>: this is the mixed-case file that was parsed.</li>
+ <li><code class="highlighter-rouge">$RUNDIR/data/train/corpus.parsed.en</code>: this is a leaf-lowercased version of the above file used for
+grammar extraction.</li>
+</ul>
+
+<h2 id="thrax-grammar-extraction-a-idtm-">4. THRAX (grammar extraction) <a id="tm"></a></h2>
+
+<p>The grammar extraction step takes three pieces of data: (1) the source-language training corpus, (2)
+the target-language training corpus (parsed, if an SAMT grammar is being extracted), and (3) the
+alignment file. From these, it computes a synchronous context-free grammar. If you already have a
+grammar and wish to skip this step, you can do so passing the grammar with the <code class="highlighter-rouge">--grammar
+/path/to/grammar</code> flag.</p>
+
+<p>The main variable in grammar extraction is Hadoop. If you have a Hadoop installation, simply ensure
+that the environment variable <code class="highlighter-rouge">$HADOOP</code> is defined, and Thrax will seamlessly use it. If you <em>do
+not</em> have a Hadoop installation, the pipeline will roll out out for you, running Hadoop in
+standalone mode (this mode is triggered when <code class="highlighter-rouge">$HADOOP</code> is undefined). Theoretically, any grammar
+extractable on a full Hadoop cluster should be extractable in standalone mode, if you are patient
+enough; in practice, you probably are not patient enough, and will be limited to smaller
+datasets. You may also run into problems with disk space; Hadoop uses a lot (use <code class="highlighter-rouge">--tmp
+/path/to/tmp</code> to specify an alternate place for temporary data; we suggest you use a local disk
+partition with tens or hundreds of gigabytes free, and not an NFS partition). Setting up your own
+Hadoop cluster is not too difficult a chore; in particular, you may find it helpful to install a
+<a href="http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html">pseudo-distributed version of Hadoop</a>.
+In our experience, this works fine, but you should note the following caveats:</p>
+
+<ul>
+ <li>It is of crucial importance that you have enough physical disks. We have found that having too
+few, or too slow of disks, results in a whole host of seemingly unrelated issues that are hard to
+resolve, such as timeouts. </li>
+ <li>NFS filesystems can cause lots of problems. You should really try to install physical disks that
+are dedicated to Hadoop scratch space.</li>
+</ul>
+
+<p>Here are some flags relevant to Hadoop and grammar extraction with Thrax:</p>
+
+<ul>
+ <li>
+ <p><code class="highlighter-rouge">--hadoop /path/to/hadoop</code></p>
+
+ <p>This sets the location of Hadoop (overriding the environment variable <code class="highlighter-rouge">$HADOOP</code>)</p>
+ </li>
+ <li>
+ <p><code class="highlighter-rouge">--hadoop-mem MEM</code> (2g)</p>
+
+ <p>This alters the amount of memory available to Hadoop mappers (passed via the
+<code class="highlighter-rouge">mapred.child.java.opts</code> options).</p>
+ </li>
+ <li>
+ <p><code class="highlighter-rouge">--thrax-conf FILE</code></p>
+
+ <p>Use the provided Thrax configuration file instead of the (grammar-specific) default. The Thrax
+ templates are located at <code class="highlighter-rouge">$JOSHUA/scripts/training/templates/thrax-TYPE.conf</code>, where TYPE is one
+ of “hiero” or “samt”.</p>
+ </li>
+</ul>
+
+<p>When the grammar is extracted, it is compressed and placed at <code class="highlighter-rouge">$RUNDIR/grammar.gz</code>.</p>
+
+<h2 id="a-idlm--5-language-model"><a id="lm"></a> 5. Language model</h2>
+
+<p>Before tuning can take place, a language model is needed. A language model is always built from the
+target side of the training corpus unless <code class="highlighter-rouge">--no-corpus-lm</code> is specified. In addition, you can
+provide other language models (any number of them) with the <code class="highlighter-rouge">--lmfile FILE</code> argument. Other
+arguments are as follows.</p>
+
+<ul>
+ <li>
+ <p><code class="highlighter-rouge">--lm</code> {kenlm (default), berkeleylm}</p>
+
+ <p>This determines the language model code that will be used when decoding. These implementations
+are described in their respective papers (PDFs:
+<a href="http://kheafield.com/professional/avenue/kenlm.pdf">KenLM</a>,
+<a href="http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf">BerkeleyLM</a>). KenLM is written in
+C++ and requires a pass through the JNI, but is recommended because it supports left-state minimization.</p>
+ </li>
+ <li>
+ <p><code class="highlighter-rouge">--lmfile FILE</code></p>
+
+ <p>Specifies a pre-built language model to use when decoding. This language model can be in ARPA
+format, or in KenLM format when using KenLM or BerkeleyLM format when using that format.</p>
+ </li>
+ <li>
+ <p><code class="highlighter-rouge">--lm-gen</code> {kenlm (default), srilm, berkeleylm}, <code class="highlighter-rouge">--buildlm-mem MEM</code>, <code class="highlighter-rouge">--witten-bell</code></p>
+
+ <p>At the tuning step, an LM is built from the target side of the training data (unless
+<code class="highlighter-rouge">--no-corpus-lm</code> is specified). This controls which code is used to build it. The default is a
+KenLM’s <a href="http://kheafield.com/code/kenlm/estimation/">lmplz</a>, and is strongly recommended.</p>
+
+ <p>If SRILM is used, it is called with the following arguments:</p>
+
+ <div class="highlighter-rouge"><pre class="highlight"><code> $SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text TRAINING-DATA -unk -lm lm.gz
+</code></pre>
+ </div>
+
+ <p>Where SMOOTHING is <code class="highlighter-rouge">-kndiscount</code>, or <code class="highlighter-rouge">-wbdiscount</code> if <code class="highlighter-rouge">--witten-bell</code> is passed to the pipeline.</p>
+
+ <p><a href="http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java">BerkeleyLM java class</a>
+is also available. It computes a Kneser-Ney LM with a constant discounting (0.75) and no count
+thresholding. The flag <code class="highlighter-rouge">--buildlm-mem</code> can be used to control how much memory is allocated to the
+Java process. The default is “2g”, but you will want to increase it for larger language models.</p>
+
+ <p>A language model built from the target side of the training data is placed at <code class="highlighter-rouge">$RUNDIR/lm.gz</code>. </p>
+ </li>
+</ul>
+
+<h2 id="interlude-decoder-arguments">Interlude: decoder arguments</h2>
+
+<p>Running the decoder is done in both the tuning stage and the testing stage. A critical point is
+that you have to give the decoder enough memory to run. Joshua can be very memory-intensive, in
+particular when decoding with large grammars and large language models. The default amount of
+memory is 3100m, which is likely not enough (especially if you are decoding with SAMT grammar). You
+can alter the amount of memory for Joshua using the <code class="highlighter-rouge">--joshua-mem MEM</code> argument, where MEM is a Java
+memory specification (passed to its <code class="highlighter-rouge">-Xmx</code> flag).</p>
+
+<h2 id="a-idtuning--6-tuning"><a id="tuning"></a> 6. TUNING</h2>
+
+<p>Two optimizers are provided with Joshua: MERT and PRO (<code class="highlighter-rouge">--tuner {mert,pro}</code>). If Moses is
+installed, you can also use Cherry & Foster’s k-best batch MIRA (<code class="highlighter-rouge">--tuner mira</code>, recommended).
+Tuning is run till convergence in the <code class="highlighter-rouge">$RUNDIR/tune/N</code> directory, where N is the tuning instance.
+By default, tuning is run just once, but the pipeline supports running the optimizer an arbitrary
+number of times due to <a href="http://www.youtube.com/watch?v=BOa3XDkgf0Y">recent work</a> pointing out the
+variance of tuning procedures in machine translation, in particular MERT. This can be activated
+with <code class="highlighter-rouge">--optimizer-runs N</code>. Each run can be found in a directory <code class="highlighter-rouge">$RUNDIR/tune/N</code>.</p>
+
+<p>When tuning is finished, each final configuration file can be found at either</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>$RUNDIR/tune/N/joshua.config.final
+</code></pre>
+</div>
+
+<p>where N varies from 1..<code class="highlighter-rouge">--optimizer-runs</code>.</p>
+
+<h2 id="a-idtesting--7-testing"><a id="testing"></a> 7. Testing</h2>
+
+<p>For each of the tuner runs, Joshua takes the tuner output file and decodes the test set. If you
+like, you can also apply minimum Bayes-risk decoding to the decoder output with <code class="highlighter-rouge">--mbr</code>. This
+usually yields about 0.3 - 0.5 BLEU points, but is time-consuming.</p>
+
+<p>After decoding the test set with each set of tuned weights, Joshua computes the mean BLEU score,
+writes it to <code class="highlighter-rouge">$RUNDIR/test/final-bleu</code>, and cats it. It also writes a file
+<code class="highlighter-rouge">$RUNDIR/test/final-times</code> containing a summary of runtime information. That’s the end of the pipeline!</p>
+
+<p>Joshua also supports decoding further test sets. This is enabled by rerunning the pipeline with a
+number of arguments:</p>
+
+<ul>
+ <li>
+ <p><code class="highlighter-rouge">--first-step TEST</code></p>
+
+ <p>This tells the decoder to start at the test step.</p>
+ </li>
+ <li>
+ <p><code class="highlighter-rouge">--name NAME</code></p>
+
+ <p>A name is needed to distinguish this test set from the previous ones. Output for this test run
+will be stored at <code class="highlighter-rouge">$RUNDIR/test/NAME</code>.</p>
+ </li>
+ <li>
+ <p><code class="highlighter-rouge">--joshua-config CONFIG</code></p>
+
+ <p>A tuned parameter file is required. This file will be the output of some prior tuning run.
+Necessary pathnames and so on will be adjusted.</p>
+ </li>
+</ul>
+
+<h2 id="a-idanalysis-8-analysis"><a id="analysis"> 8. ANALYSIS</a></h2>
+
+<p>If you have used the suggested layout, with a number of related runs all contained in a common
+directory with sequential numbers, you can use the script <code class="highlighter-rouge">$JOSHUA/scripts/training/summarize.pl</code> to
+display a summary of the mean BLEU scores from all runs, along with the text you placed in the run
+README file (using the pipeline’s <code class="highlighter-rouge">--readme TEXT</code> flag).</p>
+
+<h2 id="common-use-cases-and-pitfalls">COMMON USE CASES AND PITFALLS</h2>
+
+<ul>
+ <li>
+ <p>If the pipeline dies at the “thrax-run” stage with an error like the following:</p>
+
+ <div class="highlighter-rouge"><pre class="highlight"><code>JOB FAILED (return code 1)
+hadoop/bin/hadoop: line 47:
+/some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or directory
+Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell
+Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FsShell
+</code></pre>
+ </div>
+
+ <p>This occurs if the <code class="highlighter-rouge">$HADOOP</code> environment variable is set but does not point to a working
+Hadoop installation. To fix it, make sure to unset the variable:</p>
+
+ <div class="highlighter-rouge"><pre class="highlight"><code># in bash
+unset HADOOP
+</code></pre>
+ </div>
+
+ <p>and then rerun the pipeline with the same invocation.</p>
+ </li>
+ <li>
+ <p>Memory usage is a major consideration in decoding with Joshua and hierarchical grammars. In
+particular, SAMT grammars often require a large amount of memory. Many steps have been taken to
+reduce memory usage, including beam settings and test-set- and sentence-level filtering of
+grammars. However, memory usage can still be in the tens of gigabytes.</p>
+
+ <p>To accommodate this kind of variation, the pipeline script allows you to specify both (a) the
+amount of memory used by the Joshua decoder instance and (b) the amount of memory required of
+nodes obtained by the qsub command. These are accomplished with the <code class="highlighter-rouge">--joshua-mem</code> MEM and
+<code class="highlighter-rouge">--qsub-args</code> ARGS commands. For example,</p>
+
+ <div class="highlighter-rouge"><pre class="highlight"><code>pipeline.pl --joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ...
+</code></pre>
+ </div>
+
+ <p>Also, should Thrax fail, it might be due to a memory restriction. By default, Thrax requests 2 GB
+from the Hadoop server. If more memory is needed, set the memory requirement with the
+<code class="highlighter-rouge">--hadoop-mem</code> in the same way as the <code class="highlighter-rouge">--joshua-mem</code> option is used.</p>
+ </li>
+ <li>
+ <p>Other pitfalls and advice will be added as it is discovered.</p>
+ </li>
+</ul>
+
+<h2 id="feedback">FEEDBACK</h2>
+
+<p>Please email joshua_support@googlegroups.com with problems or suggestions.</p>
+
+
+
+ </div>
+ </div>
+ </div> <!-- /container -->
+
+ <!-- Le javascript
+ ================================================== -->
+ <!-- Placed at the end of the document so the pages load faster -->
+ <script src="bootstrap/js/jquery.js"></script>
+ <script src="bootstrap/js/bootstrap-transition.js"></script>
+ <script src="bootstrap/js/bootstrap-alert.js"></script>
+ <script src="bootstrap/js/bootstrap-modal.js"></script>
+ <script src="bootstrap/js/bootstrap-dropdown.js"></script>
+ <script src="bootstrap/js/bootstrap-scrollspy.js"></script>
+ <script src="bootstrap/js/bootstrap-tab.js"></script>
+ <script src="bootstrap/js/bootstrap-tooltip.js"></script>
+ <script src="bootstrap/js/bootstrap-popover.js"></script>
+ <script src="bootstrap/js/bootstrap-button.js"></script>
+ <script src="bootstrap/js/bootstrap-collapse.js"></script>
+ <script src="bootstrap/js/bootstrap-carousel.js"></script>
+ <script src="bootstrap/js/bootstrap-typeahead.js"></script>
+
+ <!-- Start of StatCounter Code for Default Guide -->
+ <script type="text/javascript">
+ var sc_project=8264132;
+ var sc_invisible=1;
+ var sc_security="4b97fe2d";
+ </script>
+ <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
+ <noscript>
+ <div class="statcounter">
+ <a title="hit counter joomla"
+ href="http://statcounter.com/joomla/"
+ target="_blank">
+ <img class="statcounter"
+ src="http://c.statcounter.com/8264132/0/4b97fe2d/1/"
+ alt="hit counter joomla" />
+ </a>
+ </div>
+ </noscript>
+ <!-- End of StatCounter Code for Default Guide -->
+
+ </body>
+</html>
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/pipeline.md
----------------------------------------------------------------------
diff --git a/5.0/pipeline.md b/5.0/pipeline.md
deleted file mode 100644
index fbe052d..0000000
--- a/5.0/pipeline.md
+++ /dev/null
@@ -1,640 +0,0 @@
----
-layout: default
-category: links
-title: The Joshua Pipeline
----
-
-This page describes the Joshua pipeline script, which manages the complexity of training and
-evaluating machine translation systems. The pipeline eases the pain of two related tasks in
-statistical machine translation (SMT) research:
-
-- Training SMT systems involves a complicated process of interacting steps that are
- time-consuming and prone to failure.
-
-- Developing and testing new techniques requires varying parameters at different points in the
- pipeline. Earlier results (which are often expensive) need not be recomputed.
-
-To facilitate these tasks, the pipeline script:
-
-- Runs the complete SMT pipeline, from corpus normalization and tokenization, through alignment,
- model building, tuning, test-set decoding, and evaluation.
-
-- Caches the results of intermediate steps (using robust SHA-1 checksums on dependencies), so the
- pipeline can be debugged or shared across similar runs while doing away with time spent
- recomputing expensive steps.
-
-- Allows you to jump into and out of the pipeline at a set of predefined places (e.g., the alignment
- stage), so long as you provide the missing dependencies.
-
-The Joshua pipeline script is designed in the spirit of Moses' `train-model.pl`, and shares many of
-its features. It is not as extensive, however, as Moses'
-[Experiment Management System](http://www.statmt.org/moses/?n=FactoredTraining.EMS), which allows
-the user to define arbitrary execution dependency graphs.
-
-## Installation
-
-The pipeline has no *required* external dependencies. However, it has support for a number of
-external packages, some of which are included with Joshua.
-
-- [GIZA++](http://code.google.com/p/giza-pp/) (included)
-
- GIZA++ is the default aligner. It is included with Joshua, and should compile successfully when
- you typed `ant` from the Joshua root directory. It is not required because you can use the
- (included) Berkeley aligner (`--aligner berkeley`). We have recently also provided support
- for the [Jacana-XY aligner](http://code.google.com/p/jacana-xy/wiki/JacanaXY) (`--aligner
- jacana`).
-
-- [Hadoop](http://hadoop.apache.org/) (included)
-
- The pipeline uses the [Thrax grammar extractor](thrax.html), which is built on Hadoop. If you
- have a Hadoop installation, simply ensure that the `$HADOOP` environment variable is defined, and
- the pipeline will use it automatically at the grammar extraction step. If you are going to
- attempt to extract very large grammars, it is best to have a good-sized Hadoop installation.
-
- (If you do not have a Hadoop installation, you might consider setting one up. Hadoop can be
- installed in a
- ["pseudo-distributed"](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed)
- mode that allows it to use just a few machines or a number of processors on a single machine.
- The main issue is to ensure that there are a lot of independent physical disks, since in our
- experience Hadoop starts to exhibit lots of hard-to-trace problems if there is too much demand on
- the disks.)
-
- If you don't have a Hadoop installation, there are still no worries. The pipeline will unroll a
- standalone installation and use it to extract your grammar. This behavior will be triggered if
- `$HADOOP` is undefined.
-
-- [SRILM](http://www.speech.sri.com/projects/srilm/) (not included)
-
- By default, the pipeline uses a Java program from the
- [Berkeley LM](http://code.google.com/p/berkeleylm/) package that constructs an
- Kneser-Ney-smoothed language model in ARPA format from the target side of your training data. If
- you wish to use SRILM instead, you need to do the following:
-
- 1. Install SRILM and set the `$SRILM` environment variable to point to its installed location.
- 1. Add the `--lm-gen srilm` flag to your pipeline invocation.
-
- More information on this is available in the [LM building section of the pipeline](#lm). SRILM
- is not used for representing language models during decoding (and in fact is not supported,
- having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) (the default) and
- BerkeleyLM).
-
-- [Moses](http://statmt.org/moses/) (not included)
-
-Make sure that the environment variable `$JOSHUA` is defined, and you should be all set.
-
-## A basic pipeline run
-
-The pipeline takes a set of inputs (training, tuning, and test data), and creates a set of
-intermediate files in the *run directory*. By default, the run directory is the current directory,
-but it can be changed with the `--rundir` parameter.
-
-For this quick start, we will be working with the example that can be found in
-`$JOSHUA/examples/pipeline`. This example contains 1,000 sentences of Urdu-English data (the full
-dataset is available as part of the
-[Indian languages parallel corpora](/indian-parallel-corpora/) with
-100-sentence tuning and test sets with four references each.
-
-Running the pipeline requires two main steps: data preparation and invocation.
-
-1. Prepare your data. The pipeline script needs to be told where to find the raw training, tuning,
- and test data. A good convention is to place these files in an input/ subdirectory of your run's
- working directory (NOTE: do not use `data/`, since a directory of that name is created and used
- by the pipeline itself for storing processed files). The expected format (for each of training,
- tuning, and test) is a pair of files that share a common path prefix and are distinguished by
- their extension, e.g.,
-
- input/
- train.SOURCE
- train.TARGET
- tune.SOURCE
- tune.TARGET
- test.SOURCE
- test.TARGET
-
- These files should be parallel at the sentence level (with one sentence per line), should be in
- UTF-8, and should be untokenized (tokenization occurs in the pipeline). SOURCE and TARGET denote
- variables that should be replaced with the actual target and source language abbreviations (e.g.,
- "ur" and "en").
-
-1. Run the pipeline. The following is the minimal invocation to run the complete pipeline:
-
- $JOSHUA/bin/pipeline.pl \
- --corpus input/train \
- --tune input/tune \
- --test input/devtest \
- --source SOURCE \
- --target TARGET
-
- The `--corpus`, `--tune`, and `--test` flags define file prefixes that are concatened with the
- language extensions given by `--target` and `--source` (with a "." in between). Note the
- correspondences with the files defined in the first step above. The prefixes can be either
- absolute or relative pathnames. This particular invocation assumes that a subdirectory `input/`
- exists in the current directory, that you are translating from a language identified "ur"
- extension to a language identified by the "en" extension, that the training data can be found at
- `input/train.en` and `input/train.ur`, and so on.
-
-*Don't* run the pipeline directly from `$JOSHUA`. We recommend creating a run directory somewhere
- else to contain all of your experiments in some other location. The advantage to this (apart from
- not clobbering part of the Joshua install) is that Joshua provides support scripts for visualizing
- the results of a series of experiments that only work if you
-
-Assuming no problems arise, this command will run the complete pipeline in about 20 minutes,
-producing BLEU scores at the end. As it runs, you will see output that looks like the following:
-
- [train-copy-en] rebuilding...
- dep=/Users/post/code/joshua/test/pipeline/input/train.en
- dep=data/train/train.en.gz [NOT FOUND]
- cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n > data/train/train.en.gz
- took 0 seconds (0s)
- [train-copy-ur] rebuilding...
- dep=/Users/post/code/joshua/test/pipeline/input/train.ur
- dep=data/train/train.ur.gz [NOT FOUND]
- cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n > data/train/train.ur.gz
- took 0 seconds (0s)
- ...
-
-And in the current directory, you will see the following files (among other intermediate files
-generated by the individual sub-steps).
-
- data/
- train/
- corpus.ur
- corpus.en
- thrax-input-file
- tune/
- tune.tok.lc.ur
- tune.tok.lc.en
- grammar.filtered.gz
- grammar.glue
- test/
- test.tok.lc.ur
- test.tok.lc.en
- grammar.filtered.gz
- grammar.glue
- alignments/
- 0/
- [giza/berkeley aligner output files]
- training.align
- thrax-hiero.conf
- thrax.log
- grammar.gz
- lm.gz
- tune/
- 1/
- decoder_command
- joshua.config
- params.txt
- joshua.log
- mert.log
- joshua.config.ZMERT.final
- final-bleu
-
-These files will be described in more detail in subsequent sections of this tutorial.
-
-Another useful flag is the `--rundir DIR` flag, which chdir()s to the specified directory before
-running the pipeline. By default the rundir is the current directory. Changing it can be useful
-for organizing related pipeline runs. Relative paths specified to other flags (e.g., to `--corpus`
-or `--lmfile`) are relative to the directory the pipeline was called *from*, not the rundir itself
-(unless they happen to be the same, of course).
-
-The complete pipeline comprises many tens of small steps, which can be grouped together into a set
-of traditional pipeline tasks:
-
-1. [Data preparation](#prep)
-1. [Alignment](#alignment)
-1. [Parsing](#parsing) (syntax-based grammars only)
-1. [Grammar extraction](#tm)
-1. [Language model building](#lm)
-1. [Tuning](#tuning)
-1. [Testing](#testing)
-1. [Analysis](#analysis)
-
-These steps are discussed below, after a few intervening sections about high-level details of the
-pipeline.
-
-## Managing groups of experiments
-
-The real utility of the pipeline comes when you use it to manage groups of experiments. Typically,
-there is a held-out test set, and we want to vary a number of training parameters to determine what
-effect this has on BLEU scores or some other metric. Joshua comes with a script
-`$JOSHUA/scripts/training/summarize.pl` that collects information from a group of runs and reports
-them to you. This script works so long as you organize your runs as follows:
-
-1. Your runs should be grouped together in a root directory, which I'll call `$RUNDIR`.
-
-2. For comparison purposes, the runs should all be evaluated on the same test set.
-
-3. Each run in the run group should be in its own numbered directory, shown with the files used by
-the summarize script:
-
- $RUNDIR/
- 1/
- README.txt
- test/
- final-bleu
- final-times
- [other files]
- 2/
- README.txt
- ...
-
-You can get such directories using the `--rundir N` flag to the pipeline.
-
-Run directories can build off each other. For example, `1/` might contain a complete baseline
-run. If you wanted to just change the tuner, you don't need to rerun the aligner and model builder,
-so you can reuse the results by supplying the second run with the information it needs that was
-computed in step 1:
-
- $JOSHUA/bin/pipeline.pl \
- --first-step tune \
- --grammar 1/grammar.gz \
- ...
-
-More details are below.
-
-## Grammar options
-
-Joshua can extract three types of grammars: Hiero grammars, GHKM, and SAMT grammars. As described
-on the [file formats page](file-formats.html), all of them are encoded into the same file format,
-but they differ in terms of the richness of their nonterminal sets.
-
-Hiero grammars make use of a single nonterminals, and are extracted by computing phrases from
-word-based alignments and then subtracting out phrase differences. More detail can be found in
-[Chiang (2007) [PDF]](http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201).
-[GHKM](http://www.isi.edu/%7Emarcu/papers/cr_ghkm_naacl04.pdf) (new with 5.0) and
-[SAMT](http://www.cs.cmu.edu/~zollmann/samt/) grammars make use of a source- or target-side parse
-tree on the training data, differing in the way they extract rules using these trees: GHKM extracts
-synchronous tree substitution grammar rules rooted in a subset of the tree constituents, whereas
-SAMT projects constituent labels down onto phrases. SAMT grammars are usually many times larger and
-are much slower to decode with, but sometimes increase BLEU score. Both grammar formats are
-extracted with the [Thrax software](thrax.html).
-
-By default, the Joshua pipeline extract a Hiero grammar, but this can be altered with the `--type
-(ghkm|samt)` flag. For GHKM grammars, the default is to use
-[Michel Galley's extractor](http://www-nlp.stanford.edu/~mgalley/software/stanford-ghkm-latest.tar.gz),
-but you can also use Moses' extractor with `--ghkm-extractor moses`. Galley's extractor only outputs
-two features, so the scores tend to be significantly lower than that of Moses'.
-
-## Other high-level options
-
-The following command-line arguments control run-time behavior of multiple steps:
-
-- `--threads N` (1)
-
- This enables multithreaded operation for a number of steps: alignment (with GIZA, max two
- threads), parsing, and decoding (any number of threads)
-
-- `--jobs N` (1)
-
- This enables parallel operation over a cluster using the qsub command. This feature is not
- well-documented at this point, but you will likely want to edit the file
- `$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub environment, and may also
- want to pass specific qsub commands via the `--qsub-args "ARGS"` command.
-
-## Restarting failed runs
-
-If the pipeline dies, you can restart it with the same command you used the first time. If you
-rerun the pipeline with the exact same invocation as the previous run (or an overlapping
-configuration -- one that causes the same set of behaviors), you will see slightly different
-output compared to what we saw above:
-
- [train-copy-en] cached, skipping...
- [train-copy-ur] cached, skipping...
- ...
-
-This indicates that the caching module has discovered that the step was already computed and thus
-did not need to be rerun. This feature is quite useful for restarting pipeline runs that have
-crashed due to bugs, memory limitations, hardware failures, and the myriad other problems that
-plague MT researchers across the world.
-
-Often, a command will die because it was parameterized incorrectly. For example, perhaps the
-decoder ran out of memory. This allows you to adjust the parameter (e.g., `--joshua-mem`) and rerun
-the script. Of course, if you change one of the parameters a step depends on, it will trigger a
-rerun, which in turn might trigger further downstream reruns.
-
-## <a id="steps" /> Skipping steps, quitting early
-
-You will also find it useful to start the pipeline somewhere other than data preparation (for
-example, if you have already-processed data and an alignment, and want to begin with building a
-grammar) or to end it prematurely (if, say, you don't have a test set and just want to tune a
-model). This can be accomplished with the `--first-step` and `--last-step` flags, which take as
-argument a case-insensitive version of the following steps:
-
-- *FIRST*: Data preparation. Everything begins with data preparation. This is the default first
- step, so there is no need to be explicit about it.
-
-- *ALIGN*: Alignment. You might want to start here if you want to skip data preprocessing.
-
-- *PARSE*: Parsing. This is only relevant for building SAMT grammars (`--type samt`), in which case
- the target side (`--target`) of the training data (`--corpus`) is parsed before building a
- grammar.
-
-- *THRAX*: Grammar extraction [with Thrax](thrax.html). If you jump to this step, you'll need to
- provide an aligned corpus (`--alignment`) along with your parallel data.
-
-- *TUNE*: Tuning. The exact tuning method is determined with `--tuner {mert,mira,pro}`. With this
- option, you need to specify a grammar (`--grammar`) or separate tune (`--tune-grammar`) and test
- (`--test-grammar`) grammars. A full grammar (`--grammar`) will be filtered against the relevant
- tuning or test set unless you specify `--no-filter-tm`. If you want a language model built from
- the target side of your training data, you'll also need to pass in the training corpus
- (`--corpus`). You can also specify an arbitrary number of additional language models with one or
- more `--lmfile` flags.
-
-- *TEST*: Testing. If you have a tuned model file, you can test new corpora by passing in a test
- corpus with references (`--test`). You'll need to provide a run name (`--name`) to store the
- results of this run, which will be placed under `test/NAME`. You'll also need to provide a
- Joshua configuration file (`--joshua-config`), one or more language models (`--lmfile`), and a
- grammar (`--grammar`); this will be filtered to the test data unless you specify
- `--no-filter-tm`) or unless you directly provide a filtered test grammar (`--test-grammar`).
-
-- *LAST*: The last step. This is the default target of `--last-step`.
-
-We now discuss these steps in more detail.
-
-### <a id="prep" /> 1. DATA PREPARATION
-
-Data prepare involves doing the following to each of the training data (`--corpus`), tuning data
-(`--tune`), and testing data (`--test`). Each of these values is an absolute or relative path
-prefix. To each of these prefixes, a "." is appended, followed by each of SOURCE (`--source`) and
-TARGET (`--target`), which are file extensions identifying the languages. The SOURCE and TARGET
-files must have the same number of lines.
-
-For tuning and test data, multiple references are handled automatically. A single reference will
-have the format TUNE.TARGET, while multiple references will have the format TUNE.TARGET.NUM, where
-NUM starts at 0 and increments for as many references as there are.
-
-The following processing steps are applied to each file.
-
-1. **Copying** the files into `$RUNDIR/data/TYPE`, where TYPE is one of "train", "tune", or "test".
- Multiple `--corpora` files are concatenated in the order they are specified. Multiple `--tune`
- and `--test` flags are not currently allowed.
-
-1. **Normalizing** punctuation and text (e.g., removing extra spaces, converting special
- quotations). There are a few language-specific options that depend on the file extension
- matching the [two-letter ISO 639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
- designation.
-
-1. **Tokenizing** the data (e.g., separating out punctuation, converting brackets). Again, there
- are language-specific tokenizations for a few languages (English, German, and Greek).
-
-1. (Training only) **Removing** all parallel sentences with more than `--maxlen` tokens on either
- side. By default, MAXLEN is 50. To turn this off, specify `--maxlen 0`.
-
-1. **Lowercasing**.
-
-This creates a series of intermediate files which are saved for posterity but compressed. For
-example, you might see
-
- data/
- train/
- train.en.gz
- train.tok.en.gz
- train.tok.50.en.gz
- train.tok.50.lc.en
- corpus.en -> train.tok.50.lc.en
-
-The file "corpus.LANG" is a symbolic link to the last file in the chain.
-
-## 2. ALIGNMENT <a id="alignment" />
-
-Alignments are between the parallel corpora at `$RUNDIR/data/train/corpus.{SOURCE,TARGET}`. To
-prevent the alignment tables from getting too big, the parallel corpora are grouped into files of no
-more than ALIGNER\_CHUNK\_SIZE blocks (controlled with a parameter below). The last block is folded
-into the penultimate block if it is too small. These chunked files are all created in a
-subdirectory of `$RUNDIR/data/train/splits`, named `corpus.LANG.0`, `corpus.LANG.1`, and so on.
-
-The pipeline parameters affecting alignment are:
-
-- `--aligner ALIGNER` {giza (default), berkeley, jacana}
-
- Which aligner to use. The default is [GIZA++](http://code.google.com/p/giza-pp/), but
- [the Berkeley aligner](http://code.google.com/p/berkeleyaligner/) can be used instead. When
- using the Berkeley aligner, you'll want to pay attention to how much memory you allocate to it
- with `--aligner-mem` (the default is 10g).
-
-- `--aligner-chunk-size SIZE` (1,000,000)
-
- The number of sentence pairs to compute alignments over. The training data is split into blocks
- of this size, aligned separately, and then concatenated.
-
-- `--alignment FILE`
-
- If you have an already-computed alignment, you can pass that to the script using this flag.
- Note that, in this case, you will want to skip data preparation and alignment using
- `--first-step thrax` (the first step after alignment) and also to specify `--no-prepare` so
- as not to retokenize the data and mess with your alignments.
-
- The alignment file format is the standard format where 0-indexed many-many alignment pairs for a
- sentence are provided on a line, source language first, e.g.,
-
- 0-0 0-1 1-2 1-7 ...
-
- This value is required if you start at the grammar extraction step.
-
-When alignment is complete, the alignment file can be found at `$RUNDIR/alignments/training.align`.
-It is parallel to the training corpora. There are many files in the `alignments/` subdirectory that
-contain the output of intermediate steps.
-
-### <a id="parsing" /> 3. PARSING
-
-To build SAMT and GHKM grammars (`--type samt` and `--type ghkm`), the target side of the
-training data must be parsed. The pipeline assumes your target side will be English, and will parse
-it for you using [the Berkeley parser](http://code.google.com/p/berkeleyparser/), which is included.
-If it is not the case that English is your target-side language, the target side of your training
-data (found at CORPUS.TARGET) must already be parsed in PTB format. The pipeline will notice that
-it is parsed and will not reparse it.
-
-Parsing is affected by both the `--threads N` and `--jobs N` options. The former runs the parser in
-multithreaded mode, while the latter distributes the runs across as cluster (and requires some
-configuration, not yet documented). The options are mutually exclusive.
-
-Once the parsing is complete, there will be two parsed files:
-
-- `$RUNDIR/data/train/corpus.en.parsed`: this is the mixed-case file that was parsed.
-- `$RUNDIR/data/train/corpus.parsed.en`: this is a leaf-lowercased version of the above file used for
- grammar extraction.
-
-## 4. THRAX (grammar extraction) <a id="tm" />
-
-The grammar extraction step takes three pieces of data: (1) the source-language training corpus, (2)
-the target-language training corpus (parsed, if an SAMT grammar is being extracted), and (3) the
-alignment file. From these, it computes a synchronous context-free grammar. If you already have a
-grammar and wish to skip this step, you can do so passing the grammar with the `--grammar
-/path/to/grammar` flag.
-
-The main variable in grammar extraction is Hadoop. If you have a Hadoop installation, simply ensure
-that the environment variable `$HADOOP` is defined, and Thrax will seamlessly use it. If you *do
-not* have a Hadoop installation, the pipeline will roll out out for you, running Hadoop in
-standalone mode (this mode is triggered when `$HADOOP` is undefined). Theoretically, any grammar
-extractable on a full Hadoop cluster should be extractable in standalone mode, if you are patient
-enough; in practice, you probably are not patient enough, and will be limited to smaller
-datasets. You may also run into problems with disk space; Hadoop uses a lot (use `--tmp
-/path/to/tmp` to specify an alternate place for temporary data; we suggest you use a local disk
-partition with tens or hundreds of gigabytes free, and not an NFS partition). Setting up your own
-Hadoop cluster is not too difficult a chore; in particular, you may find it helpful to install a
-[pseudo-distributed version of Hadoop](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html).
-In our experience, this works fine, but you should note the following caveats:
-
-- It is of crucial importance that you have enough physical disks. We have found that having too
- few, or too slow of disks, results in a whole host of seemingly unrelated issues that are hard to
- resolve, such as timeouts.
-- NFS filesystems can cause lots of problems. You should really try to install physical disks that
- are dedicated to Hadoop scratch space.
-
-Here are some flags relevant to Hadoop and grammar extraction with Thrax:
-
-- `--hadoop /path/to/hadoop`
-
- This sets the location of Hadoop (overriding the environment variable `$HADOOP`)
-
-- `--hadoop-mem MEM` (2g)
-
- This alters the amount of memory available to Hadoop mappers (passed via the
- `mapred.child.java.opts` options).
-
-- `--thrax-conf FILE`
-
- Use the provided Thrax configuration file instead of the (grammar-specific) default. The Thrax
- templates are located at `$JOSHUA/scripts/training/templates/thrax-TYPE.conf`, where TYPE is one
- of "hiero" or "samt".
-
-When the grammar is extracted, it is compressed and placed at `$RUNDIR/grammar.gz`.
-
-## <a id="lm" /> 5. Language model
-
-Before tuning can take place, a language model is needed. A language model is always built from the
-target side of the training corpus unless `--no-corpus-lm` is specified. In addition, you can
-provide other language models (any number of them) with the `--lmfile FILE` argument. Other
-arguments are as follows.
-
-- `--lm` {kenlm (default), berkeleylm}
-
- This determines the language model code that will be used when decoding. These implementations
- are described in their respective papers (PDFs:
- [KenLM](http://kheafield.com/professional/avenue/kenlm.pdf),
- [BerkeleyLM](http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf)). KenLM is written in
- C++ and requires a pass through the JNI, but is recommended because it supports left-state minimization.
-
-- `--lmfile FILE`
-
- Specifies a pre-built language model to use when decoding. This language model can be in ARPA
- format, or in KenLM format when using KenLM or BerkeleyLM format when using that format.
-
-- `--lm-gen` {kenlm (default), srilm, berkeleylm}, `--buildlm-mem MEM`, `--witten-bell`
-
- At the tuning step, an LM is built from the target side of the training data (unless
- `--no-corpus-lm` is specified). This controls which code is used to build it. The default is a
- KenLM's [lmplz](http://kheafield.com/code/kenlm/estimation/), and is strongly recommended.
-
- If SRILM is used, it is called with the following arguments:
-
- $SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text TRAINING-DATA -unk -lm lm.gz
-
- Where SMOOTHING is `-kndiscount`, or `-wbdiscount` if `--witten-bell` is passed to the pipeline.
-
- [BerkeleyLM java class](http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java)
- is also available. It computes a Kneser-Ney LM with a constant discounting (0.75) and no count
- thresholding. The flag `--buildlm-mem` can be used to control how much memory is allocated to the
- Java process. The default is "2g", but you will want to increase it for larger language models.
-
- A language model built from the target side of the training data is placed at `$RUNDIR/lm.gz`.
-
-## Interlude: decoder arguments
-
-Running the decoder is done in both the tuning stage and the testing stage. A critical point is
-that you have to give the decoder enough memory to run. Joshua can be very memory-intensive, in
-particular when decoding with large grammars and large language models. The default amount of
-memory is 3100m, which is likely not enough (especially if you are decoding with SAMT grammar). You
-can alter the amount of memory for Joshua using the `--joshua-mem MEM` argument, where MEM is a Java
-memory specification (passed to its `-Xmx` flag).
-
-## <a id="tuning" /> 6. TUNING
-
-Two optimizers are provided with Joshua: MERT and PRO (`--tuner {mert,pro}`). If Moses is
-installed, you can also use Cherry & Foster's k-best batch MIRA (`--tuner mira`, recommended).
-Tuning is run till convergence in the `$RUNDIR/tune/N` directory, where N is the tuning instance.
-By default, tuning is run just once, but the pipeline supports running the optimizer an arbitrary
-number of times due to [recent work](http://www.youtube.com/watch?v=BOa3XDkgf0Y) pointing out the
-variance of tuning procedures in machine translation, in particular MERT. This can be activated
-with `--optimizer-runs N`. Each run can be found in a directory `$RUNDIR/tune/N`.
-
-When tuning is finished, each final configuration file can be found at either
-
- $RUNDIR/tune/N/joshua.config.final
-
-where N varies from 1..`--optimizer-runs`.
-
-## <a id="testing" /> 7. Testing
-
-For each of the tuner runs, Joshua takes the tuner output file and decodes the test set. If you
-like, you can also apply minimum Bayes-risk decoding to the decoder output with `--mbr`. This
-usually yields about 0.3 - 0.5 BLEU points, but is time-consuming.
-
-After decoding the test set with each set of tuned weights, Joshua computes the mean BLEU score,
-writes it to `$RUNDIR/test/final-bleu`, and cats it. It also writes a file
-`$RUNDIR/test/final-times` containing a summary of runtime information. That's the end of the pipeline!
-
-Joshua also supports decoding further test sets. This is enabled by rerunning the pipeline with a
-number of arguments:
-
-- `--first-step TEST`
-
- This tells the decoder to start at the test step.
-
-- `--name NAME`
-
- A name is needed to distinguish this test set from the previous ones. Output for this test run
- will be stored at `$RUNDIR/test/NAME`.
-
-- `--joshua-config CONFIG`
-
- A tuned parameter file is required. This file will be the output of some prior tuning run.
- Necessary pathnames and so on will be adjusted.
-
-## <a id="analysis"> 8. ANALYSIS
-
-If you have used the suggested layout, with a number of related runs all contained in a common
-directory with sequential numbers, you can use the script `$JOSHUA/scripts/training/summarize.pl` to
-display a summary of the mean BLEU scores from all runs, along with the text you placed in the run
-README file (using the pipeline's `--readme TEXT` flag).
-
-## COMMON USE CASES AND PITFALLS
-
-- If the pipeline dies at the "thrax-run" stage with an error like the following:
-
- JOB FAILED (return code 1)
- hadoop/bin/hadoop: line 47:
- /some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or directory
- Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell
- Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FsShell
-
- This occurs if the `$HADOOP` environment variable is set but does not point to a working
- Hadoop installation. To fix it, make sure to unset the variable:
-
- # in bash
- unset HADOOP
-
- and then rerun the pipeline with the same invocation.
-
-- Memory usage is a major consideration in decoding with Joshua and hierarchical grammars. In
- particular, SAMT grammars often require a large amount of memory. Many steps have been taken to
- reduce memory usage, including beam settings and test-set- and sentence-level filtering of
- grammars. However, memory usage can still be in the tens of gigabytes.
-
- To accommodate this kind of variation, the pipeline script allows you to specify both (a) the
- amount of memory used by the Joshua decoder instance and (b) the amount of memory required of
- nodes obtained by the qsub command. These are accomplished with the `--joshua-mem` MEM and
- `--qsub-args` ARGS commands. For example,
-
- pipeline.pl --joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ...
-
- Also, should Thrax fail, it might be due to a memory restriction. By default, Thrax requests 2 GB
- from the Hadoop server. If more memory is needed, set the memory requirement with the
- `--hadoop-mem` in the same way as the `--joshua-mem` option is used.
-
-- Other pitfalls and advice will be added as it is discovered.
-
-## FEEDBACK
-
-Please email joshua_support@googlegroups.com with problems or suggestions.
-
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/server.html
----------------------------------------------------------------------
diff --git a/5.0/server.html b/5.0/server.html
new file mode 100644
index 0000000..661a8bc
--- /dev/null
+++ b/5.0/server.html
@@ -0,0 +1,196 @@
+<!DOCTYPE html>
+<html lang="en">
+ <head>
+ <meta charset="utf-8">
+ <title>Joshua Documentation | Server mode</title>
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
+ <meta name="description" content="">
+ <meta name="author" content="">
+
+ <!-- Le styles -->
+ <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+ <style>
+ body {
+ padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
+ }
+ #download {
+ background-color: green;
+ font-size: 14pt;
+ font-weight: bold;
+ text-align: center;
+ color: white;
+ border-radius: 5px;
+ padding: 4px;
+ }
+
+ #download a:link {
+ color: white;
+ }
+
+ #download a:hover {
+ color: lightgrey;
+ }
+
+ #download a:visited {
+ color: white;
+ }
+
+ a.pdf {
+ font-variant: small-caps;
+ /* font-weight: bold; */
+ font-size: 10pt;
+ color: white;
+ background: brown;
+ padding: 2px;
+ }
+
+ a.bibtex {
+ font-variant: small-caps;
+ /* font-weight: bold; */
+ font-size: 10pt;
+ color: white;
+ background: orange;
+ padding: 2px;
+ }
+
+ img.sponsor {
+ height: 120px;
+ margin: 5px;
+ }
+ </style>
+ <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+
+ <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
+ <!--[if lt IE 9]>
+ <script src="bootstrap/js/html5shiv.js"></script>
+ <![endif]-->
+
+ <!-- Fav and touch icons -->
+ <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
+ <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
+ <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
+ <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
+ <link rel="shortcut icon" href="bootstrap/ico/favicon.png">
+ </head>
+
+ <body>
+
+ <div class="navbar navbar-inverse navbar-fixed-top">
+ <div class="navbar-inner">
+ <div class="container">
+ <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+ <span class="icon-bar"></span>
+ <span class="icon-bar"></span>
+ <span class="icon-bar"></span>
+ </button>
+ <a class="brand" href="/">Joshua</a>
+ <div class="nav-collapse collapse">
+ <ul class="nav">
+ <li><a href="index.html">Documentation</a></li>
+ <li><a href="pipeline.html">Pipeline</a></li>
+ <li><a href="tutorial.html">Tutorial</a></li>
+ <li><a href="decoder.html">Decoder</a></li>
+ <li><a href="thrax.html">Thrax</a></li>
+ <li><a href="file-formats.html">File formats</a></li>
+ <!-- <li><a href="advanced.html">Advanced</a></li> -->
+ <li><a href="faq.html">FAQ</a></li>
+ </ul>
+ </div><!--/.nav-collapse -->
+ </div>
+ </div>
+ </div>
+
+ <div class="container">
+
+ <div class="row">
+ <div class="span2">
+ <img src="/images/joshua-logo-small.png"
+ alt="Joshua logo (picture of a Joshua tree)" />
+ </div>
+ <div class="span10">
+ <h1>Joshua Documentation</h1>
+ <h2>Server mode</h2>
+ <span id="download">
+ <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a>
+ </span>
+ (version 5.0, released 16 August 2013)
+ </div>
+ </div>
+
+ <hr />
+
+ <div class="row">
+ <div class="span8">
+
+ <p>The Joshua decoder can be run as a TCP/IP server instead of a POSIX-style command-line tool. Clients can concurrently connect to a socket and receive a set of newline-separated outputs for a set of newline-separated inputs.</p>
+
+<p>Threading takes place both within and across requests. Threads from the decoder pool are assigned in round-robin manner across requests, preventing starvation.</p>
+
+<h1 id="invoking-the-server">Invoking the server</h1>
+
+<p>A running server is configured at invokation time. To start in server mode, run <code class="highlighter-rouge">joshua-decoder</code> with the option <code class="highlighter-rouge">-server-port [PORT]</code>. Additionally, the server can be configured in the same ways as when using the command-line-functionality.</p>
+
+<p>E.g.,</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/joshua-decoder -server-port 10101 -mark-oovs false -output-format "%s" -threads 10
+</code></pre>
+</div>
+
+<h2 id="using-the-server">Using the server</h2>
+
+<p>To test that the server is working, a set of inputs can be sent to the server from the command line. </p>
+
+<p>The server, as configured in the example above, will then respond to requests on port 10101. You can test it out with the <code class="highlighter-rouge">nc</code> utility:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>wget -qO - http://cs.jhu.edu/~post/files/pg1023.txt | head -132 | tail -11 | nc localhost 10101
+</code></pre>
+</div>
+
+<p>Since no model was loaded, this will just return the text to you as sent to the server.</p>
+
+<p>The <code class="highlighter-rouge">-server-port</code> option can also be used when creating a <a href="bundle.html">bundled configuration</a> that will be run in server mode.</p>
+
+
+ </div>
+ </div>
+ </div> <!-- /container -->
+
+ <!-- Le javascript
+ ================================================== -->
+ <!-- Placed at the end of the document so the pages load faster -->
+ <script src="bootstrap/js/jquery.js"></script>
+ <script src="bootstrap/js/bootstrap-transition.js"></script>
+ <script src="bootstrap/js/bootstrap-alert.js"></script>
+ <script src="bootstrap/js/bootstrap-modal.js"></script>
+ <script src="bootstrap/js/bootstrap-dropdown.js"></script>
+ <script src="bootstrap/js/bootstrap-scrollspy.js"></script>
+ <script src="bootstrap/js/bootstrap-tab.js"></script>
+ <script src="bootstrap/js/bootstrap-tooltip.js"></script>
+ <script src="bootstrap/js/bootstrap-popover.js"></script>
+ <script src="bootstrap/js/bootstrap-button.js"></script>
+ <script src="bootstrap/js/bootstrap-collapse.js"></script>
+ <script src="bootstrap/js/bootstrap-carousel.js"></script>
+ <script src="bootstrap/js/bootstrap-typeahead.js"></script>
+
+ <!-- Start of StatCounter Code for Default Guide -->
+ <script type="text/javascript">
+ var sc_project=8264132;
+ var sc_invisible=1;
+ var sc_security="4b97fe2d";
+ </script>
+ <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
+ <noscript>
+ <div class="statcounter">
+ <a title="hit counter joomla"
+ href="http://statcounter.com/joomla/"
+ target="_blank">
+ <img class="statcounter"
+ src="http://c.statcounter.com/8264132/0/4b97fe2d/1/"
+ alt="hit counter joomla" />
+ </a>
+ </div>
+ </noscript>
+ <!-- End of StatCounter Code for Default Guide -->
+
+ </body>
+</html>