+          <div class="blog-post">
+            <p>If you just want to use Joshua to translate data, the quickest way is
+to download a <a href="/language-packs/">pre-built model</a>. </p>
+<p>If not language pack is available, or if you have your own parallel
+data that you want to train the translation engine on, then you have
+to build your own model. This takes a bit more knowledge and effort,
+but is made easier with Joshua’s <a href="pipeline.html">pipeline script</a>,
+which runs all the steps of preparing data, aligning it, and
+extracting and tuning component models. </p>
+<p>Detailed information about running the pipeline can be found in
+<a href="/6.0/pipeline.html">the pipeline documentation</a>, but as a quick
+start, you can build a simple Bengali–English model by following
+these instructions.</p>
+<p><em>NOTE: We suggest you build models outside the <code class="highlighter-rouge">$JOSHUA</code> directory</em>.</p>
+<p>First, download the dataset:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>mkdir -p ~/models/bn-en/
+cd ~/models/bn-en
+wget -q
+tar xzf indian-parallel-corpora-1.0.tar.gz
+ln -s indian-parallel-corpora-1.0 input
+<p>Then, train and test a model</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/ --source bn --target en \
+    --type hiero \
+    --no-prepare --aligner berkeley \
+    --corpus input/bn-en/tok/ \
+    --tune input/bn-en/tok/ \
+    --test input/bn-en/tok/
+<p>This will align the data with the Berkeley aligner, build a Hiero
+model, tune with MERT, decode the test sets, and reports results that
+should correspond with what you find on
+<a href="/indian-parallel-corpora/">the Indian Parallel Corpora page</a>. For
+more details, including information on the many options available with
+the pipeline script, please see <a href="pipeline.html">its documentation page</a>.</p>
+<p>Finally, you can export the full model as a language pack:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>./ \
+  tune/ \
+  language-pack-bn-en \
+  --pack-tm grammar.gz
+<p>(or possibly <code class="highlighter-rouge">tune/1/</code> if you’re using an older version of
+the pipeline).</p>
+<p>This will create a <a href="bundle.html">runnable model</a> in
+<code class="highlighter-rouge">language-pack-bn-en</code>. See the <code class="highlighter-rouge">README</code> file in that directory for
+information on how to run the decoder.</p>
+          <div class="blog-post">
+            <p>The Joshua decoder can be run as a TCP/IP server instead of a POSIX-style command-line tool. Clients can concurrently connect to a socket and receive a set of newline-separated outputs for a set of newline-separated inputs.</p>
+<p>Threading takes place both within and across requests.  Threads from the decoder pool are assigned in round-robin manner across requests, preventing starvation.</p>
+<h1 id="invoking-the-server">Invoking the server</h1>
+<p>A running server is configured at invokation time. To start in server mode, run <code class="highlighter-rouge">joshua-decoder</code> with the option <code class="highlighter-rouge">-server-port [PORT]</code>. Additionally, the server can be configured in the same ways as when using the command-line-functionality.</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/joshua-decoder -server-port 10101 -mark-oovs false -output-format "%s" -threads 10
+<h2 id="using-the-server">Using the server</h2>
+<p>To test that the server is working, a set of inputs can be sent to the server from the command line. </p>
+<p>The server, as configured in the example above, will then respond to requests on port 10101.  You can test it out with the <code class="highlighter-rouge">nc</code> utility:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>wget -qO - | head -132 | tail -11 | nc localhost 10101
+<p>Since no model was loaded, this will just return the text to you as sent to the server.</p>
+<p>The <code class="highlighter-rouge">-server-port</code> option can also be used when creating a <a href="bundle.html">bundled configuration</a> that will be run in server mode.</p>
+          <div class="blog-post">
+            <p>One day, this will hold Thrax documentation, including how to use Thrax, how to do grammar
+filtering, and details on the configuration file options.  It will also include details about our
+experience setting up and maintaining Hadoop cluster installations, knowledge wrought of hard-fought
+sweat and tears.</p>
+<p>In the meantime, please bother <a href="">Jonny Weese</a> if there is something you
+need to do that you don’t understand.  You might also be able to dig up some information <a href="">on the old
+Thrax page</a>.</p>
+            <h1 id="build-a-translation-model">Build a translation model</h1>
+<p>Extracting a grammar from a large amount of data is a multi-step process. The first requirement is parallel data. The Europarl, Call Home, and Fisher corpora all contain parallel translations of Spanish and English sentences.</p>
+<p>We will copy (or symlink) the parallel source text files in a subdirectory called <code class="highlighter-rouge">input/</code>.</p>
+<p>Then, we concatenate all the training files on each side. The pipeline script normally does tokenization and normalization, but in this instance we have a custom tokenizer we need to apply to the source side, so we have to do it manually and then skip that step using the <code class="highlighter-rouge"></code> option <code class="highlighter-rouge">--first-step alignment</code>.</p>
+  <li>
+    <p>to tokenize the English data, do</p>
+    <table>
+      <tbody>
+        <tr>
+          <td>cat callhome.en europarl.en fisher.en &gt; all.en</td>
+          <td>$JOSHUA/scripts/training/ en</td>
+          <td>$JOSHUA/scripts/training/penn-treebank-tokenizer.perl</td>
+          <td>$JOSHUA/scripts/lowercase.perl &gt;</td>
+        </tr>
+      </tbody>
+    </table>
+  </li>
+<p>The same can be done for the Spanish side of the input data:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>cat &gt; | $JOSHUA/scripts/training/ es | $JOSHUA/scripts/training/penn-treebank-tokenizer.perl | $JOSHUA/scripts/lowercase.perl &gt;
+<p>By the way, an alternative tokenizer is a Twitter tokenizer found in the <a href="">Jerboa</a> project.</p>
+<p>The final step in the training data preparation is to remove all examples in which either of the language sides is a blank line.</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>paste | grep -Pv "^\t|\t$" \
+  | ./
+<p>contents of <code class="highlighter-rouge"></code> by Matt Post:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code><span class="c1">#!/usr/bin/perl</span>
+<span class="c1"># splits on tab, printing respective chunks to the list of files given</span>
+<span class="c1"># as script arguments</span>
+<span class="k">use</span> <span class="nv">FileHandle</span><span class="p">;</span>
+<span class="k">my</span> <span class="nv">@fh</span><span class="p">;</span>
+<span class="vg">$|</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>   <span class="c1"># don't buffer output</span>
+<span class="k">if</span> <span class="p">(</span><span class="nv">@ARGV</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
+  <span class="k">print</span> <span class="s">"Usage: &lt; tabbed-file\n"</span><span class="p">;</span>
+  <span class="nb">exit</span><span class="p">;</span>
+<span class="p">}</span>
+<span class="k">my</span> <span class="nv">@fh</span> <span class="o">=</span> <span class="nb">map</span> <span class="p">{</span> <span class="nv">get_filehandle</span><span class="p">(</span><span class="nv">$_</span><span class="p">)</span> <span class="p">}</span> <span class="nv">@ARGV</span><span class="p">;</span>
+<span class="nv">@ARGV</span> <span class="o">=</span> <span class="p">();</span>
+<span class="k">while</span> <span class="p">(</span><span class="k">my</span> <span class="nv">$line</span> <span class="o">=</span> <span class="o">&lt;&gt;</span><span class="p">)</span> <span class="p">{</span>
+  <span class="nb">chomp</span><span class="p">(</span><span class="nv">$line</span><span class="p">);</span>
+  <span class="k">my</span> <span class="p">(</span><span class="nv">@fields</span><span class="p">)</span> <span class="o">=</span> <span class="nb">split</span><span class="p">(</span><span class="sr">/\t/</span><span class="p">,</span><span class="nv">$line</span><span class="p">,</span><span class="nb">scalar</span> <span class="nv">@fh</span><span class="p">);</span>
+  <span class="nb">map</span> <span class="p">{</span> <span class="k">print</span> <span class="p">{</span><span class="nv">$fh</span><span class="p">[</span><span class="nv">$_</span><span class="p">]}</span> <span class="s">"$fields[$_]\n"</span> <span class="p">}</span> <span class="p">(</span><span class="mi">0</span><span class="o">..</span><span class="nv">$#fields</span><span class="p">);</span>
+<span class="p">}</span>
+<span class="k">sub </span><span class="nf">get_filehandle</span> <span class="p">{</span>
+    <span class="k">my</span> <span class="nv">$file</span> <span class="o">=</span> <span class="nb">shift</span><span class="p">;</span>
+    <span class="k">if</span> <span class="p">(</span><span class="nv">$file</span> <span class="ow">eq</span> <span class="s">"-"</span><span class="p">)</span> <span class="p">{</span>
+        <span class="k">return</span> <span class="o">*</span><span class="bp">STDOUT</span><span class="p">;</span>
+    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
+        <span class="nb">local</span> <span class="o">*</span><span class="nv">FH</span><span class="p">;</span>
+        <span class="nb">open</span> <span class="nv">FH</span><span class="p">,</span> <span class="s">"&gt;$file"</span> <span class="ow">or</span> <span class="nb">die</span> <span class="s">"can't open '$file' for writing"</span><span class="p">;</span>
+        <span class="k">return</span> <span class="o">*</span><span class="nv">FH</span><span class="p">;</span>
+    <span class="p">}</span>
+<span class="p">}</span>
+<p>Now we can run the pipeline to extract the grammar. Run the following script:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
+<span class="c"># this creates a grammar</span>
+<span class="c"># NEED:</span>
+<span class="c"># pair</span>
+<span class="c"># type</span>
+<span class="nb">set</span> -u
+<span class="nv">pair</span><span class="o">=</span>es-en
+<span class="nb">type</span><span class="o">=</span>hiero
+<span class="c">#. ~/.bashrc</span>
+<span class="c">#basedir=$(pwd)</span>
+<span class="nv">dir</span><span class="o">=</span>grammar-<span class="nv">$pair</span>-<span class="nv">$type</span>
+<span class="o">[[</span> ! -d <span class="nv">$dir</span> <span class="o">]]</span> <span class="o">&amp;&amp;</span> mkdir -p <span class="nv">$dir</span>
+<span class="nb">cd</span> <span class="nv">$dir</span>
+<span class="nb">source</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span> <span class="nv">$pair</span> | cut -d- -f 1<span class="k">)</span>
+<span class="nv">target</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span> <span class="nv">$pair</span> | cut -d- -f 2<span class="k">)</span>
+<span class="nv">$JOSHUA</span>/scripts/training/ <span class="se">\</span>
+  --source <span class="nv">$source</span> <span class="se">\</span>
+  --target <span class="nv">$target</span> <span class="se">\</span>
+  --corpus /home/hltcoe/lorland/expts/scale12/model1/input/ <span class="se">\</span>
+  --type <span class="nv">$type</span> <span class="se">\</span>
+  --joshua-mem 100g <span class="se">\</span>
+  --no-prepare <span class="se">\</span>
+  --first-step align <span class="se">\</span>
+  --last-step thrax <span class="se">\</span>
+  --hadoop <span class="nv">$HADOOP</span> <span class="se">\</span>
+  --threads 8 <span class="se">\</span>
+            <p>This document will walk you through using the pipeline in a variety of scenarios. Once you’ve gained a
+sense for how the pipeline works, you can consult the <a href="pipeline.html">pipeline page</a> for a number of
+other options available in the pipeline.</p>
+<h2 id="download-and-setup">Download and Setup</h2>
+<p>Download and install Joshua as described on the <a href="index.html">quick start page</a>, installing it under
+<code class="highlighter-rouge">~/code/</code>. Once you’ve done that, you should make sure you have the following environment variable set:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>export JOSHUA=$HOME/code/joshua-v6.0.5
+export JAVA_HOME=/usr/java/default
+<p>If you have a Hadoop installation, make sure you’ve set <code class="highlighter-rouge">$HADOOP</code> to point to it. For example, if the <code class="highlighter-rouge">hadoop</code> command is in <code class="highlighter-rouge">/usr/bin</code>,
+you should type</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>export HADOOP=/usr
+<p>Joshua will find the binary and use it to submit to your hadoop cluster. If you don’t have one, just
+make sure that HADOOP is unset, and Joshua will roll one out for you and run it in
+<a href="">standalone mode</a>. </p>
+<h2 id="a-basic-pipeline-run">A basic pipeline run</h2>
+<p>For today’s experiments, we’ll be building a Spanish–English system using data included in the
+<a href="/data/fisher-callhome-corpus/">Fisher and CALLHOME translation corpus</a>. This
+data was collected by translating transcribed speech from previous LDC releases.</p>
+<p>Download the data and install it somewhere:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/data
+wget --no-check -O
+<p>Then define the environment variable <code class="highlighter-rouge">$FISHER</code> to point to it:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/data/fisher-callhome-corpus-master
+export FISHER=$(pwd)
+<h3 id="preparing-the-data">Preparing the data</h3>
+<p>Inside the tarball is the Fisher and CALLHOME Spanish–English data, which includes Kaldi-provided
+ASR output and English translations on the Fisher and CALLHOME  dataset transcriptions. Because of
+licensing restrictions, we cannot distribute the Spanish transcripts, but if you have an LDC site
+license, a script is provided to build them. You can type:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>./bin/ /export/common/data/corpora/LDC/LDC2010T04
+<p>Where the first argument is the path to your LDC data release. This will create the files in <code class="highlighter-rouge">corpus/ldc</code>.</p>
+<p>In <code class="highlighter-rouge">$FISHER/corpus</code>, there are a set of parallel directories for LDC transcripts (<code class="highlighter-rouge">ldc</code>), ASR output
+(<code class="highlighter-rouge">asr</code>), oracle ASR output (<code class="highlighter-rouge">oracle</code>), and ASR lattice output (<code class="highlighter-rouge">plf</code>). The files look like this:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>$ ls corpus/ldc
+callhome_devtest.en  fisher_dev2.en.2  fisher_dev.en.2   fisher_test.en.2
+callhome_evltest.en  fisher_dev2.en.3  fisher_dev.en.3   fisher_test.en.3
+fisher_dev2.en.0     fisher_dev.en.0   fisher_test.en.0  fisher_train.en
+fisher_dev2.en.1     fisher_dev.en.1   fisher_test.en.1
+<p>If you don’t have the LDC transcripts, you can use the data in <code class="highlighter-rouge">corpus/asr</code> instead. We will now use
+this data to build our own Spanish–English model using Joshua’s pipeline.</p>
+<h3 id="run-the-pipeline">Run the pipeline</h3>
+<p>Create an experiments directory for containing your first experiment. <em>Note: it’s important that
+this <strong>not</strong> be inside your <code class="highlighter-rouge">$JOSHUA</code> directory</em>.</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>mkdir ~/expts/joshua
+cd ~/expts/joshua
+<p>We will now create the baseline run, using a particular directory structure for experiments that
+will allow us to take advantage of scripts provided with Joshua for displaying the results of many
+related experiments. Because this can take quite some time to run, we are going to reduce the model
+by quite a bit by 
+restriction: Joshua will only use sentences in the training sets with ten or fewer words on either
+side (Spanish or English):</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/expts/joshua
+$JOSHUA/bin/           \
+  --rundir 1                      \
+  --readme "Baseline Hiero run"   \
+  --source es                     \
+  --target en                     \
+  --type hiero                    \
+  --corpus $FISHER/corpus/ldc/fisher_train \
+  --tune $FISHER/corpus/ldc/fisher_dev \
+  --test $FISHER/corpus/ldc/fisher_dev2 \
+  --maxlen 10 \
+  --lm-order 3
+<p>This will start the pipeline building a Spanish–English translation system constructed from the
+training data and a dictionary, tuned against dev, and tested against devtest. It will use the
+default values for most of the pipeline: <a href="">GIZA++</a> for alignment,
+KenLM’s <code class="highlighter-rouge">lmplz</code> for building the language model, Z-MERT for tuning, KenLM with left-state
+minimization for representing LM state in the decoder, and so on. We change the order of the n-gram
+model to 3 (from its default of 5) because there is not enough data to build a 5-gram LM.</p>
+<p>A few notes:</p>
+  <li>
+    <p>This will likely take many hours to run, especially if you don’t have a Hadoop cluster.</p>
+  </li>
+  <li>
+    <p>If you are running on Mac OS X, KenLM’s <code class="highlighter-rouge">lmplz</code> will not build due to the absence of static
+libraries. In that case, you should add the flag <code class="highlighter-rouge">--lm-gen srilm</code> (recommended, if SRILM is
+installed) or <code class="highlighter-rouge">--lm-gen berkeleylm</code>.</p>
+  </li>
+<h3 id="variations">Variations</h3>
+<p>Once that is finished, you will have a baseline model. From there, you might wish to try variations
+of the baseline model. Here are some examples of what you could vary:</p>
+  <li>
+    <p>Build an SAMT model (<code class="highlighter-rouge">--type samt</code>), GKHM model (<code class="highlighter-rouge">--type ghkm</code>), or phrasal ITG model (<code class="highlighter-rouge">--type phrasal</code>) </p>
+  </li>
+  <li>
+    <p>Use the Berkeley aligner instead of GIZA++ (<code class="highlighter-rouge">--aligner berkeley</code>)</p>
+  </li>
+  <li>
+    <p>Build the language model with BerkeleyLM (<code class="highlighter-rouge">--lm-gen srilm</code>) instead of KenLM (the default)</p>
+  </li>
+  <li>
+    <p>Change the order of the LM from the default of 5 (<code class="highlighter-rouge">--lm-order 4</code>)</p>
+  </li>
+  <li>
+    <p>Tune with MIRA instead of MERT (<code class="highlighter-rouge">--tuner mira</code>). This requires that Moses is installed.</p>
+  </li>
+  <li>
+    <p>Decode with a wider beam (<code class="highlighter-rouge">--joshua-args '-pop-limit 200'</code>) (the default is 100)</p>
+  </li>
+  <li>
+    <p>Add the provided BN-EN dictionary to the training data (add another <code class="highlighter-rouge">--corpus</code> line, e.g., <code class="highlighter-rouge">--corpus $FISHER/bn-en/</code>)</p>
+  </li>
+<p>To do this, we will create new runs that partially reuse the results of previous runs. This is
+possible by doing two things: (1) incrementing the run directory and providing an updated README
+note; (2) telling the pipeline which of the many steps of the pipeline to begin at; and (3)
+providing the needed dependencies.</p>
+<h1 id="a-second-run">A second run</h1>
+<p>Let’s begin by changing the tuner, to see what effect that has. To do so, we change the run
+directory, tell the pipeline to start at the tuning step, and provide the needed dependencies:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/           \
+  --rundir 2                      \
+  --readme "Tuning with MIRA"     \
+  --source bn                     \
+  --target en                     \
+  --corpus $FISHER/bn-en/tok/ \
+  --tune $FISHER/bn-en/tok/        \
+  --test $FISHER/bn-en/tok/    \
+  --first-step tune \
+  --tuner mira \
+  --grammar 1/grammar.gz \
+  --no-corpus-lm \
+  --lmfile 1/lm.gz
+<p>Here, we have essentially the same invocation, but we have told the pipeline to use a different
+ MIRA, to start with tuning, and have provided it with the language model file and grammar it needs
+ to execute the tuning step. </p>
+<p>Note that we have also told it not to build a language model. This is necessary because the
+ pipeline always builds an LM on the target side of the training data, if provided, but we are
+ supplying the language model that was already built. We could equivalently have removed the
+ <code class="highlighter-rouge">--corpus</code> line.</p>
+<h2 id="changing-the-model-type">Changing the model type</h2>
+<p>Let’s compare the Hiero model we’ve already built to an SAMT model. We have to reextract the
+grammar, but can reuse the alignments and the language model:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/           \
+  --rundir 3                      \
+  --readme "Baseline SAMT model"  \
+  --source bn                     \
+  --target en                     \
+  --corpus $FISHER/bn-en/tok/ \
+  --tune $FISHER/bn-en/tok/        \
+  --test $FISHER/bn-en/tok/    \
+  --alignment 1/alignments/training.align   \
+  --first-step parse \
+  --no-corpus-lm \
+  --lmfile 1/lm.gz
+<p>See <a href="pipeline.html#steps">the pipeline script page</a> for a list of all the steps.</p>
+<h2 id="analyzing-the-results">Analyzing the results</h2>
+<p>We now have three runs, in subdirectories 1, 2, and 3. We can display summary results from them
+using the <code class="highlighter-rouge">$JOSHUA/scripts/training/</code> script.</p>
+            <p>Joshua 6.0 introduces a number of new features and improvements.</p>
+  <li>A new phrase-based decoder that is as fast as Moses</li>
+  <li>Significantly faster hierarchical decoding</li>
+  <li>Support for class-based language modeling</li>
+  <li>Reflection-based loading of feature functions for super-easy
+development of new features</li>
