You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@joshua.apache.org by le...@apache.org on 2016/04/05 07:13:05 UTC
[17/18] incubator-joshua-site git commit: Initial import of joshua-decoder.github.com site to Apache

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/step-by-step-instructions.html
----------------------------------------------------------------------
diff --git a/4.0/step-by-step-instructions.html b/4.0/step-by-step-instructions.html
new file mode 100644
index 0000000..ee004c2
--- /dev/null
+++ b/4.0/step-by-step-instructions.html
@@ -0,0 +1,908 @@
+---
+layout: default
+category: links
+title: Installing and running the Joshua Decoder
+---
+
+<!-- begin header -->
+<h2><a href="http://cs.jhu.edu/~ccb/">by Chris Callison-Burch</a> <br/>(Released: January 17, 2012)</h2>
+
+<div class="warning">
+  <p>
+    Note: this walkthrough describes an older version of Joshua that is different in some ways from
+    the current version.  Some of these differences are: SRILM support has been removed, the
+    Berkeley aligner is now included in <code>$JOSHUA/lib</code> (and therefore doesn't need to be
+    installed separately), and there is no need for you to download the developer version of Joshua
+    if you only want to use the software and not contribute to it.  Please refer to
+    the <a href="pipeline.html">pipeline documentation for the 4.0 release</a>. The pipeline
+    automates these steps.
+  </p>
+</div>
+
+<p>This web page gives instructions on how to install and use the Joshua decoder.  Joshua is an open-source decoder for parsing-based machine translation.    Joshua uses the synchronous context free grammar (SCFG) formalism in its approach to statistical machine translation, and the software implements the algorithms that underly the approach.</p>
+
+<a name="steps" />
+<p>These instructions will tell you how to:
+<ol>
+<li> <a href="#step1">Install the software</a></li>
+<li> <a href="#step2">Prepare your data</a></li>
+<li> <a href="#step3">Create word alignments</a> </li>
+<li> <a href="#step4">Train a language model</a> </li>
+<li> <a href="#step5">Extract a translation grammar</a> </li>
+<li> <a href="#step6">Run minimum error rate training</a> </li>
+<li> <a href="#step7">Decode a test set</a></li>
+<li> <a href="#step8">Recase the translations</a></li>
+<li> <a href="#step9">Score the translations</a></li>
+</ol>
+</p>
+
+<p>If you use Joshua in your work, please cite this paper:</p>
+<p>Jonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post and Adam Lopez, 2011. <a href="publications/joshua-3.0.pdf">Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor</a>. In Proceedings of the Workshop on Statistical Machine Translation (WMT11). <a href="publications/joshua-3.0.pdf">[pdf]</a> <a href ="joshua.bib">[bib]</a>
+</p>
+
+<p>
+These instructions apply to <a href = "https://github.com/joshua-decoder/joshua/tags">Release 3.1 of Joshua</a>, which is described in our WMT11 paper. You can also get the latest version of the Joshua software from the repository with the command:
+</p>
+<pre>
+git clone https://github.com/joshua-decoder/joshua.git
+</pre>
+
+<a name="step1">
+<h1>Step 1: Install the software</h1>
+
+<h3>Prerequisites</h3>
+
+<p>The Joshua decoder is written in Java.  You'll need to install a few software development tools before you install it:</p>
+<ul>
+<li> <a href="http://git-scm.com/download">git</a> - git is the version control system that we use for managing the Joshua codebase. </li>
+<li> <a href="http://ant.apache.org/bindownload.cgi">Apache Ant</a> - ant is a tool for compiling Java code which has  similar functionality to make. </li>
+</ul>
+<p>
+Before installing these, you can check whether they're already on your system by typing <code>which git</code> and <code>which ant</code>.
+</p>
+
+
+<p>In addition to these software development tools, you will also need to download: </p>
+<ul>
+<li> <a href = "http://www.speech.sri.com/projects/srilm/download.html">The SRI language modeling toolkit</a> - srilm is a widely used toolkit for building n-gram language models, which are an important component in the translation process.</li>
+<li> <a href = "http://code.google.com/p/berkeleyaligner/">The Berkeley Aligner</a> - this software is used to align words across sentence pairs in a bilingual parallel corpus.  Word alignment takes place before extracting an SCFG.
+</ul>
+
+<p>
+After you have downloaded the srilm tar file, type the following commands to install it:
+</p>
+<pre>
+mkdir srilm 
+mv srilm.tgz srilm/ 
+cd srilm/ 
+tar xfz srilm.tgz 
+make 
+</pre>
+<p>If the build fails, please follow the instructions in SRILM's INSTALL file.  For instance, if SRILM's Makefile does not identify that your're running a 64 bit Linux you might have to run "make MACHINE_TYPE=i686-m64 World".</p>
+<p>After you successfully compile SRILM, Joshua will need to know what directory it is in.  You can type <code>pwd</code>
+to get the absolute path to the <code>sirlm/</code> directory that you created.  Once you've figured out the path, set an <code>SRILM</code> environment variable by typing:</p>
+<pre>
+export SRILM="<b>/path/to/srilm</b>"
+</pre>
+<p>Where "<b>/path/to/srilm</b>" is replaced with your path.  You'll also need to set a <code>JAVA_HOME</code> environment variable.  For Mac OS X this usually is done by typing:</p>
+<pre>
+export JAVA_HOME="<b>/Library/Java/Home</b>"
+</pre>
+<p>These variables will need to be set every time you use Joshua, so it's useful to add them to your .bashrc, .bash_profile or .profile file.</p>
+
+
+<h3>Download and Install Joshua</h3>
+
+<p>First, download the <a href = "https://github.com/joshua-decoder/joshua/tarball/v3.1.1">Joshua release 3.1.1 tar file</a>.  Next, type the following commands to untar the file and compile the Java classes: </p>
+
+<pre>
+tar xfz joshua-decoder-joshua-v3.1.1-0-g1a0e6b6.tar.gz
+cd joshua-decoder-joshua-735224e
+ant
+</pre>
+
+<p>Running <code>ant</code> will compile the Java classes and link in srilm.  If everything works properly, you should see the message <b>BUILD SUCCESSFUL</b>.  If you get a BUILD FAILED message, it may be because you have not properly set the paths to SRILM and JAVA_HOME, or because srilm was not compiled properly, as described above.</p>
+
+<p>For the examples in this document, you will need to set a <code>JOSHUA</code> environment variable:
+<pre>
+export JOSHUA="<b>/path/to/joshua</b>"
+</pre>
+
+
+<h3>Run the example model</h3>
+
+<p>
+To test to make sure that the decoder is installed properly, we'll translate 5 sentences using a small translation model that loads quickly.  The sentences that we will translate are contained in <code>example/example.test.in</code>
+</p>
+
+<pre>
+科学家 为 攸关 初期 失智症 的 染色体 完成 定序<br>
+( 法新社 巴黎 二日 电 ) 国际 间 的 一 群 科学家 表示 , 他们 已 为 人类 第十四 对 染色体 完成 定序 , 这 对 染色体 与 许多 疾病 有关 , 包括 三十几 岁 者 可能 罹患 的 初期 阿耳滋海默氏症 。<br>
+这 是 到 目前 为止 完成 定序 的 第四 对 染色体 , 它 由 八千七百多万 对 去氧 核糖核酸 ( dna ) 组成 。<br>
+英国 自然 科学 周刊 发表 的 这 项 研究 显示 , 第十四 对 染色体 排序 由 一千零五十 个 基因 和 基因 片段 构成 。<br>
+基因 科学家 的 目标 是 , 提供 诊断 工具 以 发现 致病 的 缺陷 基因 , 终而 提供 可 阻止 这些 基因 产生 障碍 的 疗法 。
+</pre>
+
+<p>
+The small translation grammar contains 15,939 rules -- you can get the count of the number of rules by running <code>gunzip -c example/example.hiero.tm.gz | wc -l</code> or you can see the first few translation rules with <code>gunzip -c example/example.hiero.tm.gz | head</code>
+</p>
+
+<pre>
+[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists to [X,2] ||| 2.17609119 0.333095818 1.53173875
+[X] ||| [X,1] 科学家 [X,2] ||| [X,2] of the [X,1] scientists ||| 2.47712135 0.333095818 2.17681264
+[X] ||| [X,1] 科学家 [X,2] ||| [X,2] of [X,1] scientists ||| 2.47712135 0.333095818 1.13837981
+[X] ||| [X,1] 科学家 [X,2] ||| [X,2] [X,1] scientists ||| 2.47712135 0.333095818 0.218843221
+[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists [X,2] ||| 1.01472330 0.333095818 0.218843221
+[X] ||| [X,1] 科学家 [X,2] ||| [X,2] of scientists of [X,1] ||| 2.47712135 0.333095818 2.05791640
+[X] ||| [X,1] 科学家 [X,2] ||| scientists [X,1] for [X,2] ||| 2.47712135 0.333095818 2.05956721
+[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientist [X,2] ||| 1.63202321 0.303409695 0.977472364
+[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists , [X,2] ||| 2.47712135 0.333095818 1.68990576
+[X] ||| [X,1] 科学家 [X,2] ||| scientists [X,2] [X,1] ||| 2.47712135 0.333095818 0.218843221
+</pre>
+
+<p>The different parts of the rules are separated by the <code>|||</code> delimiter.  The first part of the rule is the left-hand side non-terminal.  The second and third parts are the right-hand side.   The three numbers listed after each translation rules are negative log probabilities that signify, in order: 
+<ul>
+<li>  prob(e|f) - the probability of the English phrase given the foreign phrase </li> 
+<li> lexprob(e|f) - the lexical translation probabilities of the English words given the foreign words </li>
+<li> lexprob(f|e) - the lexical translation probabilities of the foreign words given the English words</li>
+</ul>
+
+<p>You can use the grammar to translate the test set by running </p>
+
+<pre>java -Xmx1g -cp $JOSHUA/bin \
+	-Djava.library.path=$JOSHUA/lib \
+	-Dfile.encoding=utf8 joshua.decoder.JoshuaDecoder \
+	example/example.config.srilm \
+	example/example.test.in \
+	example/example.nbest.srilm.out
+</pre>
+
+
+<p>
+For those of you who aren't very familiar with Java, the arguments are the following:
+<ul>
+<li><code>-Xmx1g</code> -- this tells Java to use 1 GB of memory.  </li>
+<li><code>-cp $JOSHUA/bin</code> -- this specifies the directory that contains the Java class files.</li>
+<li><code>-Djava.library.path=$JOSHUA/lib</code> -- this specifies the directory that contains the libraries that link in C++ code </li>
+<li><code>-Dfile.ecoding=utf8</code> -- this tells java to use unicode as the default file encoding.</li>
+<li><code>joshua.decoder.JoshuaDecoder </code> -- This is the class that is run.  If you want to look at the the source code for this class, you can find it in <code>src/joshua/decoder/JoshuaDecoder.java</code></li>
+<li><code>example/example.config.srilm </code> -- This is the configuration file used by Joshua.</li>
+<li><code>example/example.test.in</code> -- This is the input file containing the sentences to translate.</li>
+<li><code>example/example.nbest.srilm.out</code> -- This is the output file that the n-best translations will be written to.</li>
+</ul>
+
+
+
+<p>You can inspect the output file by typing <code>head example/example.nbest.srilm.out</code></p>
+
+<pre>
+0 ||| scientists to vital early 失智症 the chromosome completed has ||| -127.759 -6.353 -11.577 -5.325 -3.909 ||| -135.267
+0 ||| scientists for vital early 失智症 the chromosome completed has ||| -128.239 -6.419 -11.179 -5.390 -3.909 ||| -135.556
+0 ||| scientists to related early 失智症 the chromosome completed has ||| -126.942 -6.450 -12.716 -5.764 -3.909 ||| -135.670
+0 ||| scientists to vital early 失智症 the chromosomes completed has ||| -128.354 -6.353 -11.396 -5.305 -3.909 ||| -135.714
+0 ||| scientists to death early 失智症 the chromosome completed has ||| -127.879 -6.575 -11.845 -5.287 -3.909 ||| -135.803
+0 ||| scientists as vital early 失智症 the chromosome completed has ||| -128.537 -6.000 -11.384 -5.828 -3.909 ||| -135.820
+0 ||| scientists for related early 失智症 the chromosome completed has ||| -127.422 -6.516 -12.319 -5.829 -3.909 ||| -135.959
+0 ||| scientists for vital early 失智症 the chromosomes completed has ||| -128.834 -6.419 -10.998 -5.370 -3.909 ||| -136.003
+0 ||| scientists to vital early 失智症 completed the chromosome has ||| -127.423 -7.364 -11.577 -5.325 -3.909 ||| -136.009
+0 ||| scientists to vital early 失智症 of chromosomes completed has ||| -127.427 -7.136 -11.612 -5.816 -3.909 ||| -136.086
+</pre>
+
+<p>This file contains the n-best translations, under the model.  The first 10 lines that you see above are 10 best translations of the first sentence.  Each line contains 4 fields. The first field is the index of the sentence (index 0 for the first sentence), the second field is the translation, the third field contains the each of the individual feature function scores for the translation (language model, rule translation probability, lexical translation probability, reverse lexical translation probability, and word penalty), and the final field is the overall score. 
+</p>
+<p>
+To get the 1-best translations for each sentence in the test set without all of the extra information, you can run the following command:
+</p>
+
+<pre>
+java -Xmx1g -cp $JOSHUA/bin \
+	-Dfile.encoding=utf8 joshua.util.ExtractTopCand \
+	example/example.nbest.srilm.out \
+	example/example.nbest.srilm.out.1best
+</pre>
+
+<p>You cat then look at the 1-best output file by typing <code>cat example/example.nbest.srilm.out.1best</code>:</p>
+
+<pre>
+scientists to vital early 失智症 the chromosome completed has
+( , paris 2 ) international a group of scientists said that they completed to human to chromosome 14 has , the chromosome with many diseases , including more years , may with the early 阿耳滋海默氏症 .
+this is to now completed has in the fourth chromosome , which 八千七百多万 to carry when ( dna ) .
+the weekly british science the study showed that the chromosome 14 are by 一千零五十 genes and gene fragments .
+the goal of gene scientists is to provide diagnostic tools to found of the flawed genes , are still provide a to stop these genes treatments .
+</pre>
+
+<p>If your translations are identical to the ones above then Joshua is installed correctly.  With this small model, there are many untranslated words, and the quality of the translations is very low.  In the next steps, we'll show you how to train a model for a new language pair, using a larger training corpus that will result in higher quality translations.</p>
+
+<a name="step2" />
+<h1>Step 2: Prepare your data</h1>
+<p>To create a new statistical translation model with Joshua, you will need several data sets:
+<ul>
+<li> A large sentence-aligned bilingual parallel corpus.  We refer to this set as the <b>training data</b>, since it will be used to train the translation model. The question of how much data is necessary always arises.  The short answer is more is better.  Our parallel corpora typically contain tens of millions of words, and we use as much as 250 million words.</li>
+<li> A larger monolingual corpus.  We need data in the target language to train the language model.  You could simply use the target side of the parallel corpus, but it is better to assemble to large amounts of monolingual text, since it will help improve the fluency of your translations.</li>
+<li> A small sentence-aligned bilingual corpus to use as a <b>development set</b> (somewhere around 1000 sentence pairs ought to be sufficient).  This data should disjoint from your training data.  It will be used to optimize the parameters of your model in minimum error rate training (MERT).  It may be useful to have multiple reference translations for your dev set, although this is not strictly necessary. </li>
+<li> A small sentence-aligned bilingual corpus to use as a <b>test set</b> to evaluate the translation quality of your system and any modifications that you make to it. The test set should be disjoint from the dev and training sets. Again, it may be useful to have multiple reference translations if you are evaluating using the Bleu metric.</li>
+</ul>
+</p>
+<p>
+There are several sources for training data.  A good source of free parallel corpora for European languages is the Europarl corpus that is distributed as part of the <a href="http://statmt.org/wmt12/translation-task.html">Workshop on Statistical Machine Translation</a>.  If you sign up to participate in the annual <a href="http://www.itl.nist.gov/iad/mig//tests/mt/">NIST Open Machine Translation Evaluation</a> you can get access large Arabic-English and Chinese-English parallel corpora, and a small Urdu-English parallel corpus.
+</p>
+
+<p>Once you've gathered your data, you will need to do several preprocess steps: sentence alignment, tokenization, normalization, and subsampling. </p>
+
+<h3>Sentence alignment</h3>
+<p>In this exercise, we'll start with an existing sentence-aligned parallel corpus.  Download this tarball, which contains a Spanish-Engish parallel corpus, along with a dev and a test set: <a href="http://cs.jhu.edu/~ccb/joshua/data.tar.gz">data.tar.gz</a> </p>
+
+<p> The data tarball contains two training directories <code>training/</code>, which includes a subset of the corpus, and <code>full-training</code>, which includes the full corpus.  I strongly recommend staring with the smaller set, and building an end-to-end system with it, since many steps take a very long time on the full data set.  You should debug on the smaller set to avoid wasting time.</p>
+
+
+<p>
+If you start with your own data set, you will need to sentence align it.  We recommend Bob Moore's <a href="http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/">bilingual sentence aligner</a>.
+</p>
+
+<h3>Tokenization</h3>
+<p>Joshua uses whitespace to delineate words.  For many languages, tokenization can be as simple as separating punctation off as its own token.  For languages like Chinese, which don't put spaces around words, tokenization can be more tricky.  </p>
+
+<p>For this example we'll use the simple tokenizer that is released as part of the WMT.  It's located in the tarball under the scripts directory.  To use it type the following commands:</p>
+
+<pre>
+tar xfz data.tar.gz
+
+cd data/
+
+gunzip -c es-en/full-training/europarl-v4.es-en.es.gz \
+	| perl scripts/tokenizer.perl -l es \
+	> es-en/full-training/training.es.tok
+
+gunzip -c es-en/full-training/europarl-v4.es-en.en.gz \
+	| perl scripts/tokenizer.perl -l en \
+	> es-en/full-training/training.en.tok 
+</pre>
+
+<h3>Normalization</h3>
+
+<p>After tokenization, we recommend that you normalize your data by lowercasing it.  The system treats words with variant capitalization as distinct, which can lead to worse probability estimates for their translation, since the counts are fragmented. For other languages you might want to normalize the text in other ways.</p>
+
+<p>You can lowercase your tokenized data with the following script:</p>
+<pre>
+cat es-en/full-training/training.en.tok \
+	| perl scripts/lowercase.perl \
+	> es-en/full-training/training.en.tok.lc 
+
+cat es-en/full-training/training.es.tok \
+	| perl scripts/lowercase.perl \
+	> es-en/full-training/training.es.tok.lc
+</pre>
+
+
+<p>The untokenized file looks like this (<code>gunzip -c es-en/full-training/europarl-v4.es-en.en.gz | head -3</code>):</p>
+
+<p>
+Resumption of the session<br>
+I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.<br>
+Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
+</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>After tokenization and lowercasing, the file looks like  this (<code>head -3 es-en/full-training/training.en.tok.lc</code>):</p>
+</td>
+</tr>
+<tr>
+<td bgcolor="#cccccc">
+<p>
+resumption of the session<br>
+i declare resumed the session of the european parliament adjourned on friday 17 december 1999 , and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .<br>
+although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .
+</p>
+</td>
+</tr>
+<tr>
+<td>
+
+<p>You must preprocess your dev and test sets in the same way you preprocess your training data.  Run the following commands on the data that you downloaded:</p>
+<p>
+<pre>
+cat es-en/dev/news-dev2009.es \
+	| perl scripts/tokenizer.perl -l es \
+	| perl scripts/lowercase.perl \
+	> es-en/dev/news-dev2009.es.tok.lc
+
+cat es-en/dev/news-dev2009.en \
+	| perl scripts/tokenizer.perl -l en \
+	| perl scripts/lowercase.perl \
+	> es-en/dev/news-dev2009.en.tok.lc
+
+cat es-en/test/newstest2009.es \
+	| perl scripts/tokenizer.perl -l es \
+	| perl scripts/lowercase.perl \
+	> es-en/test/newstest2009.es.tok.lc
+
+cat es-en/test/newstest2009.en \
+	| perl scripts/tokenizer.perl -l en \
+	| perl scripts/lowercase.perl \
+	> es-en/test/newstest2009.en.tok.lc
+</pre>
+
+<h3>Subsampling (optional)</h3> 
+
+<p>
+Sometimes the amount of training data is so large that it makes creating word alignments extremely time-consuming and memory-intesive.  We therefore provide a facility for subsampling the training corpus to select sentences that are relevant for a test set. 
+</p>
+<p>
+<pre>
+mkdir es-en/full-training/subsampled
+echo "training" > es-en/full-training/subsampled/manifest
+cat es-en/dev/news-dev2009.es.tok.lc es-en/test/newstest2009.es.tok.lc > es-en/full-training/subsampled/test-data
+
+java -Xmx1000m -Dfile.encoding=utf8 -cp "$JOSHUA/bin:$JOSHUA/lib/commons-cli-2.0-SNAPSHOT.jar" \
+	joshua.subsample.Subsampler \
+	-e en.tok.lc \
+	-f es.tok.lc \
+	-epath  es-en/full-training/ \
+	-fpath  es-en/full-training/ \
+	-output es-en/full-training/subsampled/subsample \
+	-ratio 1.04 \
+	-test es-en/full-training/subsampled/test-data \
+	-training es-en/full-training/subsampled/manifest
+</pre>
+</p>
+<p>You can see how much the subsampling step reduces the training data, by yping <code> wc -lw es-en/full-training/training.??.tok.lc es-en/full-training/subsampled/subsample.??.tok.lc</code>:
+</p>
+</td>
+</tr>
+<tr>
+<td bgcolor="#cccccc">
+<pre>
+ 1411589 39411018 training/training.en.tok.lc
+ 1411589 41042110 training/training.es.tok.lc
+  671429 16721564 training/subsampled/subsample.en.tok.lc
+  671429 17670846 training/subsampled/subsample.es.tok.lc
+</pre>
+</td>
+</tr>
+<tr>
+<td>
+
+
+<a name="step3" />
+<h1>Step 3: Create word alignments</h1>
+
+<p>
+Before extracting a translation grammar, we first need to create word alignments for our parallel corpus.  In this example, we show you how to use the Berkeley aligner.  You may also use Giza++ to create the alignments, although that program is a little unwieldy to install.
+</p>
+
+<p>
+To run the Berkeley aligner you first need to set up a configuration file,   which defines the models that are used to align the data, how the program runs, and which files are to be aligned.  Here is an example configuration file (you should create your own version of this file and save it as <code>training/word-align.conf</code>):
+</p>
+
+<pre>
+## word-align.conf
+## ----------------------
+## This is an example training script for the Berkeley
+## word aligner.  In this configuration it uses two HMM
+## alignment models trained jointly and then decoded 
+## using the competitive thresholding heuristic.
+
+##########################################
+# Training: Defines the training regimen 
+##########################################
+
+forwardModels	MODEL1 HMM
+reverseModels	MODEL1 HMM
+mode	JOINT JOINT
+iters	5 5
+
+###############################################
+# Execution: Controls output and program flow 
+###############################################
+
+execDir	alignments
+create
+saveParams	true
+numThreads	1
+msPerLine	10000
+alignTraining
+
+#################
+# Language/Data 
+#################
+
+<b>foreignSuffix	es.tok.lc</b>
+<b>englishSuffix	en.tok.lc</b>
+
+# Choose the training sources, which can either be directories or files that list files/directories
+<b>trainSources	subsampled/</b>
+sentences	MAX
+
+#################
+# 1-best output 
+#################
+
+competitiveThresholding
+</pre>
+
+<p>To run the Berkeley aligner, first set an environment variable saying where the aligner's jar file is located (this environment variable is just used for convenience in this document, and is not necessary for running the aligner in general:
+</p>
+<pre>
+export BERKELEYALIGNER="<b>/path/to/berkeleyaligner/dir</b>"
+</pre>
+
+<p>
+You'll need to create an empty directory called <code>example/test</code>.  This is because the Berkeley aligner generally expects to test against a set of manually word-aligned data: 
+</p>
+<pre>
+cd es-en/full-training/
+mkdir -p example/test
+</pre>
+
+<p>
+After you've created the <code>word-align.config</code> file, you can run the aligner with this command: 
+</p>
+<pre>
+nohup java -d64 -Xmx10g -jar $BERKELEYALIGNER/berkeleyaligner.jar ++word-align.conf &
+</pre>
+
+<p>
+If the program finishes right away, then it probably terminated with an error.  You can read the <code>nohup.out</code> file to see what went wrong.  Common problems include a missing <code>example/test</code> directory, or a file not found exception.  When you re-run the program, you will need to manually remove the <code>alignments/</code> directory. 
+</p>
+
+<p>When you are aligning tens of millions of words worth of data, the word alignment process will take several hours to complete. While it is running, you can skip ahead and complete step 4, but not step 5.</p>
+
+<!-- ccb - todo - show the output here, and the different subdirectories -->
+
+<p>
+After you get comfortable using the aligner and after you've run through the whole Joshua training sequence, you can try experimenting with the amount of training data, the number of training iterations, and different alignment models (the Berkeley aligner supports Model 1, a Hidden Markov Model, and a syntactic HMM).  
+</p>
+
+
+<a name="step4" />
+<h1>Step 4: Train a language model</h1>
+
+<p>Most translation models also make use of an n-gram language model as a way of assigning higher probability to hypothesis translations that look like fluent examples of the target language.  Joshua provides support for n-gram language models, either through a built in data structure, or through external calls to the SRI language modeling toolkit (srilm).  To use large language models, we recommend srilm.  </p>
+
+<p>If you successfully installed srilm in <a href="#step1">Step 1</a>, then you should be able to train a language model with the following command:</p>
+
+<pre>
+mkdir -p model/lm
+
+$SRILM/bin/macosx64/ngram-count \
+	-order 3 \
+	-unk \
+	-kndiscount1 -kndiscount2 -kndiscount3 \
+	-text training/training.en.tok.lc \
+	-lm model/lm/europarl.en.trigram.lm
+</pre>
+
+<p>(Note: the above assumes that you are on a 64-bit machine running Mac OS X. If that's not the case, your path to ngram-count will be slightly different.)</p>
+
+<p>This will train a trigram language model on the English side of the parallel corpus.  We use the <code>.tok.lc</code> file because it is important to have the input to the LM training be tokenized and normalized in the same way as the input data for word alignment and translation grammar extraction.</p>
+
+<p>
+The <code>-order 3</code> tells srilm to produce a trigram language model.  You can set this to a higher value, and srilm will happily output 4-gram, 5-gram or even higher order language models.  Joshua supports arbitrary order n-gram language models, but as the order increases the amount of memory that they require rapidly increases, and the amount of evidence used to estimate the probabilities decreases, so there is a diminishing returns for increasing n.  It's common to use n-gram models up to order 5, but in practice, people generally don't use n-gram models much beyond that for practical reasons.
+</p>
+
+<p>The <code>-kndiscount</code> tells SRILM to use modified Kneser-Ney discounting as its smoothing scheme.  Other smoothing schemes that are implemented in SRILM include Good-Turing and Witten-Bell. </p>
+
+<p>Given that the English side of the parallel corpus is a relatively small amount of data in terms of language modeling, it only takes a few minutes a few minutes to output the LM.  The uncompressed LM is 144 megabytes large (<code>du -h europarl.en.trigram.lm</code>). 
+</p>
+
+<a name="step5" />
+<h1>Step 5: Extract a translation grammar</h1>
+
+
+<p>We'll use the word alignments to create a translation grammar similar to the Chinese one shown in <a href="#step1">Step 1</a>.  The translation grammar is created by looking for where the foreign language phrases from the test set occur in the training set, and then using the word alignments to figure out which foreign phrases are aligned. </p>
+
+<h3>Create a suffix array index</h3>
+<p>
+To find the foreign phrases in the test set, we first create an easily searchable index, called a suffix array, for the training data.
+</p>
+
+<pre>
+java -Xmx500m -cp $JOSHUA/bin/ \
+	joshua.corpus.suffix_array.Compile \
+	training/subsampled/subsample.es.tok.lc \
+	training/subsampled/subsample.en.tok.lc  \
+	training/subsampled/training.en.tok.lc-es.tok.lc.align \
+	model
+</pre>
+
+<p>This compiles the index that Joshua will use for its rule extraction, and puts it into a directory named <code>model</code>.  
+</p>
+<!-- ccb - todo - add this back in when the GridViewer is fixed.
+Joshua has some tools that let you manipulate the data in this directory.  For example, you can visualize the word alignments with this command:
+</p>
+<pre>
+java -cp $JOSHUA/bin joshua.ui.alignment.GridViewer model 1
+</pre>
+//!-->
+
+<h3>Extract grammar rules for the dev set</h3>
+<p>The following command will extract a translation grammar from the  suffix array index of your word-aligned parallel corpus, where the grammar rules apply to the foreign phrases in the dev set <code>dev/news-dev2009.es.tok.lc</code>: 
+</p>
+<pre>
+mkdir mert
+
+java -Dfile.encoding=UTF8 -Xmx1g -cp $JOSHUA/bin \
+        joshua.prefix_tree.ExtractRules \
+        ./model \
+        mert/news-dev2009.es.tok.lc.grammar.raw \
+        dev/news-dev2009.es.tok.lc &  
+</pre>
+<p>
+Next, sort the grammar rules and remove the redundancies with the following Unix command:
+</p>
+<pre>
+sort -u mert/news-dev2009.es.tok.lc.grammar.raw \
+	-o mert/news-dev2009.es.tok.lc.grammar
+</pre>
+
+
+<p>You will also need to create a small "glue grammar", in a file called <code>model/hiero.glue</code> that contains these rules that allow hiero-style grammars to reach the goal state:</p>
+
+</td>
+</tr>
+<tr>
+<td bgcolor="#cccccc">
+<pre>
+[S] ||| [X,1] ||| [X,1] ||| 0 0 0
+[S] ||| [S,1] [X,2] ||| [S,1] [X,2] ||| 0.434294482 0 0
+</pre>
+</td>
+</tr>
+<tr>
+<td>
+
+
+<!-- ccb todo - show the Spanish grammar here -->
+
+
+<a name="step6" />
+<h1>Step 6: Run minimum error rate training</h1>
+<p>
+After we've extracted the grammar for the dev set we can run minimum error rate training (MERT).  MERT is a method for setting the weights of the different feature functions the translation model to maximize the translation quality on the dev set.  Translation quality is calculated according to an automatic metric, such as Bleu.  Our implementation of MERT allows you to easily implement some other metric, and optimize your paramters to that.  There's even a YouTube tutorial to show you how. </p>
+
+<p>To run MERT you will first need to create a few files:
+<ul>
+<li> A MERT configuration file </li>
+<li> A separate file with the list of the feature functions used in your model, along with their possible ranges</li>
+<li> An executable file containing the command to use to run the decoder</li>
+<li> A Joshua configuration file</li>
+</ul>
+</p>
+
+<p>Create a MERT configuration file.  In this example we name the file <code>mert/mert.config</code>. Its contents are: </p>
+
+</td>
+</tr>
+<tr>
+<td bgcolor="#cccccc">
+<pre>
+### MERT parameters
+# target sentences file name (in this case, file name prefix)
+-r	dev/news-dev2009.en.tok.lc
+-rps	1			# references per sentence
+-p	mert/params.txt		# parameter file
+-m	BLEU 4 closest		# evaluation metric and its options
+-maxIt	10			# maximum MERT iterations
+-ipi	20			# number of intermediate initial points per iteration
+-cmd	mert/decoder_command    # file containing commands to run decoder
+-decOut	mert/news-dev2009.output.nbest     # file prodcued by decoder
+-dcfg	mert/joshua.config      # decoder config file
+-N	300                     # size of N-best list
+-v	1                       # verbosity level (0-2; higher value => more verbose)
+-seed   12341234                # random number generator seed
+</pre>
+</td>
+</tr>
+<tr>
+<td>
+
+<p>You can see a list of the other parameters available in our MERT implementation by running this command:</p>
+
+<pre>java -cp $JOSHUA/bin joshua.zmert.ZMERT -h </pre>
+
+<p>Next, create a file called <code>mert/params.txt</code> that specifies what feature functions you are using in your mode.  In our baseline model, this file should contain the following information:</p>
+</td>
+</tr>
+<tr>
+<td bgcolor="#cccccc">
+<pre>
+lm			|||	1.000000		Opt	0.1	+Inf	+0.5	+1.5
+phrasemodel pt 0	|||	1.066893		Opt	-Inf	+Inf	-1	+1
+phrasemodel pt 1	|||	0.752247		Opt	-Inf	+Inf	-1	+1
+phrasemodel pt 2	|||	0.589793		Opt	-Inf	+Inf	-1	+1
+wordpenalty		|||	-2.844814		Opt	-Inf	+Inf	-5	0
+normalization = absval 1 lm
+</pre>
+</td>
+</tr>
+<tr>
+<td>
+<p>Next, create a file called <code>mert/decoder_command</code> that contains the following command:</p>
+</td>
+</tr>
+<tr>
+<td bgcolor="#cccccc">
+<pre>
+java -Xmx1g -cp $JOSHUA/bin/ -Djava.library.path=$JOSHUA/lib -Dfile.encoding=utf8 \
+	joshua.decoder.JoshuaDecoder \
+	mert/joshua.config \
+	dev/news-dev2009.es.tok.lc \
+	mert/news-dev2009.output.nbest 
+</pre>
+</td>
+</tr>
+<tr>
+<td>
+
+<p>Next, create a configuration file for joshua at <code>mert/joshua.config</code> that contains the following:</p>
+</td>
+</tr>
+
+<tr>
+<td bgcolor="#cccccc">
+<pre>
+<b>lm_file=model/lm/europarl.en.trigram.lm</b>
+
+<b>tm_file=mert/news-dev2009.es.tok.lc.grammar</b>
+tm_format=hiero
+
+glue_file=model/hiero.glue
+glue_format=hiero
+
+#lm config
+use_srilm=true
+lm_ceiling_cost=100
+use_left_equivalent_state=false
+use_right_equivalent_state=false
+order=3
+
+
+#tm config
+span_limit=10
+phrase_owner=pt
+mono_owner=mono
+begin_mono_owner=begin_mono
+default_non_terminal=X
+goalSymbol=S
+
+#pruning config
+fuzz1=0.1
+fuzz2=0.1
+max_n_items=30
+relative_threshold=10.0
+max_n_rules=50
+rule_relative_threshold=10.0
+
+#nbest config
+use_unique_nbest=true
+use_tree_nbest=false
+add_combined_cost=true
+top_n=300
+
+
+#remote lm server config, we should first prepare remote_symbol_tbl before starting any jobs
+use_remote_lm_server=false
+remote_symbol_tbl=./voc.remote.sym
+num_remote_lm_servers=4
+f_remote_server_list=./remote.lm.server.list
+remote_lm_server_port=9000
+
+
+#parallel deocoder: it cannot be used together with remote lm
+num_parallel_decoders=1
+parallel_files_prefix=/tmp/
+
+
+###### model weights
+#lm order weight
+lm 1.0
+
+#phrasemodel owner column(0-indexed) weight
+phrasemodel pt 0 1.4037585111897322
+phrasemodel pt 1 0.38379188013385945
+phrasemodel pt 2 0.47752204361625605
+
+#arityphrasepenalty owner start_arity end_arity weight
+#arityphrasepenalty pt 0 0 1.0
+#arityphrasepenalty pt 1 2 -1.0
+
+#phrasemodel mono 0 0.5
+
+#wordpenalty weight
+wordpenalty -2.721711092619053
+</pre>
+</td>
+</tr>
+<tr>
+<td>
+
+<p>Finally, run the command to start MERT:</p>
+<pre>
+nohup java -cp $JOSHUA/bin \
+	joshua.zmert.ZMERT \
+	-maxMem 1500 mert/mert.config &
+</pre>
+
+<p>While MERT is running, you can skip ahead to the first part of the next step and extract the grammar for the test set.</p>
+
+<a name="step7" />
+<h1>Step 7: Decode a test set</h1>
+
+<p>When MERT finishes, it will output a file <code>mert/joshua.config.ZMERT.final</code> that contains the news weights for the different feature functions. You can copy this config file and use it to decode the test set.  </p>
+
+
+<h3>Extract grammar rules for the test set</h3>
+<p>Before decoding the test set, you'll need to extract a translation grammar for the foreign phrases in the test set <code>test/newstest2009.es.tok.lc</code>: 
+</p>
+<pre>
+java -Dfile.encoding=UTF8 -Xmx1g -cp $JOSHUA/bin \
+        joshua.prefix_tree.ExtractRules \
+        ./model \
+        test/newstest2009.es.tok.lc.grammar.raw \
+        test/newstest2009.es.tok.lc &  
+</pre>
+<p>
+Next, sort the grammar rules and remove the redundancies with the following Unix command:
+</p>
+<pre>
+sort -u test/newstest2009.es.tok.lc.grammar.raw \
+	-o test/newstest2009.es.tok.lc.grammar
+</pre>
+
+<p>Once the grammar extraction has completed, you can edit the <code>joshua.config</code> file for the test set.</p>
+
+<pre>
+cp mert/joshua.config.ZMERT.final test/joshua.config
+</pre>
+
+<p>You'll need to edit the config file to replace <b><code>tm_file=mert/mert/news-dev2009.es.tok.lc.grammar</code></b> with <b><code>tm_file=test/newstest2009.es.tok.lc.grammar</code></b>.  After you have done that, you can decode the test set with the following command:</p>
+
+<pre>
+java -Xmx1g -cp $JOSHUA/bin/ -Djava.library.path=$JOSHUA/lib -Dfile.encoding=utf8 \
+	joshua.decoder.JoshuaDecoder \
+	test/joshua.config \
+	test/newstest2009.es.tok.lc \
+	test/newstest2009.output.nbest
+</pre>
+
+<p>After the decoder has finished, you can extract the 1-best translations from the n-best list using the following command:</p>
+
+<pre>
+java -cp $JOSHUA/bin -Dfile.encoding=utf8 \
+	joshua.util.ExtractTopCand \
+	test/newstest2009.output.nbest \
+	test/newstest2009.output.1best 
+</pre>
+
+<!-- ccb - todo - show the output -->
+
+<a name="step8" />
+<h1>Step 8: Recase and detokenize</h1>
+
+<p>You'll notice that your output is all lowercased and has the punctuation split off.  In order to make the output more readable to human beings (remember us?), it'd be good to fix these problems and use proper punctuation and spacing.  These are called recasing and detokenization, respectively. We can do recasing using SRILM, and can do detokenization with a perl script. </p>
+
+<p>To build a recasing model first train a language model on true cased English text:</p>
+<pre>
+$SRILM/bin/macosx64/ngram-count \
+	-unk \
+	-order 5 \
+	-kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5 \
+	-text training/training.en.tok \
+	-lm model/lm/training.TrueCase.5gram.lm
+</pre>
+<p>Next, you'll need to create a list of all of the alternative ways that each word can be capitalized.  This will be stored in a map file that lists a lowercased word as the key and associates it with all of the variant capitalization of that word.  Here's an example perl script to create the map:</p>
+
+<pre>
+#!/usr/bin/perl
+#
+# truecase-map.perl
+# -----------------
+# This script outputs alternate capitalizations
+
+%map = ();
+while($line = <>) {
+    @words = split(/\s+/, $line);
+    foreach $word (@words) {
+	$key = lc($word);
+	$map{$key}{$word} = 1;
+    }
+}
+
+foreach $key (sort keys %map) {
+    @words = keys %{$map{$key}};
+    if(scalar(@words) > 1 || !($words[0] eq $key)) {
+	print $key;
+	foreach $word (sort @words) {
+	    print " $word";
+	}
+	print "\n";
+    }
+}
+</pre>
+
+<pre>
+cat training/training.en.tok | perl truecase-map.perl > model/lm/true-case.map
+</pre>
+
+<p>Finally, recase the lowercased 1-best translation by running the SRILM <code>disambig</code> program, which takes the map of alternative capitalizations, creates a confusion network, and uses truecased LM to find the best path through it:</p>
+<pre>
+$SRILM/bin/macosx/disambig \
+	-lm model/lm/training.TrueCase.5gram.lm \
+	-keep-unk \
+	-order 5 \
+	-map model/lm/true-case.map \
+	-text test/mt09.output.1best \
+	| perl strip-sent-tags.perl
+	> test/mt09.output.1best.recased
+</pre>
+
+<p>Where <code>strip-sent-tags.perl</code> is:</p>
+
+
+<pre>
+while($line = <>) {
+    $line =~ s/^\s*&lt;s&gt;\s*//g;
+    $line =~ s/\s*&lt;\/s&gt;\s*$//g;
+    print $line . "\n";
+}
+</pre>
+
+
+<!-- ccb - todo - show the recased output -->
+
+
+<a name="step9" />
+<h1>Step 9: Score the translations</h1>
+
+<p>
+The quality of machine translation is commonly measured using the BLEU metric, which automatically compares a system's output against reference human translations. You can score your output using the JoshuaEval class, Joshua's built-in scorer:
+</p>
+
+<pre>
+java -cp $JOSHUA/bin -Djava.library.path=lib -Xmx1000m -Xms1000m \
+	-Djava.util.logging.config.file=logging.properties \
+	joshua.util.JoshuaEval \
+	-cand dev/dev2006.en.output \
+	-ref dev/dev2006.en.small \
+	-m BLEU 4 closest 
+</pre>
+
+<!-- ccb - todo - update these numbers with the actual numbers
+The output will be something like:
+BLEU_precision(1) = 797 / 2169 = 0.3675
+BLEU_precision(2) = 346 / 2119 = 0.1633
+BLEU_precision(3) = 174 / 2069 = 0.0841
+BLEU_precision(4) = 86 / 2019 = 0.0426
+BLEU_precision = 0.1211
+
+Length of candidate corpus = 2169
+Effective length of reference corpus = 1360
+BLEU_BP = 1.0000
+
+BLEU = 0.1211
+Your Bleu score in this case would be 0.1211.
+-->
+
+
+<!-- 	  <div id="news"> -->
+<!-- <div id="contentcenter"><h2>Made Possible By</h2></div> -->
+<!-- 		<p><a href="http://hltcoe.jhu.edu/"> -->
+<!-- <img src="images/sponsors/hltcoe-logo1.jpg" width="100" border="0"  /><br /> -->
+<!-- The Human Language Technology Center of Excellence (HLTCOE) </a></p>  -->
+
+<!-- 		<p><a href="http://www.darpa.mil/Our_Work/I2O/Programs/Global_Autonomous_Language_Exploitation_(GALE).aspx"> -->
+<!-- <img src="images/sponsors/darpa-logo.jpg" width="100" border="0"  /><br /> -->
+<!-- Global Autonomous Language Exploitation (GALE) </a></p>  -->
+
+<!-- 		<p><a href="http://nsf.gov/awardsearch/showAward.do?AwardNumber=0713448"> -->
+<!-- <img src="images/sponsors/NSF-logo.jpg" width="100" border="0"  /><br /> -->
+<!-- Multi-level modeling of language and translation </a></p>  -->
+
+<!-- <br> -->
+<!-- <p> -->
+<!-- <div xmlns:cc="http://creativecommons.org/ns#" about="http://www.flickr.com/photos/marcusfrieze/422897640">Logo photo by <a rel="cc:attributionURL" href="http://www.flickr.com/photos/marcusfrieze/">Marcus Frieze</a> used under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/2.0/">Creative Commons License</a>.</div> -->
+<!-- </p> -->
+
+<!-- </div> -->

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/thrax.md
----------------------------------------------------------------------
diff --git a/4.0/thrax.md b/4.0/thrax.md
new file mode 100644
index 0000000..6b276b0
--- /dev/null
+++ b/4.0/thrax.md
@@ -0,0 +1,14 @@
+---
+layout: default4
+category: advanced
+title: Grammar extraction with Thrax
+---
+
+One day, this will hold Thrax documentation, including how to use Thrax, how to do grammar
+filtering, and details on the configuration file options.  It will also include details about our
+experience setting up and maintaining Hadoop cluster installations, knowledge wrought of hard-fought
+sweat and tears.
+
+In the meantime, please bother [Jonny Weese](http://cs.jhu.edu/~jonny/) if there is something you
+need to do that you don't understand.  You might also be able to dig up some information [on the old
+Thrax page](http://cs.jhu.edu/~jonny/thrax/).

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/tms.md
----------------------------------------------------------------------
diff --git a/4.0/tms.md b/4.0/tms.md
new file mode 100644
index 0000000..a86a311
--- /dev/null
+++ b/4.0/tms.md
@@ -0,0 +1,106 @@
+---
+layout: default4
+category: advanced
+title: Building Translation Models
+---
+
+# Build a translation model
+
+Extracting a grammar from a large amount of data is a multi-step process. The first requirement is parallel data. The Europarl, Call Home, and Fisher corpora all contain parallel translations of Spanish and English sentences.
+
+We will copy (or symlink) the parallel source text files in a subdirectory called `input/`.
+
+Then, we concatenate all the training files on each side. The pipeline script normally does tokenization and normalization, but in this instance we have a custom tokenizer we need to apply to the source side, so we have to do it manually and then skip that step using the `pipeline.pl` option `--first-step alignment`.
+
+* to tokenize the English data, do
+
+    cat callhome.en europarl.en fisher.en > all.en | $JOSHUA/scripts/training/normalize-punctuation.pl en | $JOSHUA/scripts/training/penn-treebank-tokenizer.perl | $JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.en
+
+The same can be done for the Spanish side of the input data:
+
+    cat callhome.es europarl.es fisher.es > all.es | $JOSHUA/scripts/training/normalize-punctuation.pl es | $JOSHUA/scripts/training/penn-treebank-tokenizer.perl | $JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.es
+
+By the way, an alternative tokenizer is a Twitter tokenizer found in the [Jerboa](http://github.com/vandurme/jerboa) project.
+
+The final step in the training data preparation is to remove all examples in which either of the language sides is a blank line.
+
+    paste all.norm.tok.lc.es all.norm.tok.lc.en | grep -Pv "^\t|\t$" \
+      | ./splittabs.pl all.norm.tok.lc.noblanks.es all.norm.tok.lc.noblanks.en
+
+contents of `splittabls.pl` by Matt Post:
+
+    #!/usr/bin/perl
+
+    # splits on tab, printing respective chunks to the list of files given
+    # as script arguments
+
+    use FileHandle;
+
+    my @fh;
+    $| = 1;   # don't buffer output
+
+    if (@ARGV < 0) {
+      print "Usage: splittabs.pl < tabbed-file\n";
+      exit;
+    }
+
+    my @fh = map { get_filehandle($_) } @ARGV;
+    @ARGV = ();
+
+    while (my $line = <>) {
+      chomp($line);
+      my (@fields) = split(/\t/,$line,scalar @fh);
+
+      map { print {$fh[$_]} "$fields[$_]\n" } (0..$#fields);
+    }
+
+    sub get_filehandle {
+        my $file = shift;
+
+        if ($file eq "-") {
+            return *STDOUT;
+        } else {
+            local *FH;
+            open FH, ">$file" or die "can't open '$file' for writing";
+            return *FH;
+        }
+    }
+
+Now we can run the pipeline to extract the grammar. Run the following script:
+
+    #!/bin/bash
+
+    # this creates a grammar
+
+    # NEED:
+    # pair
+    # type
+
+    set -u
+
+    pair=es-en
+    type=hiero
+
+    #. ~/.bashrc
+
+    #basedir=$(pwd)
+
+    dir=grammar-$pair-$type
+
+    [[ ! -d $dir ]] && mkdir -p $dir
+    cd $dir
+
+    source=$(echo $pair | cut -d- -f 1)
+    target=$(echo $pair | cut -d- -f 2)
+
+    $JOSHUA/scripts/training/pipeline.pl \
+      --source $source \
+      --target $target \
+      --corpus /home/hltcoe/lorland/expts/scale12/model1/input/all.norm.tok.lc.noblanks \
+      --type $type \
+      --joshua-mem 100g \
+      --no-prepare \
+      --first-step align \
+      --last-step thrax \
+      --hadoop $HADOOP \
+      --threads 8 \

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/zmert.md
----------------------------------------------------------------------
diff --git a/4.0/zmert.md b/4.0/zmert.md
new file mode 100644
index 0000000..538a2ac
--- /dev/null
+++ b/4.0/zmert.md
@@ -0,0 +1,83 @@
+---
+layout: default4
+category: advanced
+title: Z-MERT
+---
+
+This document describes how to manually run the ZMERT module.  ZMERT is Joshua's minimum error-rate
+training module, written by Omar F. Zaidan.  It is easily adapted to drop in different decoders, and
+was also written so as to work with different objective functions (other than BLEU).
+
+((Section (1) in `$JOSHUA/examples/ZMERT/README_ZMERT.txt` is an expanded version of this section))
+
+Z-MERT, can be used by launching the driver program (`ZMERT.java`), which expects a config file as
+its main argument.  This config file can be used to specify any subset of Z-MERT's 20-some
+parameters.  For a full list of those parameters, and their default values, run ZMERT with a single
+-h argument as follows:
+
+    java -cp $JOSHUA/bin joshua.zmert.ZMERT -h
+
+So what does a Z-MERT config file look like?
+
+Examine the file `examples/ZMERT/ZMERT_config_ex2.txt`.  You will find that it
+specifies the following "main" MERT parameters:
+
+    (*) -dir dirPrefix:         working directory
+    (*) -s sourceFile:          source sentences (foreign sentences) of the MERT dataset
+    (*) -r refFile:             target sentences (reference translations) of the MERT dataset
+    (*) -rps refsPerSen:        number of reference translations per sentence
+    (*) -p paramsFile:          file containing parameter names, initial values, and ranges
+    (*) -maxIt maxMERTIts:      maximum number of MERT iterations
+    (*) -ipi initsPerIt:        number of intermediate initial points per iteration
+    (*) -cmd commandFile:       name of file containing commands to run the decoder
+    (*) -decOut decoderOutFile: name of the output file produced by the decoder
+    (*) -dcfg decConfigFile:    name of decoder config file
+    (*) -N N:                   size of N-best list (per sentence) generated in each MERT iteration
+    (*) -v verbosity:           output verbosity level (0-2; higher value => more verbose)
+    (*) -seed seed:             seed used to initialize the random number generator
+
+(Note that the `-s` parameter is only used if Z-MERT is running Joshua as an
+ internal decoder.  If Joshua is run as an external decoder, as is the case in
+ this README, then this parameter is ignored.)
+
+To test Z-MERT on the 100-sentence test set of example2, provide this config
+file to Z-MERT as follows:
+
+    java -cp bin joshua.zmert.ZMERT -maxMem 500 examples/ZMERT/ZMERT_config_ex2.txt > examples/ZMERT/ZMERT_example/ZMERT.out
+
+This will run Z-MERT for a couple of iterations on the data from the example2
+folder.  (Notice that we have made copies of the source and reference files
+from example2 and renamed them as src.txt and ref.* in the MERT_example folder,
+just to have all the files needed by Z-MERT in one place.)  Once the Z-MERT run
+is complete, you should be able to inspect the log file to see what kinds of
+things it did.  If everything goes well, the run should take a few minutes, of
+which more than 95% is time spent by Z-MERT waiting on Joshua to finish
+decoding the sentences (once per iteration).
+
+The output file you get should be equivalent to `ZMERT.out.verbosity1`.  If you
+rerun the experiment with the verbosity (-v) argument set to 2 instead of 1,
+the output file you get should be equivalent to `ZMERT.out.verbosity2`, which has
+more interesting details about what Z-MERT does.
+
+Notice the additional `-maxMem` argument.  It tells Z-MERT that it should not
+persist to use up memory while the decoder is running (during which time Z-MERT
+would be idle).  The 500 tells Z-MERT that it can only use a maximum of 500 MB.
+For more details on this issue, see section (4) in Z-MERT's README.
+
+A quick note about Z-MERT's interaction with the decoder.  If you examine the
+file `decoder_command_ex2.txt`, which is provided as the commandFile (`-cmd`)
+argument in Z-MERT's config file, you'll find it contains the command one would
+use to run the decoder.  Z-MERT launches the commandFile as an external
+process, and assumes that it will launch the decoder to produce translations.
+(Make sure that commandFile is executable.)  After launching this external
+process, Z-MERT waits for it to finish, then uses the resulting output file for
+parameter tuning (in addition to the output files from previous iterations).
+The command file here only has a single command, but your command file could
+have multiple lines.  Just make sure the command file itself is executable.
+
+Notice that the Z-MERT arguments `configFile` and `decoderOutFile` (`-cfg` and
+`-decOut`) must match the two Joshua arguments in the commandFile's (`-cmd`) single
+command.  Also, the Z-MERT argument for N must match the value for `top_n` in
+Joshua's config file, indicated by the Z-MERT argument configFile (`-cfg`).
+
+For more details on Z-MERT, refer to `$JOSHUA/examples/ZMERT/README_ZMERT.txt`

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/advanced.md
----------------------------------------------------------------------
diff --git a/5.0/advanced.md b/5.0/advanced.md
new file mode 100644
index 0000000..174041e
--- /dev/null
+++ b/5.0/advanced.md
@@ -0,0 +1,7 @@
+---
+layout: default
+category: links
+title: Advanced features
+---
+
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/bundle.md
----------------------------------------------------------------------
diff --git a/5.0/bundle.md b/5.0/bundle.md
new file mode 100644
index 0000000..c3874ab
--- /dev/null
+++ b/5.0/bundle.md
@@ -0,0 +1,24 @@
+---
+layout: default
+category: links
+title: Bundling a configuration
+---
+
+A *bundled configuration* is a minimal set of configuration, resource, and script files. A script, `$JOSHUA/scripts/support/run-bundler.py` can be used to package up the run bundle. The resulting bundle can easily be transferred and shared.
+
+**Example invocation:**
+
+    ./run-bundler.py \
+      --force \
+      /path/to/rundir/runs/5/test/1/joshua.config \
+      /path/to/rundir/runs/5 \
+      bundled-configurations \
+        "-top-n 1 \
+        -output-format %S \
+        -mark-oovs false \
+        -server-port 5674 \
+        -tm/pt "thrax pt 20 /path/to/rundir/runs/5/test/1/grammar.gz"
+
+A new directory `./bundled-configurations` will be created, and all the bundled files will be copied or created in it.  To use the configuration with Joshua, run the executable file `./bundled-configurations/bundle-runner.sh`.
+
+Note, the additional options between the pair of quotation marks are passed as arguments to the `$JOSHUA/scripts/copy-config.pl` script. That script has some special parameters, especially the `-tm/..` option.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/decoder.md
----------------------------------------------------------------------
diff --git a/5.0/decoder.md b/5.0/decoder.md
new file mode 100644
index 0000000..b78cead
--- /dev/null
+++ b/5.0/decoder.md
@@ -0,0 +1,374 @@
+---
+layout: default
+category: links
+title: Decoder configuration parameters
+---
+
+Joshua configuration parameters affect the runtime behavior of the decoder itself.  This page
+describes the complete list of these parameters and describes how to invoke the decoder manually.
+
+To run the decoder, a convenience script is provided that loads the necessary Java libraries.
+Assuming you have set the environment variable `$JOSHUA` to point to the root of your installation,
+its syntax is:
+
+    $JOSHUA/bin/decoder [-m memory-amount] [-c config-file other-joshua-options ...]
+
+The `-m` argument, if present, must come first, and the memory specification is in Java format
+(e.g., 400m, 4g, 50g).  Most notably, the suffixes "m" and "g" are used for "megabytes" and
+"gigabytes", and there cannot be a space between the number and the unit.  The value of this
+argument is passed to Java itself in the invocation of the decoder, and the remaining options are
+passed to Joshua.  The `-c` parameter has special import because it specifies the location of the
+configuration file.
+
+The Joshua decoder works by reading from STDIN and printing translations to STDOUT as they are
+received, according to a number of [output options](#output).  If no run-time parameters are
+specified (e.g., no translation model), sentences are simply pushed through untranslated.  Blank
+lines are similarly pushed through as blank lines, so as to maintain parallelism with the input.
+
+Parameters can be provided to Joshua via a configuration file and from the command
+line.  Command-line arguments override values found in the configuration file.  The format for
+configuration file parameters is
+
+    parameter = value
+
+Command-line options are specified in the following format
+
+    -parameter value
+
+Values are one of four types (which we list here mostly to call attention to the boolean format):
+
+- STRING, an arbitrary string (no spaces)
+- FLOAT, a floating-point value
+- INT, an integer
+- BOOLEAN, a boolean value.  For booleans, `true` evaluates to true, and all other values evaluate
+  to false.  For command-line options, the value may be omitted, in which case it evaluates to
+  true.  For example, the following are equivalent:
+
+      $JOSHUA/bin/decoder -mark-oovs true
+      $JOSHUA/bin/decoder -mark-oovs
+
+## Joshua configuration file
+
+In addition to the decoder parameters described below, the configuration file contains the model
+feature weights.  These weights are distinguished from runtime parameters in that they are delimited
+by a space instead of an equals sign. They take the following
+format, and by convention are placed at the end of the configuration file:
+
+    lm_0 4.23
+    tm_pt_0 -0.2
+    OOVPenalty -100
+   
+Joshua can make use of thousands of features, which are described in further detail in the
+[feature file](features.html).
+
+## Joshua decoder parameters
+
+This section contains a list of the Joshua run-time parameters.  An important note about the
+parameters is that they are collapsed to canonical form, in which dashes (-) and underscores (-) are
+removed and case is converted to lowercase.  For example, the following parameter forms are
+equivalent (either in the configuration file or from the command line):
+
+    {top-n, topN, top_n, TOP_N, t-o-p-N}
+    {poplimit, pop-limit, pop-limit, popLimit,PoPlImIt}
+
+This basically defines equivalence classes of parameters, and relieves you of the task of having to
+remember the exact format of each parameter.
+
+In what follows, we group the configuration parameters in the following groups:
+
+- [General options](#general)
+- [Pruning](#pruning)
+- [Translation model options](#tm)
+- [Language model options](#lm)
+- [Output options](#output)
+- [Alternate modes of operation](#modes)
+
+<a id="general" />
+
+### General decoder options
+
+- `c`, `config` --- *NULL*
+
+   Specifies the configuration file from which Joshua options are loaded.  This feature is unique in
+   that it must be specified from the command line (obviously).
+
+- `amortize` --- *true*
+
+  When true, specifies that sorting of the rule lists at each trie node in the grammar should be
+  delayed until the trie node is accessed. When false, all such nodes are sorted before decoding
+  even begins. Setting to true results in slower per-sentence decoding, but allows the decoder to
+  begin translating almost immediately (especially with large grammars).
+
+- `server-port` --- *0*
+
+  If set to a nonzero value, Joshua will start a multithreaded TCP/IP server on the specified
+  port. Clients can connect to it directly through programming APIs or command-line tools like
+  `telnet` or `nc`.
+  
+      $ $JOSHUA/bin/decoder -m 30g -c /path/to/config/file -server-port 8723
+      ...
+      $ cat input.txt | nc localhost 8723 > results.txt
+
+- `maxlen` --- *200*
+
+  Input sentences longer than this are truncated.
+
+- `feature-function`
+
+  Enables a particular feature function. See the [feature function page](features.html) for more information.
+
+- `oracle-file` --- *NULL*
+
+  The location of a set of oracle reference translations, parallel to the input.  When present,
+  after producing the hypergraph by decoding the input sentence, the oracle is used to rescore the
+  translation forest with a BLEU approximation in order to extract the oracle-translation from the
+  forest.  This is useful for obtaining an (approximation to an) upper bound on your translation
+  model under particular search settings.
+
+- `default-nonterminal` --- *"X"*
+
+   This is the nonterminal symbol assigned to out-of-vocabulary (OOV) items. Joshua assigns this
+   label to every word of the input, in fact, so that even known words can be translated as OOVs, if
+   the model prefers them. Usually, a very low weight on the `OOVPenalty` feature discourages their
+   use unless necessary.
+
+- `goal-symbol` --- *"GOAL"*
+
+   This is the symbol whose presence in the chart over the whole input span denotes a successful
+   parse (translation).  It should match the LHS nonterminal in your glue grammar.  Internally,
+   Joshua represents nonterminals enclosed in square brackets (e.g., "[GOAL]"), which you can
+   optionally supply in the configuration file.
+
+- `true-oovs-only` --- *false*
+
+  By default, Joshua creates an OOV entry for every word in the source sentence, regardless of
+  whether it is found in the grammar.  This allows every word to be pushed through untranslated
+  (although potentially incurring a high cost based on the `OOVPenalty` feature).  If this option is
+  set, then only true OOVs are entered into the chart as OOVs. To determine "true" OOVs, Joshua
+  examines the first level of the grammar trie for each word of the input (this isn't a perfect
+  heuristic, since a word could be present only in deeper levels of the trie).
+
+- `threads`, `num-parallel-decoders` --- *1*
+
+  This determines how many simultaneous decoding threads to launch.  
+
+  Outputs are assembled in order and Joshua has to hold on to the complete target hypergraph until
+  it is ready to be processed for output, so too many simultaneous threads could result in lots of
+  memory usage if a long sentence results in many sentences being queued up.  We have run Joshua
+  with as many as 64 threads without any problems of this kind, but it's useful to keep in the back
+  of your mind.
+  
+- `weights-file` --- NULL
+
+  Weights are appended to the end of the Joshua configuration file, by convention. If you prefer to
+  put them in a separate file, you can do so, and point to the file with this parameter.
+
+### Pruning options <a id="pruning" />
+
+- `pop-limit` --- *100*
+
+  The number of cube-pruning hypotheses that are popped from the candidates list for each span of
+  the input.  Higher values result in a larger portion of the search space being explored at the
+  cost of an increased search time. For exhaustive search, set `pop-limit` to 0.
+
+- `filter-grammar` --- false
+
+  Set to true, this enables dynamic sentence-level filtering. For each sentence, each grammar is
+  filtered at runtime down to rules that can be applied to the sentence under consideration. This
+  takes some time (which we haven't thoroughly quantified), but can result in the removal of many
+  rules that are only partially applicable to the sentence.
+
+- `constrain-parse` --- *false*
+- `use_pos_labels` --- *false*
+
+  *These features are not documented.*
+
+### Translation model options <a id="tm" />
+
+Joshua supports any number of translation models. Conventionally, two are supplied: the main grammar
+containing translation rules, and the glue grammar for patching things together. Internally, Joshua
+doesn't distinguish between the roles of these grammars; they are treated differently only in that
+they typically have different span limits (the maximum input width they can be applied to).
+
+Grammars are instantiated with config file lines of the following form:
+
+    tm = TYPE OWNER SPAN_LIMIT FILE
+
+* `TYPE` is the grammar type, which must be set to "thrax". 
+* `OWNER` is the grammar's owner, which defines the set of [feature weights](features.html) that
+  apply to the weights found in each line of the grammar (using different owners allows each grammar
+  to have different sets and numbers of weights, while sharing owners allows weights to be shared
+  across grammars).
+* `SPAN_LIMIT` is the maximum span of the input that rules from this grammar can be applied to. A
+  span limit of 0 means "no limit", while a span limit of -1 means that rules from this grammar must
+  be anchored to the left side of the sentence (index 0).
+* `FILE` is the path to the file containing the grammar. If the file is a directory, it is assumed
+  to be [packed](packed.html). Only one packed grammar can currently be used at a time.
+
+For reference, the following two translation model lines are used by the [pipeline](pipeline.html):
+
+    tm = thrax pt 20 /path/to/packed/grammar
+    tm = thrax glue -1 /path/to/glue/grammar
+
+### Language model options <a id="lm" />
+
+Joshua supports any number of language models.  To add a language
+model, add a line of the following format to the configuration file:
+
+    lm = TYPE ORDER LEFT_STATE RIGHT_STATE CEILING_COST FILE
+
+where the six fields correspond to the following values:
+
+* `TYPE`: one of "kenlm", "berkeleylm", or "none"
+* `ORDER`: the order of the language model
+* `LEFT_STATE`: whether to use left-state minimization; currently only supported by KenLM
+* `RIGHT_STATE`: whether to use right equivalent state (currently unsupported)
+* `CEILING_COST`: the LM-specific ceiling cost of all n-grams (currently ignored)
+* `FILE`: the path to the language model file.  All language model types support the standard ARPA
+   format.  Additionally, if the LM type is "kenlm", this file can be compiled into KenLM's compiled
+   format (using the program at `$JOSHUA/bin/build_binary`); if the the LM type is "berkeleylm", it
+   can be compiled by following the directions in
+   `$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README`. The [pipeline](pipeline.html) will
+   automatically compile either type.
+
+For each language model, you need to specify a feature weight in the following format:
+
+    lm_0 WEIGHT
+    lm_1 WEIGHT
+    ...
+
+where the indices correspond to the order of the language model declaration lines.
+
+### Output options <a id="output" />
+
+- `output-format` *New in 5.0*
+
+  Joshua prints a lot of information to STDERR (making this more granular is on the TODO
+  list). Output to STDOUT is reserved for decoder translations, and is controlled by the
+
+   - `%i`: the sentence number (0-indexed)
+
+   - `%e`: the source sentence
+
+   - `%s`: the translated sentence
+
+   - `%S`: the translated sentence, with some basic capitalization and denomralization. e.g.,
+
+         $ echo "¿ who you lookin' at , mr. ?" | $JOSHUA/bin/decoder -output-format "%S" -mark-oovs false 2> /dev/null 
+         ¿Who you lookin' at, Mr.? 
+
+   - `%t`: the synchronous derivation
+
+   - `%f`: the list of feature values (as name=value pairs)
+
+   - `%c`: the model cost
+
+   - `%w`: the weight vector (unimplemented)
+
+   - `%a`: the alignments between source and target words (unimplemented)
+
+  The default value is
+
+      output-format = %i ||| %s ||| %f ||| %c
+      
+  i.e.,
+
+      input ID ||| translation ||| model scores ||| score
+
+- `top-n` --- *300*
+
+  The number of translation hypotheses to output, sorted in decreasing order of model score
+
+- `use-unique-nbest` --- *true*
+
+  When constructing the n-best list for a sentence, skip hypotheses whose string has already been
+  output.
+
+- `escape-trees` --- *false*
+
+- `include-align-index` --- *false*
+
+  Output the source words indices that each target word aligns to.
+
+- `mark-oovs` --- *false*
+
+  if `true`, this causes the text "_OOV" to be appended to each untranslated word in the output.
+
+- `visualize-hypergraph` --- *false*
+
+  If set to true, a visualization of the hypergraph will be displayed, though you will have to
+  explicitly include the relevant jar files.  See the example usage in
+  `$JOSHUA/examples/tree_visualizer/`, which contains a demonstration of a source sentence,
+  translation, and synchronous derivation.
+
+- `dump-hypergraph` --- ""
+
+  This feature directs that the hypergraph should be written to disk for each input sentence. If
+  set, the value should contain the string "%d", which is replaced with the sentence number. For
+  example,
+  
+      cat input.txt | $JOSHUA/bin/decoder -dump-hypergraph hgs/%d.txt
+
+  Note that the output directory must exist.
+
+  TODO: revive the
+  [discussion on a common hypergraph format](http://aclweb.org/aclwiki/index.php?title=Hypergraph_Format)
+  on the ACL Wiki and support that format.
+
+### Lattice decoding
+
+In addition to regular sentences, Joshua can decode weighted lattices encoded in
+[the PLF format](http://www.statmt.org/moses/?n=Moses.WordLattices), except that path costs should
+be listed as <b>log probabilities</b> instead of probabilities.  Lattice decoding was originally
+added by Lane Schwartz and [Chris Dyer](http://www.cs.cmu.edu/~cdyer/).
+
+Joshua will automatically detect whether the input sentence is a regular sentence (the usual case)
+or a lattice.  If a lattice, a feature will be activated that accumulates the cost of different
+paths through the lattice.  In this case, you need to ensure that a weight for this feature is
+present in [your model file](decoder.html). The [pipeline](pipeline.html) will handle this
+automatically, or if you are doing this manually, you can add the line
+
+    SourcePath COST
+    
+to your Joshua configuration file.    
+
+Lattices must be listed one per line.
+
+### Alternate modes of operation <a id="modes" />
+
+In addition to decoding input sentences in the standard way, Joshua supports both *constrained
+decoding* and *synchronous parsing*. In both settings, both the source and target sides are provided
+as input, and the decoder finds a derivation between them.
+
+#### Constrained decoding
+
+To enable constrained decoding, simply append the desired target string as part of the input, in
+the following format:
+
+    source sentence ||| target sentence
+
+Joshua will translate the source sentence constrained to the target sentence. There are a few
+caveats:
+
+   * Left-state minimization cannot be enabled for the language model
+
+   * A heuristic is used to constrain the derivation (the LM state must match against the
+     input). This is not a perfect heuristic, and sometimes results in analyses that are not
+     perfectly constrained to the input, but have extra words.
+
+#### Synchronous parsing
+
+Joshua supports synchronous parsing as a two-step sequence of monolingual parses, as described in
+Dyer (NAACL 2010) ([PDF](http://www.aclweb.org/anthology/N10-1033‎.pdf)). To enable this:
+
+   - Set the configuration parameter `parse = true`.
+
+   - Remove all language models from the input file 
+
+   - Provide input in the following format:
+
+          source sentence ||| target sentence
+
+You may also wish to display the synchronouse parse tree (`-output-format %t`) and the alignment
+(`-show-align-index`).
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/faq.md
----------------------------------------------------------------------
diff --git a/5.0/faq.md b/5.0/faq.md
new file mode 100644
index 0000000..2ac67ba
--- /dev/null
+++ b/5.0/faq.md
@@ -0,0 +1,7 @@
+---
+layout: default
+category: help
+title: Common problems
+---
+
+Solutions to common problems will be posted here as we become aware of them.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/features.md
----------------------------------------------------------------------
diff --git a/5.0/features.md b/5.0/features.md
new file mode 100644
index 0000000..7613954
--- /dev/null
+++ b/5.0/features.md
@@ -0,0 +1,6 @@
+---
+layout: default
+title: Features
+---
+
+Joshua 5.0 uses a sparse feature representation to encode features internally.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/file-formats.md
----------------------------------------------------------------------
diff --git a/5.0/file-formats.md b/5.0/file-formats.md
new file mode 100644
index 0000000..a53d661
--- /dev/null
+++ b/5.0/file-formats.md
@@ -0,0 +1,72 @@
+---
+layout: default
+category: advanced
+title: Joshua file formats
+---
+This page describes the formats of Joshua configuration and support files.
+
+## Translation models (grammars)
+
+Joshua supports two grammar file formats: a text-based version (also used by Hiero, shared by
+[cdec](), and supported by [hierarchical Moses]()), and an efficient
+[packed representation](packing.html) developed by [Juri Ganitkevich](http://cs.jhu.edu/~juri).
+
+Grammar rules follow this format.
+
+    [LHS] ||| SOURCE-SIDE ||| TARGET-SIDE ||| FEATURES
+    
+The source and target sides contain a mixture of terminals and nonterminals. The nonterminals are
+linked across sides by indices. There is no limit to the number of paired nonterminals in the rule
+or on the nonterminal labels (Joshua supports decoding with SAMT and GHKM grammars).
+
+    [X] ||| el chico [X,1] ||| the boy [X,1] ||| -3.14 0 2 17
+    [S] ||| el chico [VP,1] ||| the boy [VP,1] ||| -3.14 0 2 17
+    [VP] ||| [NP,1] [IN,2] [VB,3] ||| [VB,3] [IN,2] [NP,1] ||| 0.0019026637 0.81322956
+
+The feature values can have optional labels, e.g.:
+
+    [X] ||| el chico [X,1] ||| the boy [X,1] ||| lexprob=-3.14 lexicalized=1 numwords=2 count=17
+    
+One file common to decoding is the glue grammar, which for hiero grammar is defined as follows:
+
+    [GOAL] ||| <s> ||| <s> ||| 0
+    [GOAL] ||| [GOAL,1] [X,2] ||| [GOAL,1] [X,2] ||| -1
+    [GOAL] ||| [GOAL,1] </s> ||| [GOAL,1] </s> ||| 0
+
+Joshua's [pipeline](pipeline.html) supports extraction of Hiero and SAMT grammars via
+[Thrax](thrax.html) or GHKM grammars using [Michel Galley](http://www-nlp.stanford.edu/~mgalley/)'s
+GHKM extractor (included) or Moses' GHKM extractor (if Moses is installed).
+
+## Language Model
+
+Joshua has two language model implementations: [KenLM](http://kheafield.com/code/kenlm/) and
+[BerkeleyLM](http://berkeleylm.googlecode.com).  All language model implementations support the
+standard ARPA format output by [SRILM](http://www.speech.sri.com/projects/srilm/).  In addition,
+KenLM and BerkeleyLM support compiled formats that can be loaded more quickly and efficiently. KenLM
+is written in C++ and is supported via a JNI bridge, while BerkeleyLM is written in Java. KenLM is
+the default because of its support for left-state minimization.
+
+### Compiling for KenLM
+
+To compile an ARPA grammar for KenLM, use the (provided) `build-binary` command, located deep within
+the Joshua source code:
+
+    $JOSHUA/bin/build_binary lm.arpa lm.kenlm
+    
+This script takes the `lm.arpa` file and produces the compiled version in `lm.kenlm`.
+
+### Compiling for BerkeleyLM
+
+To compile a grammar for BerkeleyLM, type:
+
+    java -cp $JOSHUA/lib/berkeleylm.jar -server -mxMEM edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm.arpa lm.berkeleylm
+
+The `lm.berkeleylm` file can then be listed directly in the [Joshua configuration file](decoder.html).
+
+## Joshua configuration file
+
+The [decoder page](decoder.html) documents decoder command-line and config file options.
+
+## Thrax configuration
+
+See [the thrax page](thrax.html) for more information about the Thrax configuration file.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/index.md
----------------------------------------------------------------------
diff --git a/5.0/index.md b/5.0/index.md
new file mode 100644
index 0000000..7a1d016
--- /dev/null
+++ b/5.0/index.md
@@ -0,0 +1,77 @@
+---
+layout: default
+title: Getting Started
+---
+
+This page contains end-user oriented documentation for the 5.0 release of
+[the Joshua decoder](http://joshua-decoder.org/).
+
+## Download and Setup
+
+1. Download Joshua by clicking the big green button above, or from the command line:
+
+       wget -q http://cs.jhu.edu/~post/files/joshua-v5.0.tgz
+
+1. Next, unpack it, set environment variables, and compile everything:
+
+       tar xzf joshua-v5.0.tgz
+       cd joshua-v5.0
+
+       # for bash
+       export JAVA_HOME=/path/to/java
+       export JOSHUA=$(pwd)
+       echo "export JOSHUA=$JOSHUA" >> ~/.bashrc
+
+       # for tcsh
+       setenv JAVA_HOME /path/to/java
+       setenv JOSHUA `pwd`
+       echo "setenv JOSHUA $JOSHUA" >> ~/.profile
+       
+       ant
+
+   (If you don't know what to set `$JAVA_HOME` to, try `/usr/java/default`)
+
+3. If you have a Hadoop installation, make sure that the environment variable `$HADOOP` is set and
+points to it. If you don't, Joshua will roll one out for you in standalone mode.
+
+4. If you want to use Cherry & Foster's
+[batch MIRA tuner](http://aclweb.org/anthology-new/N/N12/N12-1047v2.pdf) (recommended), you need to
+[install Moses](http://www.statmt.org/moses/?n=Development.GetStarted) and define the `$MOSES`
+environment variable to point to the root of the Moses installation.
+
+## Quick start
+
+Our <a href="pipeline.html">pipeline script</a> is the quickest way to get started. For example, to
+train and test a complete model translating from Bengali to English:
+
+First, download the Indian languages data:
+   
+    wget --no-check -O indian-languages.tgz https://github.com/joshua-decoder/indian-parallel-corpora/tarball/master
+    tar xf indian-languages.tgz
+    ln -s joshua-decoder-indian-parallel-corpora-b71d31a input
+
+Then, train and test a model
+
+    $JOSHUA/bin/pipeline.pl --source bn --target en \
+        --no-prepare --aligner berkeley \
+        --corpus input/bn-en/tok/training.bn-en \
+        --tune input/bn-en/tok/dev.bn-en \
+        --test input/bn-en/tok/devtest.bn-en
+
+This will align the data with the Berkeley aligner, build a Hiero model, tune with MERT, decode the
+test sets, and reports results that should correspond with what you find on <a
+href="/indian-parallel-corpora/">the Indian Parallel Corpora page</a>. For
+more details, including information on the many options available with the pipeline script, please
+see <a href="pipeline.html">its documentation page</a>.
+
+## More information
+
+For more detail on the decoder itself, including its command-line options, see
+[the Joshua decoder page](decoder.html).  You can also learn more about other steps of
+[the Joshua MT pipeline](pipeline.html), including [grammar extraction](thrax.html) with Thrax and
+Joshua's [efficient grammar representation](packing.html).
+
+If you have problems or issues, you might find some help [on our answers page](faq.html) or
+[in the mailing list archives](https://groups.google.com/forum/?fromgroups#!forum/joshua_support).
+
+A [bundled configuration](bundle.html), which is a minimal set of configuration, resource, and script files, can be created and easily transferred and shared.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/jacana.md
----------------------------------------------------------------------
diff --git a/5.0/jacana.md b/5.0/jacana.md
new file mode 100644
index 0000000..613a862
--- /dev/null
+++ b/5.0/jacana.md
@@ -0,0 +1,139 @@
+---
+layout: default
+title: Alignment with Jacana
+---
+
+## Introduction
+
+jacana-xy is a token-based word aligner for machine translation, adapted from the original
+English-English word aligner jacana-align described in the following paper:
+
+    A Lightweight and High Performance Monolingual Word Aligner. Xuchen Yao, Benjamin Van Durme,
+    Chris Callison-Burch and Peter Clark. Proceedings of ACL 2013, short papers.
+
+It currently supports only aligning from French to English with a very limited feature set, from the
+one week hack at the [Eighth MT Marathon 2013](http://statmt.org/mtm13). Please feel free to check
+out the code, read to the bottom of this page, and
+[send the author an email](http://www.cs.jhu.edu/~xuchen/) if you want to add more language pairs to
+it.
+
+## Build
+
+jacana-xy is written in a mixture of Java and Scala. If you build from ant, you have to set up the
+environmental variables `JAVA_HOME` and `SCALA_HOME`. In my system, I have:
+
+    export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26
+    export SCALA_HOME=/home/xuchen/Downloads/scala-2.10.2
+
+Then type:
+
+    ant
+
+build/lib/jacana-xy.jar will be built for you.
+
+If you build from Eclipse, first install scala-ide, then import the whole jacana folder as a Scala project. Eclipse should find the .project file and set up the project automatically for you.
+
+Demo
+scripts-align/runDemoServer.sh shows up the web demo. Direct your browser to http://localhost:8080/ and you should be able to align some sentences.
+
+Note: To make jacana-xy know where to look for resource files, pass the property JACANA_HOME with Java when you run it:
+
+java -DJACANA_HOME=/path/to/jacana -cp jacana-xy.jar ......
+
+Browser
+You can also browse one or two alignment files (*.json) with firefox opening src/web/AlignmentBrowser.html:
+
+
+
+Note 1: due to strict security setting for accessing local files, Chrome/IE won't work.
+
+Note 2: the input *.json files have to be in the same folder with AlignmentBrowser.html.
+
+Align
+scripts-align/alignFile.sh aligns tab-separated sentence files and outputs the output to a .json file that's accepted by the browser:
+
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -src fr -tgt en -m fr-en.model -a s.txt -o s.json
+
+scripts-align/alignFile.sh takes GIZA++-style input files (one file containing the source sentences, and the other file the target sentences) and outputs to one .align file with dashed alignment indices (e.g. "1-2 0-4"):
+
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -m fr-en.model -src fr -tgt en -a s1.txt -b s2.txt -o s.align
+
+Training
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -r train.json -d dev.json -t test.json -m /tmp/align.model
+
+The aligner then would train on train.json, and report F1 values on dev.json for every 10 iterations, when the stopping criterion has reached, it will test on test.json.
+
+For every 10 iterations, a model file is saved to (in this example) /tmp/align.model.iter_XX.F1_XX.X. Normally what I do is to select the one with the best F1 on dev.json, then run a final test on test.json:
+
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -t test.json -m /tmp/align.model.iter_XX.F1_XX.X
+
+In this case since the training data is missing, the aligner assumes it's a test job, then reads model file still from the -m option, and test on test.json.
+
+All the json files are in a format like the following (also accepted by the browser for display):
+
+[
+    {
+        "id": "0008",
+        "name": "Hansards.french-english.0008",
+        "possibleAlign": "0-0 0-1 0-2",
+        "source": "bravo !",
+        "sureAlign": "1-3",
+        "target": "hear , hear !"
+    },
+    {
+        "id": "0009",
+        "name": "Hansards.french-english.0009",
+        "possibleAlign": "1-1 6-5 7-5 6-6 7-6 13-10 13-11",
+        "source": "monsieur le Orateur , ma question se adresse à le ministre chargé de les transports .",
+        "sureAlign": "0-0 2-1 3-2 4-3 5-4 8-7 9-8 10-9 12-10 14-11 15-12",
+        "target": "Mr. Speaker , my question is directed to the Minister of Transport ."
+    }
+]
+Where possibleAlign is not used.
+
+The stopping criterion is to run up to 300 iterations or when the objective difference between two iterations is less than 0.001, whichever happens first. Currently they are hard-coded. If you need to be flexible on this, send me an email!
+
+Support More Languages
+To add support to more languages, you need:
+
+labelled word alignment (in the download there's already French-English under alignment-data/fr-en; I also have Chinese-English and Arabic-English; let me know if you have more). Usually 100 labelled sentence pairs would be enough
+implement some feature functions for this language pair
+To add more features, you need to implement the following interface:
+
+edu.jhu.jacana.align.feature.AlignFeature
+
+and override the following function:
+
+addPhraseBasedFeature
+
+For instance, a simple feature that checks whether the two words are translations in wiktionary for the French-English alignment task has the function implemented as:
+
+def addPhraseBasedFeature(pair: AlignPair, ins:AlignFeatureVector, i:Int, srcSpan:Int, j:Int, tgtSpan:Int,
+      currState:Int, featureAlphabet: Alphabet){
+  if (j == -1) {
+  } else {
+    val srcTokens = pair.srcTokens.slice(i, i+srcSpan).mkString(" ")
+    val tgtTokens = pair.tgtTokens.slice(j, j+tgtSpan).mkString(" ")
+                
+    if (WiktionaryMultilingual.exists(srcTokens, tgtTokens)) {
+      ins.addFeature("InWiktionary", NONE_STATE, currState, 1.0, srcSpan, featureAlphabet) 
+    }
+        
+  }       
+}
+This is a more general function that also deals with phrase alignment. But it is suggested to implement it just for token alignment as currently the phrase alignment part is very slow to train (60x slower than token alignment).
+
+Some other language-independent and English-only features are implemented under the package edu.jhu.jacana.align.feature, for instance:
+
+StringSimilarityAlignFeature: various string similarity measures
+
+PositionalAlignFeature: features based on relative sentence positions
+
+DistortionAlignFeature: Markovian (state transition) features
+
+When you add features for more languages, just create a new package like the one for French-English:
+
+edu.jhu.jacana.align.feature.fr_en
+
+and start coding!
+