You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@joshua.apache.org by le...@apache.org on 2016/04/05 07:13:02 UTC
[14/18] incubator-joshua-site git commit: Initial import of joshua-decoder.github.com site to Apache

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/zmert.md
----------------------------------------------------------------------
diff --git a/6.0/zmert.md b/6.0/zmert.md
new file mode 100644
index 0000000..022d0dc
--- /dev/null
+++ b/6.0/zmert.md
@@ -0,0 +1,83 @@
+---
+layout: default6
+category: advanced
+title: Z-MERT
+---
+
+This document describes how to manually run the ZMERT module.  ZMERT is Joshua's minimum error-rate
+training module, written by Omar F. Zaidan.  It is easily adapted to drop in different decoders, and
+was also written so as to work with different objective functions (other than BLEU).
+
+((Section (1) in `$JOSHUA/examples/ZMERT/README_ZMERT.txt` is an expanded version of this section))
+
+Z-MERT, can be used by launching the driver program (`ZMERT.java`), which expects a config file as
+its main argument.  This config file can be used to specify any subset of Z-MERT's 20-some
+parameters.  For a full list of those parameters, and their default values, run ZMERT with a single
+-h argument as follows:
+
+    java -cp $JOSHUA/bin joshua.zmert.ZMERT -h
+
+So what does a Z-MERT config file look like?
+
+Examine the file `examples/ZMERT/ZMERT_config_ex2.txt`.  You will find that it
+specifies the following "main" MERT parameters:
+
+    (*) -dir dirPrefix:         working directory
+    (*) -s sourceFile:          source sentences (foreign sentences) of the MERT dataset
+    (*) -r refFile:             target sentences (reference translations) of the MERT dataset
+    (*) -rps refsPerSen:        number of reference translations per sentence
+    (*) -p paramsFile:          file containing parameter names, initial values, and ranges
+    (*) -maxIt maxMERTIts:      maximum number of MERT iterations
+    (*) -ipi initsPerIt:        number of intermediate initial points per iteration
+    (*) -cmd commandFile:       name of file containing commands to run the decoder
+    (*) -decOut decoderOutFile: name of the output file produced by the decoder
+    (*) -dcfg decConfigFile:    name of decoder config file
+    (*) -N N:                   size of N-best list (per sentence) generated in each MERT iteration
+    (*) -v verbosity:           output verbosity level (0-2; higher value => more verbose)
+    (*) -seed seed:             seed used to initialize the random number generator
+
+(Note that the `-s` parameter is only used if Z-MERT is running Joshua as an
+ internal decoder.  If Joshua is run as an external decoder, as is the case in
+ this README, then this parameter is ignored.)
+
+To test Z-MERT on the 100-sentence test set of example2, provide this config
+file to Z-MERT as follows:
+
+    java -cp bin joshua.zmert.ZMERT -maxMem 500 examples/ZMERT/ZMERT_config_ex2.txt > examples/ZMERT/ZMERT_example/ZMERT.out
+
+This will run Z-MERT for a couple of iterations on the data from the example2
+folder.  (Notice that we have made copies of the source and reference files
+from example2 and renamed them as src.txt and ref.* in the MERT_example folder,
+just to have all the files needed by Z-MERT in one place.)  Once the Z-MERT run
+is complete, you should be able to inspect the log file to see what kinds of
+things it did.  If everything goes well, the run should take a few minutes, of
+which more than 95% is time spent by Z-MERT waiting on Joshua to finish
+decoding the sentences (once per iteration).
+
+The output file you get should be equivalent to `ZMERT.out.verbosity1`.  If you
+rerun the experiment with the verbosity (-v) argument set to 2 instead of 1,
+the output file you get should be equivalent to `ZMERT.out.verbosity2`, which has
+more interesting details about what Z-MERT does.
+
+Notice the additional `-maxMem` argument.  It tells Z-MERT that it should not
+persist to use up memory while the decoder is running (during which time Z-MERT
+would be idle).  The 500 tells Z-MERT that it can only use a maximum of 500 MB.
+For more details on this issue, see section (4) in Z-MERT's README.
+
+A quick note about Z-MERT's interaction with the decoder.  If you examine the
+file `decoder_command_ex2.txt`, which is provided as the commandFile (`-cmd`)
+argument in Z-MERT's config file, you'll find it contains the command one would
+use to run the decoder.  Z-MERT launches the commandFile as an external
+process, and assumes that it will launch the decoder to produce translations.
+(Make sure that commandFile is executable.)  After launching this external
+process, Z-MERT waits for it to finish, then uses the resulting output file for
+parameter tuning (in addition to the output files from previous iterations).
+The command file here only has a single command, but your command file could
+have multiple lines.  Just make sure the command file itself is executable.
+
+Notice that the Z-MERT arguments `configFile` and `decoderOutFile` (`-cfg` and
+`-decOut`) must match the two Joshua arguments in the commandFile's (`-cmd`) single
+command.  Also, the Z-MERT argument for N must match the value for `top_n` in
+Joshua's config file, indicated by the Z-MERT argument configFile (`-cfg`).
+
+For more details on Z-MERT, refer to `$JOSHUA/examples/ZMERT/README_ZMERT.txt`

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/advanced.md
----------------------------------------------------------------------
diff --git a/6/advanced.md b/6/advanced.md
new file mode 100644
index 0000000..4997e73
--- /dev/null
+++ b/6/advanced.md
@@ -0,0 +1,7 @@
+---
+layout: default6
+category: links
+title: Advanced features
+---
+
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/bundle.md
----------------------------------------------------------------------
diff --git a/6/bundle.md b/6/bundle.md
new file mode 100644
index 0000000..f433172
--- /dev/null
+++ b/6/bundle.md
@@ -0,0 +1,100 @@
+---
+layout: default6
+category: links
+title: Building a language pack
+---
+
+*The information in this page applies to Joshua 6.0.3 and greater*.
+
+Joshua distributes [language packs](/language-packs), which are models
+that have been trained and tuned for particular language pairs. You
+can easily create your own language pack after you have trained and
+tuned a model using the provided
+`$JOSHUA/scripts/support/run-bundler.py` script, which gathers files
+from a pipeline training directory and bundles them together for easy
+distribution and release.
+
+The script takes just two mandatory arguments in the following order:
+
+1.  The path to the Joshua configuration file to base the bundle
+    on. This file should contain the tuned weights from the tuning run, so
+    you can use either the final tuned file from the tuning run
+    (`tune/joshua.config.final`) or from the test run
+    (`test/model/joshua.config`).
+1.  The directory to place the language pack in. If this directory
+    already exists, the script will die, unless you also pass `--force`.
+
+In addition, there are a number of other arguments that may be important.
+
+- `--root /path/to/root`. If file paths in the Joshua config file are
+   not absolute, you need to provide relative root. If you specify a
+   tuned pipeline file (such as `tune/joshua.config.final` above), the
+   paths should all be absolute. If you instead provide a config file
+   from a previous run bundle (e.g., `test/model/joshua.config`), the
+   bundle directory above is the relative root.
+
+- The config file options that are used in the pipeline are likely not
+  the ones you want if you release a model. For example, the tuning
+  configuration file contains options that tell Joshua to output 300
+  translation candidates for each sentence (`-top-n 300`) and to
+  include lots of detail about each translation (`-output-format '%i
+  ||| %s ||| %f ||| %c'`).  Because of this, you will want to tell the
+  run bundler to change many of the config file options to be more
+  geared towards human-readable output. The default copy-config
+  options are options are `-top-n 0 -output-format %S -mark-oovs
+  false`, which accomplishes exactly this (human readability).
+  
+- A very important issue has to do with the translation model (the
+  "TM", also sometimes called the grammar or phrase table). The
+  translation model can be very large, so that it takes a long time to
+  load and to [pack](packing.html). To reduce this time during model
+  training, the translation model is filtered against the tuning and
+  testing data in the pipeline, and these filtered models will be what
+  is listed in the source config files. However, when exporting a
+  model for use as a language pack, you need to export the full model
+  instead of the filtered one so as to maximize your coverage on new
+  test data. The `--tm` parameter is used to accomplish this; it takes
+  an argument specifying the path to the full model. If you would
+  additionally like the large model to be [packed](packing.html) (this
+  is recommended; it reformats the TM so that it can be quickly loaded
+  at run time), you can use `--pack-tm` instead. You can only pack one
+  TM (but typically there is only TM anyway). Multiple `--tm`
+  parameters can be passed; they will replace TMs found in the config
+  file in the order they are found.
+
+Here is an example invocation for packing a hierarchical model using
+the final tuned Joshua config file:
+
+    ./run-bundler.py \
+      --force --verbose \
+      /path/to/rundir/tune/joshua.config.final \
+      language-pack-YYYY-MM-DD \
+      --root /path/to/rundir \
+      --pack-tm /path/to/rundir/grammar.gz \
+      --copy-config-options \ 
+        '-top-n 1 -output-format %S -mark-oovs false' \
+      --server-port 5674
+
+The copy config options tell the decoder to present just the
+single-best (`-top-n 0`) translated output string that has been
+heuristically capitalized (`-output-format %S`), to not append `_OOV`
+to OOVs (`-mark-oovs false`), and to use the translation model
+`/path/to/rundir/grammar.gz` as the main translation model, packing it
+before placing it in the bundle. Note that these arguments to
+`--copy-config` are the default, so you could leave this off entirely.
+See [this page](decoder.html) for a longer list of decoder options.
+
+This command is a slight variation used for phrase-based models, which
+instead takes the test-set Joshua config (the result is the same):
+
+    ./run-bundler.py \
+      --force --verbose \
+      /path/to/rundir/test/model/joshua.config \
+      --root /path/to/rundir/test/model \
+      language-pack-YYYY-MM-DD \
+      --pack-tm /path/to/rundir/model/phrase-table.gz \
+      --server-port 5674
+
+In both cases, a new directory `language-pack-YYYY-MM-DD` will be
+created along with a README and a number of support files.
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/decoder.md
----------------------------------------------------------------------
diff --git a/6/decoder.md b/6/decoder.md
new file mode 100644
index 0000000..e8dc8c9
--- /dev/null
+++ b/6/decoder.md
@@ -0,0 +1,385 @@
+---
+layout: default6
+category: links
+title: Decoder configuration parameters
+---
+
+Joshua configuration parameters affect the runtime behavior of the decoder itself.  This page
+describes the complete list of these parameters and describes how to invoke the decoder manually.
+
+To run the decoder, a convenience script is provided that loads the necessary Java libraries.
+Assuming you have set the environment variable `$JOSHUA` to point to the root of your installation,
+its syntax is:
+
+    $JOSHUA/bin/decoder [-m memory-amount] [-c config-file other-joshua-options ...]
+
+The `-m` argument, if present, must come first, and the memory specification is in Java format
+(e.g., 400m, 4g, 50g).  Most notably, the suffixes "m" and "g" are used for "megabytes" and
+"gigabytes", and there cannot be a space between the number and the unit.  The value of this
+argument is passed to Java itself in the invocation of the decoder, and the remaining options are
+passed to Joshua.  The `-c` parameter has special import because it specifies the location of the
+configuration file.
+
+The Joshua decoder works by reading from STDIN and printing translations to STDOUT as they are
+received, according to a number of [output options](#output).  If no run-time parameters are
+specified (e.g., no translation model), sentences are simply pushed through untranslated.  Blank
+lines are similarly pushed through as blank lines, so as to maintain parallelism with the input.
+
+Parameters can be provided to Joshua via a configuration file and from the command
+line.  Command-line arguments override values found in the configuration file.  The format for
+configuration file parameters is
+
+    parameter = value
+
+Command-line options are specified in the following format
+
+    -parameter value
+
+Values are one of four types (which we list here mostly to call attention to the boolean format):
+
+- STRING, an arbitrary string (no spaces)
+- FLOAT, a floating-point value
+- INT, an integer
+- BOOLEAN, a boolean value.  For booleans, `true` evaluates to true, and all other values evaluate
+  to false.  For command-line options, the value may be omitted, in which case it evaluates to
+  true.  For example, the following are equivalent:
+
+      $JOSHUA/bin/decoder -mark-oovs true
+      $JOSHUA/bin/decoder -mark-oovs
+
+## Joshua configuration file
+
+In addition to the decoder parameters described below, the configuration file contains the model
+feature weights.  These weights are distinguished from runtime parameters in that they are delimited
+by a space instead of an equals sign. They take the following
+format, and by convention are placed at the end of the configuration file:
+
+    lm_0 4.23
+    tm_pt_0 -0.2
+    OOVPenalty -100
+   
+Joshua can make use of thousands of features, which are described in further detail in the
+[feature file](features.html).
+
+## Joshua decoder parameters
+
+This section contains a list of the Joshua run-time parameters.  An important note about the
+parameters is that they are collapsed to canonical form, in which dashes (-) and underscores (-) are
+removed and case is converted to lowercase.  For example, the following parameter forms are
+equivalent (either in the configuration file or from the command line):
+
+    {top-n, topN, top_n, TOP_N, t-o-p-N}
+    {poplimit, pop-limit, pop-limit, popLimit,PoPlImIt}
+
+This basically defines equivalence classes of parameters, and relieves you of the task of having to
+remember the exact format of each parameter.
+
+In what follows, we group the configuration parameters in the following groups:
+
+- [General options](#general)
+- [Pruning](#pruning)
+- [Translation model options](#tm)
+- [Language model options](#lm)
+- [Output options](#output)
+- [Alternate modes of operation](#modes)
+
+<a id="general" />
+
+### General decoder options
+
+- `c`, `config` --- *NULL*
+
+   Specifies the configuration file from which Joshua options are loaded.  This feature is unique in
+   that it must be specified from the command line (obviously).
+
+- `amortize` --- *true*
+
+  When true, specifies that sorting of the rule lists at each trie node in the grammar should be
+  delayed until the trie node is accessed. When false, all such nodes are sorted before decoding
+  even begins. Setting to true results in slower per-sentence decoding, but allows the decoder to
+  begin translating almost immediately (especially with large grammars).
+
+- `server-port` --- *0*
+
+  If set to a nonzero value, Joshua will start a multithreaded TCP/IP server on the specified
+  port. Clients can connect to it directly through programming APIs or command-line tools like
+  `telnet` or `nc`.
+  
+      $ $JOSHUA/bin/decoder -m 30g -c /path/to/config/file -server-port 8723
+      ...
+      $ cat input.txt | nc localhost 8723 > results.txt
+
+- `maxlen` --- *200*
+
+  Input sentences longer than this are truncated.
+
+- `feature-function`
+
+  Enables a particular feature function. See the [feature function page](features.html) for more information.
+
+- `oracle-file` --- *NULL*
+
+  The location of a set of oracle reference translations, parallel to the input.  When present,
+  after producing the hypergraph by decoding the input sentence, the oracle is used to rescore the
+  translation forest with a BLEU approximation in order to extract the oracle-translation from the
+  forest.  This is useful for obtaining an (approximation to an) upper bound on your translation
+  model under particular search settings.
+
+- `default-nonterminal` --- *"X"*
+
+   This is the nonterminal symbol assigned to out-of-vocabulary (OOV) items. Joshua assigns this
+   label to every word of the input, in fact, so that even known words can be translated as OOVs, if
+   the model prefers them. Usually, a very low weight on the `OOVPenalty` feature discourages their
+   use unless necessary.
+
+- `goal-symbol` --- *"GOAL"*
+
+   This is the symbol whose presence in the chart over the whole input span denotes a successful
+   parse (translation).  It should match the LHS nonterminal in your glue grammar.  Internally,
+   Joshua represents nonterminals enclosed in square brackets (e.g., "[GOAL]"), which you can
+   optionally supply in the configuration file.
+
+- `true-oovs-only` --- *false*
+
+  By default, Joshua creates an OOV entry for every word in the source sentence, regardless of
+  whether it is found in the grammar.  This allows every word to be pushed through untranslated
+  (although potentially incurring a high cost based on the `OOVPenalty` feature).  If this option is
+  set, then only true OOVs are entered into the chart as OOVs. To determine "true" OOVs, Joshua
+  examines the first level of the grammar trie for each word of the input (this isn't a perfect
+  heuristic, since a word could be present only in deeper levels of the trie).
+
+- `threads`, `num-parallel-decoders` --- *1*
+
+  This determines how many simultaneous decoding threads to launch.  
+
+  Outputs are assembled in order and Joshua has to hold on to the complete target hypergraph until
+  it is ready to be processed for output, so too many simultaneous threads could result in lots of
+  memory usage if a long sentence results in many sentences being queued up.  We have run Joshua
+  with as many as 64 threads without any problems of this kind, but it's useful to keep in the back
+  of your mind.
+  
+- `weights-file` --- NULL
+
+  Weights are appended to the end of the Joshua configuration file, by convention. If you prefer to
+  put them in a separate file, you can do so, and point to the file with this parameter.
+
+### Pruning options <a id="pruning" />
+
+- `pop-limit` --- *100*
+
+  The number of cube-pruning hypotheses that are popped from the candidates list for each span of
+  the input.  Higher values result in a larger portion of the search space being explored at the
+  cost of an increased search time. For exhaustive search, set `pop-limit` to 0.
+
+- `filter-grammar` --- false
+
+  Set to true, this enables dynamic sentence-level filtering. For each sentence, each grammar is
+  filtered at runtime down to rules that can be applied to the sentence under consideration. This
+  takes some time (which we haven't thoroughly quantified), but can result in the removal of many
+  rules that are only partially applicable to the sentence.
+
+- `constrain-parse` --- *false*
+- `use_pos_labels` --- *false*
+
+  *These features are not documented.*
+
+### Translation model options <a id="tm" />
+
+Joshua supports any number of translation models. Conventionally, two are supplied: the main grammar
+containing translation rules, and the glue grammar for patching things together. Internally, Joshua
+doesn't distinguish between the roles of these grammars; they are treated differently only in that
+they typically have different span limits (the maximum input width they can be applied to).
+
+Grammars are instantiated with config file lines of the following form:
+
+    tm = TYPE OWNER SPAN_LIMIT FILE
+
+* `TYPE` is the grammar type, which must be set to "thrax". 
+* `OWNER` is the grammar's owner, which defines the set of [feature weights](features.html) that
+  apply to the weights found in each line of the grammar (using different owners allows each grammar
+  to have different sets and numbers of weights, while sharing owners allows weights to be shared
+  across grammars).
+* `SPAN_LIMIT` is the maximum span of the input that rules from this grammar can be applied to. A
+  span limit of 0 means "no limit", while a span limit of -1 means that rules from this grammar must
+  be anchored to the left side of the sentence (index 0).
+* `FILE` is the path to the file containing the grammar. If the file is a directory, it is assumed
+  to be [packed](packed.html). Only one packed grammar can currently be used at a time.
+
+For reference, the following two translation model lines are used by the [pipeline](pipeline.html):
+
+    tm = thrax pt 20 /path/to/packed/grammar
+    tm = thrax glue -1 /path/to/glue/grammar
+
+### Language model options <a id="lm" />
+
+Joshua supports any number of language models. With Joshua 6.0, these
+are just regular feature functions:
+
+    feature-function = LanguageModel -lm_file /path/to/lm/file -lm_order N -lm_type TYPE
+    feature-function = StateMinimizingLanguageModel -lm_file /path/to/lm/file -lm_order N -lm_type TYPE
+
+`LanguageModel` is a generic language model, supporting types 'kenlm'
+(the default) and 'berkeleylm'. `StateMinimizingLanguageModel`
+implements LM state minimization to reduce the size of context n-grams
+where appropriate
+([Li and Khudanpur, 2008](http://www.aclweb.org/anthology/W08-0402.pdf);
+[Heafield et al., 2013](https://aclweb.org/anthology/N/N13/N13-1116.pdf)). This
+is currently only supported by KenLM, so the `-lm_type` option is not
+available here.
+
+The other key/value pairs are defined as follows:
+
+* `lm_type`: one of "kenlm" "berkeleylm"
+* `lm_order`: the order of the language model
+* `lm_file`: the path to the language model file.  All language model
+   types support the standard ARPA format.  Additionally, if the LM
+   type is "kenlm", this file can be compiled into KenLM's compiled
+   format (using the program at `$JOSHUA/bin/build_binary`); if the
+   the LM type is "berkeleylm", it can be compiled by following the
+   directions in
+   `$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README`. The
+   [pipeline](pipeline.html) will automatically compile either type.
+
+For each language model, you need to specify a feature weight in the following format:
+
+    lm_0 WEIGHT
+    lm_1 WEIGHT
+    ...
+
+where the indices correspond to the order of the language model declaration lines.
+
+### Output options <a id="output" />
+
+- `output-format` *New in 5.0*
+
+  Joshua prints a lot of information to STDERR (making this more granular is on the TODO
+  list). Output to STDOUT is reserved for decoder translations, and is controlled by the
+
+   - `%i`: the sentence number (0-indexed)
+
+   - `%e`: the source sentence
+
+   - `%s`: the translated sentence
+
+   - `%S`: the translated sentence, with some basic capitalization and denomralization. e.g.,
+
+         $ echo "¿ who you lookin' at , mr. ?" | $JOSHUA/bin/decoder -output-format "%S" -mark-oovs false 2> /dev/null 
+         ¿Who you lookin' at, Mr.? 
+
+   - `%t`: the target-side tree projection, all printed on one line (PTB style)
+   
+   - `%d`: the synchronous derivation, with each rules printed indented on their own lines
+
+   - `%f`: the list of feature values (as name=value pairs)
+
+   - `%c`: the model cost
+
+   - `%w`: the weight vector (unimplemented)
+
+   - `%a`: the alignments between source and target words (currently broken for hierarchical mode)
+
+  The default value is
+
+      output-format = %i ||| %s ||| %f ||| %c
+      
+  i.e.,
+
+      input ID ||| translation ||| model scores ||| score
+
+- `top-n` --- *300*
+
+  The number of translation hypotheses to output, sorted in decreasing order of model score
+
+- `use-unique-nbest` --- *true*
+
+  When constructing the n-best list for a sentence, skip hypotheses whose string has already been
+  output.
+
+- `escape-trees` --- *false*
+
+- `include-align-index` --- *false*
+
+  Output the source words indices that each target word aligns to.
+
+- `mark-oovs` --- *false*
+
+  if `true`, this causes the text "_OOV" to be appended to each untranslated word in the output.
+
+- `visualize-hypergraph` --- *false*
+
+  If set to true, a visualization of the hypergraph will be displayed, though you will have to
+  explicitly include the relevant jar files.  See the example usage in
+  `$JOSHUA/examples/tree_visualizer/`, which contains a demonstration of a source sentence,
+  translation, and synchronous derivation.
+
+- `dump-hypergraph` --- ""
+
+  This feature directs that the hypergraph should be written to disk for each input sentence. If
+  set, the value should contain the string "%d", which is replaced with the sentence number. For
+  example,
+  
+      cat input.txt | $JOSHUA/bin/decoder -dump-hypergraph hgs/%d.txt
+
+  Note that the output directory must exist.
+
+  TODO: revive the
+  [discussion on a common hypergraph format](http://aclweb.org/aclwiki/index.php?title=Hypergraph_Format)
+  on the ACL Wiki and support that format.
+
+### Lattice decoding
+
+In addition to regular sentences, Joshua can decode weighted lattices encoded in
+[the PLF format](http://www.statmt.org/moses/?n=Moses.WordLattices), except that path costs should
+be listed as <b>log probabilities</b> instead of probabilities.  Lattice decoding was originally
+added by Lane Schwartz and [Chris Dyer](http://www.cs.cmu.edu/~cdyer/).
+
+Joshua will automatically detect whether the input sentence is a regular sentence (the usual case)
+or a lattice.  If a lattice, a feature will be activated that accumulates the cost of different
+paths through the lattice.  In this case, you need to ensure that a weight for this feature is
+present in [your model file](decoder.html). The [pipeline](pipeline.html) will handle this
+automatically, or if you are doing this manually, you can add the line
+
+    SourcePath COST
+    
+to your Joshua configuration file.    
+
+Lattices must be listed one per line.
+
+### Alternate modes of operation <a id="modes" />
+
+In addition to decoding input sentences in the standard way, Joshua supports both *constrained
+decoding* and *synchronous parsing*. In both settings, both the source and target sides are provided
+as input, and the decoder finds a derivation between them.
+
+#### Constrained decoding
+
+To enable constrained decoding, simply append the desired target string as part of the input, in
+the following format:
+
+    source sentence ||| target sentence
+
+Joshua will translate the source sentence constrained to the target sentence. There are a few
+caveats:
+
+   * Left-state minimization cannot be enabled for the language model
+
+   * A heuristic is used to constrain the derivation (the LM state must match against the
+     input). This is not a perfect heuristic, and sometimes results in analyses that are not
+     perfectly constrained to the input, but have extra words.
+
+#### Synchronous parsing
+
+Joshua supports synchronous parsing as a two-step sequence of monolingual parses, as described in
+Dyer (NAACL 2010) ([PDF](http://www.aclweb.org/anthology/N10-1033‎.pdf)). To enable this:
+
+   - Set the configuration parameter `parse = true`.
+
+   - Remove all language models from the input file 
+
+   - Provide input in the following format:
+
+          source sentence ||| target sentence
+
+You may also wish to display the synchronouse parse tree (`-output-format %t`) and the alignment
+(`-show-align-index`).
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/faq.md
----------------------------------------------------------------------
diff --git a/6/faq.md b/6/faq.md
new file mode 100644
index 0000000..cc06b11
--- /dev/null
+++ b/6/faq.md
@@ -0,0 +1,161 @@
+---
+layout: default6
+category: help
+title: Frequently Asked Questions
+---
+
+Solutions to common problems will be posted here as we become aware of
+them.  If you need help with something, please check
+[our support group](https://groups.google.com/forum/#!forum/joshua_support)
+for a solution, or
+[post a new question](https://groups.google.com/forum/#!newtopic/joshua_support).
+
+### I get a message stating: "no ken in java.library.path"
+
+This occurs when [KenLM](https://kheafield.com/code/kenlm/) failed to
+build. This can occur for a number of reasons:
+   
+- [Boost](http://www.boost.org/) isn't installed. Boost is
+  available through most package management tools, so try that
+  first. You can also build it from source.
+
+- Boost is installed, but not in your path. The easiest solution is
+  to add the boost library directory to your `$LD_LIBRARY_PATH`
+  environment variable. You can also edit the file
+  `$JOSHUA/src/joshua/decoder/ff/lm/kenlm/Makefile` and define
+  `BOOST_ROOT` to point to your boost location. Then rebuild KenLM
+  with the command
+  
+      ant -f $JOSHUA/build.xml kenlm
+
+- You have run into boost's weird naming of multi-threaded
+  libraries. For some reason, boost libraries sometimes have a
+  `-mt` extension applied when they are built with multi-threaded
+  support. This will cause the linker to fail, since it is looking
+  for, e.g., `-lboost_system` instead of `-lboost_system-mt`. Edit
+  the same Makefile as above and uncomment the `BOOST_MT = -mt`
+  line, then try to compile again with
+  
+      ant -f $JOSHUA.build.xml kenlm
+
+You may find the following reference URLs to be useful.
+
+    https://groups.google.com/forum/#!topic/joshua_support/SiGO41tkpsw
+    http://stackoverflow.com/questions/12583080/c-library-in-using-boost-library
+
+
+### How do I make Joshua produce better results?
+
+One way is to add a larger language model. Build on Gigaword, news
+crawl data, etc. `lmplz` makes it easy to build and efficient to
+represent (especially if you compress it with `build_binary). To
+include it in Joshua, there are two ways:
+
+- *Pipeline*. By default, Joshua's pipeline builds a language
+   model on the target side of your parallel training data. But
+   Joshua can decode with any number of additional language models
+   as well. So you can build a language model separately,
+   presumably on much more data (since you won't be constrained
+   only to one side of parallel data, which is much more scarce
+   than monolingual data). Once you've built extra language models
+   and compiled them with KenLM's `build_binary` script, you can
+   tell the pipeline to use them with any number of `--lmfile
+   /path/to/lm/file` flags.
+
+- *Joshua* (directly).
+      [This file](http://localhost:4000/6.0/file-formats.html)
+      documents the Joshua configuration file format.
+
+### I have already run the pipeline once. How do I run it again, skipping the early stages and just retuning the model?
+
+You would need to do this if, for example, you added a language
+model, or changed some other parameter (e.g., an improvement to the
+decoder). To do this, follow the following steps:
+
+- Re-run the pipeline giving it a new `--rundir N+1` (where `N` is the last
+  run, and `N+1` is a new, non-existent directory). 
+- Give it all the other flags that you gave before, such as the
+  tuning data, testing data, source and target flags, etc. You
+  don't have to give it the training data.
+- Tell it to start at the tuning step with `--first-step TUNE`
+- Tell it where all of your language model files are with `--lmfile
+  /path/to/lm` lines. You also have to tell it where the main
+  language model is, which is usually `--lmfile N/lm.kenlm` (paths
+  are relative to the directory above the run directory.
+- Tell it where the main grammar is, e.g., `--grammar
+  N/grammar.gz`. If the tuning and test data hasn't changed, you
+  can also point it to the filtered and packed versions to save a
+  little time using `--tune-grammar N/data/tune/grammar.packed` and
+  `--test-grammar N/data/test/grammar.packed`, where `N` here again
+  is the previous run (or some other run; it can be anywhere).
+
+Here's an example. Let's say you ran a full pipeline as run 1, and
+now added a new language model and want to see how it affects the
+decoder. Your first run might have been invoked like this:
+
+    $JOSHUA/scripts/training/pipeline.pl \
+      --rundir 1 \
+      --readme "Baseline French--English Europarl hiero system" \
+      --corpus /path/to/europarl \
+      --tune /path/to/europarl/tune \
+      --test /path/to/europarl/test \
+      --source fr \
+      --target en \
+      --threads 8 \
+      --joshua-mem 30g \
+      --tuner mira \
+      --type hiero \
+      --aligner berkeley
+
+Your new run will look like this:
+
+    $JOSHUA/scripts/training/pipeline.pl \
+      --rundir 2 \
+      --readme "Adding in a huge language model" \
+      --tune /path/to/europarl/tune \
+      --test /path/to/europarl/test \
+      --source fr \
+      --target en \
+      --threads 8 \
+      --joshua-mem 30g \
+      --tuner mira \
+      --type hiero \
+      --aligner berkeley \
+      --first-step TUNE \
+      --lmfile 1/lm.kenlm \
+      --lmfile /path/to/huge/new/lm \
+      --tune-grammar 1/data/tune/grammar.packed \
+      --test-grammar 1/data/test/grammar.packed
+
+Notice the changes: we removed the `--corpus` (though it would have
+been fine to have left it, it would have just been skipped),
+specified the first step, changed the run directory and README
+comments, and pointed to the grammars and *both* language model files.
+
+How can I enable specific feature functions?
+
+Let's say you created a new feature function, `OracleFeature`, and
+you want to enable it. You can do this in two ways. Through the
+pipeline, simply pass it the argument `--joshua-args "list of
+joshua args"`. These will then be passed to the decoder when it is
+invoked. You can enable your feature functions, then using
+something like
+
+    $JOSHUA/bin/pipeline.pl --joshua-args '-feature-function OracleFeature'   
+
+If you call the decoder directly, you can just put that line in
+the configuration file, e.g.,
+
+    feature-function = OracleFeature
+    
+or you can pass it directly to Joshua on the command line using
+the standard notation, e.g.,
+
+    $JOSHUA/bin/joshua-decoder -feature-function OracleFeature
+    
+These could be stacked, e.g.,
+    
+    $JOSHUA/bin/joshua-decoder -feature-function OracleFeature \
+        -feature-function MagicFeature \
+        -feature-function MTSolverFeature \
+        ...

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/features.md
----------------------------------------------------------------------
diff --git a/6/features.md b/6/features.md
new file mode 100644
index 0000000..f9406a9
--- /dev/null
+++ b/6/features.md
@@ -0,0 +1,6 @@
+---
+layout: default6
+title: Features
+---
+
+Joshua 5.0 uses a sparse feature representation to encode features internally.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/file-formats.md
----------------------------------------------------------------------
diff --git a/6/file-formats.md b/6/file-formats.md
new file mode 100644
index 0000000..dbebe55
--- /dev/null
+++ b/6/file-formats.md
@@ -0,0 +1,72 @@
+---
+layout: default6
+category: advanced
+title: Joshua file formats
+---
+This page describes the formats of Joshua configuration and support files.
+
+## Translation models (grammars)
+
+Joshua supports two grammar file formats: a text-based version (also used by Hiero, shared by
+[cdec](), and supported by [hierarchical Moses]()), and an efficient
+[packed representation](packing.html) developed by [Juri Ganitkevich](http://cs.jhu.edu/~juri).
+
+Grammar rules follow this format.
+
+    [LHS] ||| SOURCE-SIDE ||| TARGET-SIDE ||| FEATURES
+    
+The source and target sides contain a mixture of terminals and nonterminals. The nonterminals are
+linked across sides by indices. There is no limit to the number of paired nonterminals in the rule
+or on the nonterminal labels (Joshua supports decoding with SAMT and GHKM grammars).
+
+    [X] ||| el chico [X,1] ||| the boy [X,1] ||| -3.14 0 2 17
+    [S] ||| el chico [VP,1] ||| the boy [VP,1] ||| -3.14 0 2 17
+    [VP] ||| [NP,1] [IN,2] [VB,3] ||| [VB,3] [IN,2] [NP,1] ||| 0.0019026637 0.81322956
+
+The feature values can have optional labels, e.g.:
+
+    [X] ||| el chico [X,1] ||| the boy [X,1] ||| lexprob=-3.14 lexicalized=1 numwords=2 count=17
+    
+One file common to decoding is the glue grammar, which for hiero grammar is defined as follows:
+
+    [GOAL] ||| <s> ||| <s> ||| 0
+    [GOAL] ||| [GOAL,1] [X,2] ||| [GOAL,1] [X,2] ||| -1
+    [GOAL] ||| [GOAL,1] </s> ||| [GOAL,1] </s> ||| 0
+
+Joshua's [pipeline](pipeline.html) supports extraction of Hiero and SAMT grammars via
+[Thrax](thrax.html) or GHKM grammars using [Michel Galley](http://www-nlp.stanford.edu/~mgalley/)'s
+GHKM extractor (included) or Moses' GHKM extractor (if Moses is installed).
+
+## Language Model
+
+Joshua has two language model implementations: [KenLM](http://kheafield.com/code/kenlm/) and
+[BerkeleyLM](http://berkeleylm.googlecode.com).  All language model implementations support the
+standard ARPA format output by [SRILM](http://www.speech.sri.com/projects/srilm/).  In addition,
+KenLM and BerkeleyLM support compiled formats that can be loaded more quickly and efficiently. KenLM
+is written in C++ and is supported via a JNI bridge, while BerkeleyLM is written in Java. KenLM is
+the default because of its support for left-state minimization.
+
+### Compiling for KenLM
+
+To compile an ARPA grammar for KenLM, use the (provided) `build-binary` command, located deep within
+the Joshua source code:
+
+    $JOSHUA/bin/build_binary lm.arpa lm.kenlm
+    
+This script takes the `lm.arpa` file and produces the compiled version in `lm.kenlm`.
+
+### Compiling for BerkeleyLM
+
+To compile a grammar for BerkeleyLM, type:
+
+    java -cp $JOSHUA/lib/berkeleylm.jar -server -mxMEM edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm.arpa lm.berkeleylm
+
+The `lm.berkeleylm` file can then be listed directly in the [Joshua configuration file](decoder.html).
+
+## Joshua configuration file
+
+The [decoder page](decoder.html) documents decoder command-line and config file options.
+
+## Thrax configuration
+
+See [the thrax page](thrax.html) for more information about the Thrax configuration file.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/index.md
----------------------------------------------------------------------
diff --git a/6/index.md b/6/index.md
new file mode 100644
index 0000000..898464a
--- /dev/null
+++ b/6/index.md
@@ -0,0 +1,24 @@
+---
+layout: default6
+title: Joshua documentation
+---
+
+This page contains end-user oriented documentation for the 6.0 release of
+[the Joshua decoder](http://joshua-decoder.org/).
+
+To navigate the documentation, use the links on the navigation bar to
+the left. For more detail on the decoder itself, including its command-line options, see
+[the Joshua decoder page](decoder.html).  You can also learn more about other steps of
+[the Joshua MT pipeline](pipeline.html), including [grammar extraction](thrax.html) with Thrax and
+Joshua's [efficient grammar representation](packing.html).
+
+A [bundled configuration](bundle.html), which is a minimal set of configuration, resource, and script files, can be created and easily transferred and shared.
+
+## Development
+
+For developer support, please consult [the javadoc documentation](http://cs.jhu.edu/~post/joshua-docs) and the [Joshua developers mailing list](https://groups.google.com/forum/?fromgroups#!forum/joshua_developers).
+
+## Support
+
+If you have problems or issues, you might find some help [on our answers page](faq.html) or
+[in the mailing list archives](https://groups.google.com/forum/?fromgroups#!forum/joshua_support).

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/install.md
----------------------------------------------------------------------
diff --git a/6/install.md b/6/install.md
new file mode 100644
index 0000000..87e0079
--- /dev/null
+++ b/6/install.md
@@ -0,0 +1,88 @@
+---
+layout: default6
+title: Installation
+---
+
+### Download and install
+
+To use Joshua as a standalone decoder (with [language packs](/language-packs/)), you only need to download and install the runtime version of the decoder. 
+If you also wish to build translation models from your own data, you will want to install the full version.
+See the instructions below.
+
+1.  Set up some basic environment variables. 
+    You need to define `$JAVA_HOME`
+
+        export JAVA_HOME=/path/to/java
+
+        # JAVA_HOME is not very standardized. Here are some places to look:
+        # OS X:  export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home
+        # Linux: export JAVA_HOME=/usr/java/default
+
+1.  If you are installing the full version of Joshua, you also need to define `$HADOOP` to point to your Hadoop installation.
+    (Joshua looks for the Hadoop executuble in `$HADOOP/bin/hadoop`)
+
+        export HADOOP=/usr
+
+    If you don't have a Hadoop installation, [Joshua's pipeline](pipeline.html) can install a standalone version for you.
+    
+1.  To install just the runtime version of Joshua, type
+
+        wget -q http://cs.jhu.edu/~post/files/joshua-runtime-{{ site.data.joshua.release_version }}.tgz
+
+    Then build everything
+
+        tar xzf joshua-runtime-{{ site.data.joshua.release_version }}.tgz
+        cd joshua-runtime-{{ site.data.joshua.release_version }}
+
+        # Add this to your init files
+        export JOSHUA=$(pwd)
+       
+        # build everything
+        ant
+
+1.  To instead install the full version, type
+
+        wget -q http://cs.jhu.edu/~post/files/joshua-{{ site.data.joshua.release_version }}.tgz
+
+        tar xzf joshua-{{ site.data.joshua.release_version }}.tgz
+        cd joshua-{{ site.data.joshua.release_version }}
+
+        # Add this to your init files
+        export JOSHUA=$(pwd)
+       
+        # build everything
+        ant
+
+### Building new models
+
+If you wish to build models for new language pairs from existing data (such as the [WMT data](http://statmt.org/wmt14/)), you need to install some additional dependencies.
+
+1. For learning hierarchical models, Joshua includes a tool called [Thrax](thrax.html), which
+is built on Hadoop. If you have a Hadoop installation, make sure that the environment variable
+`$HADOOP` is set and points to it. If you don't, Joshua will roll one out for you in standalone
+mode. Hadoop is only needed if you plan to build new models with Joshua.
+
+1. You will need to install Moses if either of the following applies to you:
+
+    - You wish to build [phrase-based models](phrase.html) (Joshua 6 includes a phrase-based
+      decoder, but not the tools for building such a model)
+
+    - You are building your own models (phrase- or syntax-based) and wish to use Cherry & Foster's
+[batch MIRA tuner](http://aclweb.org/anthology-new/N/N12/N12-1047v2.pdf) instead of the included
+MERT implementation, [Z-MERT](zmert.html). 
+
+    Follow [the instructions for installing Moses
+here](http://www.statmt.org/moses/?n=Development.GetStarted), and then define the `$MOSES`
+environment variable to point to the root of the Moses installation.
+
+## More information
+
+For more detail on the decoder itself, including its command-line options, see
+[the Joshua decoder page](decoder.html).  You can also learn more about other steps of
+[the Joshua MT pipeline](pipeline.html), including [grammar extraction](thrax.html) with Thrax and
+Joshua's [efficient grammar representation](packing.html).
+
+If you have problems or issues, you might find some help [on our answers page](faq.html) or
+[in the mailing list archives](https://groups.google.com/forum/?fromgroups#!forum/joshua_support).
+
+A [bundled configuration](bundle.html), which is a minimal set of configuration, resource, and script files, can be created and easily transferred and shared.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/jacana.md
----------------------------------------------------------------------
diff --git a/6/jacana.md b/6/jacana.md
new file mode 100644
index 0000000..71c1753
--- /dev/null
+++ b/6/jacana.md
@@ -0,0 +1,139 @@
+---
+layout: default6
+title: Alignment with Jacana
+---
+
+## Introduction
+
+jacana-xy is a token-based word aligner for machine translation, adapted from the original
+English-English word aligner jacana-align described in the following paper:
+
+    A Lightweight and High Performance Monolingual Word Aligner. Xuchen Yao, Benjamin Van Durme,
+    Chris Callison-Burch and Peter Clark. Proceedings of ACL 2013, short papers.
+
+It currently supports only aligning from French to English with a very limited feature set, from the
+one week hack at the [Eighth MT Marathon 2013](http://statmt.org/mtm13). Please feel free to check
+out the code, read to the bottom of this page, and
+[send the author an email](http://www.cs.jhu.edu/~xuchen/) if you want to add more language pairs to
+it.
+
+## Build
+
+jacana-xy is written in a mixture of Java and Scala. If you build from ant, you have to set up the
+environmental variables `JAVA_HOME` and `SCALA_HOME`. In my system, I have:
+
+    export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26
+    export SCALA_HOME=/home/xuchen/Downloads/scala-2.10.2
+
+Then type:
+
+    ant
+
+build/lib/jacana-xy.jar will be built for you.
+
+If you build from Eclipse, first install scala-ide, then import the whole jacana folder as a Scala project. Eclipse should find the .project file and set up the project automatically for you.
+
+Demo
+scripts-align/runDemoServer.sh shows up the web demo. Direct your browser to http://localhost:8080/ and you should be able to align some sentences.
+
+Note: To make jacana-xy know where to look for resource files, pass the property JACANA_HOME with Java when you run it:
+
+java -DJACANA_HOME=/path/to/jacana -cp jacana-xy.jar ......
+
+Browser
+You can also browse one or two alignment files (*.json) with firefox opening src/web/AlignmentBrowser.html:
+
+
+
+Note 1: due to strict security setting for accessing local files, Chrome/IE won't work.
+
+Note 2: the input *.json files have to be in the same folder with AlignmentBrowser.html.
+
+Align
+scripts-align/alignFile.sh aligns tab-separated sentence files and outputs the output to a .json file that's accepted by the browser:
+
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -src fr -tgt en -m fr-en.model -a s.txt -o s.json
+
+scripts-align/alignFile.sh takes GIZA++-style input files (one file containing the source sentences, and the other file the target sentences) and outputs to one .align file with dashed alignment indices (e.g. "1-2 0-4"):
+
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -m fr-en.model -src fr -tgt en -a s1.txt -b s2.txt -o s.align
+
+Training
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -r train.json -d dev.json -t test.json -m /tmp/align.model
+
+The aligner then would train on train.json, and report F1 values on dev.json for every 10 iterations, when the stopping criterion has reached, it will test on test.json.
+
+For every 10 iterations, a model file is saved to (in this example) /tmp/align.model.iter_XX.F1_XX.X. Normally what I do is to select the one with the best F1 on dev.json, then run a final test on test.json:
+
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -t test.json -m /tmp/align.model.iter_XX.F1_XX.X
+
+In this case since the training data is missing, the aligner assumes it's a test job, then reads model file still from the -m option, and test on test.json.
+
+All the json files are in a format like the following (also accepted by the browser for display):
+
+[
+    {
+        "id": "0008",
+        "name": "Hansards.french-english.0008",
+        "possibleAlign": "0-0 0-1 0-2",
+        "source": "bravo !",
+        "sureAlign": "1-3",
+        "target": "hear , hear !"
+    },
+    {
+        "id": "0009",
+        "name": "Hansards.french-english.0009",
+        "possibleAlign": "1-1 6-5 7-5 6-6 7-6 13-10 13-11",
+        "source": "monsieur le Orateur , ma question se adresse à le ministre chargé de les transports .",
+        "sureAlign": "0-0 2-1 3-2 4-3 5-4 8-7 9-8 10-9 12-10 14-11 15-12",
+        "target": "Mr. Speaker , my question is directed to the Minister of Transport ."
+    }
+]
+Where possibleAlign is not used.
+
+The stopping criterion is to run up to 300 iterations or when the objective difference between two iterations is less than 0.001, whichever happens first. Currently they are hard-coded. If you need to be flexible on this, send me an email!
+
+Support More Languages
+To add support to more languages, you need:
+
+labelled word alignment (in the download there's already French-English under alignment-data/fr-en; I also have Chinese-English and Arabic-English; let me know if you have more). Usually 100 labelled sentence pairs would be enough
+implement some feature functions for this language pair
+To add more features, you need to implement the following interface:
+
+edu.jhu.jacana.align.feature.AlignFeature
+
+and override the following function:
+
+addPhraseBasedFeature
+
+For instance, a simple feature that checks whether the two words are translations in wiktionary for the French-English alignment task has the function implemented as:
+
+def addPhraseBasedFeature(pair: AlignPair, ins:AlignFeatureVector, i:Int, srcSpan:Int, j:Int, tgtSpan:Int,
+      currState:Int, featureAlphabet: Alphabet){
+  if (j == -1) {
+  } else {
+    val srcTokens = pair.srcTokens.slice(i, i+srcSpan).mkString(" ")
+    val tgtTokens = pair.tgtTokens.slice(j, j+tgtSpan).mkString(" ")
+                
+    if (WiktionaryMultilingual.exists(srcTokens, tgtTokens)) {
+      ins.addFeature("InWiktionary", NONE_STATE, currState, 1.0, srcSpan, featureAlphabet) 
+    }
+        
+  }       
+}
+This is a more general function that also deals with phrase alignment. But it is suggested to implement it just for token alignment as currently the phrase alignment part is very slow to train (60x slower than token alignment).
+
+Some other language-independent and English-only features are implemented under the package edu.jhu.jacana.align.feature, for instance:
+
+StringSimilarityAlignFeature: various string similarity measures
+
+PositionalAlignFeature: features based on relative sentence positions
+
+DistortionAlignFeature: Markovian (state transition) features
+
+When you add features for more languages, just create a new package like the one for French-English:
+
+edu.jhu.jacana.align.feature.fr_en
+
+and start coding!
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/large-lms.md
----------------------------------------------------------------------
diff --git a/6/large-lms.md b/6/large-lms.md
new file mode 100644
index 0000000..a6792dd
--- /dev/null
+++ b/6/large-lms.md
@@ -0,0 +1,192 @@
+---
+layout: default6
+title: Building large LMs with SRILM
+category: advanced
+---
+
+The following is a tutorial for building a large language model from the
+English Gigaword Fifth Edition corpus
+[LDC2011T07](http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07)
+using SRILM. English text is provided from seven different sources.
+
+### Step 0: Clean up the corpus
+
+The Gigaword corpus has to be stripped of all SGML tags and tokenized.
+Instructions for performing those steps are not included in this
+documentation. A description of this process can be found in a paper
+called ["Annotated
+Gigaword"](https://akbcwekex2012.files.wordpress.com/2012/05/28_paper.pdf).
+
+The Joshua package ships with a script that converts all alphabetical
+characters to their lowercase equivalent. The script is located at
+`$JOSHUA/scripts/lowercase.perl`.
+
+Make a directory structure as follows:
+
+    gigaword/
+    ├── corpus/
+    │   ├── afp_eng/
+    │   │   ├── afp_eng_199405.lc.gz
+    │   │   ├── afp_eng_199406.lc.gz
+    │   │   ├── ...
+    │   │   └── counts/
+    │   ├── apw_eng/
+    │   │   ├── apw_eng_199411.lc.gz
+    │   │   ├── apw_eng_199412.lc.gz
+    │   │   ├── ...
+    │   │   └── counts/
+    │   ├── cna_eng/
+    │   │   ├── ...
+    │   │   └── counts/
+    │   ├── ltw_eng/
+    │   │   ├── ...
+    │   │   └── counts/
+    │   ├── nyt_eng/
+    │   │   ├── ...
+    │   │   └── counts/
+    │   ├── wpb_eng/
+    │   │   ├── ...
+    │   │   └── counts/
+    │   └── xin_eng/
+    │       ├── ...
+    │       └── counts/
+    └── lm/
+        ├── afp_eng/
+        ├── apw_eng/
+        ├── cna_eng/
+        ├── ltw_eng/
+        ├── nyt_eng/
+        ├── wpb_eng/
+        └── xin_eng/
+
+
+The next step will be to build smaller LMs and then interpolate them into one
+file.
+
+### Step 1: Count ngrams
+
+Run the following script once from each source directory under the `corpus/`
+directory (edit it to specify the path to the `ngram-count` binary as well as
+the number of processors):
+
+    #!/bin/sh
+
+    NGRAM_COUNT=$SRILM_SRC/bin/i686-m64/ngram-count
+    args=""
+
+    for source in *.gz; do
+       args=$args"-sort -order 5 -text $source -write counts/$source-counts.gz "
+    done
+
+    echo $args | xargs --max-procs=4 -n 7 $NGRAM_COUNT
+
+Then move each `counts/` directory to the corresponding directory under
+`lm/`. Now that each ngram has been counted, we can make a language
+model for each of the seven sources.
+
+### Step 2: Make individual language models
+
+SRILM includes a script, called `make-big-lm`, for building large language
+models under resource-limited environments. The manual for this script can be
+read online
+[here](http://www-speech.sri.com/projects/srilm/manpages/training-scripts.1.html).
+Since the Gigaword corpus is so large, it is convenient to use `make-big-lm`
+even in environments with many parallel processors and a lot of memory.
+
+Initiate the following script from each of the source directories under the
+`lm/` directory (edit it to specify the path to the `make-big-lm` script as
+well as the pruning threshold):
+
+    #!/bin/bash
+    set -x
+
+    CMD=$SRILM_SRC/bin/make-big-lm
+    PRUNE_THRESHOLD=1e-8
+
+    $CMD \
+      -name gigalm `for k in counts/*.gz; do echo " \
+      -read $k "; done` \
+      -lm lm.gz \
+      -max-per-file 100000000 \
+      -order 5 \
+      -kndiscount \
+      -interpolate \
+      -unk \
+      -prune $PRUNE_THRESHOLD
+
+The language model attributes chosen are the following:
+
+* N-grams up to order 5
+* Kneser-Ney smoothing
+* N-gram probability estimates at the specified order *n* are interpolated with
+  lower-order estimates
+* include the unknown-word token as a regular word
+* pruning N-grams based on the specified threshold
+
+Next, we will mix the models together into a single file.
+
+### Step 3: Mix models together
+
+Using development text, interpolation weights can determined that give highest
+weight to the source language models that have the lowest perplexity on the
+specified development set.
+
+#### Step 3-1: Determine interpolation weights
+
+Initiate the following script from the `lm/` directory (edit it to specify the
+path to the `ngram` binary as well as the path to the development text file):
+
+    #!/bin/bash
+    set -x
+
+    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
+    DEV_TEXT=~mpost/expts/wmt12/runs/es-en/data/tune/tune.tok.lc.es
+
+    dirs=( afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng )
+
+    for d in ${dirs[@]} ; do
+      $NGRAM -debug 2 -order 5 -unk -lm $d/lm.gz -ppl $DEV_TEXT > $d/lm.ppl ;
+    done
+
+    compute-best-mix */lm.ppl > best-mix.ppl
+
+Take a look at the contents of `best-mix.ppl`. It will contain a sequence of
+values in parenthesis. These are the interpolation weights of the source
+language models in the order specified. Copy and paste the values within the
+parenthesis into the script below.
+
+#### Step 3-2: Combine the models
+
+Initiate the following script from the `lm/` directory (edit it to specify the
+path to the `ngram` binary as well as the interpolation weights):
+
+    #!/bin/bash
+    set -x
+
+    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
+    DIRS=(   afp_eng    apw_eng     cna_eng  ltw_eng   nyt_eng  wpb_eng  xin_eng )
+    LAMBDAS=(0.00631272 0.000647602 0.251555 0.0134726 0.348953 0.371566 0.00749238)
+
+    $NGRAM -order 5 -unk \
+      -lm      ${DIRS[0]}/lm.gz     -lambda  ${LAMBDAS[0]} \
+      -mix-lm  ${DIRS[1]}/lm.gz \
+      -mix-lm2 ${DIRS[2]}/lm.gz -mix-lambda2 ${LAMBDAS[2]} \
+      -mix-lm3 ${DIRS[3]}/lm.gz -mix-lambda3 ${LAMBDAS[3]} \
+      -mix-lm4 ${DIRS[4]}/lm.gz -mix-lambda4 ${LAMBDAS[4]} \
+      -mix-lm5 ${DIRS[5]}/lm.gz -mix-lambda5 ${LAMBDAS[5]} \
+      -mix-lm6 ${DIRS[6]}/lm.gz -mix-lambda6 ${LAMBDAS[6]} \
+      -write-lm mixed_lm.gz
+
+The resulting file, `mixed_lm.gz` is a language model based on all the text in
+the Gigaword corpus and with some probabilities biased to the development text
+specify in step 3-1. It is in the ARPA format. The optional next step converts
+it into KenLM format.
+
+#### Step 3-3: Convert to KenLM
+
+The KenLM format has some speed advantages over the ARPA format. Issuing the
+following command will write a new language model file `mixed_lm-kenlm.gz` that
+is the `mixed_lm.gz` language model transformed into the KenLM format.
+
+    $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary mixed_lm.gz mixed_lm.kenlm
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6/packing.md
----------------------------------------------------------------------
diff --git a/6/packing.md b/6/packing.md
new file mode 100644
index 0000000..8d84004
--- /dev/null
+++ b/6/packing.md
@@ -0,0 +1,74 @@
+---
+layout: default6
+category: advanced
+title: Grammar Packing
+---
+
+Grammar packing refers to the process of taking a textual grammar
+output by [Thrax](thrax.html) (or Moses, for phrase-based models) and
+efficiently encoding it so that it can be loaded
+[very quickly](https://aclweb.org/anthology/W/W12/W12-3134.pdf) ---
+packing the grammar results in significantly faster load times for
+very large grammars.  Packing is done automatically by the
+[Joshua pipeline](pipeline.html), but you can also run the packer
+manually.
+
+The script can be found at
+`$JOSHUA/scripts/support/grammar-packer.pl`. See that script for
+example usage. You can then add it to a Joshua config file, simply
+replacing a `tm` path to the compressed text-file format with a path
+to the packed grammar directory (Joshua will automatically detect that
+it is packed, since a packed grammar is a directory).
+
+Packing the grammar requires first sorting it by the rules source side,
+which can take quite a bit of temporary space.
+
+*CAVEAT*: You may run into problems packing very very large Hiero
+ grammars. Email the support list if you do.
+
+### Examples
+
+A Hiero grammar, using the compressed text file version:
+
+    tm = hiero -owner pt -maxspan 20 -path grammar.filtered.gz
+
+Pack it:
+
+    $JOSHUA/scripts/support/grammar-packer.pl grammar.filtered.gz grammar.packed
+
+Pack a really big grammar:
+
+    $JOSHUA/scripts/support/grammar-packer.pl -m 30g grammar.filtered.gz grammar.packed
+
+Be a little more verbose:
+
+    $JOSHUA/scripts/support/grammar-packer.pl -m 30g grammar.filtered.gz grammar.packed
+
+You have a different temp file location:
+
+    $JOSHUA/scripts/support/grammar-packer.pl -T /local grammar.filtered.gz grammar.packed
+
+Update the config file line:
+
+    tm = hiero -owner pt -maxspan 20 -path grammar.packed
+
+### Using multiple packed grammars (Joshua 6.0.5)
+
+Packed grammars serialize their vocabularies which prevented the use of multiple
+packed grammars during decoding. With Joshua 6.0.5, it is possible to use multiple packed grammars during decoding if they have the same serialized vocabulary.
+This is achieved by packing these grammars jointly using a revised packing CLI.
+
+To pack multiple grammars:
+
+    $JOSHUA/scripts/support/grammar-packer.pl grammar1.filtered.gz grammar2.filtered.gz [...] grammar1.packed grammar2.packed [...]
+
+This will produce two packed grammars with the same vocabulary. To use them in the decoder, put this in your ```joshua.config```:
+
+    tm = hiero -owner pt -maxspan 20 -path grammar1.packed
+    tm = hiero -owner pt2 -maxspan 20 -path grammar2.packed
+
+Note the different owners.
+If you are trying to load multiple packed grammars that do not have the same
+vocabulary, the decoder will throw a RuntimeException at loading time:
+
+    Exception in thread "main" java.lang.RuntimeException: Trying to load multiple packed grammars with different vocabularies! Have you packed them jointly?