You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@joshua.apache.org by mj...@apache.org on 2016/04/09 05:10:47 UTC

[38/44] incubator-joshua-site git commit: First attempt

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/jacana.html
----------------------------------------------------------------------
diff --git a/5.0/jacana.html b/5.0/jacana.html
new file mode 100644
index 0000000..8e2a5e9
--- /dev/null
+++ b/5.0/jacana.html
@@ -0,0 +1,309 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <title>Joshua Documentation | Alignment with Jacana</title>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <meta name="description" content="">
+    <meta name="author" content="">
+
+    <!-- Le styles -->
+    <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+    <style>
+      body {
+        padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
+      }
+      #download {
+          background-color: green;
+          font-size: 14pt;
+          font-weight: bold;
+          text-align: center;
+          color: white;
+          border-radius: 5px;
+          padding: 4px;
+      }
+
+      #download a:link {
+          color: white;
+      }
+
+      #download a:hover {
+          color: lightgrey;
+      }
+
+      #download a:visited {
+          color: white;
+      }
+
+      a.pdf {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: brown;
+          padding: 2px;
+      }
+
+      a.bibtex {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: orange;
+          padding: 2px;
+      }
+
+      img.sponsor {
+        height: 120px;
+        margin: 5px;
+      }
+    </style>
+    <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+
+    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
+    <!--[if lt IE 9]>
+      <script src="bootstrap/js/html5shiv.js"></script>
+    <![endif]-->
+
+    <!-- Fav and touch icons -->
+    <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
+    <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
+      <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
+                    <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
+                                   <link rel="shortcut icon" href="bootstrap/ico/favicon.png">
+  </head>
+
+  <body>
+
+    <div class="navbar navbar-inverse navbar-fixed-top">
+      <div class="navbar-inner">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <a class="brand" href="/">Joshua</a>
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="index.html">Documentation</a></li>
+              <li><a href="pipeline.html">Pipeline</a></li>
+              <li><a href="tutorial.html">Tutorial</a></li>
+              <li><a href="decoder.html">Decoder</a></li>
+              <li><a href="thrax.html">Thrax</a></li>
+              <li><a href="file-formats.html">File formats</a></li>
+              <!-- <li><a href="advanced.html">Advanced</a></li> -->
+              <li><a href="faq.html">FAQ</a></li>
+            </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+    <div class="container">
+
+      <div class="row">
+        <div class="span2">
+          <img src="/images/joshua-logo-small.png" 
+               alt="Joshua logo (picture of a Joshua tree)" />
+        </div>
+        <div class="span10">
+          <h1>Joshua Documentation</h1>
+          <h2>Alignment with Jacana</h2>
+          <span id="download">
+            <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a>
+          </span>
+          &nbsp; (version 5.0, released 16 August 2013)
+        </div>
+      </div>
+      
+      <hr />
+
+      <div class="row">
+        <div class="span8">
+
+          <h2 id="introduction">Introduction</h2>
+
+<p>jacana-xy is a token-based word aligner for machine translation, adapted from the original
+English-English word aligner jacana-align described in the following paper:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>A Lightweight and High Performance Monolingual Word Aligner. Xuchen Yao, Benjamin Van Durme,
+Chris Callison-Burch and Peter Clark. Proceedings of ACL 2013, short papers.
+</code></pre>
+</div>
+
+<p>It currently supports only aligning from French to English with a very limited feature set, from the
+one week hack at the <a href="http://statmt.org/mtm13">Eighth MT Marathon 2013</a>. Please feel free to check
+out the code, read to the bottom of this page, and
+<a href="http://www.cs.jhu.edu/~xuchen/">send the author an email</a> if you want to add more language pairs to
+it.</p>
+
+<h2 id="build">Build</h2>
+
+<p>jacana-xy is written in a mixture of Java and Scala. If you build from ant, you have to set up the
+environmental variables <code class="highlighter-rouge">JAVA_HOME</code> and <code class="highlighter-rouge">SCALA_HOME</code>. In my system, I have:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26
+export SCALA_HOME=/home/xuchen/Downloads/scala-2.10.2
+</code></pre>
+</div>
+
+<p>Then type:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>ant
+</code></pre>
+</div>
+
+<p>build/lib/jacana-xy.jar will be built for you.</p>
+
+<p>If you build from Eclipse, first install scala-ide, then import the whole jacana folder as a Scala project. Eclipse should find the .project file and set up the project automatically for you.</p>
+
+<p>Demo
+scripts-align/runDemoServer.sh shows up the web demo. Direct your browser to http://localhost:8080/ and you should be able to align some sentences.</p>
+
+<p>Note: To make jacana-xy know where to look for resource files, pass the property JACANA_HOME with Java when you run it:</p>
+
+<p>java -DJACANA_HOME=/path/to/jacana -cp jacana-xy.jar ……</p>
+
+<p>Browser
+You can also browse one or two alignment files (*.json) with firefox opening src/web/AlignmentBrowser.html:</p>
+
+<p>Note 1: due to strict security setting for accessing local files, Chrome/IE won’t work.</p>
+
+<p>Note 2: the input *.json files have to be in the same folder with AlignmentBrowser.html.</p>
+
+<p>Align
+scripts-align/alignFile.sh aligns tab-separated sentence files and outputs the output to a .json file that’s accepted by the browser:</p>
+
+<p>java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -src fr -tgt en -m fr-en.model -a s.txt -o s.json</p>
+
+<p>scripts-align/alignFile.sh takes GIZA++-style input files (one file containing the source sentences, and the other file the target sentences) and outputs to one .align file with dashed alignment indices (e.g. “1-2 0-4”):</p>
+
+<p>java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -m fr-en.model -src fr -tgt en -a s1.txt -b s2.txt -o s.align</p>
+
+<p>Training
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -r train.json -d dev.json -t test.json -m /tmp/align.model</p>
+
+<p>The aligner then would train on train.json, and report F1 values on dev.json for every 10 iterations, when the stopping criterion has reached, it will test on test.json.</p>
+
+<p>For every 10 iterations, a model file is saved to (in this example) /tmp/align.model.iter_XX.F1_XX.X. Normally what I do is to select the one with the best F1 on dev.json, then run a final test on test.json:</p>
+
+<p>java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -t test.json -m /tmp/align.model.iter_XX.F1_XX.X</p>
+
+<p>In this case since the training data is missing, the aligner assumes it’s a test job, then reads model file still from the -m option, and test on test.json.</p>
+
+<p>All the json files are in a format like the following (also accepted by the browser for display):</p>
+
+<p>[
+    {
+        “id”: “0008”,
+        “name”: “Hansards.french-english.0008”,
+        “possibleAlign”: “0-0 0-1 0-2”,
+        “source”: “bravo !”,
+        “sureAlign”: “1-3”,
+        “target”: “hear , hear !”
+    },
+    {
+        “id”: “0009”,
+        “name”: “Hansards.french-english.0009”,
+        “possibleAlign”: “1-1 6-5 7-5 6-6 7-6 13-10 13-11”,
+        “source”: “monsieur le Orateur , ma question se adresse à le ministre chargé de les transports .”,
+        “sureAlign”: “0-0 2-1 3-2 4-3 5-4 8-7 9-8 10-9 12-10 14-11 15-12”,
+        “target”: “Mr. Speaker , my question is directed to the Minister of Transport .”
+    }
+]
+Where possibleAlign is not used.</p>
+
+<p>The stopping criterion is to run up to 300 iterations or when the objective difference between two iterations is less than 0.001, whichever happens first. Currently they are hard-coded. If you need to be flexible on this, send me an email!</p>
+
+<p>Support More Languages
+To add support to more languages, you need:</p>
+
+<p>labelled word alignment (in the download there’s already French-English under alignment-data/fr-en; I also have Chinese-English and Arabic-English; let me know if you have more). Usually 100 labelled sentence pairs would be enough
+implement some feature functions for this language pair
+To add more features, you need to implement the following interface:</p>
+
+<p>edu.jhu.jacana.align.feature.AlignFeature</p>
+
+<p>and override the following function:</p>
+
+<p>addPhraseBasedFeature</p>
+
+<p>For instance, a simple feature that checks whether the two words are translations in wiktionary for the French-English alignment task has the function implemented as:</p>
+
+<p>def addPhraseBasedFeature(pair: AlignPair, ins:AlignFeatureVector, i:Int, srcSpan:Int, j:Int, tgtSpan:Int,
+      currState:Int, featureAlphabet: Alphabet){
+  if (j == -1) {
+  } else {
+    val srcTokens = pair.srcTokens.slice(i, i+srcSpan).mkString(“ “)
+    val tgtTokens = pair.tgtTokens.slice(j, j+tgtSpan).mkString(“ “)</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>if (WiktionaryMultilingual.exists(srcTokens, tgtTokens)) {
+  ins.addFeature("InWiktionary", NONE_STATE, currState, 1.0, srcSpan, featureAlphabet) 
+}
+</code></pre>
+</div>
+
+<p>}     <br />
+}
+This is a more general function that also deals with phrase alignment. But it is suggested to implement it just for token alignment as currently the phrase alignment part is very slow to train (60x slower than token alignment).</p>
+
+<p>Some other language-independent and English-only features are implemented under the package edu.jhu.jacana.align.feature, for instance:</p>
+
+<p>StringSimilarityAlignFeature: various string similarity measures</p>
+
+<p>PositionalAlignFeature: features based on relative sentence positions</p>
+
+<p>DistortionAlignFeature: Markovian (state transition) features</p>
+
+<p>When you add features for more languages, just create a new package like the one for French-English:</p>
+
+<p>edu.jhu.jacana.align.feature.fr_en</p>
+
+<p>and start coding!</p>
+
+
+
+        </div>
+      </div>
+    </div> <!-- /container -->
+
+    <!-- Le javascript
+    ================================================== -->
+    <!-- Placed at the end of the document so the pages load faster -->
+    <script src="bootstrap/js/jquery.js"></script>
+    <script src="bootstrap/js/bootstrap-transition.js"></script>
+    <script src="bootstrap/js/bootstrap-alert.js"></script>
+    <script src="bootstrap/js/bootstrap-modal.js"></script>
+    <script src="bootstrap/js/bootstrap-dropdown.js"></script>
+    <script src="bootstrap/js/bootstrap-scrollspy.js"></script>
+    <script src="bootstrap/js/bootstrap-tab.js"></script>
+    <script src="bootstrap/js/bootstrap-tooltip.js"></script>
+    <script src="bootstrap/js/bootstrap-popover.js"></script>
+    <script src="bootstrap/js/bootstrap-button.js"></script>
+    <script src="bootstrap/js/bootstrap-collapse.js"></script>
+    <script src="bootstrap/js/bootstrap-carousel.js"></script>
+    <script src="bootstrap/js/bootstrap-typeahead.js"></script>
+
+    <!-- Start of StatCounter Code for Default Guide -->
+    <script type="text/javascript">
+      var sc_project=8264132; 
+      var sc_invisible=1; 
+      var sc_security="4b97fe2d"; 
+    </script>
+    <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
+    <noscript>
+      <div class="statcounter">
+        <a title="hit counter joomla" 
+           href="http://statcounter.com/joomla/"
+           target="_blank">
+          <img class="statcounter"
+               src="http://c.statcounter.com/8264132/0/4b97fe2d/1/"
+               alt="hit counter joomla" />
+        </a>
+      </div>
+    </noscript>
+    <!-- End of StatCounter Code for Default Guide -->
+
+  </body>
+</html>

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/jacana.md
----------------------------------------------------------------------
diff --git a/5.0/jacana.md b/5.0/jacana.md
deleted file mode 100644
index 613a862..0000000
--- a/5.0/jacana.md
+++ /dev/null
@@ -1,139 +0,0 @@
----
-layout: default
-title: Alignment with Jacana
----
-
-## Introduction
-
-jacana-xy is a token-based word aligner for machine translation, adapted from the original
-English-English word aligner jacana-align described in the following paper:
-
-    A Lightweight and High Performance Monolingual Word Aligner. Xuchen Yao, Benjamin Van Durme,
-    Chris Callison-Burch and Peter Clark. Proceedings of ACL 2013, short papers.
-
-It currently supports only aligning from French to English with a very limited feature set, from the
-one week hack at the [Eighth MT Marathon 2013](http://statmt.org/mtm13). Please feel free to check
-out the code, read to the bottom of this page, and
-[send the author an email](http://www.cs.jhu.edu/~xuchen/) if you want to add more language pairs to
-it.
-
-## Build
-
-jacana-xy is written in a mixture of Java and Scala. If you build from ant, you have to set up the
-environmental variables `JAVA_HOME` and `SCALA_HOME`. In my system, I have:
-
-    export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26
-    export SCALA_HOME=/home/xuchen/Downloads/scala-2.10.2
-
-Then type:
-
-    ant
-
-build/lib/jacana-xy.jar will be built for you.
-
-If you build from Eclipse, first install scala-ide, then import the whole jacana folder as a Scala project. Eclipse should find the .project file and set up the project automatically for you.
-
-Demo
-scripts-align/runDemoServer.sh shows up the web demo. Direct your browser to http://localhost:8080/ and you should be able to align some sentences.
-
-Note: To make jacana-xy know where to look for resource files, pass the property JACANA_HOME with Java when you run it:
-
-java -DJACANA_HOME=/path/to/jacana -cp jacana-xy.jar ......
-
-Browser
-You can also browse one or two alignment files (*.json) with firefox opening src/web/AlignmentBrowser.html:
-
-
-
-Note 1: due to strict security setting for accessing local files, Chrome/IE won't work.
-
-Note 2: the input *.json files have to be in the same folder with AlignmentBrowser.html.
-
-Align
-scripts-align/alignFile.sh aligns tab-separated sentence files and outputs the output to a .json file that's accepted by the browser:
-
-java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -src fr -tgt en -m fr-en.model -a s.txt -o s.json
-
-scripts-align/alignFile.sh takes GIZA++-style input files (one file containing the source sentences, and the other file the target sentences) and outputs to one .align file with dashed alignment indices (e.g. "1-2 0-4"):
-
-java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -m fr-en.model -src fr -tgt en -a s1.txt -b s2.txt -o s.align
-
-Training
-java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -r train.json -d dev.json -t test.json -m /tmp/align.model
-
-The aligner then would train on train.json, and report F1 values on dev.json for every 10 iterations, when the stopping criterion has reached, it will test on test.json.
-
-For every 10 iterations, a model file is saved to (in this example) /tmp/align.model.iter_XX.F1_XX.X. Normally what I do is to select the one with the best F1 on dev.json, then run a final test on test.json:
-
-java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -t test.json -m /tmp/align.model.iter_XX.F1_XX.X
-
-In this case since the training data is missing, the aligner assumes it's a test job, then reads model file still from the -m option, and test on test.json.
-
-All the json files are in a format like the following (also accepted by the browser for display):
-
-[
-    {
-        "id": "0008",
-        "name": "Hansards.french-english.0008",
-        "possibleAlign": "0-0 0-1 0-2",
-        "source": "bravo !",
-        "sureAlign": "1-3",
-        "target": "hear , hear !"
-    },
-    {
-        "id": "0009",
-        "name": "Hansards.french-english.0009",
-        "possibleAlign": "1-1 6-5 7-5 6-6 7-6 13-10 13-11",
-        "source": "monsieur le Orateur , ma question se adresse à le ministre chargé de les transports .",
-        "sureAlign": "0-0 2-1 3-2 4-3 5-4 8-7 9-8 10-9 12-10 14-11 15-12",
-        "target": "Mr. Speaker , my question is directed to the Minister of Transport ."
-    }
-]
-Where possibleAlign is not used.
-
-The stopping criterion is to run up to 300 iterations or when the objective difference between two iterations is less than 0.001, whichever happens first. Currently they are hard-coded. If you need to be flexible on this, send me an email!
-
-Support More Languages
-To add support to more languages, you need:
-
-labelled word alignment (in the download there's already French-English under alignment-data/fr-en; I also have Chinese-English and Arabic-English; let me know if you have more). Usually 100 labelled sentence pairs would be enough
-implement some feature functions for this language pair
-To add more features, you need to implement the following interface:
-
-edu.jhu.jacana.align.feature.AlignFeature
-
-and override the following function:
-
-addPhraseBasedFeature
-
-For instance, a simple feature that checks whether the two words are translations in wiktionary for the French-English alignment task has the function implemented as:
-
-def addPhraseBasedFeature(pair: AlignPair, ins:AlignFeatureVector, i:Int, srcSpan:Int, j:Int, tgtSpan:Int,
-      currState:Int, featureAlphabet: Alphabet){
-  if (j == -1) {
-  } else {
-    val srcTokens = pair.srcTokens.slice(i, i+srcSpan).mkString(" ")
-    val tgtTokens = pair.tgtTokens.slice(j, j+tgtSpan).mkString(" ")
-                
-    if (WiktionaryMultilingual.exists(srcTokens, tgtTokens)) {
-      ins.addFeature("InWiktionary", NONE_STATE, currState, 1.0, srcSpan, featureAlphabet) 
-    }
-        
-  }       
-}
-This is a more general function that also deals with phrase alignment. But it is suggested to implement it just for token alignment as currently the phrase alignment part is very slow to train (60x slower than token alignment).
-
-Some other language-independent and English-only features are implemented under the package edu.jhu.jacana.align.feature, for instance:
-
-StringSimilarityAlignFeature: various string similarity measures
-
-PositionalAlignFeature: features based on relative sentence positions
-
-DistortionAlignFeature: Markovian (state transition) features
-
-When you add features for more languages, just create a new package like the one for French-English:
-
-edu.jhu.jacana.align.feature.fr_en
-
-and start coding!
-

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/large-lms.html
----------------------------------------------------------------------
diff --git a/5.0/large-lms.html b/5.0/large-lms.html
new file mode 100644
index 0000000..34b7dba
--- /dev/null
+++ b/5.0/large-lms.html
@@ -0,0 +1,368 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <title>Joshua Documentation | Building large LMs with SRILM</title>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <meta name="description" content="">
+    <meta name="author" content="">
+
+    <!-- Le styles -->
+    <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+    <style>
+      body {
+        padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
+      }
+      #download {
+          background-color: green;
+          font-size: 14pt;
+          font-weight: bold;
+          text-align: center;
+          color: white;
+          border-radius: 5px;
+          padding: 4px;
+      }
+
+      #download a:link {
+          color: white;
+      }
+
+      #download a:hover {
+          color: lightgrey;
+      }
+
+      #download a:visited {
+          color: white;
+      }
+
+      a.pdf {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: brown;
+          padding: 2px;
+      }
+
+      a.bibtex {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: orange;
+          padding: 2px;
+      }
+
+      img.sponsor {
+        height: 120px;
+        margin: 5px;
+      }
+    </style>
+    <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+
+    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
+    <!--[if lt IE 9]>
+      <script src="bootstrap/js/html5shiv.js"></script>
+    <![endif]-->
+
+    <!-- Fav and touch icons -->
+    <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
+    <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
+      <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
+                    <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
+                                   <link rel="shortcut icon" href="bootstrap/ico/favicon.png">
+  </head>
+
+  <body>
+
+    <div class="navbar navbar-inverse navbar-fixed-top">
+      <div class="navbar-inner">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <a class="brand" href="/">Joshua</a>
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="index.html">Documentation</a></li>
+              <li><a href="pipeline.html">Pipeline</a></li>
+              <li><a href="tutorial.html">Tutorial</a></li>
+              <li><a href="decoder.html">Decoder</a></li>
+              <li><a href="thrax.html">Thrax</a></li>
+              <li><a href="file-formats.html">File formats</a></li>
+              <!-- <li><a href="advanced.html">Advanced</a></li> -->
+              <li><a href="faq.html">FAQ</a></li>
+            </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+    <div class="container">
+
+      <div class="row">
+        <div class="span2">
+          <img src="/images/joshua-logo-small.png" 
+               alt="Joshua logo (picture of a Joshua tree)" />
+        </div>
+        <div class="span10">
+          <h1>Joshua Documentation</h1>
+          <h2>Building large LMs with SRILM</h2>
+          <span id="download">
+            <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a>
+          </span>
+          &nbsp; (version 5.0, released 16 August 2013)
+        </div>
+      </div>
+      
+      <hr />
+
+      <div class="row">
+        <div class="span8">
+
+          <p>The following is a tutorial for building a large language model from the
+English Gigaword Fifth Edition corpus
+<a href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07">LDC2011T07</a>
+using SRILM. English text is provided from seven different sources.</p>
+
+<h3 id="step-0-clean-up-the-corpus">Step 0: Clean up the corpus</h3>
+
+<p>The Gigaword corpus has to be stripped of all SGML tags and tokenized.
+Instructions for performing those steps are not included in this
+documentation. A description of this process can be found in a paper
+called <a href="https://akbcwekex2012.files.wordpress.com/2012/05/28_paper.pdf">“Annotated
+Gigaword”</a>.</p>
+
+<p>The Joshua package ships with a script that converts all alphabetical
+characters to their lowercase equivalent. The script is located at
+<code class="highlighter-rouge">$JOSHUA/scripts/lowercase.perl</code>.</p>
+
+<p>Make a directory structure as follows:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>gigaword/
+├── corpus/
+│   ├── afp_eng/
+│   │   ├── afp_eng_199405.lc.gz
+│   │   ├── afp_eng_199406.lc.gz
+│   │   ├── ...
+│   │   └── counts/
+│   ├── apw_eng/
+│   │   ├── apw_eng_199411.lc.gz
+│   │   ├── apw_eng_199412.lc.gz
+│   │   ├── ...
+│   │   └── counts/
+│   ├── cna_eng/
+│   │   ├── ...
+│   │   └── counts/
+│   ├── ltw_eng/
+│   │   ├── ...
+│   │   └── counts/
+│   ├── nyt_eng/
+│   │   ├── ...
+│   │   └── counts/
+│   ├── wpb_eng/
+│   │   ├── ...
+│   │   └── counts/
+│   └── xin_eng/
+│       ├── ...
+│       └── counts/
+└── lm/
+    ├── afp_eng/
+    ├── apw_eng/
+    ├── cna_eng/
+    ├── ltw_eng/
+    ├── nyt_eng/
+    ├── wpb_eng/
+    └── xin_eng/
+</code></pre>
+</div>
+
+<p>The next step will be to build smaller LMs and then interpolate them into one
+file.</p>
+
+<h3 id="step-1-count-ngrams">Step 1: Count ngrams</h3>
+
+<p>Run the following script once from each source directory under the <code class="highlighter-rouge">corpus/</code>
+directory (edit it to specify the path to the <code class="highlighter-rouge">ngram-count</code> binary as well as
+the number of processors):</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
+
+<span class="nv">NGRAM_COUNT</span><span class="o">=</span><span class="nv">$SRILM_SRC</span>/bin/i686-m64/ngram-count
+<span class="nv">args</span><span class="o">=</span><span class="s2">""</span>
+
+<span class="k">for </span><span class="nb">source </span><span class="k">in</span> <span class="k">*</span>.gz; <span class="k">do
+   </span><span class="nv">args</span><span class="o">=</span><span class="nv">$args</span><span class="s2">"-sort -order 5 -text </span><span class="nv">$source</span><span class="s2"> -write counts/</span><span class="nv">$source</span><span class="s2">-counts.gz "</span>
+<span class="k">done
+
+</span><span class="nb">echo</span> <span class="nv">$args</span> | xargs --max-procs<span class="o">=</span>4 -n 7 <span class="nv">$NGRAM_COUNT</span>
+</code></pre>
+</div>
+
+<p>Then move each <code class="highlighter-rouge">counts/</code> directory to the corresponding directory under
+<code class="highlighter-rouge">lm/</code>. Now that each ngram has been counted, we can make a language
+model for each of the seven sources.</p>
+
+<h3 id="step-2-make-individual-language-models">Step 2: Make individual language models</h3>
+
+<p>SRILM includes a script, called <code class="highlighter-rouge">make-big-lm</code>, for building large language
+models under resource-limited environments. The manual for this script can be
+read online
+<a href="http://www-speech.sri.com/projects/srilm/manpages/training-scripts.1.html">here</a>.
+Since the Gigaword corpus is so large, it is convenient to use <code class="highlighter-rouge">make-big-lm</code>
+even in environments with many parallel processors and a lot of memory.</p>
+
+<p>Initiate the following script from each of the source directories under the
+<code class="highlighter-rouge">lm/</code> directory (edit it to specify the path to the <code class="highlighter-rouge">make-big-lm</code> script as
+well as the pruning threshold):</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
+<span class="nb">set</span> -x
+
+<span class="nv">CMD</span><span class="o">=</span><span class="nv">$SRILM_SRC</span>/bin/make-big-lm
+<span class="nv">PRUNE_THRESHOLD</span><span class="o">=</span>1e-8
+
+<span class="nv">$CMD</span> <span class="se">\</span>
+  -name gigalm <span class="sb">`</span><span class="k">for </span>k <span class="k">in </span>counts/<span class="k">*</span>.gz; <span class="k">do </span><span class="nb">echo</span> <span class="s2">" </span><span class="se">\</span><span class="s2">
+  -read </span><span class="nv">$k</span><span class="s2"> "</span>; <span class="k">done</span><span class="sb">`</span> <span class="se">\</span>
+  -lm lm.gz <span class="se">\</span>
+  -max-per-file 100000000 <span class="se">\</span>
+  -order 5 <span class="se">\</span>
+  -kndiscount <span class="se">\</span>
+  -interpolate <span class="se">\</span>
+  -unk <span class="se">\</span>
+  -prune <span class="nv">$PRUNE_THRESHOLD</span>
+</code></pre>
+</div>
+
+<p>The language model attributes chosen are the following:</p>
+
+<ul>
+  <li>N-grams up to order 5</li>
+  <li>Kneser-Ney smoothing</li>
+  <li>N-gram probability estimates at the specified order <em>n</em> are interpolated with
+lower-order estimates</li>
+  <li>include the unknown-word token as a regular word</li>
+  <li>pruning N-grams based on the specified threshold</li>
+</ul>
+
+<p>Next, we will mix the models together into a single file.</p>
+
+<h3 id="step-3-mix-models-together">Step 3: Mix models together</h3>
+
+<p>Using development text, interpolation weights can determined that give highest
+weight to the source language models that have the lowest perplexity on the
+specified development set.</p>
+
+<h4 id="step-3-1-determine-interpolation-weights">Step 3-1: Determine interpolation weights</h4>
+
+<p>Initiate the following script from the <code class="highlighter-rouge">lm/</code> directory (edit it to specify the
+path to the <code class="highlighter-rouge">ngram</code> binary as well as the path to the development text file):</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
+<span class="nb">set</span> -x
+
+<span class="nv">NGRAM</span><span class="o">=</span><span class="nv">$SRILM_SRC</span>/bin/i686-m64/ngram
+<span class="nv">DEV_TEXT</span><span class="o">=</span>~mpost/expts/wmt12/runs/es-en/data/tune/tune.tok.lc.es
+
+<span class="nb">dirs</span><span class="o">=(</span> afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng <span class="o">)</span>
+
+<span class="k">for </span>d <span class="k">in</span> <span class="k">${</span><span class="nv">dirs</span><span class="p">[@]</span><span class="k">}</span> ; <span class="k">do</span>
+  <span class="nv">$NGRAM</span> -debug 2 -order 5 -unk -lm <span class="nv">$d</span>/lm.gz -ppl <span class="nv">$DEV_TEXT</span> &gt; <span class="nv">$d</span>/lm.ppl ;
+<span class="k">done
+
+</span>compute-best-mix <span class="k">*</span>/lm.ppl &gt; best-mix.ppl
+</code></pre>
+</div>
+
+<p>Take a look at the contents of <code class="highlighter-rouge">best-mix.ppl</code>. It will contain a sequence of
+values in parenthesis. These are the interpolation weights of the source
+language models in the order specified. Copy and paste the values within the
+parenthesis into the script below.</p>
+
+<h4 id="step-3-2-combine-the-models">Step 3-2: Combine the models</h4>
+
+<p>Initiate the following script from the <code class="highlighter-rouge">lm/</code> directory (edit it to specify the
+path to the <code class="highlighter-rouge">ngram</code> binary as well as the interpolation weights):</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
+<span class="nb">set</span> -x
+
+<span class="nv">NGRAM</span><span class="o">=</span><span class="nv">$SRILM_SRC</span>/bin/i686-m64/ngram
+<span class="nv">DIRS</span><span class="o">=(</span>   afp_eng    apw_eng     cna_eng  ltw_eng   nyt_eng  wpb_eng  xin_eng <span class="o">)</span>
+<span class="nv">LAMBDAS</span><span class="o">=(</span>0.00631272 0.000647602 0.251555 0.0134726 0.348953 0.371566 0.00749238<span class="o">)</span>
+
+<span class="nv">$NGRAM</span> -order 5 -unk <span class="se">\</span>
+  -lm      <span class="k">${</span><span class="nv">DIRS</span><span class="p">[0]</span><span class="k">}</span>/lm.gz     -lambda  <span class="k">${</span><span class="nv">LAMBDAS</span><span class="p">[0]</span><span class="k">}</span> <span class="se">\</span>
+  -mix-lm  <span class="k">${</span><span class="nv">DIRS</span><span class="p">[1]</span><span class="k">}</span>/lm.gz <span class="se">\</span>
+  -mix-lm2 <span class="k">${</span><span class="nv">DIRS</span><span class="p">[2]</span><span class="k">}</span>/lm.gz -mix-lambda2 <span class="k">${</span><span class="nv">LAMBDAS</span><span class="p">[2]</span><span class="k">}</span> <span class="se">\</span>
+  -mix-lm3 <span class="k">${</span><span class="nv">DIRS</span><span class="p">[3]</span><span class="k">}</span>/lm.gz -mix-lambda3 <span class="k">${</span><span class="nv">LAMBDAS</span><span class="p">[3]</span><span class="k">}</span> <span class="se">\</span>
+  -mix-lm4 <span class="k">${</span><span class="nv">DIRS</span><span class="p">[4]</span><span class="k">}</span>/lm.gz -mix-lambda4 <span class="k">${</span><span class="nv">LAMBDAS</span><span class="p">[4]</span><span class="k">}</span> <span class="se">\</span>
+  -mix-lm5 <span class="k">${</span><span class="nv">DIRS</span><span class="p">[5]</span><span class="k">}</span>/lm.gz -mix-lambda5 <span class="k">${</span><span class="nv">LAMBDAS</span><span class="p">[5]</span><span class="k">}</span> <span class="se">\</span>
+  -mix-lm6 <span class="k">${</span><span class="nv">DIRS</span><span class="p">[6]</span><span class="k">}</span>/lm.gz -mix-lambda6 <span class="k">${</span><span class="nv">LAMBDAS</span><span class="p">[6]</span><span class="k">}</span> <span class="se">\</span>
+  -write-lm mixed_lm.gz
+</code></pre>
+</div>
+
+<p>The resulting file, <code class="highlighter-rouge">mixed_lm.gz</code> is a language model based on all the text in
+the Gigaword corpus and with some probabilities biased to the development text
+specify in step 3-1. It is in the ARPA format. The optional next step converts
+it into KenLM format.</p>
+
+<h4 id="step-3-3-convert-to-kenlm">Step 3-3: Convert to KenLM</h4>
+
+<p>The KenLM format has some speed advantages over the ARPA format. Issuing the
+following command will write a new language model file <code class="highlighter-rouge">mixed_lm-kenlm.gz</code> that
+is the <code class="highlighter-rouge">mixed_lm.gz</code> language model transformed into the KenLM format.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary mixed_lm.gz mixed_lm.kenlm
+</code></pre>
+</div>
+
+
+
+        </div>
+      </div>
+    </div> <!-- /container -->
+
+    <!-- Le javascript
+    ================================================== -->
+    <!-- Placed at the end of the document so the pages load faster -->
+    <script src="bootstrap/js/jquery.js"></script>
+    <script src="bootstrap/js/bootstrap-transition.js"></script>
+    <script src="bootstrap/js/bootstrap-alert.js"></script>
+    <script src="bootstrap/js/bootstrap-modal.js"></script>
+    <script src="bootstrap/js/bootstrap-dropdown.js"></script>
+    <script src="bootstrap/js/bootstrap-scrollspy.js"></script>
+    <script src="bootstrap/js/bootstrap-tab.js"></script>
+    <script src="bootstrap/js/bootstrap-tooltip.js"></script>
+    <script src="bootstrap/js/bootstrap-popover.js"></script>
+    <script src="bootstrap/js/bootstrap-button.js"></script>
+    <script src="bootstrap/js/bootstrap-collapse.js"></script>
+    <script src="bootstrap/js/bootstrap-carousel.js"></script>
+    <script src="bootstrap/js/bootstrap-typeahead.js"></script>
+
+    <!-- Start of StatCounter Code for Default Guide -->
+    <script type="text/javascript">
+      var sc_project=8264132; 
+      var sc_invisible=1; 
+      var sc_security="4b97fe2d"; 
+    </script>
+    <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
+    <noscript>
+      <div class="statcounter">
+        <a title="hit counter joomla" 
+           href="http://statcounter.com/joomla/"
+           target="_blank">
+          <img class="statcounter"
+               src="http://c.statcounter.com/8264132/0/4b97fe2d/1/"
+               alt="hit counter joomla" />
+        </a>
+      </div>
+    </noscript>
+    <!-- End of StatCounter Code for Default Guide -->
+
+  </body>
+</html>

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/large-lms.md
----------------------------------------------------------------------
diff --git a/5.0/large-lms.md b/5.0/large-lms.md
deleted file mode 100644
index 28ba0b9..0000000
--- a/5.0/large-lms.md
+++ /dev/null
@@ -1,192 +0,0 @@
----
-layout: default
-title: Building large LMs with SRILM
-category: advanced
----
-
-The following is a tutorial for building a large language model from the
-English Gigaword Fifth Edition corpus
-[LDC2011T07](http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07)
-using SRILM. English text is provided from seven different sources.
-
-### Step 0: Clean up the corpus
-
-The Gigaword corpus has to be stripped of all SGML tags and tokenized.
-Instructions for performing those steps are not included in this
-documentation. A description of this process can be found in a paper
-called ["Annotated
-Gigaword"](https://akbcwekex2012.files.wordpress.com/2012/05/28_paper.pdf).
-
-The Joshua package ships with a script that converts all alphabetical
-characters to their lowercase equivalent. The script is located at
-`$JOSHUA/scripts/lowercase.perl`.
-
-Make a directory structure as follows:
-
-    gigaword/
-    ├── corpus/
-    │   ├── afp_eng/
-    │   │   ├── afp_eng_199405.lc.gz
-    │   │   ├── afp_eng_199406.lc.gz
-    │   │   ├── ...
-    │   │   └── counts/
-    │   ├── apw_eng/
-    │   │   ├── apw_eng_199411.lc.gz
-    │   │   ├── apw_eng_199412.lc.gz
-    │   │   ├── ...
-    │   │   └── counts/
-    │   ├── cna_eng/
-    │   │   ├── ...
-    │   │   └── counts/
-    │   ├── ltw_eng/
-    │   │   ├── ...
-    │   │   └── counts/
-    │   ├── nyt_eng/
-    │   │   ├── ...
-    │   │   └── counts/
-    │   ├── wpb_eng/
-    │   │   ├── ...
-    │   │   └── counts/
-    │   └── xin_eng/
-    │       ├── ...
-    │       └── counts/
-    └── lm/
-        ├── afp_eng/
-        ├── apw_eng/
-        ├── cna_eng/
-        ├── ltw_eng/
-        ├── nyt_eng/
-        ├── wpb_eng/
-        └── xin_eng/
-
-
-The next step will be to build smaller LMs and then interpolate them into one
-file.
-
-### Step 1: Count ngrams
-
-Run the following script once from each source directory under the `corpus/`
-directory (edit it to specify the path to the `ngram-count` binary as well as
-the number of processors):
-
-    #!/bin/sh
-
-    NGRAM_COUNT=$SRILM_SRC/bin/i686-m64/ngram-count
-    args=""
-
-    for source in *.gz; do
-       args=$args"-sort -order 5 -text $source -write counts/$source-counts.gz "
-    done
-
-    echo $args | xargs --max-procs=4 -n 7 $NGRAM_COUNT
-
-Then move each `counts/` directory to the corresponding directory under
-`lm/`. Now that each ngram has been counted, we can make a language
-model for each of the seven sources.
-
-### Step 2: Make individual language models
-
-SRILM includes a script, called `make-big-lm`, for building large language
-models under resource-limited environments. The manual for this script can be
-read online
-[here](http://www-speech.sri.com/projects/srilm/manpages/training-scripts.1.html).
-Since the Gigaword corpus is so large, it is convenient to use `make-big-lm`
-even in environments with many parallel processors and a lot of memory.
-
-Initiate the following script from each of the source directories under the
-`lm/` directory (edit it to specify the path to the `make-big-lm` script as
-well as the pruning threshold):
-
-    #!/bin/bash
-    set -x
-
-    CMD=$SRILM_SRC/bin/make-big-lm
-    PRUNE_THRESHOLD=1e-8
-
-    $CMD \
-      -name gigalm `for k in counts/*.gz; do echo " \
-      -read $k "; done` \
-      -lm lm.gz \
-      -max-per-file 100000000 \
-      -order 5 \
-      -kndiscount \
-      -interpolate \
-      -unk \
-      -prune $PRUNE_THRESHOLD
-
-The language model attributes chosen are the following:
-
-* N-grams up to order 5
-* Kneser-Ney smoothing
-* N-gram probability estimates at the specified order *n* are interpolated with
-  lower-order estimates
-* include the unknown-word token as a regular word
-* pruning N-grams based on the specified threshold
-
-Next, we will mix the models together into a single file.
-
-### Step 3: Mix models together
-
-Using development text, interpolation weights can determined that give highest
-weight to the source language models that have the lowest perplexity on the
-specified development set.
-
-#### Step 3-1: Determine interpolation weights
-
-Initiate the following script from the `lm/` directory (edit it to specify the
-path to the `ngram` binary as well as the path to the development text file):
-
-    #!/bin/bash
-    set -x
-
-    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
-    DEV_TEXT=~mpost/expts/wmt12/runs/es-en/data/tune/tune.tok.lc.es
-
-    dirs=( afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng )
-
-    for d in ${dirs[@]} ; do
-      $NGRAM -debug 2 -order 5 -unk -lm $d/lm.gz -ppl $DEV_TEXT > $d/lm.ppl ;
-    done
-
-    compute-best-mix */lm.ppl > best-mix.ppl
-
-Take a look at the contents of `best-mix.ppl`. It will contain a sequence of
-values in parenthesis. These are the interpolation weights of the source
-language models in the order specified. Copy and paste the values within the
-parenthesis into the script below.
-
-#### Step 3-2: Combine the models
-
-Initiate the following script from the `lm/` directory (edit it to specify the
-path to the `ngram` binary as well as the interpolation weights):
-
-    #!/bin/bash
-    set -x
-
-    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
-    DIRS=(   afp_eng    apw_eng     cna_eng  ltw_eng   nyt_eng  wpb_eng  xin_eng )
-    LAMBDAS=(0.00631272 0.000647602 0.251555 0.0134726 0.348953 0.371566 0.00749238)
-
-    $NGRAM -order 5 -unk \
-      -lm      ${DIRS[0]}/lm.gz     -lambda  ${LAMBDAS[0]} \
-      -mix-lm  ${DIRS[1]}/lm.gz \
-      -mix-lm2 ${DIRS[2]}/lm.gz -mix-lambda2 ${LAMBDAS[2]} \
-      -mix-lm3 ${DIRS[3]}/lm.gz -mix-lambda3 ${LAMBDAS[3]} \
-      -mix-lm4 ${DIRS[4]}/lm.gz -mix-lambda4 ${LAMBDAS[4]} \
-      -mix-lm5 ${DIRS[5]}/lm.gz -mix-lambda5 ${LAMBDAS[5]} \
-      -mix-lm6 ${DIRS[6]}/lm.gz -mix-lambda6 ${LAMBDAS[6]} \
-      -write-lm mixed_lm.gz
-
-The resulting file, `mixed_lm.gz` is a language model based on all the text in
-the Gigaword corpus and with some probabilities biased to the development text
-specify in step 3-1. It is in the ARPA format. The optional next step converts
-it into KenLM format.
-
-#### Step 3-3: Convert to KenLM
-
-The KenLM format has some speed advantages over the ARPA format. Issuing the
-following command will write a new language model file `mixed_lm-kenlm.gz` that
-is the `mixed_lm.gz` language model transformed into the KenLM format.
-
-    $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary mixed_lm.gz mixed_lm.kenlm
-

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/packing.html
----------------------------------------------------------------------
diff --git a/5.0/packing.html b/5.0/packing.html
new file mode 100644
index 0000000..349ab4b
--- /dev/null
+++ b/5.0/packing.html
@@ -0,0 +1,270 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <title>Joshua Documentation | Grammar Packing</title>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <meta name="description" content="">
+    <meta name="author" content="">
+
+    <!-- Le styles -->
+    <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+    <style>
+      body {
+        padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
+      }
+      #download {
+          background-color: green;
+          font-size: 14pt;
+          font-weight: bold;
+          text-align: center;
+          color: white;
+          border-radius: 5px;
+          padding: 4px;
+      }
+
+      #download a:link {
+          color: white;
+      }
+
+      #download a:hover {
+          color: lightgrey;
+      }
+
+      #download a:visited {
+          color: white;
+      }
+
+      a.pdf {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: brown;
+          padding: 2px;
+      }
+
+      a.bibtex {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: orange;
+          padding: 2px;
+      }
+
+      img.sponsor {
+        height: 120px;
+        margin: 5px;
+      }
+    </style>
+    <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+
+    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
+    <!--[if lt IE 9]>
+      <script src="bootstrap/js/html5shiv.js"></script>
+    <![endif]-->
+
+    <!-- Fav and touch icons -->
+    <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
+    <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
+      <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
+                    <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
+                                   <link rel="shortcut icon" href="bootstrap/ico/favicon.png">
+  </head>
+
+  <body>
+
+    <div class="navbar navbar-inverse navbar-fixed-top">
+      <div class="navbar-inner">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <a class="brand" href="/">Joshua</a>
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="index.html">Documentation</a></li>
+              <li><a href="pipeline.html">Pipeline</a></li>
+              <li><a href="tutorial.html">Tutorial</a></li>
+              <li><a href="decoder.html">Decoder</a></li>
+              <li><a href="thrax.html">Thrax</a></li>
+              <li><a href="file-formats.html">File formats</a></li>
+              <!-- <li><a href="advanced.html">Advanced</a></li> -->
+              <li><a href="faq.html">FAQ</a></li>
+            </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+    <div class="container">
+
+      <div class="row">
+        <div class="span2">
+          <img src="/images/joshua-logo-small.png" 
+               alt="Joshua logo (picture of a Joshua tree)" />
+        </div>
+        <div class="span10">
+          <h1>Joshua Documentation</h1>
+          <h2>Grammar Packing</h2>
+          <span id="download">
+            <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a>
+          </span>
+          &nbsp; (version 5.0, released 16 August 2013)
+        </div>
+      </div>
+      
+      <hr />
+
+      <div class="row">
+        <div class="span8">
+
+          <p>Grammar packing refers to the process of taking a textual grammar output by <a href="thrax.html">Thrax</a> and
+efficiently encoding it for use by Joshua.  Packing the grammar results in significantly faster load
+times for very large grammars.</p>
+
+<p>Soon, the <a href="pipeline.html">Joshua pipeline script</a> will add support for grammar packing
+automatically, and we will provide a script that automates these steps for you.</p>
+
+<ol>
+  <li>
+    <p>Make sure the grammar is labeled.  A labeled grammar is one that has feature names attached to
+each of the feature values in each row of the grammar file.  Here is a line from an unlabeled
+grammar:</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code> [X] ||| [X,1] অন্যান্য [X,2] ||| [X,1] other [X,2] ||| 0 0 1 0 0 1.02184
+</code></pre>
+    </div>
+
+    <p>and here is one from an labeled grammar (note that the labels are not very useful):</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code> [X] ||| [X,1] অন্যান্য [X,2] ||| [X,1] other [X,2] ||| f1=0 f2=0 f3=1 f4=0 f5=0 f6=1.02184
+</code></pre>
+    </div>
+
+    <p>If your grammar is not labeled, you can use the script <code class="highlighter-rouge">$JOSHUA/scripts/label_grammar.py</code>:</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code> zcat grammar.gz | $JOSHUA/scripts/label_grammar.py &gt; grammar-labeled.gz
+</code></pre>
+    </div>
+
+    <p>As a side-effect of this step is to produce a file ‘dense_map’ in the current directory,
+containing the mapping between feature names and feature columns.  This file is needed in later
+steps.</p>
+  </li>
+  <li>
+    <p>The packer needs a sorted grammar.  It is sufficient to sort by the first word:</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code> zcat grammar-labeled.gz | sort -k3,3 | gzip &gt; grammar-sorted.gz
+</code></pre>
+    </div>
+
+    <p>(The reason we need a sorted grammar is because the packer stores the grammar in a trie.  The
+pieces can’t be more than 2 GB due to Java limitations, so we need to ensure that rules are
+grouped by the first arc in the trie to avoid redundancy across tries and to simplify the
+lookup).</p>
+  </li>
+  <li>
+    <p>In order to pack the grammar, we need two pieces of information: (1) a packer configuration file,
+and (2) a dense map file.</p>
+
+    <ol>
+      <li>
+        <p>Write a packer config file.  This file specifies items such as the chunk size (for the packed
+pieces) and the quantization classes and types for each feature name.  Examples can be found
+at</p>
+
+        <div class="highlighter-rouge"><pre class="highlight"><code>  $JOSHUA/test/packed/packer.config
+  $JOSHUA/test/bn-en/packed/packer.quantized
+  $JOSHUA/test/bn-en/packed/packer.uncompressed
+</code></pre>
+        </div>
+
+        <p>The quantizer lines in the packer config file have the following format:</p>
+
+        <div class="highlighter-rouge"><pre class="highlight"><code>  quantizer TYPE FEATURES
+</code></pre>
+        </div>
+
+        <p>where <code class="highlighter-rouge">TYPE</code> is one of <code class="highlighter-rouge">boolean</code>, <code class="highlighter-rouge">float</code>, <code class="highlighter-rouge">byte</code>, or <code class="highlighter-rouge">8bit</code>, and <code class="highlighter-rouge">FEATURES</code> is a
+ space-delimited list of feature names that have that quantization type.</p>
+      </li>
+      <li>
+        <p>Write a dense_map file.  If you labeled an unlabeled grammar, this was produced for you as a
+side product of the <code class="highlighter-rouge">label_grammar.py</code> script you called in Step 1.  Otherwise, you need to
+create a file that lists the mapping between feature names and (0-indexed) columns in the
+grammar, one per line, in the following format:</p>
+
+        <div class="highlighter-rouge"><pre class="highlight"><code>  feature-index feature-name
+</code></pre>
+        </div>
+      </li>
+    </ol>
+  </li>
+  <li>
+    <p>To pack the grammar, type the following command:</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code> java -cp $JOSHUA/bin joshua.tools.GrammarPacker -c PACKER_CONFIG_FILE -p OUTPUT_DIR -g GRAMMAR_FILE
+</code></pre>
+    </div>
+
+    <p>This will read in your packer configuration file and your grammar, and produced a packed grammar
+ in the output directory.</p>
+  </li>
+  <li>
+    <p>To use the packed grammar, just point to the packed directory in your Joshua configuration file.</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code> tm-file = packed-grammar/
+ tm-format = packed
+</code></pre>
+    </div>
+  </li>
+</ol>
+
+
+        </div>
+      </div>
+    </div> <!-- /container -->
+
+    <!-- Le javascript
+    ================================================== -->
+    <!-- Placed at the end of the document so the pages load faster -->
+    <script src="bootstrap/js/jquery.js"></script>
+    <script src="bootstrap/js/bootstrap-transition.js"></script>
+    <script src="bootstrap/js/bootstrap-alert.js"></script>
+    <script src="bootstrap/js/bootstrap-modal.js"></script>
+    <script src="bootstrap/js/bootstrap-dropdown.js"></script>
+    <script src="bootstrap/js/bootstrap-scrollspy.js"></script>
+    <script src="bootstrap/js/bootstrap-tab.js"></script>
+    <script src="bootstrap/js/bootstrap-tooltip.js"></script>
+    <script src="bootstrap/js/bootstrap-popover.js"></script>
+    <script src="bootstrap/js/bootstrap-button.js"></script>
+    <script src="bootstrap/js/bootstrap-collapse.js"></script>
+    <script src="bootstrap/js/bootstrap-carousel.js"></script>
+    <script src="bootstrap/js/bootstrap-typeahead.js"></script>
+
+    <!-- Start of StatCounter Code for Default Guide -->
+    <script type="text/javascript">
+      var sc_project=8264132; 
+      var sc_invisible=1; 
+      var sc_security="4b97fe2d"; 
+    </script>
+    <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
+    <noscript>
+      <div class="statcounter">
+        <a title="hit counter joomla" 
+           href="http://statcounter.com/joomla/"
+           target="_blank">
+          <img class="statcounter"
+               src="http://c.statcounter.com/8264132/0/4b97fe2d/1/"
+               alt="hit counter joomla" />
+        </a>
+      </div>
+    </noscript>
+    <!-- End of StatCounter Code for Default Guide -->
+
+  </body>
+</html>

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/packing.md
----------------------------------------------------------------------
diff --git a/5.0/packing.md b/5.0/packing.md
deleted file mode 100644
index 2f39ba7..0000000
--- a/5.0/packing.md
+++ /dev/null
@@ -1,76 +0,0 @@
----
-layout: default
-category: advanced
-title: Grammar Packing
----
-
-Grammar packing refers to the process of taking a textual grammar output by [Thrax](thrax.html) and
-efficiently encoding it for use by Joshua.  Packing the grammar results in significantly faster load
-times for very large grammars.
-
-Soon, the [Joshua pipeline script](pipeline.html) will add support for grammar packing
-automatically, and we will provide a script that automates these steps for you.
-
-1. Make sure the grammar is labeled.  A labeled grammar is one that has feature names attached to
-each of the feature values in each row of the grammar file.  Here is a line from an unlabeled
-grammar:
-
-        [X] ||| [X,1] অন্যান্য [X,2] ||| [X,1] other [X,2] ||| 0 0 1 0 0 1.02184
-
-   and here is one from an labeled grammar (note that the labels are not very useful):
-
-        [X] ||| [X,1] অন্যান্য [X,2] ||| [X,1] other [X,2] ||| f1=0 f2=0 f3=1 f4=0 f5=0 f6=1.02184
-
-   If your grammar is not labeled, you can use the script `$JOSHUA/scripts/label_grammar.py`:
-   
-        zcat grammar.gz | $JOSHUA/scripts/label_grammar.py > grammar-labeled.gz
-
-   As a side-effect of this step is to produce a file 'dense_map' in the current directory,
-   containing the mapping between feature names and feature columns.  This file is needed in later
-   steps.
-
-1. The packer needs a sorted grammar.  It is sufficient to sort by the first word:
-
-        zcat grammar-labeled.gz | sort -k3,3 | gzip > grammar-sorted.gz
-      
-   (The reason we need a sorted grammar is because the packer stores the grammar in a trie.  The
-   pieces can't be more than 2 GB due to Java limitations, so we need to ensure that rules are
-   grouped by the first arc in the trie to avoid redundancy across tries and to simplify the
-   lookup).
-    
-1. In order to pack the grammar, we need two pieces of information: (1) a packer configuration file,
-   and (2) a dense map file.
-
-   1. Write a packer config file.  This file specifies items such as the chunk size (for the packed
-      pieces) and the quantization classes and types for each feature name.  Examples can be found
-      at
-   
-            $JOSHUA/test/packed/packer.config
-            $JOSHUA/test/bn-en/packed/packer.quantized
-            $JOSHUA/test/bn-en/packed/packer.uncompressed
-       
-      The quantizer lines in the packer config file have the following format:
-   
-            quantizer TYPE FEATURES
-       
-       where `TYPE` is one of `boolean`, `float`, `byte`, or `8bit`, and `FEATURES` is a
-       space-delimited list of feature names that have that quantization type.
-   
-   1. Write a dense_map file.  If you labeled an unlabeled grammar, this was produced for you as a
-      side product of the `label_grammar.py` script you called in Step 1.  Otherwise, you need to
-      create a file that lists the mapping between feature names and (0-indexed) columns in the
-      grammar, one per line, in the following format:
-   
-            feature-index feature-name
-    
-1. To pack the grammar, type the following command:
-
-        java -cp $JOSHUA/bin joshua.tools.GrammarPacker -c PACKER_CONFIG_FILE -p OUTPUT_DIR -g GRAMMAR_FILE
-
-    This will read in your packer configuration file and your grammar, and produced a packed grammar
-    in the output directory.
-
-1. To use the packed grammar, just point to the packed directory in your Joshua configuration file.
-
-        tm-file = packed-grammar/
-        tm-format = packed