You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@joshua.apache.org by mj...@apache.org on 2016/04/05 14:44:00 UTC
[13/50] incubator-joshua-site git commit: Updated tutorial

Updated tutorial


Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/commit/5b80a147
Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/tree/5b80a147
Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/diff/5b80a147

Branch: refs/heads/asf-site
Commit: 5b80a14748fab4b9ad6ee331c90e9d3926b3ae7c
Parents: dc45f50
Author: Matt Post <po...@cs.jhu.edu>
Authored: Wed Jun 10 00:10:29 2015 -0400
Committer: Matt Post <po...@cs.jhu.edu>
Committed: Wed Jun 10 00:10:29 2015 -0400

----------------------------------------------------------------------
 6.0/tutorial.md        | 96 +++++++++++++++++++++++++--------------------
 _layouts/default6.html |  7 +---
 2 files changed, 55 insertions(+), 48 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/5b80a147/6.0/tutorial.md
----------------------------------------------------------------------
diff --git a/6.0/tutorial.md b/6.0/tutorial.md
index 9a43e93..d167cdc 100644
--- a/6.0/tutorial.md
+++ b/6.0/tutorial.md
@@ -13,75 +13,87 @@ other options available in the pipeline.
 Download and install Joshua as described on the [quick start page](index.html), installing it under
 `~/code/`. Once you've done that, you should make sure you have the following environment variable set:
 
-    export JOSHUA=$HOME/code/joshua-v5.0
+    export JOSHUA=$HOME/code/joshua-v{{ site.data.joshua.release_version }}
     export JAVA_HOME=/usr/java/default
 
-If you have a Hadoop installation, make sure you've set `$HADOOP` to point to it (if not, Joshua
-will roll out a standalone cluster for you). If you'd like to use kbmira for tuning, you should also
-install Moses, and define the environment variable `$MOSES` to point to the root of its installation.
+If you have a Hadoop installation, make sure you've set `$HADOOP` to point to it. For example, if the `hadoop` command is in `/usr/bin`,
+you should type
+
+    export HADOOP=/usr
+
+Joshua will find the binary and use it to submit to your hadoop cluster. If you don't have one, just
+make sure that HADOOP is unset, and Joshua will roll one out for you and run it in
+[standalone mode](https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html). 
 
 ## A basic pipeline run
 
-For today's experiments, we'll be building a Bengali--English system using data included in the
-[Indian Languages Parallel Corpora](/indian-parallel-corpora/). This data was collected by taking
-the 100 most-popular Bengali Wikipedia pages and translating them into English using Amazon's
-[Mechanical Turk](http://www.mturk.com/). As a warning, many of these pages contain material that is
-not typically found in machine translation tutorials.
+For today's experiments, we'll be building a Spanish--English system using data included in the
+[Fisher and CALLHOME translation corpus](/data/fisher-callhome-corpus/). This
+data was collected by translating transcribed speech from previous LDC releases.
 
 Download the data and install it somewhere:
 
     cd ~/data
-    wget -q --no-check -O indian-parallel-corpora.zip https://github.com/joshua-decoder/indian-parallel-corpora/archive/master.zip
-    unzip indian-parallel-corpora.zip
+    wget --no-check -O fisher-spanish-corpus.zip https://github.com/joshua-decoder/fisher-callhome-corpus/archive/master.zip
+    unzip fisher-spanish-corpus.zip
 
-Then define the environment variable `$INDIAN` to point to it:
+Then define the environment variable `$FISHER` to point to it:
 
-    cd ~/data/indian-parallel-corpora-master
-    export INDIAN=$(pwd)
+    cd ~/data/fisher-spanish-corpus-master
+    export FISHER=$(pwd)
     
 ### Preparing the data
 
-Inside this tarball is a directory for each language pair. Within each language directory is another
-directory named `tok/`, which contains pre-tokenized and normalized versions of the data. This was
-done because the normalization scripts provided with Joshua is written in scripting languages that
-often have problems properly handling UTF-8 character sets. We will be using these tokenized
-versions, and preventing the pipeline from retokenizing using the `--no-prepare` flag.
+Inside the tarball is the Fisher and CALLHOME Spanish--English data, which includes Kaldi-provided
+ASR output and English translations on the Fisher and CALLHOME  dataset transcriptions. Because of
+licensing restrictions, we cannot distribute the Spanish transcripts, but if you have an LDC site
+license, a script is provided to build them. You can type:
+
+    ./bin/build_fisher.sh /export/common/data/corpora/LDC/LDC2010T04
 
-In `$INDIAN/bn-en/tok`, you should see the following files:
+Where the first argument is the path to your LDC data release. This will create the files in `corpus/ldc`.
 
-    $ ls $INDIAN/bn-en/tok
-    dev.bn-en.bn     devtest.bn-en.bn     dict.bn-en.bn     test.bn-en.en.2
-    dev.bn-en.en.0   devtest.bn-en.en.0   dict.bn-en.en     test.bn-en.en.3
-    dev.bn-en.en.1   devtest.bn-en.en.1   test.bn-en.bn     training.bn-en.bn
-    dev.bn-en.en.2   devtest.bn-en.en.2   test.bn-en.en.0   training.bn-en.en
-    dev.bn-en.en.3   devtest.bn-en.en.3   test.bn-en.en.1
+In `$FISHER/corpus`, there are a set of parallel directories for LDC transcripts (`ldc`), ASR output
+(`asr`), oracle ASR output (`oracle`), and ASR lattice output (`plf`). The files look like this:
 
-We will now use this data to test the complete pipeline with a single command.
+    $ ls corpus/ldc
+    callhome_devtest.en  fisher_dev2.en.2  fisher_dev.en.2   fisher_test.en.2
+    callhome_evltest.en  fisher_dev2.en.3  fisher_dev.en.3   fisher_test.en.3
+    callhome_train.en    fisher_dev2.es    fisher_dev.es     fisher_test.es
+    fisher_dev2.en.0     fisher_dev.en.0   fisher_test.en.0  fisher_train.en
+    fisher_dev2.en.1     fisher_dev.en.1   fisher_test.en.1  fisher_train.es
+
+If you don't have the LDC transcripts, you can use the data in `corpus/asr` instead. We will now use
+this data to build our own Spanish--English model using Joshua's pipeline.
     
 ### Run the pipeline
 
-Create an experiments directory for containing your first experiment:
+Create an experiments directory for containing your first experiment. *Note: it's important that
+this **not** be inside your `$JOSHUA` directory*.
 
     mkdir ~/expts/joshua
     cd ~/expts/joshua
     
 We will now create the baseline run, using a particular directory structure for experiments that
 will allow us to take advantage of scripts provided with Joshua for displaying the results of many
-related experiments.
+related experiments. Because this can take quite some time to run, we are going to add a crippling
+restriction: Joshua will only use sentences in the training sets with ten or fewer words on either
+side (Spanish or English):
 
     cd ~/expts/joshua
     $JOSHUA/bin/pipeline.pl           \
       --rundir 1                      \
       --readme "Baseline Hiero run"   \
-      --source bn                     \
+      --source es                     \
       --target en                     \
-      --corpus $INDIAN/bn-en/tok/training.bn-en \
-      --corpus $INDIAN/bn-en/tok/dict.bn-en     \
-      --tune $INDIAN/bn-en/tok/dev.bn-en        \
-      --test $INDIAN/bn-en/tok/devtest.bn-en    \
+      --type hiero                    \
+      --corpus $FISHER/corpus/ldc/fisher_train \
+      --tune $FISHER/corpus/ldc/fisher_dev \
+      --test $FISHER/corpus/ldc/fisher_dev2 \
+      --maxlen 10 \
       --lm-order 3
       
-This will start the pipeline building a Bengali--English translation system constructed from the
+This will start the pipeline building a Spanish--English translation system constructed from the
 training data and a dictionary, tuned against dev, and tested against devtest. It will use the
 default values for most of the pipeline: [GIZA++](https://code.google.com/p/giza-pp/) for alignment,
 KenLM's `lmplz` for building the language model, Z-MERT for tuning, KenLM with left-state
@@ -113,7 +125,7 @@ of the baseline model. Here are some examples of what you could vary:
    
 - Decode with a wider beam (`--joshua-args '-pop-limit 200'`) (the default is 100)
 
-- Add the provided BN-EN dictionary to the training data (add another `--corpus` line, e.g., `--corpus $INDIAN/bn-en/dict.bn-en`)
+- Add the provided BN-EN dictionary to the training data (add another `--corpus` line, e.g., `--corpus $FISHER/bn-en/dict.bn-en`)
 
 To do this, we will create new runs that partially reuse the results of previous runs. This is
 possible by doing two things: (1) incrementing the run directory and providing an updated README
@@ -130,9 +142,9 @@ directory, tell the pipeline to start at the tuning step, and provide the needed
       --readme "Tuning with MIRA"     \
       --source bn                     \
       --target en                     \
-      --corpus $INDIAN/bn-en/tok/training.bn-en \
-      --tune $INDIAN/bn-en/tok/dev.bn-en        \
-      --test $INDIAN/bn-en/tok/devtest.bn-en    \
+      --corpus $FISHER/bn-en/tok/training.bn-en \
+      --tune $FISHER/bn-en/tok/dev.bn-en        \
+      --test $FISHER/bn-en/tok/devtest.bn-en    \
       --first-step tune \
       --tuner mira \
       --grammar 1/grammar.gz \
@@ -158,9 +170,9 @@ grammar, but can reuse the alignments and the language model:
       --readme "Baseline SAMT model"  \
       --source bn                     \
       --target en                     \
-      --corpus $INDIAN/bn-en/tok/training.bn-en \
-      --tune $INDIAN/bn-en/tok/dev.bn-en        \
-      --test $INDIAN/bn-en/tok/devtest.bn-en    \
+      --corpus $FISHER/bn-en/tok/training.bn-en \
+      --tune $FISHER/bn-en/tok/dev.bn-en        \
+      --test $FISHER/bn-en/tok/devtest.bn-en    \
       --alignment 1/alignments/training.align   \
       --first-step parse \
       --no-corpus-lm \

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/5b80a147/_layouts/default6.html
----------------------------------------------------------------------
diff --git a/_layouts/default6.html b/_layouts/default6.html
index 63a8adf..3d19a7b 100644
--- a/_layouts/default6.html
+++ b/_layouts/default6.html
@@ -34,11 +34,6 @@
 
     <div class="container">
 
-      <!-- <div class="blog-header"> -->
-      <!--   <h1 class="blog-title">Joshua</h1> -->
-      <!--   <\!-- <p class="lead blog-description">The Joshua machine translation system</p> -\-> -->
-      <!-- </div> -->
-
       <div class="row">
 
         <div class="col-sm-2">
@@ -65,7 +60,6 @@
             <ol class="list-unstyled">
               <li><a href="/6.0/install.html">Installation</a></li>
               <li><a href="/6.0/quick-start.html">Quick Start</a></li>
-              <li><a href="/6.0/faq.html">FAQ</a></li>
             </ol>
           </div>
           <hr>
@@ -73,6 +67,7 @@
             <h4>Building new models</h4>
             <ol class="list-unstyled">
               <li><a href="/6.0/pipeline.html">Pipeline</a></li>
+              <li><a href="/6.0/tutorial.html">Tutorial</a></li>
               <li><a href="/6.0/faq.html">FAQ</a></li>
             </ol>
           </div>