You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@joshua.apache.org by mj...@apache.org on 2016/04/05 14:44:00 UTC
[13/50] incubator-joshua-site git commit: Updated tutorial
Updated tutorial
Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/commit/5b80a147
Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/tree/5b80a147
Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/diff/5b80a147
Branch: refs/heads/asf-site
Commit: 5b80a14748fab4b9ad6ee331c90e9d3926b3ae7c
Parents: dc45f50
Author: Matt Post <po...@cs.jhu.edu>
Authored: Wed Jun 10 00:10:29 2015 -0400
Committer: Matt Post <po...@cs.jhu.edu>
Committed: Wed Jun 10 00:10:29 2015 -0400
----------------------------------------------------------------------
6.0/tutorial.md | 96 +++++++++++++++++++++++++--------------------
_layouts/default6.html | 7 +---
2 files changed, 55 insertions(+), 48 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/5b80a147/6.0/tutorial.md
----------------------------------------------------------------------
diff --git a/6.0/tutorial.md b/6.0/tutorial.md
index 9a43e93..d167cdc 100644
--- a/6.0/tutorial.md
+++ b/6.0/tutorial.md
@@ -13,75 +13,87 @@ other options available in the pipeline.
Download and install Joshua as described on the [quick start page](index.html), installing it under
`~/code/`. Once you've done that, you should make sure you have the following environment variable set:
- export JOSHUA=$HOME/code/joshua-v5.0
+ export JOSHUA=$HOME/code/joshua-v{{ site.data.joshua.release_version }}
export JAVA_HOME=/usr/java/default
-If you have a Hadoop installation, make sure you've set `$HADOOP` to point to it (if not, Joshua
-will roll out a standalone cluster for you). If you'd like to use kbmira for tuning, you should also
-install Moses, and define the environment variable `$MOSES` to point to the root of its installation.
+If you have a Hadoop installation, make sure you've set `$HADOOP` to point to it. For example, if the `hadoop` command is in `/usr/bin`,
+you should type
+
+ export HADOOP=/usr
+
+Joshua will find the binary and use it to submit to your hadoop cluster. If you don't have one, just
+make sure that HADOOP is unset, and Joshua will roll one out for you and run it in
+[standalone mode](https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html).
## A basic pipeline run
-For today's experiments, we'll be building a Bengali--English system using data included in the
-[Indian Languages Parallel Corpora](/indian-parallel-corpora/). This data was collected by taking
-the 100 most-popular Bengali Wikipedia pages and translating them into English using Amazon's
-[Mechanical Turk](http://www.mturk.com/). As a warning, many of these pages contain material that is
-not typically found in machine translation tutorials.
+For today's experiments, we'll be building a Spanish--English system using data included in the
+[Fisher and CALLHOME translation corpus](/data/fisher-callhome-corpus/). This
+data was collected by translating transcribed speech from previous LDC releases.
Download the data and install it somewhere:
cd ~/data
- wget -q --no-check -O indian-parallel-corpora.zip https://github.com/joshua-decoder/indian-parallel-corpora/archive/master.zip
- unzip indian-parallel-corpora.zip
+ wget --no-check -O fisher-spanish-corpus.zip https://github.com/joshua-decoder/fisher-callhome-corpus/archive/master.zip
+ unzip fisher-spanish-corpus.zip
-Then define the environment variable `$INDIAN` to point to it:
+Then define the environment variable `$FISHER` to point to it:
- cd ~/data/indian-parallel-corpora-master
- export INDIAN=$(pwd)
+ cd ~/data/fisher-spanish-corpus-master
+ export FISHER=$(pwd)
### Preparing the data
-Inside this tarball is a directory for each language pair. Within each language directory is another
-directory named `tok/`, which contains pre-tokenized and normalized versions of the data. This was
-done because the normalization scripts provided with Joshua is written in scripting languages that
-often have problems properly handling UTF-8 character sets. We will be using these tokenized
-versions, and preventing the pipeline from retokenizing using the `--no-prepare` flag.
+Inside the tarball is the Fisher and CALLHOME Spanish--English data, which includes Kaldi-provided
+ASR output and English translations on the Fisher and CALLHOME dataset transcriptions. Because of
+licensing restrictions, we cannot distribute the Spanish transcripts, but if you have an LDC site
+license, a script is provided to build them. You can type:
+
+ ./bin/build_fisher.sh /export/common/data/corpora/LDC/LDC2010T04
-In `$INDIAN/bn-en/tok`, you should see the following files:
+Where the first argument is the path to your LDC data release. This will create the files in `corpus/ldc`.
- $ ls $INDIAN/bn-en/tok
- dev.bn-en.bn devtest.bn-en.bn dict.bn-en.bn test.bn-en.en.2
- dev.bn-en.en.0 devtest.bn-en.en.0 dict.bn-en.en test.bn-en.en.3
- dev.bn-en.en.1 devtest.bn-en.en.1 test.bn-en.bn training.bn-en.bn
- dev.bn-en.en.2 devtest.bn-en.en.2 test.bn-en.en.0 training.bn-en.en
- dev.bn-en.en.3 devtest.bn-en.en.3 test.bn-en.en.1
+In `$FISHER/corpus`, there are a set of parallel directories for LDC transcripts (`ldc`), ASR output
+(`asr`), oracle ASR output (`oracle`), and ASR lattice output (`plf`). The files look like this:
-We will now use this data to test the complete pipeline with a single command.
+ $ ls corpus/ldc
+ callhome_devtest.en fisher_dev2.en.2 fisher_dev.en.2 fisher_test.en.2
+ callhome_evltest.en fisher_dev2.en.3 fisher_dev.en.3 fisher_test.en.3
+ callhome_train.en fisher_dev2.es fisher_dev.es fisher_test.es
+ fisher_dev2.en.0 fisher_dev.en.0 fisher_test.en.0 fisher_train.en
+ fisher_dev2.en.1 fisher_dev.en.1 fisher_test.en.1 fisher_train.es
+
+If you don't have the LDC transcripts, you can use the data in `corpus/asr` instead. We will now use
+this data to build our own Spanish--English model using Joshua's pipeline.
### Run the pipeline
-Create an experiments directory for containing your first experiment:
+Create an experiments directory for containing your first experiment. *Note: it's important that
+this **not** be inside your `$JOSHUA` directory*.
mkdir ~/expts/joshua
cd ~/expts/joshua
We will now create the baseline run, using a particular directory structure for experiments that
will allow us to take advantage of scripts provided with Joshua for displaying the results of many
-related experiments.
+related experiments. Because this can take quite some time to run, we are going to add a crippling
+restriction: Joshua will only use sentences in the training sets with ten or fewer words on either
+side (Spanish or English):
cd ~/expts/joshua
$JOSHUA/bin/pipeline.pl \
--rundir 1 \
--readme "Baseline Hiero run" \
- --source bn \
+ --source es \
--target en \
- --corpus $INDIAN/bn-en/tok/training.bn-en \
- --corpus $INDIAN/bn-en/tok/dict.bn-en \
- --tune $INDIAN/bn-en/tok/dev.bn-en \
- --test $INDIAN/bn-en/tok/devtest.bn-en \
+ --type hiero \
+ --corpus $FISHER/corpus/ldc/fisher_train \
+ --tune $FISHER/corpus/ldc/fisher_dev \
+ --test $FISHER/corpus/ldc/fisher_dev2 \
+ --maxlen 10 \
--lm-order 3
-This will start the pipeline building a Bengali--English translation system constructed from the
+This will start the pipeline building a Spanish--English translation system constructed from the
training data and a dictionary, tuned against dev, and tested against devtest. It will use the
default values for most of the pipeline: [GIZA++](https://code.google.com/p/giza-pp/) for alignment,
KenLM's `lmplz` for building the language model, Z-MERT for tuning, KenLM with left-state
@@ -113,7 +125,7 @@ of the baseline model. Here are some examples of what you could vary:
- Decode with a wider beam (`--joshua-args '-pop-limit 200'`) (the default is 100)
-- Add the provided BN-EN dictionary to the training data (add another `--corpus` line, e.g., `--corpus $INDIAN/bn-en/dict.bn-en`)
+- Add the provided BN-EN dictionary to the training data (add another `--corpus` line, e.g., `--corpus $FISHER/bn-en/dict.bn-en`)
To do this, we will create new runs that partially reuse the results of previous runs. This is
possible by doing two things: (1) incrementing the run directory and providing an updated README
@@ -130,9 +142,9 @@ directory, tell the pipeline to start at the tuning step, and provide the needed
--readme "Tuning with MIRA" \
--source bn \
--target en \
- --corpus $INDIAN/bn-en/tok/training.bn-en \
- --tune $INDIAN/bn-en/tok/dev.bn-en \
- --test $INDIAN/bn-en/tok/devtest.bn-en \
+ --corpus $FISHER/bn-en/tok/training.bn-en \
+ --tune $FISHER/bn-en/tok/dev.bn-en \
+ --test $FISHER/bn-en/tok/devtest.bn-en \
--first-step tune \
--tuner mira \
--grammar 1/grammar.gz \
@@ -158,9 +170,9 @@ grammar, but can reuse the alignments and the language model:
--readme "Baseline SAMT model" \
--source bn \
--target en \
- --corpus $INDIAN/bn-en/tok/training.bn-en \
- --tune $INDIAN/bn-en/tok/dev.bn-en \
- --test $INDIAN/bn-en/tok/devtest.bn-en \
+ --corpus $FISHER/bn-en/tok/training.bn-en \
+ --tune $FISHER/bn-en/tok/dev.bn-en \
+ --test $FISHER/bn-en/tok/devtest.bn-en \
--alignment 1/alignments/training.align \
--first-step parse \
--no-corpus-lm \
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/5b80a147/_layouts/default6.html
----------------------------------------------------------------------
diff --git a/_layouts/default6.html b/_layouts/default6.html
index 63a8adf..3d19a7b 100644
--- a/_layouts/default6.html
+++ b/_layouts/default6.html
@@ -34,11 +34,6 @@
<div class="container">
- <!-- <div class="blog-header"> -->
- <!-- <h1 class="blog-title">Joshua</h1> -->
- <!-- <\!-- <p class="lead blog-description">The Joshua machine translation system</p> -\-> -->
- <!-- </div> -->
-
<div class="row">
<div class="col-sm-2">
@@ -65,7 +60,6 @@
<ol class="list-unstyled">
<li><a href="/6.0/install.html">Installation</a></li>
<li><a href="/6.0/quick-start.html">Quick Start</a></li>
- <li><a href="/6.0/faq.html">FAQ</a></li>
</ol>
</div>
<hr>
@@ -73,6 +67,7 @@
<h4>Building new models</h4>
<ol class="list-unstyled">
<li><a href="/6.0/pipeline.html">Pipeline</a></li>
+ <li><a href="/6.0/tutorial.html">Tutorial</a></li>
<li><a href="/6.0/faq.html">FAQ</a></li>
</ol>
</div>