You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Andrew Hitchcock <ad...@gmail.com> on 2010/12/24 01:48:29 UTC

mapreduce and google books n-grams

Hi all,

I'm excited to announce that Amazon Elastic MapReduce is now hosting
the Google Books n-gram dataset in Amazon S3. The data has been
converted to SequenceFile format to make it easy to process using
Hadoop. I spent some time this week playing with the data using Hive
and put together an article which demonstrates how easy it is to get
interesting results:

http://aws.amazon.com/articles/5249664154115844

I've included details about the public dataset at the bottom of this
e-mail. The original data came from here:

http://ngrams.googlelabs.com/datasets

I'm looking forward to seeing what the community does with this data.
Andrew



== What are n-grams? ==
N-grams are fixed size tuples of items. In this case the items are
words extracted from the Google Books corpus. The n specifies the
number of elements in the tuple, so a 5-gram contains five words or
characters.

The n grams in this dataset were produced by passing a sliding window
of the text of books and outputting a record for each new token. For
example, the following sentence.

 The yellow dog played fetch.

Would produce the following 2-grams.

 ["The", "yellow"]
 ["yellow", 'dog"]
 ["dog", "played"]
 ["played", "fetch"]
 ["fetch", "."]

Or the following 3-grams.

 ["The", "yellow", "dog"]
 ["yellow", "dog", "played"]
 ["dog", "played", "fetch"]
 ["played", "fetch", "."]

You can aggregate equivalent n-grams to find the total number of
occurrences of that n-gram. This dataset contains counts of n-grams by
year along three axis: total occurrences, number of pages on which
they occur, and number of books in which they appear.

== Dataset format ==
There are a number of different datasets available. Each dataset is a
single n-gram type (1-gram, 2-gram, etc.) for a given input corpus
(such as English or Russian text).

We store the datasets in a single object in Amazon S3. The file is in
sequence file format with block level LZO compression. The sequence
file key is the row number of the dataset stored as a LongWritable and
the value is the raw data stored as TextWritable.

The value is a tab separated string containing the following fields:

 n-gram - The actual n-gram.
 year - The year for this aggregation.
 occurrences - The number of times this n-gram appeared in this year.
 pages - The number of pages this n-gram appeared on in this year.
 books - The number of books this n-gram appeared in during this year.

The n-gram field is a space separated representation of the tuple.

 analysis is often described as   1991   1    1    1

== Available Datasets ==
The entire dataset hasn't been released yet, but those that were
complete as of the time of writing are available. Here are the names
of the available corpuses and their abbreviation.

 English - eng-all
 English One Million - eng-1M
 American English - eng-us-all
 British English - eng-gb-all
 English Fiktion - eng-fiction-all
 Chinese (simplified) - chi-sim-all
 French - fre-all
 German - ger-all
 Russian - rus-all
 Spanish - spa-all

Within each corpus there are up to five datasets, representing the
n-grams from length one to five. These can be found in Amazon S3 at
the following location.

 s3://datasets.elasticmapreduce/ngrams/books/20090715/<corpus>/<n>gram/data

For example, you can find the American English 1-grams at the
following location:

 s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data

 NOTE: These datasets are hosted in the us-east-1 region. If you
process these from other regions you will be charged data transfer
fees.

== Dataset statistics ==
This table contains information about all available datasets.

Data                 Rows             Compressed Size
English
 1 gram              472,764,897      4.8 GB
English One Million
 1 gram              261,823,186      2.6 GB
American English
 1 gram              291,639,822      3.0 GB
 2 gram              3,923,370,881    38.3 GB
British English
 1 gram              188,660,459      1.9 GB
 2 gram              2,000,106,933    19.1 GB
 3 gram              5,186,054,851    46.8 GB
 4 gram              5,325,077,699    46.6 GB
 5 gram              3,044,234,000    26.4 GB
English Fiction
 1 gram              191,545,012      2.0 GB
 2 gram              2,516,249,717    24.3 GB
Chinese
 1 gram              7,741,178        0.1 GB
 2 gram              209,624,705      2.2 GB
 3 gram              701,822,863      7.2 GB
 4 gram              672,801,944      6.8 GB
 5 gram              325,089,783      3.4 GB
French
 1 gram              157,551,172      1.6 GB
 2 gram              1,501,278,596    14.3 GB
 3 gram              4,124,079,420    37.3 GB
 4 gram              4,659,423,581    41.2 GB
 5 gram              3,251,347,768    28.8 GB
German
 1 gram              243,571,225      2.5 GB
 2 gram              1,939,436,935    18.3 GB
 3 gram              3,417,271,319    30.9 GB
 4 gram              2,488,516,783    21.9 GB
 5 gram              1,015,287,248    8.9 GB
Russian
 1 gram              238,494,121      2.5 GB
 2 gram              2,030,955,601    20.2 GB
 3 gram              2,707,065,011    25.8 GB
 4 gram              1,716,983,092    16.1 GB
 5 gram              800,258,450      7.6 GB
Spanish
 1 gram              164,009,433      1.7 GB
 2 gram              1,580,350,088    15.2 GB
 5 gram              2,013,934,820    18.1 GB