You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Andrew Hitchcock <ad...@gmail.com> on 2010/12/24 01:48:29 UTC
mapreduce and google books n-grams
Hi all,
I'm excited to announce that Amazon Elastic MapReduce is now hosting
the Google Books n-gram dataset in Amazon S3. The data has been
converted to SequenceFile format to make it easy to process using
Hadoop. I spent some time this week playing with the data using Hive
and put together an article which demonstrates how easy it is to get
interesting results:
http://aws.amazon.com/articles/5249664154115844
I've included details about the public dataset at the bottom of this
e-mail. The original data came from here:
http://ngrams.googlelabs.com/datasets
I'm looking forward to seeing what the community does with this data.
Andrew
== What are n-grams? ==
N-grams are fixed size tuples of items. In this case the items are
words extracted from the Google Books corpus. The n specifies the
number of elements in the tuple, so a 5-gram contains five words or
characters.
The n grams in this dataset were produced by passing a sliding window
of the text of books and outputting a record for each new token. For
example, the following sentence.
The yellow dog played fetch.
Would produce the following 2-grams.
["The", "yellow"]
["yellow", 'dog"]
["dog", "played"]
["played", "fetch"]
["fetch", "."]
Or the following 3-grams.
["The", "yellow", "dog"]
["yellow", "dog", "played"]
["dog", "played", "fetch"]
["played", "fetch", "."]
You can aggregate equivalent n-grams to find the total number of
occurrences of that n-gram. This dataset contains counts of n-grams by
year along three axis: total occurrences, number of pages on which
they occur, and number of books in which they appear.
== Dataset format ==
There are a number of different datasets available. Each dataset is a
single n-gram type (1-gram, 2-gram, etc.) for a given input corpus
(such as English or Russian text).
We store the datasets in a single object in Amazon S3. The file is in
sequence file format with block level LZO compression. The sequence
file key is the row number of the dataset stored as a LongWritable and
the value is the raw data stored as TextWritable.
The value is a tab separated string containing the following fields:
n-gram - The actual n-gram.
year - The year for this aggregation.
occurrences - The number of times this n-gram appeared in this year.
pages - The number of pages this n-gram appeared on in this year.
books - The number of books this n-gram appeared in during this year.
The n-gram field is a space separated representation of the tuple.
analysis is often described as 1991 1 1 1
== Available Datasets ==
The entire dataset hasn't been released yet, but those that were
complete as of the time of writing are available. Here are the names
of the available corpuses and their abbreviation.
English - eng-all
English One Million - eng-1M
American English - eng-us-all
British English - eng-gb-all
English Fiktion - eng-fiction-all
Chinese (simplified) - chi-sim-all
French - fre-all
German - ger-all
Russian - rus-all
Spanish - spa-all
Within each corpus there are up to five datasets, representing the
n-grams from length one to five. These can be found in Amazon S3 at
the following location.
s3://datasets.elasticmapreduce/ngrams/books/20090715/<corpus>/<n>gram/data
For example, you can find the American English 1-grams at the
following location:
s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data
NOTE: These datasets are hosted in the us-east-1 region. If you
process these from other regions you will be charged data transfer
fees.
== Dataset statistics ==
This table contains information about all available datasets.
Data Rows Compressed Size
English
1 gram 472,764,897 4.8 GB
English One Million
1 gram 261,823,186 2.6 GB
American English
1 gram 291,639,822 3.0 GB
2 gram 3,923,370,881 38.3 GB
British English
1 gram 188,660,459 1.9 GB
2 gram 2,000,106,933 19.1 GB
3 gram 5,186,054,851 46.8 GB
4 gram 5,325,077,699 46.6 GB
5 gram 3,044,234,000 26.4 GB
English Fiction
1 gram 191,545,012 2.0 GB
2 gram 2,516,249,717 24.3 GB
Chinese
1 gram 7,741,178 0.1 GB
2 gram 209,624,705 2.2 GB
3 gram 701,822,863 7.2 GB
4 gram 672,801,944 6.8 GB
5 gram 325,089,783 3.4 GB
French
1 gram 157,551,172 1.6 GB
2 gram 1,501,278,596 14.3 GB
3 gram 4,124,079,420 37.3 GB
4 gram 4,659,423,581 41.2 GB
5 gram 3,251,347,768 28.8 GB
German
1 gram 243,571,225 2.5 GB
2 gram 1,939,436,935 18.3 GB
3 gram 3,417,271,319 30.9 GB
4 gram 2,488,516,783 21.9 GB
5 gram 1,015,287,248 8.9 GB
Russian
1 gram 238,494,121 2.5 GB
2 gram 2,030,955,601 20.2 GB
3 gram 2,707,065,011 25.8 GB
4 gram 1,716,983,092 16.1 GB
5 gram 800,258,450 7.6 GB
Spanish
1 gram 164,009,433 1.7 GB
2 gram 1,580,350,088 15.2 GB
5 gram 2,013,934,820 18.1 GB