You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by John Howland <jo...@gmail.com> on 2008/09/08 01:22:06 UTC

Getting started questions

I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. If anyone
could answer any of the following questions, I'd greatly appreciate it.

1. I'm processing document collections, with the number of documents ranging
from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?

 - The bodies of the documents usually range from 1K-100KB in size, but some
outliers can be as big as 4-5GB.
 - I will also need to store some metadata for each document which I figure
could be stored as JSON or XML.
 - I'll typically filter on the metadata and then doing standard operations
on the bodies, like word frequency and searching.

Is there a canned FileInputFormat that makes sense? Should I roll my own?
How can I access the bodies as streams so I don't have to read them into RAM
all at once? Am I right in thinking that I should treat each document as a
record and map across them, or do I need to be more creative in what I'm
mapping across?

2. Some of the tasks I want to run are pure map operations (no reduction),
where I'm calculating new metadata fields on each document. To end up with a
good result set, I'll need to copy the entire input record + new fields into
another set of output files. Is there a better way? I haven't wanted to go
down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality of reference
on the datanodes.

3. I'm sure this is not a new idea, but I haven't seen anything regarding
it... I'll need to run several MR jobs as a pipeline... is there any way for
the map tasks in a subsequent stage to begin processing data from previous
stage's reduce task before that reducer has fully finished?

Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more "real".
A whole heap of thanks in advance,

John

Re: Getting started questions

Posted by John Howland <jo...@gmail.com>.

Dennis,

Thanks for the detailed response. I need to play with the SequenceFile
format a bit -- I found the documentation for it on the wiki. I think
I could build on top of the format to handle storage of very large
documents. The vast majority of documents will fit into RAM and in a
standard HDFS block (64MB, maybe up it to 128MB). For very large
documents, I can split them into consecutive records in the
SequenceFile. I can overload the key to be a combination of a "real"
key and a record number... Shouldn't be too hard to extend
SequenceFile to do this.

Much obliged,

John

Re: Getting started questions

Posted by Sagar Naik <sn...@attributor.com>.

Dennis Kubes wrote:
>
>
> John Howland wrote:
>> I've been reading up on Hadoop for a while now and I'm excited that I'm
>> finally getting my feet wet with the examples + my own variations. If 
>> anyone
>> could answer any of the following questions, I'd greatly appreciate it.
>>
>> 1. I'm processing document collections, with the number of documents 
>> ranging
>> from 10,000 - 10,000,000. What is the best way to store this data for
>> effective processing?
>
> AFAIK hadoop doesn't do well with, although it can handle, a large 
> number of small files.  So it would be better to read in the documents 
> and store them in SequenceFile or MapFile format.  This would be 
> similar to the way the Fetcher works in Nutch.  10M documents in a 
> sequence/map file on DFS is comparatively small and can be handled 
> efficiently.
>
>>
>>  - The bodies of the documents usually range from 1K-100KB in size, 
>> but some
>> outliers can be as big as 4-5GB.
>
> I would say store your document objects as Text objects, not sure if 
> Text has a max size.  I think it does but not sure what that is.  If 
> it does you can always store as a BytesWritable which is just an array 
> of bytes.  But you are going to have memory issues reading in and 
> writing out that large of a record.
>>  - I will also need to store some metadata for each document which I 
>> figure
>> could be stored as JSON or XML.
>>  - I'll typically filter on the metadata and then doing standard 
>> operations
>> on the bodies, like word frequency and searching.
>
> It is possible to create an OutputFormat that writes out multiple 
> files.  You could also use a MapWritable as the value to store the 
> document and associated metadata.
>
>>
>> Is there a canned FileInputFormat that makes sense? Should I roll my 
>> own?
>> How can I access the bodies as streams so I don't have to read them 
>> into RAM
>
> A writable is read into RAM so even treating it like a stream doesn't 
> get around that.
>
> One thing you might want to consider is to  tar up say X documents at 
> a time and store that as a file in DFS.  You would have many of these 
> files.  Then have an index that has the offsets of the files and their 
> keys (document ids).  That index can be passed as input into a MR job 
> that can then go to DFS and stream out the file as you need it.  The 
> job will be slower because you are doing it this way but it is a 
> solution to handling such large documents as streams.
>
>> all at once? Am I right in thinking that I should treat each document 
>> as a
>> record and map across them, or do I need to be more creative in what I'm
>> mapping across?
>>
>> 2. Some of the tasks I want to run are pure map operations (no 
>> reduction),
>> where I'm calculating new metadata fields on each document. To end up 
>> with a
>> good result set, I'll need to copy the entire input record + new 
>> fields into
>> another set of output files. Is there a better way? I haven't wanted 
>> to go
>> down the HBase road because it can't handle very large values (for the
>> bodies) and it seems to make the most sense to keep the document bodies
>> together with the metadata, to allow for the greatest locality of 
>> reference
>> on the datanodes.
>
> If you don't specify a reducer, the IdentityReducer is run which 
> simply passes through output.
    One can set number of reducers to zero and reduce phase will not 
take place.
>
>>
>> 3. I'm sure this is not a new idea, but I haven't seen anything 
>> regarding
>> it... I'll need to run several MR jobs as a pipeline... is there any 
>> way for
>> the map tasks in a subsequent stage to begin processing data from 
>> previous
>> stage's reduce task before that reducer has fully finished?
>
> Yup, just use FileOutputFormat.getOutputPath(previousJobConf);
>
> Dennis
>>
>> Whatever insight folks could lend me would be a big help in crossing the
>> chasm from the Word Count and associated examples to something more 
>> "real".
>> A whole heap of thanks in advance,
>>
>> John
>>

Re: Getting started questions

Posted by Dennis Kubes <ku...@apache.org>.

John Howland wrote:
> I've been reading up on Hadoop for a while now and I'm excited that I'm
> finally getting my feet wet with the examples + my own variations. If anyone
> could answer any of the following questions, I'd greatly appreciate it.
> 
> 1. I'm processing document collections, with the number of documents ranging
> from 10,000 - 10,000,000. What is the best way to store this data for
> effective processing?

AFAIK hadoop doesn't do well with, although it can handle, a large 
number of small files.  So it would be better to read in the documents 
and store them in SequenceFile or MapFile format.  This would be similar 
to the way the Fetcher works in Nutch.  10M documents in a sequence/map 
file on DFS is comparatively small and can be handled efficiently.

> 
>  - The bodies of the documents usually range from 1K-100KB in size, but some
> outliers can be as big as 4-5GB.

I would say store your document objects as Text objects, not sure if 
Text has a max size.  I think it does but not sure what that is.  If it 
does you can always store as a BytesWritable which is just an array of 
bytes.  But you are going to have memory issues reading in and writing 
out that large of a record.

>  - I will also need to store some metadata for each document which I figure
> could be stored as JSON or XML.
>  - I'll typically filter on the metadata and then doing standard operations
> on the bodies, like word frequency and searching.

It is possible to create an OutputFormat that writes out multiple files. 
  You could also use a MapWritable as the value to store the document 
and associated metadata.

> 
> Is there a canned FileInputFormat that makes sense? Should I roll my own?
> How can I access the bodies as streams so I don't have to read them into RAM

A writable is read into RAM so even treating it like a stream doesn't 
get around that.

One thing you might want to consider is to  tar up say X documents at a 
time and store that as a file in DFS.  You would have many of these 
files.  Then have an index that has the offsets of the files and their 
keys (document ids).  That index can be passed as input into a MR job 
that can then go to DFS and stream out the file as you need it.  The job 
will be slower because you are doing it this way but it is a solution to 
handling such large documents as streams.

> all at once? Am I right in thinking that I should treat each document as a
> record and map across them, or do I need to be more creative in what I'm
> mapping across?
> 
> 2. Some of the tasks I want to run are pure map operations (no reduction),
> where I'm calculating new metadata fields on each document. To end up with a
> good result set, I'll need to copy the entire input record + new fields into
> another set of output files. Is there a better way? I haven't wanted to go
> down the HBase road because it can't handle very large values (for the
> bodies) and it seems to make the most sense to keep the document bodies
> together with the metadata, to allow for the greatest locality of reference
> on the datanodes.

If you don't specify a reducer, the IdentityReducer is run which simply 
passes through output.

> 
> 3. I'm sure this is not a new idea, but I haven't seen anything regarding
> it... I'll need to run several MR jobs as a pipeline... is there any way for
> the map tasks in a subsequent stage to begin processing data from previous
> stage's reduce task before that reducer has fully finished?

Yup, just use FileOutputFormat.getOutputPath(previousJobConf);

Dennis
> 
> Whatever insight folks could lend me would be a big help in crossing the
> chasm from the Word Count and associated examples to something more "real".
> A whole heap of thanks in advance,
> 
> John
>