You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Colin Freas <co...@gmail.com> on 2008/03/24 21:35:49 UTC

MapReduce with related data from disparate files

I have a cluster of 5 machines up and accepting jobs, and I'm trying to work
out how to design my first MapReduce task for the data I have.

So, I wonder if anyone has any experience with the sort of problem I'm
trying to solve, and what the best ways to use Hadoop and MapReduce for it
are.

I have two sets of related comma delimited files.  One is a set of unique
records with something like a primary key in one of the fields.  The other
is a set of records keyed to the first with the primary key.  It's basically
a request log, and ancillary data about the request.  Something like:

file 1:
asdf, 1, 2, 5, 3 ...
qwer, 3, 6, 2, 7 ...
zxcv, 2, 3, 6, 4 ...

file 2:
asdf, 10, 3
asdf, 3, 2
asdf, 1, 3
zxcv, 3, 1

I basically need to flatten this mapping, and then perform some analysis on
the result.  I wrote a processing program that runs on a single machines to
create a map like this:

file 1-2:
asdf, 1, 2, 5, 3, ... 10, 3, 3, 2, 1, 3
qwer, 3, 6, 2, 7, ... , , , , ,
zxcv, 2, 3, 6, 4, ... , , 3, 1, ,

... where the "flattening" puts in blank values for missing ancillary data.
I then sample this map taking some small number of entire records for
output, and extrapolate some statistics from the results.

So, what I'd really like to do is figure out exactly what questions I need
to ask, and instead of sampling, do an enumeration.

Is my best bet to create the conflated date file (that I labeled "file 1-2"
above) in one task, then do analysis using another?  Or is it better to do
the conflation and aggregation in one step, and then combine those?

I'm not sure how clear this is, but I believe it gets the gist across.

Any thoughts appreciated, any questions answered.



-Colin

Re: MapReduce with related data from disparate files

Posted by Colin Freas <co...@gmail.com>.

Thanks Ted, Nathan.  Great advice.

So I've been looking at the InputFormat, RecordReader, and InputSplit
interfaces and associated classes and trying to get my head around it.

For the situation I'm in, where I have two types of file, the names are
distinct, and the names actually have time stamps built in.  I have files
like:

requests.20080323.1240
request_data.20080323.1240

There's a business rule that says any data about the requests must be in the
like-named request_data file.

So, if I implemented my own InputFormat and/or RecordReader, is there some
way to get access to the file name that's providing the input, so that I can
direct the input to the appropriate InputFormat/RecordReader/Mapper?  Is
this what I want to do?

I feel like the different input from the files should go to the same map,
and just manipulate the values associated with the keys common to both
files.  I'm not really sure how to do this.  Are there any more complex
examples of Hadoop setups anywhere?  I looked around, but I've found mostly
low-level tutorial stuff about getting the cluster up and running, but not
so much about subsequently bending it to my will.



-Colin


On Mon, Mar 24, 2008 at 5:18 PM, Nathan Wang <wa...@yahoo-inc.com> wrote:

>
> It's possible to do the whole thing in one round of map/reduce.
> The only requirement is to be able to differentiate between the 2
> different types of input files, possibly using different file name
> extensions.
>
> One of my coworkers wrote a smart InputFormat class that creates a
> different RecordReader for each file type, based on the input file's
> extension.
>
> In each RecordReader, you create a special typed value object for that
> input.  So, in your map method, you collect different value objects from
> different RecordReaders.  In you reduce method, for each key, you do
> necessary processing on the collection based on the value object types.
>
> The main point here is to keep track of the differences from the
> beginning to the end, and process them accordingly.
>
> Nathan
>
> -----Original Message-----
> From: Colin Freas [mailto:colinfreas@gmail.com]
> Sent: Monday, March 24, 2008 1:36 PM
> To: core-user@hadoop.apache.org
> Subject: MapReduce with related data from disparate files
>
> I have a cluster of 5 machines up and accepting jobs, and I'm trying to
> work
> out how to design my first MapReduce task for the data I have.
>
> So, I wonder if anyone has any experience with the sort of problem I'm
> trying to solve, and what the best ways to use Hadoop and MapReduce for
> it
> are.
>
> I have two sets of related comma delimited files.  One is a set of
> unique
> records with something like a primary key in one of the fields.  The
> other
> is a set of records keyed to the first with the primary key.  It's
> basically
> a request log, and ancillary data about the request.  Something like:
>
> file 1:
> asdf, 1, 2, 5, 3 ...
> qwer, 3, 6, 2, 7 ...
> zxcv, 2, 3, 6, 4 ...
>
> file 2:
> asdf, 10, 3
> asdf, 3, 2
> asdf, 1, 3
> zxcv, 3, 1
>
> I basically need to flatten this mapping, and then perform some analysis
> on
> the result.  I wrote a processing program that runs on a single machines
> to
> create a map like this:
>
> file 1-2:
> asdf, 1, 2, 5, 3, ... 10, 3, 3, 2, 1, 3
> qwer, 3, 6, 2, 7, ... , , , , ,
> zxcv, 2, 3, 6, 4, ... , , 3, 1, ,
>
> ... where the "flattening" puts in blank values for missing ancillary
> data.
> I then sample this map taking some small number of entire records for
> output, and extrapolate some statistics from the results.
>
> So, what I'd really like to do is figure out exactly what questions I
> need
> to ask, and instead of sampling, do an enumeration.
>
> Is my best bet to create the conflated date file (that I labeled "file
> 1-2"
> above) in one task, then do analysis using another?  Or is it better to
> do
> the conflation and aggregation in one step, and then combine those?
>
> I'm not sure how clear this is, but I believe it gets the gist across.
>
> Any thoughts appreciated, any questions answered.
>
>
>
> -Colin
>

RE: MapReduce with related data from disparate files

Posted by Nathan Wang <wa...@yahoo-inc.com>.

It's possible to do the whole thing in one round of map/reduce.
The only requirement is to be able to differentiate between the 2
different types of input files, possibly using different file name
extensions.

One of my coworkers wrote a smart InputFormat class that creates a
different RecordReader for each file type, based on the input file's
extension.  

In each RecordReader, you create a special typed value object for that
input.  So, in your map method, you collect different value objects from
different RecordReaders.  In you reduce method, for each key, you do
necessary processing on the collection based on the value object types.

The main point here is to keep track of the differences from the
beginning to the end, and process them accordingly.

Nathan

-----Original Message-----
From: Colin Freas [mailto:colinfreas@gmail.com] 
Sent: Monday, March 24, 2008 1:36 PM
To: core-user@hadoop.apache.org
Subject: MapReduce with related data from disparate files

I have a cluster of 5 machines up and accepting jobs, and I'm trying to
work
out how to design my first MapReduce task for the data I have.

So, I wonder if anyone has any experience with the sort of problem I'm
trying to solve, and what the best ways to use Hadoop and MapReduce for
it
are.

I have two sets of related comma delimited files.  One is a set of
unique
records with something like a primary key in one of the fields.  The
other
is a set of records keyed to the first with the primary key.  It's
basically
a request log, and ancillary data about the request.  Something like:

file 1:
asdf, 1, 2, 5, 3 ...
qwer, 3, 6, 2, 7 ...
zxcv, 2, 3, 6, 4 ...

file 2:
asdf, 10, 3
asdf, 3, 2
asdf, 1, 3
zxcv, 3, 1

I basically need to flatten this mapping, and then perform some analysis
on
the result.  I wrote a processing program that runs on a single machines
to
create a map like this:

file 1-2:
asdf, 1, 2, 5, 3, ... 10, 3, 3, 2, 1, 3
qwer, 3, 6, 2, 7, ... , , , , ,
zxcv, 2, 3, 6, 4, ... , , 3, 1, ,

... where the "flattening" puts in blank values for missing ancillary
data.
I then sample this map taking some small number of entire records for
output, and extrapolate some statistics from the results.

So, what I'd really like to do is figure out exactly what questions I
need
to ask, and instead of sampling, do an enumeration.

Is my best bet to create the conflated date file (that I labeled "file
1-2"
above) in one task, then do analysis using another?  Or is it better to
do
the conflation and aggregation in one step, and then combine those?

I'm not sure how clear this is, but I believe it gets the gist across.

Any thoughts appreciated, any questions answered.



-Colin

Re: MapReduce with related data from disparate files

Posted by Ted Dunning <td...@veoh.com>.

Map-reduce excels at gluing together files like this.

The map phase selects the key and makes sure that you have some way of
telling what the source of the record is.

The reduce phase takes all of the records with the same key and glues them
together.  It can do your processing, but it is also common to require a
subsequent aggregation step.  One typical log processing scenario is joining
access logs to a user database and accumulating aggregate statistics grouped
by some characteristic(s) from the user database.  This would take an
additional aggregation.

Some helpful hints include storing the name of the input file using the
config method for your mapper.  That method will be called each time a new
input file is processed by the mapper.  Another hint is to add the file name
to the values seen by the reducer so you don't have to guess which records
are which.  Another hint is to watch out for reduce steps that don't have
all of the kinds of data you want.  This could be cases where you lack
coverage of the entities in your logs, or records from your ancillary data
that were not referenced in your logs.  Either way, you have to watch out.

On 3/24/08 1:35 PM, "Colin Freas" <co...@gmail.com> wrote:

> I have a cluster of 5 machines up and accepting jobs, and I'm trying to work
> out how to design my first MapReduce task for the data I have.
> 
> So, I wonder if anyone has any experience with the sort of problem I'm
> trying to solve, and what the best ways to use Hadoop and MapReduce for it
> are.
> 
> I have two sets of related comma delimited files.  One is a set of unique
> records with something like a primary key in one of the fields.  The other
> is a set of records keyed to the first with the primary key.  It's basically
> a request log, and ancillary data about the request.  Something like:
> 
> file 1:
> asdf, 1, 2, 5, 3 ...
> qwer, 3, 6, 2, 7 ...
> zxcv, 2, 3, 6, 4 ...
> 
> file 2:
> asdf, 10, 3
> asdf, 3, 2
> asdf, 1, 3
> zxcv, 3, 1
> 
> I basically need to flatten this mapping, and then perform some analysis on
> the result.  I wrote a processing program that runs on a single machines to
> create a map like this:
> 
> file 1-2:
> asdf, 1, 2, 5, 3, ... 10, 3, 3, 2, 1, 3
> qwer, 3, 6, 2, 7, ... , , , , ,
> zxcv, 2, 3, 6, 4, ... , , 3, 1, ,
> 
> ... where the "flattening" puts in blank values for missing ancillary data.
> I then sample this map taking some small number of entire records for
> output, and extrapolate some statistics from the results.
> 
> So, what I'd really like to do is figure out exactly what questions I need
> to ask, and instead of sampling, do an enumeration.
> 
> Is my best bet to create the conflated date file (that I labeled "file 1-2"
> above) in one task, then do analysis using another?  Or is it better to do
> the conflation and aggregation in one step, and then combine those?
> 
> I'm not sure how clear this is, but I believe it gets the gist across.
> 
> Any thoughts appreciated, any questions answered.
> 
> 
> 
> -Colin