You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by pmg <pa...@gmail.com> on 2009/06/18 19:56:13 UTC

multiple file input

I am evaluating hadoop for a problem that do a Cartesian product of input
from one file of 600K (File A) with another set of file set (FileB1, FileB2,
FileB3) with 2 millions line in total.

Each line from FileA gets compared with every line from FileB1, FileB2 etc.
etc. FileB1, FileB2 etc. are in a different input directory

So....

Two input directories 

1. input1 directory with a single file of 600K records - FileA
2. input2 directory segmented into different files with 2Million records -
FileB1, FileB2 etc.

How can I have a map that reads a line from a FileA in directory input1 and
compares the line with each line from input2? 

What is the best way forward? I have seen plenty of examples that maps each
record from single input file and reduces into an output forward.

thanks
-- 
View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24095358.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: multiple file input

Posted by Gang Luo <lg...@yahoo.com.cn>.
To do the cartesian product, any node has to see at least one table completely. So what I think is to name input2 as the input to mapper, and in each map task, you read the whole fileA at input1 manually using HDFS api, for it is smaller, and build hash table on fileA. For each line from input2, you match it with the lines in hash table and join them. This is actually map side join, which only needs map phase. 

 -Gang


----- 原始邮件 ----
发件人: Ed Kohlwey <ek...@gmail.com>
收件人: common-user@hadoop.apache.org
发送日期: 2009/12/8 (周二) 10:14:51 上午
主   题: Re: multiple file input

One important thing to note is that, with cross products, you'll almost
always get better performance if you can fit both files on a single node's
disk rather than distributing the files.

On Tue, Dec 8, 2009 at 9:18 AM, laser08150815 <la...@laserxyz.de> wrote:

>
>
> pmg wrote:
> >
> > I am evaluating hadoop for a problem that do a Cartesian product of input
> > from one file of 600K (File A) with another set of file set (FileB1,
> > FileB2, FileB3) with 2 millions line in total.
> >
> > Each line from FileA gets compared with every line from FileB1, FileB2
> > etc. etc. FileB1, FileB2 etc. are in a different input directory
> >
> > So....
> >
> > Two input directories
> >
> > 1. input1 directory with a single file of 600K records - FileA
> > 2. input2 directory segmented into different files with 2Million records
> -
> > FileB1, FileB2 etc.
> >
> > How can I have a map that reads a line from a FileA in directory input1
> > and compares the line with each line from input2?
> >
> > What is the best way forward? I have seen plenty of examples that maps
> > each record from single input file and reduces into an output forward.
> >
> > thanks
> >
>
>
> I had a similar problem and solved it by writing a custom InputFormat (see
> attachment). You should improve the methods ACrossBInputSplit.getLength ,
> ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
> --
> View this message in context:
> http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

Hadoop Pipes with distributed cache

Posted by Upendra Dadi <ud...@gmu.edu>.
Hi,
  I am facing some problems with using distributed cache archive with Pipes
job. In my configuration file I have the following two properties:

<property>
  <name>mapred.create.symlink</name>
  <value>yes</value>
</property>

<property>
  <name>mapred.cache.archives</name>
  <value>hdfs://localhost:9000/user/upendra/archive/pipeArchive.zip#pipeSym</value>
</property>

The zip archive contains two folders lib and share. lib folder contains some
shared libraries. In my Pipes C++ code, I have added the following
statements to
use the shared libraries:

int main(int argc, char *argv[]) {
  dlopen("pipeSym/lib/libmant.so.1",RTLD_LAZY);
  ...

The shared library is not getting loaded during execution. It spits the
error that the shared library is not found. What is the problem with
 the above steps? Can anyone please shed some light on what might be causing
the problem. Thanks.

Upendra


Re: multiple file input

Posted by Ed Kohlwey <ek...@gmail.com>.
One important thing to note is that, with cross products, you'll almost
always get better performance if you can fit both files on a single node's
disk rather than distributing the files.

On Tue, Dec 8, 2009 at 9:18 AM, laser08150815 <la...@laserxyz.de> wrote:

>
>
> pmg wrote:
> >
> > I am evaluating hadoop for a problem that do a Cartesian product of input
> > from one file of 600K (File A) with another set of file set (FileB1,
> > FileB2, FileB3) with 2 millions line in total.
> >
> > Each line from FileA gets compared with every line from FileB1, FileB2
> > etc. etc. FileB1, FileB2 etc. are in a different input directory
> >
> > So....
> >
> > Two input directories
> >
> > 1. input1 directory with a single file of 600K records - FileA
> > 2. input2 directory segmented into different files with 2Million records
> -
> > FileB1, FileB2 etc.
> >
> > How can I have a map that reads a line from a FileA in directory input1
> > and compares the line with each line from input2?
> >
> > What is the best way forward? I have seen plenty of examples that maps
> > each record from single input file and reduces into an output forward.
> >
> > thanks
> >
>
>
> I had a similar problem and solved it by writing a custom InputFormat (see
> attachment). You should improve the methods ACrossBInputSplit.getLength ,
> ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
> --
> View this message in context:
> http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: multiple file input

Posted by laser08150815 <la...@laserxyz.de>.

pmg wrote:
> 
> I am evaluating hadoop for a problem that do a Cartesian product of input
> from one file of 600K (File A) with another set of file set (FileB1,
> FileB2, FileB3) with 2 millions line in total.
> 
> Each line from FileA gets compared with every line from FileB1, FileB2
> etc. etc. FileB1, FileB2 etc. are in a different input directory
> 
> So....
> 
> Two input directories 
> 
> 1. input1 directory with a single file of 600K records - FileA
> 2. input2 directory segmented into different files with 2Million records -
> FileB1, FileB2 etc.
> 
> How can I have a map that reads a line from a FileA in directory input1
> and compares the line with each line from input2? 
> 
> What is the best way forward? I have seen plenty of examples that maps
> each record from single input file and reduces into an output forward.
> 
> thanks
> 


I had a similar problem and solved it by writing a custom InputFormat (see
attachment). You should improve the methods ACrossBInputSplit.getLength ,
ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
-- 
View this message in context: http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: multiple file input

Posted by Erik Paulson <ep...@cs.wisc.edu>.
On Thu, Jun 18, 2009 at 01:36:14PM -0700, Owen O'Malley wrote:
> On Jun 18, 2009, at 10:56 AM, pmg wrote:
> 
> >Each line from FileA gets compared with every line from FileB1,  
> >FileB2 etc.
> >etc. FileB1, FileB2 etc. are in a different input directory
> 
> In the general case, I'd define an InputFormat that takes two  
> directories, computes the input splits for each directory and  
> generates a new list of InputSplits that is the cross-product of the  
> two lists. So instead of FileSplit, it would use a FileSplitPair that  
> gives the FileSplit for dir1 and the FileSplit for dir2 and the record  
> reader would return a TextPair with left and right records (ie.  
> lines). Clearly, you read the first line of split1 and cross it by  
> each line from split2, then move to the second line of split1 and  
> process each line from split2, etc.
> 

Out of curiosity, how does Hadoop schedule tasks when a task needs
multiple inputs and the data for a task is on different nodes?  How does
it decide which node will be more "local" and should have the task
steered to it?

-Erik


Re: multiple file input

Posted by pmg <pa...@gmail.com>.
At this time I don't see any way to do this in map/reduce. I guess I have to
go to back plan A of writing a primitive java tool scaled horizontally using
java executors and vertically processed across multiple machines.  



pmg wrote:
> 
> First make smaller chunks of your big files (small enough that one chunk
> can
> be stored in memory). Hadoop's block size is set to 64MB by default. If
> this
> seems ok according to the RAM you have, then simply run Identity Mapper
> only
> job on for both Files A and B. The output will be smaller files with the
> names part-0001, part-0002 etc. For simplicty let us call chunks of File A
> as A1, A2, A3... and chunks of B as B1, B2, B3
> 
>>> I am planning to run this on amazon elastic map with large cpu so RAM I
>>> think would not be a problem.
> I can have smaller input files outside map/reduce so I guess we don't have
> to run this phase to get small file chunks as A1, A2, A3... and chunks of
> B as B1, B2, B3
> 
> Create a file (or write a program that will generate this file) that
> contains the cross product of these chunks-
> A1 B1
> A1 B2
> A1 B3
> ..
> A2 B1
> A2 B2
> A2 B3
> ..
> 
>>> Correct me If I am wrong. the actual FileA that gets divided into chunks
>>> A1,A2...has around 600K file records. FileB that gets divided into B1,
>>> B2....has around 2 million file record. So I guess we looking at file
>>> record size of cartesian product of 600K * 2Millions. We are looking at
>>> peta bytes of data. This would be a hard sell :)
> 
> 
> Tarandeep wrote:
>> 
>> hey I think I got your question wrong. My solution won't let you achieve
>> what you intended. your example made it clear.
>> 
>> Since it is a cross product, the contents of one of the files has to be
>> in
>> memory for iteration, but since size is big, so might not be possible, so
>> how about this solution and this will scale too-
>> 
>> First make smaller chunks of your big files (small enough that one chunk
>> can
>> be stored in memory). Hadoop's block size is set to 64MB by default. If
>> this
>> seems ok according to the RAM you have, then simply run Identity Mapper
>> only
>> job on for both Files A and B. The output will be smaller files with the
>> names part-0001, part-0002 etc. For simplicty let us call chunks of File
>> A
>> as A1, A2, A3... and chunks of B as B1, B2, B3
>> 
>> Create a file (or write a program that will generate this file) that
>> contains the cross product of these chunks-
>> A1 B1
>> A1 B2
>> A1 B3
>> ..
>> A2 B1
>> A2 B2
>> A2 B3
>> ..
>> 
>> Now run a Map only job (no reducer). Use NLineInputFormat and set N = 1.
>> give input to your job this file. NLineInputFormat will give each mapper
>> a
>> line from this file. So for example, lets say a mapper got the line A1
>> B3,
>> which means take cross product of the contents of chunk A1 and chunk B1.
>> 
>> you can read one of the chunk completely and store in memory as a list or
>> array. And then read second chunk and do the comparison.
>> 
>> Now, as you would have guessed, instead of creating chunks, you can
>> actually
>> calculate offsets in the files (after an interval of say 64MB) and can
>> achieve the same effect. HDFS allows seeking to an offset in a file so
>> that
>> will work too.
>> 
>> -Tarandeep
>> 
>> 
>> 
>> On Fri, Jun 19, 2009 at 4:33 PM, pmg <pa...@gmail.com> wrote:
>> 
>>>
>>> Thanks Tarandeep for prompt reply.
>>>
>>> Let me give you an example structure of FileA and FileB
>>>
>>> FileA
>>> -------
>>>
>>> 123 ABC 1
>>>
>>>
>>> FileB
>>> -----
>>> 123 ABC 2
>>> 456 BNF 3
>>>
>>> Both the files are tab delimited. Every record is not simply compared
>>> with
>>> each record in FileB. There's heuristic I am going to run for the
>>> comparison
>>> and score the results along with output. So my output file is like this
>>>
>>> Output
>>> --------
>>>
>>> 123 ABC 1 123 ABC 2 10
>>> 123 ABC 1 456 BNF 3 20
>>>
>>> first 3 columns in the output file are from FileA, next three columns
>>> are
>>> from FileB and the last column is their comparison score.
>>>
>>> So basically you are saying we can use two map/reduce jobs for FileA and
>>> other for FileB
>>>
>>> map (FileA) -> reduce (FileA)-> map (FileB) -> reduce (FileB)
>>>
>>> For the first file FileA I map them with <k,V> (I can't use bloom filter
>>> because comparison between each record from FileA is not a straight
>>> comparison with every record in FileB - They are compared using
>>> heuristic
>>> and scored them for their quantitative comparison and stored)
>>>
>>> In the FileA reduce I store it in the distributed cache. Once this is
>>> done
>>> map the FileB in the second map and in the FileB reduce read in the
>>> FileA
>>> from the distributed cache and do my heuristics for every <K,V) from
>>> FileB
>>> and store my result
>>>
>>> thanks
>>>
>>>
>>> Tarandeep wrote:
>>> >
>>> > oh my bad, I was not clear-
>>> >
>>> > For FileB, you will be running a second map reduce job. In mapper, you
>>> can
>>> > use the Bloom Filter, created in first map reduce job (if you wish to
>>> use)
>>> > to eliminate the lines whose keys dont match. Mapper will emit
>>> key,value
>>> > pair, where key is teh field on which you want to do comparison and
>>> value
>>> > is
>>> > the whole line.
>>> >
>>> > when the key,value pairs go to reducers, then you have lines from
>>> FileB
>>> > sorted on the field yon want to use for comparison. Now you can read
>>> > contents of FileA (note that if you ran first job with N reducers, you
>>> > will
>>> > have N paritions of FileA and you want to read only the partition
>>> meant
>>> > for
>>> > this reducer). Content of FileA is also sorted on the field, Now you
>>> can
>>> > easily compare the lines from two files.
>>> >
>>> > CloudBase- cloudbase.sourceforge.net has code for doing join this
>>> fashion.
>>> >
>>> > Let me know if you need more clarification.
>>> >
>>> > -Tarandeep
>>> >
>>> > On Fri, Jun 19, 2009 at 3:45 PM, pmg <pa...@gmail.com> wrote:
>>> >
>>> >>
>>> >> thanks tarandeep
>>> >>
>>> >> Correct if I am wrong that when I map FileA mapper created key,value
>>> pair
>>> >> and sends across to the reducer. If so then how can I compare when
>>> FileB
>>> >> is
>>> >> not even mapped yet.
>>> >>
>>> >>
>>> >> Tarandeep wrote:
>>> >> >
>>> >> > On Fri, Jun 19, 2009 at 2:41 PM, pmg <pa...@gmail.com>
>>> wrote:
>>> >> >
>>> >> >>
>>> >> >> For the sake of simplification I have simplified my input into two
>>> >> files
>>> >> >> 1.
>>> >> >> FileA 2. FileB
>>> >> >>
>>> >> >> As I said earlier I want to compare every record of FileA against
>>> >> every
>>> >> >> record in FileB I know this is n2 but this is the process. I wrote
>>> a
>>> >> >> simple
>>> >> >> InputFormat and RecordReader. It seems each file is read serially
>>> one
>>> >> >> after
>>> >> >> another. How can my record read have reference to both files at
>>> the
>>> >> same
>>> >> >> line so that I can create cross list of FileA and FileB for the
>>> >> mapper.
>>> >> >>
>>> >> >> Basically the way I see is to get mapper one record from FileA and
>>> all
>>> >> >> records from FileB so that mapper can compare n2 and forward them
>>> to
>>> >> >> reducer.
>>> >> >
>>> >> >
>>> >> > It will be hard (and inefficient) to do this in Mapper using some
>>> >> custom
>>> >> > intput format. What you can do is use Semi Join technique-
>>> >> >
>>> >> > Since File A is smaller, run a map reduce job that will output
>>> >> key,value
>>> >> > pair where key is the field or set of fields on which you want to
>>> do
>>> >> the
>>> >> > comparison and value is the whole line.
>>> >> >
>>> >> > The reducer is simply an Identity reducer which writes the files.
>>> So
>>> >> your
>>> >> > fileA has been partitioned on the field(s). you can also create
>>> bloom
>>> >> > filter
>>> >> > on this field and store it in Distributed Cache.
>>> >> >
>>> >> > Now read FileB, load Bloom filter into memory and see if the field
>>> from
>>> >> > line
>>> >> > of FileB is present in Bloom filter, if yes emit Key,Value pair
>>> else
>>> >> not.
>>> >> >
>>> >> > At reducers, you get the contents of FileB partitioned just like
>>> >> contents
>>> >> > of
>>> >> > fileA were partitioned and at a particular reducer you get lines
>>> sorted
>>> >> on
>>> >> > the field you want to do the comparison, At this point you read the
>>> >> > contents
>>> >> > of FileA that reached this reducer and since its contents were
>>> sorted
>>> >> as
>>> >> > well, you can quickly go over the two lists.
>>> >> >
>>> >> > -Tarandeep
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> thanks
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> pmg wrote:
>>> >> >> >
>>> >> >> > Thanks owen. Are there any examples that I can look at?
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > owen.omalley wrote:
>>> >> >> >>
>>> >> >> >> On Jun 18, 2009, at 10:56 AM, pmg wrote:
>>> >> >> >>
>>> >> >> >>> Each line from FileA gets compared with every line from
>>> FileB1,
>>> >> >> >>> FileB2 etc.
>>> >> >> >>> etc. FileB1, FileB2 etc. are in a different input directory
>>> >> >> >>
>>> >> >> >> In the general case, I'd define an InputFormat that takes two
>>> >> >> >> directories, computes the input splits for each directory and
>>> >> >> >> generates a new list of InputSplits that is the cross-product
>>> of
>>> >> the
>>> >> >> >> two lists. So instead of FileSplit, it would use a
>>> FileSplitPair
>>> >> that
>>> >> >> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the
>>> >> record
>>> >> >> >> reader would return a TextPair with left and right records (ie.
>>> >> >> >> lines). Clearly, you read the first line of split1 and cross it
>>> by
>>> >> >> >> each line from split2, then move to the second line of split1
>>> and
>>> >> >> >> process each line from split2, etc.
>>> >> >> >>
>>> >> >> >> You'll need to ensure that you don't overwhelm the system with
>>> >> either
>>> >> >> >> too many input splits (ie. maps). Also don't forget that N^2/M
>>> >> grows
>>> >> >> >> much faster with the size of the input (N) than the M machines
>>> can
>>> >> >> >> handle in a fixed amount of time.
>>> >> >> >>
>>> >> >> >>> Two input directories
>>> >> >> >>>
>>> >> >> >>> 1. input1 directory with a single file of 600K records - FileA
>>> >> >> >>> 2. input2 directory segmented into different files with
>>> 2Million
>>> >> >> >>> records -
>>> >> >> >>> FileB1, FileB2 etc.
>>> >> >> >>
>>> >> >> >> In this particular case, it would be right to load all of FileA
>>> >> into
>>> >> >> >> memory and process the chunks of FileB/part-*. Then it would be
>>> >> much
>>> >> >> >> faster than needing to re-read the file over and over again,
>>> but
>>> >> >> >> otherwise it would be the same.
>>> >> >> >>
>>> >> >> >> -- Owen
>>> >> >> >>
>>> >> >> >>
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >> --
>>> >> >> View this message in context:
>>> >> >> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
>>> >> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >> http://www.nabble.com/multiple-file-input-tp24095358p24119864.html
>>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/multiple-file-input-tp24095358p24120283.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24126978.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: multiple file input

Posted by pmg <pa...@gmail.com>.
First make smaller chunks of your big files (small enough that one chunk can
be stored in memory). Hadoop's block size is set to 64MB by default. If this
seems ok according to the RAM you have, then simply run Identity Mapper only
job on for both Files A and B. The output will be smaller files with the
names part-0001, part-0002 etc. For simplicty let us call chunks of File A
as A1, A2, A3... and chunks of B as B1, B2, B3

>> I am planning to run this on amazon elastic map with large cpu so RAM I
>> think would not be a problem.
I can have smaller input files outside map/reduce so I guess we don't have
to run this phase to get small file chunks as A1, A2, A3... and chunks of B
as B1, B2, B3

Create a file (or write a program that will generate this file) that
contains the cross product of these chunks-
A1 B1
A1 B2
A1 B3
..
A2 B1
A2 B2
A2 B3
..

>> Correct me If I am wrong. the actual FileA that gets divided into chunks
>> A1,A2...has around 600K file records. FileB that gets divided into B1,
>> B2....has around 2 million file record. So I guess we looking at file
>> record size of cartesian product of 600K * 2Millions. We are looking at
>> peta bytes of data. This would be a hard sell :)


Tarandeep wrote:
> 
> hey I think I got your question wrong. My solution won't let you achieve
> what you intended. your example made it clear.
> 
> Since it is a cross product, the contents of one of the files has to be in
> memory for iteration, but since size is big, so might not be possible, so
> how about this solution and this will scale too-
> 
> First make smaller chunks of your big files (small enough that one chunk
> can
> be stored in memory). Hadoop's block size is set to 64MB by default. If
> this
> seems ok according to the RAM you have, then simply run Identity Mapper
> only
> job on for both Files A and B. The output will be smaller files with the
> names part-0001, part-0002 etc. For simplicty let us call chunks of File A
> as A1, A2, A3... and chunks of B as B1, B2, B3
> 
> Create a file (or write a program that will generate this file) that
> contains the cross product of these chunks-
> A1 B1
> A1 B2
> A1 B3
> ..
> A2 B1
> A2 B2
> A2 B3
> ..
> 
> Now run a Map only job (no reducer). Use NLineInputFormat and set N = 1.
> give input to your job this file. NLineInputFormat will give each mapper a
> line from this file. So for example, lets say a mapper got the line A1 B3,
> which means take cross product of the contents of chunk A1 and chunk B1.
> 
> you can read one of the chunk completely and store in memory as a list or
> array. And then read second chunk and do the comparison.
> 
> Now, as you would have guessed, instead of creating chunks, you can
> actually
> calculate offsets in the files (after an interval of say 64MB) and can
> achieve the same effect. HDFS allows seeking to an offset in a file so
> that
> will work too.
> 
> -Tarandeep
> 
> 
> 
> On Fri, Jun 19, 2009 at 4:33 PM, pmg <pa...@gmail.com> wrote:
> 
>>
>> Thanks Tarandeep for prompt reply.
>>
>> Let me give you an example structure of FileA and FileB
>>
>> FileA
>> -------
>>
>> 123 ABC 1
>>
>>
>> FileB
>> -----
>> 123 ABC 2
>> 456 BNF 3
>>
>> Both the files are tab delimited. Every record is not simply compared
>> with
>> each record in FileB. There's heuristic I am going to run for the
>> comparison
>> and score the results along with output. So my output file is like this
>>
>> Output
>> --------
>>
>> 123 ABC 1 123 ABC 2 10
>> 123 ABC 1 456 BNF 3 20
>>
>> first 3 columns in the output file are from FileA, next three columns are
>> from FileB and the last column is their comparison score.
>>
>> So basically you are saying we can use two map/reduce jobs for FileA and
>> other for FileB
>>
>> map (FileA) -> reduce (FileA)-> map (FileB) -> reduce (FileB)
>>
>> For the first file FileA I map them with <k,V> (I can't use bloom filter
>> because comparison between each record from FileA is not a straight
>> comparison with every record in FileB - They are compared using heuristic
>> and scored them for their quantitative comparison and stored)
>>
>> In the FileA reduce I store it in the distributed cache. Once this is
>> done
>> map the FileB in the second map and in the FileB reduce read in the FileA
>> from the distributed cache and do my heuristics for every <K,V) from
>> FileB
>> and store my result
>>
>> thanks
>>
>>
>> Tarandeep wrote:
>> >
>> > oh my bad, I was not clear-
>> >
>> > For FileB, you will be running a second map reduce job. In mapper, you
>> can
>> > use the Bloom Filter, created in first map reduce job (if you wish to
>> use)
>> > to eliminate the lines whose keys dont match. Mapper will emit
>> key,value
>> > pair, where key is teh field on which you want to do comparison and
>> value
>> > is
>> > the whole line.
>> >
>> > when the key,value pairs go to reducers, then you have lines from FileB
>> > sorted on the field yon want to use for comparison. Now you can read
>> > contents of FileA (note that if you ran first job with N reducers, you
>> > will
>> > have N paritions of FileA and you want to read only the partition meant
>> > for
>> > this reducer). Content of FileA is also sorted on the field, Now you
>> can
>> > easily compare the lines from two files.
>> >
>> > CloudBase- cloudbase.sourceforge.net has code for doing join this
>> fashion.
>> >
>> > Let me know if you need more clarification.
>> >
>> > -Tarandeep
>> >
>> > On Fri, Jun 19, 2009 at 3:45 PM, pmg <pa...@gmail.com> wrote:
>> >
>> >>
>> >> thanks tarandeep
>> >>
>> >> Correct if I am wrong that when I map FileA mapper created key,value
>> pair
>> >> and sends across to the reducer. If so then how can I compare when
>> FileB
>> >> is
>> >> not even mapped yet.
>> >>
>> >>
>> >> Tarandeep wrote:
>> >> >
>> >> > On Fri, Jun 19, 2009 at 2:41 PM, pmg <pa...@gmail.com> wrote:
>> >> >
>> >> >>
>> >> >> For the sake of simplification I have simplified my input into two
>> >> files
>> >> >> 1.
>> >> >> FileA 2. FileB
>> >> >>
>> >> >> As I said earlier I want to compare every record of FileA against
>> >> every
>> >> >> record in FileB I know this is n2 but this is the process. I wrote
>> a
>> >> >> simple
>> >> >> InputFormat and RecordReader. It seems each file is read serially
>> one
>> >> >> after
>> >> >> another. How can my record read have reference to both files at the
>> >> same
>> >> >> line so that I can create cross list of FileA and FileB for the
>> >> mapper.
>> >> >>
>> >> >> Basically the way I see is to get mapper one record from FileA and
>> all
>> >> >> records from FileB so that mapper can compare n2 and forward them
>> to
>> >> >> reducer.
>> >> >
>> >> >
>> >> > It will be hard (and inefficient) to do this in Mapper using some
>> >> custom
>> >> > intput format. What you can do is use Semi Join technique-
>> >> >
>> >> > Since File A is smaller, run a map reduce job that will output
>> >> key,value
>> >> > pair where key is the field or set of fields on which you want to do
>> >> the
>> >> > comparison and value is the whole line.
>> >> >
>> >> > The reducer is simply an Identity reducer which writes the files. So
>> >> your
>> >> > fileA has been partitioned on the field(s). you can also create
>> bloom
>> >> > filter
>> >> > on this field and store it in Distributed Cache.
>> >> >
>> >> > Now read FileB, load Bloom filter into memory and see if the field
>> from
>> >> > line
>> >> > of FileB is present in Bloom filter, if yes emit Key,Value pair else
>> >> not.
>> >> >
>> >> > At reducers, you get the contents of FileB partitioned just like
>> >> contents
>> >> > of
>> >> > fileA were partitioned and at a particular reducer you get lines
>> sorted
>> >> on
>> >> > the field you want to do the comparison, At this point you read the
>> >> > contents
>> >> > of FileA that reached this reducer and since its contents were
>> sorted
>> >> as
>> >> > well, you can quickly go over the two lists.
>> >> >
>> >> > -Tarandeep
>> >> >
>> >> >>
>> >> >>
>> >> >> thanks
>> >> >>
>> >> >>
>> >> >>
>> >> >> pmg wrote:
>> >> >> >
>> >> >> > Thanks owen. Are there any examples that I can look at?
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > owen.omalley wrote:
>> >> >> >>
>> >> >> >> On Jun 18, 2009, at 10:56 AM, pmg wrote:
>> >> >> >>
>> >> >> >>> Each line from FileA gets compared with every line from FileB1,
>> >> >> >>> FileB2 etc.
>> >> >> >>> etc. FileB1, FileB2 etc. are in a different input directory
>> >> >> >>
>> >> >> >> In the general case, I'd define an InputFormat that takes two
>> >> >> >> directories, computes the input splits for each directory and
>> >> >> >> generates a new list of InputSplits that is the cross-product of
>> >> the
>> >> >> >> two lists. So instead of FileSplit, it would use a FileSplitPair
>> >> that
>> >> >> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the
>> >> record
>> >> >> >> reader would return a TextPair with left and right records (ie.
>> >> >> >> lines). Clearly, you read the first line of split1 and cross it
>> by
>> >> >> >> each line from split2, then move to the second line of split1
>> and
>> >> >> >> process each line from split2, etc.
>> >> >> >>
>> >> >> >> You'll need to ensure that you don't overwhelm the system with
>> >> either
>> >> >> >> too many input splits (ie. maps). Also don't forget that N^2/M
>> >> grows
>> >> >> >> much faster with the size of the input (N) than the M machines
>> can
>> >> >> >> handle in a fixed amount of time.
>> >> >> >>
>> >> >> >>> Two input directories
>> >> >> >>>
>> >> >> >>> 1. input1 directory with a single file of 600K records - FileA
>> >> >> >>> 2. input2 directory segmented into different files with
>> 2Million
>> >> >> >>> records -
>> >> >> >>> FileB1, FileB2 etc.
>> >> >> >>
>> >> >> >> In this particular case, it would be right to load all of FileA
>> >> into
>> >> >> >> memory and process the chunks of FileB/part-*. Then it would be
>> >> much
>> >> >> >> faster than needing to re-read the file over and over again, but
>> >> >> >> otherwise it would be the same.
>> >> >> >>
>> >> >> >> -- Owen
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
>> >> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/multiple-file-input-tp24095358p24119864.html
>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/multiple-file-input-tp24095358p24120283.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24120785.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: multiple file input

Posted by Tarandeep Singh <ta...@gmail.com>.
hey I think I got your question wrong. My solution won't let you achieve
what you intended. your example made it clear.

Since it is a cross product, the contents of one of the files has to be in
memory for iteration, but since size is big, so might not be possible, so
how about this solution and this will scale too-

First make smaller chunks of your big files (small enough that one chunk can
be stored in memory). Hadoop's block size is set to 64MB by default. If this
seems ok according to the RAM you have, then simply run Identity Mapper only
job on for both Files A and B. The output will be smaller files with the
names part-0001, part-0002 etc. For simplicty let us call chunks of File A
as A1, A2, A3... and chunks of B as B1, B2, B3

Create a file (or write a program that will generate this file) that
contains the cross product of these chunks-
A1 B1
A1 B2
A1 B3
..
A2 B1
A2 B2
A2 B3
..

Now run a Map only job (no reducer). Use NLineInputFormat and set N = 1.
give input to your job this file. NLineInputFormat will give each mapper a
line from this file. So for example, lets say a mapper got the line A1 B3,
which means take cross product of the contents of chunk A1 and chunk B1.

you can read one of the chunk completely and store in memory as a list or
array. And then read second chunk and do the comparison.

Now, as you would have guessed, instead of creating chunks, you can actually
calculate offsets in the files (after an interval of say 64MB) and can
achieve the same effect. HDFS allows seeking to an offset in a file so that
will work too.

-Tarandeep



On Fri, Jun 19, 2009 at 4:33 PM, pmg <pa...@gmail.com> wrote:

>
> Thanks Tarandeep for prompt reply.
>
> Let me give you an example structure of FileA and FileB
>
> FileA
> -------
>
> 123 ABC 1
>
>
> FileB
> -----
> 123 ABC 2
> 456 BNF 3
>
> Both the files are tab delimited. Every record is not simply compared with
> each record in FileB. There's heuristic I am going to run for the
> comparison
> and score the results along with output. So my output file is like this
>
> Output
> --------
>
> 123 ABC 1 123 ABC 2 10
> 123 ABC 1 456 BNF 3 20
>
> first 3 columns in the output file are from FileA, next three columns are
> from FileB and the last column is their comparison score.
>
> So basically you are saying we can use two map/reduce jobs for FileA and
> other for FileB
>
> map (FileA) -> reduce (FileA)-> map (FileB) -> reduce (FileB)
>
> For the first file FileA I map them with <k,V> (I can't use bloom filter
> because comparison between each record from FileA is not a straight
> comparison with every record in FileB - They are compared using heuristic
> and scored them for their quantitative comparison and stored)
>
> In the FileA reduce I store it in the distributed cache. Once this is done
> map the FileB in the second map and in the FileB reduce read in the FileA
> from the distributed cache and do my heuristics for every <K,V) from FileB
> and store my result
>
> thanks
>
>
> Tarandeep wrote:
> >
> > oh my bad, I was not clear-
> >
> > For FileB, you will be running a second map reduce job. In mapper, you
> can
> > use the Bloom Filter, created in first map reduce job (if you wish to
> use)
> > to eliminate the lines whose keys dont match. Mapper will emit key,value
> > pair, where key is teh field on which you want to do comparison and value
> > is
> > the whole line.
> >
> > when the key,value pairs go to reducers, then you have lines from FileB
> > sorted on the field yon want to use for comparison. Now you can read
> > contents of FileA (note that if you ran first job with N reducers, you
> > will
> > have N paritions of FileA and you want to read only the partition meant
> > for
> > this reducer). Content of FileA is also sorted on the field, Now you can
> > easily compare the lines from two files.
> >
> > CloudBase- cloudbase.sourceforge.net has code for doing join this
> fashion.
> >
> > Let me know if you need more clarification.
> >
> > -Tarandeep
> >
> > On Fri, Jun 19, 2009 at 3:45 PM, pmg <pa...@gmail.com> wrote:
> >
> >>
> >> thanks tarandeep
> >>
> >> Correct if I am wrong that when I map FileA mapper created key,value
> pair
> >> and sends across to the reducer. If so then how can I compare when FileB
> >> is
> >> not even mapped yet.
> >>
> >>
> >> Tarandeep wrote:
> >> >
> >> > On Fri, Jun 19, 2009 at 2:41 PM, pmg <pa...@gmail.com> wrote:
> >> >
> >> >>
> >> >> For the sake of simplification I have simplified my input into two
> >> files
> >> >> 1.
> >> >> FileA 2. FileB
> >> >>
> >> >> As I said earlier I want to compare every record of FileA against
> >> every
> >> >> record in FileB I know this is n2 but this is the process. I wrote a
> >> >> simple
> >> >> InputFormat and RecordReader. It seems each file is read serially one
> >> >> after
> >> >> another. How can my record read have reference to both files at the
> >> same
> >> >> line so that I can create cross list of FileA and FileB for the
> >> mapper.
> >> >>
> >> >> Basically the way I see is to get mapper one record from FileA and
> all
> >> >> records from FileB so that mapper can compare n2 and forward them to
> >> >> reducer.
> >> >
> >> >
> >> > It will be hard (and inefficient) to do this in Mapper using some
> >> custom
> >> > intput format. What you can do is use Semi Join technique-
> >> >
> >> > Since File A is smaller, run a map reduce job that will output
> >> key,value
> >> > pair where key is the field or set of fields on which you want to do
> >> the
> >> > comparison and value is the whole line.
> >> >
> >> > The reducer is simply an Identity reducer which writes the files. So
> >> your
> >> > fileA has been partitioned on the field(s). you can also create bloom
> >> > filter
> >> > on this field and store it in Distributed Cache.
> >> >
> >> > Now read FileB, load Bloom filter into memory and see if the field
> from
> >> > line
> >> > of FileB is present in Bloom filter, if yes emit Key,Value pair else
> >> not.
> >> >
> >> > At reducers, you get the contents of FileB partitioned just like
> >> contents
> >> > of
> >> > fileA were partitioned and at a particular reducer you get lines
> sorted
> >> on
> >> > the field you want to do the comparison, At this point you read the
> >> > contents
> >> > of FileA that reached this reducer and since its contents were sorted
> >> as
> >> > well, you can quickly go over the two lists.
> >> >
> >> > -Tarandeep
> >> >
> >> >>
> >> >>
> >> >> thanks
> >> >>
> >> >>
> >> >>
> >> >> pmg wrote:
> >> >> >
> >> >> > Thanks owen. Are there any examples that I can look at?
> >> >> >
> >> >> >
> >> >> >
> >> >> > owen.omalley wrote:
> >> >> >>
> >> >> >> On Jun 18, 2009, at 10:56 AM, pmg wrote:
> >> >> >>
> >> >> >>> Each line from FileA gets compared with every line from FileB1,
> >> >> >>> FileB2 etc.
> >> >> >>> etc. FileB1, FileB2 etc. are in a different input directory
> >> >> >>
> >> >> >> In the general case, I'd define an InputFormat that takes two
> >> >> >> directories, computes the input splits for each directory and
> >> >> >> generates a new list of InputSplits that is the cross-product of
> >> the
> >> >> >> two lists. So instead of FileSplit, it would use a FileSplitPair
> >> that
> >> >> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the
> >> record
> >> >> >> reader would return a TextPair with left and right records (ie.
> >> >> >> lines). Clearly, you read the first line of split1 and cross it by
> >> >> >> each line from split2, then move to the second line of split1 and
> >> >> >> process each line from split2, etc.
> >> >> >>
> >> >> >> You'll need to ensure that you don't overwhelm the system with
> >> either
> >> >> >> too many input splits (ie. maps). Also don't forget that N^2/M
> >> grows
> >> >> >> much faster with the size of the input (N) than the M machines can
> >> >> >> handle in a fixed amount of time.
> >> >> >>
> >> >> >>> Two input directories
> >> >> >>>
> >> >> >>> 1. input1 directory with a single file of 600K records - FileA
> >> >> >>> 2. input2 directory segmented into different files with 2Million
> >> >> >>> records -
> >> >> >>> FileB1, FileB2 etc.
> >> >> >>
> >> >> >> In this particular case, it would be right to load all of FileA
> >> into
> >> >> >> memory and process the chunks of FileB/part-*. Then it would be
> >> much
> >> >> >> faster than needing to re-read the file over and over again, but
> >> >> >> otherwise it would be the same.
> >> >> >>
> >> >> >> -- Owen
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
> >> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/multiple-file-input-tp24095358p24119864.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/multiple-file-input-tp24095358p24120283.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: multiple file input

Posted by pmg <pa...@gmail.com>.
Thanks Tarandeep for prompt reply.

Let me give you an example structure of FileA and FileB

FileA
-------

123 ABC 1


FileB
-----
123 ABC 2
456 BNF 3

Both the files are tab delimited. Every record is not simply compared with
each record in FileB. There's heuristic I am going to run for the comparison
and score the results along with output. So my output file is like this

Output
--------

123 ABC 1 123 ABC 2 10 
123 ABC 1 456 BNF 3 20

first 3 columns in the output file are from FileA, next three columns are
from FileB and the last column is their comparison score.

So basically you are saying we can use two map/reduce jobs for FileA and
other for FileB

map (FileA) -> reduce (FileA)-> map (FileB) -> reduce (FileB)

For the first file FileA I map them with <k,V> (I can't use bloom filter
because comparison between each record from FileA is not a straight
comparison with every record in FileB - They are compared using heuristic
and scored them for their quantitative comparison and stored)

In the FileA reduce I store it in the distributed cache. Once this is done
map the FileB in the second map and in the FileB reduce read in the FileA
from the distributed cache and do my heuristics for every <K,V) from FileB
and store my result 

thanks


Tarandeep wrote:
> 
> oh my bad, I was not clear-
> 
> For FileB, you will be running a second map reduce job. In mapper, you can
> use the Bloom Filter, created in first map reduce job (if you wish to use)
> to eliminate the lines whose keys dont match. Mapper will emit key,value
> pair, where key is teh field on which you want to do comparison and value
> is
> the whole line.
> 
> when the key,value pairs go to reducers, then you have lines from FileB
> sorted on the field yon want to use for comparison. Now you can read
> contents of FileA (note that if you ran first job with N reducers, you
> will
> have N paritions of FileA and you want to read only the partition meant
> for
> this reducer). Content of FileA is also sorted on the field, Now you can
> easily compare the lines from two files.
> 
> CloudBase- cloudbase.sourceforge.net has code for doing join this fashion.
> 
> Let me know if you need more clarification.
> 
> -Tarandeep
> 
> On Fri, Jun 19, 2009 at 3:45 PM, pmg <pa...@gmail.com> wrote:
> 
>>
>> thanks tarandeep
>>
>> Correct if I am wrong that when I map FileA mapper created key,value pair
>> and sends across to the reducer. If so then how can I compare when FileB
>> is
>> not even mapped yet.
>>
>>
>> Tarandeep wrote:
>> >
>> > On Fri, Jun 19, 2009 at 2:41 PM, pmg <pa...@gmail.com> wrote:
>> >
>> >>
>> >> For the sake of simplification I have simplified my input into two
>> files
>> >> 1.
>> >> FileA 2. FileB
>> >>
>> >> As I said earlier I want to compare every record of FileA against
>> every
>> >> record in FileB I know this is n2 but this is the process. I wrote a
>> >> simple
>> >> InputFormat and RecordReader. It seems each file is read serially one
>> >> after
>> >> another. How can my record read have reference to both files at the
>> same
>> >> line so that I can create cross list of FileA and FileB for the
>> mapper.
>> >>
>> >> Basically the way I see is to get mapper one record from FileA and all
>> >> records from FileB so that mapper can compare n2 and forward them to
>> >> reducer.
>> >
>> >
>> > It will be hard (and inefficient) to do this in Mapper using some
>> custom
>> > intput format. What you can do is use Semi Join technique-
>> >
>> > Since File A is smaller, run a map reduce job that will output
>> key,value
>> > pair where key is the field or set of fields on which you want to do
>> the
>> > comparison and value is the whole line.
>> >
>> > The reducer is simply an Identity reducer which writes the files. So
>> your
>> > fileA has been partitioned on the field(s). you can also create bloom
>> > filter
>> > on this field and store it in Distributed Cache.
>> >
>> > Now read FileB, load Bloom filter into memory and see if the field from
>> > line
>> > of FileB is present in Bloom filter, if yes emit Key,Value pair else
>> not.
>> >
>> > At reducers, you get the contents of FileB partitioned just like
>> contents
>> > of
>> > fileA were partitioned and at a particular reducer you get lines sorted
>> on
>> > the field you want to do the comparison, At this point you read the
>> > contents
>> > of FileA that reached this reducer and since its contents were sorted
>> as
>> > well, you can quickly go over the two lists.
>> >
>> > -Tarandeep
>> >
>> >>
>> >>
>> >> thanks
>> >>
>> >>
>> >>
>> >> pmg wrote:
>> >> >
>> >> > Thanks owen. Are there any examples that I can look at?
>> >> >
>> >> >
>> >> >
>> >> > owen.omalley wrote:
>> >> >>
>> >> >> On Jun 18, 2009, at 10:56 AM, pmg wrote:
>> >> >>
>> >> >>> Each line from FileA gets compared with every line from FileB1,
>> >> >>> FileB2 etc.
>> >> >>> etc. FileB1, FileB2 etc. are in a different input directory
>> >> >>
>> >> >> In the general case, I'd define an InputFormat that takes two
>> >> >> directories, computes the input splits for each directory and
>> >> >> generates a new list of InputSplits that is the cross-product of
>> the
>> >> >> two lists. So instead of FileSplit, it would use a FileSplitPair
>> that
>> >> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the
>> record
>> >> >> reader would return a TextPair with left and right records (ie.
>> >> >> lines). Clearly, you read the first line of split1 and cross it by
>> >> >> each line from split2, then move to the second line of split1 and
>> >> >> process each line from split2, etc.
>> >> >>
>> >> >> You'll need to ensure that you don't overwhelm the system with
>> either
>> >> >> too many input splits (ie. maps). Also don't forget that N^2/M
>> grows
>> >> >> much faster with the size of the input (N) than the M machines can
>> >> >> handle in a fixed amount of time.
>> >> >>
>> >> >>> Two input directories
>> >> >>>
>> >> >>> 1. input1 directory with a single file of 600K records - FileA
>> >> >>> 2. input2 directory segmented into different files with 2Million
>> >> >>> records -
>> >> >>> FileB1, FileB2 etc.
>> >> >>
>> >> >> In this particular case, it would be right to load all of FileA
>> into
>> >> >> memory and process the chunks of FileB/part-*. Then it would be
>> much
>> >> >> faster than needing to re-read the file over and over again, but
>> >> >> otherwise it would be the same.
>> >> >>
>> >> >> -- Owen
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/multiple-file-input-tp24095358p24119864.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24120283.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: multiple file input

Posted by Tarandeep Singh <ta...@gmail.com>.
oh my bad, I was not clear-

For FileB, you will be running a second map reduce job. In mapper, you can
use the Bloom Filter, created in first map reduce job (if you wish to use)
to eliminate the lines whose keys dont match. Mapper will emit key,value
pair, where key is teh field on which you want to do comparison and value is
the whole line.

when the key,value pairs go to reducers, then you have lines from FileB
sorted on the field yon want to use for comparison. Now you can read
contents of FileA (note that if you ran first job with N reducers, you will
have N paritions of FileA and you want to read only the partition meant for
this reducer). Content of FileA is also sorted on the field, Now you can
easily compare the lines from two files.

CloudBase- cloudbase.sourceforge.net has code for doing join this fashion.

Let me know if you need more clarification.

-Tarandeep

On Fri, Jun 19, 2009 at 3:45 PM, pmg <pa...@gmail.com> wrote:

>
> thanks tarandeep
>
> Correct if I am wrong that when I map FileA mapper created key,value pair
> and sends across to the reducer. If so then how can I compare when FileB is
> not even mapped yet.
>
>
> Tarandeep wrote:
> >
> > On Fri, Jun 19, 2009 at 2:41 PM, pmg <pa...@gmail.com> wrote:
> >
> >>
> >> For the sake of simplification I have simplified my input into two files
> >> 1.
> >> FileA 2. FileB
> >>
> >> As I said earlier I want to compare every record of FileA against every
> >> record in FileB I know this is n2 but this is the process. I wrote a
> >> simple
> >> InputFormat and RecordReader. It seems each file is read serially one
> >> after
> >> another. How can my record read have reference to both files at the same
> >> line so that I can create cross list of FileA and FileB for the mapper.
> >>
> >> Basically the way I see is to get mapper one record from FileA and all
> >> records from FileB so that mapper can compare n2 and forward them to
> >> reducer.
> >
> >
> > It will be hard (and inefficient) to do this in Mapper using some custom
> > intput format. What you can do is use Semi Join technique-
> >
> > Since File A is smaller, run a map reduce job that will output key,value
> > pair where key is the field or set of fields on which you want to do the
> > comparison and value is the whole line.
> >
> > The reducer is simply an Identity reducer which writes the files. So your
> > fileA has been partitioned on the field(s). you can also create bloom
> > filter
> > on this field and store it in Distributed Cache.
> >
> > Now read FileB, load Bloom filter into memory and see if the field from
> > line
> > of FileB is present in Bloom filter, if yes emit Key,Value pair else not.
> >
> > At reducers, you get the contents of FileB partitioned just like contents
> > of
> > fileA were partitioned and at a particular reducer you get lines sorted
> on
> > the field you want to do the comparison, At this point you read the
> > contents
> > of FileA that reached this reducer and since its contents were sorted as
> > well, you can quickly go over the two lists.
> >
> > -Tarandeep
> >
> >>
> >>
> >> thanks
> >>
> >>
> >>
> >> pmg wrote:
> >> >
> >> > Thanks owen. Are there any examples that I can look at?
> >> >
> >> >
> >> >
> >> > owen.omalley wrote:
> >> >>
> >> >> On Jun 18, 2009, at 10:56 AM, pmg wrote:
> >> >>
> >> >>> Each line from FileA gets compared with every line from FileB1,
> >> >>> FileB2 etc.
> >> >>> etc. FileB1, FileB2 etc. are in a different input directory
> >> >>
> >> >> In the general case, I'd define an InputFormat that takes two
> >> >> directories, computes the input splits for each directory and
> >> >> generates a new list of InputSplits that is the cross-product of the
> >> >> two lists. So instead of FileSplit, it would use a FileSplitPair that
> >> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the
> record
> >> >> reader would return a TextPair with left and right records (ie.
> >> >> lines). Clearly, you read the first line of split1 and cross it by
> >> >> each line from split2, then move to the second line of split1 and
> >> >> process each line from split2, etc.
> >> >>
> >> >> You'll need to ensure that you don't overwhelm the system with either
> >> >> too many input splits (ie. maps). Also don't forget that N^2/M grows
> >> >> much faster with the size of the input (N) than the M machines can
> >> >> handle in a fixed amount of time.
> >> >>
> >> >>> Two input directories
> >> >>>
> >> >>> 1. input1 directory with a single file of 600K records - FileA
> >> >>> 2. input2 directory segmented into different files with 2Million
> >> >>> records -
> >> >>> FileB1, FileB2 etc.
> >> >>
> >> >> In this particular case, it would be right to load all of FileA into
> >> >> memory and process the chunks of FileB/part-*. Then it would be much
> >> >> faster than needing to re-read the file over and over again, but
> >> >> otherwise it would be the same.
> >> >>
> >> >> -- Owen
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/multiple-file-input-tp24095358p24119864.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: multiple file input

Posted by pmg <pa...@gmail.com>.
thanks tarandeep

Correct if I am wrong that when I map FileA mapper created key,value pair
and sends across to the reducer. If so then how can I compare when FileB is
not even mapped yet.


Tarandeep wrote:
> 
> On Fri, Jun 19, 2009 at 2:41 PM, pmg <pa...@gmail.com> wrote:
> 
>>
>> For the sake of simplification I have simplified my input into two files
>> 1.
>> FileA 2. FileB
>>
>> As I said earlier I want to compare every record of FileA against every
>> record in FileB I know this is n2 but this is the process. I wrote a
>> simple
>> InputFormat and RecordReader. It seems each file is read serially one
>> after
>> another. How can my record read have reference to both files at the same
>> line so that I can create cross list of FileA and FileB for the mapper.
>>
>> Basically the way I see is to get mapper one record from FileA and all
>> records from FileB so that mapper can compare n2 and forward them to
>> reducer.
> 
> 
> It will be hard (and inefficient) to do this in Mapper using some custom
> intput format. What you can do is use Semi Join technique-
> 
> Since File A is smaller, run a map reduce job that will output key,value
> pair where key is the field or set of fields on which you want to do the
> comparison and value is the whole line.
> 
> The reducer is simply an Identity reducer which writes the files. So your
> fileA has been partitioned on the field(s). you can also create bloom
> filter
> on this field and store it in Distributed Cache.
> 
> Now read FileB, load Bloom filter into memory and see if the field from
> line
> of FileB is present in Bloom filter, if yes emit Key,Value pair else not.
> 
> At reducers, you get the contents of FileB partitioned just like contents
> of
> fileA were partitioned and at a particular reducer you get lines sorted on
> the field you want to do the comparison, At this point you read the
> contents
> of FileA that reached this reducer and since its contents were sorted as
> well, you can quickly go over the two lists.
> 
> -Tarandeep
> 
>>
>>
>> thanks
>>
>>
>>
>> pmg wrote:
>> >
>> > Thanks owen. Are there any examples that I can look at?
>> >
>> >
>> >
>> > owen.omalley wrote:
>> >>
>> >> On Jun 18, 2009, at 10:56 AM, pmg wrote:
>> >>
>> >>> Each line from FileA gets compared with every line from FileB1,
>> >>> FileB2 etc.
>> >>> etc. FileB1, FileB2 etc. are in a different input directory
>> >>
>> >> In the general case, I'd define an InputFormat that takes two
>> >> directories, computes the input splits for each directory and
>> >> generates a new list of InputSplits that is the cross-product of the
>> >> two lists. So instead of FileSplit, it would use a FileSplitPair that
>> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the record
>> >> reader would return a TextPair with left and right records (ie.
>> >> lines). Clearly, you read the first line of split1 and cross it by
>> >> each line from split2, then move to the second line of split1 and
>> >> process each line from split2, etc.
>> >>
>> >> You'll need to ensure that you don't overwhelm the system with either
>> >> too many input splits (ie. maps). Also don't forget that N^2/M grows
>> >> much faster with the size of the input (N) than the M machines can
>> >> handle in a fixed amount of time.
>> >>
>> >>> Two input directories
>> >>>
>> >>> 1. input1 directory with a single file of 600K records - FileA
>> >>> 2. input2 directory segmented into different files with 2Million
>> >>> records -
>> >>> FileB1, FileB2 etc.
>> >>
>> >> In this particular case, it would be right to load all of FileA into
>> >> memory and process the chunks of FileB/part-*. Then it would be much
>> >> faster than needing to re-read the file over and over again, but
>> >> otherwise it would be the same.
>> >>
>> >> -- Owen
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24119864.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: multiple file input

Posted by Tarandeep Singh <ta...@gmail.com>.
On Fri, Jun 19, 2009 at 2:41 PM, pmg <pa...@gmail.com> wrote:

>
> For the sake of simplification I have simplified my input into two files 1.
> FileA 2. FileB
>
> As I said earlier I want to compare every record of FileA against every
> record in FileB I know this is n2 but this is the process. I wrote a simple
> InputFormat and RecordReader. It seems each file is read serially one after
> another. How can my record read have reference to both files at the same
> line so that I can create cross list of FileA and FileB for the mapper.
>
> Basically the way I see is to get mapper one record from FileA and all
> records from FileB so that mapper can compare n2 and forward them to
> reducer.


It will be hard (and inefficient) to do this in Mapper using some custom
intput format. What you can do is use Semi Join technique-

Since File A is smaller, run a map reduce job that will output key,value
pair where key is the field or set of fields on which you want to do the
comparison and value is the whole line.

The reducer is simply an Identity reducer which writes the files. So your
fileA has been partitioned on the field(s). you can also create bloom filter
on this field and store it in Distributed Cache.

Now read FileB, load Bloom filter into memory and see if the field from line
of FileB is present in Bloom filter, if yes emit Key,Value pair else not.

At reducers, you get the contents of FileB partitioned just like contents of
fileA were partitioned and at a particular reducer you get lines sorted on
the field you want to do the comparison, At this point you read the contents
of FileA that reached this reducer and since its contents were sorted as
well, you can quickly go over the two lists.

-Tarandeep

>
>
> thanks
>
>
>
> pmg wrote:
> >
> > Thanks owen. Are there any examples that I can look at?
> >
> >
> >
> > owen.omalley wrote:
> >>
> >> On Jun 18, 2009, at 10:56 AM, pmg wrote:
> >>
> >>> Each line from FileA gets compared with every line from FileB1,
> >>> FileB2 etc.
> >>> etc. FileB1, FileB2 etc. are in a different input directory
> >>
> >> In the general case, I'd define an InputFormat that takes two
> >> directories, computes the input splits for each directory and
> >> generates a new list of InputSplits that is the cross-product of the
> >> two lists. So instead of FileSplit, it would use a FileSplitPair that
> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the record
> >> reader would return a TextPair with left and right records (ie.
> >> lines). Clearly, you read the first line of split1 and cross it by
> >> each line from split2, then move to the second line of split1 and
> >> process each line from split2, etc.
> >>
> >> You'll need to ensure that you don't overwhelm the system with either
> >> too many input splits (ie. maps). Also don't forget that N^2/M grows
> >> much faster with the size of the input (N) than the M machines can
> >> handle in a fixed amount of time.
> >>
> >>> Two input directories
> >>>
> >>> 1. input1 directory with a single file of 600K records - FileA
> >>> 2. input2 directory segmented into different files with 2Million
> >>> records -
> >>> FileB1, FileB2 etc.
> >>
> >> In this particular case, it would be right to load all of FileA into
> >> memory and process the chunks of FileB/part-*. Then it would be much
> >> faster than needing to re-read the file over and over again, but
> >> otherwise it would be the same.
> >>
> >> -- Owen
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: multiple file input

Posted by pmg <pa...@gmail.com>.
For the sake of simplification I have simplified my input into two files 1.
FileA 2. FileB

As I said earlier I want to compare every record of FileA against every
record in FileB I know this is n2 but this is the process. I wrote a simple
InputFormat and RecordReader. It seems each file is read serially one after
another. How can my record read have reference to both files at the same
line so that I can create cross list of FileA and FileB for the mapper.

Basically the way I see is to get mapper one record from FileA and all
records from FileB so that mapper can compare n2 and forward them to
reducer. 

thanks



pmg wrote:
> 
> Thanks owen. Are there any examples that I can look at? 
> 
> 
> 
> owen.omalley wrote:
>> 
>> On Jun 18, 2009, at 10:56 AM, pmg wrote:
>> 
>>> Each line from FileA gets compared with every line from FileB1,  
>>> FileB2 etc.
>>> etc. FileB1, FileB2 etc. are in a different input directory
>> 
>> In the general case, I'd define an InputFormat that takes two  
>> directories, computes the input splits for each directory and  
>> generates a new list of InputSplits that is the cross-product of the  
>> two lists. So instead of FileSplit, it would use a FileSplitPair that  
>> gives the FileSplit for dir1 and the FileSplit for dir2 and the record  
>> reader would return a TextPair with left and right records (ie.  
>> lines). Clearly, you read the first line of split1 and cross it by  
>> each line from split2, then move to the second line of split1 and  
>> process each line from split2, etc.
>> 
>> You'll need to ensure that you don't overwhelm the system with either  
>> too many input splits (ie. maps). Also don't forget that N^2/M grows  
>> much faster with the size of the input (N) than the M machines can  
>> handle in a fixed amount of time.
>> 
>>> Two input directories
>>>
>>> 1. input1 directory with a single file of 600K records - FileA
>>> 2. input2 directory segmented into different files with 2Million  
>>> records -
>>> FileB1, FileB2 etc.
>> 
>> In this particular case, it would be right to load all of FileA into  
>> memory and process the chunks of FileB/part-*. Then it would be much  
>> faster than needing to re-read the file over and over again, but  
>> otherwise it would be the same.
>> 
>> -- Owen
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: multiple file input

Posted by pmg <pa...@gmail.com>.
Thanks owen. Are there any examples that I can look at? 



owen.omalley wrote:
> 
> On Jun 18, 2009, at 10:56 AM, pmg wrote:
> 
>> Each line from FileA gets compared with every line from FileB1,  
>> FileB2 etc.
>> etc. FileB1, FileB2 etc. are in a different input directory
> 
> In the general case, I'd define an InputFormat that takes two  
> directories, computes the input splits for each directory and  
> generates a new list of InputSplits that is the cross-product of the  
> two lists. So instead of FileSplit, it would use a FileSplitPair that  
> gives the FileSplit for dir1 and the FileSplit for dir2 and the record  
> reader would return a TextPair with left and right records (ie.  
> lines). Clearly, you read the first line of split1 and cross it by  
> each line from split2, then move to the second line of split1 and  
> process each line from split2, etc.
> 
> You'll need to ensure that you don't overwhelm the system with either  
> too many input splits (ie. maps). Also don't forget that N^2/M grows  
> much faster with the size of the input (N) than the M machines can  
> handle in a fixed amount of time.
> 
>> Two input directories
>>
>> 1. input1 directory with a single file of 600K records - FileA
>> 2. input2 directory segmented into different files with 2Million  
>> records -
>> FileB1, FileB2 etc.
> 
> In this particular case, it would be right to load all of FileA into  
> memory and process the chunks of FileB/part-*. Then it would be much  
> faster than needing to re-read the file over and over again, but  
> otherwise it would be the same.
> 
> -- Owen
> 
> 

-- 
View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24105398.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: multiple file input

Posted by Owen O'Malley <om...@apache.org>.
On Jun 18, 2009, at 10:56 AM, pmg wrote:

> Each line from FileA gets compared with every line from FileB1,  
> FileB2 etc.
> etc. FileB1, FileB2 etc. are in a different input directory

In the general case, I'd define an InputFormat that takes two  
directories, computes the input splits for each directory and  
generates a new list of InputSplits that is the cross-product of the  
two lists. So instead of FileSplit, it would use a FileSplitPair that  
gives the FileSplit for dir1 and the FileSplit for dir2 and the record  
reader would return a TextPair with left and right records (ie.  
lines). Clearly, you read the first line of split1 and cross it by  
each line from split2, then move to the second line of split1 and  
process each line from split2, etc.

You'll need to ensure that you don't overwhelm the system with either  
too many input splits (ie. maps). Also don't forget that N^2/M grows  
much faster with the size of the input (N) than the M machines can  
handle in a fixed amount of time.

> Two input directories
>
> 1. input1 directory with a single file of 600K records - FileA
> 2. input2 directory segmented into different files with 2Million  
> records -
> FileB1, FileB2 etc.

In this particular case, it would be right to load all of FileA into  
memory and process the chunks of FileB/part-*. Then it would be much  
faster than needing to re-read the file over and over again, but  
otherwise it would be the same.

-- Owen