You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by pmg <pa...@gmail.com> on 2009/06/18 19:56:13 UTC

multiple file input

I am evaluating hadoop for a problem that do a Cartesian product of input
from one file of 600K (File A) with another set of file set (FileB1, FileB2,
FileB3) with 2 millions line in total.

Each line from FileA gets compared with every line from FileB1, FileB2 etc.
etc. FileB1, FileB2 etc. are in a different input directory

So....

Two input directories 

1. input1 directory with a single file of 600K records - FileA
2. input2 directory segmented into different files with 2Million records -
FileB1, FileB2 etc.

How can I have a map that reads a line from a FileA in directory input1 and
compares the line with each line from input2? 

What is the best way forward? I have seen plenty of examples that maps each
record from single input file and reduces into an output forward.

thanks
-- 
View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24095358.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re： multiple file input

Posted by Gang Luo <lg...@yahoo.com.cn>.

To do the cartesian product, any node has to see at least one table completely. So what I think is to name input2 as the input to mapper, and in each map task, you read the whole fileA at input1 manually using HDFS api, for it is smaller, and build hash table on fileA. For each line from input2, you match it with the lines in hash table and join them. This is actually map side join, which only needs map phase. 

 -Gang


----- 原始邮件 ----
发件人： Ed Kohlwey <ek...@gmail.com>
收件人： common-user@hadoop.apache.org
发送日期： 2009/12/8 (周二) 10:14:51 上午
主   题： Re: multiple file input

One important thing to note is that, with cross products, you'll almost
always get better performance if you can fit both files on a single node's
disk rather than distributing the files.

On Tue, Dec 8, 2009 at 9:18 AM, laser08150815 <la...@laserxyz.de> wrote:

>
>
> pmg wrote:
> >
> > I am evaluating hadoop for a problem that do a Cartesian product of input
> > from one file of 600K (File A) with another set of file set (FileB1,
> > FileB2, FileB3) with 2 millions line in total.
> >
> > Each line from FileA gets compared with every line from FileB1, FileB2
> > etc. etc. FileB1, FileB2 etc. are in a different input directory
> >
> > So....
> >
> > Two input directories
> >
> > 1. input1 directory with a single file of 600K records - FileA
> > 2. input2 directory segmented into different files with 2Million records
> -
> > FileB1, FileB2 etc.
> >
> > How can I have a map that reads a line from a FileA in directory input1
> > and compares the line with each line from input2?
> >
> > What is the best way forward? I have seen plenty of examples that maps
> > each record from single input file and reduces into an output forward.
> >
> > thanks
> >
>
>
> I had a similar problem and solved it by writing a custom InputFormat (see
> attachment). You should improve the methods ACrossBInputSplit.getLength ,
> ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
> --
> View this message in context:
> http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

Hadoop Pipes with distributed cache

Posted by Upendra Dadi <ud...@gmu.edu>.

Hi,
  I am facing some problems with using distributed cache archive with Pipes
job. In my configuration file I have the following two properties:

<property>
  <name>mapred.create.symlink</name>
  <value>yes</value>
</property>

<property>
  <name>mapred.cache.archives</name>
  <value>hdfs://localhost:9000/user/upendra/archive/pipeArchive.zip#pipeSym</value>
</property>

The zip archive contains two folders lib and share. lib folder contains some
shared libraries. In my Pipes C++ code, I have added the following
statements to
use the shared libraries:

int main(int argc, char *argv[]) {
  dlopen("pipeSym/lib/libmant.so.1",RTLD_LAZY);
  ...

The shared library is not getting loaded during execution. It spits the
error that the shared library is not found. What is the problem with
 the above steps? Can anyone please shed some light on what might be causing
the problem. Thanks.

Upendra

Re: multiple file input

Posted by Ed Kohlwey <ek...@gmail.com>.

One important thing to note is that, with cross products, you'll almost
always get better performance if you can fit both files on a single node's
disk rather than distributing the files.

On Tue, Dec 8, 2009 at 9:18 AM, laser08150815 <la...@laserxyz.de> wrote:

>
>
> pmg wrote:
> >
> > I am evaluating hadoop for a problem that do a Cartesian product of input
> > from one file of 600K (File A) with another set of file set (FileB1,
> > FileB2, FileB3) with 2 millions line in total.
> >
> > Each line from FileA gets compared with every line from FileB1, FileB2
> > etc. etc. FileB1, FileB2 etc. are in a different input directory
> >
> > So....
> >
> > Two input directories
> >
> > 1. input1 directory with a single file of 600K records - FileA
> > 2. input2 directory segmented into different files with 2Million records
> -
> > FileB1, FileB2 etc.
> >
> > How can I have a map that reads a line from a FileA in directory input1
> > and compares the line with each line from input2?
> >
> > What is the best way forward? I have seen plenty of examples that maps
> > each record from single input file and reduces into an output forward.
> >
> > thanks
> >
>
>
> I had a similar problem and solved it by writing a custom InputFormat (see
> attachment). You should improve the methods ACrossBInputSplit.getLength ,
> ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
> --
> View this message in context:
> http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: multiple file input

Posted by laser08150815 <la...@laserxyz.de>.


pmg wrote:
> 
> I am evaluating hadoop for a problem that do a Cartesian product of input
> from one file of 600K (File A) with another set of file set (FileB1,
> FileB2, FileB3) with 2 millions line in total.
> 
> Each line from FileA gets compared with every line from FileB1, FileB2
> etc. etc. FileB1, FileB2 etc. are in a different input directory
> 
> So....
> 
> Two input directories 
> 
> 1. input1 directory with a single file of 600K records - FileA
> 2. input2 directory segmented into different files with 2Million records -
> FileB1, FileB2 etc.
> 
> How can I have a map that reads a line from a FileA in directory input1
> and compares the line with each line from input2? 
> 
> What is the best way forward? I have seen plenty of examples that maps
> each record from single input file and reduces into an output forward.
> 
> thanks
> 


I had a similar problem and solved it by writing a custom InputFormat (see
attachment). You should improve the methods ACrossBInputSplit.getLength ,
ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
-- 
View this message in context: http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.