You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Rohit Kelkar <ro...@gmail.com> on 2011/01/19 12:35:24 UTC

cross product of two files using MapReduce - pls suggest

I have two files, A and D, containing (vectorId, vector) on each line.
|D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100

Now I want to execute the following

for eachItem in A:
    for eachElem in D:
        dot_product = eachItem * eachElem
        save(dot_product)


What I tried was to convert file D in to a MapFile in (key = vectorId,
value = vector) format and set up a hadoop job such that,
inputFile = A
inputFileFormat = NLineInputFormat

pseudo code for the map function:

map(key=vectorid, value=myVector):
    open(MapFile containing all vectors of D)
    for eachElem in MapFile:
        dot_product = myVector * eachElem
        context.write(dot_product)
    close(MapFile containing all vectors of D)


I was expecting that sequentially accessing the MapFile would be much
faster. When I took some stats on a single node with a smaller dataset
where |A| = 100 and |D| = 100,000 what I observed was that
total time taken to iterate over the MapFile = 738 secs
total time taken to compute the dot_product = 11 sec

My original intention to speed up the process using MapReduce is
defeated because of the io time involved in accessing each entry in
the MapFile. Are there any other avenues that I could explore?

Re: cross product of two files using MapReduce - pls suggest

Posted by Jason <ur...@gmail.com>.

I am afraid that by reading an hdfs file manually in your mapper, you are loosing data locality.
You can try putting smaller vectors into distributed cache and preload them all in memory in the mapper setup. This implies that they can fit in memory and also that you can change your m/r to run over the larger vector set as an input.

Sent from my iPhone

On Jan 19, 2011, at 3:35 AM, Rohit Kelkar <ro...@gmail.com> wrote:

> I have two files, A and D, containing (vectorId, vector) on each line.
> |D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100
> 
> Now I want to execute the following
> 
> for eachItem in A:
>    for eachElem in D:
>        dot_product = eachItem * eachElem
>        save(dot_product)
> 
> 
> What I tried was to convert file D in to a MapFile in (key = vectorId,
> value = vector) format and set up a hadoop job such that,
> inputFile = A
> inputFileFormat = NLineInputFormat
> 
> pseudo code for the map function:
> 
> map(key=vectorid, value=myVector):
>    open(MapFile containing all vectors of D)
>    for eachElem in MapFile:
>        dot_product = myVector * eachElem
>        context.write(dot_product)
>    close(MapFile containing all vectors of D)
> 
> 
> I was expecting that sequentially accessing the MapFile would be much
> faster. When I took some stats on a single node with a smaller dataset
> where |A| = 100 and |D| = 100,000 what I observed was that
> total time taken to iterate over the MapFile = 738 secs
> total time taken to compute the dot_product = 11 sec
> 
> My original intention to speed up the process using MapReduce is
> defeated because of the io time involved in accessing each entry in
> the MapFile. Are there any other avenues that I could explore?

Re: cross product of two files using MapReduce - pls suggest

Posted by Ashutosh Chauhan <as...@gmail.com>.

Pig has a built-in CROSS operator.

 grunt> a = load 'file1';
 grunt> b = load 'file2';
 grunt> c = cross a,b;
 grunt> store c into 'file3';

 Ashutosh

> On Wed, Jan 19, 2011 at 03:35, Rohit Kelkar <ro...@gmail.com> wrote:
>> I have two files, A and D, containing (vectorId, vector) on each line.
>> |D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100
>>
>> Now I want to execute the following
>>
>> for eachItem in A:
>>    for eachElem in D:
>>        dot_product = eachItem * eachElem
>>        save(dot_product)
>>
>>
>> What I tried was to convert file D in to a MapFile in (key = vectorId,
>> value = vector) format and set up a hadoop job such that,
>> inputFile = A
>> inputFileFormat = NLineInputFormat
>>
>> pseudo code for the map function:
>>
>> map(key=vectorid, value=myVector):
>>    open(MapFile containing all vectors of D)
>>    for eachElem in MapFile:
>>        dot_product = myVector * eachElem
>>        context.write(dot_product)
>>    close(MapFile containing all vectors of D)
>>
>>
>> I was expecting that sequentially accessing the MapFile would be much
>> faster. When I took some stats on a single node with a smaller dataset
>> where |A| = 100 and |D| = 100,000 what I observed was that
>> total time taken to iterate over the MapFile = 738 secs
>> total time taken to compute the dot_product = 11 sec
>>
>> My original intention to speed up the process using MapReduce is
>> defeated because of the io time involved in accessing each entry in
>> the MapFile. Are there any other avenues that I could explore?
>>
>