You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Rohit Kelkar <ro...@gmail.com> on 2011/01/19 12:35:24 UTC
cross product of two files using MapReduce - pls suggest
I have two files, A and D, containing (vectorId, vector) on each line.
|D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100
Now I want to execute the following
for eachItem in A:
for eachElem in D:
dot_product = eachItem * eachElem
save(dot_product)
What I tried was to convert file D in to a MapFile in (key = vectorId,
value = vector) format and set up a hadoop job such that,
inputFile = A
inputFileFormat = NLineInputFormat
pseudo code for the map function:
map(key=vectorid, value=myVector):
open(MapFile containing all vectors of D)
for eachElem in MapFile:
dot_product = myVector * eachElem
context.write(dot_product)
close(MapFile containing all vectors of D)
I was expecting that sequentially accessing the MapFile would be much
faster. When I took some stats on a single node with a smaller dataset
where |A| = 100 and |D| = 100,000 what I observed was that
total time taken to iterate over the MapFile = 738 secs
total time taken to compute the dot_product = 11 sec
My original intention to speed up the process using MapReduce is
defeated because of the io time involved in accessing each entry in
the MapFile. Are there any other avenues that I could explore?
Re: cross product of two files using MapReduce - pls suggest
Posted by Jason <ur...@gmail.com>.
I am afraid that by reading an hdfs file manually in your mapper, you are loosing data locality.
You can try putting smaller vectors into distributed cache and preload them all in memory in the mapper setup. This implies that they can fit in memory and also that you can change your m/r to run over the larger vector set as an input.
Sent from my iPhone
On Jan 19, 2011, at 3:35 AM, Rohit Kelkar <ro...@gmail.com> wrote:
> I have two files, A and D, containing (vectorId, vector) on each line.
> |D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100
>
> Now I want to execute the following
>
> for eachItem in A:
> for eachElem in D:
> dot_product = eachItem * eachElem
> save(dot_product)
>
>
> What I tried was to convert file D in to a MapFile in (key = vectorId,
> value = vector) format and set up a hadoop job such that,
> inputFile = A
> inputFileFormat = NLineInputFormat
>
> pseudo code for the map function:
>
> map(key=vectorid, value=myVector):
> open(MapFile containing all vectors of D)
> for eachElem in MapFile:
> dot_product = myVector * eachElem
> context.write(dot_product)
> close(MapFile containing all vectors of D)
>
>
> I was expecting that sequentially accessing the MapFile would be much
> faster. When I took some stats on a single node with a smaller dataset
> where |A| = 100 and |D| = 100,000 what I observed was that
> total time taken to iterate over the MapFile = 738 secs
> total time taken to compute the dot_product = 11 sec
>
> My original intention to speed up the process using MapReduce is
> defeated because of the io time involved in accessing each entry in
> the MapFile. Are there any other avenues that I could explore?
Re: cross product of two files using MapReduce - pls suggest
Posted by Ashutosh Chauhan <as...@gmail.com>.
Pig has a built-in CROSS operator.
grunt> a = load 'file1';
grunt> b = load 'file2';
grunt> c = cross a,b;
grunt> store c into 'file3';
Ashutosh
> On Wed, Jan 19, 2011 at 03:35, Rohit Kelkar <ro...@gmail.com> wrote:
>> I have two files, A and D, containing (vectorId, vector) on each line.
>> |D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100
>>
>> Now I want to execute the following
>>
>> for eachItem in A:
>> for eachElem in D:
>> dot_product = eachItem * eachElem
>> save(dot_product)
>>
>>
>> What I tried was to convert file D in to a MapFile in (key = vectorId,
>> value = vector) format and set up a hadoop job such that,
>> inputFile = A
>> inputFileFormat = NLineInputFormat
>>
>> pseudo code for the map function:
>>
>> map(key=vectorid, value=myVector):
>> open(MapFile containing all vectors of D)
>> for eachElem in MapFile:
>> dot_product = myVector * eachElem
>> context.write(dot_product)
>> close(MapFile containing all vectors of D)
>>
>>
>> I was expecting that sequentially accessing the MapFile would be much
>> faster. When I took some stats on a single node with a smaller dataset
>> where |A| = 100 and |D| = 100,000 what I observed was that
>> total time taken to iterate over the MapFile = 738 secs
>> total time taken to compute the dot_product = 11 sec
>>
>> My original intention to speed up the process using MapReduce is
>> defeated because of the io time involved in accessing each entry in
>> the MapFile. Are there any other avenues that I could explore?
>>
>