You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Praveen Kumar K J V S <pr...@gmail.com> on 2012/03/29 22:05:54 UTC
Need help to map a problem to Mapreduce domain
Hi All,
I have already posted my question to the MapReduce users mailing list, but
alas I did not get any response. Probably I did not convey my question
correctly, so I thought I will rephrase my question and post it in dev list.
Kindly give your suggestions.
I have a many files HDFS each containing list of cities. For each city in
any document I want to find a similar city that appear in any of the
documents. I have a utility method that says the level of similarity b/w 2
cities, re turning a value b/w 0 -1.
Is there a way of doing this in Hadoop. I have specific doubt because, a
city might be similar to another city present in some other input split
that is processed by another mapper. Lets say odd cities (C1, C3, C5) are
similar
Input Split 1 has the cities: C1, C2, C3, C4
Input Split 2 has the cities: C1, C2, C5, C6
Say my mapper 1 o/p is: since odd cities (C1, C3, C5) are similar
C1, C3,
C2, C4
C3, C1
C4, C2
Similar for mapper 2.
C1, C5
C2, C5
C5, C1
C6, C2
Since C1 appears in both the splits, is finally at my reducer I get C3, C5
for key C1, But this does not happen for C3, since it appears in only one
split,
Thanks,
Praveen
Re: Need help to map a problem to Mapreduce domain
Posted by Praveen Kumar K J V S <pr...@gmail.com>.
Thanks, very kind of you.
On Fri, Mar 30, 2012 at 2:00 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
> Sorry about the broken link.
>
> Here is one that works.
>
> http://flamingo.ics.uci.edu/**pub/sigmod10-vernica.pdf<http://flamingo.ics.uci.edu/pub/sigmod10-vernica.pdf>
>
> Vinayak
>
>
>
> On 3/29/12 1:27 PM, Praveen Kumar K J V S wrote:
>
>> Hi Vinayak,
>>
>> Thanks. If I use a single reduce I might run out of memory in that reduce
>> JVM.
>>
>> BTW URL is not accessible.
>>
>> Thanks,
>> Praveen
>>
>> On Fri, Mar 30, 2012 at 1:52 AM, Vinayak Borkar<vb...@yahoo.com> wrote:
>>
>> Hi Praveen,
>>>
>>> The way your problem is stated, requires in the worst case that all
>>> cities
>>> appear at every reducer. The simplest way to do so is to have one reducer
>>> -- but this is a sequential solution and probably not what you are
>>> looking
>>> for.
>>>
>>> If you have more visibility into your similarity function you can do
>>> better. Look at http://asterix.ics.uci.edu/****
>>> pub/sigmod10-vernica-long.pdf<http://asterix.ics.uci.edu/**pub/sigmod10-vernica-long.pdf>
>>> <**http://asterix.ics.uci.edu/**pub/sigmod10-vernica-long.pdf<http://asterix.ics.uci.edu/pub/sigmod10-vernica-long.pdf>
>>> >**for trying to solve a similar problem for set similarity joins.
>>>
>>>
>>> One other approach you could use (if the number of unique cities is
>>> fairly
>>> small), is to first run a MapReduce job to compute the distinct cities
>>> (duplicate eliminated). Then do a map-only job where each mapper uses the
>>> distinct list of cities to perform the "similarity join" with the data in
>>> its HDFS block.
>>>
>>> Hope this helps.
>>>
>>> Vinayak
>>>
>>>
>>>
>>> On 3/29/12 1:05 PM, Praveen Kumar K J V S wrote:
>>>
>>> Hi All,
>>>>
>>>> I have already posted my question to the MapReduce users mailing list,
>>>> but
>>>> alas I did not get any response. Probably I did not convey my question
>>>> correctly, so I thought I will rephrase my question and post it in dev
>>>> list.
>>>>
>>>> Kindly give your suggestions.
>>>>
>>>> I have a many files HDFS each containing list of cities. For each city
>>>> in
>>>> any document I want to find a similar city that appear in any of the
>>>> documents. I have a utility method that says the level of similarity
>>>> b/w 2
>>>> cities, re turning a value b/w 0 -1.
>>>>
>>>> Is there a way of doing this in Hadoop. I have specific doubt because, a
>>>> city might be similar to another city present in some other input split
>>>> that is processed by another mapper. Lets say odd cities (C1, C3, C5)
>>>> are
>>>> similar
>>>>
>>>> Input Split 1 has the cities: C1, C2, C3, C4
>>>> Input Split 2 has the cities: C1, C2, C5, C6
>>>>
>>>> Say my mapper 1 o/p is: since odd cities (C1, C3, C5) are similar
>>>>
>>>> C1, C3,
>>>> C2, C4
>>>> C3, C1
>>>> C4, C2
>>>>
>>>> Similar for mapper 2.
>>>> C1, C5
>>>> C2, C5
>>>> C5, C1
>>>> C6, C2
>>>>
>>>> Since C1 appears in both the splits, is finally at my reducer I get C3,
>>>> C5
>>>> for key C1, But this does not happen for C3, since it appears in only
>>>> one
>>>> split,
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>>
>>>>
>>>
>>
>
Re: Need help to map a problem to Mapreduce domain
Posted by Vinayak Borkar <vb...@yahoo.com>.
Sorry about the broken link.
Here is one that works.
http://flamingo.ics.uci.edu/pub/sigmod10-vernica.pdf
Vinayak
On 3/29/12 1:27 PM, Praveen Kumar K J V S wrote:
> Hi Vinayak,
>
> Thanks. If I use a single reduce I might run out of memory in that reduce
> JVM.
>
> BTW URL is not accessible.
>
> Thanks,
> Praveen
>
> On Fri, Mar 30, 2012 at 1:52 AM, Vinayak Borkar<vb...@yahoo.com> wrote:
>
>> Hi Praveen,
>>
>> The way your problem is stated, requires in the worst case that all cities
>> appear at every reducer. The simplest way to do so is to have one reducer
>> -- but this is a sequential solution and probably not what you are looking
>> for.
>>
>> If you have more visibility into your similarity function you can do
>> better. Look at http://asterix.ics.uci.edu/**pub/sigmod10-vernica-long.pdf<http://asterix.ics.uci.edu/pub/sigmod10-vernica-long.pdf>for trying to solve a similar problem for set similarity joins.
>>
>> One other approach you could use (if the number of unique cities is fairly
>> small), is to first run a MapReduce job to compute the distinct cities
>> (duplicate eliminated). Then do a map-only job where each mapper uses the
>> distinct list of cities to perform the "similarity join" with the data in
>> its HDFS block.
>>
>> Hope this helps.
>>
>> Vinayak
>>
>>
>>
>> On 3/29/12 1:05 PM, Praveen Kumar K J V S wrote:
>>
>>> Hi All,
>>>
>>> I have already posted my question to the MapReduce users mailing list, but
>>> alas I did not get any response. Probably I did not convey my question
>>> correctly, so I thought I will rephrase my question and post it in dev
>>> list.
>>>
>>> Kindly give your suggestions.
>>>
>>> I have a many files HDFS each containing list of cities. For each city in
>>> any document I want to find a similar city that appear in any of the
>>> documents. I have a utility method that says the level of similarity b/w 2
>>> cities, re turning a value b/w 0 -1.
>>>
>>> Is there a way of doing this in Hadoop. I have specific doubt because, a
>>> city might be similar to another city present in some other input split
>>> that is processed by another mapper. Lets say odd cities (C1, C3, C5) are
>>> similar
>>>
>>> Input Split 1 has the cities: C1, C2, C3, C4
>>> Input Split 2 has the cities: C1, C2, C5, C6
>>>
>>> Say my mapper 1 o/p is: since odd cities (C1, C3, C5) are similar
>>>
>>> C1, C3,
>>> C2, C4
>>> C3, C1
>>> C4, C2
>>>
>>> Similar for mapper 2.
>>> C1, C5
>>> C2, C5
>>> C5, C1
>>> C6, C2
>>>
>>> Since C1 appears in both the splits, is finally at my reducer I get C3, C5
>>> for key C1, But this does not happen for C3, since it appears in only one
>>> split,
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>
>
Re: Need help to map a problem to Mapreduce domain
Posted by Praveen Kumar K J V S <pr...@gmail.com>.
Hi Vinayak,
Thanks. If I use a single reduce I might run out of memory in that reduce
JVM.
BTW URL is not accessible.
Thanks,
Praveen
On Fri, Mar 30, 2012 at 1:52 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
> Hi Praveen,
>
> The way your problem is stated, requires in the worst case that all cities
> appear at every reducer. The simplest way to do so is to have one reducer
> -- but this is a sequential solution and probably not what you are looking
> for.
>
> If you have more visibility into your similarity function you can do
> better. Look at http://asterix.ics.uci.edu/**pub/sigmod10-vernica-long.pdf<http://asterix.ics.uci.edu/pub/sigmod10-vernica-long.pdf>for trying to solve a similar problem for set similarity joins.
>
> One other approach you could use (if the number of unique cities is fairly
> small), is to first run a MapReduce job to compute the distinct cities
> (duplicate eliminated). Then do a map-only job where each mapper uses the
> distinct list of cities to perform the "similarity join" with the data in
> its HDFS block.
>
> Hope this helps.
>
> Vinayak
>
>
>
> On 3/29/12 1:05 PM, Praveen Kumar K J V S wrote:
>
>> Hi All,
>>
>> I have already posted my question to the MapReduce users mailing list, but
>> alas I did not get any response. Probably I did not convey my question
>> correctly, so I thought I will rephrase my question and post it in dev
>> list.
>>
>> Kindly give your suggestions.
>>
>> I have a many files HDFS each containing list of cities. For each city in
>> any document I want to find a similar city that appear in any of the
>> documents. I have a utility method that says the level of similarity b/w 2
>> cities, re turning a value b/w 0 -1.
>>
>> Is there a way of doing this in Hadoop. I have specific doubt because, a
>> city might be similar to another city present in some other input split
>> that is processed by another mapper. Lets say odd cities (C1, C3, C5) are
>> similar
>>
>> Input Split 1 has the cities: C1, C2, C3, C4
>> Input Split 2 has the cities: C1, C2, C5, C6
>>
>> Say my mapper 1 o/p is: since odd cities (C1, C3, C5) are similar
>>
>> C1, C3,
>> C2, C4
>> C3, C1
>> C4, C2
>>
>> Similar for mapper 2.
>> C1, C5
>> C2, C5
>> C5, C1
>> C6, C2
>>
>> Since C1 appears in both the splits, is finally at my reducer I get C3, C5
>> for key C1, But this does not happen for C3, since it appears in only one
>> split,
>>
>> Thanks,
>> Praveen
>>
>>
>
Re: Need help to map a problem to Mapreduce domain
Posted by Vinayak Borkar <vb...@yahoo.com>.
Hi Praveen,
The way your problem is stated, requires in the worst case that all
cities appear at every reducer. The simplest way to do so is to have one
reducer -- but this is a sequential solution and probably not what you
are looking for.
If you have more visibility into your similarity function you can do
better. Look at http://asterix.ics.uci.edu/pub/sigmod10-vernica-long.pdf
for trying to solve a similar problem for set similarity joins.
One other approach you could use (if the number of unique cities is
fairly small), is to first run a MapReduce job to compute the distinct
cities (duplicate eliminated). Then do a map-only job where each mapper
uses the distinct list of cities to perform the "similarity join" with
the data in its HDFS block.
Hope this helps.
Vinayak
On 3/29/12 1:05 PM, Praveen Kumar K J V S wrote:
> Hi All,
>
> I have already posted my question to the MapReduce users mailing list, but
> alas I did not get any response. Probably I did not convey my question
> correctly, so I thought I will rephrase my question and post it in dev list.
>
> Kindly give your suggestions.
>
> I have a many files HDFS each containing list of cities. For each city in
> any document I want to find a similar city that appear in any of the
> documents. I have a utility method that says the level of similarity b/w 2
> cities, re turning a value b/w 0 -1.
>
> Is there a way of doing this in Hadoop. I have specific doubt because, a
> city might be similar to another city present in some other input split
> that is processed by another mapper. Lets say odd cities (C1, C3, C5) are
> similar
>
> Input Split 1 has the cities: C1, C2, C3, C4
> Input Split 2 has the cities: C1, C2, C5, C6
>
> Say my mapper 1 o/p is: since odd cities (C1, C3, C5) are similar
>
> C1, C3,
> C2, C4
> C3, C1
> C4, C2
>
> Similar for mapper 2.
> C1, C5
> C2, C5
> C5, C1
> C6, C2
>
> Since C1 appears in both the splits, is finally at my reducer I get C3, C5
> for key C1, But this does not happen for C3, since it appears in only one
> split,
>
> Thanks,
> Praveen
>