You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by praveenesh kumar <pr...@gmail.com> on 2013/02/02 06:05:44 UTC

how to find top N values using map-reduce ?

I am looking for a better solution for this.

1 way to do this would be to find top N values from each mappers and
then find out the top N out of them in 1 reducer.  I am afraid that
this won't work effectively if my N is larger than number of values in
my inputsplit (or mapper input).

Otherway is to just sort all of them in 1 reducer and then do the cat of top-N.

Wondering if there is any better approach to do this ?

Regards
Praveenesh

Re: how to find top N values using map-reduce ?

Posted by Russell Jurney <ru...@gmail.com>.

Maybe look at the pig source to see how it does it?

Russell Jurney http://datasyndrome.com

On Feb 1, 2013, at 11:37 PM, praveenesh kumar <pr...@gmail.com> wrote:

> Thanks for that Russell. Unfortunately I can't use Pig. Need to write
> my own MR job. I was wondering how its usually done in the best way
> possible.
>
> Regards
> Praveenesh
>
> On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <ru...@gmail.com> wrote:
>> Pig. Datafu. 7 lines of code.
>>
>> https://gist.github.com/4696443
>> https://github.com/linkedin/datafu
>>
>>
>> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <pr...@gmail.com>wrote:
>>
>>> Actually what I am trying to find to top n% of the whole data.
>>> This n could be very large if my data is large.
>>>
>>> Assuming I have uniform rows of equal size and if the total data size
>>> is 10 GB, using the above mentioned approach, if I have to take top
>>> 10% of the whole data set, I need 10% of 10GB which could be rows
>>> worth of 1 GB (roughly) in my mappers.
>>> I think that would not be possible given my input splits are of
>>> 64/128/512 MB (based on my block size) or am I making wrong
>>> assumptions. I can increase the inputsplit size, but is there a better
>>> way to find top n%.
>>>
>>>
>>> My whole actual problem is to give ranks to some values and then find
>>> out the top 10 ranks.
>>>
>>> I think this context can give more idea about the problem ?
>>>
>>> Regards
>>> Praveenesh
>>>
>>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ek...@gmail.com>
>>> wrote:
>>>> Hi,
>>>>
>>>> Can you tell more about:
>>>> * How big is N
>>>> * How big is the input dataset
>>>> * How many mappers you have
>>>> * Do input splits correlate with the sorting criterion for top N?
>>>>
>>>> Depending on the answers, very different strategies will be optimal.
>>>>
>>>>
>>>>
>>>> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <praveenesh@gmail.com
>>>> wrote:
>>>>
>>>>> I am looking for a better solution for this.
>>>>>
>>>>> 1 way to do this would be to find top N values from each mappers and
>>>>> then find out the top N out of them in 1 reducer.  I am afraid that
>>>>> this won't work effectively if my N is larger than number of values in
>>>>> my inputsplit (or mapper input).
>>>>>
>>>>> Otherway is to just sort all of them in 1 reducer and then do the cat of
>>>>> top-N.
>>>>>
>>>>> Wondering if there is any better approach to do this ?
>>>>>
>>>>> Regards
>>>>> Praveenesh
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Eugene Kirpichov
>>>> http://www.linkedin.com/in/eugenekirpichov
>>>> http://jkff.info/software/timeplotters - my performance visualization
>>> tools
>>>
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: how to find top N values using map-reduce ?

Posted by praveenesh kumar <pr...@gmail.com>.

Thanks for that Russell. Unfortunately I can't use Pig. Need to write
my own MR job. I was wondering how its usually done in the best way
possible.

Regards
Praveenesh

On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <ru...@gmail.com> wrote:
> Pig. Datafu. 7 lines of code.
>
> https://gist.github.com/4696443
> https://github.com/linkedin/datafu
>
>
> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <pr...@gmail.com>wrote:
>
>> Actually what I am trying to find to top n% of the whole data.
>> This n could be very large if my data is large.
>>
>> Assuming I have uniform rows of equal size and if the total data size
>> is 10 GB, using the above mentioned approach, if I have to take top
>> 10% of the whole data set, I need 10% of 10GB which could be rows
>> worth of 1 GB (roughly) in my mappers.
>> I think that would not be possible given my input splits are of
>> 64/128/512 MB (based on my block size) or am I making wrong
>> assumptions. I can increase the inputsplit size, but is there a better
>> way to find top n%.
>>
>>
>> My whole actual problem is to give ranks to some values and then find
>> out the top 10 ranks.
>>
>> I think this context can give more idea about the problem ?
>>
>> Regards
>> Praveenesh
>>
>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ek...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Can you tell more about:
>> >  * How big is N
>> >  * How big is the input dataset
>> >  * How many mappers you have
>> >  * Do input splits correlate with the sorting criterion for top N?
>> >
>> > Depending on the answers, very different strategies will be optimal.
>> >
>> >
>> >
>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <praveenesh@gmail.com
>> >wrote:
>> >
>> >> I am looking for a better solution for this.
>> >>
>> >> 1 way to do this would be to find top N values from each mappers and
>> >> then find out the top N out of them in 1 reducer.  I am afraid that
>> >> this won't work effectively if my N is larger than number of values in
>> >> my inputsplit (or mapper input).
>> >>
>> >> Otherway is to just sort all of them in 1 reducer and then do the cat of
>> >> top-N.
>> >>
>> >> Wondering if there is any better approach to do this ?
>> >>
>> >> Regards
>> >> Praveenesh
>> >>
>> >
>> >
>> >
>> > --
>> > Eugene Kirpichov
>> > http://www.linkedin.com/in/eugenekirpichov
>> > http://jkff.info/software/timeplotters - my performance visualization
>> tools
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: how to find top N values using map-reduce ?

Posted by praveenesh kumar <pr...@gmail.com>.

Thanks for that Russell. Unfortunately I can't use Pig. Need to write
my own MR job. I was wondering how its usually done in the best way
possible.

Regards
Praveenesh

On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <ru...@gmail.com> wrote:
> Pig. Datafu. 7 lines of code.
>
> https://gist.github.com/4696443
> https://github.com/linkedin/datafu
>
>
> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <pr...@gmail.com>wrote:
>
>> Actually what I am trying to find to top n% of the whole data.
>> This n could be very large if my data is large.
>>
>> Assuming I have uniform rows of equal size and if the total data size
>> is 10 GB, using the above mentioned approach, if I have to take top
>> 10% of the whole data set, I need 10% of 10GB which could be rows
>> worth of 1 GB (roughly) in my mappers.
>> I think that would not be possible given my input splits are of
>> 64/128/512 MB (based on my block size) or am I making wrong
>> assumptions. I can increase the inputsplit size, but is there a better
>> way to find top n%.
>>
>>
>> My whole actual problem is to give ranks to some values and then find
>> out the top 10 ranks.
>>
>> I think this context can give more idea about the problem ?
>>
>> Regards
>> Praveenesh
>>
>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ek...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Can you tell more about:
>> >  * How big is N
>> >  * How big is the input dataset
>> >  * How many mappers you have
>> >  * Do input splits correlate with the sorting criterion for top N?
>> >
>> > Depending on the answers, very different strategies will be optimal.
>> >
>> >
>> >
>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <praveenesh@gmail.com
>> >wrote:
>> >
>> >> I am looking for a better solution for this.
>> >>
>> >> 1 way to do this would be to find top N values from each mappers and
>> >> then find out the top N out of them in 1 reducer.  I am afraid that
>> >> this won't work effectively if my N is larger than number of values in
>> >> my inputsplit (or mapper input).
>> >>
>> >> Otherway is to just sort all of them in 1 reducer and then do the cat of
>> >> top-N.
>> >>
>> >> Wondering if there is any better approach to do this ?
>> >>
>> >> Regards
>> >> Praveenesh
>> >>
>> >
>> >
>> >
>> > --
>> > Eugene Kirpichov
>> > http://www.linkedin.com/in/eugenekirpichov
>> > http://jkff.info/software/timeplotters - my performance visualization
>> tools
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: how to find top N values using map-reduce ?

Posted by Russell Jurney <ru...@gmail.com>.

Pig. Datafu. 7 lines of code.

https://gist.github.com/4696443
https://github.com/linkedin/datafu


On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <pr...@gmail.com>wrote:

> Actually what I am trying to find to top n% of the whole data.
> This n could be very large if my data is large.
>
> Assuming I have uniform rows of equal size and if the total data size
> is 10 GB, using the above mentioned approach, if I have to take top
> 10% of the whole data set, I need 10% of 10GB which could be rows
> worth of 1 GB (roughly) in my mappers.
> I think that would not be possible given my input splits are of
> 64/128/512 MB (based on my block size) or am I making wrong
> assumptions. I can increase the inputsplit size, but is there a better
> way to find top n%.
>
>
> My whole actual problem is to give ranks to some values and then find
> out the top 10 ranks.
>
> I think this context can give more idea about the problem ?
>
> Regards
> Praveenesh
>
> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ek...@gmail.com>
> wrote:
> > Hi,
> >
> > Can you tell more about:
> >  * How big is N
> >  * How big is the input dataset
> >  * How many mappers you have
> >  * Do input splits correlate with the sorting criterion for top N?
> >
> > Depending on the answers, very different strategies will be optimal.
> >
> >
> >
> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
> >
> >> I am looking for a better solution for this.
> >>
> >> 1 way to do this would be to find top N values from each mappers and
> >> then find out the top N out of them in 1 reducer.  I am afraid that
> >> this won't work effectively if my N is larger than number of values in
> >> my inputsplit (or mapper input).
> >>
> >> Otherway is to just sort all of them in 1 reducer and then do the cat of
> >> top-N.
> >>
> >> Wondering if there is any better approach to do this ?
> >>
> >> Regards
> >> Praveenesh
> >>
> >
> >
> >
> > --
> > Eugene Kirpichov
> > http://www.linkedin.com/in/eugenekirpichov
> > http://jkff.info/software/timeplotters - my performance visualization
> tools
>



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: how to find top N values using map-reduce ?

Posted by Niels Basjes <Ni...@basjes.nl>.

My suggestion is to use secondary sort with a single reducer. That easy you
can easily extract the top N. If you want to get the top N% you'll need an
additional phase to determine how many records this N% really is.

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 2 feb. 2013 12:08 schreef "praveenesh kumar" <pr...@gmail.com> het
volgende:

> My actual problem is to rank all values and then run logic 1 to top n%
> values and logic 2 to rest values.
> 1st  - Ranking ? (need major suggestions here)
> 2nd - Find top n% out of them.
> Then rest is  covered.
>
> Regards
> Praveenesh
>
> On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <la...@gmail.com> wrote:
> > there's one thing i want to clarify that you can use multi-reducers to
> sort
> > the data globally and then cat all the parts to get the top n records.
> The
> > data in all parts are globally in order.
> > Then you may find the problem is much easier.
> >
> > 在 2013-2-2 下午3:18，"praveenesh kumar" <pr...@gmail.com>写道：
> >
> >> Actually what I am trying to find to top n% of the whole data.
> >> This n could be very large if my data is large.
> >>
> >> Assuming I have uniform rows of equal size and if the total data size
> >> is 10 GB, using the above mentioned approach, if I have to take top
> >> 10% of the whole data set, I need 10% of 10GB which could be rows
> >> worth of 1 GB (roughly) in my mappers.
> >> I think that would not be possible given my input splits are of
> >> 64/128/512 MB (based on my block size) or am I making wrong
> >> assumptions. I can increase the inputsplit size, but is there a better
> >> way to find top n%.
> >>
> >>
> >> My whole actual problem is to give ranks to some values and then find
> >> out the top 10 ranks.
> >>
> >> I think this context can give more idea about the problem ?
> >>
> >> Regards
> >> Praveenesh
> >>
> >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ekirpichov@gmail.com
> >
> >> wrote:
> >> > Hi,
> >> >
> >> > Can you tell more about:
> >> >  * How big is N
> >> >  * How big is the input dataset
> >> >  * How many mappers you have
> >> >  * Do input splits correlate with the sorting criterion for top N?
> >> >
> >> > Depending on the answers, very different strategies will be optimal.
> >> >
> >> >
> >> >
> >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar
> >> > <pr...@gmail.com>wrote:
> >> >
> >> >> I am looking for a better solution for this.
> >> >>
> >> >> 1 way to do this would be to find top N values from each mappers and
> >> >> then find out the top N out of them in 1 reducer.  I am afraid that
> >> >> this won't work effectively if my N is larger than number of values
> in
> >> >> my inputsplit (or mapper input).
> >> >>
> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat
> >> >> of
> >> >> top-N.
> >> >>
> >> >> Wondering if there is any better approach to do this ?
> >> >>
> >> >> Regards
> >> >> Praveenesh
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Eugene Kirpichov
> >> > http://www.linkedin.com/in/eugenekirpichov
> >> > http://jkff.info/software/timeplotters - my performance visualization
> >> > tools
>

Re: how to find top N values using map-reduce ?

Posted by Niels Basjes <Ni...@basjes.nl>.

My suggestion is to use secondary sort with a single reducer. That easy you
can easily extract the top N. If you want to get the top N% you'll need an
additional phase to determine how many records this N% really is.

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 2 feb. 2013 12:08 schreef "praveenesh kumar" <pr...@gmail.com> het
volgende:

> My actual problem is to rank all values and then run logic 1 to top n%
> values and logic 2 to rest values.
> 1st  - Ranking ? (need major suggestions here)
> 2nd - Find top n% out of them.
> Then rest is  covered.
>
> Regards
> Praveenesh
>
> On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <la...@gmail.com> wrote:
> > there's one thing i want to clarify that you can use multi-reducers to
> sort
> > the data globally and then cat all the parts to get the top n records.
> The
> > data in all parts are globally in order.
> > Then you may find the problem is much easier.
> >
> > 在 2013-2-2 下午3:18，"praveenesh kumar" <pr...@gmail.com>写道：
> >
> >> Actually what I am trying to find to top n% of the whole data.
> >> This n could be very large if my data is large.
> >>
> >> Assuming I have uniform rows of equal size and if the total data size
> >> is 10 GB, using the above mentioned approach, if I have to take top
> >> 10% of the whole data set, I need 10% of 10GB which could be rows
> >> worth of 1 GB (roughly) in my mappers.
> >> I think that would not be possible given my input splits are of
> >> 64/128/512 MB (based on my block size) or am I making wrong
> >> assumptions. I can increase the inputsplit size, but is there a better
> >> way to find top n%.
> >>
> >>
> >> My whole actual problem is to give ranks to some values and then find
> >> out the top 10 ranks.
> >>
> >> I think this context can give more idea about the problem ?
> >>
> >> Regards
> >> Praveenesh
> >>
> >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ekirpichov@gmail.com
> >
> >> wrote:
> >> > Hi,
> >> >
> >> > Can you tell more about:
> >> >  * How big is N
> >> >  * How big is the input dataset
> >> >  * How many mappers you have
> >> >  * Do input splits correlate with the sorting criterion for top N?
> >> >
> >> > Depending on the answers, very different strategies will be optimal.
> >> >
> >> >
> >> >
> >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar
> >> > <pr...@gmail.com>wrote:
> >> >
> >> >> I am looking for a better solution for this.
> >> >>
> >> >> 1 way to do this would be to find top N values from each mappers and
> >> >> then find out the top N out of them in 1 reducer.  I am afraid that
> >> >> this won't work effectively if my N is larger than number of values
> in
> >> >> my inputsplit (or mapper input).
> >> >>
> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat
> >> >> of
> >> >> top-N.
> >> >>
> >> >> Wondering if there is any better approach to do this ?
> >> >>
> >> >> Regards
> >> >> Praveenesh
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Eugene Kirpichov
> >> > http://www.linkedin.com/in/eugenekirpichov
> >> > http://jkff.info/software/timeplotters - my performance visualization
> >> > tools
>

Re: how to find top N values using map-reduce ?

Posted by praveenesh kumar <pr...@gmail.com>.

My actual problem is to rank all values and then run logic 1 to top n%
values and logic 2 to rest values.
1st  - Ranking ? (need major suggestions here)
2nd - Find top n% out of them.
Then rest is  covered.

Regards
Praveenesh

On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <la...@gmail.com> wrote:
> there's one thing i want to clarify that you can use multi-reducers to sort
> the data globally and then cat all the parts to get the top n records. The
> data in all parts are globally in order.
> Then you may find the problem is much easier.
>
> 在 2013-2-2 下午3:18，"praveenesh kumar" <pr...@gmail.com>写道：
>
>> Actually what I am trying to find to top n% of the whole data.
>> This n could be very large if my data is large.
>>
>> Assuming I have uniform rows of equal size and if the total data size
>> is 10 GB, using the above mentioned approach, if I have to take top
>> 10% of the whole data set, I need 10% of 10GB which could be rows
>> worth of 1 GB (roughly) in my mappers.
>> I think that would not be possible given my input splits are of
>> 64/128/512 MB (based on my block size) or am I making wrong
>> assumptions. I can increase the inputsplit size, but is there a better
>> way to find top n%.
>>
>>
>> My whole actual problem is to give ranks to some values and then find
>> out the top 10 ranks.
>>
>> I think this context can give more idea about the problem ?
>>
>> Regards
>> Praveenesh
>>
>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ek...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Can you tell more about:
>> >  * How big is N
>> >  * How big is the input dataset
>> >  * How many mappers you have
>> >  * Do input splits correlate with the sorting criterion for top N?
>> >
>> > Depending on the answers, very different strategies will be optimal.
>> >
>> >
>> >
>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar
>> > <pr...@gmail.com>wrote:
>> >
>> >> I am looking for a better solution for this.
>> >>
>> >> 1 way to do this would be to find top N values from each mappers and
>> >> then find out the top N out of them in 1 reducer.  I am afraid that
>> >> this won't work effectively if my N is larger than number of values in
>> >> my inputsplit (or mapper input).
>> >>
>> >> Otherway is to just sort all of them in 1 reducer and then do the cat
>> >> of
>> >> top-N.
>> >>
>> >> Wondering if there is any better approach to do this ?
>> >>
>> >> Regards
>> >> Praveenesh
>> >>
>> >
>> >
>> >
>> > --
>> > Eugene Kirpichov
>> > http://www.linkedin.com/in/eugenekirpichov
>> > http://jkff.info/software/timeplotters - my performance visualization
>> > tools

Re: how to find top N values using map-reduce ?

Posted by praveenesh kumar <pr...@gmail.com>.

My actual problem is to rank all values and then run logic 1 to top n%
values and logic 2 to rest values.
1st  - Ranking ? (need major suggestions here)
2nd - Find top n% out of them.
Then rest is  covered.

Regards
Praveenesh

On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <la...@gmail.com> wrote:
> there's one thing i want to clarify that you can use multi-reducers to sort
> the data globally and then cat all the parts to get the top n records. The
> data in all parts are globally in order.
> Then you may find the problem is much easier.
>
> 在 2013-2-2 下午3:18，"praveenesh kumar" <pr...@gmail.com>写道：
>
>> Actually what I am trying to find to top n% of the whole data.
>> This n could be very large if my data is large.
>>
>> Assuming I have uniform rows of equal size and if the total data size
>> is 10 GB, using the above mentioned approach, if I have to take top
>> 10% of the whole data set, I need 10% of 10GB which could be rows
>> worth of 1 GB (roughly) in my mappers.
>> I think that would not be possible given my input splits are of
>> 64/128/512 MB (based on my block size) or am I making wrong
>> assumptions. I can increase the inputsplit size, but is there a better
>> way to find top n%.
>>
>>
>> My whole actual problem is to give ranks to some values and then find
>> out the top 10 ranks.
>>
>> I think this context can give more idea about the problem ?
>>
>> Regards
>> Praveenesh
>>
>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ek...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Can you tell more about:
>> >  * How big is N
>> >  * How big is the input dataset
>> >  * How many mappers you have
>> >  * Do input splits correlate with the sorting criterion for top N?
>> >
>> > Depending on the answers, very different strategies will be optimal.
>> >
>> >
>> >
>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar
>> > <pr...@gmail.com>wrote:
>> >
>> >> I am looking for a better solution for this.
>> >>
>> >> 1 way to do this would be to find top N values from each mappers and
>> >> then find out the top N out of them in 1 reducer.  I am afraid that
>> >> this won't work effectively if my N is larger than number of values in
>> >> my inputsplit (or mapper input).
>> >>
>> >> Otherway is to just sort all of them in 1 reducer and then do the cat
>> >> of
>> >> top-N.
>> >>
>> >> Wondering if there is any better approach to do this ?
>> >>
>> >> Regards
>> >> Praveenesh
>> >>
>> >
>> >
>> >
>> > --
>> > Eugene Kirpichov
>> > http://www.linkedin.com/in/eugenekirpichov
>> > http://jkff.info/software/timeplotters - my performance visualization
>> > tools

Re: how to find top N values using map-reduce ?

Posted by Lake Chang <la...@gmail.com>.

there's one thing i want to clarify that you can use multi-reducers to sort
the data globally and then cat all the parts to get the top n records. The
data in all parts are globally in order.
Then you may find the problem is much easier.
在 2013-2-2 下午3:18，"praveenesh kumar" <pr...@gmail.com>写道：

> Actually what I am trying to find to top n% of the whole data.
> This n could be very large if my data is large.
>
> Assuming I have uniform rows of equal size and if the total data size
> is 10 GB, using the above mentioned approach, if I have to take top
> 10% of the whole data set, I need 10% of 10GB which could be rows
> worth of 1 GB (roughly) in my mappers.
> I think that would not be possible given my input splits are of
> 64/128/512 MB (based on my block size) or am I making wrong
> assumptions. I can increase the inputsplit size, but is there a better
> way to find top n%.
>
>
> My whole actual problem is to give ranks to some values and then find
> out the top 10 ranks.
>
> I think this context can give more idea about the problem ?
>
> Regards
> Praveenesh
>
> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ek...@gmail.com>
> wrote:
> > Hi,
> >
> > Can you tell more about:
> >  * How big is N
> >  * How big is the input dataset
> >  * How many mappers you have
> >  * Do input splits correlate with the sorting criterion for top N?
> >
> > Depending on the answers, very different strategies will be optimal.
> >
> >
> >
> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
> >
> >> I am looking for a better solution for this.
> >>
> >> 1 way to do this would be to find top N values from each mappers and
> >> then find out the top N out of them in 1 reducer.  I am afraid that
> >> this won't work effectively if my N is larger than number of values in
> >> my inputsplit (or mapper input).
> >>
> >> Otherway is to just sort all of them in 1 reducer and then do the cat of
> >> top-N.
> >>
> >> Wondering if there is any better approach to do this ?
> >>
> >> Regards
> >> Praveenesh
> >>
> >
> >
> >
> > --
> > Eugene Kirpichov
> > http://www.linkedin.com/in/eugenekirpichov
> > http://jkff.info/software/timeplotters - my performance visualization
> tools
>

Re: how to find top N values using map-reduce ?

Posted by praveenesh kumar <pr...@gmail.com>.

Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size
is 10 GB, using the above mentioned approach, if I have to take top
10% of the whole data set, I need 10% of 10GB which could be rows
worth of 1 GB (roughly) in my mappers.
I think that would not be possible given my input splits are of
64/128/512 MB (based on my block size) or am I making wrong
assumptions. I can increase the inputsplit size, but is there a better
way to find top n%.

My whole actual problem is to give ranks to some values and then find
out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards
Praveenesh

On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ek...@gmail.com> wrote:
> Hi,
>
> Can you tell more about:
>  * How big is N
>  * How big is the input dataset
>  * How many mappers you have
>  * Do input splits correlate with the sorting criterion for top N?
>
> Depending on the answers, very different strategies will be optimal.
>
>
>
> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <pr...@gmail.com>wrote:
>
>> I am looking for a better solution for this.
>>
>> 1 way to do this would be to find top N values from each mappers and
>> then find out the top N out of them in 1 reducer.  I am afraid that
>> this won't work effectively if my N is larger than number of values in
>> my inputsplit (or mapper input).
>>
>> Otherway is to just sort all of them in 1 reducer and then do the cat of
>> top-N.
>>
>> Wondering if there is any better approach to do this ?
>>
>> Regards
>> Praveenesh
>>
>
>
>
> --
> Eugene Kirpichov
> http://www.linkedin.com/in/eugenekirpichov
> http://jkff.info/software/timeplotters - my performance visualization tools

Re: how to find top N values using map-reduce ?

Posted by praveenesh kumar <pr...@gmail.com>.

Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size
is 10 GB, using the above mentioned approach, if I have to take top
10% of the whole data set, I need 10% of 10GB which could be rows
worth of 1 GB (roughly) in my mappers.
I think that would not be possible given my input splits are of
64/128/512 MB (based on my block size) or am I making wrong
assumptions. I can increase the inputsplit size, but is there a better
way to find top n%.

My whole actual problem is to give ranks to some values and then find
out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards
Praveenesh

On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ek...@gmail.com> wrote:
> Hi,
>
> Can you tell more about:
>  * How big is N
>  * How big is the input dataset
>  * How many mappers you have
>  * Do input splits correlate with the sorting criterion for top N?
>
> Depending on the answers, very different strategies will be optimal.
>
>
>
> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <pr...@gmail.com>wrote:
>
>> I am looking for a better solution for this.
>>
>> 1 way to do this would be to find top N values from each mappers and
>> then find out the top N out of them in 1 reducer.  I am afraid that
>> this won't work effectively if my N is larger than number of values in
>> my inputsplit (or mapper input).
>>
>> Otherway is to just sort all of them in 1 reducer and then do the cat of
>> top-N.
>>
>> Wondering if there is any better approach to do this ?
>>
>> Regards
>> Praveenesh
>>
>
>
>
> --
> Eugene Kirpichov
> http://www.linkedin.com/in/eugenekirpichov
> http://jkff.info/software/timeplotters - my performance visualization tools

Re: how to find top N values using map-reduce ?

Posted by Eugene Kirpichov <ek...@gmail.com>.

Hi,

Can you tell more about:
 * How big is N
 * How big is the input dataset
 * How many mappers you have
 * Do input splits correlate with the sorting criterion for top N?

Depending on the answers, very different strategies will be optimal.

On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <pr...@gmail.com>wrote:

> I am looking for a better solution for this.
>
> 1 way to do this would be to find top N values from each mappers and
> then find out the top N out of them in 1 reducer.  I am afraid that
> this won't work effectively if my N is larger than number of values in
> my inputsplit (or mapper input).
>
> Otherway is to just sort all of them in 1 reducer and then do the cat of
> top-N.
>
> Wondering if there is any better approach to do this ?
>
> Regards
> Praveenesh
>

-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov
http://jkff.info/software/timeplotters - my performance visualization tools

Re: how to find top N values using map-reduce ?

Posted by Harsh J <ha...@cloudera.com>.

Note that a "one reducer" isn't always the solution. If you know your
key space boundaries, consider using a total-order-partition to scale
the app/job and make use of nodes on the cluster.

On Sat, Feb 2, 2013 at 10:35 AM, praveenesh kumar <pr...@gmail.com> wrote:
> I am looking for a better solution for this.
>
> 1 way to do this would be to find top N values from each mappers and
> then find out the top N out of them in 1 reducer.  I am afraid that
> this won't work effectively if my N is larger than number of values in
> my inputsplit (or mapper input).
>
> Otherway is to just sort all of them in 1 reducer and then do the cat of top-N.
>
> Wondering if there is any better approach to do this ?
>
> Regards
> Praveenesh



--
Harsh J