You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Ashish <pa...@gmail.com> on 2012/11/14 12:36:05 UTC

Sorting output of WordCount example

Hi All,

I am newbie to Crunch. Have been playing wit examples in standalone mode.
Was trying to extend the WordCount example, but got stuck.

I want to extend the WordCount example to sort the output by max word
count. I tried using (PCollections.sort)

PTable<String, Long> counts = words.count();
words.sort(false);

This code had no effect. Using crunch-0.4 release (under voting)

Is there a simple way to achieve this, or need to modify the code according
to SecondarySort example.

Already have a blog post based on WordCount, want to extend the same
example for sorting, so that it's easy to understand.

-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: Sorting output of WordCount example

Posted by Ashish <pa...@gmail.com>.
Thanks Josh !

This helps. I am slowly getting a hang of things. top() function would
do,for the time being
As of now my full focus is to understand how stuff works and complete my
blog series on Crunch.

Let's see if a patch comes out of this :)

thanks
ashish


On Wed, Nov 14, 2012 at 7:51 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Ashish,
>
> The sort function operates on the keys, which are already sorted. For
> getting the maximum values from a PTable<String, Long>, there is the
> built-in top(N) function, where N is the number of entries that you
> want returned, which will be faster than doing a full sort when you
> only want the top several values. To do a full sort on the values, you
> would need to swap the keys and the values and then call sort,
> something like this:
>
> PTable<String, Long> counts = ...;
> PTable<Long, String> switched = counts.parallelDo(new
> MapFn<Pair<String, Long>, Pair<Long, String>>() {
>   @Override public Pair<Long, String> map(Pair<String, Long> input) {
> return Pair.of(input.second(), input.first()); } },
>   Avros.tableOf(Avros.longs(), Avros.strings()));
> switched.sort();
>
> I'm not sure how common the full sort-on-value is relative to just
> getting a sample of the values via top(), but I could certainly see
> adding the key-value switching logic to org.apache.crunch.lib.PTables,
> and would gladly accept a patch to do that.
>
> Josh
>
> On Wed, Nov 14, 2012 at 3:36 AM, Ashish <pa...@gmail.com> wrote:
> > Hi All,
> >
> > I am newbie to Crunch. Have been playing wit examples in standalone mode.
> > Was trying to extend the WordCount example, but got stuck.
> >
> > I want to extend the WordCount example to sort the output by max word
> count.
> > I tried using (PCollections.sort)
> >
> > PTable<String, Long> counts = words.count();
> > words.sort(false);
> >
> > This code had no effect. Using crunch-0.4 release (under voting)
> >
> > Is there a simple way to achieve this, or need to modify the code
> according
> > to SecondarySort example.
> >
> > Already have a blog post based on WordCount, want to extend the same
> example
> > for sorting, so that it's easy to understand.
> >
> > --
> > thanks
> > ashish
> >
> > Blog: http://www.ashishpaliwal.com/blog
> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: Sorting output of WordCount example

Posted by Josh Wills <jw...@cloudera.com>.
Hey Ashish,

The sort function operates on the keys, which are already sorted. For
getting the maximum values from a PTable<String, Long>, there is the
built-in top(N) function, where N is the number of entries that you
want returned, which will be faster than doing a full sort when you
only want the top several values. To do a full sort on the values, you
would need to swap the keys and the values and then call sort,
something like this:

PTable<String, Long> counts = ...;
PTable<Long, String> switched = counts.parallelDo(new
MapFn<Pair<String, Long>, Pair<Long, String>>() {
  @Override public Pair<Long, String> map(Pair<String, Long> input) {
return Pair.of(input.second(), input.first()); } },
  Avros.tableOf(Avros.longs(), Avros.strings()));
switched.sort();

I'm not sure how common the full sort-on-value is relative to just
getting a sample of the values via top(), but I could certainly see
adding the key-value switching logic to org.apache.crunch.lib.PTables,
and would gladly accept a patch to do that.

Josh

On Wed, Nov 14, 2012 at 3:36 AM, Ashish <pa...@gmail.com> wrote:
> Hi All,
>
> I am newbie to Crunch. Have been playing wit examples in standalone mode.
> Was trying to extend the WordCount example, but got stuck.
>
> I want to extend the WordCount example to sort the output by max word count.
> I tried using (PCollections.sort)
>
> PTable<String, Long> counts = words.count();
> words.sort(false);
>
> This code had no effect. Using crunch-0.4 release (under voting)
>
> Is there a simple way to achieve this, or need to modify the code according
> to SecondarySort example.
>
> Already have a blog post based on WordCount, want to extend the same example
> for sorting, so that it's easy to understand.
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills