You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Rob Wilkerson <rw...@lotame.com> on 2010/10/01 13:32:08 UTC

Grouping & Counting

Hey guys -

I have a script that loads a list of ~800,000 category hierarchies,
filters them a bit and streams them through a PHP script for some
quick procedural work. The file contains one column and a snippet
looks like this:

Arts
Arts/Animation
Arts/Animation/Anime
Arts/Animation/Anime/Characters
Arts/Animation/Anime/Clubs_and_Organizations
Arts/Animation/Anime/Collectibles
Arts/Animation/Anime/Collectibles/Cels
Arts/Animation/Anime/Collectibles/Models_and_Figures
Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures
Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam
Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids
Arts/Animation/Anime/Collectibles/Models_and_Figures/Models
Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam
Arts/Animation/Anime/Collectibles/Shitajiki
Arts/Animation/Anime/Creators
Arts/Animation/Anime/Creators/Anno,_Hideaki
Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko
Arts/Animation/Anime/Creators/Miyazaki,_Hayao
Arts/Animation/Anime/Creators/Studios
Arts/Animation/Anime/Creators/Studios/Studio_Ghibli
Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles
Arts/Animation/Anime/Distribution
Arts/Animation/Anime/Distribution/Companies

Now I need to take it one step further. I need to get a count of how
many items are in "Arts", how many are in "Arts/Animation", etc. I
know a grouping and count is involved, but I can't wrap my mind around
how to get there since the category path depth is entirely variable
and I need these numbers relative to the "whole" (i.e. I need to know
how many times Arts/Animation/Anime appears rather than how many times
Anime appears at any level).

Any guidance would be much appreciated.

Rob Wilkerson
 
The information transmitted in this  
email is intended only for the  
person(s) or entity to which it is  
addressed and may contain  
confidential and/or privileged  
material. Any review,  
retransmission, dissemination  
or other use of, or taking of any  
action in reliance upon, this  
information by persons or entities  
other than the intended recipient  
is prohibited. If you received this  
email in error, please contact the  
sender and permanently delete the  
email from any computer.  


Re: Grouping & Counting

Posted by Rob Wilkerson <rw...@lotame.com>.
A more concise way of stating this would be to say that:

For each line item, I need to count all of the line items (including
this on) that _start_ with the same value plus an optional "/".

For example, using pseudo-regex syntax:

- A count of all of the lines that start with "Arts$" or "Arts/"
- A count of all of the lines that start with "Arts/Animation$" or
"Arts/Animation/", etc.

I should have led with that, but I'm still feeling out the problem a little bit.

On Fri, Oct 1, 2010 at 7:32 AM, Rob Wilkerson <rw...@lotame.com> wrote:
> I have a script that loads a list of ~800,000 category hierarchies,
> filters them a bit and streams them through a PHP script for some
> quick procedural work. The file contains one column and a snippet
> looks like this:
>
> Arts
> Arts/Animation
> Arts/Animation/Anime
> Arts/Animation/Anime/Characters
> Arts/Animation/Anime/Clubs_and_Organizations
> Arts/Animation/Anime/Collectibles
> Arts/Animation/Anime/Collectibles/Cels
> Arts/Animation/Anime/Collectibles/Models_and_Figures
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Models
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam
> Arts/Animation/Anime/Collectibles/Shitajiki
> Arts/Animation/Anime/Creators
> Arts/Animation/Anime/Creators/Anno,_Hideaki
> Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko
> Arts/Animation/Anime/Creators/Miyazaki,_Hayao
> Arts/Animation/Anime/Creators/Studios
> Arts/Animation/Anime/Creators/Studios/Studio_Ghibli
> Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles
> Arts/Animation/Anime/Distribution
> Arts/Animation/Anime/Distribution/Companies
>
> Now I need to take it one step further. I need to get a count of how
> many items are in "Arts", how many are in "Arts/Animation", etc. I
> know a grouping and count is involved, but I can't wrap my mind around
> how to get there since the category path depth is entirely variable
> and I need these numbers relative to the "whole" (i.e. I need to know
> how many times Arts/Animation/Anime appears rather than how many times
> Anime appears at any level).
>
> Any guidance would be much appreciated.
 
The information transmitted in this  
email is intended only for the  
person(s) or entity to which it is  
addressed and may contain  
confidential and/or privileged  
material. Any review,  
retransmission, dissemination  
or other use of, or taking of any  
action in reliance upon, this  
information by persons or entities  
other than the intended recipient  
is prohibited. If you received this  
email in error, please contact the  
sender and permanently delete the  
email from any computer.  


RE: Grouping & Counting

Posted by Andrea Leistra <An...@concur.com>.
The '#' is being used as a delimiter here, to avoid the need to escape all the slashes.

-----Original Message-----
From: Rob Wilkerson [mailto:rwilkerson@lotame.com] 
Sent: Friday, October 01, 2010 8:00 AM
To: pig-user@hadoop.apache.org
Subject: Re: Grouping & Counting

On Fri, Oct 1, 2010 at 7:44 AM, David Vrensk <da...@icehouse.se> wrote:
> I would just preprocess the file with Perl or Ruby:
>
> perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > 
> outfile

What is the "#" representing? I have a semi-educated guess, but I can't find that particular symbol in any examples.

Also, as far as I can tell, this regex also misses the top level path because it has not children. For example, the "Arts" path. It catches "Arts/Anime" and below nicely, of course.

> Come to think of it, if your entire file is just 800k lines, I'd do 
> the entire thing with Perl.

I thought about that when PHP couldn't handle it, but my Perl skills are light and it was a chance to learn something entirely new.

Thanks for your help.

-- 
+rw
 
The information transmitted in this
email is intended only for the
person(s) or entity to which it is
addressed and may contain
confidential and/or privileged
material. Any review,
retransmission, dissemination
or other use of, or taking of any
action in reliance upon, this
information by persons or entities
other than the intended recipient
is prohibited. If you received this
email in error, please contact the
sender and permanently delete the
email from any computer.  


Re: Grouping & Counting

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
So, if you had to do this for a "really large" file, and wanted to do this
"the proper way" you would write the same thing as a Pig UDF that returns a
bag of children, and then flatten the bag to get a row per partial
hierarchy. Then group by said partial hierarchy, and count.

In pig 8 you will be able to write this udf in Python, if you like.

-D

On Fri, Oct 1, 2010 at 5:43 AM, Rob Wilkerson <rw...@lotame.com> wrote:

> On Fri, Oct 1, 2010 at 8:23 AM, David Vrensk <da...@icehouse.se> wrote:
> > Yup.  But the point is to count children, so if there is no child on the
> > row, there is nothing to count.
>
> I suppose the point is to count descendents + the base, but I should
> have been more clear.
>
> > BTW, you didn't say if you wanted to count children or descendants (i.e.
> > children and children's children).  From your follow-up, I gather it's
> about
> > descendants
>
> This looks very much like what I'm after. I'll verify against the
> data, of course, but thank you very much.
>
>
> --
> +rw
>
> The information transmitted in this
> email is intended only for the
> person(s) or entity to which it is
> addressed and may contain
> confidential and/or privileged
> material. Any review,
> retransmission, dissemination
> or other use of, or taking of any
> action in reliance upon, this
> information by persons or entities
> other than the intended recipient
> is prohibited. If you received this
> email in error, please contact the
> sender and permanently delete the
> email from any computer.
>
>

Re: Grouping & Counting

Posted by Rob Wilkerson <rw...@lotame.com>.
On Fri, Oct 1, 2010 at 8:23 AM, David Vrensk <da...@icehouse.se> wrote:
> Yup.  But the point is to count children, so if there is no child on the
> row, there is nothing to count.

I suppose the point is to count descendents + the base, but I should
have been more clear.

> BTW, you didn't say if you wanted to count children or descendants (i.e.
> children and children's children).  From your follow-up, I gather it's about
> descendants

This looks very much like what I'm after. I'll verify against the
data, of course, but thank you very much.


-- 
+rw
 
The information transmitted in this  
email is intended only for the  
person(s) or entity to which it is  
addressed and may contain  
confidential and/or privileged  
material. Any review,  
retransmission, dissemination  
or other use of, or taking of any  
action in reliance upon, this  
information by persons or entities  
other than the intended recipient  
is prohibited. If you received this  
email in error, please contact the  
sender and permanently delete the  
email from any computer.  


Re: Grouping & Counting

Posted by David Vrensk <da...@icehouse.se>.
On Fri, Oct 1, 2010 at 14:00, Rob Wilkerson <rw...@lotame.com> wrote:

> On Fri, Oct 1, 2010 at 7:44 AM, David Vrensk <da...@icehouse.se> wrote:
> > I would just preprocess the file with Perl or Ruby:
> >
> > perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > outfile
>
> What is the "#" representing? I have a semi-educated guess, but I
> can't find that particular symbol in any examples.
>

They are alternative regexp delimiters.  Usually, we write regexen delimited
by slashes, but I wanted to avoid that since the only important character in
the regexp is a slash.


> Also, as far as I can tell, this regex also misses the top level path
> because it has not children. For example, the "Arts" path. It catches
> "Arts/Anime" and below nicely, of course.
>

Yup.  But the point is to count children, so if there is no child on the
row, there is nothing to count.

BTW, you didn't say if you wanted to count children or descendants (i.e.
children and children's children).  From your follow-up, I gather it's about
descendants.  Try this:

--------8<----------------
#! /usr/bin/env ruby

counts = Hash.new(0)

while (line = ARGF.gets)
  line.chomp!
  segments = line.split '/'
  key = []
  segments.each do |s|
    key << s
    counts[key.join('/')] += 1
  end
end

counts.keys.sort.each do |key|
  puts("%5d %s" % [counts[key], key])
end
--------8<----------------

which with your data returns

   23 Arts
   22 Arts/Animation
   21 Arts/Animation/Anime
    1 Arts/Animation/Anime/Characters
    1 Arts/Animation/Anime/Clubs_and_Organizations
    9 Arts/Animation/Anime/Collectibles
    1 Arts/Animation/Anime/Collectibles/Cels
    6 Arts/Animation/Anime/Collectibles/Models_and_Figures
    3 Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures
    1
Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam
    1
Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids
    2 Arts/Animation/Anime/Collectibles/Models_and_Figures/Models
    1 Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam
    1 Arts/Animation/Anime/Collectibles/Shitajiki
    7 Arts/Animation/Anime/Creators
    1 Arts/Animation/Anime/Creators/Anno,_Hideaki
    1 Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko
    1 Arts/Animation/Anime/Creators/Miyazaki,_Hayao
    3 Arts/Animation/Anime/Creators/Studios
    2 Arts/Animation/Anime/Creators/Studios/Studio_Ghibli
    1 Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles
    2 Arts/Animation/Anime/Distribution
    1 Arts/Animation/Anime/Distribution/Companies


HTH,

/David
-- 
David Vrensk
Systems developer, ICE House AB
Mobile: +46 703 74 69 00

Re: Grouping & Counting

Posted by Rob Wilkerson <rw...@lotame.com>.
On Fri, Oct 1, 2010 at 7:44 AM, David Vrensk <da...@icehouse.se> wrote:
> I would just preprocess the file with Perl or Ruby:
>
> perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > outfile

What is the "#" representing? I have a semi-educated guess, but I
can't find that particular symbol in any examples.

Also, as far as I can tell, this regex also misses the top level path
because it has not children. For example, the "Arts" path. It catches
"Arts/Anime" and below nicely, of course.

> Come to think of it, if your entire file is just 800k lines, I'd do the
> entire thing with Perl.

I thought about that when PHP couldn't handle it, but my Perl skills
are light and it was a chance to learn something entirely new.

Thanks for your help.

-- 
+rw
 
The information transmitted in this  
email is intended only for the  
person(s) or entity to which it is  
addressed and may contain  
confidential and/or privileged  
material. Any review,  
retransmission, dissemination  
or other use of, or taking of any  
action in reliance upon, this  
information by persons or entities  
other than the intended recipient  
is prohibited. If you received this  
email in error, please contact the  
sender and permanently delete the  
email from any computer.  


Re: Grouping & Counting

Posted by David Vrensk <da...@icehouse.se>.
I would just preprocess the file with Perl or Ruby:

perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > outfile

That would give you

Arts/Animation/Anime<TAB>Clubs_and_Organizations

i.e. two columns for every line (lines without slashes will be skipped).

Come to think of it, if your entire file is just 800k lines, I'd do the
entire thing with Perl.

HTH,

/David

On Fri, Oct 1, 2010 at 13:32, Rob Wilkerson <rw...@lotame.com> wrote:

> Hey guys -
>
> I have a script that loads a list of ~800,000 category hierarchies,
> filters them a bit and streams them through a PHP script for some
> quick procedural work. The file contains one column and a snippet
> looks like this:
>
> Arts
> Arts/Animation
> Arts/Animation/Anime
> Arts/Animation/Anime/Characters
> Arts/Animation/Anime/Clubs_and_Organizations
> Arts/Animation/Anime/Collectibles
> Arts/Animation/Anime/Collectibles/Cels
> Arts/Animation/Anime/Collectibles/Models_and_Figures
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Models
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam
> Arts/Animation/Anime/Collectibles/Shitajiki
> Arts/Animation/Anime/Creators
> Arts/Animation/Anime/Creators/Anno,_Hideaki
> Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko
> Arts/Animation/Anime/Creators/Miyazaki,_Hayao
> Arts/Animation/Anime/Creators/Studios
> Arts/Animation/Anime/Creators/Studios/Studio_Ghibli
> Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles
> Arts/Animation/Anime/Distribution
> Arts/Animation/Anime/Distribution/Companies
>
> Now I need to take it one step further. I need to get a count of how
> many items are in "Arts", how many are in "Arts/Animation", etc. I
> know a grouping and count is involved, but I can't wrap my mind around
> how to get there since the category path depth is entirely variable
> and I need these numbers relative to the "whole" (i.e. I need to know
> how many times Arts/Animation/Anime appears rather than how many times
> Anime appears at any level).
>
> Any guidance would be much appreciated.
>
> Rob Wilkerson
>
> The information transmitted in this
> email is intended only for the
> person(s) or entity to which it is
> addressed and may contain
> confidential and/or privileged
> material. Any review,
> retransmission, dissemination
> or other use of, or taking of any
> action in reliance upon, this
> information by persons or entities
> other than the intended recipient
> is prohibited. If you received this
> email in error, please contact the
> sender and permanently delete the
> email from any computer.
>
>


-- 
David Vrensk
Systems developer, ICE House AB
Mobile: +46 703 74 69 00