You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Alan Gates <ga...@hortonworks.com> on 2012/09/04 17:51:30 UTC

Re: Count of all the rows

Expression in Pig can only have tuples and bags.  A tuple is a single record.  It has a defined number of fields.  Those fields are in a defined order.  They can be given names and types may be assigned to them.  Thus it is reasonable to speak of the 3rd field or the field named "user".  That field may be of any data type supported by Pig.  Constant tuples are denoted by ().

A bag is an unordered collection of tuples.   So you can never say the "3rd tuple of a bag" as it has no meaning.  We do sometimes get sloppy and discuss a schema of a bag, but what we really mean is a schema that applies to all tuples inside the bag.  Constant bags are denoted by {}.

Given these definitions it seems like what we assign to on the left hand side of Pig Latin scripts could be thought of as bags, since they are (usually unordered) collections of tuples.  However, there is a distinction.  You cannot (usually) use these in expressions such as COUNT().  And bags cannot be assigned to nor used in places where you would expect a relation name.  Thus we distinguish these by calling them relations.  So in the script:
A = load 'foo';
B = group A by name;
C = foreach B generate name, COUNT(A);

A is playing two rolls.  In the first and second lines it is a relation.  In the third line it is a bag named after the relation it came from.

All of this gets a little fuzzier when you consider nested foreach operators, but I've ignored that for now.  Hope this helps.

Alan.

On Aug 30, 2012, at 9:57 AM, Mohit Anchlia wrote:

> I looked at definition of Relation which says:
> 
> 
> A relation is a bag (more specifically, an outer bag).
> If relation is a bag then what's the difference between a Bag and Relation.
> I am getting bit confused on the definitions. In below example what would
> be Telation, Tuple or a Bag?
> 
> (1,2,3,4)
> 
> Is 1,2,3,4 without "(" is a tuple? Then what is a Relation or a Bag?
> 
> On Wed, Aug 29, 2012 at 4:51 PM, Jonathan Coveney <jc...@gmail.com>wrote:
> 
>> COUNT is a UDF that takes in a Bag and outputs a Double.
>> 
>> Relations are not Bags, so that's one way of thinking about it. But of
>> course, we could have coerced the syntax to make it work.
>> 
>> I like to think of it as such:
>> 
>> A foreach is a transformation on the rows of a relation. Thus, applying
>> COUNT directly to a relation doesn't make any sense, since you're doing an
>> aggregate transformation. This is why grouping is necessary. you're putting
>> all of the rows of the relation into one row (with the catch-all key
>> "all"), so that you can run a function on them.
>> 
>> Don't know if that helps.
>> 
>> 2012/8/29 Mohit Anchlia <mo...@gmail.com>
>> 
>>> Thanks! Why is grouping necessary? Is it to send it to the reducer?
>>> 
>>> On Wed, Aug 29, 2012 at 4:03 PM, Alan Gates <ga...@hortonworks.com>
>> wrote:
>>> 
>>>> A = load 'foo';
>>>> B = group A all;
>>>> C = foreach B generate COUNT(A);
>>>> 
>>>> Alan.
>>>> On Aug 29, 2012, at 3:51 PM, Mohit Anchlia wrote:
>>>> 
>>>>> How do I get count of all the rows? All the examples of COUNT use
>> group
>>>> by.
>>>> 
>>>> 
>>> 
>>