You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jacob Perkins <ja...@gmail.com> on 2011/04/24 19:41:28 UTC

SAMPLE after a GROUP BY

So I'm running into something strange. Consider the following code:

tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray,
weight:double);
grouped = GROUP tfidf_all BY doc_id;
vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token,
weight) AS vector;
DUMP vectors;

This, of course, runs just fine. tfidf_all contains 1,428,280 records.
The reduce output records should be exactly the number of documents,
which turn out to be 18,863 in this case. All well and good.

The strangeness comes when I add a SAMPLE command:

sampled = SAMPLE vectors 0.0012;
DUMP sampled;

Running this results in 1,513 reduce output records. So, am I insane or
shouldn't the reduce output records be much much closer to 22 or 23
records (eg. 0.0012*18863)?

--jacob
@thedatachef


Re: SAMPLE after a GROUP BY

Posted by Jacob Perkins <ja...@gmail.com>.
JIRA filed, see:

https://issues.apache.org/jira/browse/PIG-2014

--jacob
@thedatachef

On Mon, 2011-04-25 at 09:02 -0700, Alan Gates wrote:
> You are not insane.  Pig rewrites sample into filter, and then pushes  
> that filter in front of the group.  It shouldn't push that filter  
> since the UDF is non-deterministic.  If you add "-t PushUpFilter" to  
> your command line when invoking pig this won't happen.  Could you file  
> a JIRA for this so we keep track of it?
> 
> Alan.
> 
> On Apr 24, 2011, at 10:41 AM, Jacob Perkins wrote:
> 
> > So I'm running into something strange. Consider the following code:
> >
> > tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray,
> > weight:double);
> > grouped = GROUP tfidf_all BY doc_id;
> > vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token,
> > weight) AS vector;
> > DUMP vectors;
> >
> > This, of course, runs just fine. tfidf_all contains 1,428,280 records.
> > The reduce output records should be exactly the number of documents,
> > which turn out to be 18,863 in this case. All well and good.
> >
> > The strangeness comes when I add a SAMPLE command:
> >
> > sampled = SAMPLE vectors 0.0012;
> > DUMP sampled;
> >
> > Running this results in 1,513 reduce output records. So, am I insane  
> > or
> > shouldn't the reduce output records be much much closer to 22 or 23
> > records (eg. 0.0012*18863)?
> >
> > --jacob
> > @thedatachef
> >
> 



Re: SAMPLE after a GROUP BY

Posted by Alan Gates <ga...@yahoo-inc.com>.
You are not insane.  Pig rewrites sample into filter, and then pushes  
that filter in front of the group.  It shouldn't push that filter  
since the UDF is non-deterministic.  If you add "-t PushUpFilter" to  
your command line when invoking pig this won't happen.  Could you file  
a JIRA for this so we keep track of it?

Alan.

On Apr 24, 2011, at 10:41 AM, Jacob Perkins wrote:

> So I'm running into something strange. Consider the following code:
>
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray,
> weight:double);
> grouped = GROUP tfidf_all BY doc_id;
> vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token,
> weight) AS vector;
> DUMP vectors;
>
> This, of course, runs just fine. tfidf_all contains 1,428,280 records.
> The reduce output records should be exactly the number of documents,
> which turn out to be 18,863 in this case. All well and good.
>
> The strangeness comes when I add a SAMPLE command:
>
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
>
> Running this results in 1,513 reduce output records. So, am I insane  
> or
> shouldn't the reduce output records be much much closer to 22 or 23
> records (eg. 0.0012*18863)?
>
> --jacob
> @thedatachef
>