You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2011/05/10 08:26:03 UTC

[jira] [Updated] (PIG-2014) SAMPLE shouldn't be pushed up

     [ https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2014:
-----------------------------------

    Attachment: PIG-2014.patch

Implemented suggested approach to fixing this. Please review.

> SAMPLE shouldn't be pushed up
> -----------------------------
>
>                 Key: PIG-2014
>                 URL: https://issues.apache.org/jira/browse/PIG-2014
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Jacob Perkins
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: PIG-2014.patch
>
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double);
> grouped   = GROUP tfidf_all BY doc_id;
> vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains 1,428,280 records. The reduce output records should be exactly the number of documents, which turn out to be 18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output records be much much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in front of the group. It shouldn't push that filter  
> since the UDF is non-deterministic.  
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig this won't happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira