You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2011/06/22 05:34:47 UTC
[jira] [Created] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
SAMPLE should not be pushed above DISTINCT
------------------------------------------
Key: PIG-2137
URL: https://issues.apache.org/jira/browse/PIG-2137
Project: Pig
Issue Type: Bug
Affects Versions: 0.8.1, 0.8.0, 0.9.0, 0.10
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Critical
I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
Script 1, using GROUP BY to get distinct entries in the data, works:
{code}
grunt> f = load 'tmp/dupnumbers.txt';
grunt> d = foreach (group f by $0) generate group;
grunt> s = sample d 0.01;
grunt> n = foreach (group s all) generate COUNT(s);
grunt> dump n;
(493)
{code}
Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
grunt> f = load 'tmp/dupnumbers.txt';
grunt> d = distinct f;
grunt> s = sample d 0.01;
grunt> n = foreach (group s all) generate COUNT(s);
(980)
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053551#comment-13053551 ]
Thejas M Nair commented on PIG-2137:
------------------------------------
It helps to push filter before distinct operation (ie discard the rows early), and the results will be correct if the udf is deterministic .
I think the filter pushup should be disabled only for non deterministic udfs.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053572#comment-13053572 ]
Thejas M Nair commented on PIG-2137:
------------------------------------
Sorry, my mistake, I didn't go through the function properly. Changes look good . +1
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054195#comment-13054195 ]
Dmitriy V. Ryaboy commented on PIG-2137:
----------------------------------------
For 0.8 I was going to backport PIG-2014 before this one.. we are running both in production right now (on top of 8.1), they are fine.
Although I did have trouble backporting the tests, a bunch of the optimizer interfaces seem to have changed. I don't think 8 is as important, since it doesn't seem likely we'll release 8.2 what with 0.9.0 being almost out the door.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.1.patch, PIG-2137.2.patch, PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054155#comment-13054155 ]
Thejas M Nair commented on PIG-2137:
------------------------------------
Dmitriy,
Unit tests and test-patch have passed. You can commit the patch.
But this patch can't be committed to 0.8, as the Nondeterministic annotation was added only in 0.9.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.1.patch, PIG-2137.2.patch, PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053577#comment-13053577 ]
Dmitriy V. Ryaboy commented on PIG-2137:
----------------------------------------
Thanks
I'll commit PIG-2137.1 to 0.8, 0.9, and trunk.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.1.patch, PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053575#comment-13053575 ]
Thejas M Nair commented on PIG-2137:
------------------------------------
(Sorry again!) Actually, it does disable the optimization for all filters with distinct as predecessor. Patch PIG-2137.1.patch has fix.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.1.patch, PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy V. Ryaboy updated PIG-2137:
-----------------------------------
Resolution: Fixed
Fix Version/s: 0.10
0.9.0
Status: Resolved (was: Patch Available)
Committed to 0.9 and 0.10 (trunk)
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Fix For: 0.9.0, 0.10
>
> Attachments: PIG-2137.1.patch, PIG-2137.2.patch, PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair updated PIG-2137:
-------------------------------
Attachment: PIG-2137.1.patch
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.1.patch, PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053588#comment-13053588 ]
Dmitriy V. Ryaboy commented on PIG-2137:
----------------------------------------
I'll wait for the test-patch results.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.1.patch, PIG-2137.2.patch, PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053569#comment-13053569 ]
Dmitriy V. Ryaboy commented on PIG-2137:
----------------------------------------
Thejas, I believe my fix still allows that -- it just doesn't early-terminate the optimizer when encountered distinct, but proceeds to check if the other conditions required for a successful filter push (such as the udf being deterministic) apply.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy V. Ryaboy updated PIG-2137:
-----------------------------------
Attachment: PIG-2137.patch
Easy fix. Attached. Please review.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy V. Ryaboy updated PIG-2137:
-----------------------------------
Description:
I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
Script 1, using GROUP BY to get distinct entries in the data, works:
{code}
grunt> f = load 'tmp/dupnumbers.txt';
grunt> d = foreach (group f by $0) generate group;
grunt> s = sample d 0.01;
grunt> n = foreach (group s all) generate COUNT(s);
grunt> dump n;
(493)
{code}
Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
{code}
grunt> f = load 'tmp/dupnumbers.txt';
grunt> d = distinct f;
grunt> s = sample d 0.01;
grunt> n = foreach (group s all) generate COUNT(s);
(980)
{code}
was:
I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
Script 1, using GROUP BY to get distinct entries in the data, works:
{code}
grunt> f = load 'tmp/dupnumbers.txt';
grunt> d = foreach (group f by $0) generate group;
grunt> s = sample d 0.01;
grunt> n = foreach (group s all) generate COUNT(s);
grunt> dump n;
(493)
{code}
Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
grunt> f = load 'tmp/dupnumbers.txt';
grunt> d = distinct f;
grunt> s = sample d 0.01;
grunt> n = foreach (group s all) generate COUNT(s);
(980)
{code}
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair updated PIG-2137:
-------------------------------
Attachment: PIG-2137.2.patch
PIG-2137.2.patch has a test case for the case when filter should be pushed above distinct.
I haven't run unit tests and test-patch with my changes to the patch. I will start them today and it will take couple of hours. If you are able to run them they finish before i get back on results, please feel free to commit.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.1.patch, PIG-2137.2.patch, PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy V. Ryaboy updated PIG-2137:
-----------------------------------
Status: Patch Available (was: Open)
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.1, 0.8.0, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2137) SAMPLE should not be pushed above
DISTINCT
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053023#comment-13053023 ]
Dmitriy V. Ryaboy commented on PIG-2137:
----------------------------------------
Turning off PushUpFilter fixes the issue. It seems like the fix to PIG-2014 was incomplete.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira