You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Brian Choi <su...@gmail.com> on 2012/09/12 00:37:21 UTC

Issues with SAMPLE in PIG v0.8.1

Hello Everyone,

          I am wondering if anyone has run into an issue that I am having
using SAMPLE in a pig script to create a subsample of 0.001% from the
orignal relation.

Assume the relation "A" contains a single column of data (int type) with
1,000,000 records

Asamp = SAMPLE A 0.00001;
Asamp2 = SAMPLE A 0.0001;

Asamp and Asamp2 should produce subsampled relations with 10 and 100
records, respectively. However, what I find is Asamp and Asamp2 are closer
to 1000 and 10000 records, which seems like a 100-fold error in sample
size. Interestingly, in the limiting case of:

Asamp3 = SAMPLE A 0.99;

The actual subsampled size is VERY close to the expected 99% size of the
full sample size. Can anyone shed light as to what I may be doing wrong or
share their experiences if they have also seen issues with using SAMPLE in
PIG. Thank you.

           Brian

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Brian Choi <su...@gmail.com>.

Dimitriy,

       Thank you for the continued efforts and for providing information. I
think this does shed some light into what I was suspecting. Perhaps we
should upgrade to a later PIG version to circumvent this issue. Thanks
again.

         Brian


On Tue, Sep 18, 2012 at 6:09 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Aaaah those extra operations could be it. I suspect you are affected by
> this bug:
>
> https://issues.apache.org/jira/browse/PIG-2014
>
> This was fixed in Pig 0.9.
>
> -Dmitriy.
>
> On Tue, Sep 18, 2012 at 7:54 AM, Brian Choi <su...@gmail.com> wrote:
>
> > Dmitry,
> >
> >        Yes that is literally the script I ran, aside from the relation
> > names. I did, however, run some operations upstream from those comments
> and
> > I wonder if there is some indirect dependency on how SAMPLE is affected
> by
> > upstream relations/filtering, etc. Thanks, I didnt expect anyone to solve
> > or reproduce this, I was more wondering if anyone had seen this in their
> > scripts before.
> >
> >            Brian
> >
> >
> > On Sun, Sep 16, 2012 at 10:24 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > I just ran this very script three times using Pig 0.8 (svn revision
> > > 1148107) on a set of 2.5 million rows and got (2509), (2552), and
> > > (2473) as the output.
> > >
> > > Don't know what to tell you.. can't reproduce. Are you sure you are
> > > running on the input you think you are running on?
> > >
> > > Is this literally the script you ran?
> > >
> > > On Sun, Sep 16, 2012 at 10:13 PM, Brian Choi <su...@gmail.com>
> wrote:
> > > > The PIG script would be simply as follows:
> > > >
> > > > UIDs = FOREACH xRelation GENERATE $0 as user_id;
> > > > UIDsample = SAMPLE UIDs 0.001;
> > > > UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);
> > > >
> > > > where number of UIDs = ~ 2.5MM user ids
> > > > and in this case UIDsampleCount = ~ 250,000 UIDs or records, but
> > > > UIDsampleCount should be = ~ 2,500
> > > >
> > > > The version I am using is pig-0.8.1.
> > > >
> > > > Please let me know if there is any other information that you would
> > like
> > > me
> > > > to provide.
> > > >
> > > >         brian
> > > >
> > > >
> > > > On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
> >
> > > wrote:
> > > >
> > > >> Brian, could you provide a complete script that reproduces the
> issue?
> > > >> What version of pig are you on?
> > > >>
> > > >> Thanks,
> > > >> -D
> > > >>
> > > >> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <su...@gmail.com>
> > wrote:
> > > >> > Yes - i saw this issue with SAMPLE() in multiple runs. The
> strangest
> > > >> thing
> > > >> > about this is that it approaches the correct values for SAMPLE()
> as
> > > you
> > > >> > approach a sample size of 100% (or 0.99), but gets worse as you
> > start
> > > >> > getting to lower sample fractions.
> > > >> >
> > > >> >        Brian
> > > >> >
> > > >> >
> > > >> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <
> > thejas@hortonworks.com>
> > > >> wrote:
> > > >> >
> > > >> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> > > >> >>
> > > >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE
> operator
> > > is
> > > >> >>> nothing but "Math.random() < x" where "x" is a double.
> > > >> >>>
> > > >> >>>  You are right. Sample operator translates in to a filter
> operator
> > > with
> > > >> >> condition "Math.random() < x".
> > > >> >>
> > > >> >>
> > > >> >>  In my test, SAMPLE A 0.00001 returns about 10 records with a
> > million
> > > >> >>> records when running in local mode. I am curious if something
> can
> > go
> > > >> wrong
> > > >> >>> when running it in MR mode.
> > > >> >>>
> > > >> >>
> > > >> >> I wouldn't expect different behavior in case of MR mode.
> > > >> >>
> > > >> >> Brian,
> > > >> >> Do you see this behavior across multiple runs ?
> > > >> >>
> > > >> >> -Thejas
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com>
> > > wrote:
> > > >> >>>
> > > >> >>>  Hello Everyone,
> > > >> >>>>
> > > >> >>>>            I am wondering if anyone has run into an issue that
> I
> > am
> > > >> >>>> having
> > > >> >>>> using SAMPLE in a pig script to create a subsample of 0.001%
> from
> > > the
> > > >> >>>> orignal relation.
> > > >> >>>>
> > > >> >>>> Assume the relation "A" contains a single column of data (int
> > type)
> > > >> with
> > > >> >>>> 1,000,000 records
> > > >> >>>>
> > > >> >>>> Asamp = SAMPLE A 0.00001;
> > > >> >>>> Asamp2 = SAMPLE A 0.0001;
> > > >> >>>>
> > > >> >>>> Asamp and Asamp2 should produce subsampled relations with 10
> and
> > > 100
> > > >> >>>> records, respectively. However, what I find is Asamp and Asamp2
> > are
> > > >> >>>> closer
> > > >> >>>> to 1000 and 10000 records, which seems like a 100-fold error in
> > > sample
> > > >> >>>> size. Interestingly, in the limiting case of:
> > > >> >>>>
> > > >> >>>> Asamp3 = SAMPLE A 0.99;
> > > >> >>>>
> > > >> >>>> The actual subsampled size is VERY close to the expected 99%
> size
> > > of
> > > >> the
> > > >> >>>> full sample size. Can anyone shed light as to what I may be
> doing
> > > >> wrong
> > > >> >>>> or
> > > >> >>>> share their experiences if they have also seen issues with
> using
> > > >> SAMPLE
> > > >> >>>> in
> > > >> >>>> PIG. Thank you.
> > > >> >>>>
> > > >> >>>>             Brian
> > > >> >>>>
> > > >> >>>>
> > > >> >>>
> > > >> >>
> > > >>
> > >
> >
>

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Aaaah those extra operations could be it. I suspect you are affected by
this bug:

https://issues.apache.org/jira/browse/PIG-2014

This was fixed in Pig 0.9.

-Dmitriy.

On Tue, Sep 18, 2012 at 7:54 AM, Brian Choi <su...@gmail.com> wrote:

> Dmitry,
>
>        Yes that is literally the script I ran, aside from the relation
> names. I did, however, run some operations upstream from those comments and
> I wonder if there is some indirect dependency on how SAMPLE is affected by
> upstream relations/filtering, etc. Thanks, I didnt expect anyone to solve
> or reproduce this, I was more wondering if anyone had seen this in their
> scripts before.
>
>            Brian
>
>
> On Sun, Sep 16, 2012 at 10:24 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > I just ran this very script three times using Pig 0.8 (svn revision
> > 1148107) on a set of 2.5 million rows and got (2509), (2552), and
> > (2473) as the output.
> >
> > Don't know what to tell you.. can't reproduce. Are you sure you are
> > running on the input you think you are running on?
> >
> > Is this literally the script you ran?
> >
> > On Sun, Sep 16, 2012 at 10:13 PM, Brian Choi <su...@gmail.com> wrote:
> > > The PIG script would be simply as follows:
> > >
> > > UIDs = FOREACH xRelation GENERATE $0 as user_id;
> > > UIDsample = SAMPLE UIDs 0.001;
> > > UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);
> > >
> > > where number of UIDs = ~ 2.5MM user ids
> > > and in this case UIDsampleCount = ~ 250,000 UIDs or records, but
> > > UIDsampleCount should be = ~ 2,500
> > >
> > > The version I am using is pig-0.8.1.
> > >
> > > Please let me know if there is any other information that you would
> like
> > me
> > > to provide.
> > >
> > >         brian
> > >
> > >
> > > On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> > >
> > >> Brian, could you provide a complete script that reproduces the issue?
> > >> What version of pig are you on?
> > >>
> > >> Thanks,
> > >> -D
> > >>
> > >> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <su...@gmail.com>
> wrote:
> > >> > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest
> > >> thing
> > >> > about this is that it approaches the correct values for SAMPLE() as
> > you
> > >> > approach a sample size of 100% (or 0.99), but gets worse as you
> start
> > >> > getting to lower sample fractions.
> > >> >
> > >> >        Brian
> > >> >
> > >> >
> > >> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <
> thejas@hortonworks.com>
> > >> wrote:
> > >> >
> > >> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> > >> >>
> > >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator
> > is
> > >> >>> nothing but "Math.random() < x" where "x" is a double.
> > >> >>>
> > >> >>>  You are right. Sample operator translates in to a filter operator
> > with
> > >> >> condition "Math.random() < x".
> > >> >>
> > >> >>
> > >> >>  In my test, SAMPLE A 0.00001 returns about 10 records with a
> million
> > >> >>> records when running in local mode. I am curious if something can
> go
> > >> wrong
> > >> >>> when running it in MR mode.
> > >> >>>
> > >> >>
> > >> >> I wouldn't expect different behavior in case of MR mode.
> > >> >>
> > >> >> Brian,
> > >> >> Do you see this behavior across multiple runs ?
> > >> >>
> > >> >> -Thejas
> > >> >>
> > >> >>
> > >> >>
> > >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com>
> > wrote:
> > >> >>>
> > >> >>>  Hello Everyone,
> > >> >>>>
> > >> >>>>            I am wondering if anyone has run into an issue that I
> am
> > >> >>>> having
> > >> >>>> using SAMPLE in a pig script to create a subsample of 0.001% from
> > the
> > >> >>>> orignal relation.
> > >> >>>>
> > >> >>>> Assume the relation "A" contains a single column of data (int
> type)
> > >> with
> > >> >>>> 1,000,000 records
> > >> >>>>
> > >> >>>> Asamp = SAMPLE A 0.00001;
> > >> >>>> Asamp2 = SAMPLE A 0.0001;
> > >> >>>>
> > >> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and
> > 100
> > >> >>>> records, respectively. However, what I find is Asamp and Asamp2
> are
> > >> >>>> closer
> > >> >>>> to 1000 and 10000 records, which seems like a 100-fold error in
> > sample
> > >> >>>> size. Interestingly, in the limiting case of:
> > >> >>>>
> > >> >>>> Asamp3 = SAMPLE A 0.99;
> > >> >>>>
> > >> >>>> The actual subsampled size is VERY close to the expected 99% size
> > of
> > >> the
> > >> >>>> full sample size. Can anyone shed light as to what I may be doing
> > >> wrong
> > >> >>>> or
> > >> >>>> share their experiences if they have also seen issues with using
> > >> SAMPLE
> > >> >>>> in
> > >> >>>> PIG. Thank you.
> > >> >>>>
> > >> >>>>             Brian
> > >> >>>>
> > >> >>>>
> > >> >>>
> > >> >>
> > >>
> >
>

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Brian Choi <su...@gmail.com>.

Dmitry,

       Yes that is literally the script I ran, aside from the relation
names. I did, however, run some operations upstream from those comments and
I wonder if there is some indirect dependency on how SAMPLE is affected by
upstream relations/filtering, etc. Thanks, I didnt expect anyone to solve
or reproduce this, I was more wondering if anyone had seen this in their
scripts before.

           Brian


On Sun, Sep 16, 2012 at 10:24 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I just ran this very script three times using Pig 0.8 (svn revision
> 1148107) on a set of 2.5 million rows and got (2509), (2552), and
> (2473) as the output.
>
> Don't know what to tell you.. can't reproduce. Are you sure you are
> running on the input you think you are running on?
>
> Is this literally the script you ran?
>
> On Sun, Sep 16, 2012 at 10:13 PM, Brian Choi <su...@gmail.com> wrote:
> > The PIG script would be simply as follows:
> >
> > UIDs = FOREACH xRelation GENERATE $0 as user_id;
> > UIDsample = SAMPLE UIDs 0.001;
> > UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);
> >
> > where number of UIDs = ~ 2.5MM user ids
> > and in this case UIDsampleCount = ~ 250,000 UIDs or records, but
> > UIDsampleCount should be = ~ 2,500
> >
> > The version I am using is pig-0.8.1.
> >
> > Please let me know if there is any other information that you would like
> me
> > to provide.
> >
> >         brian
> >
> >
> > On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> Brian, could you provide a complete script that reproduces the issue?
> >> What version of pig are you on?
> >>
> >> Thanks,
> >> -D
> >>
> >> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <su...@gmail.com> wrote:
> >> > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest
> >> thing
> >> > about this is that it approaches the correct values for SAMPLE() as
> you
> >> > approach a sample size of 100% (or 0.99), but gets worse as you start
> >> > getting to lower sample fractions.
> >> >
> >> >        Brian
> >> >
> >> >
> >> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <th...@hortonworks.com>
> >> wrote:
> >> >
> >> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> >> >>
> >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator
> is
> >> >>> nothing but "Math.random() < x" where "x" is a double.
> >> >>>
> >> >>>  You are right. Sample operator translates in to a filter operator
> with
> >> >> condition "Math.random() < x".
> >> >>
> >> >>
> >> >>  In my test, SAMPLE A 0.00001 returns about 10 records with a million
> >> >>> records when running in local mode. I am curious if something can go
> >> wrong
> >> >>> when running it in MR mode.
> >> >>>
> >> >>
> >> >> I wouldn't expect different behavior in case of MR mode.
> >> >>
> >> >> Brian,
> >> >> Do you see this behavior across multiple runs ?
> >> >>
> >> >> -Thejas
> >> >>
> >> >>
> >> >>
> >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com>
> wrote:
> >> >>>
> >> >>>  Hello Everyone,
> >> >>>>
> >> >>>>            I am wondering if anyone has run into an issue that I am
> >> >>>> having
> >> >>>> using SAMPLE in a pig script to create a subsample of 0.001% from
> the
> >> >>>> orignal relation.
> >> >>>>
> >> >>>> Assume the relation "A" contains a single column of data (int type)
> >> with
> >> >>>> 1,000,000 records
> >> >>>>
> >> >>>> Asamp = SAMPLE A 0.00001;
> >> >>>> Asamp2 = SAMPLE A 0.0001;
> >> >>>>
> >> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and
> 100
> >> >>>> records, respectively. However, what I find is Asamp and Asamp2 are
> >> >>>> closer
> >> >>>> to 1000 and 10000 records, which seems like a 100-fold error in
> sample
> >> >>>> size. Interestingly, in the limiting case of:
> >> >>>>
> >> >>>> Asamp3 = SAMPLE A 0.99;
> >> >>>>
> >> >>>> The actual subsampled size is VERY close to the expected 99% size
> of
> >> the
> >> >>>> full sample size. Can anyone shed light as to what I may be doing
> >> wrong
> >> >>>> or
> >> >>>> share their experiences if they have also seen issues with using
> >> SAMPLE
> >> >>>> in
> >> >>>> PIG. Thank you.
> >> >>>>
> >> >>>>             Brian
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >>
>

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I just ran this very script three times using Pig 0.8 (svn revision
1148107) on a set of 2.5 million rows and got (2509), (2552), and
(2473) as the output.

Don't know what to tell you.. can't reproduce. Are you sure you are
running on the input you think you are running on?

Is this literally the script you ran?

On Sun, Sep 16, 2012 at 10:13 PM, Brian Choi <su...@gmail.com> wrote:
> The PIG script would be simply as follows:
>
> UIDs = FOREACH xRelation GENERATE $0 as user_id;
> UIDsample = SAMPLE UIDs 0.001;
> UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);
>
> where number of UIDs = ~ 2.5MM user ids
> and in this case UIDsampleCount = ~ 250,000 UIDs or records, but
> UIDsampleCount should be = ~ 2,500
>
> The version I am using is pig-0.8.1.
>
> Please let me know if there is any other information that you would like me
> to provide.
>
>         brian
>
>
> On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> Brian, could you provide a complete script that reproduces the issue?
>> What version of pig are you on?
>>
>> Thanks,
>> -D
>>
>> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <su...@gmail.com> wrote:
>> > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest
>> thing
>> > about this is that it approaches the correct values for SAMPLE() as you
>> > approach a sample size of 100% (or 0.99), but gets worse as you start
>> > getting to lower sample fractions.
>> >
>> >        Brian
>> >
>> >
>> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <th...@hortonworks.com>
>> wrote:
>> >
>> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
>> >>
>> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
>> >>> nothing but "Math.random() < x" where "x" is a double.
>> >>>
>> >>>  You are right. Sample operator translates in to a filter operator with
>> >> condition "Math.random() < x".
>> >>
>> >>
>> >>  In my test, SAMPLE A 0.00001 returns about 10 records with a million
>> >>> records when running in local mode. I am curious if something can go
>> wrong
>> >>> when running it in MR mode.
>> >>>
>> >>
>> >> I wouldn't expect different behavior in case of MR mode.
>> >>
>> >> Brian,
>> >> Do you see this behavior across multiple runs ?
>> >>
>> >> -Thejas
>> >>
>> >>
>> >>
>> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com> wrote:
>> >>>
>> >>>  Hello Everyone,
>> >>>>
>> >>>>            I am wondering if anyone has run into an issue that I am
>> >>>> having
>> >>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>> >>>> orignal relation.
>> >>>>
>> >>>> Assume the relation "A" contains a single column of data (int type)
>> with
>> >>>> 1,000,000 records
>> >>>>
>> >>>> Asamp = SAMPLE A 0.00001;
>> >>>> Asamp2 = SAMPLE A 0.0001;
>> >>>>
>> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>> >>>> records, respectively. However, what I find is Asamp and Asamp2 are
>> >>>> closer
>> >>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>> >>>> size. Interestingly, in the limiting case of:
>> >>>>
>> >>>> Asamp3 = SAMPLE A 0.99;
>> >>>>
>> >>>> The actual subsampled size is VERY close to the expected 99% size of
>> the
>> >>>> full sample size. Can anyone shed light as to what I may be doing
>> wrong
>> >>>> or
>> >>>> share their experiences if they have also seen issues with using
>> SAMPLE
>> >>>> in
>> >>>> PIG. Thank you.
>> >>>>
>> >>>>             Brian
>> >>>>
>> >>>>
>> >>>
>> >>
>>

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Brian Choi <su...@gmail.com>.

The PIG script would be simply as follows:

UIDs = FOREACH xRelation GENERATE $0 as user_id;
UIDsample = SAMPLE UIDs 0.001;
UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);

where number of UIDs = ~ 2.5MM user ids
and in this case UIDsampleCount = ~ 250,000 UIDs or records, but
UIDsampleCount should be = ~ 2,500

The version I am using is pig-0.8.1.

Please let me know if there is any other information that you would like me
to provide.

        brian


On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Brian, could you provide a complete script that reproduces the issue?
> What version of pig are you on?
>
> Thanks,
> -D
>
> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <su...@gmail.com> wrote:
> > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest
> thing
> > about this is that it approaches the correct values for SAMPLE() as you
> > approach a sample size of 100% (or 0.99), but gets worse as you start
> > getting to lower sample fractions.
> >
> >        Brian
> >
> >
> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <th...@hortonworks.com>
> wrote:
> >
> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> >>
> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
> >>> nothing but "Math.random() < x" where "x" is a double.
> >>>
> >>>  You are right. Sample operator translates in to a filter operator with
> >> condition "Math.random() < x".
> >>
> >>
> >>  In my test, SAMPLE A 0.00001 returns about 10 records with a million
> >>> records when running in local mode. I am curious if something can go
> wrong
> >>> when running it in MR mode.
> >>>
> >>
> >> I wouldn't expect different behavior in case of MR mode.
> >>
> >> Brian,
> >> Do you see this behavior across multiple runs ?
> >>
> >> -Thejas
> >>
> >>
> >>
> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com> wrote:
> >>>
> >>>  Hello Everyone,
> >>>>
> >>>>            I am wondering if anyone has run into an issue that I am
> >>>> having
> >>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
> >>>> orignal relation.
> >>>>
> >>>> Assume the relation "A" contains a single column of data (int type)
> with
> >>>> 1,000,000 records
> >>>>
> >>>> Asamp = SAMPLE A 0.00001;
> >>>> Asamp2 = SAMPLE A 0.0001;
> >>>>
> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
> >>>> records, respectively. However, what I find is Asamp and Asamp2 are
> >>>> closer
> >>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
> >>>> size. Interestingly, in the limiting case of:
> >>>>
> >>>> Asamp3 = SAMPLE A 0.99;
> >>>>
> >>>> The actual subsampled size is VERY close to the expected 99% size of
> the
> >>>> full sample size. Can anyone shed light as to what I may be doing
> wrong
> >>>> or
> >>>> share their experiences if they have also seen issues with using
> SAMPLE
> >>>> in
> >>>> PIG. Thank you.
> >>>>
> >>>>             Brian
> >>>>
> >>>>
> >>>
> >>
>

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Brian, could you provide a complete script that reproduces the issue?
What version of pig are you on?

Thanks,
-D

On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <su...@gmail.com> wrote:
> Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing
> about this is that it approaches the correct values for SAMPLE() as you
> approach a sample size of 100% (or 0.99), but gets worse as you start
> getting to lower sample fractions.
>
>        Brian
>
>
> On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <th...@hortonworks.com> wrote:
>
>> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
>>
>>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
>>> nothing but "Math.random() < x" where "x" is a double.
>>>
>>>  You are right. Sample operator translates in to a filter operator with
>> condition "Math.random() < x".
>>
>>
>>  In my test, SAMPLE A 0.00001 returns about 10 records with a million
>>> records when running in local mode. I am curious if something can go wrong
>>> when running it in MR mode.
>>>
>>
>> I wouldn't expect different behavior in case of MR mode.
>>
>> Brian,
>> Do you see this behavior across multiple runs ?
>>
>> -Thejas
>>
>>
>>
>>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com> wrote:
>>>
>>>  Hello Everyone,
>>>>
>>>>            I am wondering if anyone has run into an issue that I am
>>>> having
>>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>>>> orignal relation.
>>>>
>>>> Assume the relation "A" contains a single column of data (int type) with
>>>> 1,000,000 records
>>>>
>>>> Asamp = SAMPLE A 0.00001;
>>>> Asamp2 = SAMPLE A 0.0001;
>>>>
>>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>>>> records, respectively. However, what I find is Asamp and Asamp2 are
>>>> closer
>>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>>>> size. Interestingly, in the limiting case of:
>>>>
>>>> Asamp3 = SAMPLE A 0.99;
>>>>
>>>> The actual subsampled size is VERY close to the expected 99% size of the
>>>> full sample size. Can anyone shed light as to what I may be doing wrong
>>>> or
>>>> share their experiences if they have also seen issues with using SAMPLE
>>>> in
>>>> PIG. Thank you.
>>>>
>>>>             Brian
>>>>
>>>>
>>>
>>

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Prasanth J <bu...@gmail.com>.

I have used SAMPLE operator while implementing CUBE operator, where I choose sample percentage at runtime so that it always emits
around 100K tuples. I tested it from 1M to 100M tuples and it worked fine as expected. It works as expected with trunk version. I haven't tested with earlier versions. 

Thanks
-- Prasanth

On Sep 16, 2012, at 11:15 PM, Brian Choi <su...@gmail.com> wrote:

> Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing
> about this is that it approaches the correct values for SAMPLE() as you
> approach a sample size of 100% (or 0.99), but gets worse as you start
> getting to lower sample fractions.
> 
>       Brian
> 
> 
> On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <th...@hortonworks.com> wrote:
> 
>> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
>> 
>>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
>>> nothing but "Math.random() < x" where "x" is a double.
>>> 
>>> You are right. Sample operator translates in to a filter operator with
>> condition "Math.random() < x".
>> 
>> 
>> In my test, SAMPLE A 0.00001 returns about 10 records with a million
>>> records when running in local mode. I am curious if something can go wrong
>>> when running it in MR mode.
>>> 
>> 
>> I wouldn't expect different behavior in case of MR mode.
>> 
>> Brian,
>> Do you see this behavior across multiple runs ?
>> 
>> -Thejas
>> 
>> 
>> 
>>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com> wrote:
>>> 
>>> Hello Everyone,
>>>> 
>>>>           I am wondering if anyone has run into an issue that I am
>>>> having
>>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>>>> orignal relation.
>>>> 
>>>> Assume the relation "A" contains a single column of data (int type) with
>>>> 1,000,000 records
>>>> 
>>>> Asamp = SAMPLE A 0.00001;
>>>> Asamp2 = SAMPLE A 0.0001;
>>>> 
>>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>>>> records, respectively. However, what I find is Asamp and Asamp2 are
>>>> closer
>>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>>>> size. Interestingly, in the limiting case of:
>>>> 
>>>> Asamp3 = SAMPLE A 0.99;
>>>> 
>>>> The actual subsampled size is VERY close to the expected 99% size of the
>>>> full sample size. Can anyone shed light as to what I may be doing wrong
>>>> or
>>>> share their experiences if they have also seen issues with using SAMPLE
>>>> in
>>>> PIG. Thank you.
>>>> 
>>>>            Brian
>>>> 
>>>> 
>>> 
>>

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Brian Choi <su...@gmail.com>.

Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing
about this is that it approaches the correct values for SAMPLE() as you
approach a sample size of 100% (or 0.99), but gets worse as you start
getting to lower sample fractions.

       Brian


On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <th...@hortonworks.com> wrote:

> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
>
>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
>> nothing but "Math.random() < x" where "x" is a double.
>>
>>  You are right. Sample operator translates in to a filter operator with
> condition "Math.random() < x".
>
>
>  In my test, SAMPLE A 0.00001 returns about 10 records with a million
>> records when running in local mode. I am curious if something can go wrong
>> when running it in MR mode.
>>
>
> I wouldn't expect different behavior in case of MR mode.
>
> Brian,
> Do you see this behavior across multiple runs ?
>
> -Thejas
>
>
>
>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com> wrote:
>>
>>  Hello Everyone,
>>>
>>>            I am wondering if anyone has run into an issue that I am
>>> having
>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>>> orignal relation.
>>>
>>> Assume the relation "A" contains a single column of data (int type) with
>>> 1,000,000 records
>>>
>>> Asamp = SAMPLE A 0.00001;
>>> Asamp2 = SAMPLE A 0.0001;
>>>
>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>>> records, respectively. However, what I find is Asamp and Asamp2 are
>>> closer
>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>>> size. Interestingly, in the limiting case of:
>>>
>>> Asamp3 = SAMPLE A 0.99;
>>>
>>> The actual subsampled size is VERY close to the expected 99% size of the
>>> full sample size. Can anyone shed light as to what I may be doing wrong
>>> or
>>> share their experiences if they have also seen issues with using SAMPLE
>>> in
>>> PIG. Thank you.
>>>
>>>             Brian
>>>
>>>
>>
>

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Thejas Nair <th...@hortonworks.com>.

On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
> nothing but "Math.random() < x" where "x" is a double.
>
You are right. Sample operator translates in to a filter operator with 
condition "Math.random() < x".


> In my test, SAMPLE A 0.00001 returns about 10 records with a million
> records when running in local mode. I am curious if something can go wrong
> when running it in MR mode.

I wouldn't expect different behavior in case of MR mode.

Brian,
Do you see this behavior across multiple runs ?

-Thejas


>
> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>>            I am wondering if anyone has run into an issue that I am having
>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>> orignal relation.
>>
>> Assume the relation "A" contains a single column of data (int type) with
>> 1,000,000 records
>>
>> Asamp = SAMPLE A 0.00001;
>> Asamp2 = SAMPLE A 0.0001;
>>
>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>> records, respectively. However, what I find is Asamp and Asamp2 are closer
>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>> size. Interestingly, in the limiting case of:
>>
>> Asamp3 = SAMPLE A 0.99;
>>
>> The actual subsampled size is VERY close to the expected 99% size of the
>> full sample size. Can anyone shed light as to what I may be doing wrong or
>> share their experiences if they have also seen issues with using SAMPLE in
>> PIG. Thank you.
>>
>>             Brian
>>
>

Re: Issues with SAMPLE in PIG v0.8.1

Posted by Cheolsoo Park <ch...@cloudera.com>.

I am puzzled about this. If I am not mistaken, the SAMPLE operator is
nothing but "Math.random() < x" where "x" is a double.

In my test, SAMPLE A 0.00001 returns about 10 records with a million
records when running in local mode. I am curious if something can go wrong
when running it in MR mode.

On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <su...@gmail.com> wrote:

> Hello Everyone,
>
>           I am wondering if anyone has run into an issue that I am having
> using SAMPLE in a pig script to create a subsample of 0.001% from the
> orignal relation.
>
> Assume the relation "A" contains a single column of data (int type) with
> 1,000,000 records
>
> Asamp = SAMPLE A 0.00001;
> Asamp2 = SAMPLE A 0.0001;
>
> Asamp and Asamp2 should produce subsampled relations with 10 and 100
> records, respectively. However, what I find is Asamp and Asamp2 are closer
> to 1000 and 10000 records, which seems like a 100-fold error in sample
> size. Interestingly, in the limiting case of:
>
> Asamp3 = SAMPLE A 0.99;
>
> The actual subsampled size is VERY close to the expected 99% size of the
> full sample size. Can anyone shed light as to what I may be doing wrong or
> share their experiences if they have also seen issues with using SAMPLE in
> PIG. Thank you.
>
>            Brian
>