You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by wi...@thomsonreuters.com on 2014/05/27 19:03:57 UTC

How to sample an inner bag?

Hi Pig users,

Is there an easy/efficient way to sample an inner bag? For example, with input in a relation like

(id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
(id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
(id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})

I’d like to sample 1/3 the elements of the bags, and get something like (ignoring the non-determinism)
(id1,att1,{(x,0.999749968742)})
(id1,att2,{(b,0.04)})
(id2,att1,{(b,0.05)})

I have a circumlocution that seems to work using flatten+ group but that looks ugly to me:

tfidf1 = load '$tfidf' as (id: chararray,
                          att: chararray,
                          pairs: {pair: (word: chararray, value: double)});

flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
sample_flat_tfidf = sample flat_tfidf 0.33;
tfidf2 = group sample_flat_tfidf by (id, att);

tfidf = foreach tfidf2 {
   pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
   generate group.id, group.att, pairs;
};

Can someone suggest a better way to do this?  Many thanks!

William F Dowling
Senior Technologist

Thomson Reuters

RE: How to sample an inner bag?

Posted by wi...@thomsonreuters.com.

As far as I can tell, the python UDF I proposed is working fine. pig passes a bag to python as a list of tuples. The implementation of random.sample is not iterating over the input list.

I suppose if the bag were very huge then this would not work, or consume too much memory as the argument to the UDF is being prepared. But a solution that relied on a sort of the inner bag would have the same problem.  Anyway those inner bags are not huge in my data, so the UDF is OK.

Thanks again for your help,
Will

William F Dowling
Senior Technologist
Thomson Reuters


-----Original Message-----
From: Mehmet Tepedelenlioglu [mailto:mehmetsino@yahoo.com] 
Sent: Wednesday, May 28, 2014 4:27 PM
To: user@pig.apache.org user@pig.apache.org
Subject: Re: How to sample an inner bag?

I have no experience with the python udfs (I use Java). But I doubt the example you supplied would work. First, I am not sure if a bag is a subclass of sequence, which is, I believe, what you need to pass to the sample method. Second, at least in Java, if I remember correctly, you can iterate over the bag only once, and unless you know how the sample method works, I would caution against passing a bag to it. You could just read the input bag into a sequence and pass it, or you could iterate over it and accept elements with a certain probability, and spill to a output bag.


On May 28, 2014, at 1:06 PM, <wi...@thomsonreuters.com> <wi...@thomsonreuters.com> wrote:

> Thanks Mehmet! I tried that and it seems to work on a small test case. I'm also experimenting now with your other suggestion, a UDF. 
> I will probably use something like this, which seems less tricky and does not rely on a sort:
> 
> #!/usr/bin/python
> import random
> @outputSchema('id_bag: {items: (item: chararray)}')
> def random_subset(bag, n):
>    # return bag if it has <= n elements or n=-1, else return n random elements from it
>    if n == -1 or len(bag) <= n:
>        return bag
>    else:
>        return random.sample(bag, n)
> 
> 
> Thanks again,
> 
> Will
> 
> 
> William F Dowling
> Senior Technologist
> Thomson Reuters
> 
> 
> -----Original Message-----
> From: Mehmet Tepedelenlioglu [mailto:mehmetsino@yahoo.com] 
> Sent: Tuesday, May 27, 2014 5:09 PM
> To: user@pig.apache.org user@pig.apache.org
> Subject: Re: How to sample an inner bag?
> 
> If you know how many items you want from each inner bag exactly, you can hack it like this:
> 
> x = foreach x {
>    y = foreach x generate RANDOM() as rnd, *;
>    y = order y by rnd;
>    y = limit y $SAMPLE_NUM;
>    y = foreach y generate $1 ..;
>    generate group, y;
> }
> 
> Basically randomize the inner bag, sort it wrt the random number and limit it to the sample size you want. No reducers needed.
> If the inner bags are huge, ordering will obviously be expensive. If you don’t like this, you might have to write your own udf.
> 
> Mehmet
> 
> On May 27, 2014, at 10:03 AM, <wi...@thomsonreuters.com> <wi...@thomsonreuters.com> wrote:
> 
>> Hi Pig users,
>> 
>> Is there an easy/efficient way to sample an inner bag? For example, with input in a relation like
>> 
>> (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
>> (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
>> (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
>> 
>> I’d like to sample 1/3 the elements of the bags, and get something like (ignoring the non-determinism)
>> (id1,att1,{(x,0.999749968742)})
>> (id1,att2,{(b,0.04)})
>> (id2,att1,{(b,0.05)})
>> 
>> I have a circumlocution that seems to work using flatten+ group but that looks ugly to me:
>> 
>> tfidf1 = load '$tfidf' as (id: chararray,
>>                         att: chararray,
>>                         pairs: {pair: (word: chararray, value: double)});
>> 
>> flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
>> sample_flat_tfidf = sample flat_tfidf 0.33;
>> tfidf2 = group sample_flat_tfidf by (id, att);
>> 
>> tfidf = foreach tfidf2 {
>>  pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
>>  generate group.id, group.att, pairs;
>> };
>> 
>> Can someone suggest a better way to do this?  Many thanks!
>> 
>> William F Dowling
>> Senior Technologist
>> 
>> Thomson Reuters
>> 
>> 
>> 
>

Re: How to sample an inner bag?

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.

I have no experience with the python udfs (I use Java). But I doubt the example you supplied would work. First, I am not sure if a bag is a subclass of sequence, which is, I believe, what you need to pass to the sample method. Second, at least in Java, if I remember correctly, you can iterate over the bag only once, and unless you know how the sample method works, I would caution against passing a bag to it. You could just read the input bag into a sequence and pass it, or you could iterate over it and accept elements with a certain probability, and spill to a output bag.


On May 28, 2014, at 1:06 PM, <wi...@thomsonreuters.com> <wi...@thomsonreuters.com> wrote:

> Thanks Mehmet! I tried that and it seems to work on a small test case. I'm also experimenting now with your other suggestion, a UDF. 
> I will probably use something like this, which seems less tricky and does not rely on a sort:
> 
> #!/usr/bin/python
> import random
> @outputSchema('id_bag: {items: (item: chararray)}')
> def random_subset(bag, n):
>    # return bag if it has <= n elements or n=-1, else return n random elements from it
>    if n == -1 or len(bag) <= n:
>        return bag
>    else:
>        return random.sample(bag, n)
> 
> 
> Thanks again,
> 
> Will
> 
> 
> William F Dowling
> Senior Technologist
> Thomson Reuters
> 
> 
> -----Original Message-----
> From: Mehmet Tepedelenlioglu [mailto:mehmetsino@yahoo.com] 
> Sent: Tuesday, May 27, 2014 5:09 PM
> To: user@pig.apache.org user@pig.apache.org
> Subject: Re: How to sample an inner bag?
> 
> If you know how many items you want from each inner bag exactly, you can hack it like this:
> 
> x = foreach x {
>    y = foreach x generate RANDOM() as rnd, *;
>    y = order y by rnd;
>    y = limit y $SAMPLE_NUM;
>    y = foreach y generate $1 ..;
>    generate group, y;
> }
> 
> Basically randomize the inner bag, sort it wrt the random number and limit it to the sample size you want. No reducers needed.
> If the inner bags are huge, ordering will obviously be expensive. If you don’t like this, you might have to write your own udf.
> 
> Mehmet
> 
> On May 27, 2014, at 10:03 AM, <wi...@thomsonreuters.com> <wi...@thomsonreuters.com> wrote:
> 
>> Hi Pig users,
>> 
>> Is there an easy/efficient way to sample an inner bag? For example, with input in a relation like
>> 
>> (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
>> (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
>> (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
>> 
>> I’d like to sample 1/3 the elements of the bags, and get something like (ignoring the non-determinism)
>> (id1,att1,{(x,0.999749968742)})
>> (id1,att2,{(b,0.04)})
>> (id2,att1,{(b,0.05)})
>> 
>> I have a circumlocution that seems to work using flatten+ group but that looks ugly to me:
>> 
>> tfidf1 = load '$tfidf' as (id: chararray,
>>                         att: chararray,
>>                         pairs: {pair: (word: chararray, value: double)});
>> 
>> flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
>> sample_flat_tfidf = sample flat_tfidf 0.33;
>> tfidf2 = group sample_flat_tfidf by (id, att);
>> 
>> tfidf = foreach tfidf2 {
>>  pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
>>  generate group.id, group.att, pairs;
>> };
>> 
>> Can someone suggest a better way to do this?  Many thanks!
>> 
>> William F Dowling
>> Senior Technologist
>> 
>> Thomson Reuters
>> 
>> 
>> 
>

RE: How to sample an inner bag?

Posted by wi...@thomsonreuters.com.

Thanks Mehmet! I tried that and it seems to work on a small test case. I'm also experimenting now with your other suggestion, a UDF. 
I will probably use something like this, which seems less tricky and does not rely on a sort:

#!/usr/bin/python
import random
@outputSchema('id_bag: {items: (item: chararray)}')
def random_subset(bag, n):
    # return bag if it has <= n elements or n=-1, else return n random elements from it
    if n == -1 or len(bag) <= n:
        return bag
    else:
        return random.sample(bag, n)


Thanks again,

Will


William F Dowling
Senior Technologist
Thomson Reuters


-----Original Message-----
From: Mehmet Tepedelenlioglu [mailto:mehmetsino@yahoo.com] 
Sent: Tuesday, May 27, 2014 5:09 PM
To: user@pig.apache.org user@pig.apache.org
Subject: Re: How to sample an inner bag?

If you know how many items you want from each inner bag exactly, you can hack it like this:

x = foreach x {
    y = foreach x generate RANDOM() as rnd, *;
    y = order y by rnd;
    y = limit y $SAMPLE_NUM;
    y = foreach y generate $1 ..;
    generate group, y;
}

Basically randomize the inner bag, sort it wrt the random number and limit it to the sample size you want. No reducers needed.
If the inner bags are huge, ordering will obviously be expensive. If you don’t like this, you might have to write your own udf.

Mehmet

On May 27, 2014, at 10:03 AM, <wi...@thomsonreuters.com> <wi...@thomsonreuters.com> wrote:

> Hi Pig users,
> 
> Is there an easy/efficient way to sample an inner bag? For example, with input in a relation like
> 
> (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
> (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
> (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
> 
> I’d like to sample 1/3 the elements of the bags, and get something like (ignoring the non-determinism)
> (id1,att1,{(x,0.999749968742)})
> (id1,att2,{(b,0.04)})
> (id2,att1,{(b,0.05)})
> 
> I have a circumlocution that seems to work using flatten+ group but that looks ugly to me:
> 
> tfidf1 = load '$tfidf' as (id: chararray,
>                          att: chararray,
>                          pairs: {pair: (word: chararray, value: double)});
> 
> flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
> sample_flat_tfidf = sample flat_tfidf 0.33;
> tfidf2 = group sample_flat_tfidf by (id, att);
> 
> tfidf = foreach tfidf2 {
>   pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
>   generate group.id, group.att, pairs;
> };
> 
> Can someone suggest a better way to do this?  Many thanks!
> 
> William F Dowling
> Senior Technologist
> 
> Thomson Reuters
> 
> 
>

Re: How to sample an inner bag?

Posted by Pradeep Gollakota <pr...@gmail.com>.

@Mehmet... great hack! I like it :-P


On Tue, May 27, 2014 at 5:08 PM, Mehmet Tepedelenlioglu <
mehmetsino@yahoo.com> wrote:

> If you know how many items you want from each inner bag exactly, you can
> hack it like this:
>
> x = foreach x {
>     y = foreach x generate RANDOM() as rnd, *;
>     y = order y by rnd;
>     y = limit y $SAMPLE_NUM;
>     y = foreach y generate $1 ..;
>     generate group, y;
> }
>
> Basically randomize the inner bag, sort it wrt the random number and limit
> it to the sample size you want. No reducers needed.
> If the inner bags are huge, ordering will obviously be expensive. If you
> don’t like this, you might have to write your own udf.
>
> Mehmet
>
> On May 27, 2014, at 10:03 AM, <wi...@thomsonreuters.com> <
> william.dowling@thomsonreuters.com> wrote:
>
> > Hi Pig users,
> >
> > Is there an easy/efficient way to sample an inner bag? For example, with
> input in a relation like
> >
> > (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
> > (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
> > (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
> >
> > I’d like to sample 1/3 the elements of the bags, and get something like
> (ignoring the non-determinism)
> > (id1,att1,{(x,0.999749968742)})
> > (id1,att2,{(b,0.04)})
> > (id2,att1,{(b,0.05)})
> >
> > I have a circumlocution that seems to work using flatten+ group but that
> looks ugly to me:
> >
> > tfidf1 = load '$tfidf' as (id: chararray,
> >                          att: chararray,
> >                          pairs: {pair: (word: chararray, value:
> double)});
> >
> > flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
> > sample_flat_tfidf = sample flat_tfidf 0.33;
> > tfidf2 = group sample_flat_tfidf by (id, att);
> >
> > tfidf = foreach tfidf2 {
> >   pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
> >   generate group.id, group.att, pairs;
> > };
> >
> > Can someone suggest a better way to do this?  Many thanks!
> >
> > William F Dowling
> > Senior Technologist
> >
> > Thomson Reuters
> >
> >
> >
>
>

Re: How to sample an inner bag?

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.

If you know how many items you want from each inner bag exactly, you can hack it like this:

x = foreach x {
    y = foreach x generate RANDOM() as rnd, *;
    y = order y by rnd;
    y = limit y $SAMPLE_NUM;
    y = foreach y generate $1 ..;
    generate group, y;
}

Basically randomize the inner bag, sort it wrt the random number and limit it to the sample size you want. No reducers needed.
If the inner bags are huge, ordering will obviously be expensive. If you don’t like this, you might have to write your own udf.

Mehmet

On May 27, 2014, at 10:03 AM, <wi...@thomsonreuters.com> <wi...@thomsonreuters.com> wrote:

> Hi Pig users,
> 
> Is there an easy/efficient way to sample an inner bag? For example, with input in a relation like
> 
> (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
> (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
> (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
> 
> I’d like to sample 1/3 the elements of the bags, and get something like (ignoring the non-determinism)
> (id1,att1,{(x,0.999749968742)})
> (id1,att2,{(b,0.04)})
> (id2,att1,{(b,0.05)})
> 
> I have a circumlocution that seems to work using flatten+ group but that looks ugly to me:
> 
> tfidf1 = load '$tfidf' as (id: chararray,
>                          att: chararray,
>                          pairs: {pair: (word: chararray, value: double)});
> 
> flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
> sample_flat_tfidf = sample flat_tfidf 0.33;
> tfidf2 = group sample_flat_tfidf by (id, att);
> 
> tfidf = foreach tfidf2 {
>   pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
>   generate group.id, group.att, pairs;
> };
> 
> Can someone suggest a better way to do this?  Many thanks!
> 
> William F Dowling
> Senior Technologist
> 
> Thomson Reuters
> 
> 
>