You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2016/11/28 16:27:25 UTC

Grabbing Random Sample of rows based on a column

Is there a way to grab a random return of data from Drill?

For example, let's say I have a table with 1 billion rows, and I want to
return 100,000 at random based on a sampling of a specific column... is
that possible?

Thanks

John

Re: Grabbing Random Sample of rows based on a column

Posted by Ted Dunning <te...@gmail.com>.
OK. Suppose that you have a column called random_key in the main table t1.
Suppose that you have a second table t2 with two columns, random_key and p.

This should do your job:

select * from t1 join t2 using random_key where random() < t2.p;

This will give you an extra column p (relative to t1). If we had
expressions like t1.* we could limit that.

Clearly t2 doesn't have to be pre-defined.

In fact, here is a query that should work to sample about 100 of different
ranges of b

with t1 as  (select round(b) as k, random() z, b from maprfs.ted.`xyz/*`) ,
     t2 as  (select k, count(k) cnt from t1 group by k),
     t3 as  (select k, case when cnt < 100 then 1.0 else 100.0/cnt end p
from t2),
     t4 as  (select t1.z as z, t1.b as b, t1.k as k, t3.p as p from t1 join
t3 on (t1.k = t3.k))
select * from t4 where z < p;

xyz has lots of data in it:

0: jdbc:drill:> with t1 as  (select round(b) as k, rand() z, b from
maprfs.ted.`xyz/*`)
. . . . . . . > select k, count(k) cnt from t1 group by k;
+-------+---------+
|   k   |   cnt   |
+-------+---------+
| 1.0   | 306696  |
| 2.0   | 26540   |
| -2.0  | 188484  |
| -3.0  | 9852    |
| -4.0  | 8       |
| 3.0   | 156     |
| 0.0   | 841312  |
| -1.0  | 724104  |
+-------+---------+

Here is an example of how well this query does. Note that there is no way
to get more -4's since there are only 8 in the original data.

0: jdbc:drill:> with t1 as  (select round(b) as k, random() z, b from
maprfs.ted.`xyz/*`) ,
. . . . . . . >      t2 as  (select k, count(k) cnt from t1 group by k),
. . . . . . . >      t3 as  (select k, case when cnt < 10 then 1.0 else
100.0/cnt end p from t2),
. . . . . . . >      t4 as  (select t1.z as z, t1.b as b, t1.k as k, t3.p
as p from t1 join t3 on (t1.k = t3.k))
. . . . . . . > select k,count(1) from t4 where z <= p group by k order by k
. . . . . . . > ;
+-------+---------+
|   k   | EXPR$1  |
+-------+---------+
| -4.0  | 8       |
| -3.0  | 107     |
| -2.0  | 91      |
| -1.0  | 104     |
| 0.0   | 93      |
| 1.0   | 110     |
| 2.0   | 105     |
| 3.0   | 92      |
+-------+---------+
8 rows selected (2.169 seconds)


Unfortunately, there is a bug in the rand() function (this is a very old
bug). Use random() instead.

0: jdbc:drill:> select round(b) as k, rand() bad, random() z, b from
maprfs.ted.`xyz/*` limit 10;
+-------+----------------------+----------------------+----------------------+
|   k   |         bad          |          z           |          b
  |
+-------+----------------------+----------------------+----------------------+
| 1.0   | 0.46045930988550243  | 0.5743224359285806   | 0.9330797837593181
  |
| 0.0   | 0.46045930988550243  | 0.314046977888911    | 0.2428613780673916
  |
| 1.0   | 0.46045930988550243  | 0.21483988823160682  | 0.6581640175620538
  |
| 1.0   | 0.46045930988550243  | 0.2660185130331417   | 0.5177771442646559
  |
| 1.0   | 0.46045930988550243  | 0.31355327439484393  | 1.1932869828247252
  |
| 1.0   | 0.46045930988550243  | 0.8150798519837753   | 0.5030685771327987
  |
| 1.0   | 0.46045930988550243  | 0.08149861547426307  | 0.9183712166274609
  |
| 1.0   | 0.46045930988550243  | 0.5245412734834898   | 0.777984343330063
 |
| -1.0  | 0.46045930988550243  | 0.9605005092215819   | -1.141443398317655
  |
| -2.0  | 0.46045930988550243  | 0.4892755331157771   | -1.8316618040095816
 |
+-------+----------------------+----------------------+----------------------+
10 rows selected (0.434 seconds)



On Mon, Nov 28, 2016 at 1:35 PM, John Omernik <jo...@omernik.com> wrote:

> So I may have data that is 20 columns wide, what I am looking for is based
> on a single column, pick a random sampling from that column, but return the
> the whole row... I guess it doesn't matter much on the column, random is
> random, I just want the whole row, not just just a a single column.  Two
> passes wouldn't be horrible, I was just trying to see how others approach
> this problem.
>
> On Mon, Nov 28, 2016 at 2:58 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > What does "grab random form that column"?
> >
> > Does it mean use that row to determine the probability of picking the
> row?
> >
> > How bad is it to make two passes through the data?
> >
> >
> >
> > On Mon, Nov 28, 2016 at 11:19 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > > This is less about a random sampling of one column, more about based
> on a
> > > column, grab random from that column, but return the whole row...
> > >
> > > On Mon, Nov 28, 2016 at 11:45 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > >
> > > > The answer is probably yes. (Get it?)
> > > >
> > > > If you just want a random sample of one column, try random() < p as a
> > > > qualifier in the where clause.
> > > >
> > > > If you want samples where the likelihood varies with the value of a
> > > > column, the answer is slightly more elaborate.  For instance, suppose
> > you
> > > > want about a thousand samples from each city in the data. This means
> > that
> > > > you should have p=1 for all cities where there are less than a
> thousand
> > > > samples at all and p=1000/n where n is the number of samples for the
> > > > current city. So what you want is a two pass query that counts the
> > cities
> > > > and then uses these counts to get probabilities. I am not up for
> typing
> > > > that on a phone, but it should be straightforward.
> > > >
> > > > This same task can be done in a single pass by using what is called
> > > > reservoir sampling. You can use two levels of reservoir sampling
> with a
> > > > counter to bias the results but that will require a user defined
> > > aggregator
> > > > that can work on two levels and I don't think that is possible/easy
> yet
> > > > with drill.
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Nov 28, 2016, at 8:27, John Omernik <jo...@omernik.com> wrote:
> > > > >
> > > > > Is there a way to grab a random return of data from Drill?
> > > > >
> > > > > For example, let's say I have a table with 1 billion rows, and I
> want
> > > to
> > > > > return 100,000 at random based on a sampling of a specific
> column...
> > is
> > > > > that possible?
> > > > >
> > > > > Thanks
> > > > >
> > > > > John
> > > >
> > >
> >
>

Re: Grabbing Random Sample of rows based on a column

Posted by John Omernik <jo...@omernik.com>.
So I may have data that is 20 columns wide, what I am looking for is based
on a single column, pick a random sampling from that column, but return the
the whole row... I guess it doesn't matter much on the column, random is
random, I just want the whole row, not just just a a single column.  Two
passes wouldn't be horrible, I was just trying to see how others approach
this problem.

On Mon, Nov 28, 2016 at 2:58 PM, Ted Dunning <te...@gmail.com> wrote:

> What does "grab random form that column"?
>
> Does it mean use that row to determine the probability of picking the row?
>
> How bad is it to make two passes through the data?
>
>
>
> On Mon, Nov 28, 2016 at 11:19 AM, John Omernik <jo...@omernik.com> wrote:
>
> > This is less about a random sampling of one column, more about based on a
> > column, grab random from that column, but return the whole row...
> >
> > On Mon, Nov 28, 2016 at 11:45 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > >
> > > The answer is probably yes. (Get it?)
> > >
> > > If you just want a random sample of one column, try random() < p as a
> > > qualifier in the where clause.
> > >
> > > If you want samples where the likelihood varies with the value of a
> > > column, the answer is slightly more elaborate.  For instance, suppose
> you
> > > want about a thousand samples from each city in the data. This means
> that
> > > you should have p=1 for all cities where there are less than a thousand
> > > samples at all and p=1000/n where n is the number of samples for the
> > > current city. So what you want is a two pass query that counts the
> cities
> > > and then uses these counts to get probabilities. I am not up for typing
> > > that on a phone, but it should be straightforward.
> > >
> > > This same task can be done in a single pass by using what is called
> > > reservoir sampling. You can use two levels of reservoir sampling with a
> > > counter to bias the results but that will require a user defined
> > aggregator
> > > that can work on two levels and I don't think that is possible/easy yet
> > > with drill.
> > >
> > > Sent from my iPhone
> > >
> > > > On Nov 28, 2016, at 8:27, John Omernik <jo...@omernik.com> wrote:
> > > >
> > > > Is there a way to grab a random return of data from Drill?
> > > >
> > > > For example, let's say I have a table with 1 billion rows, and I want
> > to
> > > > return 100,000 at random based on a sampling of a specific column...
> is
> > > > that possible?
> > > >
> > > > Thanks
> > > >
> > > > John
> > >
> >
>

Re: Grabbing Random Sample of rows based on a column

Posted by Ted Dunning <te...@gmail.com>.
What does "grab random form that column"?

Does it mean use that row to determine the probability of picking the row?

How bad is it to make two passes through the data?



On Mon, Nov 28, 2016 at 11:19 AM, John Omernik <jo...@omernik.com> wrote:

> This is less about a random sampling of one column, more about based on a
> column, grab random from that column, but return the whole row...
>
> On Mon, Nov 28, 2016 at 11:45 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> >
> > The answer is probably yes. (Get it?)
> >
> > If you just want a random sample of one column, try random() < p as a
> > qualifier in the where clause.
> >
> > If you want samples where the likelihood varies with the value of a
> > column, the answer is slightly more elaborate.  For instance, suppose you
> > want about a thousand samples from each city in the data. This means that
> > you should have p=1 for all cities where there are less than a thousand
> > samples at all and p=1000/n where n is the number of samples for the
> > current city. So what you want is a two pass query that counts the cities
> > and then uses these counts to get probabilities. I am not up for typing
> > that on a phone, but it should be straightforward.
> >
> > This same task can be done in a single pass by using what is called
> > reservoir sampling. You can use two levels of reservoir sampling with a
> > counter to bias the results but that will require a user defined
> aggregator
> > that can work on two levels and I don't think that is possible/easy yet
> > with drill.
> >
> > Sent from my iPhone
> >
> > > On Nov 28, 2016, at 8:27, John Omernik <jo...@omernik.com> wrote:
> > >
> > > Is there a way to grab a random return of data from Drill?
> > >
> > > For example, let's say I have a table with 1 billion rows, and I want
> to
> > > return 100,000 at random based on a sampling of a specific column... is
> > > that possible?
> > >
> > > Thanks
> > >
> > > John
> >
>

Re: Grabbing Random Sample of rows based on a column

Posted by John Omernik <jo...@omernik.com>.
This is less about a random sampling of one column, more about based on a
column, grab random from that column, but return the whole row...

On Mon, Nov 28, 2016 at 11:45 AM, Ted Dunning <te...@gmail.com> wrote:

>
> The answer is probably yes. (Get it?)
>
> If you just want a random sample of one column, try random() < p as a
> qualifier in the where clause.
>
> If you want samples where the likelihood varies with the value of a
> column, the answer is slightly more elaborate.  For instance, suppose you
> want about a thousand samples from each city in the data. This means that
> you should have p=1 for all cities where there are less than a thousand
> samples at all and p=1000/n where n is the number of samples for the
> current city. So what you want is a two pass query that counts the cities
> and then uses these counts to get probabilities. I am not up for typing
> that on a phone, but it should be straightforward.
>
> This same task can be done in a single pass by using what is called
> reservoir sampling. You can use two levels of reservoir sampling with a
> counter to bias the results but that will require a user defined aggregator
> that can work on two levels and I don't think that is possible/easy yet
> with drill.
>
> Sent from my iPhone
>
> > On Nov 28, 2016, at 8:27, John Omernik <jo...@omernik.com> wrote:
> >
> > Is there a way to grab a random return of data from Drill?
> >
> > For example, let's say I have a table with 1 billion rows, and I want to
> > return 100,000 at random based on a sampling of a specific column... is
> > that possible?
> >
> > Thanks
> >
> > John
>

Re: Grabbing Random Sample of rows based on a column

Posted by Ted Dunning <te...@gmail.com>.
The answer is probably yes. (Get it?)

If you just want a random sample of one column, try random() < p as a qualifier in the where clause. 

If you want samples where the likelihood varies with the value of a column, the answer is slightly more elaborate.  For instance, suppose you  want about a thousand samples from each city in the data. This means that you should have p=1 for all cities where there are less than a thousand samples at all and p=1000/n where n is the number of samples for the current city. So what you want is a two pass query that counts the cities and then uses these counts to get probabilities. I am not up for typing that on a phone, but it should be straightforward. 

This same task can be done in a single pass by using what is called reservoir sampling. You can use two levels of reservoir sampling with a counter to bias the results but that will require a user defined aggregator that can work on two levels and I don't think that is possible/easy yet with drill. 

Sent from my iPhone

> On Nov 28, 2016, at 8:27, John Omernik <jo...@omernik.com> wrote:
> 
> Is there a way to grab a random return of data from Drill?
> 
> For example, let's say I have a table with 1 billion rows, and I want to
> return 100,000 at random based on a sampling of a specific column... is
> that possible?
> 
> Thanks
> 
> John