You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Stéphane Thibaud <sn...@gmail.com> on 2019/05/19 11:59:54 UTC

Recommended way of random sampling

Hello Ignite users,

I am considering to sample randomly on large amounts of data, but I was
wondering what would be the most efficient way for this. Right now, I think
I might need cluster-based randomness using a MOD function as described
here: https://www.alandix.com/academic/topics/random/sampling-SQL.html

I currently have a UUID column (uuid4), which I think can be used for it,
but I might need some bit manipulation to get the non-random parts out of
the UUID.
Do you think this is indeed the most straightforward way to do it?


Kind regards,

Stéphane Thibaud

Re: Recommended way of random sampling

Posted by Stéphane Thibaud <sn...@gmail.com>.

Excuse me, I just sent my response, but I see that you actually suggested a
new column... in that case this would work, but I think it's a bit
unfortunate to have to store random numbers.


Kind regards,

Stéphane Thibaud

2019年5月20日(月) 19:22 Stéphane Thibaud <sn...@gmail.com>:

> Hello Ilya,
>
> Thank you for that suggestion. On a traditional database I know that
> approach does not scale well, since a random number is first assigned to
> all rows (it scales linearly with the number of rows if I am not mistaken).
> Do you think this would be different for Ignite?
>
>
> Kind regards,
>
> Stéphane Thibaud
>
> 2019年5月20日(月) 15:53 Ilya Kasnacheev <il...@gmail.com>:
>
>> Hello!
>>
>> You can have a random indexed field in your table and do queries like
>> SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random
>> item.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <sn...@gmail.com>:
>>
>>> As a small addition: it would really help if Ignite had a hashing
>>> function for this, but I only see AES encryption.
>>>
>>>
>>> Kind regards,
>>>
>>> Stéphane Thibaud
>>>
>>> 2019年5月19日(日) 20:59 Stéphane Thibaud <sn...@gmail.com>:
>>>
>>>> Hello Ignite users,
>>>>
>>>> I am considering to sample randomly on large amounts of data, but I was
>>>> wondering what would be the most efficient way for this. Right now, I think
>>>> I might need cluster-based randomness using a MOD function as described
>>>> here: https://www.alandix.com/academic/topics/random/sampling-SQL.html
>>>>
>>>> I currently have a UUID column (uuid4), which I think can be used for
>>>> it, but I might need some bit manipulation to get the non-random parts out
>>>> of the UUID.
>>>> Do you think this is indeed the most straightforward way to do it?
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Stéphane Thibaud
>>>>
>>>

Re: Recommended way of random sampling

Posted by Stéphane Thibaud <sn...@gmail.com>.

Hello Ilya,

Yes, you are right. I sent my first response too quickly. With an extra
column it will work.
I will go with the approach you suggest. :-)


Kind regards,

Stéphane Thibaud

2019年5月20日(月) 20:26 Ilya Kasnacheev <il...@gmail.com>:

> Hello!
>
> I'm not sure why you think it will not scale. If this field is indexed
> then taking a random sample is basically one b-tree walk away.
>
> I guess you will have to store random numbers, if you rely on non-random
> field it might introduce bias.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 20 мая 2019 г. в 13:28, Stéphane Thibaud <sn...@gmail.com>:
>
>> Hello Ilya,
>>
>> Thank you for that suggestion. On a traditional database I know that
>> approach does not scale well, since a random number is first assigned to
>> all rows (it scales linearly with the number of rows if I am not mistaken).
>> Do you think this would be different for Ignite?
>>
>>
>> Kind regards,
>>
>> Stéphane Thibaud
>>
>> 2019年5月20日(月) 15:53 Ilya Kasnacheev <il...@gmail.com>:
>>
>>> Hello!
>>>
>>> You can have a random indexed field in your table and do queries like
>>> SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random
>>> item.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <sn...@gmail.com>:
>>>
>>>> As a small addition: it would really help if Ignite had a hashing
>>>> function for this, but I only see AES encryption.
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Stéphane Thibaud
>>>>
>>>> 2019年5月19日(日) 20:59 Stéphane Thibaud <sn...@gmail.com>:
>>>>
>>>>> Hello Ignite users,
>>>>>
>>>>> I am considering to sample randomly on large amounts of data, but I
>>>>> was wondering what would be the most efficient way for this. Right now, I
>>>>> think I might need cluster-based randomness using a MOD function as
>>>>> described here:
>>>>> https://www.alandix.com/academic/topics/random/sampling-SQL.html
>>>>>
>>>>> I currently have a UUID column (uuid4), which I think can be used for
>>>>> it, but I might need some bit manipulation to get the non-random parts out
>>>>> of the UUID.
>>>>> Do you think this is indeed the most straightforward way to do it?
>>>>>
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Stéphane Thibaud
>>>>>
>>>>

Re: Recommended way of random sampling

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I'm not sure why you think it will not scale. If this field is indexed then
taking a random sample is basically one b-tree walk away.

I guess you will have to store random numbers, if you rely on non-random
field it might introduce bias.

Regards,
-- 
Ilya Kasnacheev


пн, 20 мая 2019 г. в 13:28, Stéphane Thibaud <sn...@gmail.com>:

> Hello Ilya,
>
> Thank you for that suggestion. On a traditional database I know that
> approach does not scale well, since a random number is first assigned to
> all rows (it scales linearly with the number of rows if I am not mistaken).
> Do you think this would be different for Ignite?
>
>
> Kind regards,
>
> Stéphane Thibaud
>
> 2019年5月20日(月) 15:53 Ilya Kasnacheev <il...@gmail.com>:
>
>> Hello!
>>
>> You can have a random indexed field in your table and do queries like
>> SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random
>> item.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <sn...@gmail.com>:
>>
>>> As a small addition: it would really help if Ignite had a hashing
>>> function for this, but I only see AES encryption.
>>>
>>>
>>> Kind regards,
>>>
>>> Stéphane Thibaud
>>>
>>> 2019年5月19日(日) 20:59 Stéphane Thibaud <sn...@gmail.com>:
>>>
>>>> Hello Ignite users,
>>>>
>>>> I am considering to sample randomly on large amounts of data, but I was
>>>> wondering what would be the most efficient way for this. Right now, I think
>>>> I might need cluster-based randomness using a MOD function as described
>>>> here: https://www.alandix.com/academic/topics/random/sampling-SQL.html
>>>>
>>>> I currently have a UUID column (uuid4), which I think can be used for
>>>> it, but I might need some bit manipulation to get the non-random parts out
>>>> of the UUID.
>>>> Do you think this is indeed the most straightforward way to do it?
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Stéphane Thibaud
>>>>
>>>

Re: Recommended way of random sampling

Posted by Stéphane Thibaud <sn...@gmail.com>.

Hello Ilya,

Thank you for that suggestion. On a traditional database I know that
approach does not scale well, since a random number is first assigned to
all rows (it scales linearly with the number of rows if I am not mistaken).
Do you think this would be different for Ignite?


Kind regards,

Stéphane Thibaud

2019年5月20日(月) 15:53 Ilya Kasnacheev <il...@gmail.com>:

> Hello!
>
> You can have a random indexed field in your table and do queries like
> SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random
> item.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <sn...@gmail.com>:
>
>> As a small addition: it would really help if Ignite had a hashing
>> function for this, but I only see AES encryption.
>>
>>
>> Kind regards,
>>
>> Stéphane Thibaud
>>
>> 2019年5月19日(日) 20:59 Stéphane Thibaud <sn...@gmail.com>:
>>
>>> Hello Ignite users,
>>>
>>> I am considering to sample randomly on large amounts of data, but I was
>>> wondering what would be the most efficient way for this. Right now, I think
>>> I might need cluster-based randomness using a MOD function as described
>>> here: https://www.alandix.com/academic/topics/random/sampling-SQL.html
>>>
>>> I currently have a UUID column (uuid4), which I think can be used for
>>> it, but I might need some bit manipulation to get the non-random parts out
>>> of the UUID.
>>> Do you think this is indeed the most straightforward way to do it?
>>>
>>>
>>> Kind regards,
>>>
>>> Stéphane Thibaud
>>>
>>

Re: Recommended way of random sampling

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

You can have a random indexed field in your table and do queries like
SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random
item.

Regards,
-- 
Ilya Kasnacheev


пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <sn...@gmail.com>:

> As a small addition: it would really help if Ignite had a hashing function
> for this, but I only see AES encryption.
>
>
> Kind regards,
>
> Stéphane Thibaud
>
> 2019年5月19日(日) 20:59 Stéphane Thibaud <sn...@gmail.com>:
>
>> Hello Ignite users,
>>
>> I am considering to sample randomly on large amounts of data, but I was
>> wondering what would be the most efficient way for this. Right now, I think
>> I might need cluster-based randomness using a MOD function as described
>> here: https://www.alandix.com/academic/topics/random/sampling-SQL.html
>>
>> I currently have a UUID column (uuid4), which I think can be used for it,
>> but I might need some bit manipulation to get the non-random parts out of
>> the UUID.
>> Do you think this is indeed the most straightforward way to do it?
>>
>>
>> Kind regards,
>>
>> Stéphane Thibaud
>>
>

Re: Recommended way of random sampling

Posted by Stéphane Thibaud <sn...@gmail.com>.

As a small addition: it would really help if Ignite had a hashing function
for this, but I only see AES encryption.


Kind regards,

Stéphane Thibaud

2019年5月19日(日) 20:59 Stéphane Thibaud <sn...@gmail.com>:

> Hello Ignite users,
>
> I am considering to sample randomly on large amounts of data, but I was
> wondering what would be the most efficient way for this. Right now, I think
> I might need cluster-based randomness using a MOD function as described
> here: https://www.alandix.com/academic/topics/random/sampling-SQL.html
>
> I currently have a UUID column (uuid4), which I think can be used for it,
> but I might need some bit manipulation to get the non-random parts out of
> the UUID.
> Do you think this is indeed the most straightforward way to do it?
>
>
> Kind regards,
>
> Stéphane Thibaud
>