You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Liu, Ming (Ming)" <mi...@esgyn.cn> on 2018/04/12 16:16:07 UTC

how to get random rows from a big hbase table faster

Hi, all,

We have a hbase table which has 1 billion rows, and we want to randomly get 1M from that table. We are now trying the RandomRowFilter, but it is still very slow. If I understand it correctly, in the Server side, RandomRowFilter still need to read all 1 billions but return randomly 1% for them. But read 1 billion rows is very slow. Is this true?

So is there any other better way to randomly get 1% rows from a given table? Any idea will be very appreciated.
We don't know the distribution of the 1 billion rows in advance.

Thanks,
Ming

Re: how to get random rows from a big hbase table faster

Posted by "Proust (Feng Guizhou) [FDS Payment]" <pf...@coupang.com>.
The problem seems related to sampling, a short answer would be based on Spark RDD.sample


If RDD.sample is still too slow for your requirement, then maybe https://en.wikipedia.org/wiki/Reservoir_sampling is the direction to investigate, but not sure any existing implementation yet.

Reservoir sampling - Wikipedia<https://en.wikipedia.org/wiki/Reservoir_sampling>
en.wikipedia.org
Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number.




________________________________
From: Liu, Ming (Ming) <mi...@esgyn.cn>
Sent: Friday, April 13, 2018 12:16:07 AM
To: user@hbase.apache.org
Subject: how to get random rows from a big hbase table faster

Hi, all,

We have a hbase table which has 1 billion rows, and we want to randomly get 1M from that table. We are now trying the RandomRowFilter, but it is still very slow. If I understand it correctly, in the Server side, RandomRowFilter still need to read all 1 billions but return randomly 1% for them. But read 1 billion rows is very slow. Is this true?

So is there any other better way to randomly get 1% rows from a given table? Any idea will be very appreciated.
We don't know the distribution of the 1 billion rows in advance.

Thanks,
Ming

Re: how to get random rows from a big hbase table faster

Posted by Vladimir Rodionov <vl...@gmail.com>.
1% from 1B is 10M. 10M random reads is doable if :

a. Cluster is sufficiently large
b. equipped with SSDs
c. you run multiple clients in parallel to retrieve these rows

You need to know in advance min/max rows in a table,
then generate randomly start row and open scanner with this start row, then
just read first KV


Or, say split min/max row region into N consecutive sub-regions (N is up to
you) and open N scanners with RandomRowFilter

again, you have to run N clients (or threads) to do this in parallel

-Vlad




On Thu, Apr 12, 2018 at 9:16 AM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
>
> We have a hbase table which has 1 billion rows, and we want to randomly
> get 1M from that table. We are now trying the RandomRowFilter, but it is
> still very slow. If I understand it correctly, in the Server side,
> RandomRowFilter still need to read all 1 billions but return randomly 1%
> for them. But read 1 billion rows is very slow. Is this true?
>
> So is there any other better way to randomly get 1% rows from a given
> table? Any idea will be very appreciated.
> We don't know the distribution of the 1 billion rows in advance.
>
> Thanks,
> Ming
>