You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Yi Jiang <yi...@ubisoft.com> on 2016/05/12 01:05:38 UTC

Hbase scaning for couple Terabytes data

Hi, Guys
Recently we are debating the usage for hbase as our destination for data pipeline job.
Basically, we want to save our logs into hbase, and our pipeline can generate 2-4 terabytes data everyday, but our IT department think it is not good idea to scan so hbase, it will cause the performance and memory issue. And they ask our just keep 15 minutes data amount in the hbase for real time analysis.
For now, I am using hive to external to hbase, but what I am thinking that for map reduce job, what kind of mapper it is using to scan the data from hbase? Is it TableInputFormatBase? and how many mapper it will use in hive to scan the hbase. Is it efficient or not? Will it cause the performance issue if we have couple T's or more larger data amount?
I am also trying to index some columns that we might use to query. But  I am not sure if it is good idea to keep so much history data in the hbase for query.
Thank you
Jacky

Re: Hbase scaning for couple Terabytes data

Posted by Ted Yu <yu...@gmail.com>.

TableInputFormatBase is abstract.

Most likely you would use TableInputFormat for the scan.

See javadoc of getSplits():

   * Calculates the splits that will serve as input for the map tasks. The

   * number of splits matches the number of regions in a table.


FYI

On Wed, May 11, 2016 at 6:05 PM, Yi Jiang <yi...@ubisoft.com> wrote:

> Hi, Guys
> Recently we are debating the usage for hbase as our destination for data
> pipeline job.
> Basically, we want to save our logs into hbase, and our pipeline can
> generate 2-4 terabytes data everyday, but our IT department think it is not
> good idea to scan so hbase, it will cause the performance and memory issue.
> And they ask our just keep 15 minutes data amount in the hbase for real
> time analysis.
> For now, I am using hive to external to hbase, but what I am thinking that
> for map reduce job, what kind of mapper it is using to scan the data from
> hbase? Is it TableInputFormatBase? and how many mapper it will use in hive
> to scan the hbase. Is it efficient or not? Will it cause the performance
> issue if we have couple T's or more larger data amount?
> I am also trying to index some columns that we might use to query. But  I
> am not sure if it is good idea to keep so much history data in the hbase
> for query.
> Thank you
> Jacky
>
>