You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Yi Jiang <yi...@ubisoft.com> on 2016/05/12 01:06:44 UTC

Performance for hive external to hbase with serval terabyte or more data

Hi, Guys
Recently we are debating the usage for hbase as our destination for data pipeline job.
Basically, we want to save our logs into hbase, and our pipeline can generate 2-4 terabytes data everyday, but our IT department think it is not good idea to scan so hbase, it will cause the performance and memory issue. And they ask our just keep 15 minutes data amount in the hbase for real time analysis.
For now, I am using hive to external to hbase, but what I am thinking that for map reduce job, what kind of mapper it is using to scan the data from hbase? Is it TableInputFormatBase? and how many mapper it will use in hive to scan the hbase. Is it efficient or not? Will it cause the performance issue if we have couple T's or more larger data amount?
I am also trying to index some columns that we might use to query. But  I am not sure if it is good idea to keep so much history data in the hbase for query.
Thank you
Jacky

Re: Performance for hive external to hbase with serval terabyte or more data

Posted by Yi Jiang <yi...@ubisoft.com>.

Hi Sathi
Thank you for the answer. But we will load data from hbase to hive, let map reduce to process those data. I am not sure if it is efficient in sever for those terabytes data
Thanks l
Jacky


On May 11, 2016, at 11:03 PM, Sathi Chowdhury <sa...@lithium.com>> wrote:

Hi Yang,
Did you think of bulk loading option?

http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
This may be a way to go .
Thanks
Sathi


On May 11, 2016, at 6:07 PM, Yi Jiang <yi...@ubisoft.com>> wrote:

Hi, Guys
Recently we are debating the usage for hbase as our destination for data pipeline job.
Basically, we want to save our logs into hbase, and our pipeline can generate 2-4 terabytes data everyday, but our IT department think it is not good idea to scan so hbase, it will cause the performance and memory issue. And they ask our just keep 15 minutes data amount in the hbase for real time analysis.
For now, I am using hive to external to hbase, but what I am thinking that for map reduce job, what kind of mapper it is using to scan the data from hbase? Is it TableInputFormatBase? and how many mapper it will use in hive to scan the hbase. Is it efficient or not? Will it cause the performance issue if we have couple T's or more larger data amount?
I am also trying to index some columns that we might use to query. But  I am not sure if it is good idea to keep so much history data in the hbase for query.
Thank you
Jacky

Re: Performance for hive external to hbase with serval terabyte or more data

Posted by Sathi Chowdhury <sa...@lithium.com>.

Hi Yang,
Did you think of bulk loading option?

http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
This may be a way to go .
Thanks
Sathi


On May 11, 2016, at 6:07 PM, Yi Jiang <yi...@ubisoft.com>> wrote:

Hi, Guys
Recently we are debating the usage for hbase as our destination for data pipeline job.
Basically, we want to save our logs into hbase, and our pipeline can generate 2-4 terabytes data everyday, but our IT department think it is not good idea to scan so hbase, it will cause the performance and memory issue. And they ask our just keep 15 minutes data amount in the hbase for real time analysis.
For now, I am using hive to external to hbase, but what I am thinking that for map reduce job, what kind of mapper it is using to scan the data from hbase? Is it TableInputFormatBase? and how many mapper it will use in hive to scan the hbase. Is it efficient or not? Will it cause the performance issue if we have couple T's or more larger data amount?
I am also trying to index some columns that we might use to query. But  I am not sure if it is good idea to keep so much history data in the hbase for query.
Thank you
Jacky

Re: Performance for hive external to hbase with serval terabyte or more data

Posted by Yi Jiang <yi...@ubisoft.com>.

Hi Jorn
Thank you for replying. We are currently exporting data from hbase to hive, I have mentioned in the previous message. I am working in the big company. I personally like tez but it's even not in our roadmap.
Thank you

On May 12, 2016, at 1:52 AM, J?rn Franke <jo...@gmail.com>> wrote:

Why don't you export the data from hbase to hive, eg in Orc format. You should not use mr with Hive, but Tez. Also use a recent hive version (at least 1.2). You can then do queries there. For large log file processing in real time, one alternative depending on your needs could be Solr on Hadoop.

On 12 May 2016, at 03:06, Yi Jiang <yi...@ubisoft.com>> wrote:

Hi, Guys
Recently we are debating the usage for hbase as our destination for data pipeline job.
Basically, we want to save our logs into hbase, and our pipeline can generate 2-4 terabytes data everyday, but our IT department think it is not good idea to scan so hbase, it will cause the performance and memory issue. And they ask our just keep 15 minutes data amount in the hbase for real time analysis.
For now, I am using hive to external to hbase, but what I am thinking that for map reduce job, what kind of mapper it is using to scan the data from hbase? Is it TableInputFormatBase? and how many mapper it will use in hive to scan the hbase. Is it efficient or not? Will it cause the performance issue if we have couple T's or more larger data amount?
I am also trying to index some columns that we might use to query. But I am not sure if it is good idea to keep so much history data in the hbase for query.
Thank you
Jacky

Re: Performance for hive external to hbase with serval terabyte or more data

Posted by Jörn Franke <jo...@gmail.com>.

Why don't you export the data from hbase to  hive, eg in Orc format. You should not use mr with Hive, but Tez. Also use a recent hive version (at least 1.2). You can then do queries there. For large log file processing in real time, one alternative depending on your needs could be Solr on Hadoop.

> On 12 May 2016, at 03:06, Yi Jiang <yi...@ubisoft.com> wrote:
> 
> Hi, Guys
> Recently we are debating the usage for hbase as our destination for data pipeline job.
> Basically, we want to save our logs into hbase, and our pipeline can generate 2-4 terabytes data everyday, but our IT department think it is not good idea to scan so hbase, it will cause the performance and memory issue. And they ask our just keep 15 minutes data amount in the hbase for real time analysis.
> For now, I am using hive to external to hbase, but what I am thinking that for map reduce job, what kind of mapper it is using to scan the data from hbase? Is it TableInputFormatBase? and how many mapper it will use in hive to scan the hbase. Is it efficient or not? Will it cause the performance issue if we have couple T’s or more larger data amount?
> I am also trying to index some columns that we might use to query. But  I am not sure if it is good idea to keep so much history data in the hbase for query.
> Thank you
> Jacky
>