You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by venkatesh b <ve...@gmail.com> on 2015/08/06 14:54:22 UTC

Is it worth storing in ORC for one time read. And can be replace hive with HBase

Hi, here I got two things to know.
FIRST:
In our project we use hive.
We daily get new data. We need to process this new data only once. And send
this processed data to RDBMS. Here in processing we majorly use many
complex queries with joins with where condition and grouping functions.
There are many intermediate tables generated around 50 while
processing. Till now we use text format as storage. We came across ORC file
format. I would like to know that since it is one Time querying the table
is it worth of storing as ORC format.

SECOND:
I came to know about HBase, which is faster.
Can I replace hive with HBase for processing of data daily faster.
Currently it is taking 15hrs daily with hive.


Please inform me if any other information is needed.

Thanks & regards
Venkatesh

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

Posted by venkatesh b <ve...@gmail.com>.

I'm really sorry, by mistake I posted in spark mailing list.

Jorn Frankie Thanks for your reply.
I have many joins, many complex queries and all are table scans. So I think
HBase do not work for me.

On Thursday, August 6, 2015, Jörn Franke <jo...@gmail.com> wrote:

> Additionally it is of key importance to use the right data types for the
> columns. Use int for ids,  int or decimal or float or double etc for
> numeric values etc. - A bad data model using varchars and string where not
> appropriate is a significant bottle neck.
> Furthermore include partition columns in join statements (not where)
> otherwise you do a full table scan ignoring partitions
>
> Le jeu. 6 août 2015 à 15:07, Jörn Franke <jornfranke@gmail.com
> <javascript:_e(%7B%7D,'cvml','jornfranke@gmail.com');>> a écrit :
>
>> Yes you should use orc it is much faster and more compact. Additionally
>> you can apply compression (snappy) to increase performance. Your data
>> processing pipeline seems to be not.very optimized. You should use the
>> newest hive version enabling storage indexes and bloom filters on
>> appropriate columns. Ideally you should insert the data sorted
>> appropriately. Partitioning and setting the execution engine to tez is also
>> beneficial.
>>
>> Hbase with phoenix should currently only be used if you do few joins, not
>> very complex queries and not many full table scans.
>>
>> Le jeu. 6 août 2015 à 14:54, venkatesh b <venkateshmailinglist@gmail.com
>> <javascript:_e(%7B%7D,'cvml','venkateshmailinglist@gmail.com');>> a
>> écrit :
>>
>>> Hi, here I got two things to know.
>>> FIRST:
>>> In our project we use hive.
>>> We daily get new data. We need to process this new data only once. And
>>> send this processed data to RDBMS. Here in processing we majorly use many
>>> complex queries with joins with where condition and grouping functions.
>>> There are many intermediate tables generated around 50 while
>>> processing. Till now we use text format as storage. We came across ORC file
>>> format. I would like to know that since it is one Time querying the table
>>> is it worth of storing as ORC format.
>>>
>>> SECOND:
>>> I came to know about HBase, which is faster.
>>> Can I replace hive with HBase for processing of data daily faster.
>>> Currently it is taking 15hrs daily with hive.
>>>
>>>
>>> Please inform me if any other information is needed.
>>>
>>> Thanks & regards
>>> Venkatesh
>>>
>>

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

Posted by Jörn Franke <jo...@gmail.com>.

Additionally it is of key importance to use the right data types for the
columns. Use int for ids,  int or decimal or float or double etc for
numeric values etc. - A bad data model using varchars and string where not
appropriate is a significant bottle neck.
Furthermore include partition columns in join statements (not where)
otherwise you do a full table scan ignoring partitions

Le jeu. 6 août 2015 à 15:07, Jörn Franke <jo...@gmail.com> a écrit :

> Yes you should use orc it is much faster and more compact. Additionally
> you can apply compression (snappy) to increase performance. Your data
> processing pipeline seems to be not.very optimized. You should use the
> newest hive version enabling storage indexes and bloom filters on
> appropriate columns. Ideally you should insert the data sorted
> appropriately. Partitioning and setting the execution engine to tez is also
> beneficial.
>
> Hbase with phoenix should currently only be used if you do few joins, not
> very complex queries and not many full table scans.
>
> Le jeu. 6 août 2015 à 14:54, venkatesh b <ve...@gmail.com>
> a écrit :
>
>> Hi, here I got two things to know.
>> FIRST:
>> In our project we use hive.
>> We daily get new data. We need to process this new data only once. And
>> send this processed data to RDBMS. Here in processing we majorly use many
>> complex queries with joins with where condition and grouping functions.
>> There are many intermediate tables generated around 50 while
>> processing. Till now we use text format as storage. We came across ORC file
>> format. I would like to know that since it is one Time querying the table
>> is it worth of storing as ORC format.
>>
>> SECOND:
>> I came to know about HBase, which is faster.
>> Can I replace hive with HBase for processing of data daily faster.
>> Currently it is taking 15hrs daily with hive.
>>
>>
>> Please inform me if any other information is needed.
>>
>> Thanks & regards
>> Venkatesh
>>
>

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

Posted by Jörn Franke <jo...@gmail.com>.

Yes you should use orc it is much faster and more compact. Additionally you
can apply compression (snappy) to increase performance. Your data
processing pipeline seems to be not.very optimized. You should use the
newest hive version enabling storage indexes and bloom filters on
appropriate columns. Ideally you should insert the data sorted
appropriately. Partitioning and setting the execution engine to tez is also
beneficial.

Hbase with phoenix should currently only be used if you do few joins, not
very complex queries and not many full table scans.

Le jeu. 6 août 2015 à 14:54, venkatesh b <ve...@gmail.com> a
écrit :

> Hi, here I got two things to know.
> FIRST:
> In our project we use hive.
> We daily get new data. We need to process this new data only once. And
> send this processed data to RDBMS. Here in processing we majorly use many
> complex queries with joins with where condition and grouping functions.
> There are many intermediate tables generated around 50 while
> processing. Till now we use text format as storage. We came across ORC file
> format. I would like to know that since it is one Time querying the table
> is it worth of storing as ORC format.
>
> SECOND:
> I came to know about HBase, which is faster.
> Can I replace hive with HBase for processing of data daily faster.
> Currently it is taking 15hrs daily with hive.
>
>
> Please inform me if any other information is needed.
>
> Thanks & regards
> Venkatesh
>