You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Frank Luo <jl...@merkleinc.com> on 2016/01/29 01:46:52 UTC

bloom filter used in 0.14?

All,

I have a huge table that I periodically want to do select on some particular value. For example, supposing I have a table for the entire world population. Then I know the id of “1234” is criminal, hence I want to pull out his information from the table.

Without any optimization, I have to use thousands of mappers to find just one id. So not ideal. I tried to enable bloom-filter on the column that I want to search on. But a simple query shows that the amount of data read is the same as that without a bloom-filter. So I am questioning whether it is enabled on the version I am on, which is 0.14. Does anyone know? If bloom-filter is not the way to go, does anyone have suggestions?

Here is the hql:

create table test
(
col1 STRING,
col2 STRING
) STORED AS ORC
tblproperties ("orc.bloom.filter.columns"="col1");

select * from test where col1 = ‘1234’;

Thx

Frank
[”MerkleONE”]<http://www2.merkleinc.com/janfooter>

This email and any attachments transmitted with it are intended for use by the intended recipient(s) only. If you have received this email in error, please notify the sender immediately and then delete it. If you are not the intended recipient, you must not keep, use, disclose, copy or distribute this email without the author’s prior permission. We take precautions to minimize the risk of transmitting software viruses, but we advise you to perform your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses. The information contained in this communication may be confidential and may be subject to the attorney-client privilege.

RE: bloom filter used in 0.14?

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

There are a number of other things you can do with Hive to improve your simple query.

 

First is best to upgrade to Hive 1.2.1 or above which will require minimal effort just ensure that you upgrade your metastore as well.

 

Second your first column is unique (?) as it is ID so in all probability it is a monotonically increasing number. You can therefore deploy bucketing to spread your col1 uniformily as its selectivity is 1.

 

I just wrote a simple shell routine to create 100K rows in an ORC table. Divided into 256 buckets. I am running Hive on Spark engine

 

CREATE TABLE txtest (

     ID string

   , DESCRIPTION string

)

CLUSTERED BY (ID) INTO 256 BUCKETS

STORED AS ORC

TBLPROPERTIES (

"orc.create.index"="true",

"orc.compress"="SNAPPY",

"orc.stripe.size"="16777216",

"orc.row.index.stride"="10000" )

;

 

INFO  : Loading data to table test.txtest from hdfs://rhes564:9000/user/hive/warehouse/test.db/txtest/.hive-staging_hive_2016-01-29_16-10-04_765_4205127232246216560-13/-ext-10000

INFO  : Table test.txtest stats: [numFiles=256, numRows=100000, totalSize=5653369, rawDataSize=22200000]

 

OK let us  just do a simple point query as you did

 

0: jdbc:hive2://rhes564:10010/default> use test;

No rows affected (0.02 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from txtest where id = '1234';

+------------+-----------------------------------------------------+--+

| txtest.id  |                 txtest.description                  |

+------------+-----------------------------------------------------+--+

| 1234       | R_MaH9oKm_OfLbe8LumdSXzBvDFJFrUIkvUoRm_iDPdeegPEWf  |

+------------+-----------------------------------------------------+--+

1 row selected (0.887 seconds)

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Frank Luo [mailto:jluo@merkleinc.com] 
Sent: 29 January 2016 00:47
To: user@hive.apache.org
Subject: bloom filter used in 0.14?

 

All, 

 

I have a huge table that I periodically want to do select on some particular value. For example, supposing I have a table for the entire world population. Then I know the id of “1234” is criminal, hence I want to pull out his information from the table. 

 

Without any optimization, I have to use thousands of mappers to find just one id. So not ideal. I tried to enable bloom-filter on the column that I want to search on. But a simple query shows that the amount of data read is the same as that without a bloom-filter. So I am questioning whether it is enabled on the version I am on, which is 0.14. Does anyone know? If bloom-filter is not the way to go, does anyone have suggestions?

 

Here is the hql:

 

create table test
(
  col1   STRING,
   col2   STRING
) STORED AS ORC
tblproperties ("orc.bloom.filter.columns"="col1");

 

select * from test where col1 = ‘1234’;

 

Thx

 

Frank

 <http://www2.merkleinc.com/janfooter> 

This email and any attachments transmitted with it are intended for use by the intended recipient(s) only. If you have received this email in error, please notify the sender immediately and then delete it. If you are not the intended recipient, you must not keep, use, disclose, copy or distribute this email without the author’s prior permission. We take precautions to minimize the risk of transmitting software viruses, but we advise you to perform your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses. The information contained in this communication may be confidential and may be subject to the attorney-client privilege.

Re: bloom filter used in 0.14?

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

To get optimal performance from bloom filter, make sure the records of col1 are sorted. Sorted on the column of interest will efficiently prune stripes and row groups. If the records that you are searching for is spread across row groups (10K rows by default) or stripes (64MB by default) then ORC reader will have to read all/most of the row groups and stripes. Sorting helps to cluster the records together to making pruning better.

Thanks
Prasanth

On Jan 28, 2016, at 6:46 PM, Frank Luo <jl...@merkleinc.com>> wrote:

All,

Here is the hql:

create table test
(
col1 STRING,
col2 STRING
) STORED AS ORC
tblproperties ("orc.bloom.filter.columns"="col1");

select * from test where col1 = ‘1234’;

Thx

Frank
[”MerkleONE”]<http://www2.merkleinc.com/janfooter>

RE: bloom filter used in 0.14?

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

There are a number of other things you can do with Hive to improve your simple query.

 

First is best to upgrade to Hive 1.2.1 or above which will require minimal effort just ensure that you upgrade your metastore as well.

 

Second your first column is unique (?) as it is ID so in all probability it is a monotonically increasing number. You can therefore deploy bucketing to spread your col1 uniformily as its selectivity is 1.

 

I just wrote a simple shell routine to create 100K rows in an ORC table. Divided into 256 buckets. I am running Hive on Spark engine

 

CREATE TABLE txtest (

     ID string

   , DESCRIPTION string

)

CLUSTERED BY (ID) INTO 256 BUCKETS

STORED AS ORC

TBLPROPERTIES (

"orc.create.index"="true",

"orc.compress"="SNAPPY",

"orc.stripe.size"="16777216",

"orc.row.index.stride"="10000" )

;

 

INFO  : Loading data to table test.txtest from hdfs://rhes564:9000/user/hive/warehouse/test.db/txtest/.hive-staging_hive_2016-01-29_16-10-04_765_4205127232246216560-13/-ext-10000

INFO  : Table test.txtest stats: [numFiles=256, numRows=100000, totalSize=5653369, rawDataSize=22200000]

 

OK let us  just do a simple point query as you did

 

0: jdbc:hive2://rhes564:10010/default> use test;

No rows affected (0.02 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from txtest where id = '1234';

+------------+-----------------------------------------------------+--+

| txtest.id  |                 txtest.description                  |

+------------+-----------------------------------------------------+--+

| 1234       | R_MaH9oKm_OfLbe8LumdSXzBvDFJFrUIkvUoRm_iDPdeegPEWf  |

+------------+-----------------------------------------------------+--+

1 row selected (0.887 seconds)

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Frank Luo [mailto:jluo@merkleinc.com] 
Sent: 29 January 2016 00:47
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: bloom filter used in 0.14?

 

All, 

 

I have a huge table that I periodically want to do select on some particular value. For example, supposing I have a table for the entire world population. Then I know the id of “1234” is criminal, hence I want to pull out his information from the table. 

 

Without any optimization, I have to use thousands of mappers to find just one id. So not ideal. I tried to enable bloom-filter on the column that I want to search on. But a simple query shows that the amount of data read is the same as that without a bloom-filter. So I am questioning whether it is enabled on the version I am on, which is 0.14. Does anyone know? If bloom-filter is not the way to go, does anyone have suggestions?

 

Here is the hql:

 

create table test
(
  col1   STRING,
   col2   STRING
) STORED AS ORC
tblproperties ("orc.bloom.filter.columns"="col1");

 

select * from test where col1 = ‘1234’;

 

Thx

 

Frank

 <http://www2.merkleinc.com/janfooter> 

This email and any attachments transmitted with it are intended for use by the intended recipient(s) only. If you have received this email in error, please notify the sender immediately and then delete it. If you are not the intended recipient, you must not keep, use, disclose, copy or distribute this email without the author’s prior permission. We take precautions to minimize the risk of transmitting software viruses, but we advise you to perform your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses. The information contained in this communication may be confidential and may be subject to the attorney-client privilege.

RE: bloom filter used in 0.14?

Posted by Frank Luo <jl...@merkleinc.com>.

Thank you all for this discussion. Very helpful.

-----Original Message-----
From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of Gopal Vijayaraghavan
Sent: Thursday, January 28, 2016 7:43 PM
To: user@hive.apache.org
Subject: Re: bloom filter used in 0.14?

> So I am questioning whether it is enabled on the version I am on,
>which is 0.14. Does anyone know?

https://issues.apache.org/jira/browse/HIVE-9188 - fix-version (1.2.0)

The version you are using does not have bloom filter support.

It should be ignoring the parameter and not generating any bloom filter streams, when writing.

hive --orcfiledump (in later versions) will print the BLOOM_FILTER as a column next to the row index streams.

> Without any optimization, I have to use thousands of mappers to find
>just one id.

Everything else you are doing is appropriate, however be aware that the bloom filter index (& row-index) is consulted only *after* a mapper starts up.

So it might still spin up a mapper, but it might exit immediately, which plays well into Tez container reuse for very busy clusters - in fact, it might be faster in a busy cluster than a completely idle one.

The sorted[1] min-max indicators suggested by Prasanth however are actually rolled up to the split-level & can be used to prune splits before being scheduled.

Cheers,
Gopal
[1] - only CLUSTER BY needed, not ORDER BY

[”MerkleONE”]<http://www2.merkleinc.com/janfooter>
This email and any attachments transmitted with it are intended for use by the intended recipient(s) only. If you have received this email in error, please notify the sender immediately and then delete it. If you are not the intended recipient, you must not keep, use, disclose, copy or distribute this email without the author’s prior permission. We take precautions to minimize the risk of transmitting software viruses, but we advise you to perform your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses. The information contained in this communication may be confidential and may be subject to the attorney-client privilege.

Re: bloom filter used in 0.14?

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> So I am questioning whether it is enabled on the version I am on, which
>is 0.14. Does anyone know?

https://issues.apache.org/jira/browse/HIVE-9188 - fix-version (1.2.0)


The version you are using does not have bloom filter support.

It should be ignoring the parameter and not generating any bloom filter
streams, when writing.

hive --orcfiledump (in later versions) will print the BLOOM_FILTER as a
column next to the row index streams.

> Without any optimization, I have to use thousands of mappers to find
>just one id.


Everything else you are doing is appropriate, however be aware that the
bloom filter index (& row-index) is consulted only *after* a mapper starts
up.

So it might still spin up a mapper, but it might exit immediately, which
plays well into Tez container reuse for very busy clusters - in fact, it
might be faster in a busy cluster than a completely idle one.

The sorted[1] min-max indicators suggested by Prasanth however are
actually rolled up to the split-level & can be used to prune splits before
being scheduled.

Cheers,
Gopal
[1] - only CLUSTER BY needed, not ORDER BY