You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jim Apple <jb...@cloudera.com> on 2017/08/15 03:19:38 UTC

Missing sync-up this week; Bloom filter status

I'll be missing the Google Meet sync-up this week, so I wanted to
share briefly where I think we are on PARQUET-41:

I think we have agreement on the form that the filters will take,
including the hash functions, but I believe we still don't have a
benchmark against dictionary filtering. Last I heard, that was blocked
by https://issues.apache.org/jira/browse/PARQUET-1061, though that
ticket is now marked Resolved.

Re: Missing sync-up this week; Bloom filter status

Posted by 俊杰陈 <cj...@gmail.com>.
We have done 100GB scale benchmark for bloom filter vs dictionary filter
comparison. please see data in here:
https://docs.google.com/spreadsheets/d/1OIB920l9U_aCGXVeVIUUDL2chjxss7CrSLy6lUHfPaE/edit?usp=sharing
.

During benchmark we found dictionary filter can only works in very limited
cases due to :
1.  When cardinality is large, dictionary fallback to plain encoding.
2.  When cardinality is small, values probably exists in most row groups,
thus only few row groups can be filtered.


Regarding to enable dictionary PARQUET-1061, I found it is not a bug in
parquet side. Hive use deprecated API for its PPD cause the problem, I
filed HIVE-17261 and submitted patch.


2017-08-15 11:19 GMT+08:00 Jim Apple <jb...@cloudera.com>:

> I'll be missing the Google Meet sync-up this week, so I wanted to
> share briefly where I think we are on PARQUET-41:
>
> I think we have agreement on the form that the filters will take,
> including the hash functions, but I believe we still don't have a
> benchmark against dictionary filtering. Last I heard, that was blocked
> by https://issues.apache.org/jira/browse/PARQUET-1061, though that
> ticket is now marked Resolved.
>



-- 
Thanks & Best Regards