You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crunch.apache.org by "Ryan Brush (JIRA)" <ji...@apache.org> on 2014/01/31 16:58:09 UTC

[jira] [Created] (CRUNCH-336) Optimized filters and joins via Parquet RecordFilters

Ryan Brush created CRUNCH-336:
---------------------------------

Summary: Optimized filters and joins via Parquet RecordFilters
Key: CRUNCH-336
URL: https://issues.apache.org/jira/browse/CRUNCH-336
Project: Crunch
Issue Type: Improvement
Reporter: Ryan Brush

Logging this to track some ideas from an offline discussion with [~jwills] and [~mkwhitacre]. There's an opportunity to significantly speed up a couple access patterns:

1. Process only a subset of data from a Parquet file identified by a single column
2. Perform a bloom filter join between two datasets, where the joined item is a Parquet column in the larger data set.

Optimizing item 1 simply involves using a RecordFilter to narrow down the data loaded from the AvroParquetInputFormat.

Optimizing item 2 is more involved. In a nutshell, we discussed doing a bloom filter join, but using the bloom filter to implement the Parquet RecordFilter on the specific column. In cases where where we join on columns and only select a small subset of the larger dataset, this would skip IO and deserialization cost for all items that didn't match the join.

It's not obvious to me how we'd achieve this cleanly, since it involves multiple pieces (configuring of inputs in conjunction with a specific join strategy). In many cases the bloom filter join alone will achieve sufficient performance, but I'm logging this potential optimization for reference.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)