You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Li, Chengxiang" <ch...@intel.com> on 2015/06/18 11:37:19 UTC

Use bloom filter to improve hybrid hash join performance

Hi, flink developers

I read the flink hybrid hash join documents and implementation, very nice job. For the case of small table does not all fit into memory, I think we may able to improve the performance better.  Currently in hybrid hash join, while small table does not fit into memory, part of the small table data would be spilled to disk, and the counterpart partition of big table data would be spilled to disk in probe phase as well. You can find detail description here: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html . If we build a bloom filter while spill small table to disk during build phase, and use it to filter the big table records which tend to be spilled to disk, this may greatly reduce the spilled big table file size, and saved the disk IO cost for writing and further reading. I have created FLINK-2240<https://issues.apache.org/jira/browse/FLINK-2240> about it,  I would like to contribute on this optimization if someone can assign the JIRA to me. But before that, I would like to hear your comments about this.

Thanks
Chengxiang

Re: Use bloom filter to improve hybrid hash join performance

Posted by Stephan Ewen <se...@apache.org>.

Hi!

That is a very nice idea and a well proven optimization to the hybrid hash
join. It would be a great if you could contribute that.

The memory allocated for the hash buckets (holding hash codes and pointers)
is currently wasted for those buckets where the partition of the bucket is
spilled. Putting the bloom filter in there would be good fit, since we
would not need any extra memory, but just use the otherwise bare bucket
memory in a useful way.

The JIRA seems already assigned.

Let me know if you have more questions!

Greetings,
Stephan

On Thu, Jun 18, 2015 at 2:37 AM, Li, Chengxiang <ch...@intel.com>
wrote:

> Hi, flink developers
>
> I read the flink hybrid hash join documents and implementation, very nice
> job. For the case of small table does not all fit into memory, I think we
> may able to improve the performance better.  Currently in hybrid hash join,
> while small table does not fit into memory, part of the small table data
> would be spilled to disk, and the counterpart partition of big table data
> would be spilled to disk in probe phase as well. You can find detail
> description here:
> http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
> . If we build a bloom filter while spill small table to disk during build
> phase, and use it to filter the big table records which tend to be spilled
> to disk, this may greatly reduce the spilled big table file size, and saved
> the disk IO cost for writing and further reading. I have created FLINK-2240<
> https://issues.apache.org/jira/browse/FLINK-2240> about it,  I would like
> to contribute on this optimization if someone can assign the JIRA to me.
> But before that, I would like to hear your comments about this.
>
> Thanks
> Chengxiang
>