You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Ovidiu-Cristian MARCU <ov...@inria.fr> on 2016/04/18 21:47:54 UTC

Hash tables - joins, cogroup, deltaIteration

Hi,

Can you please confirm if there is any update regarding the hash tables use cases, as in [1] it is specified that Hash tables are used in Joins and for the Solution set in iterations (pending work to use them for grouping/aggregations)?

I am interested in the pending work progress and also if you consider to add an implementation where Joins and Solution Set in delta iterations (and CoGroup) can rely on a hybrid implementation where the engine can use also disk if not enough memory available when working with these hash tables.
 
[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525 <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525> Memory Management (Batch API)

Thanks

Best,
Ovidiu

Re: Hash tables - joins, cogroup, deltaIteration

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Ovidiu,

Hash tables are currently used for joins (inner & outer) and the solution
set of delta iterations.
There is a pending PR that implements a hash table for partial aggregations
(combiner) [1] which should be added soon.

Joins (inner & outer) are already implemented as Hybrid Hash joins that go
to disk if memory is insufficient.
The hash table used for solution sets does not support spilling. In case of
not enough memory, writing and reading data to / from disk in *each*
iteration would cause a significant performance hit, that possibly
outweighs the benefits of delta iterations. The recommended solution for
this is to add more machines / memory to the cluster.
I am not aware of anybody working on addressing the issue of non-spillable
hash tables of solution sets.

Best,
Fabian

[1] https://github.com/apache/flink/pull/1517

2016-04-18 21:47 GMT+02:00 Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr>:

> Hi,
>
> Can you please confirm if there is any update regarding the hash tables
> use cases, as in [1] it is specified that *Hash tables are used in Joins
> and for the Solution set in iterations (pending work to use them for
> grouping/aggregations)?*
>
> I am interested in the pending work progress and also if you consider to
> add an implementation where Joins and Solution Set in delta iterations (and
> CoGroup) can rely on a hybrid implementation where the engine can use also
> disk if not enough memory available when working with these hash tables.
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525 Memory
> Management (Batch API)
>
> Thanks
>
> Best,
> Ovidiu
>