You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/12 08:28:03 UTC
[GitHub] [iceberg] xwmr-max commented on pull request #6440: Flink: Support Look-up Function

xwmr-max commented on PR #6440:
URL: https://github.com/apache/iceberg/pull/6440#issuecomment-1379970862

   > lookup function is for lookup join in Flink [1]. I have the same question as @zinking . normally lookup functions fit better for point query storage systems (like JDBC).
   > 
   > let's discuss the two scenarios separately
   > 
   > * small Iceberg table that can fit into memory comfortably using caching. In this case, cache should always be enabled. I don't see a reason where cache should be disabled. Also if a taskmanager has 8 slots, does lookup function cache 1 or 8 copies of reference data set?
   > * large Iceberg table. would FLILP-204 [2] help?
   > 
   > [1] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-204%3A+Introduce+Hash+Lookup+Join
   
   Hi stevenzwu. Thank you for your review. Let's discuss the problem you raised separately.
   
   - As you said, small iceberg table can be easily loaded into the memory by using cache, and the query performance is also very fast. Therefore, from this point on, cache may always be enabled. However, there are some circumstances to consider. First, in our solution, _lookup-join-cache-size_ and _lookup-join-cache-ttl_ are provided to control the cache size and expiration time respectively, so that the cache size can be set according to actual conditions and the queried data can be guaranteed to be the latest. Secondly, this scheme improves query efficiency by storing data with the same primary key in the cache. If the cache does not contain data with the same primary key, the latest data will be loaded from the table. In addition, if a taskmanager has 8 slots,lookup function needs to cache a copy of the data set. lookup function is just a basic function capability that can be used in the future to optimize enhanced performance, such as secondary indexes and so on. At present iceb
 erg does not support this basic function, which can satisfy the requirements of many scenarios.
   - FLILP-204 [2] just raises the cache hit ratio, user could use a hint to enable partitioned lookup join which enforces input of lookup join to hash shuffle by look up  keys.  This can indeed relieve the pressure of cache, but the iceberg table for larger data does not support it well. But based on the basic lookup function, we can apply this in the future.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org