You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 16:30:11 UTC

[GitHub] [beam] damccorm opened a new issue, #20310: Keeping keys in a state for a very long time (keys expiry unknown)

damccorm opened a new issue, #20310:
URL: https://github.com/apache/beam/issues/20310

   I have a use case which I think might be a good addition to the pipelines patterns:
   
    
   beam (java sdk) reads two kind of records from data stream like Kafka:
    
   1. Records of type A containing key and corresponding metadata. 
   2. Records of type B containing the same key, but no metadata. Beam then needs to fill metadata for records of type B  by doing a lookup for metadata using keys received in records of type A. 
    
   Idea is to save metadata or rather state for keys received in records of type A and then do a lookup when records of type B are received.
    Beam's "@State" construct  can be used here, however, problem is that we don't know when keys should expire. I don't think keeping a global window will be a good idea as there could be many keys (may be millions over a period of time) to be saved in a state.
    
   One possible solution as suggested by Reza Ardeshir Rokni (rarokni@gmail.com):
    
   We can maintain a state in a large fixed window (1 day or so), so that GC can happen within a window bound. After window expire, save the metadata values in an external DB like BigQuery. If we get a record with same key in a new window looking for this metadata, fetch the metadata for that key from external DB and save it in window's state again.
    
    
    
    
   
    
   
   Imported from Jira [BEAM-10019](https://issues.apache.org/jira/browse/BEAM-10019). Original Jira may contain additional context.
   Reported by: mohilkhare.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org