You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 16:22:42 UTC

[GitHub] [beam] damccorm opened a new issue, #20247: Data loaded using ReadAllFromText is not projected properly as side input

damccorm opened a new issue, #20247:
URL: https://github.com/apache/beam/issues/20247

   *Context:*
   
   Data Enrichment Pattern with 2 Sources. 
   
   Source 1:  Google Cloud PubSub delivering JSON messages
   
   Source 2:  Google Cloud Storage Files (acts as a Side Input) 
   
   *Steps:* 
   
   1. Load the data from GCS (Google Cloud Storage) using ReadAllFromText based on file path inside a PCollection and convert each record as a tuple with 2 fields
   
   2. Project the Tuple loaded in step 1 as a side input using Pvalue.ASDict to the main input that is being loaded from PubSub.
   
   3. Expectation is that the side inputs should be available but for some reason AsDict is not containing all the data that was loaded from GCS
   
    
   
   *Possible Issues:*
   
         Below are few possible issues that can be ruled out as I already validated them.
    - Window Mismatches -  Main Input window is within the scope of Side Input window.
    - Delay in updating Side Input State -  Pipeline has just 1 VM and Side Input has only 8 json messages and total size of side input is around 40 KB
   
    
   
   *Troubleshooting:*
    - Working fine in a DirectRunner   
    - Validated that ReadAllFromText transform is loading all the data properly from multiple files with subsequent transform building KV as well. However when the output of KV transform is used as a sideinput using ASDict() for some reason certain elements are skipped and not available inside look up.
   
   *Code* 
   
   [Example Code](https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-python-examples/streaming-examples/slowlychanging-sideinput/sideinput_refresh/main.py#L234) for reference
   
    
   
    
   
    
   
   Imported from Jira [BEAM-10148](https://issues.apache.org/jira/browse/BEAM-10148). Original Jira may contain additional context.
   Reported by: prathapreddy22.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org