You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Jeff Webb (Jira)" <ji...@apache.org> on 2021/09/14 23:08:00 UTC

[jira] [Updated] (BEAM-10148) Data loaded using ReadAllFromText is not projected properly as side input

     [ https://issues.apache.org/jira/browse/BEAM-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Webb updated BEAM-10148:
-----------------------------
    Status: Open  (was: Triage Needed)

> Data loaded using ReadAllFromText is not projected properly as side input
> -------------------------------------------------------------------------
>
>                 Key: BEAM-10148
>                 URL: https://issues.apache.org/jira/browse/BEAM-10148
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-files
>    Affects Versions: 2.19.0
>         Environment: Runner: Dataflow Runner
> Beam SDK:  Python 2.19
>            Reporter: Prathap Kumar Parvathareddy
>            Priority: P3
>
> *Context:*
> Data Enrichment Pattern with 2 Sources. 
> Source 1:  Google Cloud PubSub delivering JSON messages
> Source 2:  Google Cloud Storage Files (acts as a Side Input) 
> *Steps:* 
> 1. Load the data from GCS (Google Cloud Storage) using ReadAllFromText based on file path inside a PCollection and convert each record as a tuple with 2 fields
> 2. Project the Tuple loaded in step 1 as a side input using Pvalue.ASDict to the main input that is being loaded from PubSub.
> 3. Expectation is that the side inputs should be available but for some reason AsDict is not containing all the data that was loaded from GCS
>  
> *Possible Issues:*
>       Below are few possible issues that can be ruled out as I already validated them.
>  # Window Mismatches -  Main Input window is within the scope of Side Input window.
>  # Delay in updating Side Input State -  Pipeline has just 1 VM and Side Input has only 8 json messages and total size of side input is around 40 KB
>  
> *Troubleshooting:*
>  # Working fine in a DirectRunner   
>  # Validated that ReadAllFromText transform is loading all the data properly from multiple files with subsequent transform building KV as well. However when the output of KV transform is used as a sideinput using ASDict() for some reason certain elements are skipped and not available inside look up.
> *Code* 
> [Example Code|https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-python-examples/streaming-examples/slowlychanging-sideinput/sideinput_refresh/main.py#L234] for reference
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)