You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Jeff Webb (Jira)" <ji...@apache.org> on 2021/09/14 23:08:00 UTC
[jira] [Updated] (BEAM-10148) Data loaded using ReadAllFromText is
not projected properly as side input
[ https://issues.apache.org/jira/browse/BEAM-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeff Webb updated BEAM-10148:
-----------------------------
Status: Open (was: Triage Needed)
> Data loaded using ReadAllFromText is not projected properly as side input
> -------------------------------------------------------------------------
>
> Key: BEAM-10148
> URL: https://issues.apache.org/jira/browse/BEAM-10148
> Project: Beam
> Issue Type: Bug
> Components: io-py-files
> Affects Versions: 2.19.0
> Environment: Runner: Dataflow Runner
> Beam SDK: Python 2.19
> Reporter: Prathap Kumar Parvathareddy
> Priority: P3
>
> *Context:*
> Data Enrichment Pattern with 2 Sources.
> Source 1: Google Cloud PubSub delivering JSON messages
> Source 2: Google Cloud Storage Files (acts as a Side Input)
> *Steps:*
> 1. Load the data from GCS (Google Cloud Storage) using ReadAllFromText based on file path inside a PCollection and convert each record as a tuple with 2 fields
> 2. Project the Tuple loaded in step 1 as a side input using Pvalue.ASDict to the main input that is being loaded from PubSub.
> 3. Expectation is that the side inputs should be available but for some reason AsDict is not containing all the data that was loaded from GCS
>
> *Possible Issues:*
> Below are few possible issues that can be ruled out as I already validated them.
> # Window Mismatches - Main Input window is within the scope of Side Input window.
> # Delay in updating Side Input State - Pipeline has just 1 VM and Side Input has only 8 json messages and total size of side input is around 40 KB
>
> *Troubleshooting:*
> # Working fine in a DirectRunner
> # Validated that ReadAllFromText transform is loading all the data properly from multiple files with subsequent transform building KV as well. However when the output of KV transform is used as a sideinput using ASDict() for some reason certain elements are skipped and not available inside look up.
> *Code*
> [Example Code|https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-python-examples/streaming-examples/slowlychanging-sideinput/sideinput_refresh/main.py#L234] for reference
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)