You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Mike Lambert (JIRA)" <ji...@apache.org> on 2017/03/23 06:22:41 UTC
[jira] [Comment Edited] (BEAM-1787) Python DirectRunner silently blocks reading full query from Google Datastore

    [ https://issues.apache.org/jira/browse/BEAM-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937806#comment-15937806 ] 

Mike Lambert edited comment on BEAM-1787 at 3/23/17 6:22 AM:
-------------------------------------------------------------

Nope, there is just a print statement at step 3. Using the datastore_wordcount example:
{noformat}
  lines = p | 'read from datastore' >> ReadFromDatastore(
      project, query, user_options.namespace)

  # Count the occurrences of each word.
  counts = (lines
            | 'split' >> (beam.ParDo(WordExtractingDoFn())
                          .with_output_types(unicode))
            | 'pair_with_one' >> beam.Map(lambda x: (x, 1))
            | 'group' >> beam.GroupByKey()
            | 'count' >> beam.Map(lambda (word, ones): (word, sum(ones))))
{noformat}

If I put logging statements inside WordExtractingDoFn, they are not printed until the very end of the script execution, and are printed all at once.


was (Author: mlambert):
Nope, there is just a print statement at step 3. Using the datastore_wordcount example:
{code:none}
  lines = p | 'read from datastore' >> ReadFromDatastore(
      project, query, user_options.namespace)

  # Count the occurrences of each word.
  counts = (lines
            | 'split' >> (beam.ParDo(WordExtractingDoFn())
                          .with_output_types(unicode))
            | 'pair_with_one' >> beam.Map(lambda x: (x, 1))
            | 'group' >> beam.GroupByKey()
            | 'count' >> beam.Map(lambda (word, ones): (word, sum(ones))))
{code:none}

If I put logging statements inside WordExtractingDoFn, they are not printed until the very end of the script execution, and are printed all at once.

> Python DirectRunner silently blocks reading full query from Google Datastore
> ----------------------------------------------------------------------------
>
>                 Key: BEAM-1787
>                 URL: https://issues.apache.org/jira/browse/BEAM-1787
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>            Reporter: Mike Lambert
>            Assignee: Ahmet Altay
>            Priority: Minor
>              Labels: datastore, python
>
> When I run a query (even with many splits) against the production datastore (such as in the datastore_wordcount demo), it operates as follows:
> 1. split the query into a bunch of split queries
> 2. run each split query, collecting the results
> 3. then pass the results to the following stage / ParDo
> However, 2 is run to completion with DirectRunner before starting 3. So a large dataset must be fully downloaded before it attempts to run any of the following stages.
> While it may make sense and local parallelism/pipelining might be impossible....there is no output or status messages. And debugging why my code appeared to hang before processing results, took forever to dig through code and instrument-log-debug all the beam code to figure out what was going on.
> See https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/36 for more details
> This happens with github head 0.7.0-dev (there was no "version" tag for this above).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)