You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2019/12/24 04:31:00 UTC

[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

     [ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=362879&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-362879 ]

ASF GitHub Bot logged work on BEAM-9029:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Dec/19 04:30
            Start Date: 24/Dec/19 04:30
    Worklog Time Spent: 10m 
      Work Description: icemoon1987 commented on pull request #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459
 
 
   Trying to fix the bugs on JIRA: https://issues.apache.org/jira/browse/BEAM-9029
   
   1. Ignore exception when trying to list a nonexistent S3 path;
   2. Fix parsing issue when deleting the temporary output directory.
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
   Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
   XLang | --- | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | --- | ---
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) 
   Portable | --- | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 362879)
    Remaining Estimate: 23h 50m  (was: 24h)
            Time Spent: 10m

> Two bugs in Python SDK S3 filesystem support
> --------------------------------------------
>
>                 Key: BEAM-9029
>                 URL: https://issues.apache.org/jira/browse/BEAM-9029
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Wenhai Pan
>            Priority: Major
>              Labels: pull-request-available
>   Original Estimate: 24h
>          Time Spent: 10m
>  Remaining Estimate: 23h 50m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
>  
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-00001': BeamIOError("List operation failed with exceptions {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried to list nonexistent S3 path: s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will raise an exception when trying to list a nonexistent S3 path (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And the S3IO class does not handle this exception in list_prefix function (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there is no existing output file, it will try to list a nonexistent S3 path and will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", line 272, in delete
> exceptions = {path: error for (path, error) in results
> File "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", line 272, in <dictcomp>
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will trigger this exception. This exception is caused by parsing (path, error) directly from the "results" which is a dict (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we should use results.items() here.
> I have submitted a patch for these 2 bugs. Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)