You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2019/11/04 20:02:00 UTC

[jira] [Work logged] (BEAM-2572) Implement an S3 filesystem for Python SDK

     [ https://issues.apache.org/jira/browse/BEAM-2572?focusedWorklogId=338336&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-338336 ]

ASF GitHub Bot logged work on BEAM-2572:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 04/Nov/19 20:01
            Start Date: 04/Nov/19 20:01
    Worklog Time Spent: 10m 
      Work Description: MattMorgis commented on issue #9955: [BEAM-2572] Python SDK S3 Filesystem
URL: https://github.com/apache/beam/pull/9955#issuecomment-549522747
 
 
   Hi,
   
   We are running into trouble getting the unit tests to pass in the CI environment, and I think we can use help from a core team member. 
   
   We added a new set of extra dependencies when using this new S3 filesystem - we followed the same pattern that GCP did: https://github.com/apache/beam/pull/9955/files#diff-e9d0ab71f74dc10309a29b697ee99330R239
   
   This allows the user to install with `pip install beam[gcp]` or `pip install beam[aws]` in our case. 
   
   Our unit tests are completely mocked out and do not require any of the AWS extra packages, however, we set it up behind a flag so you can bypass the mock and talk to a real S3 bucket over the wire. Because of this, the extra dependencies *do* need to installed when running these new unit tests. 
   
   Again, following the lead of how GCP implemented this, they also skip the unit tests if their extra dependencies are not installed: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/gcsio_test.py#L240
   
   Our question: _**How do we configured CI to install the AWS deps to run the tests?**_ 
   
   I have poked around a bit and found one setting in `tox.ini` that appears to install both the test and gcp deps (https://github.com/apache/beam/blob/master/sdks/python/tox.ini#L200). Addtionally, at the root level of the project, (https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1799) I found a `installGcpTest` Gradle task that seems to also install both. This task only seems to be referenced inside of the `test-suites/dataflow` but not `direct` or `portable`. 
   
   Any guidance here would be greatly appreciated! 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 338336)
    Time Spent: 20m  (was: 10m)

> Implement an S3 filesystem for Python SDK
> -----------------------------------------
>
>                 Key: BEAM-2572
>                 URL: https://issues.apache.org/jira/browse/BEAM-2572
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-py-core
>            Reporter: Dmitry Demeshchuk
>            Priority: Minor
>              Labels: GSoC2019, gsoc, gsoc2019, mentor, outreachy19dec
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore their behaviors may contradict each other in some edge cases (say, we write something to S3, but it's not immediately accessible for reading from another end).
> 2. There are other AWS-based sources and sinks we may want to create in the future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like reattempting.
> Whatever path we choose, there's another problem related to this: we currently cannot pass any global settings (say, pipeline options, or just an arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the runner nodes to have AWS keys set up in the environment, which is not trivial to achieve and doesn't look too clean either (I'd rather see one single place for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem implementation that only supports DirectRunner at the moment (because of the previous paragraph). I'm perfectly fine finishing it myself, with some guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)