You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Pablo Estrada <pa...@google.com> on 2019/02/21 23:19:06 UTC
s3 filesystem for Python good for GSoC?
Hello all,
I was thinking that a filesystem with support for s3 would be great to have
in the Python SDK. If I am not wrong, it would simply involve implementing
the filesystem classes
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py>
with
s3, right?
I am not familiar enough with s3, nor with filesystems, nor with AWS in
general - but I have some outstanding questions:
- Does this mean that we probably would need an extra [s3] target for
installing apache_beam, like we do with [gcp]?
- Not strictly necessary, but probably desirable...
- How do we handle KMS in GCS filesystem?
- Would the filesystem encapsulation make KMS support in an s3
filesystem difficult?
- Or even more... is the KMS support in AWS very different than in GCP?
- I'd love comments from anyone informed around this : )
- Is this project of an appropriate size for a GSoC student?
Thoughts?
Best
-P.
Re: s3 filesystem for Python good for GSoC?
Posted by Kenneth Knowles <ke...@apache.org>.
This is a good scope. And given there are multiple choices, an advanced
student can expand scope to do both.
Kenn
On Thu, Feb 21, 2019 at 5:36 PM Austin Bennett <wh...@gmail.com>
wrote:
> Hi Pablo,
>
> Agree on the usefulness.
>
> Some thoughts embedded:
>
>
> On Thu, Feb 21, 2019 at 3:19 PM Pablo Estrada <pa...@google.com> wrote:
>
>> Hello all,
>> I was thinking that a filesystem with support for s3 would be great to
>> have in the Python SDK. If I am not wrong, it would simply involve
>> implementing the filesystem classes
>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py> with
>> s3, right?
>>
>
> If talking about extending filesytems to alternate clouds -- would Azure
> Blob storage also be sensible (not something I need, but imagine could be
> valuable and pretty easy as I (naively) think these (gcs/azure-blob/s3 are
> largely interchangeable).
>
>
>> I am not familiar enough with s3, nor with filesystems, nor with AWS in
>> general - but I have some outstanding questions:
>>
>> - Does this mean that we probably would need an extra [s3] target for
>> installing apache_beam, like we do with [gcp]?
>> - Not strictly necessary, but probably desirable...
>>
>> To think about: would there be [aws] to encompass more potential/future
> aws things, to make it akin to gcp, or equivalently, would we want a
> [gcs/gs] target to narrow down what gets loaded?
>
>
>> - How do we handle KMS in GCS filesystem?
>> - Would the filesystem encapsulation make KMS support in an s3
>> filesystem difficult?
>> - Or even more... is the KMS support in AWS very different than in
>> GCP?
>> - I'd love comments from anyone informed around this : )
>>
>> I use KMS with AWS, the tricky part is custom managed keys. I haven't
> dug in enough to see how similar/different GCS implementation is (I thought
> I only saw keys managed by GCP, so potentially easier, though AWS does have
> that option).
>
>
>
>>
>> - Is this project of an appropriate size for a GSoC student?
>>
>> Can't speak to appropriate size; was this listed as a project? Did we
> have sufficiently vague proposals? I thought applications had been turned
> in?
>
>
>
>
>> Thoughts?
>> Best
>> -P.
>>
>
Re: s3 filesystem for Python good for GSoC?
Posted by Austin Bennett <wh...@gmail.com>.
Hi Pablo,
Agree on the usefulness.
Some thoughts embedded:
On Thu, Feb 21, 2019 at 3:19 PM Pablo Estrada <pa...@google.com> wrote:
> Hello all,
> I was thinking that a filesystem with support for s3 would be great to
> have in the Python SDK. If I am not wrong, it would simply involve
> implementing the filesystem classes
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py> with
> s3, right?
>
If talking about extending filesytems to alternate clouds -- would Azure
Blob storage also be sensible (not something I need, but imagine could be
valuable and pretty easy as I (naively) think these (gcs/azure-blob/s3 are
largely interchangeable).
> I am not familiar enough with s3, nor with filesystems, nor with AWS in
> general - but I have some outstanding questions:
>
> - Does this mean that we probably would need an extra [s3] target for
> installing apache_beam, like we do with [gcp]?
> - Not strictly necessary, but probably desirable...
>
> To think about: would there be [aws] to encompass more potential/future
aws things, to make it akin to gcp, or equivalently, would we want a
[gcs/gs] target to narrow down what gets loaded?
> - How do we handle KMS in GCS filesystem?
> - Would the filesystem encapsulation make KMS support in an s3
> filesystem difficult?
> - Or even more... is the KMS support in AWS very different than in GCP?
> - I'd love comments from anyone informed around this : )
>
> I use KMS with AWS, the tricky part is custom managed keys. I haven't dug
in enough to see how similar/different GCS implementation is (I thought I
only saw keys managed by GCP, so potentially easier, though AWS does have
that option).
>
> - Is this project of an appropriate size for a GSoC student?
>
> Can't speak to appropriate size; was this listed as a project? Did we
have sufficiently vague proposals? I thought applications had been turned
in?
> Thoughts?
> Best
> -P.
>
Re: s3 filesystem for Python good for GSoC?
Posted by Suneel Marthi <sm...@apache.org>.
Yup, something like this.
import boto3 s3r = boto3.resource(“s3”) data = s3r.Object(bucket=“bucket”,
key=“key”).read()
On Thu, Feb 21, 2019 at 9:50 PM Boyuan Zhang <bo...@google.com> wrote:
> I believe the Boto3 lib should be helpful with right credential
> configuration when creating a client:
> https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration
>
> On Thu, Feb 21, 2019 at 6:15 PM Suneel Marthi <su...@gmail.com>
> wrote:
>
>> Couldn't u just use Boto python package for doing that ?
>>
>> I am writing one now to read from S3 via the Python api
>>
>> On Thu, Feb 21, 2019 at 6:19 PM Pablo Estrada <pa...@google.com> wrote:
>>
>>> Hello all,
>>> I was thinking that a filesystem with support for s3 would be great to
>>> have in the Python SDK. If I am not wrong, it would simply involve
>>> implementing the filesystem classes
>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py> with
>>> s3, right?
>>>
>>> I am not familiar enough with s3, nor with filesystems, nor with AWS in
>>> general - but I have some outstanding questions:
>>>
>>> - Does this mean that we probably would need an extra [s3] target
>>> for installing apache_beam, like we do with [gcp]?
>>> - Not strictly necessary, but probably desirable...
>>> - How do we handle KMS in GCS filesystem?
>>> - Would the filesystem encapsulation make KMS support in an s3
>>> filesystem difficult?
>>> - Or even more... is the KMS support in AWS very different than in
>>> GCP?
>>> - I'd love comments from anyone informed around this : )
>>> - Is this project of an appropriate size for a GSoC student?
>>>
>>> Thoughts?
>>> Best
>>> -P.
>>>
>>
Re: s3 filesystem for Python good for GSoC?
Posted by Boyuan Zhang <bo...@google.com>.
I believe the Boto3 lib should be helpful with right credential
configuration when creating a client:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration
On Thu, Feb 21, 2019 at 6:15 PM Suneel Marthi <su...@gmail.com>
wrote:
> Couldn't u just use Boto python package for doing that ?
>
> I am writing one now to read from S3 via the Python api
>
> On Thu, Feb 21, 2019 at 6:19 PM Pablo Estrada <pa...@google.com> wrote:
>
>> Hello all,
>> I was thinking that a filesystem with support for s3 would be great to
>> have in the Python SDK. If I am not wrong, it would simply involve
>> implementing the filesystem classes
>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py> with
>> s3, right?
>>
>> I am not familiar enough with s3, nor with filesystems, nor with AWS in
>> general - but I have some outstanding questions:
>>
>> - Does this mean that we probably would need an extra [s3] target for
>> installing apache_beam, like we do with [gcp]?
>> - Not strictly necessary, but probably desirable...
>> - How do we handle KMS in GCS filesystem?
>> - Would the filesystem encapsulation make KMS support in an s3
>> filesystem difficult?
>> - Or even more... is the KMS support in AWS very different than in
>> GCP?
>> - I'd love comments from anyone informed around this : )
>> - Is this project of an appropriate size for a GSoC student?
>>
>> Thoughts?
>> Best
>> -P.
>>
>
Re: s3 filesystem for Python good for GSoC?
Posted by Suneel Marthi <su...@gmail.com>.
Couldn't u just use Boto python package for doing that ?
I am writing one now to read from S3 via the Python api
On Thu, Feb 21, 2019 at 6:19 PM Pablo Estrada <pa...@google.com> wrote:
> Hello all,
> I was thinking that a filesystem with support for s3 would be great to
> have in the Python SDK. If I am not wrong, it would simply involve
> implementing the filesystem classes
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py> with
> s3, right?
>
> I am not familiar enough with s3, nor with filesystems, nor with AWS in
> general - but I have some outstanding questions:
>
> - Does this mean that we probably would need an extra [s3] target for
> installing apache_beam, like we do with [gcp]?
> - Not strictly necessary, but probably desirable...
> - How do we handle KMS in GCS filesystem?
> - Would the filesystem encapsulation make KMS support in an s3
> filesystem difficult?
> - Or even more... is the KMS support in AWS very different than in GCP?
> - I'd love comments from anyone informed around this : )
> - Is this project of an appropriate size for a GSoC student?
>
> Thoughts?
> Best
> -P.
>