You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Dmitry Demeshchuk <dm...@postmates.com> on 2017/07/06 20:27:10 UTC

Docs/guidelines on writing filesystem sources and sinks

Hi folks,

I'm working on an S3 filesystem for the Python SDK, which already works in
case of a happy path for both reading and writing, but I feel like there
are quite a few edge cases that I'm likely missing.

So far, my approach has been: "look at the generic FileSystem
implementation, look at how gcsio.py and gcsfilesystem.py are written, try
to copy their approach as much as possible, at least for getting to the
proof of concept".

That said, I'd like to know a few things:

1. Are there any official or non-official guidelines or docs on writing
filesystems? Even Java-specific ones may be really useful.

2. Are there any existing generic test suites that every filesystem is
supposed to pass? Again, even if they exist only in Java world, I'd still
be down for trying to adopt them in Python SDK too.

3. Are there any established ideas of how to pass AWS credentials to Beam
for making the S3 filesystem actually work? I currently rely on the
existing environment variables, which boto just picks up, but sounds like
setting them up in runners like Dataflow or Spark would be troublesome.
I've seen this discussion a couple times in the list, but couldn't tell if
any closure was found. My personal preference would be having AWS settings
passed in some global context (pipeline options, perhaps?), but there may
be exceptions to that (say, people want to use different credentials for
different AWS operations).

Thanks!

-- 
Best regards,
Dmitry Demeshchuk.

Re: Docs/guidelines on writing filesystem sources and sinks

Posted by Chamikara Jayalath <ch...@google.com>.

Currently we don't have official documentation or a testing guide for
adding new FileSystems. Best source here would be existing FileSystem
implementations, as you mentioned.

I don't think parameters for initiating FileSystems should be passed when
creating a read transform. Can you try to get any config parameters from
the environment instead ? Note that for distributed runners, you will have
to register environment variables in workers in a runner specific way (for
example, for Dataflow runner, this could be through an additional package
that gets installed in workers). I think +Sourabh Bajaj
<so...@google.com> was looking into providing a better solution for
this.

- Cham

On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem, e.g.
>
> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
> aws_config=AWSConfig())
>
> because the ReadFromText class relies on PTransform's constructor, which
> has a pre-defined set of arguments.
>
> This is probably becoming a cross-topic for the dev list (have I added it
> in the right way?)
>
> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> Hi folks,
>>
>> I'm working on an S3 filesystem for the Python SDK, which already works
>> in case of a happy path for both reading and writing, but I feel like there
>> are quite a few edge cases that I'm likely missing.
>>
>> So far, my approach has been: "look at the generic FileSystem
>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>> to copy their approach as much as possible, at least for getting to the
>> proof of concept".
>>
>> That said, I'd like to know a few things:
>>
>> 1. Are there any official or non-official guidelines or docs on writing
>> filesystems? Even Java-specific ones may be really useful.
>>
>> 2. Are there any existing generic test suites that every filesystem is
>> supposed to pass? Again, even if they exist only in Java world, I'd still
>> be down for trying to adopt them in Python SDK too.
>>
>> 3. Are there any established ideas of how to pass AWS credentials to Beam
>> for making the S3 filesystem actually work? I currently rely on the
>> existing environment variables, which boto just picks up, but sounds like
>> setting them up in runners like Dataflow or Spark would be troublesome.
>> I've seen this discussion a couple times in the list, but couldn't tell if
>> any closure was found. My personal preference would be having AWS settings
>> passed in some global context (pipeline options, perhaps?), but there may
>> be exceptions to that (say, people want to use different credentials for
>> different AWS operations).
>>
>> Thanks!
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Docs/guidelines on writing filesystem sources and sinks

Posted by Chamikara Jayalath <ch...@google.com.INVALID>.

Currently we don't have official documentation or a testing guide for
adding new FileSystems. Best source here would be existing FileSystem
implementations, as you mentioned.

I don't think parameters for initiating FileSystems should be passed when
creating a read transform. Can you try to get any config parameters from
the environment instead ? Note that for distributed runners, you will have
to register environment variables in workers in a runner specific way (for
example, for Dataflow runner, this could be through an additional package
that gets installed in workers). I think +Sourabh Bajaj
<so...@google.com> was looking into providing a better solution for
this.

- Cham

On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem, e.g.
>
> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
> aws_config=AWSConfig())
>
> because the ReadFromText class relies on PTransform's constructor, which
> has a pre-defined set of arguments.
>
> This is probably becoming a cross-topic for the dev list (have I added it
> in the right way?)
>
> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> Hi folks,
>>
>> I'm working on an S3 filesystem for the Python SDK, which already works
>> in case of a happy path for both reading and writing, but I feel like there
>> are quite a few edge cases that I'm likely missing.
>>
>> So far, my approach has been: "look at the generic FileSystem
>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>> to copy their approach as much as possible, at least for getting to the
>> proof of concept".
>>
>> That said, I'd like to know a few things:
>>
>> 1. Are there any official or non-official guidelines or docs on writing
>> filesystems? Even Java-specific ones may be really useful.
>>
>> 2. Are there any existing generic test suites that every filesystem is
>> supposed to pass? Again, even if they exist only in Java world, I'd still
>> be down for trying to adopt them in Python SDK too.
>>
>> 3. Are there any established ideas of how to pass AWS credentials to Beam
>> for making the S3 filesystem actually work? I currently rely on the
>> existing environment variables, which boto just picks up, but sounds like
>> setting them up in runners like Dataflow or Spark would be troublesome.
>> I've seen this discussion a couple times in the list, but couldn't tell if
>> any closure was found. My personal preference would be having AWS settings
>> passed in some global context (pipeline options, perhaps?), but there may
>> be exceptions to that (say, people want to use different credentials for
>> different AWS operations).
>>
>> Thanks!
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Docs/guidelines on writing filesystem sources and sinks

Posted by Dmitry Demeshchuk <dm...@postmates.com>.

Hi Stephen,

Thanks for the detailed reply!

Some comments inline.

On Thu, Jul 6, 2017 at 5:21 PM, Stephen Sisk <si...@google.com> wrote:

> Hi Dmitry,
>
> I'm excited to hear that you'd like to do this work. If you haven't
> already, I'd first suggest that you open a JIRA issue to make sure other
> folks know you're working on this.
>

Will do tomorrow, thanks for the suggestion. The code is currently not a
part of Beam, but I'd be more than happy to push it upstream.


>
> I was involved in working on the recent java HDFS file system
> implementation, so I'll try and share what I know - I suspect knowledge
> about this is scattered around a bit, so hopefully others will chime in as
> well.
>
> > 1. Are there any official or non-official guidelines or docs on writing
> filesystems? Even Java-specific ones may be really useful.
> I don't know of any guides for writing IOs. I believe folks should be
> helpful here on the mailing list for specific questions, but there aren't
> that many that are experts in file system implementations. It's not
> expected to be a frequent task, so no one has tried to document it (it also
> means your contribution will have a wide impact!) If you wanted to write up
> your notes from the process, it'd likely be highly helpful to others.
>
> https://issues.apache.org/jira/browse/BEAM-2005 documents the work that
> we did to add the java Hadoop FileSystem implementation, so that might be a
> good guide - it has links to PRs, you can find out about design questions
> that came up there, etc.. The Hadoop FileSystem is relatively new, so
> reviewing its commit history may be very informative.
>

I'll check it out, thanks! The main reason I'm looking for more concrete
guidelines is that a lot of internal filesystem-related mechanisms are not
obvious at all: for example, the fact that there's a temporary file created
first and then it's moved elsewhere. Some of these functions in my
implementation are suboptimal or are not doing anything because they don't
seem to be immediately useful, but due to the complexity of the
higher-level usage of FileSystem subclasses I'm likely making some mistakes
right now.


>
> > 2. Are there any existing generic test suites that every filesystem is
> supposed to pass? Again, even if they exist only in Java world, I'd still
> be down for trying to adopt them in Python SDK too.
>
> I don't know of any. If you put together a test plan, we'd be happy to
> discuss it. The tests for the java Hadoop FileSystem represent the current
> thinking, but could likely be expanded on.
>

I can try thinking of something, but, on a second thought, different
filesystems have different characteristics and guarantees, so the same
tests that pass for HDFS may be not necessarily pass for S3 (due to its
eventual consistency), and I'm sure Google Storage and local filesystem
will also have their own quirks. My hope was that some kind of a plan
already existed, but looks like that's not the case, and now I can see why.

I'll try to reflect on this idea and see if I can pull together a doc with
at least some basic acceptance tests and ways to apply them to the existing
filesystems. Will start a new thread if/when I end up doing that.


>
> > 3. Are there any established ideas of how to pass AWS credentials to
> Beam for making the S3 filesystem actually work?
>
> Looks like you already found the past discussions of this on the mailing
> list, that was what I would refer you to.
>
> > I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem,
> We had a similar problem with the hadoop configuration object - inside of
> the hadoop filesystem registrar, we read the pipeline options to see if
> there is configuration info there, as well as some default hadoop
> configuration file locations. See https://github.com/apache/
> beam/blob/master/sdks/java/io/hadoop-file-system/src/main/
> java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45
>

Thanks, that's actually the ideal approach for me! I wasn't sure if
pipeline options were accessible from inside transformations, but looks
like they are. This makes a really good case for supporting the entire AWS
stack conveniently by providing some extra pipeline option, like
"aws_config" or something.


>
> The python folks will have to comment if that's the type of solution they
> want you to use though.
>
> I hope this helps!
>
> Stephen
>
>
> On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> I also stumbled upon a problem that I can't really pass additional
>> configuration to a filesystem, e.g.
>>
>> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
>> aws_config=AWSConfig())
>>
>> because the ReadFromText class relies on PTransform's constructor, which
>> has a pre-defined set of arguments.
>>
>> This is probably becoming a cross-topic for the dev list (have I added it
>> in the right way?)
>>
>> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dm...@postmates.com>
>> wrote:
>>
>>> Hi folks,
>>>
>>> I'm working on an S3 filesystem for the Python SDK, which already works
>>> in case of a happy path for both reading and writing, but I feel like there
>>> are quite a few edge cases that I'm likely missing.
>>>
>>> So far, my approach has been: "look at the generic FileSystem
>>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>>> to copy their approach as much as possible, at least for getting to the
>>> proof of concept".
>>>
>>> That said, I'd like to know a few things:
>>>
>>> 1. Are there any official or non-official guidelines or docs on writing
>>> filesystems? Even Java-specific ones may be really useful.
>>>
>>> 2. Are there any existing generic test suites that every filesystem is
>>> supposed to pass? Again, even if they exist only in Java world, I'd still
>>> be down for trying to adopt them in Python SDK too.
>>>
>>> 3. Are there any established ideas of how to pass AWS credentials to
>>> Beam for making the S3 filesystem actually work? I currently rely on the
>>> existing environment variables, which boto just picks up, but sounds like
>>> setting them up in runners like Dataflow or Spark would be troublesome.
>>> I've seen this discussion a couple times in the list, but couldn't tell if
>>> any closure was found. My personal preference would be having AWS settings
>>> passed in some global context (pipeline options, perhaps?), but there may
>>> be exceptions to that (say, people want to use different credentials for
>>> different AWS operations).
>>>
>>> Thanks!
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>


-- 
Best regards,
Dmitry Demeshchuk.

Re: Docs/guidelines on writing filesystem sources and sinks

Posted by Dmitry Demeshchuk <dm...@postmates.com>.

Hi Stephen,

Thanks for the detailed reply!

Some comments inline.

On Thu, Jul 6, 2017 at 5:21 PM, Stephen Sisk <si...@google.com> wrote:

> Hi Dmitry,
>
> I'm excited to hear that you'd like to do this work. If you haven't
> already, I'd first suggest that you open a JIRA issue to make sure other
> folks know you're working on this.
>

Will do tomorrow, thanks for the suggestion. The code is currently not a
part of Beam, but I'd be more than happy to push it upstream.


>
> I was involved in working on the recent java HDFS file system
> implementation, so I'll try and share what I know - I suspect knowledge
> about this is scattered around a bit, so hopefully others will chime in as
> well.
>
> > 1. Are there any official or non-official guidelines or docs on writing
> filesystems? Even Java-specific ones may be really useful.
> I don't know of any guides for writing IOs. I believe folks should be
> helpful here on the mailing list for specific questions, but there aren't
> that many that are experts in file system implementations. It's not
> expected to be a frequent task, so no one has tried to document it (it also
> means your contribution will have a wide impact!) If you wanted to write up
> your notes from the process, it'd likely be highly helpful to others.
>
> https://issues.apache.org/jira/browse/BEAM-2005 documents the work that
> we did to add the java Hadoop FileSystem implementation, so that might be a
> good guide - it has links to PRs, you can find out about design questions
> that came up there, etc.. The Hadoop FileSystem is relatively new, so
> reviewing its commit history may be very informative.
>

I'll check it out, thanks! The main reason I'm looking for more concrete
guidelines is that a lot of internal filesystem-related mechanisms are not
obvious at all: for example, the fact that there's a temporary file created
first and then it's moved elsewhere. Some of these functions in my
implementation are suboptimal or are not doing anything because they don't
seem to be immediately useful, but due to the complexity of the
higher-level usage of FileSystem subclasses I'm likely making some mistakes
right now.


>
> > 2. Are there any existing generic test suites that every filesystem is
> supposed to pass? Again, even if they exist only in Java world, I'd still
> be down for trying to adopt them in Python SDK too.
>
> I don't know of any. If you put together a test plan, we'd be happy to
> discuss it. The tests for the java Hadoop FileSystem represent the current
> thinking, but could likely be expanded on.
>

I can try thinking of something, but, on a second thought, different
filesystems have different characteristics and guarantees, so the same
tests that pass for HDFS may be not necessarily pass for S3 (due to its
eventual consistency), and I'm sure Google Storage and local filesystem
will also have their own quirks. My hope was that some kind of a plan
already existed, but looks like that's not the case, and now I can see why.

I'll try to reflect on this idea and see if I can pull together a doc with
at least some basic acceptance tests and ways to apply them to the existing
filesystems. Will start a new thread if/when I end up doing that.


>
> > 3. Are there any established ideas of how to pass AWS credentials to
> Beam for making the S3 filesystem actually work?
>
> Looks like you already found the past discussions of this on the mailing
> list, that was what I would refer you to.
>
> > I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem,
> We had a similar problem with the hadoop configuration object - inside of
> the hadoop filesystem registrar, we read the pipeline options to see if
> there is configuration info there, as well as some default hadoop
> configuration file locations. See https://github.com/apache/
> beam/blob/master/sdks/java/io/hadoop-file-system/src/main/
> java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45
>

Thanks, that's actually the ideal approach for me! I wasn't sure if
pipeline options were accessible from inside transformations, but looks
like they are. This makes a really good case for supporting the entire AWS
stack conveniently by providing some extra pipeline option, like
"aws_config" or something.


>
> The python folks will have to comment if that's the type of solution they
> want you to use though.
>
> I hope this helps!
>
> Stephen
>
>
> On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> I also stumbled upon a problem that I can't really pass additional
>> configuration to a filesystem, e.g.
>>
>> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
>> aws_config=AWSConfig())
>>
>> because the ReadFromText class relies on PTransform's constructor, which
>> has a pre-defined set of arguments.
>>
>> This is probably becoming a cross-topic for the dev list (have I added it
>> in the right way?)
>>
>> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dm...@postmates.com>
>> wrote:
>>
>>> Hi folks,
>>>
>>> I'm working on an S3 filesystem for the Python SDK, which already works
>>> in case of a happy path for both reading and writing, but I feel like there
>>> are quite a few edge cases that I'm likely missing.
>>>
>>> So far, my approach has been: "look at the generic FileSystem
>>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>>> to copy their approach as much as possible, at least for getting to the
>>> proof of concept".
>>>
>>> That said, I'd like to know a few things:
>>>
>>> 1. Are there any official or non-official guidelines or docs on writing
>>> filesystems? Even Java-specific ones may be really useful.
>>>
>>> 2. Are there any existing generic test suites that every filesystem is
>>> supposed to pass? Again, even if they exist only in Java world, I'd still
>>> be down for trying to adopt them in Python SDK too.
>>>
>>> 3. Are there any established ideas of how to pass AWS credentials to
>>> Beam for making the S3 filesystem actually work? I currently rely on the
>>> existing environment variables, which boto just picks up, but sounds like
>>> setting them up in runners like Dataflow or Spark would be troublesome.
>>> I've seen this discussion a couple times in the list, but couldn't tell if
>>> any closure was found. My personal preference would be having AWS settings
>>> passed in some global context (pipeline options, perhaps?), but there may
>>> be exceptions to that (say, people want to use different credentials for
>>> different AWS operations).
>>>
>>> Thanks!
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>


-- 
Best regards,
Dmitry Demeshchuk.

Re: Docs/guidelines on writing filesystem sources and sinks

Posted by Stephen Sisk <si...@google.com>.

Hi Dmitry,

I'm excited to hear that you'd like to do this work. If you haven't
already, I'd first suggest that you open a JIRA issue to make sure other
folks know you're working on this.

I was involved in working on the recent java HDFS file system
implementation, so I'll try and share what I know - I suspect knowledge
about this is scattered around a bit, so hopefully others will chime in as
well.

> 1. Are there any official or non-official guidelines or docs on writing
filesystems? Even Java-specific ones may be really useful.
I don't know of any guides for writing IOs. I believe folks should be
helpful here on the mailing list for specific questions, but there aren't
that many that are experts in file system implementations. It's not
expected to be a frequent task, so no one has tried to document it (it also
means your contribution will have a wide impact!) If you wanted to write up
your notes from the process, it'd likely be highly helpful to others.

https://issues.apache.org/jira/browse/BEAM-2005 documents the work that we
did to add the java Hadoop FileSystem implementation, so that might be a
good guide - it has links to PRs, you can find out about design questions
that came up there, etc.. The Hadoop FileSystem is relatively new, so
reviewing its commit history may be very informative.

> 2. Are there any existing generic test suites that every filesystem is
supposed to pass? Again, even if they exist only in Java world, I'd still
be down for trying to adopt them in Python SDK too.

I don't know of any. If you put together a test plan, we'd be happy to
discuss it. The tests for the java Hadoop FileSystem represent the current
thinking, but could likely be expanded on.

> 3. Are there any established ideas of how to pass AWS credentials to Beam
for making the S3 filesystem actually work?

Looks like you already found the past discussions of this on the mailing
list, that was what I would refer you to.

> I also stumbled upon a problem that I can't really pass additional
configuration to a filesystem,
We had a similar problem with the hadoop configuration object - inside of
the hadoop filesystem registrar, we read the pipeline options to see if
there is configuration info there, as well as some default hadoop
configuration file locations. See
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45

The python folks will have to comment if that's the type of solution they
want you to use though.

I hope this helps!

Stephen


On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem, e.g.
>
> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
> aws_config=AWSConfig())
>
> because the ReadFromText class relies on PTransform's constructor, which
> has a pre-defined set of arguments.
>
> This is probably becoming a cross-topic for the dev list (have I added it
> in the right way?)
>
> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> Hi folks,
>>
>> I'm working on an S3 filesystem for the Python SDK, which already works
>> in case of a happy path for both reading and writing, but I feel like there
>> are quite a few edge cases that I'm likely missing.
>>
>> So far, my approach has been: "look at the generic FileSystem
>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>> to copy their approach as much as possible, at least for getting to the
>> proof of concept".
>>
>> That said, I'd like to know a few things:
>>
>> 1. Are there any official or non-official guidelines or docs on writing
>> filesystems? Even Java-specific ones may be really useful.
>>
>> 2. Are there any existing generic test suites that every filesystem is
>> supposed to pass? Again, even if they exist only in Java world, I'd still
>> be down for trying to adopt them in Python SDK too.
>>
>> 3. Are there any established ideas of how to pass AWS credentials to Beam
>> for making the S3 filesystem actually work? I currently rely on the
>> existing environment variables, which boto just picks up, but sounds like
>> setting them up in runners like Dataflow or Spark would be troublesome.
>> I've seen this discussion a couple times in the list, but couldn't tell if
>> any closure was found. My personal preference would be having AWS settings
>> passed in some global context (pipeline options, perhaps?), but there may
>> be exceptions to that (say, people want to use different credentials for
>> different AWS operations).
>>
>> Thanks!
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Docs/guidelines on writing filesystem sources and sinks

Posted by Stephen Sisk <si...@google.com.INVALID>.

Hi Dmitry,

I'm excited to hear that you'd like to do this work. If you haven't
already, I'd first suggest that you open a JIRA issue to make sure other
folks know you're working on this.

I was involved in working on the recent java HDFS file system
implementation, so I'll try and share what I know - I suspect knowledge
about this is scattered around a bit, so hopefully others will chime in as
well.

> 1. Are there any official or non-official guidelines or docs on writing
filesystems? Even Java-specific ones may be really useful.
I don't know of any guides for writing IOs. I believe folks should be
helpful here on the mailing list for specific questions, but there aren't
that many that are experts in file system implementations. It's not
expected to be a frequent task, so no one has tried to document it (it also
means your contribution will have a wide impact!) If you wanted to write up
your notes from the process, it'd likely be highly helpful to others.

https://issues.apache.org/jira/browse/BEAM-2005 documents the work that we
did to add the java Hadoop FileSystem implementation, so that might be a
good guide - it has links to PRs, you can find out about design questions
that came up there, etc.. The Hadoop FileSystem is relatively new, so
reviewing its commit history may be very informative.

> 2. Are there any existing generic test suites that every filesystem is
supposed to pass? Again, even if they exist only in Java world, I'd still
be down for trying to adopt them in Python SDK too.

I don't know of any. If you put together a test plan, we'd be happy to
discuss it. The tests for the java Hadoop FileSystem represent the current
thinking, but could likely be expanded on.

> 3. Are there any established ideas of how to pass AWS credentials to Beam
for making the S3 filesystem actually work?

Looks like you already found the past discussions of this on the mailing
list, that was what I would refer you to.

> I also stumbled upon a problem that I can't really pass additional
configuration to a filesystem,
We had a similar problem with the hadoop configuration object - inside of
the hadoop filesystem registrar, we read the pipeline options to see if
there is configuration info there, as well as some default hadoop
configuration file locations. See
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45

The python folks will have to comment if that's the type of solution they
want you to use though.

I hope this helps!

Stephen


On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem, e.g.
>
> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
> aws_config=AWSConfig())
>
> because the ReadFromText class relies on PTransform's constructor, which
> has a pre-defined set of arguments.
>
> This is probably becoming a cross-topic for the dev list (have I added it
> in the right way?)
>
> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dm...@postmates.com>
> wrote:
>
>> Hi folks,
>>
>> I'm working on an S3 filesystem for the Python SDK, which already works
>> in case of a happy path for both reading and writing, but I feel like there
>> are quite a few edge cases that I'm likely missing.
>>
>> So far, my approach has been: "look at the generic FileSystem
>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>> to copy their approach as much as possible, at least for getting to the
>> proof of concept".
>>
>> That said, I'd like to know a few things:
>>
>> 1. Are there any official or non-official guidelines or docs on writing
>> filesystems? Even Java-specific ones may be really useful.
>>
>> 2. Are there any existing generic test suites that every filesystem is
>> supposed to pass? Again, even if they exist only in Java world, I'd still
>> be down for trying to adopt them in Python SDK too.
>>
>> 3. Are there any established ideas of how to pass AWS credentials to Beam
>> for making the S3 filesystem actually work? I currently rely on the
>> existing environment variables, which boto just picks up, but sounds like
>> setting them up in runners like Dataflow or Spark would be troublesome.
>> I've seen this discussion a couple times in the list, but couldn't tell if
>> any closure was found. My personal preference would be having AWS settings
>> passed in some global context (pipeline options, perhaps?), but there may
>> be exceptions to that (say, people want to use different credentials for
>> different AWS operations).
>>
>> Thanks!
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Docs/guidelines on writing filesystem sources and sinks

Posted by Dmitry Demeshchuk <dm...@postmates.com>.

I also stumbled upon a problem that I can't really pass additional
configuration to a filesystem, e.g.

lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
aws_config=AWSConfig())

because the ReadFromText class relies on PTransform's constructor, which
has a pre-defined set of arguments.

This is probably becoming a cross-topic for the dev list (have I added it
in the right way?)

On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> Hi folks,
>
> I'm working on an S3 filesystem for the Python SDK, which already works in
> case of a happy path for both reading and writing, but I feel like there
> are quite a few edge cases that I'm likely missing.
>
> So far, my approach has been: "look at the generic FileSystem
> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
> to copy their approach as much as possible, at least for getting to the
> proof of concept".
>
> That said, I'd like to know a few things:
>
> 1. Are there any official or non-official guidelines or docs on writing
> filesystems? Even Java-specific ones may be really useful.
>
> 2. Are there any existing generic test suites that every filesystem is
> supposed to pass? Again, even if they exist only in Java world, I'd still
> be down for trying to adopt them in Python SDK too.
>
> 3. Are there any established ideas of how to pass AWS credentials to Beam
> for making the S3 filesystem actually work? I currently rely on the
> existing environment variables, which boto just picks up, but sounds like
> setting them up in runners like Dataflow or Spark would be troublesome.
> I've seen this discussion a couple times in the list, but couldn't tell if
> any closure was found. My personal preference would be having AWS settings
> passed in some global context (pipeline options, perhaps?), but there may
> be exceptions to that (say, people want to use different credentials for
> different AWS operations).
>
> Thanks!
>
> --
> Best regards,
> Dmitry Demeshchuk.
>



-- 
Best regards,
Dmitry Demeshchuk.

Re: Docs/guidelines on writing filesystem sources and sinks

Posted by Dmitry Demeshchuk <dm...@postmates.com>.

I also stumbled upon a problem that I can't really pass additional
configuration to a filesystem, e.g.

lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
aws_config=AWSConfig())

because the ReadFromText class relies on PTransform's constructor, which
has a pre-defined set of arguments.

This is probably becoming a cross-topic for the dev list (have I added it
in the right way?)

On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dm...@postmates.com>
wrote:

> Hi folks,
>
> I'm working on an S3 filesystem for the Python SDK, which already works in
> case of a happy path for both reading and writing, but I feel like there
> are quite a few edge cases that I'm likely missing.
>
> So far, my approach has been: "look at the generic FileSystem
> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
> to copy their approach as much as possible, at least for getting to the
> proof of concept".
>
> That said, I'd like to know a few things:
>
> 1. Are there any official or non-official guidelines or docs on writing
> filesystems? Even Java-specific ones may be really useful.
>
> 2. Are there any existing generic test suites that every filesystem is
> supposed to pass? Again, even if they exist only in Java world, I'd still
> be down for trying to adopt them in Python SDK too.
>
> 3. Are there any established ideas of how to pass AWS credentials to Beam
> for making the S3 filesystem actually work? I currently rely on the
> existing environment variables, which boto just picks up, but sounds like
> setting them up in runners like Dataflow or Spark would be troublesome.
> I've seen this discussion a couple times in the list, but couldn't tell if
> any closure was found. My personal preference would be having AWS settings
> passed in some global context (pipeline options, perhaps?), but there may
> be exceptions to that (say, people want to use different credentials for
> different AWS operations).
>
> Thanks!
>
> --
> Best regards,
> Dmitry Demeshchuk.
>



-- 
Best regards,
Dmitry Demeshchuk.