You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Shen Li <cs...@gmail.com> on 2017/04/11 13:56:30 UTC

HDFS and Google Cloud Storage

Hi,

Is there any reason why HDFS IO is implemented as a BoundedSource while
Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
contribute a new IO connector, how can I determine whether it should be
implemented as a source transform or as a scheme for the TextIO?

Thanks,

Shen

Re: HDFS and Google Cloud Storage

Posted by Stephen Sisk <si...@google.com.INVALID>.
This is a great question! I filed
https://issues.apache.org/jira/browse/BEAM-1929 to update the I/O docs to
make sure they answer this.

S

On Tue, Apr 11, 2017 at 8:20 AM Shen Li <cs...@gmail.com> wrote:

> Thanks!
>
> Shen
>
> On Tue, Apr 11, 2017 at 11:10 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Yes, FileSystem "plugins" will use a scheme. Other connectors will use
> (as
> > it's already the case) DoFn/Source transforms.
> >
> > Regards
> > JB
> >
> >
> > On 04/11/2017 05:05 PM, Shen Li wrote:
> >
> >> Hi JB,
> >>
> >> Thanks a lot for your response. Does it mean all file-based IO will be
> >> added as schemes using IOChannelFactory (or the new name FileSystem).
> All
> >> others, e.g., HTTP, TCP, KV-store, DB, message-queue, should be
> >> source/sink
> >> transforms?
> >>
> >> Thanks,
> >>
> >> Shen
> >>
> >> On Tue, Apr 11, 2017 at 10:29 AM, Jean-Baptiste Onofré <jb@nanthrax.net
> >
> >> wrote:
> >>
> >> Hi Shen,
> >>>
> >>> We are doing a refactoring of the file IO (IOChannelFactory). Thanks to
> >>> this refactoring, you will be able to use a scheme for hdfs (or s3,
> ...)
> >>> with different format (avro, text, hadoop input format, ...).
> >>>
> >>> It means that HdfsIO will be deprecated (to be removed at some point).
> >>> I'm
> >>> working on couple of PRs to leverage the new file IO layer.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 04/11/2017 03:56 PM, Shen Li wrote:
> >>>
> >>> Hi,
> >>>>
> >>>> Is there any reason why HDFS IO is implemented as a BoundedSource
> while
> >>>> Google Cloud Storage is implemented as a scheme ("gs://") for TextIO?
> To
> >>>> contribute a new IO connector, how can I determine whether it should
> be
> >>>> implemented as a source transform or as a scheme for the TextIO?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Shen
> >>>>
> >>>>
> >>>> --
> >>> Jean-Baptiste Onofré
> >>> jbonofre@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>>
> >>>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: HDFS and Google Cloud Storage

Posted by Shen Li <cs...@gmail.com>.
Thanks!

Shen

On Tue, Apr 11, 2017 at 11:10 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Yes, FileSystem "plugins" will use a scheme. Other connectors will use (as
> it's already the case) DoFn/Source transforms.
>
> Regards
> JB
>
>
> On 04/11/2017 05:05 PM, Shen Li wrote:
>
>> Hi JB,
>>
>> Thanks a lot for your response. Does it mean all file-based IO will be
>> added as schemes using IOChannelFactory (or the new name FileSystem). All
>> others, e.g., HTTP, TCP, KV-store, DB, message-queue, should be
>> source/sink
>> transforms?
>>
>> Thanks,
>>
>> Shen
>>
>> On Tue, Apr 11, 2017 at 10:29 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>> Hi Shen,
>>>
>>> We are doing a refactoring of the file IO (IOChannelFactory). Thanks to
>>> this refactoring, you will be able to use a scheme for hdfs (or s3, ...)
>>> with different format (avro, text, hadoop input format, ...).
>>>
>>> It means that HdfsIO will be deprecated (to be removed at some point).
>>> I'm
>>> working on couple of PRs to leverage the new file IO layer.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 04/11/2017 03:56 PM, Shen Li wrote:
>>>
>>> Hi,
>>>>
>>>> Is there any reason why HDFS IO is implemented as a BoundedSource while
>>>> Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
>>>> contribute a new IO connector, how can I determine whether it should be
>>>> implemented as a source transform or as a scheme for the TextIO?
>>>>
>>>> Thanks,
>>>>
>>>> Shen
>>>>
>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: HDFS and Google Cloud Storage

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Yes, FileSystem "plugins" will use a scheme. Other connectors will use (as it's 
already the case) DoFn/Source transforms.

Regards
JB

On 04/11/2017 05:05 PM, Shen Li wrote:
> Hi JB,
>
> Thanks a lot for your response. Does it mean all file-based IO will be
> added as schemes using IOChannelFactory (or the new name FileSystem). All
> others, e.g., HTTP, TCP, KV-store, DB, message-queue, should be source/sink
> transforms?
>
> Thanks,
>
> Shen
>
> On Tue, Apr 11, 2017 at 10:29 AM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> Hi Shen,
>>
>> We are doing a refactoring of the file IO (IOChannelFactory). Thanks to
>> this refactoring, you will be able to use a scheme for hdfs (or s3, ...)
>> with different format (avro, text, hadoop input format, ...).
>>
>> It means that HdfsIO will be deprecated (to be removed at some point). I'm
>> working on couple of PRs to leverage the new file IO layer.
>>
>> Regards
>> JB
>>
>>
>> On 04/11/2017 03:56 PM, Shen Li wrote:
>>
>>> Hi,
>>>
>>> Is there any reason why HDFS IO is implemented as a BoundedSource while
>>> Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
>>> contribute a new IO connector, how can I determine whether it should be
>>> implemented as a source transform or as a scheme for the TextIO?
>>>
>>> Thanks,
>>>
>>> Shen
>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: HDFS and Google Cloud Storage

Posted by Shen Li <cs...@gmail.com>.
Hi JB,

Thanks a lot for your response. Does it mean all file-based IO will be
added as schemes using IOChannelFactory (or the new name FileSystem). All
others, e.g., HTTP, TCP, KV-store, DB, message-queue, should be source/sink
transforms?

Thanks,

Shen

On Tue, Apr 11, 2017 at 10:29 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Shen,
>
> We are doing a refactoring of the file IO (IOChannelFactory). Thanks to
> this refactoring, you will be able to use a scheme for hdfs (or s3, ...)
> with different format (avro, text, hadoop input format, ...).
>
> It means that HdfsIO will be deprecated (to be removed at some point). I'm
> working on couple of PRs to leverage the new file IO layer.
>
> Regards
> JB
>
>
> On 04/11/2017 03:56 PM, Shen Li wrote:
>
>> Hi,
>>
>> Is there any reason why HDFS IO is implemented as a BoundedSource while
>> Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
>> contribute a new IO connector, how can I determine whether it should be
>> implemented as a source transform or as a scheme for the TextIO?
>>
>> Thanks,
>>
>> Shen
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>

Re: HDFS and Google Cloud Storage

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Shen,

We are doing a refactoring of the file IO (IOChannelFactory). Thanks to this 
refactoring, you will be able to use a scheme for hdfs (or s3, ...) with 
different format (avro, text, hadoop input format, ...).

It means that HdfsIO will be deprecated (to be removed at some point). I'm 
working on couple of PRs to leverage the new file IO layer.

Regards
JB

On 04/11/2017 03:56 PM, Shen Li wrote:
> Hi,
>
> Is there any reason why HDFS IO is implemented as a BoundedSource while
> Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
> contribute a new IO connector, how can I determine whether it should be
> implemented as a source transform or as a scheme for the TextIO?
>
> Thanks,
>
> Shen
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com