You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Jean-Baptiste Onofré <jb...@nanthrax.net> on 2018/03/23 07:07:26 UTC

Beam Summit - IO brainstorming

Hi all,

Sorry for the delay, but I got issues with my e-mail provider (I was not able to
send e-mails :( ).

Last week during Beam Summit, I had the change to participate to the IO
brainstorming session.

Here's the minute notes:

1. IOs set
We now have a decent number of IOs in Beam, and new are coming (ParquetIO,
RabbitMQIO). Users mentioned a new file format you could support: HDF5. It would
be an Python IO.
I will create the Jira about HDF5.
Other IOs will also be in preparation, coming along with SDF support.

2. IOs and SDKs
This point was related to the portability layer: how can I use a Java IO in
Python or the opposite ? Today, most of the IOs are related to Java SDK, and
it's a bit frustrating for Python SDK users. Users are looking forward
portability layer, however they also expressed some questions about Docker
requirements. I think we should prepare a clean answer to this point.

3. PCollection Headers
Users want more "dynamic" IOs, maybe that a IO behavior could change depending
of the element they are considering in the PCollection. I introduced what we are
using in Apache Camel: Message Headers. The Camel components endpoints
(equivalent of Beam IOs) can use the headers: for instance the camel-http
component can use a Camel.HTTP_URL header. We already discussed about
PCollection headers/hints/annotation/metadata (whatever the name we give) and I
still think it would be a great feature for both IOs and even the runners.
I'm proposing to create a Jira about that, I will be more than happy to work on
this one.

4. Schema
As you might know, we are working on adding schema support in PCollection. This
feature can be leveraged by IOs. Especially, I think it would reduce the
"wrapping" made by IOs (like KafkaRecord, JmsRecord, ...) and easier data convert.

5. Error Handling
Users would need a generic error handling in the IOs. Today the error handling
is managed by each IOs. I introduced the error handler we are using in Apache
Camel (sorry again ;)) and especially the default error handler features like:
redelivery policy, recoverable/irrecoverable error handling, onWhen,
onException, whileTrue, ...
The error handler is not at component level but at routing engine level. We
could imagine something similar at pipeline level.
Thoughts ?

I hope I didn't forget something ;)

To summarize:
- I will create new Jiras for HDF5 and other new IOs
- We have to work on documentation/explanation about portability layer & IOs
- I will start a separate thread for error handling discussion
- Nothing to do about schema: it has already started.

Regards
JB
-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Beam Summit - IO brainstorming

Posted by Romain Manni-Bucau <rm...@gmail.com>.
+1000 for record metadata (camel headers)

For python it can be interesting to "just" generate python styled IO API
and use jython under the hood to let python users code as they know but
reuse all the beam ecosystem - including runners! The other way around
implies a lot of work for the community but also ops and makes users quite
unhappy to have a partial vision of beam IMHO. Java having the jsr223 we
can plug whatever we want behind including the portable api if you want
instead of putting it everywhere and pollute people not caring much.



Le 23 mars 2018 08:07, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a écrit :

> Hi all,
>
> Sorry for the delay, but I got issues with my e-mail provider (I was not
> able to
> send e-mails :( ).
>
> Last week during Beam Summit, I had the change to participate to the IO
> brainstorming session.
>
> Here's the minute notes:
>
> 1. IOs set
> We now have a decent number of IOs in Beam, and new are coming (ParquetIO,
> RabbitMQIO). Users mentioned a new file format you could support: HDF5. It
> would
> be an Python IO.
> I will create the Jira about HDF5.
> Other IOs will also be in preparation, coming along with SDF support.
>
> 2. IOs and SDKs
> This point was related to the portability layer: how can I use a Java IO in
> Python or the opposite ? Today, most of the IOs are related to Java SDK,
> and
> it's a bit frustrating for Python SDK users. Users are looking forward
> portability layer, however they also expressed some questions about Docker
> requirements. I think we should prepare a clean answer to this point.
>
> 3. PCollection Headers
> Users want more "dynamic" IOs, maybe that a IO behavior could change
> depending
> of the element they are considering in the PCollection. I introduced what
> we are
> using in Apache Camel: Message Headers. The Camel components endpoints
> (equivalent of Beam IOs) can use the headers: for instance the camel-http
> component can use a Camel.HTTP_URL header. We already discussed about
> PCollection headers/hints/annotation/metadata (whatever the name we give)
> and I
> still think it would be a great feature for both IOs and even the runners.
> I'm proposing to create a Jira about that, I will be more than happy to
> work on
> this one.
>
> 4. Schema
> As you might know, we are working on adding schema support in PCollection.
> This
> feature can be leveraged by IOs. Especially, I think it would reduce the
> "wrapping" made by IOs (like KafkaRecord, JmsRecord, ...) and easier data
> convert.
>
> 5. Error Handling
> Users would need a generic error handling in the IOs. Today the error
> handling
> is managed by each IOs. I introduced the error handler we are using in
> Apache
> Camel (sorry again ;)) and especially the default error handler features
> like:
> redelivery policy, recoverable/irrecoverable error handling, onWhen,
> onException, whileTrue, ...
> The error handler is not at component level but at routing engine level. We
> could imagine something similar at pipeline level.
> Thoughts ?
>
> I hope I didn't forget something ;)
>
> To summarize:
> - I will create new Jiras for HDF5 and other new IOs
> - We have to work on documentation/explanation about portability layer &
> IOs
> - I will start a separate thread for error handling discussion
> - Nothing to do about schema: it has already started.
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Beam Summit - IO brainstorming

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Thanks for the update Eila !

Much appreciated.

Regards
JB

On 03/23/2018 12:57 PM, OrielResearch Eila Arich-Landkof wrote:
> Hi All,
> 
> Cham and myself were trying to initiate the HDF5 support with the HDF5 team. It
> seems that their forum might be able to provide the required support.
> I have created a ticket on their system. https://forum.hdfgroup.org/ and will
> follow up after that to make sure that this is not being forgotten.
> Please let me know if you have any comments
> 
> Best,
> Eila
> 
> 
> On Fri, Mar 23, 2018 at 3:07 AM, Jean-Baptiste Onofré <jb@nanthrax.net
> <ma...@nanthrax.net>> wrote:
> 
>     Hi all,
> 
>     Sorry for the delay, but I got issues with my e-mail provider (I was not able to
>     send e-mails :( ).
> 
>     Last week during Beam Summit, I had the change to participate to the IO
>     brainstorming session.
> 
>     Here's the minute notes:
> 
>     1. IOs set
>     We now have a decent number of IOs in Beam, and new are coming (ParquetIO,
>     RabbitMQIO). Users mentioned a new file format you could support: HDF5. It would
>     be an Python IO.
>     I will create the Jira about HDF5.
>     Other IOs will also be in preparation, coming along with SDF support.
> 
>     2. IOs and SDKs
>     This point was related to the portability layer: how can I use a Java IO in
>     Python or the opposite ? Today, most of the IOs are related to Java SDK, and
>     it's a bit frustrating for Python SDK users. Users are looking forward
>     portability layer, however they also expressed some questions about Docker
>     requirements. I think we should prepare a clean answer to this point.
> 
>     3. PCollection Headers
>     Users want more "dynamic" IOs, maybe that a IO behavior could change depending
>     of the element they are considering in the PCollection. I introduced what we are
>     using in Apache Camel: Message Headers. The Camel components endpoints
>     (equivalent of Beam IOs) can use the headers: for instance the camel-http
>     component can use a Camel.HTTP_URL header. We already discussed about
>     PCollection headers/hints/annotation/metadata (whatever the name we give) and I
>     still think it would be a great feature for both IOs and even the runners.
>     I'm proposing to create a Jira about that, I will be more than happy to work on
>     this one.
> 
>     4. Schema
>     As you might know, we are working on adding schema support in PCollection. This
>     feature can be leveraged by IOs. Especially, I think it would reduce the
>     "wrapping" made by IOs (like KafkaRecord, JmsRecord, ...) and easier data
>     convert.
> 
>     5. Error Handling
>     Users would need a generic error handling in the IOs. Today the error handling
>     is managed by each IOs. I introduced the error handler we are using in Apache
>     Camel (sorry again ;)) and especially the default error handler features like:
>     redelivery policy, recoverable/irrecoverable error handling, onWhen,
>     onException, whileTrue, ...
>     The error handler is not at component level but at routing engine level. We
>     could imagine something similar at pipeline level.
>     Thoughts ?
> 
>     I hope I didn't forget something ;)
> 
>     To summarize:
>     - I will create new Jiras for HDF5 and other new IOs
>     - We have to work on documentation/explanation about portability layer & IOs
>     - I will start a separate thread for error handling discussion
>     - Nothing to do about schema: it has already started.
> 
>     Regards
>     JB
>     --
>     Jean-Baptiste Onofré
>     jbonofre@apache.org <ma...@apache.org>
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
> 
> 
> 
> 
> -- 
> Eila
> www.orielresearch.org <http://www.orielresearch.org>
> https://www.meetup.com/Deep-Learning-In-Production/

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Beam Summit - IO brainstorming

Posted by OrielResearch Eila Arich-Landkof <ei...@orielresearch.org>.
Hi All,

Cham and myself were trying to initiate the HDF5 support with the HDF5
team. It seems that their forum might be able to provide the required
support.
I have created a ticket on their system. https://forum.hdfgroup.org/ and
will follow up after that to make sure that this is not being forgotten.
Please let me know if you have any comments

Best,
Eila


On Fri, Mar 23, 2018 at 3:07 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi all,
>
> Sorry for the delay, but I got issues with my e-mail provider (I was not
> able to
> send e-mails :( ).
>
> Last week during Beam Summit, I had the change to participate to the IO
> brainstorming session.
>
> Here's the minute notes:
>
> 1. IOs set
> We now have a decent number of IOs in Beam, and new are coming (ParquetIO,
> RabbitMQIO). Users mentioned a new file format you could support: HDF5. It
> would
> be an Python IO.
> I will create the Jira about HDF5.
> Other IOs will also be in preparation, coming along with SDF support.
>
> 2. IOs and SDKs
> This point was related to the portability layer: how can I use a Java IO in
> Python or the opposite ? Today, most of the IOs are related to Java SDK,
> and
> it's a bit frustrating for Python SDK users. Users are looking forward
> portability layer, however they also expressed some questions about Docker
> requirements. I think we should prepare a clean answer to this point.
>
> 3. PCollection Headers
> Users want more "dynamic" IOs, maybe that a IO behavior could change
> depending
> of the element they are considering in the PCollection. I introduced what
> we are
> using in Apache Camel: Message Headers. The Camel components endpoints
> (equivalent of Beam IOs) can use the headers: for instance the camel-http
> component can use a Camel.HTTP_URL header. We already discussed about
> PCollection headers/hints/annotation/metadata (whatever the name we give)
> and I
> still think it would be a great feature for both IOs and even the runners.
> I'm proposing to create a Jira about that, I will be more than happy to
> work on
> this one.
>
> 4. Schema
> As you might know, we are working on adding schema support in PCollection.
> This
> feature can be leveraged by IOs. Especially, I think it would reduce the
> "wrapping" made by IOs (like KafkaRecord, JmsRecord, ...) and easier data
> convert.
>
> 5. Error Handling
> Users would need a generic error handling in the IOs. Today the error
> handling
> is managed by each IOs. I introduced the error handler we are using in
> Apache
> Camel (sorry again ;)) and especially the default error handler features
> like:
> redelivery policy, recoverable/irrecoverable error handling, onWhen,
> onException, whileTrue, ...
> The error handler is not at component level but at routing engine level. We
> could imagine something similar at pipeline level.
> Thoughts ?
>
> I hope I didn't forget something ;)
>
> To summarize:
> - I will create new Jiras for HDF5 and other new IOs
> - We have to work on documentation/explanation about portability layer &
> IOs
> - I will start a separate thread for error handling discussion
> - Nothing to do about schema: it has already started.
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Eila
www.orielresearch.org
https://www.meetup.com/Deep-Learning-In-Production/

Re: Beam Summit - IO brainstorming

Posted by Chamikara Jayalath <ch...@google.com>.
Thanks JB for detailed notes.

On Fri, Mar 23, 2018 at 2:43 PM Eugene Kirpichov <ki...@google.com>
wrote:

> Hi! Thanks for the notes.
>
> On Fri, Mar 23, 2018 at 3:07 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Hi all,
>>
>> Sorry for the delay, but I got issues with my e-mail provider (I was not
>> able to
>> send e-mails :( ).
>>
>> Last week during Beam Summit, I had the change to participate to the IO
>> brainstorming session.
>>
>> Here's the minute notes:
>>
>> 1. IOs set
>> We now have a decent number of IOs in Beam, and new are coming (ParquetIO,
>> RabbitMQIO). Users mentioned a new file format you could support: HDF5.
>> It would
>> be an Python IO.
>> I will create the Jira about HDF5.
>> Other IOs will also be in preparation, coming along with SDF support.
>>
>
As Eila mentioned, we are talking to HDF5 group to determine if there's
somebody whose willing to write a HDF5 IO for Python SDK. I'll be happy to
review it. Looks like Eila created
https://issues.apache.org/jira/browse/BEAM-3850 for this.


>
>> 2. IOs and SDKs
>> This point was related to the portability layer: how can I use a Java IO
>> in
>> Python or the opposite ? Today, most of the IOs are related to Java SDK,
>> and
>> it's a bit frustrating for Python SDK users. Users are looking forward
>> portability layer, however they also expressed some questions about Docker
>> requirements. I think we should prepare a clean answer to this point.
>>
>
> I'm pretty sure this is on the radar this quarter, but I don't remember
> whose radar.
>

I hope to look into some aspects of this in next few months. Created
https://issues.apache.org/jira/browse/BEAM-3923 with more info.

>
>
>>
>> 3. PCollection Headers
>> Users want more "dynamic" IOs, maybe that a IO behavior could change
>> depending
>> of the element they are considering in the PCollection. I introduced what
>> we are
>> using in Apache Camel: Message Headers. The Camel components endpoints
>> (equivalent of Beam IOs) can use the headers: for instance the camel-http
>> component can use a Camel.HTTP_URL header. We already discussed about
>> PCollection headers/hints/annotation/metadata (whatever the name we give)
>> and I
>> still think it would be a great feature for both IOs and even the runners.
>> I'm proposing to create a Jira about that, I will be more than happy to
>> work on
>> this one.
>>
>
> Do you have a use case in mind that cannot be solved within the current
> approach to IOs? I think we have a pretty reasonable approach to "dynamic"
> IOs too, exemplified by FileIO.writeDynamic().
>
>
>>
>> 4. Schema
>> As you might know, we are working on adding schema support in
>> PCollection. This
>> feature can be leveraged by IOs. Especially, I think it would reduce the
>> "wrapping" made by IOs (like KafkaRecord, JmsRecord, ...) and easier data
>> convert.
>>
>> 5. Error Handling
>> Users would need a generic error handling in the IOs. Today the error
>> handling
>> is managed by each IOs. I introduced the error handler we are using in
>> Apache
>> Camel (sorry again ;)) and especially the default error handler features
>> like:
>> redelivery policy, recoverable/irrecoverable error handling, onWhen,
>> onException, whileTrue, ...
>> The error handler is not at component level but at routing engine level.
>> We
>> could imagine something similar at pipeline level.
>> Thoughts ?
>>
>
> Can you give some example use cases here too?
> I'm sure we can add some useful abstractions related to error handling,
> but picking the right level of abstraction for such an API will require
> very careful design. E.g. something like "a pipeline-global deadletter
> collection of records that failed processing" sounds useful in theory, but
> I think is impossible to define in a useful way compatible with the Beam
> model, and I think it has to be left to individual transforms.
>
>
>> I hope I didn't forget something ;)
>>
>> To summarize:
>> - I will create new Jiras for HDF5 and other new IOs
>> - We have to work on documentation/explanation about portability layer &
>> IOs
>> - I will start a separate thread for error handling discussion
>> - Nothing to do about schema: it has already started.
>>
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: Beam Summit - IO brainstorming

Posted by Eugene Kirpichov <ki...@google.com>.
Hi! Thanks for the notes.

On Fri, Mar 23, 2018 at 3:07 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi all,
>
> Sorry for the delay, but I got issues with my e-mail provider (I was not
> able to
> send e-mails :( ).
>
> Last week during Beam Summit, I had the change to participate to the IO
> brainstorming session.
>
> Here's the minute notes:
>
> 1. IOs set
> We now have a decent number of IOs in Beam, and new are coming (ParquetIO,
> RabbitMQIO). Users mentioned a new file format you could support: HDF5. It
> would
> be an Python IO.
> I will create the Jira about HDF5.
> Other IOs will also be in preparation, coming along with SDF support.
>
> 2. IOs and SDKs
> This point was related to the portability layer: how can I use a Java IO in
> Python or the opposite ? Today, most of the IOs are related to Java SDK,
> and
> it's a bit frustrating for Python SDK users. Users are looking forward
> portability layer, however they also expressed some questions about Docker
> requirements. I think we should prepare a clean answer to this point.
>

I'm pretty sure this is on the radar this quarter, but I don't remember
whose radar.


>
> 3. PCollection Headers
> Users want more "dynamic" IOs, maybe that a IO behavior could change
> depending
> of the element they are considering in the PCollection. I introduced what
> we are
> using in Apache Camel: Message Headers. The Camel components endpoints
> (equivalent of Beam IOs) can use the headers: for instance the camel-http
> component can use a Camel.HTTP_URL header. We already discussed about
> PCollection headers/hints/annotation/metadata (whatever the name we give)
> and I
> still think it would be a great feature for both IOs and even the runners.
> I'm proposing to create a Jira about that, I will be more than happy to
> work on
> this one.
>

Do you have a use case in mind that cannot be solved within the current
approach to IOs? I think we have a pretty reasonable approach to "dynamic"
IOs too, exemplified by FileIO.writeDynamic().


>
> 4. Schema
> As you might know, we are working on adding schema support in PCollection.
> This
> feature can be leveraged by IOs. Especially, I think it would reduce the
> "wrapping" made by IOs (like KafkaRecord, JmsRecord, ...) and easier data
> convert.
>
> 5. Error Handling
> Users would need a generic error handling in the IOs. Today the error
> handling
> is managed by each IOs. I introduced the error handler we are using in
> Apache
> Camel (sorry again ;)) and especially the default error handler features
> like:
> redelivery policy, recoverable/irrecoverable error handling, onWhen,
> onException, whileTrue, ...
> The error handler is not at component level but at routing engine level. We
> could imagine something similar at pipeline level.
> Thoughts ?
>

Can you give some example use cases here too?
I'm sure we can add some useful abstractions related to error handling, but
picking the right level of abstraction for such an API will require very
careful design. E.g. something like "a pipeline-global deadletter
collection of records that failed processing" sounds useful in theory, but
I think is impossible to define in a useful way compatible with the Beam
model, and I think it has to be left to individual transforms.


> I hope I didn't forget something ;)
>
> To summarize:
> - I will create new Jiras for HDF5 and other new IOs
> - We have to work on documentation/explanation about portability layer &
> IOs
> - I will start a separate thread for error handling discussion
> - Nothing to do about schema: it has already started.
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>