You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Ryan Skraba (JIRA)" <ji...@apache.org> on 2017/10/05 11:53:00 UTC

[jira] [Comment Edited] (BEAM-2993) AvroIO.write without specifying a schema

    [ https://issues.apache.org/jira/browse/BEAM-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192777#comment-16192777 ] 

Ryan Skraba edited comment on BEAM-2993 at 10/5/17 11:52 AM:
-------------------------------------------------------------

Very good points and foreseeing some of our plans!  Fortunately, I'm pretty sure that we can consider that *if* someone chooses to use {{AvroIO.write()}} without specifying a schema, they *must* provide a homogeneous collection (all with the same schema)! 

But looking ahead, we *are* moving towards heterogeneous collections (or at least heterogeneous-ish with a limited number of possible schemas) and there are intelligent things we can do in intermediate transforms, such as reconciling them into a good, "known" schema.  I don't think it would be reasonable or desirable to ask AvroIO.write to implement any of that intermediate-type logic.

That being said, the SchemaRefAndRecord is probably what we would need to solve the heterogeneous collection problem, but I don't consider it related.

For info, before Beam 2.0, we used the hadoop input format Sink, with a lazy configuration when the first record is received which actually worked very well -- but we're pretty motivated to move entirely to the BFS as soon as possible!


was (Author: ryanskraba):
Very good points and foreseeing some of our plans!  Fortunately, I'm pretty sure that we can consider that *if* someone chooses to use {{AvroIO.write()}} without specifying a schema, they *must* provide a homogeneous collection (all with the same schema)! 

But looking ahead, we *are* moving towards heterogeneous collections (or at least heterogeneous-ish with a limited number of possible schemas) and there are intelligent things we can do in intermediate transforms, such as reconciling them into a good, "known" schema.  I don't think it would be reasonable or desirable to ask AvroIO.write to implement any of this logic.

That being said, the SchemaRefAndRecord is probably what we would need to solve the heterogeneous collection problem, but I don't consider it related.

For info, before Beam 2.0, we used the hadoop input format Sink, with a lazy configuration when the first record is received which actually worked very well -- but we're pretty motivated to move entirely to the BFS as soon as possible!

> AvroIO.write without specifying a schema
> ----------------------------------------
>
>                 Key: BEAM-2993
>                 URL: https://issues.apache.org/jira/browse/BEAM-2993
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>
> Similarly to https://issues.apache.org/jira/browse/BEAM-2677, we should be able to write to avro files using {{AvroIO}} without specifying a schema at build time. Consider the following use case: a user has a {{PCollection<GenericRecord>}}  but the schema is only known while running the pipeline.  {{AvroIO.writeGenericRecords}} needs the schema, but the schema is already available in {{GenericRecord}}. We should be able to call {{AvroIO.writeGenericRecords()}} with no schema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)