You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2019/01/17 12:46:00 UTC

[jira] [Commented] (FLINK-11347) Optimize the ParquetAvroWriters factory

    [ https://issues.apache.org/jira/browse/FLINK-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745008#comment-16745008 ] 

Stephan Ewen commented on FLINK-11347:
--------------------------------------

Forwarding my comment from the pull request:

The schema must be serializable, hence we convert it to a string and back.
The schema is in the closure of the factory, which itself is part of the user code that is shipped for distributed execution, hence the requirement to be serializable.

The parsing also happens just once when the writer is created, so my assumption is that the cost is acceptable.

I would close this issue, because the solution here is not possible.
Please reopen the issue, if you disagree and would like t pursue this further.

> Optimize the ParquetAvroWriters factory
> ---------------------------------------
>
>                 Key: FLINK-11347
>                 URL: https://issues.apache.org/jira/browse/FLINK-11347
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats
>    Affects Versions: 1.7.1
>            Reporter: Fokko Driesprong
>            Assignee: Fokko Driesprong
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.8.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the ParquetAvroWriters the schema is first serialized to a string, and then back to a Schema, which is quite expensive to do. Therefore it makes sense to pass the schema to the writer directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)