You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Robert Gruener (JIRA)" <ji...@apache.org> on 2018/07/03 14:47:00 UTC

[jira] [Commented] (PARQUET-179) Retire _metadata generation

    [ https://issues.apache.org/jira/browse/PARQUET-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531503#comment-16531503 ] 

Robert Gruener commented on PARQUET-179:
----------------------------------------

I am curious if there is still the plan to remove the _metadata file (it seems like it is the plan looking at the code). We currently use it to find the row group information for a dataset since it is extremely inefficient to read the footers of all individual parquet files. We are reading data directly from parquet into tensorflow on a single node for deep learning use cases. 

Will there be any similar method for doing this operation or will we be on our own to create an index for the dataset?

> Retire _metadata generation 
> ----------------------------
>
>                 Key: PARQUET-179
>                 URL: https://issues.apache.org/jira/browse/PARQUET-179
>             Project: Parquet
>          Issue Type: Task
>            Reporter: elif dede
>            Assignee: elif dede
>            Priority: Major
>
> We can disable  _metadata file generation since it is not being used anymore to reduce the memory usage during commit phase.
> We are keeping the _common_metadata since it is less memory expensive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)