You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Shivam Dalmia (JIRA)" <ji...@apache.org> on 2017/08/28 10:39:00 UTC

[jira] [Commented] (PARQUET-1080) Empty Parquet Files created as a result of spark jobs fail when read

    [ https://issues.apache.org/jira/browse/PARQUET-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143618#comment-16143618 ] 

Shivam Dalmia commented on PARQUET-1080:
----------------------------------------

Of course, this is because of the empty directory being created, which is an intermittent and often difficult to replicate scenario.

So 
1.) How and why are these empty parquet files being created? Any leads would be helpful.
2.) Is there any way to have spark check if the file is empty/corrupt before attempting to infer its schema?

> Empty Parquet Files created as a result of spark jobs fail when read
> --------------------------------------------------------------------
>
>                 Key: PARQUET-1080
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1080
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Shivam Dalmia
>            Priority: Minor
>
> I have faced an issue intermittently with certain spark jobs writing parquet files which apparently succeed but the written .parquet directory in HDFS is an empty directory (with no _SUCCESS and _metadata parts, even). Surprisingly, no errors are thrown from spark dataframe writer.
> However, when attempting to read this written file, spark throws the error:
> {{Unable to infer schema for Parquet. It must be specified manually}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)