You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Nikola (Jira)" <ji...@apache.org> on 2020/05/10 16:09:00 UTC
[jira] [Updated] (ORC-633) Skip broken ORC files when reading
[ https://issues.apache.org/jira/browse/ORC-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nikola updated ORC-633:
-----------------------
Description:
I am reading a path with ORC files using flink. However, some of them are broken.
I get exceptions like this:
{code:java}
org.apache.orc.FileFormatException: Not a valid ORC file /user/orc/0.orc (maxFileLength= 9223372036854775807)
at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:546)
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370)
at org.apache.orc.OrcFile.createReader(OrcFile.java:342)
at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530){code}
I have also defined in my configuration the "skip corrupt file":
{code:java}
conf.setBoolean(OrcConf.SKIP_CORRUPT_DATA.getAttribute(), true);{code}
but it only handles a specific case and it doesn't skip broken files.
Is it possible to skip all broken ORC files for whatever reason and only take the valid ones?
was:
I am reading a path with ORC files using flink. However, some of them are broken.
I get exceptions like this:
org.apache.orc.FileFormatException: Not a valid ORC file /user/orc/0.orc (maxFileLength= 9223372036854775807)org.apache.orc.FileFormatException: Not a valid ORC file /user/orc/0.orc (maxFileLength= 9223372036854775807) at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:546) at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370) at org.apache.orc.OrcFile.createReader(OrcFile.java:342) at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225) at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63) at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> Skip broken ORC files when reading
> ----------------------------------
>
> Key: ORC-633
> URL: https://issues.apache.org/jira/browse/ORC-633
> Project: ORC
> Issue Type: Improvement
> Components: Reader
> Affects Versions: 1.6.3
> Reporter: Nikola
> Priority: Critical
>
> I am reading a path with ORC files using flink. However, some of them are broken.
> I get exceptions like this:
> {code:java}
> org.apache.orc.FileFormatException: Not a valid ORC file /user/orc/0.orc (maxFileLength= 9223372036854775807)
> at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:546)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:342)
> at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
> at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
> at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530){code}
>
> I have also defined in my configuration the "skip corrupt file":
> {code:java}
> conf.setBoolean(OrcConf.SKIP_CORRUPT_DATA.getAttribute(), true);{code}
>
> but it only handles a specific case and it doesn't skip broken files.
> Is it possible to skip all broken ORC files for whatever reason and only take the valid ones?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)