You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by omalley <gi...@git.apache.org> on 2017/08/29 18:58:10 UTC

[GitHub] orc pull request #163: ORC-162. Handle 0 byte files as empty ORC files.

GitHub user omalley opened a pull request:

    https://github.com/apache/orc/pull/163

    ORC-162. Handle 0 byte files as empty ORC files.

    Treat 0 byte files as an empty ORC file with schema of struct<>.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/orc orc-162

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #163
    
----
commit 3d5561cc444fd04326aef12a00d172de7f1e3573
Author: Owen O'Malley <om...@apache.org>
Date:   2017-08-29T18:07:39Z

    ORC-162. Handle 0 byte files as empty ORC files.
    
    Signed-off-by: Owen O'Malley <om...@apache.org>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by prasanthj <gi...@git.apache.org>.

Github user prasanthj commented on the issue:

    https://github.com/apache/orc/pull/163
  
    Hive creates empty files only for MR to support bucketed joins. Tez doesn't create empty bucket files anymore. Hive currently discards empty files during split generation. We can do similar thing in Orc's version of OrcInputFormat (or add EmptyFilePathPattern to ignore 0 length files or files <= MAGIC.length). Creating splits for empty is anyway useless. As far as calling the Reader's directly with a empty file path, we can treat it as empty file with struct<>. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by prasanthj <gi...@git.apache.org>.

Github user prasanthj commented on a diff in the pull request:

    https://github.com/apache/orc/pull/163#discussion_r136468354
  
    --- Diff: java/mapreduce/src/java/org/apache/orc/mapred/OrcInputFormat.java ---
    @@ -151,4 +153,26 @@ public static void setSearchArgument(Configuration conf,
         return new OrcMapredRecordReader<>(file, buildOptions(conf,
             file, split.getStart(), split.getLength()));
       }
    +
    +  /**
    +   * Filter out the 0 byte files, so that we don't generate splits for the
    +   * empty ORC files.
    +   * @param job the job configuration
    +   * @return a list of files that need to be read
    +   * @throws IOException
    +   */
    +  protected FileStatus[] listStatus(JobConf job) throws IOException {
    +    FileStatus[] result = super.listStatus(job);
    +    List<FileStatus> ok = new ArrayList<>(result.length);
    --- End diff --
    
    Make sense. Just noticed filter gets applied after listStatus anyway. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/163
  
    I agree that a filename encoding would be a nice safe guard, but it doesn't work since Hive isn't using that convention. (Hive made the change in Hive's OrcInputFormat so it didn't move over to the ORC project.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by dain <gi...@git.apache.org>.

Github user dain commented on the issue:

    https://github.com/apache/orc/pull/163
  
    We were considering doing this internally and then we ran into a production bug where files got truncated to zero bytes.  Since empty files are illegal we could find all of the effected partitions easily, but without this you would be stuck.  A good work around would be to require empty files to end with ".empty" but I'm not sure you can do that with M/R.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/163
  
    @prasanthj Ok, I added the code that will cause the ORC input formats to not generate splits for the empty files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by electrum <gi...@git.apache.org>.

Github user electrum commented on the issue:

    https://github.com/apache/orc/pull/163
  
    We had the same use case of making empty bucket creation more efficient. Encoding the fact that the file is intentionally empty in the name provides a good safeguard against storage system problems that can cause files to be truncated (unfortunately far too common).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by dain <gi...@git.apache.org>.

Github user dain commented on the issue:

    https://github.com/apache/orc/pull/163
  
    Also this is a backwards incompatible change, so we would, at the very least, need to do the trick where it is disabled by default in the writer until the reader is rolled out everywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by prasanthj <gi...@git.apache.org>.

Github user prasanthj commented on a diff in the pull request:

    https://github.com/apache/orc/pull/163#discussion_r136466436
  
    --- Diff: java/mapreduce/src/java/org/apache/orc/mapred/OrcInputFormat.java ---
    @@ -151,4 +153,26 @@ public static void setSearchArgument(Configuration conf,
         return new OrcMapredRecordReader<>(file, buildOptions(conf,
             file, split.getStart(), split.getLength()));
       }
    +
    +  /**
    +   * Filter out the 0 byte files, so that we don't generate splits for the
    +   * empty ORC files.
    +   * @param job the job configuration
    +   * @return a list of files that need to be read
    +   * @throws IOException
    +   */
    +  protected FileStatus[] listStatus(JobConf job) throws IOException {
    +    FileStatus[] result = super.listStatus(job);
    +    List<FileStatus> ok = new ArrayList<>(result.length);
    --- End diff --
    
    Instead of checking this after retrieving all FileStatus objects, it will be better if a PathFilter can be passed to listStatus() so that we will only get non-zero files. Getting 1000s of 0 length files and filtering here seems wasteful. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/163
  
    The problem is that Hive is doing this across the board. See HIVE-13040.
    
    Making the reader not throw is ok, if slightly incompatible. This patch doesn't change the writer to write such files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/orc/pull/163


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by prasanthj <gi...@git.apache.org>.

Github user prasanthj commented on the issue:

    https://github.com/apache/orc/pull/163
  
    I agree that reader should gracefully handle 0 length files like what this patch does instead of throwing. In addition to that we should also avoid creating splits for 0 length files. Spinning up tasks for reading 0 length files is wasteful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on a diff in the pull request:

    https://github.com/apache/orc/pull/163#discussion_r136466870
  
    --- Diff: java/mapreduce/src/java/org/apache/orc/mapred/OrcInputFormat.java ---
    @@ -151,4 +153,26 @@ public static void setSearchArgument(Configuration conf,
         return new OrcMapredRecordReader<>(file, buildOptions(conf,
             file, split.getStart(), split.getLength()));
       }
    +
    +  /**
    +   * Filter out the 0 byte files, so that we don't generate splits for the
    +   * empty ORC files.
    +   * @param job the job configuration
    +   * @return a list of files that need to be read
    +   * @throws IOException
    +   */
    +  protected FileStatus[] listStatus(JobConf job) throws IOException {
    +    FileStatus[] result = super.listStatus(job);
    +    List<FileStatus> ok = new ArrayList<>(result.length);
    --- End diff --
    
    That would involve doing a second getStatus. Filters are only given the name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/163
  
    @prasanthj There are customers out there with millions of zero byte ORC files in their Hive warehouses. We need to have the reader not throw when they read them with Spark, etc. Rather than patch each context where Readers may be created, I'd rather fix the core Reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---