You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Prasanth Jayachandran (JIRA)" <ji...@apache.org> on 2016/03/24 23:03:25 UTC
[jira] [Updated] (HIVE-13361) Orc concatenation should enforce the compression buffer size

     [ https://issues.apache.org/jira/browse/HIVE-13361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasanth Jayachandran updated HIVE-13361:
-----------------------------------------
    Description: 
With HIVE-11807 buffer size estimation happens by default. This can have undesired effect wrt file concatenation. Consider the following table with files

{code}
testtable
  -- 000000_0 (created before HIVE-11807 which has buffer size 256KB)
  -- 000001_0 (created before HIVE-11807 which has buffer size 256KB)
  -- 000002_0 (created after HIVE-11807 with buffer size chosen as 128KB)
  -- 000003_0 (created after HIVE-11807 with buffer size chosen as 128KB)
{code}

If we perform ALTER TABLE .. CONCATENATE on the above table with HIVE-11807, then depending on the split arrangement 000000_0 and 000001_0 will be concatenated together to new merged file. But this new merged file will have 128KB buffer size (estimated buffer size and not requested buffer size). Since new ORC writer size does not honor the requested buffer size the new merged files will have smaller buffers than the required 256KB making the file unreadable. Following exception will be thrown when reading the table after concatenation
{code}
2016-03-24T16:26:33,974 ERROR [a9e27a9a-37cb-411d-9708-6c58a4ce34f2 main]: CliDriver (SessionState.java:printError(1049)) - Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 153187
java.io.IOException: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 153187
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:513)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:420)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:145)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1848)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:256)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:187)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:782)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:721)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:648)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{code}


  was:
With HIVE-11807 buffer size estimation happens by default. This can have undesired effect wrt file concatenation. Consider the following table with files

{code}
testtable
  -- 000000_0 (created before HIVE-11807 which has buffer size 256KB)
  -- 000001_0 (created before HIVE-11807 which has buffer size 256KB)
  -- 000002_0 (created after HIVE-11807 with buffer size chosen as 128KB)
  -- 000003_0 (created after HIVE-11807 with buffer size chosen as 128KB)
{code}

If we perform ALTER TABLE .. CONCATENATE on the above table with HIVE-11807, then depending on the split arrangement 000000_0 and 000001_0 will be concatenated together to new merged file. But this new merged file will have 128KB buffer size (estimated buffer size and not requested buffer size). Since new ORC writer size does not honor the requested buffer size the new merged files will have smaller buffers than the required 256KB making the file unreadable. Following exception will be thrown when reading the table after concatenation
{code}
2016-03-24T16:26:33,974 ERROR [a9e27a9a-37cb-411d-9708-6c58a4ce34f2 main]: CliDriver (SessionState.java:printError(1049)) - Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 153187
java.io.IOException: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 153187
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:513)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:420)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:145)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1848)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:256)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:187)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:782)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:721)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:648)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{code}


> Orc concatenation should enforce the compression buffer size
> ------------------------------------------------------------
>
>                 Key: HIVE-13361
>                 URL: https://issues.apache.org/jira/browse/HIVE-13361
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.3.0, 2.0.0, 2.1.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>            Priority: Critical
>
> With HIVE-11807 buffer size estimation happens by default. This can have undesired effect wrt file concatenation. Consider the following table with files
> {code}
> testtable
>   -- 000000_0 (created before HIVE-11807 which has buffer size 256KB)
>   -- 000001_0 (created before HIVE-11807 which has buffer size 256KB)
>   -- 000002_0 (created after HIVE-11807 with buffer size chosen as 128KB)
>   -- 000003_0 (created after HIVE-11807 with buffer size chosen as 128KB)
> {code}
> If we perform ALTER TABLE .. CONCATENATE on the above table with HIVE-11807, then depending on the split arrangement 000000_0 and 000001_0 will be concatenated together to new merged file. But this new merged file will have 128KB buffer size (estimated buffer size and not requested buffer size). Since new ORC writer size does not honor the requested buffer size the new merged files will have smaller buffers than the required 256KB making the file unreadable. Following exception will be thrown when reading the table after concatenation
> {code}
> 2016-03-24T16:26:33,974 ERROR [a9e27a9a-37cb-411d-9708-6c58a4ce34f2 main]: CliDriver (SessionState.java:printError(1049)) - Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 153187
> java.io.IOException: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 153187
>         at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:513)
>         at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:420)
>         at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:145)
>         at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1848)
>         at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:256)
>         at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:187)
>         at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>         at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:782)
>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:721)
>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:648)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)