You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "zratkai (via GitHub)" <gi...@apache.org> on 2023/03/07 17:07:41 UTC

[GitHub] [orc] zratkai opened a new pull request, #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary s…

zratkai opened a new pull request, #1431:
URL: https://github.com/apache/orc/pull/1431

   …tream bigger then dictionary
   
   ### What changes were proposed in this pull request? Avoid  ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary. Check the size of the dictionary and input and read only the min of those.
   
   ### Why are the changes needed?
   In Hive when reading with LLAP data is read in 4kB blocks which leads to ArrayIndexOutOfBoundsException when the dictionary is smaller.
   
   ### How was this patch tested?
   It is tested with HIVE's qtest, since here we do not have the necessary subclasses.
   
   This closes #1384
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] zratkai commented on a diff in pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "zratkai (via GitHub)" <gi...@apache.org>.
zratkai commented on code in PR #1431:
URL: https://github.com/apache/orc/pull/1431#discussion_r1129262725


##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,12 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size
+          // to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];
           while (pos < dictionaryBufferSize) {
-            int currentLength = in.read(chunkBytes, 0, chunkSize);
+            int currentLength = in.read(chunkBytes, 0, readSize);
             System.arraycopy(chunkBytes, 0, dictionaryBuffer, pos, currentLength);
             pos += currentLength;

Review Comment:
   @guiyanakuang ok, I understand it now. I combined my changes with your suggestions. Combining this two reduces the memory usage also. With that this cases are covered:
   
   Case 1: 
   in.available() = 4096
   dictionaryBuffer = 900
   readSize = 900
   It  creates a chunkBytes array only with 900 bytes (instead of 4096 bytes) and reads 900 bytes in one run.
   
   Case2:
   in.available() = 4k
   dictionaryBufferSize = 5k
   readSize = 4k
   It  creates a chunkBytes with 4kB and reads 4kB at first run, then it reads 1kB and copies the remaining 1kB to the dictionaryBuffer.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on a diff in pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "guiyanakuang (via GitHub)" <gi...@apache.org>.
guiyanakuang commented on code in PR #1431:
URL: https://github.com/apache/orc/pull/1431#discussion_r1129240248


##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,12 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size
+          // to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];
           while (pos < dictionaryBufferSize) {
-            int currentLength = in.read(chunkBytes, 0, chunkSize);
+            int currentLength = in.read(chunkBytes, 0, readSize);
             System.arraycopy(chunkBytes, 0, dictionaryBuffer, pos, currentLength);
             pos += currentLength;

Review Comment:
   > Actually this won't be a problem, because if the readSize is bigger than the available input, it just returns the available here.
   
   The example I gave was specific to the LLAP environment. Could you test it out? I'm not worried about other cases, as available is always accurate.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on PR #1431:
URL: https://github.com/apache/orc/pull/1431#issuecomment-1459548080

   Thank you for making the PR.
   Could you make the CI happy with `mvn checkstyle:check -Panalyze -Pbenchmark` ? @zratkai 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on a diff in pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "guiyanakuang (via GitHub)" <gi...@apache.org>.
guiyanakuang commented on code in PR #1431:
URL: https://github.com/apache/orc/pull/1431#discussion_r1129240248


##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,12 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size
+          // to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];
           while (pos < dictionaryBufferSize) {
-            int currentLength = in.read(chunkBytes, 0, chunkSize);
+            int currentLength = in.read(chunkBytes, 0, readSize);
             System.arraycopy(chunkBytes, 0, dictionaryBuffer, pos, currentLength);
             pos += currentLength;

Review Comment:
   > Actually this won't be a problem, because if the readSize is bigger than the available input, it just returns the available here.
   
   The example I gave was specific to the LLAP environment. Could you test it out? I'm not worried about other cases, because "available" and "read" are always accurate.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] zratkai commented on a diff in pull request #1431: ORC-1384: Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "zratkai (via GitHub)" <gi...@apache.org>.
zratkai commented on code in PR #1431:
URL: https://github.com/apache/orc/pull/1431#discussion_r1130783498


##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,12 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size
+          // to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];
           while (pos < dictionaryBufferSize) {
-            int currentLength = in.read(chunkBytes, 0, chunkSize);
+            int currentLength = in.read(chunkBytes, 0, readSize);
             System.arraycopy(chunkBytes, 0, dictionaryBuffer, pos, currentLength);
             pos += currentLength;

Review Comment:
   @guiyanakuang can we close this thread? Can you merge this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on a diff in pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "guiyanakuang (via GitHub)" <gi...@apache.org>.
guiyanakuang commented on code in PR #1431:
URL: https://github.com/apache/orc/pull/1431#discussion_r1129152573


##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,12 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size
+          // to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];
           while (pos < dictionaryBufferSize) {
-            int currentLength = in.read(chunkBytes, 0, chunkSize);
+            int currentLength = in.read(chunkBytes, 0, readSize);
             System.arraycopy(chunkBytes, 0, dictionaryBuffer, pos, currentLength);
             pos += currentLength;

Review Comment:
   Would this fix be better?
   ```suggestion
               currentLength = Math.min(currentLength, dictionaryBufferSize - pos);
               System.arraycopy(chunkBytes, 0, dictionaryBuffer, pos, currentLength);
               pos += currentLength;
   ```



##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,11 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];

Review Comment:
   @zratkai Thank you for your detailed explanation. I understand your use case.
   
   I think `in.available()` will be recalculated every time a new block is read, right? So this fix seems not perfect.
   For example, at the beginning,
   `in.available()` = 4k
   `dictionaryBufferSize` = 5k
   `readSize` = 4k
   Entering the loop,
   the first time` currentLength` is read as 4k, and then copied.
   The second time `currentLength` may still be 4k, which will still cause an out of bounds error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1431: ORC-1384: Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1431:
URL: https://github.com/apache/orc/pull/1431#issuecomment-1460676087

   Also cc @williamhyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on a diff in pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on code in PR #1431:
URL: https://github.com/apache/orc/pull/1431#discussion_r1128991003


##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,11 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];

Review Comment:
   I'm also curious, `available `currently returns the remaining value of the Bytebuffer, could you provide the specific scenario?
   @zratkai 



##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,11 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];

Review Comment:
   I'm also curious, `available `currently returns the remaining value of the Bytebuffer, could you provide the specific scenario? @zratkai 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] zratkai commented on pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "zratkai (via GitHub)" <gi...@apache.org>.
zratkai commented on PR #1431:
URL: https://github.com/apache/orc/pull/1431#issuecomment-1459924991

   @deshanxiao thanks for adding this info. Yes, exactly, this was the original issue I started with. To finish it this fix is needed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] zratkai commented on pull request #1431: ORC-1384: Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "zratkai (via GitHub)" <gi...@apache.org>.
zratkai commented on PR #1431:
URL: https://github.com/apache/orc/pull/1431#issuecomment-1461750260

   @dongjoon-hyun 
   
   > Thank you for making a PR, @zratkai .
   > 
   > * Do you happen to know what caused this regression?
   > * Do you think you can add a small unit test which you described in the commets?
   
   1. It is caused by LLAP in Hive, since it reads in 4kB blocks as I described above.
   2. I was thinking about that, but it is not easy for this modification, since it is in a private method, and we don't have any test even for the public method which calls this (org.apache.orc.impl.TreeReaderFactory.StringDictionaryTreeReader#nextVector) . So it would mean I need to create for that as well from scratch. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1431: ORC-1384: Fix `ArrayIndexOutOfBoundsException` when reading dictionary stream bigger then dictionary

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1431:
URL: https://github.com/apache/orc/pull/1431#issuecomment-1462790306

   Thank you, @zratkai and @guiyanakuang .
   
   Also, cc @williamhyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] zratkai commented on a diff in pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "zratkai (via GitHub)" <gi...@apache.org>.
zratkai commented on code in PR #1431:
URL: https://github.com/apache/orc/pull/1431#discussion_r1129189106


##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,12 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size
+          // to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];
           while (pos < dictionaryBufferSize) {
-            int currentLength = in.read(chunkBytes, 0, chunkSize);
+            int currentLength = in.read(chunkBytes, 0, readSize);
             System.arraycopy(chunkBytes, 0, dictionaryBuffer, pos, currentLength);
             pos += currentLength;

Review Comment:
   @guiyanakuang Thank you for thinking about that. Actually this won't be a problem, because if the readSize is bigger than the available input, it just returns the available here:
   int currentLength = in.read(chunkBytes, 0, readSize); 
   For example:
   readSize = 1000
   available = 947
   then it reads 947 and returns 957 bytes and then System.arraycopy(chunkBytes, 0, dictionaryBuffer, pos, currentLength) copies only 947 bytes.
   
   I tested with the org.apache.orc.TestVectorOrcFile#testWithoutIndex where 
   dictionaryBufferSize = 30947
   readSize = 1000
   
   so it reads with 1000 bytes blocks 30 times , and at the end it reads 947 bytes. 
   
   The problem what I faces is when the available is bigger than the dictionary size.
   In my case:
   available = 4096
   dictionarySize =900
   
   Then it reads 4096 and tries to copy into a 900 byte dictionaryBuffer. 
   
   Your solution looks also which could work. I don't see which solution has more advantages.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on a diff in pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "guiyanakuang (via GitHub)" <gi...@apache.org>.
guiyanakuang commented on code in PR #1431:
URL: https://github.com/apache/orc/pull/1431#discussion_r1128934893


##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,11 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];

Review Comment:
   The previous reading code was implemented on the premise that the total amount of data in the stream is the same as the expected amount to be read. I am curious as to why `in.available()` is greater than `dictionaryBufferSize` in your environment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun closed pull request #1431: ORC-1384: Fix `ArrayIndexOutOfBoundsException` when reading dictionary stream bigger then dictionary

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #1431: ORC-1384: Fix `ArrayIndexOutOfBoundsException` when reading dictionary stream bigger then dictionary
URL: https://github.com/apache/orc/pull/1431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1431: ORC-1384: Fix `ArrayIndexOutOfBoundsException` when reading dictionary stream bigger then dictionary

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1431:
URL: https://github.com/apache/orc/pull/1431#issuecomment-1462840767

   BTW, welcome to the Apache ORC community, @zratkai .
   ORC-1384 is resolved and assigned to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] zratkai commented on a diff in pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "zratkai (via GitHub)" <gi...@apache.org>.
zratkai commented on code in PR #1431:
URL: https://github.com/apache/orc/pull/1431#discussion_r1129103949


##########
java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java:
##########
@@ -2292,10 +2292,11 @@ private void readDictionaryStream(InStream in) throws IOException {
           int dictionaryBufferSize = dictionaryOffsets[dictionaryOffsets.length - 1];
           dictionaryBuffer = new byte[dictionaryBufferSize];
           int pos = 0;
-          int chunkSize = in.available();
-          byte[] chunkBytes = new byte[chunkSize];
+          //check if dictionary size is smaller than available stream size to avoid ArrayIndexOutOfBoundsException
+          int readSize = Math.min(in.available(), dictionaryBufferSize);
+          byte[] chunkBytes = new byte[readSize];

Review Comment:
   In Hive the ORC's Reader classes (e.g. StringTreeReader)  have sublclasses and ORC file is read with LLAP. LLAP uses a default 4kB blocks to read, and this causes that the in.available() can be bigger than dictionary size. (If dictionary is smaller than 4kB.) This fix will not cause any issue if ORC file is read without LLAP, only with ORC classes, since in System.arraycopy() the maximum amount which can be copied can not be bigger than destination or source size. The previous reading mode created a bigger destination array in Hive than necessary. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on pull request #1431: ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on PR #1431:
URL: https://github.com/apache/orc/pull/1431#issuecomment-1459913443

   Actually, the issue blocked the upgrade of ORC in Hive:
   https://github.com/apache/hive/pull/3833


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org