You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by GitBox <gi...@apache.org> on 2022/06/23 11:30:04 UTC

[GitHub] [orc] abstractdog opened a new pull request, #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

abstractdog opened a new pull request, #1166:
URL: https://github.com/apache/orc/pull/1166

   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on a diff in pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on code in PR #1166:
URL: https://github.com/apache/orc/pull/1166#discussion_r905946771


##########
java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java:
##########
@@ -1764,6 +1812,9 @@ public void nextVector(ColumnVector previousVector,
                                                    " proleptic Gregorian dates.");
           }
         }
+      } else {
+        bytesColVector.ensureSize(batchSize, false);
+        dateColumnVector.ensureSize(batchSize, false);

Review Comment:
   `dateColumnVector` can be ignored. This is because it is the vector obtained by forcing `previousVector` .
   
   https://github.com/apache/orc/blob/3eac56761f48f3e315eea56dc63de9f2cbadcdbb/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L1423
   
   `computeBatchSize` ensures that batchSize <= the size of the input vector.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] abstractdog commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
abstractdog commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165379280

   I see, makes sense @guiyanakuang
   as I'm now interested in fixing the issue ASAP for ConvertTreeReaders, the best way is what you proposed in Jira, right?
   ```
       if (doubleColVector == null) {
           // Allocate column vector for file; cast column vector for reader.
           doubleColVector = new DoubleColumnVector(batchSize);
           longColVector = (LongColumnVector) previousVector;
       }
   +   else {
   +       doubleColVector.ensureSize(batchSize, false);
   +   }
   ```
   without touching the interface
   
   if so, let me refresh the PR soon with the fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] abstractdog commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
abstractdog commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165412772

   thanks @guiyanakuang, just force pushed the fix 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165036632

   BTW, @abstractdog . Which ORC version are you using now?
   - The latest one is https://orc.apache.org/news/2022/06/16/ORC-1.7.5/ 
   - 1.6.x will be EOL soon.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] abstractdog commented on a diff in pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
abstractdog commented on code in PR #1166:
URL: https://github.com/apache/orc/pull/1166#discussion_r905954918


##########
java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java:
##########
@@ -1764,6 +1812,9 @@ public void nextVector(ColumnVector previousVector,
                                                    " proleptic Gregorian dates.");
           }
         }
+      } else {
+        bytesColVector.ensureSize(batchSize, false);
+        dateColumnVector.ensureSize(batchSize, false);

Review Comment:
   handled it, thanks!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on a diff in pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on code in PR #1166:
URL: https://github.com/apache/orc/pull/1166#discussion_r906020399


##########
java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java:
##########
@@ -1764,6 +1812,9 @@ public void nextVector(ColumnVector previousVector,
                                                    " proleptic Gregorian dates.");
           }
         }
+      } else {
+        bytesColVector.ensureSize(batchSize, false);
+        dateColumnVector.ensureSize(batchSize, false);

Review Comment:
   > `dateColumnVector` can be ignored. This is because it is the vector obtained by forcing `previousVector` .
   > 
   > https://github.com/apache/orc/blob/3eac56761f48f3e315eea56dc63de9f2cbadcdbb/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L1423
   > 
   > `computeBatchSize` ensures that batchSize <= the size of the input vector.
   
   @abstractdog, in addition to this, LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on a diff in pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on code in PR #1166:
URL: https://github.com/apache/orc/pull/1166#discussion_r905923547


##########
java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java:
##########
@@ -1764,6 +1812,9 @@ public void nextVector(ColumnVector previousVector,
                                                    " proleptic Gregorian dates.");
           }
         }
+      } else {
+        bytesColVector.ensureSize(batchSize, false);
+        dateColumnVector.ensureSize(batchSize, false);

Review Comment:
   `dateColumnVector.ensureSize(batchSize, false);` may throw a null pointer exception and we may need to check additionally.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun closed pull request #1166: ORC-1205: `nextVector` should invoke `ensureSize` when reusing vectors

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #1166: ORC-1205: `nextVector` should invoke `ensureSize` when reusing vectors
URL: https://github.com/apache/orc/pull/1166


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165872257

   @abstractdog 
   
   > hive is on ORC 1.6.9 so I'm motivated to backport the patch to branch-1.6 too if it can make its way to at least a single release, but if it becomes EOL, I don't mind, we're fixing this on 1.7.x too, hive should upgrade eventually anyway :)
   
   For the above your claim, there are more technical issues which we don't think next Apache ORC 1.6.15 could be used by Apache Hive official release.
   
   AFAIK,
   1. Although I upgraded Apache Hive to use 1.6.9 via [HIVE-25384](https://github.com/apache/hive/pull/2530). I'm not sure we can have the official Apache Hive 4.0.0 release in the future.
   2. In addition, this patch alone cannot unblock Apache Hive's Apache ORC adoption because it has been blocked by other technical issues since Apache ORC 1.6.10 ~ 1.6.14.3. 
   <img width="663" alt="Screen Shot 2022-06-24 at 12 16 15 PM" src="https://user-images.githubusercontent.com/9700541/175650596-6de0e24d-4ed4-467f-bfe2-41850ede3474.png">
   
   3. For the other releases, Apache Hive 3.1 depends on Apache ORC 1.5.8 and Apache Hive 2.3 depends on Apache ORC 1.3.4. So, they are irrelevant to Apache ORC 1.6.14.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165343055

   What I mean is that this pr is more like a refactoring than a problem fix.
   
   The ensureSize interface is defined in the TypeReader, and if this pr is treated as a refactoring, our modification looks only half done. There are still many temporary reuse vector calls to ensureSize in Various TreeReaderImpl, rather than using the new interface to control size (e.g. line 2131).
   
   https://github.com/apache/orc/blob/3eac56761f48f3e315eea56dc63de9f2cbadcdbb/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L2126-L2134
   
   So I think it would be better to use a separate pr to abstract the code and use the public ensureSize interface (both ConvertTreeReaderImpl and TreeReaderImpl use this interface to control the size of the vectors).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165383655

   > I see, makes sense @guiyanakuang as I'm now interested in fixing the issue ASAP for ConvertTreeReaders, the best way is what you proposed in Jira, right?
   > 
   > ```
   >     if (doubleColVector == null) {
   >         // Allocate column vector for file; cast column vector for reader.
   >         doubleColVector = new DoubleColumnVector(batchSize);
   >         longColVector = (LongColumnVector) previousVector;
   >     }
   > +   else {
   > +       doubleColVector.ensureSize(batchSize, false);
   > +   }
   > ```
   > 
   > without touching the interface
   > 
   > if so, let me refresh the PR soon with the fix
   
   Yes I do think so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] guiyanakuang commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165168687

   Thanks to @abstractdog  for the extensive test coverage. I verified that the patch works.
   
   The current reader inheritance relationship looks like this
   
   ```
   TypeReader +--> TreeReader +---> ConvertTreeReader +-> Various ConvertTreeReaderImpl
                       +
                       |
                       +--------------------------------> Various TreeReaderImpl
   ```
   The new ensureSize interface is called recursively to ensure the size of the vector, which is a good solution to the out-of-bounds problem in Various ConvertTreeReaderImpl. However, in the Various TreeReaderImpl code path we had the ensureSize method being called repeatedly. I think this is a good refactoring design and I wonder if we should simply fix the out-of-bounds issue in the ConvertTreeReader first. Then refactor the ensureSize entry in another pr to be consistent.
   
   Please let me know if I am not correct.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] abstractdog commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
abstractdog commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165883176

   finally, this doesn't introduce any APIs, updating descriptions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] abstractdog commented on a diff in pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
abstractdog commented on code in PR #1166:
URL: https://github.com/apache/orc/pull/1166#discussion_r906035132


##########
java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java:
##########
@@ -1764,6 +1812,9 @@ public void nextVector(ColumnVector previousVector,
                                                    " proleptic Gregorian dates.");
           }
         }
+      } else {
+        bytesColVector.ensureSize(batchSize, false);
+        dateColumnVector.ensureSize(batchSize, false);

Review Comment:
   okay, removed
   thanks a lot for the quick reviews @guiyanakuang !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] abstractdog commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
abstractdog commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165296502

   thanks for the comments so far!
   
   @dongjoon-hyun: hive is on ORC 1.6.9, so I'm motivated to backport the patch to branch-1.6 too if it can make its way to at least a single release, but if it becomes EOL, I don't mind, we're fixing this on 1.7.x too, hive should upgrade eventually anyway :)
   
   @guiyanakuang: can you please elaborate on this:
   ```
   in the Various TreeReaderImpl code path we had the ensureSize method being called repeatedly
   ```
   it's not clear where exactly this happens, and why it is a problem, but otherwise, I'm fine with fixing this on ConvertTreeReaders only: so did you mean that the ensureSize method can stay exposed in TypeReader interface (as it is in the current patch), and let it be a no-op in TreeReaderFactory (so ConvertTreeReader will implement it instead of calling super.ensureSize)?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] abstractdog commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
abstractdog commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165818845

   thanks for your comments and approving this PR @guiyanakuang 
   if patch looks good, I'm about to open some PR with the very same contents, I'll update jira
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1166: ORC-1205: Size of batches in some ConvertTreeReaders should be ensured before using

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1166:
URL: https://github.com/apache/orc/pull/1166#issuecomment-1165879657

   In addition, 
   - Please revise the PR description. The current one is invalid.
   > This fix introduces a new API for TreeReaders: ensureSize
   
   - After revising the PR description, please copy and paste it into your new PRs. PR title and description become a commit title and message. So, the following is an invalid PR description in your backporting PRs in Apache ORC community.
   > same patch as main


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org