You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/03/29 16:11:05 UTC

[GitHub] [orc] sv2000 opened a new pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record…

sv2000 opened a new pull request #674:
URL: https://github.com/apache/orc/pull/674


   … readers and writers
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   Current behavior of OrcInput(Output)Format is to create an ORC reader/writer with the default row batch size of 1024. This may need to be overridden in certain scenarios to prevent OOMs during reads/writes.
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   The changes in this PR expose a new ORC config "orc.row.batch.size", which can be set as part of the MR job configuration to configure the MR readers/writers. The configuration is retrieved inside the ORCInput(Output)Format classes, before the readers/writers are initialized. 
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   The change is needed to provide greater configurability of the ORC MR readers/writers.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   The existing unit tests in TestOrcOutputFormat and TestMapreduceOrcOutputFormat cover the changes to the signature of the constructors in OrcMapreduceRecordWriter and OrcMapreduceRecordReader. The change to override the row batch size config was also tested by running a MR job that used the snapshot version of the orc-mapreduce library.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #674:
URL: https://github.com/apache/orc/pull/674#discussion_r608034069



##########
File path: java/mapreduce/src/java/org/apache/orc/mapred/OrcMapredRecordReader.java
##########
@@ -69,14 +69,19 @@ public OrcMapredRecordReader(RecordReader reader,
   }
 
   protected OrcMapredRecordReader(Reader fileReader,
-                                  Reader.Options options) throws IOException {
+      Reader.Options options) throws IOException {

Review comment:
       Could you keep the original style here? This looks inconsistent with the other existing one at line 64 and with your new code at line 77.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record…

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #674:
URL: https://github.com/apache/orc/pull/674#discussion_r603812715



##########
File path: java/mapreduce/src/java/org/apache/orc/mapreduce/OrcMapreduceRecordReader.java
##########
@@ -53,16 +53,20 @@ public OrcMapreduceRecordReader(RecordReader reader,
     rowInBatch = 0;
     this.row = (V) OrcStruct.createValue(schema);
   }
+  public OrcMapreduceRecordReader(Reader fileReader,
+      Reader.Options options) throws IOException {
+    this(fileReader, options, VectorizedRowBatch.DEFAULT_SIZE);

Review comment:
       Is there a reason why we use `DEFAULT_SIZE`, a fixed value?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #674:
URL: https://github.com/apache/orc/pull/674#discussion_r608035981



##########
File path: java/mapreduce/src/java/org/apache/orc/mapreduce/OrcMapreduceRecordReader.java
##########
@@ -53,16 +53,20 @@ public OrcMapreduceRecordReader(RecordReader reader,
     rowInBatch = 0;
     this.row = (V) OrcStruct.createValue(schema);
   }
+  public OrcMapreduceRecordReader(Reader fileReader,
+      Reader.Options options) throws IOException {

Review comment:
       Shall we match the indentation with line 49?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref merged pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
pgaref merged pull request #674:
URL: https://github.com/apache/orc/pull/674


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] sv2000 commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record…

Posted by GitBox <gi...@apache.org>.
sv2000 commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-811491607


   @pgaref IIUC, what you are proposing is to have an overloaded constructor in OrcMapreduceRecordReader/Writer that accepts rowBatchSize as an argument (like we do in this PR), but leave it to the caller to override the getRecordWriter() method, which in turn calls the overloaded constructor. This way we don't need to introduce a separate config. LMK if I am missing anything. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #674:
URL: https://github.com/apache/orc/pull/674#discussion_r608034440



##########
File path: java/mapreduce/src/java/org/apache/orc/mapreduce/OrcMapreduceRecordReader.java
##########
@@ -53,16 +53,20 @@ public OrcMapreduceRecordReader(RecordReader reader,
     rowInBatch = 0;
     this.row = (V) OrcStruct.createValue(schema);
   }

Review comment:
       Please add one empty line between constructors.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-812910483


   If we reach an agreement, shall we move forward in that way, @sv2000 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] sv2000 commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record…

Posted by GitBox <gi...@apache.org>.
sv2000 commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-812057013


   > > @pgaref IIUC, what you are proposing is to have an overloaded constructor in OrcMapreduceRecordReader/Writer that accepts rowBatchSize as an argument (like we do in this PR), but leave it to the caller to override the getRecordWriter() method, which in turn calls the overloaded constructor. This way we don't need to introduce a separate config. LMK if I am missing anything.
   > 
   > Hey @sv2000 -- yes you are on point. Just want to avoid having an extra configuration here as it could confuse users.
   > If we can introduce a new constructor for Reader/Writer and release the change (on 1.6 or 1.7) it could sort out the issue.
   > Thoughts?
   
   @pgaref I am ok with the proposed change. One potential downside though would be that every user that needs to configure the row batch size will have to override the getRecordWriter() method. But may be worth the cost to pay to avoid the confusion of introducing a new config. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record…

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-811772623


   > @pgaref IIUC, what you are proposing is to have an overloaded constructor in OrcMapreduceRecordReader/Writer that accepts rowBatchSize as an argument (like we do in this PR), but leave it to the caller to override the getRecordWriter() method, which in turn calls the overloaded constructor. This way we don't need to introduce a separate config. LMK if I am missing anything.
   
   Hey @sv2000  -- yes you are on point. Just want to avoid having an extra configuration here as it could confuse users. 
   If we can introduce a new constructor for Reader/Writer and release the change (on 1.6 or 1.7) it could sort out the issue.
   Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] sv2000 commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
sv2000 commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-815360658


   @pgaref Added unit tests to use the overloaded constructors and updated the PR description. LMK if the changes look ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] sv2000 commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
sv2000 commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-813708275


   @pgaref @dongjoon-hyun Addressed comments on this PR. Can you help take a look? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-814751376


   Hey @sv2000  thanks for making the changes -- thats exactly what I had in mind!
   Let update PRs description and some of the Mapreduce Tests to use the new constructors -- will merge after that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-814290896


   To @sv2000 . I left a few minor comments about coding styling.
   To @pgaref . Could you confirm that all of your concerns are addressed, please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] sv2000 commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
sv2000 commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-814362154


   Thanks @dongjoon-hyun. I addressed your comments. Can you help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] sv2000 commented on pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record…

Posted by GitBox <gi...@apache.org>.
sv2000 commented on pull request #674:
URL: https://github.com/apache/orc/pull/674#issuecomment-810422059


   @dongjoon-hyun Thanks for your comment! With regard to "Is there a reason why we use DEFAULT_SIZE, a fixed value?", I use the default value to maintain backward compatibility when orc.row.batch.size is not configured.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #674: ORC-777: Make the vectorized row batch size configurable in MR record readers and writers

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #674:
URL: https://github.com/apache/orc/pull/674#discussion_r608037472



##########
File path: java/mapreduce/src/java/org/apache/orc/mapreduce/OrcMapreduceRecordReader.java
##########
@@ -53,16 +53,20 @@ public OrcMapreduceRecordReader(RecordReader reader,
     rowInBatch = 0;
     this.row = (V) OrcStruct.createValue(schema);
   }
+  public OrcMapreduceRecordReader(Reader fileReader,
+      Reader.Options options) throws IOException {
+    this(fileReader, options, VectorizedRowBatch.DEFAULT_SIZE);
+  }
 
   public OrcMapreduceRecordReader(Reader fileReader,
-                                  Reader.Options options) throws IOException {
+                                  Reader.Options options, int rowBatchSize) throws IOException {

Review comment:
       Let's put `rowBatchSize` in a new line like the following.
   ```java
   Reader fileReader,
   Reader.Options options,
   int rowBatchSize) throws IOException {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org