You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/03/12 02:11:17 UTC

[GitHub] [orc] pavibhai opened a new pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

pavibhai opened a new pull request #652:
URL: https://github.com/apache/orc/pull/652


   ### What changes were proposed in this pull request?
   If the seek on a compressed stream is already satisfied by the current decompression then avoid seeking and decompressing the same block again.
   
   ### Why are the changes needed?
   Seeks performed within the same compression block avoid the costs of decompression.
   
   ### How was this patch tested?
   Modified an existing test to verify that the compressed buffer changes only when the seek is needed and not in other cases.
   Additionally all of the existing tests pass.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598955808



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -558,9 +558,14 @@ public void changeIv(Consumer<byte[]> modifier) {
     public void seek(PositionProvider index) throws IOException {
       boolean seeked = seek(index.getNext());
       long uncompressedBytes = index.getNext();
-      if (!seeked && uncompressed != null) {
-        // Only reposition uncompressed
-        uncompressed.position((int) uncompressedBytes);
+      if (!seeked) {
+        if (uncompressed != null) {
+          // Only reposition uncompressed
+          uncompressed.position((int) uncompressedBytes);
+        } else {

Review comment:
       Sure, dropped the empty block




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r597907354



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -397,13 +397,14 @@ public String toString() {
     }
   }
 
-  private static class CompressedStream extends InStream {
+  static class CompressedStream extends InStream {

Review comment:
       Shall we make this public as Encrypted and Uncompressed above?

##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -551,15 +556,20 @@ public void changeIv(Consumer<byte[]> modifier) {
 
     @Override
     public void seek(PositionProvider index) throws IOException {
-      seek(index.getNext());
+      boolean seeked = seek(index.getNext());
       long uncompressedBytes = index.getNext();
-      if (uncompressedBytes != 0) {
-        readHeader();
-        uncompressed.position(uncompressed.position() +
-                              (int) uncompressedBytes);
-      } else if (uncompressed != null) {
-        // mark the uncompressed buffer as done
-        uncompressed.position(uncompressed.limit());
+      if (!seeked && uncompressed != null) {
+        // Only reposition uncompressed
+        uncompressed.position((int) uncompressedBytes);
+      } else {
+        if (uncompressedBytes != 0) {

Review comment:
       add some explanation here as well?

##########
File path: java/core/src/test/org/apache/orc/impl/TestInStream.java
##########
@@ -434,8 +437,30 @@ public void testCompressed() throws Exception {
       int x = in.read();
       assertEquals(i & 0xff, x);
     }
+
+    // Forward seek test
+    int seeksPerformed = 0;
+    for(int i=0; i < 1024; ++i) {
+      long pos = positions[i].getNext();
+      positions[i].reset();
+      ByteBuffer c = ((InStream.CompressedStream) in).compressed;
+      if (pos == ((InStream.CompressedStream) in).currentCompressedStart) {
+        // Should have the same ByteBuffer if no seek has taken place
+        in.seek(positions[i]);
+        assertEquals(c, ((InStream.CompressedStream) in).compressed);
+      } else {
+        seeksPerformed += 1;
+        in.seek(positions[i]);
+        assertNotEquals(c, ((InStream.CompressedStream) in).compressed);
+      }
+      positions[i].reset();
+      assertEquals(i & 0xff, in.read());
+    }
+    assertEquals(1, seeksPerformed);

Review comment:
       Why are we assuming 1 seek here? do we have 2 chunks total?
   Shall we move this to its own test? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598817629



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -551,15 +556,20 @@ public void changeIv(Consumer<byte[]> modifier) {
 
     @Override
     public void seek(PositionProvider index) throws IOException {
-      seek(index.getNext());
+      boolean seeked = seek(index.getNext());
       long uncompressedBytes = index.getNext();
-      if (uncompressedBytes != 0) {
-        readHeader();
-        uncompressed.position(uncompressed.position() +
-                              (int) uncompressedBytes);
-      } else if (uncompressed != null) {
-        // mark the uncompressed buffer as done
-        uncompressed.position(uncompressed.limit());
+      if (!seeked && uncompressed != null) {

Review comment:
       @omalley !seeked would mean that previously a readHeader has taken place which would result in the uncompressed being allocated, unless when someone seeks to -1. I will add an assert that this is unexpected, if you have any other suggestion please let me know.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598019072



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -397,13 +397,14 @@ public String toString() {
     }
   }
 
-  private static class CompressedStream extends InStream {
+  static class CompressedStream extends InStream {
     private final int bufferSize;
     private ByteBuffer uncompressed;
     private final CompressionCodec codec;
     protected ByteBuffer compressed;
     protected DiskRangeList currentRange;
     private boolean isUncompressedOriginal;
+    protected long currentCompressedStart;

Review comment:
       This should be initialized to -1 so that we don't have someone seek to 0 and get incorrect data.

##########
File path: java/core/src/test/org/apache/orc/impl/TestInStream.java
##########
@@ -19,6 +19,9 @@
 package org.apache.orc.impl;

Review comment:
       It would be nice to have a test case for blocks that used the original bytes (header.isOriginal).

##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -551,15 +556,20 @@ public void changeIv(Consumer<byte[]> modifier) {
 
     @Override
     public void seek(PositionProvider index) throws IOException {
-      seek(index.getNext());
+      boolean seeked = seek(index.getNext());
       long uncompressedBytes = index.getNext();
-      if (uncompressedBytes != 0) {
-        readHeader();
-        uncompressed.position(uncompressed.position() +
-                              (int) uncompressedBytes);
-      } else if (uncompressed != null) {
-        // mark the uncompressed buffer as done
-        uncompressed.position(uncompressed.limit());
+      if (!seeked && uncompressed != null) {

Review comment:
       Can we get into the case of !seeked && uncompressed == null? I suspect not, but this code would probably be doing the wrong thing by ignoring the case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r597950429



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -551,15 +556,20 @@ public void changeIv(Consumer<byte[]> modifier) {
 
     @Override
     public void seek(PositionProvider index) throws IOException {
-      seek(index.getNext());
+      boolean seeked = seek(index.getNext());
       long uncompressedBytes = index.getNext();
-      if (uncompressedBytes != 0) {
-        readHeader();
-        uncompressed.position(uncompressed.position() +
-                              (int) uncompressedBytes);
-      } else if (uncompressed != null) {
-        // mark the uncompressed buffer as done
-        uncompressed.position(uncompressed.limit());
+      if (!seeked && uncompressed != null) {
+        // Only reposition uncompressed
+        uncompressed.position((int) uncompressedBytes);
+      } else {
+        if (uncompressedBytes != 0) {

Review comment:
       Added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r597950586



##########
File path: java/core/src/test/org/apache/orc/impl/TestInStream.java
##########
@@ -434,8 +437,30 @@ public void testCompressed() throws Exception {
       int x = in.read();
       assertEquals(i & 0xff, x);
     }
+
+    // Forward seek test
+    int seeksPerformed = 0;
+    for(int i=0; i < 1024; ++i) {
+      long pos = positions[i].getNext();
+      positions[i].reset();
+      ByteBuffer c = ((InStream.CompressedStream) in).compressed;
+      if (pos == ((InStream.CompressedStream) in).currentCompressedStart) {
+        // Should have the same ByteBuffer if no seek has taken place
+        in.seek(positions[i]);
+        assertEquals(c, ((InStream.CompressedStream) in).compressed);
+      } else {
+        seeksPerformed += 1;
+        in.seek(positions[i]);
+        assertNotEquals(c, ((InStream.CompressedStream) in).compressed);
+      }
+      positions[i].reset();
+      assertEquals(i & 0xff, in.read());
+    }
+    assertEquals(1, seeksPerformed);

Review comment:
       Makes sense let me add a separate test for this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598817629



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -551,15 +556,20 @@ public void changeIv(Consumer<byte[]> modifier) {
 
     @Override
     public void seek(PositionProvider index) throws IOException {
-      seek(index.getNext());
+      boolean seeked = seek(index.getNext());
       long uncompressedBytes = index.getNext();
-      if (uncompressedBytes != 0) {
-        readHeader();
-        uncompressed.position(uncompressed.position() +
-                              (int) uncompressedBytes);
-      } else if (uncompressed != null) {
-        // mark the uncompressed buffer as done
-        uncompressed.position(uncompressed.limit());
+      if (!seeked && uncompressed != null) {

Review comment:
       @omalley !seeked would mean that previously a readHeader has taken place which would result in the uncompressed being allocated, unless when someone seeks to -1.
   
   I see that a read following a seek even in this case shall allocate uncompressed and do the required, so just adding documentation to describe the scenario.
   
   I am conflicted between failure and documentation, let me know what you think. For now it is documentation and I have added a test to ensure that in this case the read takes place from the beginning.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598805851



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -471,6 +472,7 @@ private int readHeaderByte() {
     }
 
     private void readHeader() throws IOException {
+      currentCompressedStart = compressed.position();

Review comment:
       This is a good catch @omalley fixed this and added a test to verify that the seek skip takes place even when the DiskRange does not perfectly align with the stream




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598802721



##########
File path: java/core/src/test/org/apache/orc/impl/TestInStream.java
##########
@@ -19,6 +19,9 @@
 package org.apache.orc.impl;

Review comment:
       Added the test case




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598022784



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -471,6 +472,7 @@ private int readHeaderByte() {
     }
 
     private void readHeader() throws IOException {
+      currentCompressedStart = compressed.position();

Review comment:
       I think this needs to be this.position.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r597947405



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -397,13 +397,14 @@ public String toString() {
     }
   }
 
-  private static class CompressedStream extends InStream {
+  static class CompressedStream extends InStream {

Review comment:
       I agree will make public that is clearer and consistent. Only the EncryptedCompressedStream remains as private and not public after this change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #652:
URL: https://github.com/apache/orc/pull/652#issuecomment-803987057


   Hey @pavibhai can you please check the remaining comments? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley closed pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
omalley closed pull request #652:
URL: https://github.com/apache/orc/pull/652


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598921532



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -558,9 +558,14 @@ public void changeIv(Consumer<byte[]> modifier) {
     public void seek(PositionProvider index) throws IOException {
       boolean seeked = seek(index.getNext());
       long uncompressedBytes = index.getNext();
-      if (!seeked && uncompressed != null) {
-        // Only reposition uncompressed
-        uncompressed.position((int) uncompressedBytes);
+      if (!seeked) {
+        if (uncompressed != null) {
+          // Only reposition uncompressed
+          uncompressed.position((int) uncompressedBytes);
+        } else {

Review comment:
       I'd drop the empty else block, but the comment is good.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598805052



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -397,13 +397,14 @@ public String toString() {
     }
   }
 
-  private static class CompressedStream extends InStream {
+  static class CompressedStream extends InStream {
     private final int bufferSize;
     private ByteBuffer uncompressed;
     private final CompressionCodec codec;
     protected ByteBuffer compressed;
     protected DiskRangeList currentRange;
     private boolean isUncompressedOriginal;
+    protected long currentCompressedStart;

Review comment:
       Fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #652: ORC-758: Avoid seeking and decompressing of compressed stream

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #652:
URL: https://github.com/apache/orc/pull/652#discussion_r598817629



##########
File path: java/core/src/java/org/apache/orc/impl/InStream.java
##########
@@ -551,15 +556,20 @@ public void changeIv(Consumer<byte[]> modifier) {
 
     @Override
     public void seek(PositionProvider index) throws IOException {
-      seek(index.getNext());
+      boolean seeked = seek(index.getNext());
       long uncompressedBytes = index.getNext();
-      if (uncompressedBytes != 0) {
-        readHeader();
-        uncompressed.position(uncompressed.position() +
-                              (int) uncompressedBytes);
-      } else if (uncompressed != null) {
-        // mark the uncompressed buffer as done
-        uncompressed.position(uncompressed.limit());
+      if (!seeked && uncompressed != null) {

Review comment:
       @omalley !seeked would mean that previously a readHeader has taken place which would result in the uncompressed being allocated, unless when someone seeks to -1.
   
   I see that a read following a seek even in this case shall allocate uncompressed and do the required, so just adding documentation to describe the scenario.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org