You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/04/29 17:57:34 UTC

[GitHub] [parquet-mr] Fokko opened a new pull request, #1084: Bring back support for Hadoop 2.7.3

Fokko opened a new pull request, #1084:
URL: https://github.com/apache/parquet-mr/pull/1084

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-XXX
     - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain Javadoc that explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1185348684


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream stream) {
 
   /**
    * Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
    * and implements ByteBufferReadable'
    *
    * That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
    *
    * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
    * check this: only those streams which provide the read(ByteBuffer)
    * semantics MAY return true for the probe "in:readbytebuffer";
    * FSDataInputStream will pass the probe down to the underlying stream.
    *
    * @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing hasCapabilities
    */
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
-    if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean isWrappedStreamByteBufferReadableHasCapabilities(FSDataInputStream stream) {

Review Comment:
   I agree, I've changed it to `isWrappedStreamByteBufferReadable`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko merged pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko merged PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] rdblue commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1182862929


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -37,6 +41,39 @@ public class HadoopStreams {
 
   private static final Logger LOG = LoggerFactory.getLogger(HadoopStreams.class);
 
+  private static final Class<?> byteBufferReadableClass = getReadableClass();
+  static final Constructor<SeekableInputStream> h2SeekableConstructor = getH2SeekableConstructor();

Review Comment:
   Rather than using two static methods, you can use `DynConstructors` instead to make this one expression and reduce error handling code:
   
   ```java
   private static final DynConstructors.Ctor<SeekableInputStream> h2streamCtor =
       DynConstructors.Builder(SeekableInputStream.class)
       .impl("org.apache.parquet.hadoop.util.H2SeekableInputStream", FSDataInputStream.class)
       .orNull()
       .build()
   
   ...
   if (h2streamCtor != null) {
     return h2streamCtor.newInstance(stream);
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183066231


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,41 +83,91 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
-      return new H2SeekableInputStream(stream);
-    } else {
-      return new H1SeekableInputStream(stream);
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);
+      return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped));

Review Comment:
   This came from the issue from Presto: https://github.com/prestodb/presto/pull/17435



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183080135


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,41 +83,91 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
-      return new H2SeekableInputStream(stream);
-    } else {
-      return new H1SeekableInputStream(stream);
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);
+      return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped));
+    }
+    if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
+      byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
+      try {
+        return h2SeekableConstructor.newInstance(stream);
+      } catch (InstantiationException | IllegalAccessException e) {
+        LOG.warn("Could not instantiate H2SeekableInputStream, falling back to byte array reads", e);
+      } catch (InvocationTargetException e) {
+        throw new ParquetDecodingException(
+          "Could not instantiate H2SeekableInputStream", e.getTargetException());
+      }
     }
+    return new H1SeekableInputStream(stream);
   }
 
   /**
    * Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
    * and implements ByteBufferReadable'
    *
    * That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
    *
    * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
    * check this: only those streams which provide the read(ByteBuffer)
    * semantics MAY return true for the probe "in:readbytebuffer";
    * FSDataInputStream will pass the probe down to the underlying stream.
    *
    * @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   *         the data, null when it cannot be determined
    */
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
-    if (stream.hasCapability("in:readbytebuffer")) {
-      // stream is issuing the guarantee that it implements the
-      // API. Holds for all implementations in hadoop-*
-      // since Hadoop 3.3.0 (HDFS-14111).
-      return true;
+  private static Boolean isWrappedStreamByteBufferReadableHasCapabilities(FSDataInputStream stream) {
+    Method methodHasCapabilities;
+    try {
+      methodHasCapabilities = stream.getClass().getMethod("hasCapability", String.class);

Review Comment:
   Added it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] rdblue commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186279482


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream stream) {
 
   /**
    * Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
    * and implements ByteBufferReadable'
    *
    * That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
    *
    * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
    * check this: only those streams which provide the read(ByteBuffer)
    * semantics MAY return true for the probe "in:readbytebuffer";
    * FSDataInputStream will pass the probe down to the underlying stream.
    *
    * @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing hasCapabilities
    */
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
-    if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
+    if (hasCapabilitiesMethod.isNoop()) {
+      // When the method is not available, just return a null
+      return null;
+    }
+
+    Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, "in:readbytebuffer");
+
+    if (hasCapabilities) {

Review Comment:
   This variable would be more clear if it were called `isByteBufferReadable` since that's the capability we are checking for.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#issuecomment-1533328928

   FWIW, I also ran the Iceberg tests and it ran fine (except the bloom filter ones, more details [here](https://github.com/apache/iceberg/pull/7301#issuecomment-1528874719)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] rdblue commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183059127


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,41 +83,91 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
-      return new H2SeekableInputStream(stream);
-    } else {
-      return new H1SeekableInputStream(stream);
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);
+      return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped));

Review Comment:
   Why would a FSDataInputStream have another one inside?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,41 +83,91 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
-      return new H2SeekableInputStream(stream);
-    } else {
-      return new H1SeekableInputStream(stream);
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);
+      return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped));
+    }
+    if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
+      byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
+      try {
+        return h2SeekableConstructor.newInstance(stream);
+      } catch (InstantiationException | IllegalAccessException e) {
+        LOG.warn("Could not instantiate H2SeekableInputStream, falling back to byte array reads", e);
+      } catch (InvocationTargetException e) {
+        throw new ParquetDecodingException(
+          "Could not instantiate H2SeekableInputStream", e.getTargetException());
+      }
     }
+    return new H1SeekableInputStream(stream);
   }
 
   /**
    * Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
    * and implements ByteBufferReadable'
    *
    * That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
    *
    * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
    * check this: only those streams which provide the read(ByteBuffer)
    * semantics MAY return true for the probe "in:readbytebuffer";
    * FSDataInputStream will pass the probe down to the underlying stream.
    *
    * @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   *         the data, null when it cannot be determined
    */
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
-    if (stream.hasCapability("in:readbytebuffer")) {
-      // stream is issuing the guarantee that it implements the
-      // API. Holds for all implementations in hadoop-*
-      // since Hadoop 3.3.0 (HDFS-14111).
-      return true;
+  private static Boolean isWrappedStreamByteBufferReadableHasCapabilities(FSDataInputStream stream) {
+    Method methodHasCapabilities;
+    try {
+      methodHasCapabilities = stream.getClass().getMethod("hasCapability", String.class);
+    } catch (Exception e) {
+      return null;
+    }
+    try {
+      if ((Boolean) methodHasCapabilities.invoke(stream, "in:readbytebuffer")) {
+        // stream is issuing the guarantee that it implements the
+        // API. Holds for all implementations in hadoop-*
+        // since Hadoop 3.3.0 (HDFS-14111).
+        return true;
+      }
+    } catch (IllegalAccessException | InvocationTargetException e) {
+      return null;

Review Comment:
   Why not return false?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] rdblue commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1182873931


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,41 +83,91 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
-      return new H2SeekableInputStream(stream);
-    } else {
-      return new H1SeekableInputStream(stream);
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);
+      return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped));
+    }
+    if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
+      byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
+      try {
+        return h2SeekableConstructor.newInstance(stream);
+      } catch (InstantiationException | IllegalAccessException e) {
+        LOG.warn("Could not instantiate H2SeekableInputStream, falling back to byte array reads", e);
+      } catch (InvocationTargetException e) {
+        throw new ParquetDecodingException(
+          "Could not instantiate H2SeekableInputStream", e.getTargetException());
+      }
     }
+    return new H1SeekableInputStream(stream);
   }
 
   /**
    * Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
    * and implements ByteBufferReadable'
    *
    * That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
    *
    * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
    * check this: only those streams which provide the read(ByteBuffer)
    * semantics MAY return true for the probe "in:readbytebuffer";
    * FSDataInputStream will pass the probe down to the underlying stream.
    *
    * @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   *         the data, null when it cannot be determined
    */
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
-    if (stream.hasCapability("in:readbytebuffer")) {
-      // stream is issuing the guarantee that it implements the
-      // API. Holds for all implementations in hadoop-*
-      // since Hadoop 3.3.0 (HDFS-14111).
-      return true;
+  private static Boolean isWrappedStreamByteBufferReadableHasCapabilities(FSDataInputStream stream) {
+    Method methodHasCapabilities;
+    try {
+      methodHasCapabilities = stream.getClass().getMethod("hasCapability", String.class);

Review Comment:
   You can use DynMethods to get an unbound `hasCapability` method. That can be done statically, so all you need to do is check whether it is present and call it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] rdblue commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1182872918


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -37,6 +41,39 @@ public class HadoopStreams {
 
   private static final Logger LOG = LoggerFactory.getLogger(HadoopStreams.class);
 
+  private static final Class<?> byteBufferReadableClass = getReadableClass();
+  static final Constructor<SeekableInputStream> h2SeekableConstructor = getH2SeekableConstructor();

Review Comment:
   Looks like you'd need to add the `orNull` handling. I don't see it here: https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/util/DynConstructors.java



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183079903


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -37,6 +41,39 @@ public class HadoopStreams {
 
   private static final Logger LOG = LoggerFactory.getLogger(HadoopStreams.class);
 
+  private static final Class<?> byteBufferReadableClass = getReadableClass();
+  static final Constructor<SeekableInputStream> h2SeekableConstructor = getH2SeekableConstructor();

Review Comment:
   Yes, I copied most from the old code to avoid refactoring. I think we can greatly simplify it because it was still taking Hadoop1 into account. We still have to check if the wrapped stream is ByteBufferReadable: https://github.com/apache/hadoop/blob/release-2.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java#L142-L148
   
   The `hasCapabilities does the same but in a more elegant way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#issuecomment-1532200533

   > I think this should work after comparing it with older code, but it seems like there are some easy improvements to me.
   
   @rdblue I agree, I did some cleaning up. Let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#issuecomment-1536815143

   Thanks @shangxinli, @wgtmac, @rdblue, and @steveloughran for the review.
   
   Steve, I'm aware of your stance and I also respect it. Unfortunately, a lot of companies are still using (internally heavy patched versions) of Hadoop, and to get traction in downstream projects we still have to maintain compatibility.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1185348113


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,7 +54,39 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);

Review Comment:
   Yes, I kept it in there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] rdblue commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186279000


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream stream) {
 
   /**
    * Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
    * and implements ByteBufferReadable'
    *
    * That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
    *
    * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
    * check this: only those streams which provide the read(ByteBuffer)
    * semantics MAY return true for the probe "in:readbytebuffer";
    * FSDataInputStream will pass the probe down to the underlying stream.
    *
    * @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing hasCapabilities
    */
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
-    if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
+    if (hasCapabilitiesMethod.isNoop()) {
+      // When the method is not available, just return a null
+      return null;
+    }
+
+    Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, "in:readbytebuffer");

Review Comment:
   Is this a boxed boolean? If so, should we update the check to handle null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] rdblue commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186277590


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,7 +54,39 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadable(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {

Review Comment:
   Usually, `isSomething` methods return a boolean. This is unwrapping, so I'd prefer naming it `unwrapByteBufferReadableLegacy` or something to be more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] rdblue commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1183057530


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -37,6 +41,39 @@ public class HadoopStreams {
 
   private static final Logger LOG = LoggerFactory.getLogger(HadoopStreams.class);
 
+  private static final Class<?> byteBufferReadableClass = getReadableClass();
+  static final Constructor<SeekableInputStream> h2SeekableConstructor = getH2SeekableConstructor();

Review Comment:
   Huh, I guess this was how it was before? Nevermind on the refactoring then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "shangxinli (via GitHub)" <gi...@apache.org>.
shangxinli commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1185179904


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,7 +54,39 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);

Review Comment:
   It is good to add this debug log in case recursive methods cause an infinite loop, we can enable this logging and debug. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186521595


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream stream) {
 
   /**
    * Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
    * and implements ByteBufferReadable'
    *
    * That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
    *
    * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
    * check this: only those streams which provide the read(ByteBuffer)
    * semantics MAY return true for the probe "in:readbytebuffer";
    * FSDataInputStream will pass the probe down to the underlying stream.
    *
    * @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing hasCapabilities
    */
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
-    if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
+    if (hasCapabilitiesMethod.isNoop()) {
+      // When the method is not available, just return a null
+      return null;
+    }
+
+    Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, "in:readbytebuffer");

Review Comment:
   I was assuming that it needed to be an object, but a primitive works as well, so changed that. Thanks for catching this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "steveloughran (via GitHub)" <gi...@apache.org>.
steveloughran commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1184944471


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,41 +83,91 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
-      return new H2SeekableInputStream(stream);
-    } else {
-      return new H1SeekableInputStream(stream);
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadableHasCapabilities(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);
+      return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped));

Review Comment:
   you can if you try hard, it's just really unusual
   
   you can never wrap an instance by itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] steveloughran commented on pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "steveloughran (via GitHub)" <gi...@apache.org>.
steveloughran commented on PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#issuecomment-1534680365

   I repeat my stance on this: to claim hadoop 2.7 runtime compatibility you should be building against java7. if you don't, well, make clear its a fairly qualified support "hadoop 2.7.3 on java8 only" and not worry about the bits of hadoop which break if they try to do that (kerberos, s3a, anything with joda-time, ...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "shangxinli (via GitHub)" <gi...@apache.org>.
shangxinli commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1185181599


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -55,23 +95,31 @@ public static SeekableInputStream wrap(FSDataInputStream stream) {
 
   /**
    * Is the inner stream byte buffer readable?
-   * The test is "the stream is not FSDataInputStream
+   * The test is 'the stream is not FSDataInputStream
    * and implements ByteBufferReadable'
    *
    * That is: all streams which implement ByteBufferReadable
-   * other than FSDataInputStream successfuly support read(ByteBuffer).
-   * This is true for all filesytem clients the hadoop codebase.
+   * other than FSDataInputStream successfully support read(ByteBuffer).
+   * This is true for all filesystem clients the hadoop codebase.
    *
    * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
    * check this: only those streams which provide the read(ByteBuffer)
    * semantics MAY return true for the probe "in:readbytebuffer";
    * FSDataInputStream will pass the probe down to the underlying stream.
    *
    * @param stream stream to probe
-   * @return true if it is safe to a H2SeekableInputStream to access the data
+   * @return true if it is safe to a H2SeekableInputStream to access
+   * the data, null when it cannot be determined because of missing hasCapabilities
    */
-  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
-    if (stream.hasCapability("in:readbytebuffer")) {
+  private static Boolean isWrappedStreamByteBufferReadableHasCapabilities(FSDataInputStream stream) {

Review Comment:
   This long method name is more meaningful but it seems too long. But this is a minor comment and doesn't need to be changed if no better name. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] rdblue commented on a diff in pull request #1084: PARQUET-2276: Bring back support for Hadoop 2.7.3

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #1084:
URL: https://github.com/apache/parquet-mr/pull/1084#discussion_r1186278151


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##########
@@ -46,7 +54,39 @@ public class HadoopStreams {
    */
   public static SeekableInputStream wrap(FSDataInputStream stream) {
     Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-    if (isWrappedStreamByteBufferReadable(stream)) {
+
+    // Try to check using hasCapabilities(str)
+    Boolean hasCapabilitiesResult = isWrappedStreamByteBufferReadable(stream);
+
+    // If it is null, then fall back to the old method
+    if (hasCapabilitiesResult != null) {
+      if (hasCapabilitiesResult) {
+        return new H2SeekableInputStream(stream);
+      } else {
+        return new H1SeekableInputStream(stream);
+      }
+    }
+
+    return isWrappedStreamByteBufferReadableLegacy(stream);
+  }
+
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is 'the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * This logic is only used for Hadoop <2.9.x, and <3.x.x
+   *
+   * @param stream stream to probe
+   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
+   */
+  private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);
+      return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped));
+    }
+    if (stream.getWrappedStream() instanceof ByteBufferReadable) {

Review Comment:
   I prefer using the same whitespace conventions as in Iceberg, although that's a bit more relaxed over here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org