You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "clairemcginty (via GitHub)" <gi...@apache.org> on 2023/04/21 19:36:28 UTC

[GitHub] [parquet-mr] clairemcginty opened a new pull request, #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

clairemcginty opened a new pull request, #1078:
URL: https://github.com/apache/parquet-mr/pull/1078

   Context:
   
   AvroWriteSupport/AvroReadSupport can improve the precision of their default `model` selection. Currently they default to new SpecificDataSupplier().get()[0]. This means that SpecificRecord classes that contain logical types will fail out-of-the-box unless a specific DATA_SUPPLIER is configured that contains logical type conversions.
   
   This PR attempts to derive a `model` by reflecting the `MODEL$` field from the SpecificRecordBase implementation, which in Avro 1.11 comes pre-loaded with all necessary logical type conversions for that Specific Avro class. (For Avro 1.8 compatibility, we have to add the conversions manually from the `conversions` field.)
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-XXX
     - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain Javadoc that explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182866610


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {

Review Comment:
   Added clearer logging & null-/error-handling.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] gszadovszky commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "gszadovszky (via GitHub)" <gi...@apache.org>.
gszadovszky commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1536206412

   @clairemcginty, I think, there is an option to reply from you own mail client in ponymail (the thread link above).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1174106674


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {

Review Comment:
   The reflection logic here is all based on the existing Avro behavior for its model cache: https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86
   
   btw, I'm not sure if `AvroRecordConverter` is the best place for this method -- maybe it would be better as part of `SpecificDataSupplier` or in either `AvroReadSupport` or `AvroWriteSupport` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1534819205

   @wgtmac, sorry about that -- `mvn verify` was happy on my machine, but I wasn't running `install`. Should be good now!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1175528495


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);

Review Comment:
   sorry, can you expand on this a bit?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182865579


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();
+      // Avro 1.8 doesn't include conversions in the MODEL$ field
+      if (avroVersion.startsWith("1.8.")) {

Review Comment:
   ok, I've added a new test suite, `TestAvroRecordConverter`, that thoroughly tests `getModelForSchema` with a variety of schema variants and with Avro versions 1.{7,8,9,10,11}. To do this, I used Powermock to mock the static invocation used to get Avro runtime version, then I ad-hoc compiled Avro generated classes using compiler versions 1.{7,8,9,10,11}. Happy to take a different approach with testing if you prefer -- this just seemed like the most straightforward way to do it :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] wgtmac commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535595073

   The parquet community is discussing a new `1.13.1` release to address some issues introduced by `1.13.0`: https://lists.apache.org/thread/1mjvdcmwqjcblmfkfgpd9ob2yodx7tom
   
   As the release manager, what do you think if we include this PR to the `1.13.1` release? @Fokko 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182538420


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {

Review Comment:
   that sounds good! I had planned to add logging originally, but I saw that there were 0 log statements in parquet-avro (and the slf4j dependencies for the module are [scoped](https://github.com/apache/parquet-mr/blob/master/parquet-avro/pom.xml#L76-L80) to `test`). So I'd have to import slf4j into the compile scope as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1181768008


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {

Review Comment:
   hi @gszadovszky ! Is there anything I can do to improve this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] RustedBones commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "RustedBones (via GitHub)" <gi...@apache.org>.
RustedBones commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1175343501


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();

Review Comment:
   This won't work for avro pre 1.10 where the implementation version is not defined, and will return `null`
   Here is a [trick](https://github.com/sbt/sbt-avro/blob/ec5f8aa064d0314a97a876bfe1d2d1bdb26dcbb8/src/main/scala/com/github/sbt/avro/SbtAvro.scala#L28) to access the avro version for previous artifact



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] RustedBones commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "RustedBones (via GitHub)" <gi...@apache.org>.
RustedBones commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1175343501


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();

Review Comment:
   This won't work for avro pre 1.10 where the implementation version is not defined, and will return `null`
   Here is a [trick](https://github.com/sbt/sbt-avro/blob/ec5f8aa064d0314a97a876bfe1d2d1bdb26dcbb8/src/main/scala/com/github/sbt/avro/SbtAvro.scala#L28) to access the avro version for previous artifact



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] RustedBones commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "RustedBones (via GitHub)" <gi...@apache.org>.
RustedBones commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1175343501


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();

Review Comment:
   This won't work for avro pre 1.10 where the implementation is not defined, and will return `null`
   Here is a [trick](https://github.com/sbt/sbt-avro/blob/ec5f8aa064d0314a97a876bfe1d2d1bdb26dcbb8/src/main/scala/com/github/sbt/avro/SbtAvro.scala#L28) to access the avro version for previous artifact



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1185224751


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();
+      // Avro 1.8 doesn't include conversions in the MODEL$ field
+      if (avroVersion.startsWith("1.8.")) {

Review Comment:
   Great! I added code wrapping `getModelForSchema` in a try/catch at each call site.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] gszadovszky commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "gszadovszky (via GitHub)" <gi...@apache.org>.
gszadovszky commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535348289

   I wouldn't like to make you sad, @clairemcginty, but we just released `1.13.0` last month and the previous one was almost a year ago. Meanwhile, nothing says we cannot do releases for tiny features, it is more about the effort and who has the time to do it. I usually don't, unfortunately.
   Maybe if you ask nicely, @wgtmac would do another one... :wink:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1174108640


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();
+      // Avro 1.8 doesn't include conversions in the MODEL$ field
+      if (avroVersion.startsWith("1.8.")) {

Review Comment:
   For reference, Avro 1.11 compiles SpecificRecord implementations that look like:
   
   ```java
   @org.apache.avro.specific.AvroGenerated
   public class LogicalTypesTest extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
     ...
     private static final SpecificData MODEL$ = new SpecificData();
     static {
       MODEL$.addLogicalTypeConversion(new org.apache.avro.data.TimeConversions.TimestampMillisConversion());
     }
   }
   ```
   
   Whereas for Avro 1.8, there is no static initialization, but there _is_ a `conversions` field:
   
   ```java
   @org.apache.avro.specific.AvroGenerated
   public class LogicalTypesTest extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
     ...
     private static final SpecificData MODEL$ = new SpecificData();
     
     protected static final org.apache.avro.data.TimeConversions.TimestampConversion TIMESTAMP_CONVERSION = new org.apache.avro.data.TimeConversions.TimestampConversion();
   
     private static final org.apache.avro.Conversion<?>[] conversions =
         new org.apache.avro.Conversion<?>[] {
         TIMESTAMP_CONVERSION,
         null
     };
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] gszadovszky commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "gszadovszky (via GitHub)" <gi...@apache.org>.
gszadovszky commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535844366

   @clairemcginty, could you reply on the thread that you would like to have this one included? It is better to have a broader audience for discussing the release content. (Note: We should usually not include features in patch releases to lower the potential risks. But it is up to the community to decide.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] gszadovszky merged pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "gszadovszky (via GitHub)" <gi...@apache.org>.
gszadovszky merged PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1536628552

   thanks @gszadovszky! I used the Ponymail link from the thread, but I fear it's just created a new email thread - apologies 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] wgtmac commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1534100022

   @clairemcginty Could you make the CIs happy?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] gszadovszky commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "gszadovszky (via GitHub)" <gi...@apache.org>.
gszadovszky commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1185124781


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();
+      // Avro 1.8 doesn't include conversions in the MODEL$ field
+      if (avroVersion.startsWith("1.8.")) {

Review Comment:
   > so I've tested 1.7 and 1.8; since 1.9 Avro has stably used the `MODEL$` field to hold all conversions, so I feel reasonably confident about relying on this. If that changes, we'll catch it in the new unit tests +1
   This sounds perfect to me. Thanks a lot for the additional work!
   
   > If you want, I can surround invocations of `getModelForSchema` in a try/catch (in `AvroReadSupport`/`AvroWriteSupport`), and just use the default SpecificDataSupplier if they throw anything. That way any unexpected behavior would just result in logical types not being used.
   Yes, I think this fallback mechanism sounds reasonable to me.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535331312

   thanks for the review @gszadovszky! very excited to start using this in Parquet :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] Fokko commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1536851987

   Thanks for pining me here @wgtmac I just replied on the dev-list. I think we can include this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1536169659

   @gszadovszky I wasn't previously subscribed to the dev mailing list, so I just subscribed -- I think I need to wait until someone pings the thread for it to land in my inbox so I can reply :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#issuecomment-1535392207

   Oh no, I'm sorry I missed the boat! 😭 Is there anything that I (or my organization) could do to help out with the release, @wgtmac? If it helps at all, having this fix released would drastically help us drive Parquet adoption, and lead to more OSS contributions down the line :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1184991179


##########
parquet-avro/pom.xml:
##########
@@ -105,6 +110,30 @@
       <type>test-jar</type>
       <scope>test</scope>
     </dependency>
+    <dependency>
+      <groupId>org.mockito</groupId>
+      <artifactId>mockito-core</artifactId>
+      <version>2.23.0</version>

Review Comment:
   It should be, but it does require an artifact migration in a few modules (parquet-column and parquet-hadoop depend on mockito-all, which has moved to mockito-core in 2.x). I would be happy to do it, just not sure if it's in scope for this PR :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1174720814


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java:
##########
@@ -400,9 +400,18 @@ private Binary fromAvroString(Object value) {
     return Binary.fromCharSequence(value.toString());
   }
 
-  private static GenericData getDataModel(Configuration conf) {
+  private static GenericData getDataModel(Configuration conf, Schema schema) {
+    if (conf.get(AVRO_DATA_SUPPLIER) == null) {
+      final GenericData modelForSchema = AvroRecordConverter.getModelForSchema(schema);

Review Comment:
   What if `schema == null`?



##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);

Review Comment:
   Should it return if model is set successfully?



##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##########
@@ -149,12 +149,22 @@ private static <T> RecordMaterializer<T> newCompatMaterializer(
         parquetSchema, avroSchema, model);
   }
 
-  private GenericData getDataModel(Configuration conf) {
+  private GenericData getDataModel(Configuration conf, Schema schema) {
     if (model != null) {
       return model;
     }
+
+    if (conf.get(AVRO_DATA_SUPPLIER) == null) {
+      final GenericData modelForSchema = AvroRecordConverter.getModelForSchema(schema);
+
+      if (modelForSchema != null) {
+        return modelForSchema;
+      }
+    }
+
     Class<? extends AvroDataSupplier> suppClass = conf.getClass(
-        AVRO_DATA_SUPPLIER, SpecificDataSupplier.class, AvroDataSupplier.class);
-    return ReflectionUtils.newInstance(suppClass, conf).get();
+      AVRO_DATA_SUPPLIER, SpecificDataSupplier.class, AvroDataSupplier.class);
+
+      return ReflectionUtils.newInstance(suppClass, conf).get();

Review Comment:
   Could you please revert these format changes?



##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {

Review Comment:
   @gszadovszky Could you help review this? I am not that familiar with Avro.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] gszadovszky commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "gszadovszky (via GitHub)" <gi...@apache.org>.
gszadovszky commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182232703


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();
+      // Avro 1.8 doesn't include conversions in the MODEL$ field
+      if (avroVersion.startsWith("1.8.")) {
+        final Field conversionsField = clazz.getDeclaredField("conversions");
+        conversionsField.setAccessible(true);
+
+        final Conversion<?>[] conversions = (Conversion<?>[]) conversionsField.get(null);
+        Arrays.stream(conversions).filter(Objects::nonNull).forEach(model::addLogicalTypeConversion);
+      }
+    } catch (Exception e) {

Review Comment:
   Similar to the previous comment about logging and catching specific exceptions.



##########
parquet-avro/src/test/java/org/apache/parquet/avro/TestSpecificReadWrite.java:
##########
@@ -237,6 +242,43 @@ public void testAvroReadSchema() throws IOException {
     }
   }
 
+  @Test

Review Comment:
   We need to test both of `getModelForSchema` related to the avro version. If the version check gets more complicated, maybe more versions are to cover.



##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {

Review Comment:
   TBH I do not have a strong opinion on any. I am fine with the current one if it works.



##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();
+      // Avro 1.8 doesn't include conversions in the MODEL$ field
+      if (avroVersion.startsWith("1.8.")) {

Review Comment:
   Since we are using reflections on private members there are no compatibility guarantees. We shall be very careful here. What about avro versions prior to 1.8? Also, what if it breaks in the future? Will the related unit test fail for a future Avro releases (in case of upgrading the Avro version in the pom)?



##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {

Review Comment:
   I think, we should log this exception. I would be also nice to use specific exceptions instead of catching everything. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1184468696


##########
parquet-avro/pom.xml:
##########
@@ -105,6 +110,30 @@
       <type>test-jar</type>
       <scope>test</scope>
     </dependency>
+    <dependency>
+      <groupId>org.mockito</groupId>
+      <artifactId>mockito-core</artifactId>
+      <version>2.23.0</version>

Review Comment:
   I think this is fine. BTW, is it compatible to upgrade this in the root pom as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182544574


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();
+      // Avro 1.8 doesn't include conversions in the MODEL$ field
+      if (avroVersion.startsWith("1.8.")) {

Review Comment:
   yeah - it's unfortunate that this isn't solvable without reflection. I've manually tested this with earlier versions of Avro... let me see if I can reshape them into automated tests for `parquet-avro` 👀 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182871206


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();
+      // Avro 1.8 doesn't include conversions in the MODEL$ field
+      if (avroVersion.startsWith("1.8.")) {

Review Comment:
   > Since we are using reflections on private members there are no compatibility guarantees. We shall be very careful here. What about avro versions prior to 1.8? Also, what if it breaks in the future? Will the related unit test fail for a future Avro releases (in case of upgrading the Avro version in the pom)?
   
   so I've tested 1.7 and 1.8; since 1.9 Avro has stably used the `MODEL$` field to hold all conversions, so I feel reasonably confident about relying on this. If that changes, we'll catch it in the new unit tests 👍 
   
   If you want, I can surround invocations of `getModelForSchema` in a try/catch (in `AvroReadSupport`/`AvroWriteSupport`), and just use the default SpecificDataSupplier if they throw anything. That way any unexpected behavior would just result in logical types not being used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1182943149


##########
parquet-avro/pom.xml:
##########
@@ -105,6 +110,30 @@
       <type>test-jar</type>
       <scope>test</scope>
     </dependency>
+    <dependency>
+      <groupId>org.mockito</groupId>
+      <artifactId>mockito-core</artifactId>
+      <version>2.23.0</version>

Review Comment:
   the declared `mockito.version` in the root `pom.xml`, `1.10.19`, is incompatible with Powermock 2.0.x.... but I can't downgrade Powermock to 1.x without the test throwing some scary objenesis errors, so I think Powermock 2.x is the lowest we can go.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] gszadovszky commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "gszadovszky (via GitHub)" <gi...@apache.org>.
gszadovszky commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1185124781


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {
+    final Class<?> clazz;
+
+    if (schema != null && (schema.getType() == Schema.Type.RECORD || schema.getType() == Schema.Type.UNION)) {
+      clazz = SpecificData.get().getClass(schema);
+    } else {
+      return null;
+    }
+
+    final SpecificData model;
+    try {
+      final Field modelField = clazz.getDeclaredField("MODEL$");
+      modelField.setAccessible(true);
+
+      model = (SpecificData) modelField.get(null);
+    } catch (Exception e) {
+      return null;
+    }
+
+    try {
+      final String avroVersion = Schema.Parser.class.getPackage().getImplementationVersion();
+      // Avro 1.8 doesn't include conversions in the MODEL$ field
+      if (avroVersion.startsWith("1.8.")) {

Review Comment:
   > so I've tested 1.7 and 1.8; since 1.9 Avro has stably used the `MODEL$` field to hold all conversions, so I feel reasonably confident about relying on this. If that changes, we'll catch it in the new unit tests +1
   
   This sounds perfect to me. Thanks a lot for the additional work!
   
   > If you want, I can surround invocations of `getModelForSchema` in a try/catch (in `AvroReadSupport`/`AvroWriteSupport`), and just use the default SpecificDataSupplier if they throw anything. That way any unexpected behavior would just result in logical types not being used.
   
   Yes, I think this fallback mechanism sounds reasonable to me.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1175253810


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java:
##########
@@ -400,9 +400,18 @@ private Binary fromAvroString(Object value) {
     return Binary.fromCharSequence(value.toString());
   }
 
-  private static GenericData getDataModel(Configuration conf) {
+  private static GenericData getDataModel(Configuration conf, Schema schema) {
+    if (conf.get(AVRO_DATA_SUPPLIER) == null) {
+      final GenericData modelForSchema = AvroRecordConverter.getModelForSchema(schema);

Review Comment:
   From where it's called in the code, `schema` should never be null -- but I guess it would be safest to add a null-check here to future-proof it a bit. Added!



##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##########
@@ -149,12 +149,22 @@ private static <T> RecordMaterializer<T> newCompatMaterializer(
         parquetSchema, avroSchema, model);
   }
 
-  private GenericData getDataModel(Configuration conf) {
+  private GenericData getDataModel(Configuration conf, Schema schema) {
     if (model != null) {
       return model;
     }
+
+    if (conf.get(AVRO_DATA_SUPPLIER) == null) {
+      final GenericData modelForSchema = AvroRecordConverter.getModelForSchema(schema);
+
+      if (modelForSchema != null) {
+        return modelForSchema;
+      }
+    }
+
     Class<? extends AvroDataSupplier> suppClass = conf.getClass(
-        AVRO_DATA_SUPPLIER, SpecificDataSupplier.class, AvroDataSupplier.class);
-    return ReflectionUtils.newInstance(suppClass, conf).get();
+      AVRO_DATA_SUPPLIER, SpecificDataSupplier.class, AvroDataSupplier.class);
+
+      return ReflectionUtils.newInstance(suppClass, conf).get();

Review Comment:
   done, sorry about that!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] clairemcginty commented on a diff in pull request #1078: PARQUET-2292: Default SpecificRecord model reflects from MODEL$ field

Posted by "clairemcginty (via GitHub)" <gi...@apache.org>.
clairemcginty commented on code in PR #1078:
URL: https://github.com/apache/parquet-mr/pull/1078#discussion_r1174106674


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java:
##########
@@ -169,6 +172,46 @@ public void add(Object value) {
     }
   }
 
+  /**
+   * Returns the specific data model for a given SpecificRecord schema by reflecting the underlying
+   * Avro class's `MODEL$` field, or Null if the class is not on the classpath or reflection fails.
+   */
+  static SpecificData getModelForSchema(Schema schema) {

Review Comment:
   The reflection logic here is all based on the existing Avro behavior for its model cache: https://github.com/apache/avro/blob/release-1.11.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L76-L86



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org