You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/10/27 09:24:28 UTC

[GitHub] [drill] dzamo opened a new pull request #2351: DRILL1282: Move parquet to use v2 format as default

dzamo opened a new pull request #2351:
URL: https://github.com/apache/drill/pull/2351


   # [DRILL1282](https://issues.apache.org/jira/browse/DRILL-1282): Move parquet to use v2 format as default
   
   ## Description
   
   In this second and final PR, Parquet v2 write support is added along with an option and v2 is selected by default.
   
   ## Documentation
   New option includes a description.  General information about Parquet versions and codecs to be added to docs.
   
   ## Testing
   New unit test `TestParquetWriter#testTPCHReadWriteFormatV2`
   
   Interactive test of CTAS, query, metadata inspection from PyArrow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r751941469



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
##########
@@ -263,20 +269,29 @@ private void newSchema() throws IOException {
     // We don't want this number to be too small either. Ideally, slightly bigger than the page size,
     // but not bigger than the block buffer
     int initialPageBufferSize = max(MINIMUM_BUFFER_SIZE, min(pageSize + pageSize / 10, initialBlockBufferSize));
+    ValuesWriterFactory valWriterFactory = writerVersion == WriterVersion.PARQUET_1_0
+      ? new DefaultV1ValuesWriterFactory()
+      : new DefaultV2ValuesWriterFactory();
+
     ParquetProperties parquetProperties = ParquetProperties.builder()
         .withPageSize(pageSize)
         .withDictionaryEncoding(enableDictionary)
         .withDictionaryPageSize(initialPageBufferSize)
-        .withWriterVersion(writerVersion)
         .withAllocator(new ParquetDirectByteBufferAllocator(oContext))
-        .withValuesWriterFactory(new DefaultV1ValuesWriterFactory())
+        .withValuesWriterFactory(valWriterFactory)
+        .withWriterVersion(writerVersion)
         .build();
+
     // TODO: Replace ParquetColumnChunkPageWriteStore with ColumnChunkPageWriteStore from parquet library
     //   once DRILL-7906 (PARQUET-1006) will be resolved
     pageStore = new ParquetColumnChunkPageWriteStore(codecFactory.getCompressor(codec), schema,
             parquetProperties.getInitialSlabSize(), pageSize, parquetProperties.getAllocator(),
             parquetProperties.getColumnIndexTruncateLength(), parquetProperties.getPageWriteChecksumEnabled());
-    store = new ColumnWriteStoreV1(pageStore, parquetProperties);
+
+    store = writerVersion == WriterVersion.PARQUET_1_0

Review comment:
       @vdiravka do you mean a switch statement could be easier to read than this new ternary conditional?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on a change in pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vdiravka commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r738450978



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -167,6 +167,10 @@ public RecordWriter getRecordWriter(FragmentContext context, ParquetWriter write
     options.put(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS,
         context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS).bool_val.toString());
 
+    options.put(
+      ExecConstants.PARQUET_WRITER_FORMAT_VERSION,
+      context.getOptions().getOption(ExecConstants.PARQUET_WRITER_FORMAT_VERSION).string_val);

Review comment:
       ```suggestion
   context.getOptions().getOption(ExecConstants.PARQUET_WRITER_FORMAT_VERSION).getValue());
   ```
   

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
##########
@@ -375,6 +375,12 @@ private ExecConstants() {
   public static final OptionValidator PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS_VALIDATOR = new EnumeratedStringValidator(PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS,
       new OptionDescription("Parquet writer logical type for decimal; supported types \'fixed_len_byte_array\' and \'binary\'"),
       "fixed_len_byte_array", "binary");
+  public static final String PARQUET_WRITER_FORMAT_VERSION = "store.parquet.writer.format_version";
+  public static final OptionValidator PARQUET_WRITER_FORMAT_VERSION_VALIDATOR = new EnumeratedStringValidator(
+    PARQUET_WRITER_FORMAT_VERSION,
+    new OptionDescription("Parquet format version used for storing Parquet output.  Allowed values: PARQUET_1_0, PARQUET_2_0"),
+    "PARQUET_1_0", "PARQUET_2_0"

Review comment:
       Optionally you can create String array of parquet versions (possibly in `ParquetFormatPlugin`) and use it for description and `EnumeratedStringValidator` values. Also (minor) consider adding short description.
   
   ```
   public static final String[] parquetVersions= {"PARQUET_1_0", "PARQUET_2_0"};
   ```
   ```
   public static final OptionValidator PARQUET_WRITER_FORMAT_VERSION_VALIDATOR = new EnumeratedStringValidator(
       PARQUET_WRITER_FORMAT_VERSION,
       new OptionDescription("Parquet format version used for storing Parquet output.  Allowed values:" + Arrays.toString(parquetVersions)), "Parquet format version",
       parquetVersions);
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r738475956



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
##########
@@ -375,6 +375,12 @@ private ExecConstants() {
   public static final OptionValidator PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS_VALIDATOR = new EnumeratedStringValidator(PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS,
       new OptionDescription("Parquet writer logical type for decimal; supported types \'fixed_len_byte_array\' and \'binary\'"),
       "fixed_len_byte_array", "binary");
+  public static final String PARQUET_WRITER_FORMAT_VERSION = "store.parquet.writer.format_version";
+  public static final OptionValidator PARQUET_WRITER_FORMAT_VERSION_VALIDATOR = new EnumeratedStringValidator(
+    PARQUET_WRITER_FORMAT_VERSION,
+    new OptionDescription("Parquet format version used for storing Parquet output.  Allowed values: PARQUET_1_0, PARQUET_2_0"),
+    "PARQUET_1_0", "PARQUET_2_0"

Review comment:
       @vdiravka nice, thanks!  I'll do this...
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r752001140



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -78,6 +80,8 @@
 
 public class ParquetFormatPlugin implements FormatPlugin {
 
+  public static final String[] PARQUET_VERSIONS = {"PARQUET_1_0", "PARQUET_2_0"};

Review comment:
       @vdiravka I followed `ParquetProperties#WriterVersion`, that's where `PARQUET_1_0` and `PARQUET_2_0` come from.  I agree it's clunky, but on the other hand I did not have introduce any new version format strings or case statements.  Which do you think is preferable?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
vdiravka commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r753613660



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetWriterConfig.java
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.test.ClusterFixture;
+import org.apache.drill.test.ClusterFixtureBuilder;
+import org.apache.drill.test.ClusterTest;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.nio.file.Paths;
+
+public class TestParquetWriterConfig extends ClusterTest {

Review comment:
       Not sure about format config in query plan, but there are a lot of test cases with checking patterns in the plan.
   They are based on the old `PlanTestBase#testPlanMatchingPatterns` method. For instance `TestPlanVerificationUtilities` has only one test case `testPlanVerifier`. You can change this one or add smth similar

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -78,6 +80,8 @@
 
 public class ParquetFormatPlugin implements FormatPlugin {
 
+  public static final String[] PARQUET_VERSIONS = {"PARQUET_1_0", "PARQUET_2_0"};

Review comment:
       You can use string values from `ParquetProperties#WriterVersion` enum for `PARQUET_VERSIONS`, so they will be tied up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r754050973



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -141,40 +145,94 @@ public AbstractWriter getWriter(PhysicalOperator child, String location, List<St
   }
 
   public RecordWriter getRecordWriter(FragmentContext context, ParquetWriter writer) throws IOException, OutOfMemoryException {
-    Map<String, String> options = new HashMap<>();
+    Map<String, String> writerOpts = new HashMap<>();
+    OptionManager contextOpts = context.getOptions();
 
-    options.put("location", writer.getLocation());
+    writerOpts.put("location", writer.getLocation());
 
     FragmentHandle handle = context.getHandle();
     String fragmentId = String.format("%d_%d", handle.getMajorFragmentId(), handle.getMinorFragmentId());
-    options.put("prefix", fragmentId);
-
-    options.put(ExecConstants.PARQUET_BLOCK_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_BLOCK_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK,
-      context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK).bool_val.toString());
-    options.put(ExecConstants.PARQUET_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_PAGE_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_DICT_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_DICT_PAGE_SIZE).num_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING).bool_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS).bool_val.toString());
+    writerOpts.put("prefix", fragmentId);
+
+    // Many options which follow may be set as Drill config options or in the parquet format
+    // plugin config.  If there is a Drill option set at session scope or narrower it takes precendence.
+    OptionValue.OptionScope minScope = OptionValue.OptionScope.SESSION;

Review comment:
       @vdiravka Okay let's discuss a design for this for a new PR.  I guess it would impact all the existing format plugins, and how they  access format configs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vdiravka commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-958960612


   Could you solve check-style errors to proceed with review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-958961461


   @
   
   > Could you solve check-style errors to proceed with review?
   
   Yes I did, now just adding the format config option that Vova suggested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] lgtm-com[bot] commented on pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-975400967


   This pull request **fixes 1 alert** when merging 021be9a7e0d47c2e97ee5e2a6d9da9187cd4b794 into baf698f63a1e549346fba59254472995e0f7be16 - [view on LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-78a017446219374fcc2f484dbb2cec6b1aafebff)
   
   **fixed alerts:**
   
   * 1 for Unused format argument


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on a change in pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vdiravka commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r751467675



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java
##########
@@ -963,6 +963,17 @@ private void compareParquetInt96Converters(String selection, String table) throw
     }
   }
 
+  @Test
+  public void testTPCHReadWriteFormatV2() throws Exception {
+    try {
+      alterSession(ExecConstants.PARQUET_WRITER_FORMAT_VERSION, "parquet_2_0");

Review comment:
       Do we need enabling it here, if the aim of the PR is enabling V2 by default?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r754042756



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderConfig.java
##########
@@ -186,11 +187,13 @@ public ParquetReaderConfig build() {
         readerConfig.enableTimeReadCounter = conf.getBoolean(ENABLE_TIME_READ_COUNTER, readerConfig.enableTimeReadCounter);
       }
 
-      // last assign values from session options, session options have higher priority than other configurations
+      // last assign values from session or query scoped options which have higher priority than other configurations
       if (options != null) {
-        String option = options.getOption(ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX_VALIDATOR);
-        if (!option.isEmpty()) {
-          readerConfig.enableStringsSignedMinMax = Boolean.valueOf(option);
+        String optVal  = (String) options.getOption(
+          ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX
+        ).getValueMinScope(OptionValue.OptionScope.SESSION);

Review comment:
       @vdiravka The idea for `getValueMinScope` is that it returns the value of an option only if it is defined at scope as least as narrow as the passed-in minimum.  If you call it with `SESSION` as the minimum while the identified option only has a value in `SYSTEM` or `BOOT` then null is returned.  This is useful for implementing the option priority QUERY > SESSION > FORMAT > SYSTEM




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
vdiravka commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r755060172



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderConfig.java
##########
@@ -186,11 +187,13 @@ public ParquetReaderConfig build() {
         readerConfig.enableTimeReadCounter = conf.getBoolean(ENABLE_TIME_READ_COUNTER, readerConfig.enableTimeReadCounter);
       }
 
-      // last assign values from session options, session options have higher priority than other configurations
+      // last assign values from session or query scoped options which have higher priority than other configurations
       if (options != null) {
-        String option = options.getOption(ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX_VALIDATOR);
-        if (!option.isEmpty()) {
-          readerConfig.enableStringsSignedMinMax = Boolean.valueOf(option);
+        String optVal  = (String) options.getOption(
+          ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX
+        ).getValueMinScope(OptionValue.OptionScope.SESSION);

Review comment:
       Sounds reasonable. Possibly this logic will be migrated to `EffectiveConfigResolver` eventually

##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java
##########
@@ -137,14 +140,14 @@ public static void setupTestFiles() {
   public int repeat = 1;
 
   @BeforeClass

Review comment:
       For sure we can still use JUnit4 here, but interested have you tries JUnit5? 
   I suppose just updating annotations is required for that, for instance `@BeforeClass` -> `@BeforeAll`..

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -141,40 +145,94 @@ public AbstractWriter getWriter(PhysicalOperator child, String location, List<St
   }
 
   public RecordWriter getRecordWriter(FragmentContext context, ParquetWriter writer) throws IOException, OutOfMemoryException {
-    Map<String, String> options = new HashMap<>();
+    Map<String, String> writerOpts = new HashMap<>();
+    OptionManager contextOpts = context.getOptions();
 
-    options.put("location", writer.getLocation());
+    writerOpts.put("location", writer.getLocation());
 
     FragmentHandle handle = context.getHandle();
     String fragmentId = String.format("%d_%d", handle.getMajorFragmentId(), handle.getMinorFragmentId());
-    options.put("prefix", fragmentId);
-
-    options.put(ExecConstants.PARQUET_BLOCK_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_BLOCK_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK,
-      context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK).bool_val.toString());
-    options.put(ExecConstants.PARQUET_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_PAGE_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_DICT_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_DICT_PAGE_SIZE).num_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING).bool_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS).bool_val.toString());
+    writerOpts.put("prefix", fragmentId);
+
+    // Many options which follow may be set as Drill config options or in the parquet format
+    // plugin config.  If there is a Drill option set at session scope or narrower it takes precendence.
+    OptionValue.OptionScope minScope = OptionValue.OptionScope.SESSION;

Review comment:
       Sounds good. In Drill V2 we can define the motivation for using SESSION, QUERY and FORMAT options. And we can discuss whether we need to manage format config options via SESSION options.
   At least we are fully free with choosing the best behavior for Drill users in spite of backward compatibility

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -78,6 +81,9 @@
 
 public class ParquetFormatPlugin implements FormatPlugin {
 
+  // {@link org.apache.parquet.column.ParquetProperties#WriterVersion}

Review comment:
       It can be added via javadoc so `@link` will allow to use reference as a hyperlink in IDE




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
vdiravka commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r755196800



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java
##########
@@ -137,14 +140,14 @@ public static void setupTestFiles() {
   public int repeat = 1;
 
   @BeforeClass

Review comment:
       That's fine. Need the separate global update of current JUnit4 test cases with JUnit5




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-967044407


   @vvysotskyi @vdiravka judging by that those test results, I did an incomplete job of changing `PARQUET_READER_STRINGS_SIGNED_MIN_MAX` from string to boolean.  Just ignore that for now please, since before I fix it (or revert it) I'd like to ask you some things.
   
   1. Is it possible to change an existing config option to a more restrictive type, as mentioned above, with breaking upgradeability.  I don't need to do that for my PR, I was just trying to clean up things nearby.
   2. The code to apply parquet config opts and session opts according to priority has been put in ParquetFormatPlugin.  Should I move it ParquetFormatConfig instead?  Perhaps in the individual getters?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vdiravka commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-958960612






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka edited a comment on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vdiravka edited a comment on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-958960612


   Could you solve check-style errors to proceed with review?
   nit: noticed missing dash in ticket description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vvysotskyi commented on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-953117856


   @dzamo, could you please add a storage format option too, so it will be more flexible. Changes for adding the system option look fine but also fix checkstyle errors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo edited a comment on pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo edited a comment on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-967044407


   @vvysotskyi @vdiravka judging by those test results, I did an incomplete job of changing `PARQUET_READER_STRINGS_SIGNED_MIN_MAX` from string to boolean.  Just ignore that for now please, since before I fix it (or revert it) I'd like to ask you some things.
   
   1. Is it possible to change an existing config option to a more restrictive type, as mentioned above, without breaking upgradeability.  I don't need to do that for my PR, I was just trying to clean up things nearby.
   2. The code to apply parquet config opts and session opts according to priority has been put in ParquetFormatPlugin.  Should I move it ParquetFormatConfig instead?  Perhaps in the individual getters?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r754090152



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetWriterConfig.java
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.test.ClusterFixture;
+import org.apache.drill.test.ClusterFixtureBuilder;
+import org.apache.drill.test.ClusterTest;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.nio.file.Paths;
+
+public class TestParquetWriterConfig extends ClusterTest {

Review comment:
       Okay, I decided to upgrade `TestParquetWriter` to `ClusterTest` so that this test can live in there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka edited a comment on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vdiravka edited a comment on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-958960612






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
vdiravka commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r753617750



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
##########
@@ -263,20 +269,29 @@ private void newSchema() throws IOException {
     // We don't want this number to be too small either. Ideally, slightly bigger than the page size,
     // but not bigger than the block buffer
     int initialPageBufferSize = max(MINIMUM_BUFFER_SIZE, min(pageSize + pageSize / 10, initialBlockBufferSize));
+    ValuesWriterFactory valWriterFactory = writerVersion == WriterVersion.PARQUET_1_0
+      ? new DefaultV1ValuesWriterFactory()
+      : new DefaultV2ValuesWriterFactory();
+
     ParquetProperties parquetProperties = ParquetProperties.builder()
         .withPageSize(pageSize)
         .withDictionaryEncoding(enableDictionary)
         .withDictionaryPageSize(initialPageBufferSize)
-        .withWriterVersion(writerVersion)
         .withAllocator(new ParquetDirectByteBufferAllocator(oContext))
-        .withValuesWriterFactory(new DefaultV1ValuesWriterFactory())
+        .withValuesWriterFactory(valWriterFactory)
+        .withWriterVersion(writerVersion)
         .build();
+
     // TODO: Replace ParquetColumnChunkPageWriteStore with ColumnChunkPageWriteStore from parquet library
     //   once DRILL-7906 (PARQUET-1006) will be resolved
     pageStore = new ParquetColumnChunkPageWriteStore(codecFactory.getCompressor(codec), schema,
             parquetProperties.getInitialSlabSize(), pageSize, parquetProperties.getAllocator(),
             parquetProperties.getColumnIndexTruncateLength(), parquetProperties.getPageWriteChecksumEnabled());
-    store = new ColumnWriteStoreV1(pageStore, parquetProperties);
+
+    store = writerVersion == WriterVersion.PARQUET_1_0

Review comment:
       oh yes, but starting from `java14`. Sorry. 
   ```
   store = switch(writerVersion) {
         case WriterVersion.PARQUET_1_0 -> new ColumnWriteStoreV1(schema, pageStore, parquetProperties);
         case WriterVersion.PARQUET_2_0 -> new ColumnWriteStoreV2(schema, pageStore, parquetProperties);
       };
   ```

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderConfig.java
##########
@@ -186,11 +187,13 @@ public ParquetReaderConfig build() {
         readerConfig.enableTimeReadCounter = conf.getBoolean(ENABLE_TIME_READ_COUNTER, readerConfig.enableTimeReadCounter);
       }
 
-      // last assign values from session options, session options have higher priority than other configurations
+      // last assign values from session or query scoped options which have higher priority than other configurations
       if (options != null) {
-        String option = options.getOption(ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX_VALIDATOR);
-        if (!option.isEmpty()) {
-          readerConfig.enableStringsSignedMinMax = Boolean.valueOf(option);
+        String optVal  = (String) options.getOption(
+          ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX
+        ).getValueMinScope(OptionValue.OptionScope.SESSION);

Review comment:
       I thought about other direction :) 
   Not sure I understood the purpose for this method fully. If session and query scope options are absent, the system and boot scope options will not be set to `optVal`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r754090152



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetWriterConfig.java
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.test.ClusterFixture;
+import org.apache.drill.test.ClusterFixtureBuilder;
+import org.apache.drill.test.ClusterTest;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.nio.file.Paths;
+
+public class TestParquetWriterConfig extends ClusterTest {

Review comment:
       @vdiravka Okay, I decided to upgrade `TestParquetWriter` to `ClusterTest` so that this test can live in there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r751942715



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetWriterConfig.java
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.test.ClusterFixture;
+import org.apache.drill.test.ClusterFixtureBuilder;
+import org.apache.drill.test.ClusterTest;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.nio.file.Paths;
+
+public class TestParquetWriterConfig extends ClusterTest {

Review comment:
       @vdiravka, thanks I will do this.  My first question here is whether I should even introduce a new test class here.  I don't think there is any other test class in Drill that only contains a single test of whether format config options are present in the query plan?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r751940823



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -141,40 +145,94 @@ public AbstractWriter getWriter(PhysicalOperator child, String location, List<St
   }
 
   public RecordWriter getRecordWriter(FragmentContext context, ParquetWriter writer) throws IOException, OutOfMemoryException {
-    Map<String, String> options = new HashMap<>();
+    Map<String, String> writerOpts = new HashMap<>();
+    OptionManager contextOpts = context.getOptions();
 
-    options.put("location", writer.getLocation());
+    writerOpts.put("location", writer.getLocation());
 
     FragmentHandle handle = context.getHandle();
     String fragmentId = String.format("%d_%d", handle.getMajorFragmentId(), handle.getMinorFragmentId());
-    options.put("prefix", fragmentId);
-
-    options.put(ExecConstants.PARQUET_BLOCK_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_BLOCK_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK,
-      context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK).bool_val.toString());
-    options.put(ExecConstants.PARQUET_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_PAGE_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_DICT_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_DICT_PAGE_SIZE).num_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING).bool_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS).bool_val.toString());
+    writerOpts.put("prefix", fragmentId);
+
+    // Many options which follow may be set as Drill config options or in the parquet format
+    // plugin config.  If there is a Drill option set at session scope or narrower it takes precendence.
+    OptionValue.OptionScope minScope = OptionValue.OptionScope.SESSION;

Review comment:
       @vdiravka yes I thought exactly the same except I made up the name `EffectiveConfigResolver`.  Plugin developers should not have to learn and rewrite the config option priority logic every time.  I'm not exactly sure where the most natural home for this animal would be - which package...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] lgtm-com[bot] commented on pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-976428309


   This pull request **fixes 1 alert** when merging 828cd2b7570bf6aecd6cbaa4d786509c66d9ef71 into baf698f63a1e549346fba59254472995e0f7be16 - [view on LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-9b20835c28a4c30ab6e6807d6a42ccea9ac20fca)
   
   **fixed alerts:**
   
   * 1 for Unused format argument


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] lgtm-com[bot] commented on pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-976674102


   This pull request **fixes 1 alert** when merging 16fada7b3900f0dabc1daea36f7b4987ac12fd02 into baf698f63a1e549346fba59254472995e0f7be16 - [view on LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-74cd9da25155d4b64a6c6316826b9b14707228a0)
   
   **fixed alerts:**
   
   * 1 for Unused format argument


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo merged pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
dzamo merged pull request #2351:
URL: https://github.com/apache/drill/pull/2351


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r755125855



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java
##########
@@ -137,14 +140,14 @@ public static void setupTestFiles() {
   public int repeat = 1;
 
   @BeforeClass

Review comment:
       @vdiravka I did try to upgrade this class to JUnit5 but `@RunWith(Parameterized.class)` does not seem to have a simple direct translation (there are different translations and I do no have experience with any) so I rolled it back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on a change in pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vdiravka commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r751467675



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java
##########
@@ -963,6 +963,17 @@ private void compareParquetInt96Converters(String selection, String table) throw
     }
   }
 
+  @Test
+  public void testTPCHReadWriteFormatV2() throws Exception {
+    try {
+      alterSession(ExecConstants.PARQUET_WRITER_FORMAT_VERSION, "parquet_2_0");

Review comment:
       Do we need enabling it, if the aim of the PR is enabling V2 by default?

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -78,6 +80,8 @@
 
 public class ParquetFormatPlugin implements FormatPlugin {
 
+  public static final String[] PARQUET_VERSIONS = {"PARQUET_1_0", "PARQUET_2_0"};

Review comment:
       I think we can simplify the values, the key already has `parquet` keyword - `PARQUET_VERSIONS`. Similar to `MetadataVersion.Constants` the array can be:
   `{"1.0", "2.0"}`.
   
   Or make it similar/corresponding to `ParquetProperties#WriterVersion` enum

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatConfig.java
##########
@@ -18,81 +18,95 @@
 package org.apache.drill.exec.store.parquet;
 
 import com.fasterxml.jackson.annotation.JsonCreator;
-import com.fasterxml.jackson.annotation.JsonIgnore;
 import com.fasterxml.jackson.annotation.JsonInclude;
 import com.fasterxml.jackson.annotation.JsonProperty;
 
-import java.util.Objects;
+import lombok.Builder;
+import lombok.EqualsAndHashCode;
+import lombok.Getter;
 
 import org.apache.drill.common.PlanStringBuilder;
 import org.apache.drill.common.logical.FormatPluginConfig;
 
 import com.fasterxml.jackson.annotation.JsonTypeName;
 
+@EqualsAndHashCode
 @JsonTypeName("parquet") @JsonInclude(JsonInclude.Include.NON_DEFAULT)
 public class ParquetFormatConfig implements FormatPluginConfig {
 
-  private final boolean autoCorrectCorruptDates;
-  private final boolean enableStringsSignedMinMax;
-
-  public ParquetFormatConfig() {
-    this(true, false);
-  }
-
-  @JsonCreator
-  public ParquetFormatConfig(@JsonProperty("autoCorrectCorruptDates") Boolean autoCorrectCorruptDates,
-      @JsonProperty("enableStringsSignedMinMax") boolean enableStringsSignedMinMax) {
-    this.autoCorrectCorruptDates = autoCorrectCorruptDates == null ? true : autoCorrectCorruptDates;
-    this.enableStringsSignedMinMax = enableStringsSignedMinMax;
-  }
+  @Getter private final boolean autoCorrectCorruptDates;

Review comment:
       `enableStringsSignedMinMax` property has a good javadoc.
   But `autoCorrectCorruptDates` has a good javadoc in `TestCorruptParquetDateCorrection`. Could you borrow some info from there or add a link to that javadoc?

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -141,40 +145,94 @@ public AbstractWriter getWriter(PhysicalOperator child, String location, List<St
   }
 
   public RecordWriter getRecordWriter(FragmentContext context, ParquetWriter writer) throws IOException, OutOfMemoryException {
-    Map<String, String> options = new HashMap<>();
+    Map<String, String> writerOpts = new HashMap<>();
+    OptionManager contextOpts = context.getOptions();
 
-    options.put("location", writer.getLocation());
+    writerOpts.put("location", writer.getLocation());
 
     FragmentHandle handle = context.getHandle();
     String fragmentId = String.format("%d_%d", handle.getMajorFragmentId(), handle.getMinorFragmentId());
-    options.put("prefix", fragmentId);
-
-    options.put(ExecConstants.PARQUET_BLOCK_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_BLOCK_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK,
-      context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK).bool_val.toString());
-    options.put(ExecConstants.PARQUET_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_PAGE_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_DICT_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_DICT_PAGE_SIZE).num_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING).bool_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS).bool_val.toString());
+    writerOpts.put("prefix", fragmentId);
+
+    // Many options which follow may be set as Drill config options or in the parquet format
+    // plugin config.  If there is a Drill option set at session scope or narrower it takes precendence.
+    OptionValue.OptionScope minScope = OptionValue.OptionScope.SESSION;
+
+    writerOpts.put(ExecConstants.PARQUET_BLOCK_SIZE,
+      ObjectUtils.firstNonNull(
+        contextOpts.getOption(ExecConstants.PARQUET_BLOCK_SIZE).getValueMinScope(minScope),
+        config.getBlockSize(),
+        contextOpts.getInt(ExecConstants.PARQUET_BLOCK_SIZE)
+      ).toString()
+    );
+
+    writerOpts.put(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK,
+      ObjectUtils.firstNonNull(
+        contextOpts.getOption(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK).getValueMinScope(minScope),
+        config.getUseSingleFSBlock(),
+        contextOpts.getBoolean(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK)
+      ).toString()
+    );
+
+    writerOpts.put(ExecConstants.PARQUET_PAGE_SIZE,
+      ObjectUtils.firstNonNull(
+        contextOpts.getOption(ExecConstants.PARQUET_PAGE_SIZE).getValueMinScope(minScope),
+        config.getPageSize(),
+        contextOpts.getInt(ExecConstants.PARQUET_PAGE_SIZE)
+      ).toString()
+    );
+
+    // "internal use" so not settable in format config
+    writerOpts.put(ExecConstants.PARQUET_DICT_PAGE_SIZE,
+      contextOpts.getOption(ExecConstants.PARQUET_DICT_PAGE_SIZE).num_val.toString()
+    );
+
+    // "internal use" so not settable in format config
+    writerOpts.put(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING,
+      contextOpts.getOption(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING).bool_val.toString()
+    );
+
+    writerOpts.put(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE,
+      ObjectUtils.firstNonNull(
+        contextOpts.getOption(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE).getValueMinScope(minScope),
+        config.getWriterCompressionType(),
+        contextOpts.getString(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE)
+      ).toString()
+    );
+
+    writerOpts.put(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS,
+      ObjectUtils.firstNonNull(
+        contextOpts.getOption(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS).getValueMinScope(minScope),
+        config.getWriterLogicalTypeForDecimals(),
+        contextOpts.getString(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS)
+      ).toString()
+    );
+
+    writerOpts.put(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS,
+      ObjectUtils.firstNonNull(
+        contextOpts.getOption(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS).getValueMinScope(minScope),
+        config.getWriterUsePrimitivesForDecimals(),
+        contextOpts.getBoolean(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS)
+      ).toString()
+    );
+
+    writerOpts.put(ExecConstants.PARQUET_WRITER_FORMAT_VERSION,
+      ObjectUtils.firstNonNull(
+        contextOpts.getOption(ExecConstants.PARQUET_WRITER_FORMAT_VERSION).getValueMinScope(minScope),
+        config.getWriterFormatVersion(),
+        contextOpts.getString(ExecConstants.PARQUET_WRITER_FORMAT_VERSION)
+      ).toString()
+    );
 
     RecordWriter recordWriter = new ParquetRecordWriter(context, writer);
-    recordWriter.init(options);
+    recordWriter.init(writerOpts);
 
     return recordWriter;
   }
 
   public WriterRecordBatch getWriterBatch(FragmentContext context, RecordBatch incoming, ParquetWriter writer)
           throws ExecutionSetupException {
+    // getConfig().get

Review comment:
       ```suggestion
   ```

##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetWriterConfig.java
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.test.ClusterFixture;
+import org.apache.drill.test.ClusterFixtureBuilder;
+import org.apache.drill.test.ClusterTest;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.nio.file.Paths;
+
+public class TestParquetWriterConfig extends ClusterTest {

Review comment:
       I recommend introducing new test classes based on `JUnit5` framework. As an example you can check my recent update `TestDrillbitResilience` onto the JUnit5 (it is also extends `ClusterTest`)

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderConfig.java
##########
@@ -186,11 +187,13 @@ public ParquetReaderConfig build() {
         readerConfig.enableTimeReadCounter = conf.getBoolean(ENABLE_TIME_READ_COUNTER, readerConfig.enableTimeReadCounter);
       }
 
-      // last assign values from session options, session options have higher priority than other configurations
+      // last assign values from session or query scoped options which have higher priority than other configurations
       if (options != null) {
-        String option = options.getOption(ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX_VALIDATOR);
-        if (!option.isEmpty()) {
-          readerConfig.enableStringsSignedMinMax = Boolean.valueOf(option);
+        String optVal  = (String) options.getOption(
+          ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX
+        ).getValueMinScope(OptionValue.OptionScope.SESSION);

Review comment:
       What about `OptionScope.QUERY` (according to above comment)?

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
##########
@@ -263,20 +269,29 @@ private void newSchema() throws IOException {
     // We don't want this number to be too small either. Ideally, slightly bigger than the page size,
     // but not bigger than the block buffer
     int initialPageBufferSize = max(MINIMUM_BUFFER_SIZE, min(pageSize + pageSize / 10, initialBlockBufferSize));
+    ValuesWriterFactory valWriterFactory = writerVersion == WriterVersion.PARQUET_1_0
+      ? new DefaultV1ValuesWriterFactory()
+      : new DefaultV2ValuesWriterFactory();
+
     ParquetProperties parquetProperties = ParquetProperties.builder()
         .withPageSize(pageSize)
         .withDictionaryEncoding(enableDictionary)
         .withDictionaryPageSize(initialPageBufferSize)
-        .withWriterVersion(writerVersion)
         .withAllocator(new ParquetDirectByteBufferAllocator(oContext))
-        .withValuesWriterFactory(new DefaultV1ValuesWriterFactory())
+        .withValuesWriterFactory(valWriterFactory)
+        .withWriterVersion(writerVersion)
         .build();
+
     // TODO: Replace ParquetColumnChunkPageWriteStore with ColumnChunkPageWriteStore from parquet library
     //   once DRILL-7906 (PARQUET-1006) will be resolved
     pageStore = new ParquetColumnChunkPageWriteStore(codecFactory.getCompressor(codec), schema,
             parquetProperties.getInitialSlabSize(), pageSize, parquetProperties.getAllocator(),
             parquetProperties.getColumnIndexTruncateLength(), parquetProperties.getPageWriteChecksumEnabled());
-    store = new ColumnWriteStoreV1(pageStore, parquetProperties);
+
+    store = writerVersion == WriterVersion.PARQUET_1_0

Review comment:
       switch-case?

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -141,40 +145,94 @@ public AbstractWriter getWriter(PhysicalOperator child, String location, List<St
   }
 
   public RecordWriter getRecordWriter(FragmentContext context, ParquetWriter writer) throws IOException, OutOfMemoryException {
-    Map<String, String> options = new HashMap<>();
+    Map<String, String> writerOpts = new HashMap<>();
+    OptionManager contextOpts = context.getOptions();
 
-    options.put("location", writer.getLocation());
+    writerOpts.put("location", writer.getLocation());
 
     FragmentHandle handle = context.getHandle();
     String fragmentId = String.format("%d_%d", handle.getMajorFragmentId(), handle.getMinorFragmentId());
-    options.put("prefix", fragmentId);
-
-    options.put(ExecConstants.PARQUET_BLOCK_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_BLOCK_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK,
-      context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK).bool_val.toString());
-    options.put(ExecConstants.PARQUET_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_PAGE_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_DICT_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_DICT_PAGE_SIZE).num_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING).bool_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS).bool_val.toString());
+    writerOpts.put("prefix", fragmentId);
+
+    // Many options which follow may be set as Drill config options or in the parquet format
+    // plugin config.  If there is a Drill option set at session scope or narrower it takes precendence.
+    OptionValue.OptionScope minScope = OptionValue.OptionScope.SESSION;

Review comment:
       As far as I understand SESSION and QUERY scopes had precedence over plugin config earlier too. The aim of these changes for giving higher precedence for plugin config over SYSTEM options, right?
   It is fine but could we manage plugin configs within a new OptionManager? And this manager can be in the managers fallback chain, so then the priority of the options will be resolved automatically for all options and plugins. Just consider it as an idea for enhancement (this PR or the new one)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r751945090



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderConfig.java
##########
@@ -186,11 +187,13 @@ public ParquetReaderConfig build() {
         readerConfig.enableTimeReadCounter = conf.getBoolean(ENABLE_TIME_READ_COUNTER, readerConfig.enableTimeReadCounter);
       }
 
-      // last assign values from session options, session options have higher priority than other configurations
+      // last assign values from session or query scoped options which have higher priority than other configurations
       if (options != null) {
-        String option = options.getOption(ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX_VALIDATOR);
-        if (!option.isEmpty()) {
-          readerConfig.enableStringsSignedMinMax = Boolean.valueOf(option);
+        String optVal  = (String) options.getOption(
+          ExecConstants.PARQUET_READER_STRINGS_SIGNED_MIN_MAX
+        ).getValueMinScope(OptionValue.OptionScope.SESSION);

Review comment:
       @vdiravka if you look at the new `getValueMinScope` method, it works returns values from all scopes from `SESSION` _and up_ under the ordering implied by the `OptionScope` enum, namely
   ```
     public enum OptionScope {
       BOOT, SYSTEM, SESSION, QUERY
     }
   ```
   
   So `SESSION` and `QUERY` would both override here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r751937946



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java
##########
@@ -963,6 +963,17 @@ private void compareParquetInt96Converters(String selection, String table) throw
     }
   }
 
+  @Test
+  public void testTPCHReadWriteFormatV2() throws Exception {
+    try {
+      alterSession(ExecConstants.PARQUET_WRITER_FORMAT_VERSION, "parquet_2_0");

Review comment:
       @vdiravka I discussed the default format version with @vvysotskyi who said that v2 does not have wide adoption.  I then checked Apache Spark and found that, indeed, reading v2 will not work without a non-default setting there.  So it was conceded that v1 should remain the default, in spite of the title of the old Jira ticket.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] lgtm-com[bot] commented on pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-967069142


   This pull request **fixes 1 alert** when merging df7929f0fef8b28011f76cd1bdaa8aa8ead343c8 into 18110e0a8f6e9b1511b31f4458068ea38d8c5731 - [view on LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-cfb969be4b37ee1638e0bb9c6b7b301bd581a41b)
   
   **fixed alerts:**
   
   * 1 for Unused format argument


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] lgtm-com[bot] commented on pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-970642277


   This pull request **fixes 1 alert** when merging 0068450220d864f87d0f088a036c961af0669367 into 9370ceb60e5dbf12851cc1ce4bf067123dd28159 - [view on LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-10258e8cb2eceaf1e3bdfa0eef194c641cb979ff)
   
   **fixed alerts:**
   
   * 1 for Unused format argument


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] lgtm-com[bot] commented on pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-971324438


   This pull request **fixes 1 alert** when merging 628fa9637db3a3bd24485ef5372854a9791d25a4 into 9370ceb60e5dbf12851cc1ce4bf067123dd28159 - [view on LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-4412314bbbf1d52374830996c0dd0fbad019c9a4)
   
   **fixed alerts:**
   
   * 1 for Unused format argument


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r752001140



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -78,6 +80,8 @@
 
 public class ParquetFormatPlugin implements FormatPlugin {
 
+  public static final String[] PARQUET_VERSIONS = {"PARQUET_1_0", "PARQUET_2_0"};

Review comment:
       @vdiravka I followed `ParquetProperties#WriterVersion`, that's where `PARQUET_1_0` and `PARQUET_2_0` come from.  I agree it's clunky, but on the plus side I did not have to introduce any new version format strings or case statements.  Which do you think is preferable?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] lgtm-com[bot] commented on pull request #2351: DRILL-1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-971695847


   This pull request **fixes 1 alert** when merging f05cc2ef2cdc662e07c613c231bcfb1fdd48a261 into 9370ceb60e5dbf12851cc1ce4bf067123dd28159 - [view on LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-c2d88b9b5905709a7e98c8018888bde45c028d70)
   
   **fixed alerts:**
   
   * 1 for Unused format argument


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-952750430


   @vvysotskyi In the end I never quite understood where I was supposed (not) to add code for the new Drill option, please check this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-952722817


   A different possibility for the Parquet version Drill option is to add a value like "parquet2" to the `store.format` option instead of introducing the new `store.parquet.writer.format_version` option...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-958961461


   @
   
   > Could you solve check-style errors to proceed with review?
   
   Yes I did, now just adding the format config option that Vova suggested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vdiravka commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-958960612


   Could you solve check-style errors to proceed with review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
dzamo commented on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-958961461






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka edited a comment on pull request #2351: DRILL1282: Move parquet to use v2 format as default

Posted by GitBox <gi...@apache.org>.
vdiravka edited a comment on pull request #2351:
URL: https://github.com/apache/drill/pull/2351#issuecomment-958960612


   Could you solve check-style errors to proceed with review?
   nit: noticed missing dash in ticket description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] dzamo commented on a change in pull request #2351: DRILL-1282: Add read and write support for Parquet v2

Posted by GitBox <gi...@apache.org>.
dzamo commented on a change in pull request #2351:
URL: https://github.com/apache/drill/pull/2351#discussion_r754980730



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
##########
@@ -141,40 +145,94 @@ public AbstractWriter getWriter(PhysicalOperator child, String location, List<St
   }
 
   public RecordWriter getRecordWriter(FragmentContext context, ParquetWriter writer) throws IOException, OutOfMemoryException {
-    Map<String, String> options = new HashMap<>();
+    Map<String, String> writerOpts = new HashMap<>();
+    OptionManager contextOpts = context.getOptions();
 
-    options.put("location", writer.getLocation());
+    writerOpts.put("location", writer.getLocation());
 
     FragmentHandle handle = context.getHandle();
     String fragmentId = String.format("%d_%d", handle.getMajorFragmentId(), handle.getMinorFragmentId());
-    options.put("prefix", fragmentId);
-
-    options.put(ExecConstants.PARQUET_BLOCK_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_BLOCK_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK,
-      context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_SINGLE_FS_BLOCK).bool_val.toString());
-    options.put(ExecConstants.PARQUET_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_PAGE_SIZE).num_val.toString());
-    options.put(ExecConstants.PARQUET_DICT_PAGE_SIZE, context.getOptions().getOption(ExecConstants.PARQUET_DICT_PAGE_SIZE).num_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING).bool_val.toString());
-
-    options.put(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_LOGICAL_TYPE_FOR_DECIMALS).string_val);
-
-    options.put(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS,
-        context.getOptions().getOption(ExecConstants.PARQUET_WRITER_USE_PRIMITIVE_TYPES_FOR_DECIMALS).bool_val.toString());
+    writerOpts.put("prefix", fragmentId);
+
+    // Many options which follow may be set as Drill config options or in the parquet format
+    // plugin config.  If there is a Drill option set at session scope or narrower it takes precendence.
+    OptionValue.OptionScope minScope = OptionValue.OptionScope.SESSION;

Review comment:
       @vdiravka maybe we should put something about a new options manager which completely handles the different priorities into the Drill v2 ideas wiki page?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org