You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/01/14 11:37:06 UTC

[GitHub] [drill] vdiravka opened a new pull request #2143: DRILL-7825: Error: SYSTEM ERROR: RuntimeException: Unknown logical ty…

vdiravka opened a new pull request #2143:
URL: https://github.com/apache/drill/pull/2143


   …pe <LogicalType UUID:UUIDType()>
   
   # [DRILL-7825](https://issues.apache.org/jira/browse/DRILL-7825): PR Title
   
   It adds support for new `parquet-format` (starting from version 2.4) UUID logical type:
   https://github.com/apache/parquet-format/blob/master/CHANGES.md
   
    `parquet-mr` supports it starting from 1.12.0 version (should be released soon: https://issues.apache.org/jira/browse/PARQUET-1898)
   TODO: need wait parquet 1.12.0 release to merge it. 
   ## Description
   
   (Please describe the change. If more than one ticket is fixed, include a reference to those tickets.)
   
   ## Documentation
   https://drill.apache.org/docs/parquet-format/
   https://drill.apache.org/docs/supported-data-types/
   should be updated to correspond:
   https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid
   
   ## Testing
   All test cases pass. Checked manually within drill-embedded.
   TODO: need to test guava update on Apache Hadoop test cluster with Drill, HBase and Zookeeper
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vvysotskyi commented on pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vvysotskyi commented on pull request #2143:
URL: https://github.com/apache/drill/pull/2143#issuecomment-824385357


   @vdiravka, let's have a single commit and don't hesitate to remove `Co-authored-by`, it's GitHub inserts it when accepting changes after CR, but I don't think that it is needed here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] cgivre commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

cgivre commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r618332415



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
##########
@@ -263,10 +258,11 @@ private void newSchema() throws IOException {
         .withAllocator(new ParquetDirectByteBufferAllocator(oContext))
         .withValuesWriterFactory(new DefaultV1ValuesWriterFactory())
         .build();
-    pageStore = new ParquetColumnChunkPageWriteStore(codecFactory.getCompressor(codec), schema, initialSlabSize,
-        pageSize, parquetProperties.getAllocator(), parquetProperties.getPageWriteChecksumEnabled(),
-        parquetProperties.getColumnIndexTruncateLength()
-    );
+    // TODO: Replace ParquetColumnChunkPageWriteStore with ColumnChunkPageWriteStore from parquet library
+    // once PARQUET-1006 will be resolved

Review comment:
       @vdiravka Could you please create a JIRA for this and any other TODOs from this PR?

##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/ParquetSimpleTestFileGenerator.java
##########
@@ -46,7 +46,8 @@
  * that are supported by Drill. Embedded types specified in the Parquet specification are not covered by the
  * examples but can be added.
  * To create a new parquet file, define a schema, create a GroupWriter based on the schema, then add values
- * for individual records to the GroupWriter.
+ * for individual records to the GroupWriter.<br>
+ *     TODO: to run this tool please use 28.2-jre <guava.version> instead of 19.0 in main POM file

Review comment:
       See comment above re: TODOs.  I know you already created one, but could you please update the comment with the JIRA?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vdiravka commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vdiravka commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r613456253



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetGroupConverter.java
##########
@@ -328,24 +329,26 @@ protected PrimitiveConverter getConverterForType(String name, PrimitiveType type
         }
       }
       case FIXED_LEN_BYTE_ARRAY:
-        switch (type.getOriginalType()) {
-          case DECIMAL: {
-            ParquetReaderUtility.checkDecimalTypeEnabled(options);
-            return getVarDecimalConverter(name, type);
-          }
-          case INTERVAL: {
-            IntervalWriter writer = type.isRepetition(Repetition.REPEATED)
-                ? getWriter(name, (m, f) -> m.list(f).interval(), l -> l.list().interval())
-                : getWriter(name, (m, f) -> m.interval(f), l -> l.interval());
-            return new DrillFixedLengthByteArrayToInterval(writer);
-          }
-          default: {
-            VarBinaryWriter writer = type.isRepetition(Repetition.REPEATED)
-                ? getWriter(name, (m, f) -> m.list(f).varBinary(), l -> l.list().varBinary())
-                : getWriter(name, (m, f) -> m.varBinary(f), l -> l.varBinary());
-            return new DrillFixedBinaryToVarbinaryConverter(writer, type.getTypeLength(), mutator.getManagedBuffer());
+        // TODO: to follow the latest parquet code, rewrite it by using LogicalTypeAnnotation instead of OriginalType
+        OriginalType originalType = type.getOriginalType();
+        if( originalType != null) {
+          switch (type.getOriginalType()) {
+            case DECIMAL: {
+              ParquetReaderUtility.checkDecimalTypeEnabled(options);
+              return getVarDecimalConverter(name, type);
+            }
+            case INTERVAL: {
+              IntervalWriter writer = type.isRepetition(Repetition.REPEATED)
+                      ? getWriter(name, (m, f) -> m.list(f).interval(), l -> l.list().interval())
+                      : getWriter(name, (m, f) -> m.interval(f), l -> l.interval());
+              return new DrillFixedLengthByteArrayToInterval(writer);
+            }
           }
         }
+        VarBinaryWriter writer = type.isRepetition(Repetition.REPEATED)
+                ? getWriter(name, (m, f) -> m.list(f).varBinary(), l -> l.list().varBinary())
+                : getWriter(name, MapWriter::varBinary, ListWriter::varBinary);
+        return new DrillFixedBinaryToVarbinaryConverter(writer, type.getTypeLength(), mutator.getManagedBuffer());

Review comment:
       ok, will do

##########
File path: pom.xml
##########
@@ -47,9 +47,9 @@
     <junit.version>4.12</junit.version>
     <slf4j.version>1.7.26</slf4j.version>
     <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->

Review comment:
       ok. Yes, I know about newer version. But the newer version can bring some new other issues for sure.
   The plan to check `<guava.version>28.2-jre</guava.version>` at first. If it is fine then to drop `<shaded.guava.version>` at all. And then to update guava version to the newest one

##########
File path: exec/jdbc-all/pom.xml
##########
@@ -575,7 +575,7 @@
 
         <build>
           <plugins>
-            <plugin>
+            <plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code -->

Review comment:
       ok :)

##########
File path: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java
##########
@@ -260,14 +260,16 @@ public long getMemSize() {
     }
 
     /**
-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>

Review comment:
       Yes, you are right. The best way to remove Drill's copy of this class, we have `todo` about in `ParquetRecordWriter#256`, but we can't to it before [PARQUET-1006](https://issues.apache.org/jira/browse/PARQUET-1006) resolving. Adding the latest functionality from Parquet version of this class requires some deeper involving into it (it requires proper instantiating of `ParquetColumnChunkPageWriteStore`) and doesn't related to `UUID` logical type. That's why DRILL-7895 is created

##########
File path: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
##########
@@ -0,0 +1,1633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import static org.apache.parquet.format.Util.writeFileCryptoMetaData;
+import static org.apache.parquet.format.Util.writeFileMetaData;
+import static org.apache.parquet.format.converter.ParquetMetadataConverter.MAX_STATS_SIZE;
+import static org.apache.parquet.hadoop.ParquetWriter.DEFAULT_BLOCK_SIZE;
+import static org.apache.parquet.hadoop.ParquetWriter.MAX_PADDING_SIZE_DEFAULT;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Set;
+import java.util.zip.CRC32;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.Version;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.bytes.BytesUtils;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.Encoding;
+import org.apache.parquet.column.EncodingStats;
+import org.apache.parquet.column.ParquetProperties;
+import org.apache.parquet.column.page.DictionaryPage;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.column.values.bloomfilter.BloomFilter;
+import org.apache.parquet.crypto.AesCipher;
+import org.apache.parquet.crypto.ColumnEncryptionProperties;
+import org.apache.parquet.crypto.FileEncryptionProperties;
+import org.apache.parquet.crypto.InternalColumnEncryptionSetup;
+import org.apache.parquet.crypto.InternalFileEncryptor;
+import org.apache.parquet.crypto.ModuleCipherFactory;
+import org.apache.parquet.crypto.ModuleCipherFactory.ModuleType;
+import org.apache.parquet.crypto.ParquetCryptoRuntimeException;
+import org.apache.parquet.hadoop.ParquetOutputFormat.JobSummaryLevel;
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.format.BlockCipher;
+import org.apache.parquet.format.Util;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.hadoop.metadata.StrictKeyValueMetadataMergeStrategy;
+import org.apache.parquet.hadoop.metadata.FileMetaData;
+import org.apache.parquet.hadoop.metadata.GlobalMetaData;
+import org.apache.parquet.hadoop.metadata.KeyValueMetadataMergeStrategy;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.util.HadoopOutputFile;
+import org.apache.parquet.hadoop.util.HadoopStreams;
+import org.apache.parquet.internal.column.columnindex.ColumnIndex;
+import org.apache.parquet.internal.column.columnindex.ColumnIndexBuilder;
+import org.apache.parquet.internal.column.columnindex.OffsetIndex;
+import org.apache.parquet.internal.column.columnindex.OffsetIndexBuilder;
+import org.apache.parquet.internal.hadoop.metadata.IndexReference;
+import org.apache.parquet.io.InputFile;
+import org.apache.parquet.io.OutputFile;
+import org.apache.parquet.io.SeekableInputStream;
+import org.apache.parquet.io.ParquetEncodingException;
+import org.apache.parquet.io.PositionOutputStream;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+import org.apache.parquet.schema.TypeUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Internal implementation of the Parquet file writer as a block container<br>
+ * Note: this is temporary Drill-Parquet class needed to write empty parquet files. Details in
+ * <a href="https://issues.apache.org/jira/browse/PARQUET-2026">PARQUET-2026</a>
+ */
+public class ParquetFileWriter {

Review comment:
       I wanted and tried to create a class extended from an original one and override `endBlock()` method. But unfortunately there are several private fields, which can be used only in scope of original class. So extending can't help. Anyway it is temporary solution, since the ticket to revisit dropping empty files is created and even if it will be rejected I'll create other ticket to make `endBlock()` be possible to be overridden. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vdiravka merged pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vdiravka merged pull request #2143:
URL: https://github.com/apache/drill/pull/2143


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vvysotskyi commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vvysotskyi commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r615276673



##########
File path: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java
##########
@@ -260,14 +260,16 @@ public long getMemSize() {
     }
 
     /**
-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>

Review comment:
       @vdiravka, are you sure that heap memory usage is the same? I assumed that the main reason for using `ParquetColumnChunkPageWriteStore` was to use direct memory instead of heap one...
   From the code perspective, it looks like nothing was done in this direction for `ColumnChunkPageWriteStore`, it is still using the `ConcatenatingByteArrayCollector` for collecting data before writing it to the file, but our version uses `CapacityByteArrayOutputStream` that uses provided allocator.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vvysotskyi commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vvysotskyi commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r612752350



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetGroupConverter.java
##########
@@ -328,24 +329,26 @@ protected PrimitiveConverter getConverterForType(String name, PrimitiveType type
         }
       }
       case FIXED_LEN_BYTE_ARRAY:
-        switch (type.getOriginalType()) {
-          case DECIMAL: {
-            ParquetReaderUtility.checkDecimalTypeEnabled(options);
-            return getVarDecimalConverter(name, type);
-          }
-          case INTERVAL: {
-            IntervalWriter writer = type.isRepetition(Repetition.REPEATED)
-                ? getWriter(name, (m, f) -> m.list(f).interval(), l -> l.list().interval())
-                : getWriter(name, (m, f) -> m.interval(f), l -> l.interval());
-            return new DrillFixedLengthByteArrayToInterval(writer);
-          }
-          default: {
-            VarBinaryWriter writer = type.isRepetition(Repetition.REPEATED)
-                ? getWriter(name, (m, f) -> m.list(f).varBinary(), l -> l.list().varBinary())
-                : getWriter(name, (m, f) -> m.varBinary(f), l -> l.varBinary());
-            return new DrillFixedBinaryToVarbinaryConverter(writer, type.getTypeLength(), mutator.getManagedBuffer());
+        // TODO: to follow the latest parquet code, rewrite it by using LogicalTypeAnnotation instead of OriginalType
+        OriginalType originalType = type.getOriginalType();
+        if( originalType != null) {
+          switch (type.getOriginalType()) {
+            case DECIMAL: {
+              ParquetReaderUtility.checkDecimalTypeEnabled(options);
+              return getVarDecimalConverter(name, type);
+            }
+            case INTERVAL: {
+              IntervalWriter writer = type.isRepetition(Repetition.REPEATED)
+                      ? getWriter(name, (m, f) -> m.list(f).interval(), l -> l.list().interval())
+                      : getWriter(name, (m, f) -> m.interval(f), l -> l.interval());
+              return new DrillFixedLengthByteArrayToInterval(writer);
+            }
           }
         }
+        VarBinaryWriter writer = type.isRepetition(Repetition.REPEATED)
+                ? getWriter(name, (m, f) -> m.list(f).varBinary(), l -> l.list().varBinary())
+                : getWriter(name, MapWriter::varBinary, ListWriter::varBinary);
+        return new DrillFixedBinaryToVarbinaryConverter(writer, type.getTypeLength(), mutator.getManagedBuffer());

Review comment:
       Don't you want to implement this TODO? :) Looks like it may be a simple visitor (by the way outer switch case may be rewritten in a similar way).
   ```suggestion
           LogicalTypeAnnotation.LogicalTypeAnnotationVisitor<PrimitiveConverter> typeAnnotationVisitor = new LogicalTypeAnnotation.LogicalTypeAnnotationVisitor<PrimitiveConverter>() {
             @Override
             public Optional<PrimitiveConverter> visit(LogicalTypeAnnotation.DecimalLogicalTypeAnnotation decimalLogicalType) {
               ParquetReaderUtility.checkDecimalTypeEnabled(options);
               return Optional.of(getVarDecimalConverter(name, type));
             }
   
             @Override
             public Optional<PrimitiveConverter> visit(LogicalTypeAnnotation.IntervalLogicalTypeAnnotation intervalLogicalType) {
               IntervalWriter writer = type.isRepetition(Repetition.REPEATED)
                   ? getWriter(name, (m, f) -> m.list(f).interval(), l -> l.list().interval())
                   : getWriter(name, MapWriter::interval, ListWriter::interval);
               return Optional.of(new DrillFixedLengthByteArrayToInterval(writer));
             }
           };
   
           LogicalTypeAnnotation logicalTypeAnnotation = type.getLogicalTypeAnnotation();
           if (logicalTypeAnnotation != null) {
             logicalTypeAnnotation.accept(typeAnnotationVisitor).orElseGet(() -> {
               VarBinaryWriter writer = type.isRepetition(Repetition.REPEATED)
                   ? getWriter(name, (m, f) -> m.list(f).varBinary(), l -> l.list().varBinary())
                   : getWriter(name, MapWriter::varBinary, ListWriter::varBinary);
               return new DrillFixedBinaryToVarbinaryConverter(writer, type.getTypeLength(), mutator.getManagedBuffer());
             });
           }
   ```

##########
File path: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
##########
@@ -0,0 +1,1633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import static org.apache.parquet.format.Util.writeFileCryptoMetaData;
+import static org.apache.parquet.format.Util.writeFileMetaData;
+import static org.apache.parquet.format.converter.ParquetMetadataConverter.MAX_STATS_SIZE;
+import static org.apache.parquet.hadoop.ParquetWriter.DEFAULT_BLOCK_SIZE;
+import static org.apache.parquet.hadoop.ParquetWriter.MAX_PADDING_SIZE_DEFAULT;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Set;
+import java.util.zip.CRC32;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.Version;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.bytes.BytesUtils;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.Encoding;
+import org.apache.parquet.column.EncodingStats;
+import org.apache.parquet.column.ParquetProperties;
+import org.apache.parquet.column.page.DictionaryPage;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.column.values.bloomfilter.BloomFilter;
+import org.apache.parquet.crypto.AesCipher;
+import org.apache.parquet.crypto.ColumnEncryptionProperties;
+import org.apache.parquet.crypto.FileEncryptionProperties;
+import org.apache.parquet.crypto.InternalColumnEncryptionSetup;
+import org.apache.parquet.crypto.InternalFileEncryptor;
+import org.apache.parquet.crypto.ModuleCipherFactory;
+import org.apache.parquet.crypto.ModuleCipherFactory.ModuleType;
+import org.apache.parquet.crypto.ParquetCryptoRuntimeException;
+import org.apache.parquet.hadoop.ParquetOutputFormat.JobSummaryLevel;
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.format.BlockCipher;
+import org.apache.parquet.format.Util;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.hadoop.metadata.StrictKeyValueMetadataMergeStrategy;
+import org.apache.parquet.hadoop.metadata.FileMetaData;
+import org.apache.parquet.hadoop.metadata.GlobalMetaData;
+import org.apache.parquet.hadoop.metadata.KeyValueMetadataMergeStrategy;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.util.HadoopOutputFile;
+import org.apache.parquet.hadoop.util.HadoopStreams;
+import org.apache.parquet.internal.column.columnindex.ColumnIndex;
+import org.apache.parquet.internal.column.columnindex.ColumnIndexBuilder;
+import org.apache.parquet.internal.column.columnindex.OffsetIndex;
+import org.apache.parquet.internal.column.columnindex.OffsetIndexBuilder;
+import org.apache.parquet.internal.hadoop.metadata.IndexReference;
+import org.apache.parquet.io.InputFile;
+import org.apache.parquet.io.OutputFile;
+import org.apache.parquet.io.SeekableInputStream;
+import org.apache.parquet.io.ParquetEncodingException;
+import org.apache.parquet.io.PositionOutputStream;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+import org.apache.parquet.schema.TypeUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Internal implementation of the Parquet file writer as a block container<br>
+ * Note: this is temporary Drill-Parquet class needed to write empty parquet files. Details in
+ * <a href="https://issues.apache.org/jira/browse/PARQUET-2026">PARQUET-2026</a>
+ */
+public class ParquetFileWriter {

Review comment:
       Is it possible somehow to reuse the original ParquetFileWriter to avoid so large code duplication?

##########
File path: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java
##########
@@ -260,14 +260,16 @@ public long getMemSize() {
     }
 
     /**
-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>

Review comment:
       This class was created as a copy of the ColumnChunkPageWriteStore class from the parquet library (see DRILL-5544 for details)
   
   Since it is a copy, it is better to sync it with the original version instead of adding TODO with adding some specific features from it...

##########
File path: exec/jdbc-all/pom.xml
##########
@@ -575,7 +575,7 @@
 
         <build>
           <plugins>
-            <plugin>
+            <plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code -->

Review comment:
       Could you please implement this TODO? It doesn't look to be complicated.

##########
File path: pom.xml
##########
@@ -47,9 +47,9 @@
     <junit.version>4.12</junit.version>
     <slf4j.version>1.7.26</slf4j.version>
     <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->

Review comment:
       Let's create a ticket instead of adding the comment. Also, there are newer versions of Guava.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] cgivre commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

cgivre commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r613468149



##########
File path: pom.xml
##########
@@ -47,9 +47,9 @@
     <junit.version>4.12</junit.version>
     <slf4j.version>1.7.26</slf4j.version>
     <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->

Review comment:
       Actually, there's a CVE for guava < 29.  Could we upgrade to guava 30-jre?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vvysotskyi commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vvysotskyi commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r615276820



##########
File path: pom.xml
##########
@@ -47,9 +47,9 @@
     <junit.version>4.12</junit.version>
     <slf4j.version>1.7.26</slf4j.version>
     <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->

Review comment:
       So the comment can be removed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vdiravka commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vdiravka commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r619089030



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
##########
@@ -263,10 +258,11 @@ private void newSchema() throws IOException {
         .withAllocator(new ParquetDirectByteBufferAllocator(oContext))
         .withValuesWriterFactory(new DefaultV1ValuesWriterFactory())
         .build();
-    pageStore = new ParquetColumnChunkPageWriteStore(codecFactory.getCompressor(codec), schema, initialSlabSize,
-        pageSize, parquetProperties.getAllocator(), parquetProperties.getPageWriteChecksumEnabled(),
-        parquetProperties.getColumnIndexTruncateLength()
-    );
+    // TODO: Replace ParquetColumnChunkPageWriteStore with ColumnChunkPageWriteStore from parquet library
+    // once PARQUET-1006 will be resolved

Review comment:
       [DRILL-7906](https://issues.apache.org/jira/browse/DRILL-7906) and [DRILL-7907](https://issues.apache.org/jira/browse/DRILL-7907) created




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] cgivre commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

cgivre commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r612546497



##########
File path: pom.xml
##########
@@ -47,9 +47,9 @@
     <junit.version>4.12</junit.version>
     <slf4j.version>1.7.26</slf4j.version>
     <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->

Review comment:
       Can we create a JIRA for this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vdiravka commented on pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vdiravka commented on pull request #2143:
URL: https://github.com/apache/drill/pull/2143#issuecomment-824370190


   @vvysotskyi Thanks for the code review. I've squashed my changes into one commit and left your CR suggestion as a separate one. Is that fine to you to go with 2 commits or do you prefer to squash them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vdiravka commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vdiravka commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r614836457



##########
File path: pom.xml
##########
@@ -47,9 +47,9 @@
     <junit.version>4.12</junit.version>
     <slf4j.version>1.7.26</slf4j.version>
     <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->

Review comment:
       WIP: [DRILL-7904: Update to 30-jre Guava version](https://issues.apache.org/jira/browse/DRILL-7904)

##########
File path: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java
##########
@@ -260,14 +260,16 @@ public long getMemSize() {
     }
 
     /**
-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>

Review comment:
       I double checked Parquet `ColumnChunkPageWriteStore` and looks like we still use `ParquetDirectByteBufferAllocator` and allocate `DrillBuf` due to initializing `ParquetProperties` with proper allocator (see `ParquetRecordWriter`#`258`). I also debug `TestParquetWriter.testTPCHReadWriteRunRepeated` test case and found that Drill allocates the same memory for `byte[]` in Heap with `ColumnChunkPageWriteStore` and old `ParquetColumnChunkPageWriteStore` (~50% for my default settings).
   So we can update `ParquetRecordWriter` with `ColumnChunkPageWriteStore`

##########
File path: exec/jdbc-all/pom.xml
##########
@@ -575,7 +575,7 @@
 
         <build>
           <plugins>
-            <plugin>
+            <plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code -->

Review comment:
       `maven-enforcer-plugin` is removed from `mapr` profile, because there is fully the same plugin in default scope.
   There is also very similar `maven-shade-plugin`, but there are some differences. So before merging this plugin it is better to check it on `mapr` cluster, I think.

##########
File path: pom.xml
##########
@@ -47,9 +47,9 @@
     <junit.version>4.12</junit.version>
     <slf4j.version>1.7.26</slf4j.version>
     <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->

Review comment:
       WIP: [DRILL-7904: Update to 30-jre Guava version](https://issues.apache.org/jira/browse/DRILL-7904)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vvysotskyi commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vvysotskyi commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r615276714



##########
File path: exec/jdbc-all/pom.xml
##########
@@ -575,7 +575,7 @@
 
         <build>
           <plugins>
-            <plugin>
+            <plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code -->

Review comment:
       Ok, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vvysotskyi commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vvysotskyi commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r617911298



##########
File path: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java
##########
@@ -260,14 +260,16 @@ public long getMemSize() {
     }
 
     /**
-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>

Review comment:
       @vdiravka, thanks for sharing screenshots and providing more details.
   
   > 3. And we converted that buf to bytes via BytesInput.from(buf) and compressedBytes.writeAllTo(buf). So all data still placed in heap.
   
   Please note, that when calling `BytesInput.from(buf)`, it doesn't convert all bytes of the buffer at the same time, it creates `CapacityBAOSBytesInput` that wraps provided `CapacityByteArrayOutputStream` and uses it when writing to the OutputStream.
   Regarding the `compressedBytes.writeAllTo(buf)` call this is fine to have bytes here since GC will take care of them, no reasons for possible leaks, data that should be processed later will be stored in direct memory.
   
   But when using `ConcatenatingByteArrayCollector`, all bytes will be stored in heap (including data that should be processed later) so GC has no power here.
   
   Not sure why the heap usage you provided is similar, perhaps it may make difference when we will have more data, or GC will do its work right before flushing data from the `buf`... 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vvysotskyi commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vvysotskyi commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r613471215



##########
File path: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java
##########
@@ -260,14 +260,16 @@ public long getMemSize() {
     }
 
     /**
-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>

Review comment:
       I mean applying the same changes that were done earlier to the copy of this class to the newer version, it shouldn't be too complex...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vdiravka commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Posted by GitBox <gi...@apache.org>.

vdiravka commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r617883888



##########
File path: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java
##########
@@ -260,14 +260,16 @@ public long getMemSize() {
     }
 
     /**
-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>

Review comment:
       1. I have checked heap memory after creating `ColumnChunkPageWriteStore` with `VisualVM` the size is the same:
   https://ibb.co/xFYqC0m
   https://ibb.co/fNB7MBq
   2. The allocator is passed to `ColumnChunkPageWriteStore` and `ColumnChunkPageWriter` too and really DrillBuf is used in process of writing the parquet file.
   3. And we converted that `buf` to bytes via `BytesInput.from(buf)` and `compressedBytes.writeAllTo(buf)`. So all data still placed in heap.
   4. We already have several other places, where `ColumnChunkPageWriteStore` is used not directly
   
   So looks like updated `ColumnChunkPageWriteStore` will menage heap memory even better in process of creating parquet files via Drill and we safe here to go with current change.
   
   And the proper way to use Direct memory more that now is to make improvements in Parquet. One of them is [PARQUET-1771](https://github.com/apache/parquet-mr/pull/749), but that one will not help here. So I want to proceed with PARQUET-1006. Looks like we can use direct memory `buf` for `ColumnChunkPageWriteStore`, `ParquetFileWriter` and for `ByteArrayOutputStream`. I am planning to ask community about it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org