You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/02/19 00:22:32 UTC

[GitHub] [orc] omalley commented on a change in pull request #635: ORC-742: LazyIO for non-filter columns

omalley commented on a change in pull request #635:
URL: https://github.com/apache/orc/pull/635#discussion_r578836875



##########
File path: java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java
##########
@@ -254,7 +273,7 @@ protected static IntegerReader createIntegerReader(OrcProto.ColumnEncoding.Kind
       }
     }
 
-    public void startStripe(StripePlanner planner) throws IOException {
+    public void startStripe(StripePlanner planner, ReadLevel readLevel) throws IOException {

Review comment:
       Do we really need the readLevel for start stripe? is startStripe(planner, FOLLOW) where we plan/read the next set of data?

##########
File path: java/core/src/java/org/apache/orc/Reader.java
##########
@@ -181,6 +183,8 @@
    * Options for creating a RecordReader.
    */
   class Options implements Cloneable {
+    private static final Logger LOG = LoggerFactory.getLogger(Options.class);

Review comment:
       Is this used?

##########
File path: java/core/src/java/org/apache/orc/impl/reader/tree/ReadLevel.java
##########
@@ -0,0 +1,25 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl.reader.tree;
+
+public enum ReadLevel {
+  ALL,     // Read every column

Review comment:
       I think that we should remove ReadLevel.ALL and the functions like nextBatch should pass an EnumSet<ReadLevel>. Otherwise, I'm afraid that too many uses will fall through the cracks where ALL doesn't include the other values.

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -83,17 +85,24 @@
   private final boolean[] fileIncluded;
   private final long rowIndexStride;
   private long rowInStripe = 0;
+  private long followRowInStripe = 0;
   private int currentStripe = -1;
   private long rowBaseInStripe = 0;
   private long rowCountInStripe = 0;
   private final BatchReader reader;
   private final OrcIndex indexes;
+  // identifies the columns requiring row indexes
+  private final boolean[] rowIndexCols;
   private final SargApplier sargApp;
   // an array about which row groups aren't skipped
   private boolean[] includedRowGroups = null;
   private final DataReader dataReader;
   private final int maxDiskRangeChunkLimit;
   private final StripePlanner planner;
+  // identifies the type of read: FULL (no filters), LEAD (filters are present)
+  private final ReadLevel readLevel;

Review comment:
       I think I'd remove this and just check whether the row level filter is null or not.

##########
File path: java/core/src/java/org/apache/orc/filter/OrcFilterContext.java
##########
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.filter;
+
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.StructColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.io.filter.MutableFilterContext;
+import org.apache.orc.TypeDescription;
+import org.jetbrains.annotations.NotNull;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.regex.Pattern;
+
+/**
+ * This defines the input for any filter operation.
+ * <p>
+ * It offers a convenience method for finding the column vector from a given name, that the filters
+ * can invoke to get access to the column vector.
+ */
+public class OrcFilterContext implements MutableFilterContext {
+
+  VectorizedRowBatch batch = null;
+  // Cache of field to ColumnVector, this is reset everytime the batch reference changes
+  private final Map<String, ColumnVector> vectors;
+  private final TypeDescription readSchema;
+
+  public OrcFilterContext(TypeDescription readSchema) {
+    this.readSchema = readSchema;
+    this.vectors = new HashMap<>();
+  }
+
+  public OrcFilterContext setBatch(@NotNull VectorizedRowBatch batch) {
+    if (batch != this.batch) {
+      this.batch = batch;
+      vectors.clear();
+    }
+    return this;
+  }
+
+  @Override
+  public void setFilterContext(boolean selectedInUse, int[] selected, int selectedSize) {
+    batch.setFilterContext(selectedInUse, selected, selectedSize);
+  }
+
+  @Override
+  public boolean validateSelected() {
+    return batch.validateSelected();
+  }
+
+  @Override
+  public int[] updateSelected(int i) {
+    return batch.updateSelected(i);
+  }
+
+  @Override
+  public void setSelectedInUse(boolean b) {
+    batch.setSelectedInUse(b);
+  }
+
+  @Override
+  public void setSelected(int[] ints) {
+    batch.setSelected(ints);
+  }
+
+  @Override
+  public void setSelectedSize(int i) {
+    batch.setSelectedSize(i);
+  }
+
+  @Override
+  public void reset() {
+    batch.reset();
+  }
+
+  @Override
+  public boolean isSelectedInUse() {
+    return batch.isSelectedInUse();
+  }
+
+  @Override
+  public int[] getSelected() {
+    return batch.getSelected();
+  }
+
+  @Override
+  public int getSelectedSize() {
+    return batch.getSelectedSize();
+  }
+
+  public ColumnVector[] getCols() {
+    return batch.cols;
+  }
+
+  /**
+   * Retrieves the column vector that matches the specified name. Allows support for nested struct
+   * references e.g. order.date where data is a field in a struct called order.
+   *
+   * @param name The column name whose vector should be retrieved
+   * @return The column vector
+   * @throws IllegalArgumentException if the field is not found or if the nested field is not part
+   *                                  of a struct
+   */
+  public ColumnVector findColumnVector(String name) {
+    if (!vectors.containsKey(name)) {
+      vectors.put(name, findVector(name));
+    }
+
+    return vectors.get(name);
+  }
+
+  private ColumnVector findVector(String name) {
+    String[] refs = name.split(SPLIT_ON_PERIOD);

Review comment:
       +1 with the exception that this code needs to navigate the VectorizedRowBatch/ColumnVector tree.

##########
File path: java/core/src/java/org/apache/orc/impl/BufferChunkList.java
##########
@@ -22,6 +22,14 @@
  * Builds a list of buffer chunks
  */
 public class BufferChunkList {
+  public BufferChunk getHead() {

Review comment:
       +1

##########
File path: java/core/src/java/org/apache/orc/impl/reader/tree/LevelTypeReader.java
##########
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl.reader.tree;
+
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.io.filter.FilterContext;
+import org.apache.orc.OrcProto;
+import org.apache.orc.impl.PositionProvider;
+import org.apache.orc.impl.reader.StripePlanner;
+
+import java.io.IOException;
+
+/**
+ * Wrapper reader that implements the level logic.
+ * The methods are invoked on the reader only if the read level matches.
+ */
+public class LevelTypeReader implements TypeReader {
+  private final TypeReader reader;

Review comment:
       Since the ReadLevel is only used here, can we take the ReadLevel field and getReadLevel() and move it from the TypeReader to here?

##########
File path: java/core/src/java/org/apache/orc/filter/OrcFilterContext.java
##########
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.filter;
+
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.StructColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.io.filter.MutableFilterContext;
+import org.apache.orc.TypeDescription;
+import org.jetbrains.annotations.NotNull;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.regex.Pattern;
+
+/**
+ * This defines the input for any filter operation.
+ * <p>
+ * It offers a convenience method for finding the column vector from a given name, that the filters
+ * can invoke to get access to the column vector.
+ */
+public class OrcFilterContext implements MutableFilterContext {

Review comment:
       I'd suggest that we make this a new interface and make a corresponding Impl class. I believe the only function that we need in the interface is ColumnVector findVector(String). 
   
   The changes to Reader should stay with the interface, which the field in ReaderImpl should become the impl class.

##########
File path: java/core/src/java/org/apache/orc/impl/reader/StripePlanner.java
##########
@@ -363,11 +396,26 @@ private void addChunk(BufferChunkList list, StreamInformation stream,
         stream.firstChunk = chunk;
       }
       list.add(chunk);
+      stream.lastChunk = chunk;
       offset += thisLen;
       length -= thisLen;
     }
   }
 
+  private BufferChunkList addFollowChunk(long offset, long length) {

Review comment:
       Agreed.

##########
File path: java/core/src/java/org/apache/orc/impl/reader/tree/LevelTypeReader.java
##########
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl.reader.tree;
+
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.io.filter.FilterContext;
+import org.apache.orc.OrcProto;
+import org.apache.orc.impl.PositionProvider;
+import org.apache.orc.impl.reader.StripePlanner;
+
+import java.io.IOException;
+
+/**
+ * Wrapper reader that implements the level logic.
+ * The methods are invoked on the reader only if the read level matches.
+ */
+public class LevelTypeReader implements TypeReader {
+  private final TypeReader reader;
+
+  public LevelTypeReader(TypeReader reader) {
+    this.reader = reader;
+  }
+
+  @Override
+  public void checkEncoding(OrcProto.ColumnEncoding encoding) throws IOException {
+    reader.checkEncoding(encoding);
+  }
+
+  @Override
+  public void startStripe(StripePlanner planner, ReadLevel readLevel) throws IOException {
+    if (reader.getReadLevel() != readLevel) {
+      return;
+    }
+
+    reader.startStripe(planner, readLevel);
+  }
+
+  @Override
+  public void seek(PositionProvider[] index, ReadLevel readLevel) throws IOException {
+    if (reader.getReadLevel() != readLevel) {
+      return;
+    }
+
+    reader.seek(index, readLevel);
+  }
+
+  @Override
+  public void seek(PositionProvider index, ReadLevel readLevel) throws IOException {
+    if (reader.getReadLevel() != readLevel) {
+      return;
+    }
+
+    reader.seek(index, readLevel);
+  }
+
+  @Override
+  public void skipRows(long rows, ReadLevel readLevel) throws IOException {
+    if (reader.getReadLevel() != readLevel) {
+      return;
+    }
+
+    reader.skipRows(rows, readLevel);
+  }
+
+  @Override
+  public void nextVector(ColumnVector previous,
+                         boolean[] isNull,
+                         int batchSize,
+                         FilterContext filterContext,
+                         ReadLevel readLevel) throws IOException {
+    if (reader.getReadLevel() != readLevel) {

Review comment:
       Like I wrote elsewhere, I think this parameter should be an EnumSet<ReadLevel> and the test should be readLevel.contains(reader.getLevel()).

##########
File path: java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java
##########
@@ -2854,11 +2924,19 @@ public static TypeReader createTreeReader(TypeDescription readerType,
     }
   }
 
+  public static boolean isDecimalAsLong(OrcFile.Version version, int precision) {
+    return version == OrcFile.Version.UNSTABLE_PRE_2_0 &&
+    precision <= TypeDescription.MAX_DECIMAL64_PRECISION;
+  }
+
   public static BatchReader createRootReader(TypeDescription readerType, Context context)
           throws IOException {
     TypeReader reader = createTreeReader(readerType, context);
     if (reader instanceof StructTreeReader) {
-      return new StructBatchReader((StructTreeReader) reader, context);
+      return new StructBatchReader(reader, context);
+    } else if (reader instanceof LevelTypeReader
+               && ((LevelTypeReader)reader).getReader() instanceof StructTreeReader) {
+      return new LevelStructBatchReader((LevelTypeReader) reader, context);

Review comment:
       I'm worried that doubling the number of TypeReaders in the tree is fair amount of overhead. Could we make it just at the transitions from LEAD to FOLLOW where we insert the extra LevelTypeReaders?

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -83,17 +85,21 @@
   private final boolean[] fileIncluded;
   private final long rowIndexStride;
   private long rowInStripe = 0;
+  private long followRowInStripe = 0;
   private int currentStripe = -1;
   private long rowBaseInStripe = 0;
   private long rowCountInStripe = 0;
   private final BatchReader reader;
   private final OrcIndex indexes;
+  private final boolean[] rowIndexCols;
   private final SargApplier sargApp;
   // an array about which row groups aren't skipped
   private boolean[] includedRowGroups = null;
   private final DataReader dataReader;
   private final int maxDiskRangeChunkLimit;
   private final StripePlanner planner;
+  private final ReadLevel readLevel;
+  private boolean needFollowStripe;

Review comment:
       I think this should be inverted and name haveReadFollowData.

##########
File path: java/core/src/java/org/apache/orc/filter/OrcFilterContext.java
##########
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.filter;
+
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.StructColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.io.filter.MutableFilterContext;
+import org.apache.orc.TypeDescription;
+import org.jetbrains.annotations.NotNull;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.regex.Pattern;
+
+/**
+ * This defines the input for any filter operation.
+ * <p>
+ * It offers a convenience method for finding the column vector from a given name, that the filters
+ * can invoke to get access to the column vector.
+ */
+public class OrcFilterContext implements MutableFilterContext {

Review comment:
       Should the name be findColumnVector?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org