You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by GitBox <gi...@apache.org> on 2022/11/04 09:39:15 UTC

[GitHub] [orc] luffy-zh opened a new pull request, #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

luffy-zh opened a new pull request, #1307:
URL: https://github.com/apache/orc/pull/1307

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   This patch implements int8/int16/int32/float buffer and refactors the RLE encoder/decoder, IntegerColumnWriter/Reader, and DoubleColumnWriter/Reader.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   For the context of [ORC-1298](https://issues.apache.org/jira/projects/ORC/issues/ORC-1298?filter=allopenissues), this change makes user no need to focus on overflow.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Add the WriterTest.testWriterFixedWidthNumericVectorBatch UT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1239389952


##########
c++/src/ColumnReader.cc:
##########
@@ -1696,13 +1690,25 @@ namespace orc {
   /**
    * Create a reader for the given stripe.
    */
-  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe) {
+  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe,
+                                            bool useTightNumericVector) {
     switch (static_cast<int64_t>(type.getKind())) {
-      case DATE:
-      case INT:
+      case SHORT: {
+        if (useTightNumericVector) {
+          return std::unique_ptr<ColumnReader>(
+              new IntegerColumnReader<ShortVectorBatch>(type, stripe));
+        }
+      }
+      case INT: {

Review Comment:
   Yes, we need to fix or mitigate this for Apache ORC 1.9.0 because this causes Docker Test failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1239431010


##########
c++/src/ColumnReader.cc:
##########
@@ -1696,13 +1690,25 @@ namespace orc {
   /**
    * Create a reader for the given stripe.
    */
-  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe) {
+  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe,
+                                            bool useTightNumericVector) {
     switch (static_cast<int64_t>(type.getKind())) {
-      case DATE:
-      case INT:
+      case SHORT: {
+        if (useTightNumericVector) {
+          return std::unique_ptr<ColumnReader>(
+              new IntegerColumnReader<ShortVectorBatch>(type, stripe));
+        }
+      }
+      case INT: {

Review Comment:
   I filed a JIRA and PRs.
   - https://issues.apache.org/jira/browse/ORC-1453
   - https://github.com/apache/orc/pull/1548 (for branch-1.9)
   - https://github.com/apache/orc/pull/1549 (for main)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1017802578


##########
.gitignore:
##########
@@ -1,14 +0,0 @@
-build

Review Comment:
   This file is not associated with this pull request. Whether it is a misoperation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1014774175


##########
c++/include/orc/Type.hh:
##########
@@ -70,10 +70,11 @@ namespace orc {
     /**
      * Create a row batch for this type.
      */
-    virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size,
-                                                             MemoryPool& pool,
-                                                             bool encoded = false
-                                                             ) const = 0;
+    virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(
+      uint64_t size,
+      MemoryPool& pool,
+      bool encoded = false,
+      bool enableFixedWidthNumericVectorBatch = false) const = 0;

Review Comment:
   @luffy-zh When you add one parameter, please avoid any change to the existing parameters. For example, indentation changes on line 73 to 76.



##########
c++/include/orc/Type.hh:
##########
@@ -70,10 +70,11 @@ namespace orc {
     /**
      * Create a row batch for this type.
      */
-    virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size,
-                                                             MemoryPool& pool,
-                                                             bool encoded = false
-                                                             ) const = 0;
+    virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(
+      uint64_t size,
+      MemoryPool& pool,
+      bool encoded = false,
+      bool enableFixedWidthNumericVectorBatch = false) const = 0;

Review Comment:
   @luffy-zh When you add one new parameter, please avoid any change to the existing parameters. For example, indentation changes on line 73 to 76.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1018098146


##########
c++/include/orc/Vector.hh:
##########
@@ -88,28 +88,116 @@ namespace orc {
     ColumnVectorBatch& operator=(const ColumnVectorBatch&);
   };
 
-  struct LongVectorBatch : public ColumnVectorBatch {
-    LongVectorBatch(uint64_t capacity, MemoryPool& pool);
-    virtual ~LongVectorBatch();
+  template <typename ValueType>
+  struct IntegerVectorBatch : public ColumnVectorBatch {
+    IntegerVectorBatch(uint64_t cap, MemoryPool& pool)
+        : ColumnVectorBatch(cap, pool), data(pool, cap) {
+      // PASS
+    }
 
-    DataBuffer<int64_t> data;
-    std::string toString() const;
-    void resize(uint64_t capacity);
-    void clear();
-    uint64_t getMemoryUsage();
+    virtual ~IntegerVectorBatch() = default;
+
+    inline std::string toString() const;
+
+    void resize(uint64_t cap) {
+      if (capacity < cap) {
+        ColumnVectorBatch::resize(cap);
+        data.resize(cap);
+      }
+    }
+
+    void clear() {
+      numElements = 0;
+    }
+
+    uint64_t getMemoryUsage() {
+      return ColumnVectorBatch::getMemoryUsage() +
+             static_cast<uint64_t>(data.capacity() * sizeof(int64_t));

Review Comment:
   ```suggestion
                static_cast<uint64_t>(data.capacity() * sizeof(ValueType));
   ```



##########
c++/src/ColumnReader.cc:
##########
@@ -474,31 +465,32 @@ namespace orc {
     return numValues;
   }
 
-  template <TypeKind columnKind, bool isLittleEndian>
-  void DoubleColumnReader<columnKind, isLittleEndian>::next(ColumnVectorBatch& rowBatch,
-                                                            uint64_t numValues, char* notNull) {
+  template <TypeKind columnKind, bool isLittleEndian, typename ValueType, typename BatchType>
+  void DoubleColumnReader<columnKind, isLittleEndian, ValueType, BatchType>::next(
+      ColumnVectorBatch& rowBatch, uint64_t numValues, char* notNull) {
     ColumnReader::next(rowBatch, numValues, notNull);
     // update the notNull from the parent class
     notNull = rowBatch.hasNulls ? rowBatch.notNull.data() : nullptr;
-    double* outArray = dynamic_cast<DoubleVectorBatch&>(rowBatch).data.data();
+    ValueType* outArray =
+        reinterpret_cast<ValueType*>(dynamic_cast<BatchType&>(rowBatch).data.data());
 
     if (columnKind == FLOAT) {

Review Comment:
   ```suggestion
       if constexpr (columnKind == FLOAT) {
   ```



##########
tools/src/FileMetadata.cc:
##########
@@ -26,7 +26,7 @@
 #include "orc/Exceptions.hh"
 #include "orc/OrcFile.hh"
 
-//#include "Adaptor.hh"
+// #include "Adaptor.hh"

Review Comment:
   Remove this line



##########
c++/src/ColumnReader.cc:
##########
@@ -129,6 +129,12 @@ namespace orc {
     }
   }
 
+  void expandBytesToLongs(int8_t* buffer, uint64_t numValues) {

Review Comment:
   if sizeof(char) == sizeof(int8_t), this function can return early. BTW, the name is strange since the output type is no longer LONG.



##########
c++/src/Timezone.cc:
##########
@@ -24,6 +24,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <time.h>
+#include <iostream>

Review Comment:
   Remove it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] luffy-zh closed pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
luffy-zh closed pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types
URL: https://github.com/apache/orc/pull/1307


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1015546519


##########
c++/src/ColumnWriter.cc:
##########
@@ -456,126 +456,107 @@ namespace orc {
     }
   }
 
+  template <typename BatchType, typename IntType>
   class IntegerColumnWriter : public ColumnWriter {
   public:
     IntegerColumnWriter(
                         const Type& type,
                         const StreamsFactory& factory,
-                        const WriterOptions& options);
+                        const WriterOptions& options) :
+                          ColumnWriter(type, factory, options),
+                          rleVersion(options.getRleVersion()) {
+      std::unique_ptr<BufferedOutputStream> dataStream =
+        factory.createStream(proto::Stream_Kind_DATA);
+      rleEncoder = createRleEncoder(
+                                    std::move(dataStream),
+                                    true,
+                                    rleVersion,
+                                    memPool,
+                                    options.getAlignedBitpacking());
+
+      if (enableIndex) {
+        recordPosition();
+      }
+    }
 
     virtual void add(ColumnVectorBatch& rowBatch,
                      uint64_t offset,
                      uint64_t numValues,
-                     const char* incomingMask) override;
-
-    virtual void flush(std::vector<proto::Stream>& streams) override;
-
-    virtual uint64_t getEstimatedSize() const override;
+                     const char* incomingMask) override {
+      const BatchType* intBatch =
+        dynamic_cast<const BatchType*>(&rowBatch);
+      if (intBatch == nullptr) {
+        throw InvalidArgument("Failed to cast to LongVectorBatch");

Review Comment:
   It may not be LongVectorBatch now. This error message should be changed as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1016113031


##########
c++/test/TestWriter.cc:
##########
@@ -2044,5 +2048,91 @@ namespace orc {
     testSuppressPresentStream(CompressionKind_SNAPPY);
   }
 
+  TEST_P(WriterTest, testWriteFixedWidthNumericVectorBatch) {
+    MemoryOutputStream memStream(DEFAULT_MEM_STREAM_SIZE);
+    MemoryPool * pool = getDefaultPool();
+    std::unique_ptr<Type> type(Type::buildTypeFromString(
+      "struct<col1:double,col2:float,col3:int,col4:smallint,col5:tinyint>"));

Review Comment:
   It's better to test the bigint type as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac merged pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
wgtmac merged PR #1307:
URL: https://github.com/apache/orc/pull/1307


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1238290036


##########
c++/src/ColumnReader.cc:
##########
@@ -1696,13 +1690,25 @@ namespace orc {
   /**
    * Create a reader for the given stripe.
    */
-  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe) {
+  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe,
+                                            bool useTightNumericVector) {
     switch (static_cast<int64_t>(type.getKind())) {
-      case DATE:
-      case INT:
+      case SHORT: {
+        if (useTightNumericVector) {
+          return std::unique_ptr<ColumnReader>(
+              new IntegerColumnReader<ShortVectorBatch>(type, stripe));
+        }
+      }
+      case INT: {

Review Comment:
   This seems to cause `implicit-fallthrough` warning and compilation failure.



##########
c++/src/ColumnReader.cc:
##########
@@ -1696,13 +1690,25 @@ namespace orc {
   /**
    * Create a reader for the given stripe.
    */
-  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe) {
+  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe,
+                                            bool useTightNumericVector) {
     switch (static_cast<int64_t>(type.getKind())) {
-      case DATE:
-      case INT:
+      case SHORT: {
+        if (useTightNumericVector) {
+          return std::unique_ptr<ColumnReader>(
+              new IntegerColumnReader<ShortVectorBatch>(type, stripe));
+        }
+      }
+      case INT: {

Review Comment:
   This seems to cause `implicit-fallthrough` warning and compilation failure in Clang environment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1238746703


##########
c++/src/ColumnReader.cc:
##########
@@ -1696,13 +1690,25 @@ namespace orc {
   /**
    * Create a reader for the given stripe.
    */
-  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe) {
+  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe,
+                                            bool useTightNumericVector) {
     switch (static_cast<int64_t>(type.getKind())) {
-      case DATE:
-      case INT:
+      case SHORT: {
+        if (useTightNumericVector) {
+          return std::unique_ptr<ColumnReader>(
+              new IntegerColumnReader<ShortVectorBatch>(type, stripe));
+        }
+      }
+      case INT: {

Review Comment:
   Should we fix it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1239431010


##########
c++/src/ColumnReader.cc:
##########
@@ -1696,13 +1690,25 @@ namespace orc {
   /**
    * Create a reader for the given stripe.
    */
-  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe) {
+  std::unique_ptr<ColumnReader> buildReader(const Type& type, StripeStreams& stripe,
+                                            bool useTightNumericVector) {
     switch (static_cast<int64_t>(type.getKind())) {
-      case DATE:
-      case INT:
+      case SHORT: {
+        if (useTightNumericVector) {
+          return std::unique_ptr<ColumnReader>(
+              new IntegerColumnReader<ShortVectorBatch>(type, stripe));
+        }
+      }
+      case INT: {

Review Comment:
   I filed a JIRA and PR for `branch-1.9` first. Will make another PR for main branch..
   - https://issues.apache.org/jira/browse/ORC-1453
   - https://github.com/apache/orc/pull/1548 (for branch-1.9)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1015150300


##########
c++/src/ColumnReader.cc:
##########
@@ -531,33 +524,34 @@ namespace orc {
     return numValues;
   }
 
-  template<TypeKind columnKind, bool isLittleEndian>
-  void DoubleColumnReader<columnKind, isLittleEndian>::next(
-      ColumnVectorBatch& rowBatch,
-      uint64_t numValues,
-      char *notNull) {
+  template<TypeKind columnKind, bool isLittleEndian, typename ValueType, typename BatchType>
+  void DoubleColumnReader<columnKind, isLittleEndian, ValueType, BatchType>::next(
+    ColumnVectorBatch& rowBatch,
+    uint64_t numValues,
+    char *notNull) {
     ColumnReader::next(rowBatch, numValues, notNull);
     // update the notNull from the parent class
     notNull = rowBatch.hasNulls ? rowBatch.notNull.data() : nullptr;
-    double* outArray = dynamic_cast<DoubleVectorBatch&>(rowBatch).data.data();
+    ValueType* outArray =
+      reinterpret_cast<ValueType*>(dynamic_cast<BatchType&>(rowBatch).data.data());
 
     if (columnKind == FLOAT) {

Review Comment:
   if constexpr (columnKind == FLOAT)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1018628411


##########
c++/src/ColumnWriter.cc:
##########
@@ -415,109 +415,97 @@ namespace orc {
     }
   }
 
+  template <typename BatchType>
   class IntegerColumnWriter : public ColumnWriter {
    public:
     IntegerColumnWriter(const Type& type, const StreamsFactory& factory,
-                        const WriterOptions& options);
+                        const WriterOptions& options)
+        : ColumnWriter(type, factory, options), rleVersion(options.getRleVersion()) {
+      std::unique_ptr<BufferedOutputStream> dataStream =
+          factory.createStream(proto::Stream_Kind_DATA);
+      rleEncoder = createRleEncoder(std::move(dataStream), true, rleVersion, memPool,
+                                    options.getAlignedBitpacking());
 
-    virtual void add(ColumnVectorBatch& rowBatch, uint64_t offset, uint64_t numValues,
-                     const char* incomingMask) override;
+      if (enableIndex) {
+        recordPosition();
+      }
+    }
 
-    virtual void flush(std::vector<proto::Stream>& streams) override;
+    virtual void add(ColumnVectorBatch& rowBatch, uint64_t offset, uint64_t numValues,
+                     const char* incomingMask) override {
+      const BatchType* intBatch = dynamic_cast<const BatchType*>(&rowBatch);
+      if (intBatch == nullptr) {
+        throw InvalidArgument("Failed to cast to LongVectorBatch");

Review Comment:
   ```suggestion
           throw InvalidArgument("Failed to cast to integer vector batch");
   ```



##########
c++/src/MemoryPool.cc:
##########
@@ -180,6 +198,60 @@ namespace orc {
     currentSize = newSize;
   }
 
+  // Specializations for int32_t
+
+  template <>
+  DataBuffer<int32_t>::~DataBuffer() {
+    if (buf) {
+      memoryPool.free(reinterpret_cast<char*>(buf));
+    }
+  }
+
+  template <>
+  void DataBuffer<int32_t>::resize(uint64_t newSize) {
+    reserve(newSize);
+    if (newSize > currentSize) {
+      memset(buf + currentSize, 0, (newSize - currentSize) * sizeof(int32_t));
+    }
+    currentSize = newSize;
+  }
+
+  // Specializations for int16_t
+
+  template <>
+  DataBuffer<int16_t>::~DataBuffer() {
+    if (buf) {
+      memoryPool.free(reinterpret_cast<char*>(buf));
+    }
+  }
+
+  template <>
+  void DataBuffer<int16_t>::resize(uint64_t newSize) {

Review Comment:
   resize of different vectors should also be in the unit test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1014774374


##########
c++/include/orc/Type.hh:
##########
@@ -70,10 +70,11 @@ namespace orc {
     /**
      * Create a row batch for this type.
      */
-    virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size,
-                                                             MemoryPool& pool,
-                                                             bool encoded = false
-                                                             ) const = 0;
+    virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(
+      uint64_t size,
+      MemoryPool& pool,
+      bool encoded = false,
+      bool enableFixedWidthNumericVectorBatch = false) const = 0;

Review Comment:
   In addition, can we keep the original three-parameter method and add new four-parameter method with new parameter additionally?
   ```
   bool enableFixedWidthNumericVectorBatch = false) const = 0;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
wgtmac commented on PR #1307:
URL: https://github.com/apache/orc/pull/1307#issuecomment-1303785824

   Thanks @luffy-zh for the contribution!
   
   cc @stiga-huang @coderex2522 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1307: ORC-1298:[C++] Support dedicated ColumnVectorBatch of numeric types

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #1307:
URL: https://github.com/apache/orc/pull/1307#discussion_r1015142239


##########
c++/include/orc/Reader.hh:
##########
@@ -325,6 +325,18 @@ namespace orc {
      * Get the IdReadIntentMap map that was supplied by client.
      */
     const IdReadIntentMap getIdReadIntentMap() const;
+
+    /**
+     * Set the enableFixedWidthNumericVectorBatch.

Review Comment:
   Please refine the comment to help understanding because this is a public API.



##########
c++/include/orc/Reader.hh:
##########
@@ -325,6 +325,18 @@ namespace orc {
      * Get the IdReadIntentMap map that was supplied by client.
      */
     const IdReadIntentMap getIdReadIntentMap() const;
+
+    /**
+     * Set the enableFixedWidthNumericVectorBatch.
+     */
+    RowReaderOptions& setEnableFixedWidthNumericVectorBatch(

Review Comment:
   This name is too long. Maybe we can change it to setUseTightNumericVector?



##########
c++/include/orc/Type.hh:
##########
@@ -75,6 +75,12 @@ namespace orc {
                                                              bool encoded = false
                                                              ) const = 0;
 
+    virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size,
+                                                             MemoryPool& pool,
+                                                             bool encoded = false,
+                                                             bool enableFixedWidthNumericVectorBatch = false

Review Comment:
   Please shorten the new parameter



##########
c++/include/orc/Vector.hh:
##########
@@ -88,28 +89,88 @@ namespace orc {
     ColumnVectorBatch& operator=(const ColumnVectorBatch&);
   };
 
-  struct LongVectorBatch: public ColumnVectorBatch {
-    LongVectorBatch(uint64_t capacity, MemoryPool& pool);
-    virtual ~LongVectorBatch();
+  template <typename ValueType>
+  struct IntegerVectorBatch: public ColumnVectorBatch {
+    IntegerVectorBatch(uint64_t cap, MemoryPool& pool)
+                       : ColumnVectorBatch(cap, pool),
+                       data(pool, cap) {
+      // PASS
+    }
 
-    DataBuffer<int64_t> data;
-    std::string toString() const;
-    void resize(uint64_t capacity);
-    void clear();
-    uint64_t getMemoryUsage();
+    virtual ~IntegerVectorBatch() {
+      // PASS
+    }
+
+    std::string toString() const {
+      std::ostringstream buffer;
+      buffer << "Long vector <" << numElements << " of " << capacity << ">";
+      return buffer.str();
+    }
+
+    void resize(uint64_t cap) {
+      if (capacity < cap) {
+        ColumnVectorBatch::resize(cap);
+        data.resize(cap);
+      }
+    }
+
+    void clear() {
+      numElements = 0;
+    }
+
+    uint64_t getMemoryUsage() {
+      return ColumnVectorBatch::getMemoryUsage() +
+        static_cast<uint64_t>(data.capacity() * sizeof(int64_t));
+    }
+
+    DataBuffer<ValueType> data;
   };
 
-  struct DoubleVectorBatch: public ColumnVectorBatch {
-    DoubleVectorBatch(uint64_t capacity, MemoryPool& pool);
-    virtual ~DoubleVectorBatch();
-    std::string toString() const;
-    void resize(uint64_t capacity);
-    void clear();
-    uint64_t getMemoryUsage();
+  using LongVectorBatch = IntegerVectorBatch<int64_t>;
+  using IntVectorBatch = IntegerVectorBatch<int32_t>;
+  using ShortVectorBatch = IntegerVectorBatch<int16_t>;
+  using ByteVectorBatch = IntegerVectorBatch<int8_t>;
+
+  template <typename FloatType>
+  struct BaseDoubleVectorBatch: public ColumnVectorBatch {
+    BaseDoubleVectorBatch(uint64_t cap, MemoryPool& pool)
+      : ColumnVectorBatch(cap, pool),
+        data(pool, cap) {
+      // PASS
+    }
+
+    virtual ~BaseDoubleVectorBatch() {

Review Comment:
   ~BaseDoubleVectorBatch() = default;



##########
c++/include/orc/Vector.hh:
##########
@@ -88,28 +89,88 @@ namespace orc {
     ColumnVectorBatch& operator=(const ColumnVectorBatch&);
   };
 
-  struct LongVectorBatch: public ColumnVectorBatch {
-    LongVectorBatch(uint64_t capacity, MemoryPool& pool);
-    virtual ~LongVectorBatch();
+  template <typename ValueType>
+  struct IntegerVectorBatch: public ColumnVectorBatch {
+    IntegerVectorBatch(uint64_t cap, MemoryPool& pool)
+                       : ColumnVectorBatch(cap, pool),
+                       data(pool, cap) {
+      // PASS
+    }
 
-    DataBuffer<int64_t> data;
-    std::string toString() const;
-    void resize(uint64_t capacity);
-    void clear();
-    uint64_t getMemoryUsage();
+    virtual ~IntegerVectorBatch() {

Review Comment:
   ~IntegerVectorBatch() = default;



##########
c++/include/orc/Vector.hh:
##########
@@ -30,6 +30,7 @@
 #include <stdexcept>
 #include <cstdlib>
 #include <iostream>

Review Comment:
   Can you also remove iostream ?



##########
c++/src/ColumnWriter.cc:
##########
@@ -456,126 +456,107 @@ namespace orc {
     }
   }
 
+  template <typename BatchType, typename IntType>

Review Comment:
   IntType can be deduced from BatchType, isn't it?



##########
c++/src/ColumnWriter.cc:
##########
@@ -801,6 +782,7 @@ namespace orc {
     rleEncoder->recordPosition(rowIndexPosition.get());
   }
 
+  template<typename ValueType, typename BatchType>
   class DoubleColumnWriter : public ColumnWriter {

Review Comment:
   DoubleColumnWriter -> FloatingColumnWriter



##########
c++/src/ColumnWriter.cc:
##########
@@ -828,16 +810,17 @@ namespace orc {
     DataBuffer<char> buffer;
   };
 
-  DoubleColumnWriter::DoubleColumnWriter(
-                          const Type& type,
-                          const StreamsFactory& factory,
-                          const WriterOptions& options,
-                          bool isFloatType) :
-                              ColumnWriter(type, factory, options),
-                              isFloat(isFloatType),
-                              buffer(*options.getMemoryPool()) {
+  template<typename ValueType, typename BatchType>
+  DoubleColumnWriter<ValueType, BatchType>::DoubleColumnWriter(
+    const Type& type,
+    const StreamsFactory& factory,
+    const WriterOptions& options,
+    bool isFloatType) :

Review Comment:
   We should throw if isFloatType == false and ValueType is float



##########
c++/src/ColumnWriter.cc:
##########
@@ -3083,18 +3093,26 @@ namespace orc {
                                   options));
       case DOUBLE:
         return std::unique_ptr<ColumnWriter>(
-          new DoubleColumnWriter(
-                                 type,
-                                 factory,
-                                 options,
-                                 false));
+          new DoubleColumnWriter<double, DoubleVectorBatch>(

Review Comment:
   ditto



##########
c++/src/RLE.hh:
##########
@@ -59,8 +59,16 @@ namespace orc {
      *    pointer is not null, positions that are false are skipped.
      */
     virtual void add(const int64_t* data, uint64_t numValues,
-                      const char* notNull);
+                     const char* notNull);
 
+    virtual void add(const int32_t* data, uint64_t numValues,
+                     const char* notNull);
+
+    virtual void add(const int16_t* data, uint64_t numValues,
+                     const char* notNull);
+
+    virtual void add(const int8_t* data, uint64_t numValues,

Review Comment:
   RLE only supports int64_t, int32_t, and int16_t. Please remove the int8_t overload.



##########
c++/src/RLE.cc:
##########
@@ -71,14 +71,41 @@ namespace orc {
   }
 
   void RleEncoder::add(const int64_t* data, uint64_t numValues,
-                         const char* notNull) {
+                       const char* notNull) {
     for (uint64_t i = 0; i < numValues; ++i) {
       if (!notNull || notNull[i]) {
         write(data[i]);
       }
     }
   }
 
+  void RleEncoder::add(const int32_t* data, uint64_t numValues,

Review Comment:
   Why not use a template function to repeat the code? static_cast does not have any runtime penalty.



##########
c++/src/ColumnWriter.cc:
##########
@@ -456,126 +456,107 @@ namespace orc {
     }
   }
 
+  template <typename BatchType, typename IntType>
   class IntegerColumnWriter : public ColumnWriter {
   public:
     IntegerColumnWriter(
                         const Type& type,
                         const StreamsFactory& factory,
-                        const WriterOptions& options);
+                        const WriterOptions& options) :
+                          ColumnWriter(type, factory, options),
+                          rleVersion(options.getRleVersion()) {
+      std::unique_ptr<BufferedOutputStream> dataStream =

Review Comment:
   nit: you can still separate template class declaration with definition like before to keep the class declaration compact. See here for reference: https://stackoverflow.com/questions/13351474/c-separate-declaration-definition-for-template-function-in-template-class



##########
c++/src/ColumnWriter.cc:
##########
@@ -456,126 +456,107 @@ namespace orc {
     }
   }
 
+  template <typename BatchType, typename IntType>
   class IntegerColumnWriter : public ColumnWriter {
   public:
     IntegerColumnWriter(
                         const Type& type,
                         const StreamsFactory& factory,
-                        const WriterOptions& options);
+                        const WriterOptions& options) :
+                          ColumnWriter(type, factory, options),
+                          rleVersion(options.getRleVersion()) {
+      std::unique_ptr<BufferedOutputStream> dataStream =
+        factory.createStream(proto::Stream_Kind_DATA);
+      rleEncoder = createRleEncoder(
+                                    std::move(dataStream),
+                                    true,
+                                    rleVersion,
+                                    memPool,
+                                    options.getAlignedBitpacking());
+
+      if (enableIndex) {
+        recordPosition();
+      }
+    }
 
     virtual void add(ColumnVectorBatch& rowBatch,
                      uint64_t offset,
                      uint64_t numValues,
-                     const char* incomingMask) override;
-
-    virtual void flush(std::vector<proto::Stream>& streams) override;
-
-    virtual uint64_t getEstimatedSize() const override;
+                     const char* incomingMask) override {
+      const BatchType* intBatch =
+        dynamic_cast<const BatchType*>(&rowBatch);
+      if (intBatch == nullptr) {
+        throw InvalidArgument("Failed to cast to LongVectorBatch");

Review Comment:
   It may not be LongVectorBatch now. This error message should be changed as well.



##########
c++/src/ColumnReader.hh:
##########
@@ -167,7 +167,8 @@ namespace orc {
    * Create a reader for the given stripe.
    */
   std::unique_ptr<ColumnReader> buildReader(const Type& type,
-                                            StripeStreams& stripe);
+                                            StripeStreams& stripe,
+                                            bool enableFixedWidthNumericVectorBatch = false);

Review Comment:
   Shorten it



##########
c++/src/ColumnReader.cc:
##########
@@ -531,33 +524,34 @@ namespace orc {
     return numValues;
   }
 
-  template<TypeKind columnKind, bool isLittleEndian>
-  void DoubleColumnReader<columnKind, isLittleEndian>::next(
-      ColumnVectorBatch& rowBatch,
-      uint64_t numValues,
-      char *notNull) {
+  template<TypeKind columnKind, bool isLittleEndian, typename ValueType, typename BatchType>
+  void DoubleColumnReader<columnKind, isLittleEndian, ValueType, BatchType>::next(
+    ColumnVectorBatch& rowBatch,
+    uint64_t numValues,
+    char *notNull) {
     ColumnReader::next(rowBatch, numValues, notNull);
     // update the notNull from the parent class
     notNull = rowBatch.hasNulls ? rowBatch.notNull.data() : nullptr;
-    double* outArray = dynamic_cast<DoubleVectorBatch&>(rowBatch).data.data();
+    ValueType* outArray =
+      reinterpret_cast<ValueType*>(dynamic_cast<BatchType&>(rowBatch).data.data());
 
     if (columnKind == FLOAT) {

Review Comment:
   if constexpr (columnKind == FLOAT)



##########
c++/include/orc/Vector.hh:
##########
@@ -88,28 +89,88 @@ namespace orc {
     ColumnVectorBatch& operator=(const ColumnVectorBatch&);
   };
 
-  struct LongVectorBatch: public ColumnVectorBatch {
-    LongVectorBatch(uint64_t capacity, MemoryPool& pool);
-    virtual ~LongVectorBatch();
+  template <typename ValueType>
+  struct IntegerVectorBatch: public ColumnVectorBatch {
+    IntegerVectorBatch(uint64_t cap, MemoryPool& pool)
+                       : ColumnVectorBatch(cap, pool),
+                       data(pool, cap) {
+      // PASS
+    }
 
-    DataBuffer<int64_t> data;
-    std::string toString() const;
-    void resize(uint64_t capacity);
-    void clear();
-    uint64_t getMemoryUsage();
+    virtual ~IntegerVectorBatch() {
+      // PASS
+    }
+
+    std::string toString() const {
+      std::ostringstream buffer;
+      buffer << "Long vector <" << numElements << " of " << capacity << ">";
+      return buffer.str();
+    }
+
+    void resize(uint64_t cap) {
+      if (capacity < cap) {
+        ColumnVectorBatch::resize(cap);
+        data.resize(cap);
+      }
+    }
+
+    void clear() {
+      numElements = 0;
+    }
+
+    uint64_t getMemoryUsage() {
+      return ColumnVectorBatch::getMemoryUsage() +
+        static_cast<uint64_t>(data.capacity() * sizeof(int64_t));
+    }
+
+    DataBuffer<ValueType> data;
   };
 
-  struct DoubleVectorBatch: public ColumnVectorBatch {
-    DoubleVectorBatch(uint64_t capacity, MemoryPool& pool);
-    virtual ~DoubleVectorBatch();
-    std::string toString() const;
-    void resize(uint64_t capacity);
-    void clear();
-    uint64_t getMemoryUsage();
+  using LongVectorBatch = IntegerVectorBatch<int64_t>;
+  using IntVectorBatch = IntegerVectorBatch<int32_t>;
+  using ShortVectorBatch = IntegerVectorBatch<int16_t>;
+  using ByteVectorBatch = IntegerVectorBatch<int8_t>;
+
+  template <typename FloatType>
+  struct BaseDoubleVectorBatch: public ColumnVectorBatch {

Review Comment:
   Rename it to FloatingVectorBatch



##########
c++/src/ColumnWriter.cc:
##########
@@ -3061,15 +3050,36 @@ namespace orc {
                                  type,
                                  factory,
                                  options));
+      case SHORT:
+        if (options.getEnableFixedWidthNumericVectorBatch()) {
+          return std::unique_ptr<ColumnWriter>(
+            new IntegerColumnWriter<ShortVectorBatch, int16_t>(
+                                                               type,
+                                                               factory,
+                                                               options));
+        }
       case INT:
+        if (options.getEnableFixedWidthNumericVectorBatch()) {
+          return std::unique_ptr<ColumnWriter>(
+            new IntegerColumnWriter<IntVectorBatch, int32_t>(
+                                                             type,
+                                                             factory,
+                                                             options));
+        }
       case LONG:
-      case SHORT:
         return std::unique_ptr<ColumnWriter>(
-          new IntegerColumnWriter(
-                                  type,
-                                  factory,
-                                  options));
+          new IntegerColumnWriter<LongVectorBatch, int64_t>(

Review Comment:
   It seems redundant to define the 2nd template parameter



##########
c++/include/orc/Vector.hh:
##########
@@ -88,28 +89,88 @@ namespace orc {
     ColumnVectorBatch& operator=(const ColumnVectorBatch&);
   };
 
-  struct LongVectorBatch: public ColumnVectorBatch {
-    LongVectorBatch(uint64_t capacity, MemoryPool& pool);
-    virtual ~LongVectorBatch();
+  template <typename ValueType>
+  struct IntegerVectorBatch: public ColumnVectorBatch {
+    IntegerVectorBatch(uint64_t cap, MemoryPool& pool)
+                       : ColumnVectorBatch(cap, pool),
+                       data(pool, cap) {
+      // PASS
+    }
 
-    DataBuffer<int64_t> data;
-    std::string toString() const;
-    void resize(uint64_t capacity);
-    void clear();
-    uint64_t getMemoryUsage();
+    virtual ~IntegerVectorBatch() {
+      // PASS
+    }
+
+    std::string toString() const {
+      std::ostringstream buffer;
+      buffer << "Long vector <" << numElements << " of " << capacity << ">";
+      return buffer.str();
+    }
+
+    void resize(uint64_t cap) {
+      if (capacity < cap) {
+        ColumnVectorBatch::resize(cap);
+        data.resize(cap);
+      }
+    }
+
+    void clear() {
+      numElements = 0;
+    }
+
+    uint64_t getMemoryUsage() {
+      return ColumnVectorBatch::getMemoryUsage() +
+        static_cast<uint64_t>(data.capacity() * sizeof(int64_t));
+    }
+
+    DataBuffer<ValueType> data;
   };
 
-  struct DoubleVectorBatch: public ColumnVectorBatch {
-    DoubleVectorBatch(uint64_t capacity, MemoryPool& pool);
-    virtual ~DoubleVectorBatch();
-    std::string toString() const;
-    void resize(uint64_t capacity);
-    void clear();
-    uint64_t getMemoryUsage();
+  using LongVectorBatch = IntegerVectorBatch<int64_t>;
+  using IntVectorBatch = IntegerVectorBatch<int32_t>;
+  using ShortVectorBatch = IntegerVectorBatch<int16_t>;
+  using ByteVectorBatch = IntegerVectorBatch<int8_t>;
+
+  template <typename FloatType>
+  struct BaseDoubleVectorBatch: public ColumnVectorBatch {
+    BaseDoubleVectorBatch(uint64_t cap, MemoryPool& pool)
+      : ColumnVectorBatch(cap, pool),
+        data(pool, cap) {
+      // PASS
+    }
+
+    virtual ~BaseDoubleVectorBatch() {
+      // PASS
+    }
+
+    std::string toString() const {
+      std::ostringstream buffer;
+      buffer << "Double vector <" << numElements << " of " << capacity << ">";

Review Comment:
   It may not be "Double vector" any more.



##########
c++/src/RLE.hh:
##########
@@ -126,6 +134,15 @@ namespace orc {
     virtual void next(int64_t* data, uint64_t numValues,
                       const char* notNull) = 0;
 
+    virtual void next(int32_t* data, uint64_t numValues,
+                      const char* notNull) = 0;
+
+    virtual void next(int16_t* data, uint64_t numValues,
+                      const char* notNull) = 0;
+
+    virtual void next(int8_t* data, uint64_t numValues,

Review Comment:
   ditto



##########
c++/include/orc/Vector.hh:
##########
@@ -88,28 +89,88 @@ namespace orc {
     ColumnVectorBatch& operator=(const ColumnVectorBatch&);
   };
 
-  struct LongVectorBatch: public ColumnVectorBatch {
-    LongVectorBatch(uint64_t capacity, MemoryPool& pool);
-    virtual ~LongVectorBatch();
+  template <typename ValueType>
+  struct IntegerVectorBatch: public ColumnVectorBatch {
+    IntegerVectorBatch(uint64_t cap, MemoryPool& pool)
+                       : ColumnVectorBatch(cap, pool),
+                       data(pool, cap) {
+      // PASS
+    }
 
-    DataBuffer<int64_t> data;
-    std::string toString() const;
-    void resize(uint64_t capacity);
-    void clear();
-    uint64_t getMemoryUsage();
+    virtual ~IntegerVectorBatch() {
+      // PASS
+    }
+
+    std::string toString() const {
+      std::ostringstream buffer;
+      buffer << "Long vector <" << numElements << " of " << capacity << ">";

Review Comment:
   It may not be "Long vector" any more



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org