You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/05/13 03:43:30 UTC

[GitHub] [incubator-doris] yinzhijian opened a new pull request, #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

yinzhijian opened a new pull request, #9541:
URL: https://github.com/apache/incubator-doris/pull/9541

   # Proposed changes
   
   Issue Number: close #xxx
   
   ## Problem Summary:
   
   Describe the overview of changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (No)
   2. Has unit tests been added: (No)
   3. Has document been added or modified: (No Need)
   4. Does it need to update dependencies: (No)
   5. Are there any changes that cannot be rolled back: (No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yinzhijian commented on a diff in pull request #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

Posted by GitBox <gi...@apache.org>.
yinzhijian commented on code in PR #9541:
URL: https://github.com/apache/incubator-doris/pull/9541#discussion_r879138403


##########
be/src/vec/data_types/data_type_factory.cpp:
##########
@@ -310,10 +310,10 @@ DataTypePtr DataTypeFactory::create_data_type(const arrow::Type::type& type, boo
         nested = std::make_shared<vectorized::DataTypeString>();
         break;
     case ::arrow::Type::DECIMAL:
-        nested = std::make_shared<vectorized::DataTypeDecimal<vectorized::Decimal128>>(27, 9);
+        nested = std::make_shared<vectorized::DataTypeDecimal<vectorized::Decimal128>>();

Review Comment:
   27 and 9 are default values



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Gabriel39 commented on pull request #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

Posted by GitBox <gi...@apache.org>.
Gabriel39 commented on PR #9541:
URL: https://github.com/apache/incubator-doris/pull/9541#issuecomment-1139252755

   should we update docker image?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] aopangzi commented on pull request #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

Posted by GitBox <gi...@apache.org>.
aopangzi commented on PR #9541:
URL: https://github.com/apache/incubator-doris/pull/9541#issuecomment-1145693726

   /be/src/exec/arrow/orc_reader.h:20:10: fatal error: arrow/adapters/orc/adapter.h: No such file or directory
      20 | #include <arrow/adapters/orc/adapter.h>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] HappenLee commented on a diff in pull request #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

Posted by GitBox <gi...@apache.org>.
HappenLee commented on code in PR #9541:
URL: https://github.com/apache/incubator-doris/pull/9541#discussion_r880007777


##########
be/src/exec/arrow/arrow_reader.cpp:
##########
@@ -0,0 +1,156 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#include "exec/arrow/arrow_reader.h"
+
+#include <arrow/array.h>
+#include <arrow/status.h>
+#include <time.h>
+
+#include "common/logging.h"
+#include "exec/file_reader.h"
+#include "gen_cpp/PaloBrokerService_types.h"
+#include "gen_cpp/TPaloBrokerService.h"
+#include "runtime/broker_mgr.h"
+#include "runtime/client_cache.h"
+#include "runtime/descriptors.h"
+#include "runtime/exec_env.h"
+#include "runtime/mem_pool.h"
+#include "runtime/tuple.h"
+#include "util/thrift_util.h"
+
+namespace doris {
+
+// Broker
+
+ArrowReaderWrap::ArrowReaderWrap(FileReader* file_reader, int64_t batch_size,
+                                 int32_t num_of_columns_from_file)
+        : _batch_size(batch_size), _num_of_columns_from_file(num_of_columns_from_file) {
+    _arrow_file = std::shared_ptr<ArrowFile>(new ArrowFile(file_reader));
+    _rb_reader = nullptr;
+    _total_groups = 0;
+    _current_group = 0;
+}
+
+ArrowReaderWrap::~ArrowReaderWrap() {
+    close();
+}
+
+void ArrowReaderWrap::close() {
+    arrow::Status st = _arrow_file->Close();
+    if (!st.ok()) {
+        LOG(WARNING) << "close file error: " << st.ToString();
+    }
+}
+
+Status ArrowReaderWrap::column_indices(const std::vector<SlotDescriptor*>& tuple_slot_descs) {
+    _include_column_ids.clear();
+    for (int i = 0; i < _num_of_columns_from_file; i++) {
+        auto slot_desc = tuple_slot_descs.at(i);
+        // Get the Column Reader for the boolean column
+        auto iter = _map_column.find(slot_desc->col_name());
+        if (iter != _map_column.end()) {
+            _include_column_ids.emplace_back(iter->second);
+        } else {
+            std::stringstream str_error;
+            str_error << "Invalid Column Name:" << slot_desc->col_name();
+            LOG(WARNING) << str_error.str();
+            return Status::InvalidArgument(str_error.str());
+        }
+    }
+    return Status::OK();
+}
+
+ArrowFile::ArrowFile(FileReader* file) : _file(file) {}
+
+ArrowFile::~ArrowFile() {
+    arrow::Status st = Close();
+    if (!st.ok()) {
+        LOG(WARNING) << "close file error: " << st.ToString();
+    }
+}
+
+arrow::Status ArrowFile::Close() {
+    if (_file != nullptr) {
+        _file->close();
+        delete _file;
+        _file = nullptr;
+    }
+    return arrow::Status::OK();
+}
+
+bool ArrowFile::closed() const {
+    if (_file != nullptr) {
+        return _file->closed();
+    } else {
+        return true;
+    }
+}
+
+arrow::Result<int64_t> ArrowFile::Read(int64_t nbytes, void* buffer) {
+    return ReadAt(_pos, nbytes, buffer);
+}
+
+arrow::Result<int64_t> ArrowFile::ReadAt(int64_t position, int64_t nbytes, void* out) {
+    int64_t reads = 0;
+    int64_t bytes_read = 0;
+    _pos = position;
+    while (nbytes > 0) {
+        Status result = _file->readat(_pos, nbytes, &reads, out);
+        if (!result.ok()) {
+            bytes_read = 0;

Review Comment:
   unless code



##########
be/src/exec/arrow/arrow_reader.cpp:
##########
@@ -0,0 +1,156 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#include "exec/arrow/arrow_reader.h"
+
+#include <arrow/array.h>
+#include <arrow/status.h>
+#include <time.h>
+
+#include "common/logging.h"
+#include "exec/file_reader.h"
+#include "gen_cpp/PaloBrokerService_types.h"
+#include "gen_cpp/TPaloBrokerService.h"
+#include "runtime/broker_mgr.h"
+#include "runtime/client_cache.h"
+#include "runtime/descriptors.h"
+#include "runtime/exec_env.h"
+#include "runtime/mem_pool.h"
+#include "runtime/tuple.h"
+#include "util/thrift_util.h"
+
+namespace doris {
+
+// Broker
+
+ArrowReaderWrap::ArrowReaderWrap(FileReader* file_reader, int64_t batch_size,
+                                 int32_t num_of_columns_from_file)
+        : _batch_size(batch_size), _num_of_columns_from_file(num_of_columns_from_file) {
+    _arrow_file = std::shared_ptr<ArrowFile>(new ArrowFile(file_reader));
+    _rb_reader = nullptr;
+    _total_groups = 0;
+    _current_group = 0;
+}
+
+ArrowReaderWrap::~ArrowReaderWrap() {
+    close();
+}
+
+void ArrowReaderWrap::close() {
+    arrow::Status st = _arrow_file->Close();
+    if (!st.ok()) {
+        LOG(WARNING) << "close file error: " << st.ToString();
+    }
+}
+
+Status ArrowReaderWrap::column_indices(const std::vector<SlotDescriptor*>& tuple_slot_descs) {
+    _include_column_ids.clear();

Review Comment:
   DCHECK(_num_of_columns_from_file >= tuple_slot_descs.size()); `at` operation will cause `Exception`



##########
be/src/exec/arrow/parquet_reader.h:
##########
@@ -55,44 +56,27 @@ class SlotDescriptor;
 class MemPool;
 class FileReader;
 
-class ParquetFile : public arrow::io::RandomAccessFile {
-public:
-    ParquetFile(FileReader* file);
-    ~ParquetFile() override;
-    arrow::Result<int64_t> Read(int64_t nbytes, void* buffer) override;
-    arrow::Result<int64_t> ReadAt(int64_t position, int64_t nbytes, void* out) override;
-    arrow::Result<int64_t> GetSize() override;
-    arrow::Status Seek(int64_t position) override;
-    arrow::Result<std::shared_ptr<arrow::Buffer>> Read(int64_t nbytes) override;
-    arrow::Result<int64_t> Tell() const override;
-    arrow::Status Close() override;
-    bool closed() const override;
-
-private:
-    FileReader* _file;
-    int64_t _pos = 0;
-};
-
 // Reader of broker parquet file
-class ParquetReaderWrap {
+class ParquetReaderWrap : public ArrowReaderWrap {
 public:
-    ParquetReaderWrap(FileReader* file_reader, int32_t num_of_columns_from_file);
-    virtual ~ParquetReaderWrap();
+    // batch_size is not use here
+    ParquetReaderWrap(FileReader* file_reader, int64_t batch_size,
+                      int32_t num_of_columns_from_file);
+    virtual ~ParquetReaderWrap() {}
 
     // Read
     Status read(Tuple* tuple, const std::vector<SlotDescriptor*>& tuple_slot_descs,
-                MemPool* mem_pool, bool* eof);
-    void close();
-    Status size(int64_t* size);
-    Status init_parquet_reader(const std::vector<SlotDescriptor*>& tuple_slot_descs,
-                               const std::string& timezone);
+                MemPool* mem_pool, bool* eof) override;
+    Status size(int64_t* size) override;
+    Status init_reader(const std::vector<SlotDescriptor*>& tuple_slot_descs,
+                       const std::string& timezone) override;
     Status next_batch(std::shared_ptr<arrow::RecordBatch>* batch,
-                      const std::vector<SlotDescriptor*>& tuple_slot_descs, bool* eof);
+                      const std::vector<SlotDescriptor*>& tuple_slot_descs, bool* eof) override;
+    void close() override;
 
 private:
     void fill_slot(Tuple* tuple, SlotDescriptor* slot_desc, MemPool* mem_pool, const uint8_t* value,
                    int32_t len);
-    Status column_indices(const std::vector<SlotDescriptor*>& tuple_slot_descs);
     Status set_field_null(Tuple* tuple, const SlotDescriptor* slot_desc);
     Status read_record_batch(const std::vector<SlotDescriptor*>& tuple_slot_descs, bool* eof);

Review Comment:
   unless for the param of `tuple_slot_descs`, delete it



##########
be/src/exec/arrow/arrow_reader.cpp:
##########
@@ -0,0 +1,156 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#include "exec/arrow/arrow_reader.h"
+
+#include <arrow/array.h>
+#include <arrow/status.h>
+#include <time.h>
+
+#include "common/logging.h"
+#include "exec/file_reader.h"
+#include "gen_cpp/PaloBrokerService_types.h"
+#include "gen_cpp/TPaloBrokerService.h"
+#include "runtime/broker_mgr.h"
+#include "runtime/client_cache.h"
+#include "runtime/descriptors.h"
+#include "runtime/exec_env.h"
+#include "runtime/mem_pool.h"
+#include "runtime/tuple.h"
+#include "util/thrift_util.h"
+
+namespace doris {
+
+// Broker
+
+ArrowReaderWrap::ArrowReaderWrap(FileReader* file_reader, int64_t batch_size,
+                                 int32_t num_of_columns_from_file)
+        : _batch_size(batch_size), _num_of_columns_from_file(num_of_columns_from_file) {
+    _arrow_file = std::shared_ptr<ArrowFile>(new ArrowFile(file_reader));
+    _rb_reader = nullptr;
+    _total_groups = 0;
+    _current_group = 0;
+}
+
+ArrowReaderWrap::~ArrowReaderWrap() {
+    close();
+}
+
+void ArrowReaderWrap::close() {
+    arrow::Status st = _arrow_file->Close();
+    if (!st.ok()) {
+        LOG(WARNING) << "close file error: " << st.ToString();
+    }
+}
+
+Status ArrowReaderWrap::column_indices(const std::vector<SlotDescriptor*>& tuple_slot_descs) {
+    _include_column_ids.clear();
+    for (int i = 0; i < _num_of_columns_from_file; i++) {
+        auto slot_desc = tuple_slot_descs.at(i);
+        // Get the Column Reader for the boolean column
+        auto iter = _map_column.find(slot_desc->col_name());
+        if (iter != _map_column.end()) {
+            _include_column_ids.emplace_back(iter->second);
+        } else {
+            std::stringstream str_error;
+            str_error << "Invalid Column Name:" << slot_desc->col_name();
+            LOG(WARNING) << str_error.str();
+            return Status::InvalidArgument(str_error.str());
+        }
+    }
+    return Status::OK();
+}
+
+ArrowFile::ArrowFile(FileReader* file) : _file(file) {}
+
+ArrowFile::~ArrowFile() {
+    arrow::Status st = Close();
+    if (!st.ok()) {
+        LOG(WARNING) << "close file error: " << st.ToString();
+    }
+}
+
+arrow::Status ArrowFile::Close() {
+    if (_file != nullptr) {
+        _file->close();
+        delete _file;
+        _file = nullptr;
+    }
+    return arrow::Status::OK();
+}
+
+bool ArrowFile::closed() const {
+    if (_file != nullptr) {
+        return _file->closed();
+    } else {
+        return true;
+    }
+}
+
+arrow::Result<int64_t> ArrowFile::Read(int64_t nbytes, void* buffer) {
+    return ReadAt(_pos, nbytes, buffer);
+}
+
+arrow::Result<int64_t> ArrowFile::ReadAt(int64_t position, int64_t nbytes, void* out) {
+    int64_t reads = 0;
+    int64_t bytes_read = 0;
+    _pos = position;
+    while (nbytes > 0) {
+        Status result = _file->readat(_pos, nbytes, &reads, out);
+        if (!result.ok()) {
+            bytes_read = 0;

Review Comment:
   unless code



##########
be/src/exec/arrow/parquet_reader.h:
##########
@@ -55,44 +56,27 @@ class SlotDescriptor;
 class MemPool;
 class FileReader;
 
-class ParquetFile : public arrow::io::RandomAccessFile {
-public:
-    ParquetFile(FileReader* file);
-    ~ParquetFile() override;
-    arrow::Result<int64_t> Read(int64_t nbytes, void* buffer) override;
-    arrow::Result<int64_t> ReadAt(int64_t position, int64_t nbytes, void* out) override;
-    arrow::Result<int64_t> GetSize() override;
-    arrow::Status Seek(int64_t position) override;
-    arrow::Result<std::shared_ptr<arrow::Buffer>> Read(int64_t nbytes) override;
-    arrow::Result<int64_t> Tell() const override;
-    arrow::Status Close() override;
-    bool closed() const override;
-
-private:
-    FileReader* _file;
-    int64_t _pos = 0;
-};
-
 // Reader of broker parquet file
-class ParquetReaderWrap {
+class ParquetReaderWrap : public ArrowReaderWrap {
 public:
-    ParquetReaderWrap(FileReader* file_reader, int32_t num_of_columns_from_file);
-    virtual ~ParquetReaderWrap();
+    // batch_size is not use here
+    ParquetReaderWrap(FileReader* file_reader, int64_t batch_size,
+                      int32_t num_of_columns_from_file);
+    virtual ~ParquetReaderWrap() {}
 
     // Read
     Status read(Tuple* tuple, const std::vector<SlotDescriptor*>& tuple_slot_descs,
-                MemPool* mem_pool, bool* eof);
-    void close();
-    Status size(int64_t* size);
-    Status init_parquet_reader(const std::vector<SlotDescriptor*>& tuple_slot_descs,
-                               const std::string& timezone);
+                MemPool* mem_pool, bool* eof) override;
+    Status size(int64_t* size) override;
+    Status init_reader(const std::vector<SlotDescriptor*>& tuple_slot_descs,
+                       const std::string& timezone) override;
     Status next_batch(std::shared_ptr<arrow::RecordBatch>* batch,
-                      const std::vector<SlotDescriptor*>& tuple_slot_descs, bool* eof);
+                      const std::vector<SlotDescriptor*>& tuple_slot_descs, bool* eof) override;
+    void close() override;
 
 private:
     void fill_slot(Tuple* tuple, SlotDescriptor* slot_desc, MemPool* mem_pool, const uint8_t* value,
                    int32_t len);
-    Status column_indices(const std::vector<SlotDescriptor*>& tuple_slot_descs);
     Status set_field_null(Tuple* tuple, const SlotDescriptor* slot_desc);
     Status read_record_batch(const std::vector<SlotDescriptor*>& tuple_slot_descs, bool* eof);

Review Comment:
   unless for the param of `tuple_slot_descs`, delete it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] cambyzju commented on a diff in pull request #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

Posted by GitBox <gi...@apache.org>.
cambyzju commented on code in PR #9541:
URL: https://github.com/apache/incubator-doris/pull/9541#discussion_r871983000


##########
be/src/vec/data_types/data_type_factory.cpp:
##########
@@ -260,4 +260,67 @@ DataTypePtr DataTypeFactory::create_data_type(const PColumnMeta& pcolumn) {
     return nested;
 }
 
+DataTypePtr DataTypeFactory::create_data_type(const arrow::Type::type& type, bool is_nullable) {

Review Comment:
   Please pass arrow::DataType as input args to avoid refractor later, because complex types create need children info.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] HappenLee commented on a diff in pull request #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

Posted by GitBox <gi...@apache.org>.
HappenLee commented on code in PR #9541:
URL: https://github.com/apache/incubator-doris/pull/9541#discussion_r879094893


##########
be/src/vec/data_types/data_type_factory.cpp:
##########
@@ -310,10 +310,10 @@ DataTypePtr DataTypeFactory::create_data_type(const arrow::Type::type& type, boo
         nested = std::make_shared<vectorized::DataTypeString>();
         break;
     case ::arrow::Type::DECIMAL:
-        nested = std::make_shared<vectorized::DataTypeDecimal<vectorized::Decimal128>>(27, 9);
+        nested = std::make_shared<vectorized::DataTypeDecimal<vectorized::Decimal128>>();

Review Comment:
   why delete the `27,9`?



##########
be/src/vec/exec/vorc_reader.h:
##########
@@ -0,0 +1,57 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <arrow/adapters/orc/adapter.h>
+#include <arrow/api.h>
+#include <arrow/buffer.h>
+
+#include <stdint.h>
+#include <map>
+#include <string>
+#include "common/status.h"
+#include "exec/arrow_reader.h"
+namespace doris::vectorized {
+
+// Reader of orc file
+class VORCReaderWrap : public ArrowReaderWrap {

Review Comment:
   May be the file should in `be/exec` same as `parquet_reader`. And the class do not use vectorized::block. maybe no need name `VORC`?



##########
be/src/vec/exec/vorc_reader.h:
##########
@@ -0,0 +1,57 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <arrow/adapters/orc/adapter.h>
+#include <arrow/api.h>
+#include <arrow/buffer.h>
+
+#include <stdint.h>
+#include <map>
+#include <string>
+#include "common/status.h"
+#include "exec/arrow_reader.h"
+namespace doris::vectorized {
+
+// Reader of orc file
+class VORCReaderWrap : public ArrowReaderWrap {
+public:
+    VORCReaderWrap(FileReader* file_reader, int64_t batch_size, int32_t num_of_columns_from_file);
+    virtual ~VORCReaderWrap();
+
+    Status init_reader(const std::vector<SlotDescriptor*>& tuple_slot_descs,
+                       const std::string& timezone) override;
+    Status next_batch(std::shared_ptr<arrow::RecordBatch>* batch,
+                      const std::vector<SlotDescriptor*>& tuple_slot_descs, bool* eof) override;
+
+private:
+    Status _column_indices(const std::vector<SlotDescriptor*>& tuple_slot_descs);
+    Status _next_stripe_reader(bool* eof);
+
+private:
+    // orc file reader object
+    std::shared_ptr<::arrow::RecordBatchReader> _rb_reader;
+    std::unique_ptr<arrow::adapters::orc::ORCFileReader> _reader;
+    std::map<std::string, int> _map_column; // column-name <---> column-index

Review Comment:
   For simultaneous variables in `orc` and `parquet`, you should put them in the parent class. like `_map_column`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yangzhg commented on a diff in pull request #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

Posted by GitBox <gi...@apache.org>.
yangzhg commented on code in PR #9541:
URL: https://github.com/apache/incubator-doris/pull/9541#discussion_r880373945


##########
thirdparty/build-thirdparty.sh:
##########
@@ -643,12 +643,13 @@ build_arrow() {
     export ARROW_SNAPPY_URL=${TP_SOURCE_DIR}/${SNAPPY_NAME}
     export ARROW_ZLIB_URL=${TP_SOURCE_DIR}/${ZLIB_NAME}
     export ARROW_XSIMD_URL=${TP_SOURCE_DIR}/${XSIMD_NAME}
+    export ARROW_ORC_URL=${TP_SOURCE_DIR}/${ORC_NAME}
 
     LDFLAGS="-L${TP_LIB_DIR} -static-libstdc++ -static-libgcc" \
     ${CMAKE_CMD} -G "${GENERATOR}" -DARROW_PARQUET=ON -DARROW_IPC=ON -DARROW_BUILD_SHARED=OFF \
     -DARROW_BUILD_STATIC=ON -DARROW_WITH_BROTLI=ON -DARROW_WITH_LZ4=ON -DARROW_USE_GLOG=ON \
     -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_JSON=ON \
-    -DARROW_WITH_UTF8PROC=OFF -DARROW_WITH_RE2=OFF \
+    -DARROW_WITH_UTF8PROC=OFF -DARROW_WITH_RE2=OFF -DARROW_ORC=ON\

Review Comment:
   ```suggestion
       -DARROW_WITH_UTF8PROC=OFF -DARROW_WITH_RE2=OFF -DARROW_ORC=ON \
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] cambyzju commented on a diff in pull request #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

Posted by GitBox <gi...@apache.org>.
cambyzju commented on code in PR #9541:
URL: https://github.com/apache/incubator-doris/pull/9541#discussion_r871985155


##########
be/src/vec/data_types/data_type_factory.cpp:
##########
@@ -260,4 +260,67 @@ DataTypePtr DataTypeFactory::create_data_type(const PColumnMeta& pcolumn) {
     return nested;
 }
 
+DataTypePtr DataTypeFactory::create_data_type(const arrow::Type::type& type, bool is_nullable) {
+    DataTypePtr nested = nullptr;
+    switch (type) {
+    case ::arrow::Type::BOOL:
+        nested = std::make_shared<vectorized::DataTypeUInt8>();
+        break;
+    case ::arrow::Type::INT8:
+        nested = std::make_shared<vectorized::DataTypeInt8>();
+        break;
+    case ::arrow::Type::UINT8:
+        nested = std::make_shared<vectorized::DataTypeUInt8>();
+        break;
+    case ::arrow::Type::INT16:
+        nested = std::make_shared<vectorized::DataTypeInt16>();
+        break;
+    case ::arrow::Type::UINT16:
+        nested = std::make_shared<vectorized::DataTypeUInt16>();
+        break;
+    case ::arrow::Type::INT32:
+        nested = std::make_shared<vectorized::DataTypeInt32>();
+        break;
+    case ::arrow::Type::UINT32:
+        nested = std::make_shared<vectorized::DataTypeUInt32>();
+        break;
+    case ::arrow::Type::INT64:
+        nested = std::make_shared<vectorized::DataTypeInt64>();
+        break;
+    case ::arrow::Type::UINT64:
+        nested = std::make_shared<vectorized::DataTypeUInt64>();
+        break;
+    case ::arrow::Type::HALF_FLOAT:
+    case ::arrow::Type::FLOAT:
+        nested = std::make_shared<vectorized::DataTypeFloat32>();
+        break;
+    case ::arrow::Type::DOUBLE:
+        nested = std::make_shared<vectorized::DataTypeFloat64>();
+        break;
+    case ::arrow::Type::DATE32:
+        nested = std::make_shared<vectorized::DataTypeDate>();
+        break;
+    case ::arrow::Type::DATE64:
+    case ::arrow::Type::TIMESTAMP:
+        nested = std::make_shared<vectorized::DataTypeDateTime>();
+        break;
+    case ::arrow::Type::BINARY:
+    case ::arrow::Type::FIXED_SIZE_BINARY:
+    case ::arrow::Type::STRING:
+        nested = std::make_shared<vectorized::DataTypeString>();
+        break;
+    case ::arrow::Type::DECIMAL:
+        nested = std::make_shared<vectorized::DataTypeDecimal<vectorized::Decimal128>>(27, 9);

Review Comment:
   ```suggestion
           nested = std::make_shared<vectorized::DataTypeDecimal<vectorized::Decimal128>>();
   ```
   
   27 and 9 are default values, do not need to pass. Other calls inside this file, please also change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] HappenLee merged pull request #9541: [feature-wip](parquet-orc) Support orc scanner in vectorized engine

Posted by GitBox <gi...@apache.org>.
HappenLee merged PR #9541:
URL: https://github.com/apache/incubator-doris/pull/9541


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org