You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/07/20 10:33:29 UTC

[GitHub] [doris] SaintBacchus opened a new pull request, #11046: [feature-wip][multi-catalog] Support orc format file split for file scan node

SaintBacchus opened a new pull request, #11046:
URL: https://github.com/apache/doris/pull/11046

   # Proposed changes
   
   Issue Number: close #xxx
   
   ## Problem Summary:
   
   Describe the overview of changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes/No/I Don't know)
   2. Has unit tests been added: (Yes/No/No Need)
   3. Has document been added or modified: (Yes/No/No Need)
   4. Does it need to update dependencies: (Yes/No)
   5. Are there any changes that cannot be rolled back: (Yes/No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] carlvinhust2012 commented on pull request #11046: [feature-wip][multi-catalog] Support orc format file split for file scan node

Posted by GitBox <gi...@apache.org>.
carlvinhust2012 commented on PR #11046:
URL: https://github.com/apache/doris/pull/11046#issuecomment-1193773724

   > Compile error:
   > 
   > ```
   >   /root/doris/be/src/exec/arrow/orc_reader.cpp:62:32: error: 'class arrow::adapters::orc::ORCFileReader' has no member named 'GetRawORCReader'
   > ```
   
   shall we re-compile the third-party to solve this error? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] SaintBacchus commented on a diff in pull request #11046: [feature-wip][multi-catalog] Support orc format file split for file scan node

Posted by GitBox <gi...@apache.org>.
SaintBacchus commented on code in PR #11046:
URL: https://github.com/apache/doris/pull/11046#discussion_r926346121


##########
be/src/exec/arrow/orc_reader.cpp:
##########
@@ -52,6 +55,31 @@ Status ORCReaderWrap::init_reader(const TupleDescriptor* tuple_desc,
         return Status::EndOfFile("Empty Orc File");
     }
 
+    int64_t row_number = 0;
+    int end_group = _total_groups;
+    for (int i = 0; i < _total_groups; i++) {
+        int64_t _offset = _reader->GetRawORCReader()->getStripe(i)->getOffset();
+        int64_t row = _reader->GetRawORCReader()->getStripe(i)->getNumberOfRows();
+        if (_offset < _range_start_offset) {
+            row_number += row;
+        } else if (_offset == _range_start_offset) {
+            _current_group = i;
+        }
+        if (_range_start_offset + _range_size <= _offset) {
+            end_group = i;
+            break;
+        }
+    }
+    LOG(INFO) << "This reader read orc file from offset: " << _range_start_offset
+              << " with size: " << _range_size << ". Also mean that read from strip id from "
+              << _current_group << " to " << end_group;
+    _total_groups = end_group;

Review Comment:
   Yeah



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] SaintBacchus commented on a diff in pull request #11046: [feature-wip][multi-catalog] Support orc format file split for file scan node

Posted by GitBox <gi...@apache.org>.
SaintBacchus commented on code in PR #11046:
URL: https://github.com/apache/doris/pull/11046#discussion_r926346121


##########
be/src/exec/arrow/orc_reader.cpp:
##########
@@ -52,6 +55,31 @@ Status ORCReaderWrap::init_reader(const TupleDescriptor* tuple_desc,
         return Status::EndOfFile("Empty Orc File");
     }
 
+    int64_t row_number = 0;
+    int end_group = _total_groups;
+    for (int i = 0; i < _total_groups; i++) {
+        int64_t _offset = _reader->GetRawORCReader()->getStripe(i)->getOffset();
+        int64_t row = _reader->GetRawORCReader()->getStripe(i)->getNumberOfRows();
+        if (_offset < _range_start_offset) {
+            row_number += row;
+        } else if (_offset == _range_start_offset) {
+            _current_group = i;
+        }
+        if (_range_start_offset + _range_size <= _offset) {
+            end_group = i;
+            break;
+        }
+    }
+    LOG(INFO) << "This reader read orc file from offset: " << _range_start_offset
+              << " with size: " << _range_size << ". Also mean that read from strip id from "
+              << _current_group << " to " << end_group;
+    _total_groups = end_group;

Review Comment:
   in the impl of arrow, `seek` function is just a value set.
   ```
     Status Seek(int64_t row_number) {
       ARROW_RETURN_IF(row_number >= NumberOfRows(),
                       Status::Invalid("Out of bounds row number: ", row_number));
   
       current_row_ = row_number;
       return Status::OK();
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] SaintBacchus commented on a diff in pull request #11046: [feature-wip][multi-catalog] Support orc format file split for file scan node

Posted by GitBox <gi...@apache.org>.
SaintBacchus commented on code in PR #11046:
URL: https://github.com/apache/doris/pull/11046#discussion_r926345730


##########
be/src/exec/arrow/orc_reader.cpp:
##########
@@ -52,6 +55,31 @@ Status ORCReaderWrap::init_reader(const TupleDescriptor* tuple_desc,
         return Status::EndOfFile("Empty Orc File");
     }
 
+    int64_t row_number = 0;
+    int end_group = _total_groups;
+    for (int i = 0; i < _total_groups; i++) {
+        int64_t _offset = _reader->GetRawORCReader()->getStripe(i)->getOffset();
+        int64_t row = _reader->GetRawORCReader()->getStripe(i)->getNumberOfRows();
+        if (_offset < _range_start_offset) {
+            row_number += row;
+        } else if (_offset == _range_start_offset) {
+            _current_group = i;

Review Comment:
   In my test for orc file , `_range_start_offset` is one of the `_offset` list



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman merged pull request #11046: [feature-wip][multi-catalog] Support orc format file split for file scan node

Posted by GitBox <gi...@apache.org>.
morningman merged PR #11046:
URL: https://github.com/apache/doris/pull/11046


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] SaintBacchus commented on pull request #11046: [feature-wip][multi-catalog] Support orc format file split for file scan node

Posted by GitBox <gi...@apache.org>.
SaintBacchus commented on PR #11046:
URL: https://github.com/apache/doris/pull/11046#issuecomment-1193956936

   @carlvinhust2012 Yes, please use the latest docker image to compile it 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman commented on a diff in pull request #11046: [feature-wip][multi-catalog] Support orc format file split for file scan node

Posted by GitBox <gi...@apache.org>.
morningman commented on code in PR #11046:
URL: https://github.com/apache/doris/pull/11046#discussion_r926296324


##########
be/src/exec/arrow/orc_reader.cpp:
##########
@@ -52,6 +55,31 @@ Status ORCReaderWrap::init_reader(const TupleDescriptor* tuple_desc,
         return Status::EndOfFile("Empty Orc File");
     }
 
+    int64_t row_number = 0;
+    int end_group = _total_groups;
+    for (int i = 0; i < _total_groups; i++) {
+        int64_t _offset = _reader->GetRawORCReader()->getStripe(i)->getOffset();
+        int64_t row = _reader->GetRawORCReader()->getStripe(i)->getNumberOfRows();
+        if (_offset < _range_start_offset) {
+            row_number += row;
+        } else if (_offset == _range_start_offset) {
+            _current_group = i;
+        }
+        if (_range_start_offset + _range_size <= _offset) {
+            end_group = i;
+            break;
+        }
+    }
+    LOG(INFO) << "This reader read orc file from offset: " << _range_start_offset
+              << " with size: " << _range_size << ". Also mean that read from strip id from "
+              << _current_group << " to " << end_group;
+    _total_groups = end_group;

Review Comment:
   if `_total_groups` is 0 here, than we don't need to call `_reader->Seek()`



##########
be/src/exec/arrow/orc_reader.cpp:
##########
@@ -52,6 +55,31 @@ Status ORCReaderWrap::init_reader(const TupleDescriptor* tuple_desc,
         return Status::EndOfFile("Empty Orc File");
     }
 
+    int64_t row_number = 0;
+    int end_group = _total_groups;
+    for (int i = 0; i < _total_groups; i++) {
+        int64_t _offset = _reader->GetRawORCReader()->getStripe(i)->getOffset();
+        int64_t row = _reader->GetRawORCReader()->getStripe(i)->getNumberOfRows();
+        if (_offset < _range_start_offset) {
+            row_number += row;
+        } else if (_offset == _range_start_offset) {
+            _current_group = i;

Review Comment:
   `_current_group` may not be set here?
   And it may always be 0?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org