You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/07/14 07:55:58 UTC

[GitHub] [doris] wsjz opened a new pull request, #10843: [feature] (multi-catalog) read parquet file by start/offset

wsjz opened a new pull request, #10843:
URL: https://github.com/apache/doris/pull/10843

   # Proposed changes
   
   Issue Number: close #xxx
   
   ## Problem Summary:
   
   Describe the overview of changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes/No/I Don't know)
   2. Has unit tests been added: (Yes/No/No Need)
   3. Has document been added or modified: (Yes/No/No Need)
   4. Does it need to update dependencies: (Yes/No)
   5. Are there any changes that cannot be rolled back: (Yes/No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #10843: [feature] (multi-catalog) read parquet file by start/offset

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #10843:
URL: https://github.com/apache/doris/pull/10843#issuecomment-1186808346

   PR approved by anyone and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #10843: [feature] (multi-catalog) read parquet file by start/offset

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #10843:
URL: https://github.com/apache/doris/pull/10843#issuecomment-1186808329

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Zasek commented on pull request #10843: [feature] (multi-catalog) read parquet file by start/offset

Posted by GitBox <gi...@apache.org>.
Zasek commented on PR #10843:
URL: https://github.com/apache/doris/pull/10843#issuecomment-1186739609

   Hello, wsjz, excellent job!  My team is also optimizing the parquet file scanner/reader, but using a different way. We implemented this split read by modifying some Arrow source code. I hope we can discuss this topic😁, you may contact me via email.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dujl commented on pull request #10843: [feature] (multi-catalog) read parquet file by start/offset

Posted by GitBox <gi...@apache.org>.
dujl commented on PR #10843:
URL: https://github.com/apache/doris/pull/10843#issuecomment-1189021264

   @wsjz @morningman  our parquet align strategy is not same as parquet community.
   parquet community check whether the rowgroup's midPoint in the scan range.
   if the row group's midpoint in the scan range,  will add the rowGroup to scan list.
   Suggest that we align with the parquet community
   
   For parquet
   ```
         long midPoint = startIndex + totalSize / 2;
         if (filter.contains(midPoint)) {
           newRowGroups.add(rowGroup);
         }
   ```
   ```
       public boolean contains(long offset) {
         return offset >= this.startOffset && offset < this.endOffset;
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman merged pull request #10843: [feature] (multi-catalog) read parquet file by start/offset

Posted by GitBox <gi...@apache.org>.
morningman merged PR #10843:
URL: https://github.com/apache/doris/pull/10843


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman commented on pull request #10843: [feature] (multi-catalog) read parquet file by start/offset

Posted by GitBox <gi...@apache.org>.
morningman commented on PR #10843:
URL: https://github.com/apache/doris/pull/10843#issuecomment-1186799104

   > Hello, wsjz, excellent job! My team is also optimizing the parquet file scanner/reader, but using a different way. We implemented this split read by modifying some Arrow source code. I hope we can discuss this topic😁, you may contact me via email.
   
   Modifying Arrow source code is hard to maintain. But I am still interested in your approach. Could you share some doc about it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] wsjz commented on pull request #10843: [feature] (multi-catalog) read parquet file by start/offset

Posted by GitBox <gi...@apache.org>.
wsjz commented on PR #10843:
URL: https://github.com/apache/doris/pull/10843#issuecomment-1186825360

   Test Param Control:
   1.  create hms catalog and use it in current version.
   2. split size setting at FE:Change the argument of inputFormat.getSplits(jobConf, 0) in ExternalHiveScanProvider. We can change the second argument '0' to 10 ,split will be divided into 100 parts, then each split  have the same size.
   4. row group size setting(MB):set parquet.block.size= N * 1024 * 1024 when load data with hive/spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org