You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/28 14:35:10 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #1695: Lazy TempDir creation in DiskManager

alamb opened a new pull request #1695:
URL: https://github.com/apache/arrow-datafusion/pull/1695


   # Which issue does this PR close?
   
   Related https://github.com/apache/arrow-datafusion/issues/1690
   
   First part of https://github.com/apache/arrow-datafusion/issues/1690: only do IO / `TempDir` creation when tempfiles are actually needed in a plan
   
   I plan a second PR to  avoid creating so many `DiskManager` instances in the first place
   
    # Rationale for this change
   Creating temp files is expensive and now DataFusion is doing it frequently during processing (see https://github.com/apache/arrow-datafusion/issues/1690 for more backstory)
   
   # What changes are included in this PR?
   
   Changes;
   If the user doesn't specify explicit temp directories, do not create a system assigned tempfile unless it is actually requested
   
   # Are there any user-facing changes?
   less tempfiles!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1695: Lazy TempDir creation in DiskManager

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1695:
URL: https://github.com/apache/arrow-datafusion/pull/1695#issuecomment-1024387697


   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/listing/helpers.rs#L244
   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/simplify_expressions.rs#L293
   
   two more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1695: Lazy TempDir creation in DiskManager

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1695:
URL: https://github.com/apache/arrow-datafusion/pull/1695#issuecomment-1024387697


   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/listing/helpers.rs#L244
   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/simplify_expressions.rs#L293
   
   two more.
   
   I think the main reasons are the optimizers are detached from the execution context but need functionalities somehow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #1695: Lazy TempDir creation in DiskManager

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1695:
URL: https://github.com/apache/arrow-datafusion/pull/1695


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1695: Lazy TempDir creation in DiskManager

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1695:
URL: https://github.com/apache/arrow-datafusion/pull/1695#issuecomment-1024387697


   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/listing/helpers.rs#L244
   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/simplify_expressions.rs#L293
   
   two more.
   
   I think the main reasons are the optimizers/planners are detached from the execution context but need functionalities somehow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1695: Lazy TempDir creation in DiskManager

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1695:
URL: https://github.com/apache/arrow-datafusion/pull/1695#issuecomment-1024353939


   > If I read our code correctly. when executing a plan, we need to create RuntimeEnv where will create a DiskManager instance. So I can't understand if there will be so many DisManager instances?
   
   
   @xudong963  I think (non obviously) there are several more surprising places which create a DiskManager instance today -- for example https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_optimizer/pruning.rs#L132


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1695: Lazy TempDir creation in DiskManager

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1695:
URL: https://github.com/apache/arrow-datafusion/pull/1695#issuecomment-1024492619


   > two more.
   > I think the main reasons are the optimizers/planners are detached from the execution context but need functionalities somehow.
   
   💯  @yjshen  -- see https://github.com/apache/arrow-datafusion/pull/1700 :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1695: Lazy TempDir creation in DiskManager

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1695:
URL: https://github.com/apache/arrow-datafusion/pull/1695#issuecomment-1024282103


   cc @yjshen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb edited a comment on pull request #1695: Lazy TempDir creation in DiskManager

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #1695:
URL: https://github.com/apache/arrow-datafusion/pull/1695#issuecomment-1024353939


   > If I read our code correctly. when executing a plan, we need to create RuntimeEnv where will create a DiskManager instance. So I can't understand if there will be so many DisManager instances?
   
   
   @xudong963  I think (non obviously) there are several more surprising places which create a DiskManager instance today -- for example https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_optimizer/pruning.rs#L132 (indirectly by creating an ExecutionContext)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org