You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/24 16:59:35 UTC

[GitHub] [arrow-datafusion] alamb commented on pull request #6049: MemoryExec INSERT INTO refactor to use ExecutionPlan

alamb commented on PR #6049:
URL: https://github.com/apache/arrow-datafusion/pull/6049#issuecomment-1520524109

   > If you believe that simplifying the PR without losing functionality is the best course of action, I am willing to make the necessary changes.
   
   This is my preference -- and if/when we find that having an `ExecutionPlan` that implements the write logic, we can perhaps revive that part of this PR
   
   >  However, I would appreciate further discussion on how to align this PR with the goals outlined in issue https://github.com/apache/arrow-datafusion/issues/5076 while addressing your concerns about the resulting ExecutionPlan.
   
   My high level thinking is that a TableProvider should be a "Sink" that takes streams from executing
   
   After thinking about it more, I think an `ExecutionPlan` for implementing Inserts might be ok for the reasons you mention. 
   
   However,  concerns I have with the `MemoryWriteExec` in this PR is:
   1. It is tightly bound to the specific `MemTable` TableProvider but is in the general purpose physical plan area -- it seems the MemTable specific functionality should stay near / within the `datasource` module. 
   2. Most of the logic in `MemoryWriteExec` seems like it is general purpose -- only the actual `RwLock<Vec<RecordBatch>>>>` seems specific to Memory
   
   My reading of https://github.com/apache/arrow-datafusion/issues/5076 and https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-SparkPlan-DataWritingCommandExec.html suggest that there is single specific physical operator that handles writes to arbitrary datasources which is how I would prefer DataFusion proceeds. 
   
   I left some additional comments about this plan on https://github.com/apache/arrow-datafusion/issues/5076#issuecomment-1520521168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org