You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/03/01 11:39:14 UTC

[GitHub] [arrow-datafusion] tustvold commented on issue #5130: Using INSERT INTO query to append a file

tustvold commented on issue #5130:
URL: https://github.com/apache/arrow-datafusion/issues/5130#issuecomment-1449968315

   > To support mutability the TableProvider implementation would need to implement "interior mutability"
   
   I think this touches on a key point, there needs to be some sort of consistency/atomicity story here. Most users would likely expect that `INSERT INTO` is atomic, i.e. a query sees all the inserted data or none of the inserted data. They may additionally have expectations with respect to transaction isolation / serializability. 
   
   **Blindly appending to a CSV / JSON file without any external coordination will result in queries seeing partial or potentially corrupted data** 
   
   One common approach is for new data to always be written to a new file, thus ensuring atomicity. 
   
   This basic approach can then be optionally extended with things like:
   
   * A Write-Ahead Log and MemTable to reduce file churn
   * Catalog functionality, such as provided by deltalake or lakehouse, to support in-place, atomic rewrites, transactions, etc...
   * Compaction functionality (deltalake calls this [bin-packing](https://docs.delta.io/1.2.1/optimizations-oss.html#compaction-bin-packing)) to coalesce small files into larger ones
   
   I think adding some pieces of functionality for this to DataFusion would be amazing, and may even be of interest to the delta-rs folks (FYI @roeap), but may benefit from having a more fleshed out catalog story first (#5291)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org