You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/07 20:49:12 UTC

[GitHub] [arrow-datafusion] matthewmturner opened a new issue #1777: Improve DataFusions ability to write files

matthewmturner opened a new issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   
   I would like to add functionality for writing files from DataFusion.  To start, I've thought of the below.
   
   - [ ] (1) Add write functionality to `ObjectStore`
   - [ ] (2)Add `write_json` to `ExecutionContext`
   - [ ] (3) Add `write_ipc` to `ExecutionContext`
   - [ ] (4) Add `COPY` / `COPY TO` command for SQL (like postgres https://www.postgresql.org/docs/current/sql-copy.html)
   - [ ] (5) Add ability to write partitioned datasets
   - [ ] (6) Add support for writing metadata
   - [x] (7) Add `write_csv` method to `DataFrame` trait
   - [ ] (8) Add `write_parquet` method to `DataFrame` trait
   - [ ] (9) Add `write_ipc` method to `DataFrame` trait
   - [ ] (10) Add `write_json` method to `DataFrame` trait
   
   I will use this as a parent / tracker issue for the above points which will each have an issue.
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

Igosuki commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1060044229


   Exactly, like we can do with Spark.
   
   Le dim. 6 mars 2022 à 16:55, Matthew Turner ***@***.***> a
   écrit :
   
   > @Igosuki <https://github.com/Igosuki> just to confirm what youre looking
   > for - you want to be able to write partitioned parquet files from SQL?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1059987921>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADDFBQ6C2RS7ORZUH7OC2TU6TIOVANCNFSM5NYSNHWQ>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

xudong963 commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034318468


   > To confirm - does `COPY FROM` functionality already exist
   
   No, looking forward to seeing it. I just describe my idea of how to implement it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1061122771






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1061122771


   reopened -- I think github got a little over eager and interpreted the "closes #1777 task three" comment to mean it should close the tickt
   
   ![Screen Shot 2022-03-07 at 3 49 19 PM](https://user-images.githubusercontent.com/490673/157115585-d9c56d20-d7c9-435f-899a-28eb13399590.png)
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034168277


   > @matthewmturner I'd love to see PARTITION BY be implemented, which would output typical k=v partitions usable by the ListingTableProvider.
   
   @Igosuki  Agreed that would be great, I've added it to the list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1033373385


   @yjshen also interested in your view - in particular on the `ObjectStore` point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

Igosuki commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034146126


   @matthewmturner I'd love to see PARTITION BY be implemented, which would output typical k=v partitions usable by the ListingTableProvider. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

Igosuki commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1059983372


   @alamb is write_parquet outside of the scope now ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner edited a comment on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner edited a comment on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1059987921


   @Igosuki just to confirm what youre looking for - you want to be able to write partitioned parquet files from SQL?
   
   Plus writing more metadata. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1059984946


   @alamb FYI I don't think this issue should be closed yet. I'm using it as a tracker with the task list in the description. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki edited a comment on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

Igosuki edited a comment on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034741516


   @matthewmturner 
   For instance : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala 
   Or after writing a parquet file with pandas : 
   ```
   ############ file meta data ############
   created_by: parquet-cpp-arrow version 6.0.1
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034969587


   @Igosuki Thx much.
   
   Will keep you posted when I start working on this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034315867


   > The direct implementation of `COPY FROM` (copy data from a file to a table) is if a table exists, then we can create a new table from the file and `union` new table with the old table, otherwise, we can let the new table `union`s with an empty table with the same scheme.
   > 
   > BTW, we should make `sqlparser-rs` support those commands firstly.
   
   @xudong963 To confirm - does `COPY FROM` functionality already exist?  I'm not sure if i understood correctly, but I was focused on writing tables/dataframes from datafusion to files for this issue.
   
   And definitely agree on checking `sqlparser-rs`. I will check that out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

xudong963 commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034302444


   The direct implementation of `COPY FROM` (copy data from a file to a table) is if a table exists, then we can create a new table from the file and `union` new table with the old table, otherwise, we can let the new table `union`s with an empty table with the same scheme. 
   
   BTW, we should make `sqlparser-rs` support those commands firstly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki edited a comment on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

Igosuki edited a comment on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034741516


   @matthewmturner 
   For instance : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala 
   Or after writing a parquet file with pandas : 
   ```
   ############ file meta data ############
   created_by: parquet-cpp-arrow version 6.0.1
   ```
   
   Of course then other arbitrary metadata, for other systems such as warehouses.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to work with files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1073346393


   Ok great - using dataframe api aligns more with my thinking. It was just the SQL part that was throwing me off. Thx.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1059987921


   @Igosuki just to confirm what youre looking for - you want to be able to write partitioned parquet files from SQL?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1031982456


   @seddonm1 @houqp FYI - in case you have any thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

Igosuki commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034219306


   And, probably, support writing metadata, both arbitrary and specific to datafusion like other apache engines do ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki commented on issue #1777: Improve DataFusions ability to work with files

Posted by GitBox <gi...@apache.org>.

Igosuki commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1073344972


   Personnally, I'm fine using the dataframe api, but the partitioned output isn't available right now in the API ? Spark goes with write.partitionBy(...).parquet("")


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034466504


   @xudong963 very helpful, thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

Igosuki commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034741516


   @matthewmturner 
   For instance : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala 
   Or after writing a parquet file with pandas : 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

alamb closed issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1046999903


   From what I see it looks like only the execution context can write files right now - let me know if im mistaken.  I think it makes sense to add write functionality to dataframes as well.  I've added to the list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Add support for COPY command

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1031926470


   Just wanted to create this to start collecting feedback / thoughts.  I havent had chance to really look into details of this yet.  I hope to start in a couple weeks when I have some more time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034242254


   > And, probably, support writing metadata, both arbitrary and specific to datafusion like other apache engines do ?
   
   Added that as well.
   
   For my information, do you have any examples from other systems that I could reference?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on issue #1777: Improve DataFusions ability to write files

Posted by GitBox <gi...@apache.org>.

xudong963 commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1034336447


   > And definitely agree on checking `sqlparser-rs`. I will check that out.
   
   FYI @matthewmturner  https://github.com/sqlparser-rs/sqlparser-rs/pull/409/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1777: Improve DataFusions ability to work with files

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1777:
URL: https://github.com/apache/arrow-datafusion/issues/1777#issuecomment-1072470491


   @Igosuki sry if im being dumb / bad at searching google but i havent been able to find an example / docs of writing partitioned parquet files from SQL. Only writing with dataframe API or reading a partitioned parquet dataset with SQL.  I havent had the chance to test, but is the command youre looking for something like:
   
   ```
   COPY TABLE abc
   TO `abc`
   STORED AS PARQUET
   PARTITION BY year, month
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org