You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "oridag (via GitHub)" <gi...@apache.org> on 2023/06/11 08:00:46 UTC
[GitHub] [arrow-datafusion] oridag opened a new issue, #6630: allow writing compressed files
oridag opened a new issue, #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630
### Is your feature request related to a problem or challenge?
It would be useful to allow stream compression for output file formats that don't have built in compression spec like parquet
### Describe the solution you'd like
`DataFrame::write_json` and `DataFrame::write_csv` can take an optional `Compression` argument
### Describe alternatives you've considered
An alternative would be to post process the output files and compress them, but that's less efficient compared to stream compress them while writing
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] allow writing compressed files [arrow-datafusion]
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1787415031
Thanks for the ping @devinjdangelo
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] allow writing compressed files [arrow-datafusion]
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #6630: allow writing compressed files
URL: https://github.com/apache/arrow-datafusion/issues/6630
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] oridag commented on issue #6630: allow writing compressed files
Posted by "oridag (via GitHub)" <gi...@apache.org>.
oridag commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1606725177
@jiangzhx i'm not, go ahead
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] allow writing compressed files [arrow-datafusion]
Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.
devinjdangelo commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1786182560
This is supported today in both SQL (see [write_options](https://arrow.apache.org/datafusion/user-guide/sql/write_options.html)) and DataFrames (see `DataFrameWriteOptions` parameter).
@alamb I believe we can close this issue as complete.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on issue #6630: allow writing compressed files
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1609184216
After doing some research, I found that PR #6526 allows you to write compressed data to a CSV file. However, currently it only works with SQL and not with DataFrames.
https://github.com/apache/arrow-datafusion/blob/36292f63e3c7dfbd5bca77f5e36c2aa43048d2be/datafusion/core/src/datasource/file_format/csv.rs#L473-L521
example:
```
CREATE EXTERNAL TABLE test (
c1 TINYINT NOT NULL,
c2 SMALLINT NOT NULL,
)
STORED AS CSV
WITH HEADER ROW
COMPRESSION TYPE GZIP
LOCATION '/Users/sylar/workspace/opensource/arrow-datafusion/testing/data/csv/test.csv.gz';
INSERT INTO test select * from test;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #6630: allow writing compressed files
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1610025223
> After doing some research, I found that PR https://github.com/apache/arrow-datafusion/pull/6526 allows you to write compressed data to a CSV file. However, currently it only works with SQL and not with DataFrames.
The more we can unify the SQL / DataFrame codepaths the better. Since SQL has just started supporting writes, there is likely some non trivial unification that would be helpful to do
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on issue #6630: allow writing compressed files
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1606534410
> I suggest we add some sort of `CsvWriteOption` structure to make the API more general. Perhaps something like
>
> ```rust
> /// DataFusion specific writing properties
> struct WriteProperties {
> /// underlying arrow csv writer properties
> csv: CsvWriterProperties,
> compression: Compression
> }
>
> pub async fn write_csv(self, path: &str, writer_properties: Option<WriteProperties>) -> Result<()>
> ```
>
> This would be consistent with the parquet write API: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.write_parquet
@oridag have you working on this?
if not ,i will give this a try.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #6630: allow writing compressed files
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1590038340
I suggest we add some sort of `CsvWriteOption` structure to make the API more general. Perhaps something like
```rust
/// DataFusion specific writing propertoes
struct WriteProperties {
/// underlying arrow csv writer properties
csv: CsvWriterProperties,
compression: Compression
}
pub async fn write_csv(self, path: &str) -> Result<()>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org