You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "oridag (via GitHub)" <gi...@apache.org> on 2023/06/11 08:00:46 UTC

[GitHub] [arrow-datafusion] oridag opened a new issue, #6630: allow writing compressed files

oridag opened a new issue, #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630

   ### Is your feature request related to a problem or challenge?
   
   It would be useful to allow stream compression for output file formats that don't have built in compression spec like parquet
   
   ### Describe the solution you'd like
   
   `DataFrame::write_json` and `DataFrame::write_csv` can take an optional `Compression` argument 
   
   ### Describe alternatives you've considered
   
   An alternative would be to post process the output files and compress them, but that's less efficient compared to stream compress them while writing 
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] allow writing compressed files [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1787415031

   Thanks for the ping @devinjdangelo 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] allow writing compressed files [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #6630: allow writing compressed files
URL: https://github.com/apache/arrow-datafusion/issues/6630


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] oridag commented on issue #6630: allow writing compressed files

Posted by "oridag (via GitHub)" <gi...@apache.org>.
oridag commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1606725177

   @jiangzhx i'm not, go ahead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] allow writing compressed files [arrow-datafusion]

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.
devinjdangelo commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1786182560

   This is supported today in both SQL (see [write_options](https://arrow.apache.org/datafusion/user-guide/sql/write_options.html)) and DataFrames (see `DataFrameWriteOptions` parameter).
   
   @alamb I believe we can close this issue as complete. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jiangzhx commented on issue #6630: allow writing compressed files

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1609184216

   After doing some research, I found that PR #6526 allows you to write compressed data to a CSV file. However, currently it only works with SQL and not with DataFrames.
   
   https://github.com/apache/arrow-datafusion/blob/36292f63e3c7dfbd5bca77f5e36c2aa43048d2be/datafusion/core/src/datasource/file_format/csv.rs#L473-L521
   
   
   example:
   ```
   CREATE EXTERNAL TABLE test (
           c1  TINYINT NOT NULL,
           c2  SMALLINT NOT NULL,
   
       )
   STORED AS CSV
   WITH HEADER ROW
   COMPRESSION TYPE GZIP
   LOCATION '/Users/sylar/workspace/opensource/arrow-datafusion/testing/data/csv/test.csv.gz';
   
   INSERT INTO test select * from test;
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6630: allow writing compressed files

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1610025223

   > After doing some research, I found that PR https://github.com/apache/arrow-datafusion/pull/6526 allows you to write compressed data to a CSV file. However, currently it only works with SQL and not with DataFrames.
   
   The more we can unify the SQL / DataFrame codepaths the better. Since SQL has just started supporting writes, there is likely some non trivial unification that would be helpful to do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jiangzhx commented on issue #6630: allow writing compressed files

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1606534410

   > I suggest we add some sort of `CsvWriteOption` structure to make the API more general. Perhaps something like
   > 
   > ```rust
   > /// DataFusion specific writing properties
   > struct WriteProperties {
   >   /// underlying arrow csv writer properties
   >   csv: CsvWriterProperties,
   >   compression: Compression
   > }
   > 
   > pub async fn write_csv(self, path: &str, writer_properties: Option<WriteProperties>) -> Result<()>
   > ```
   > 
   > This would be consistent with the parquet write API: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.write_parquet
   
   @oridag  have you working on this?
   if not ,i will give this a try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6630: allow writing compressed files

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6630:
URL: https://github.com/apache/arrow-datafusion/issues/6630#issuecomment-1590038340

   I suggest we add some sort of `CsvWriteOption` structure to make the API more general. Perhaps something like
   
   ```rust
   /// DataFusion specific writing propertoes
   struct WriteProperties {
     /// underlying arrow csv writer properties
     csv: CsvWriterProperties,
     compression: Compression
   }
   
   pub async fn write_csv(self, path: &str) -> Result<()>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org