You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "ismail (via GitHub)" <gi...@apache.org> on 2023/03/20 20:00:26 UTC

[GitHub] [arrow-datafusion] ismail opened a new issue, #5657: Request for documentation for compressed CSV/JSON support

ismail opened a new issue, #5657:
URL: https://github.com/apache/arrow-datafusion/issues/5657

   Hi,
   
   Support for compressed csv/json was added in https://github.com/apache/arrow-datafusion/commit/b8a3a78f833fae8faace8d7542a1fb3d7a497b6a and trying to use it in a sample
   
   ```
   use datafusion::prelude::*;
   use datafusion::datasource::file_format::file_type::FileCompressionType;
   
   #[tokio::main]
   async fn main() -> datafusion::error::Result<()> {
       let ctx = SessionContext::new();
       let csv_options = CsvReadOptions::default()
           .has_header(true)
           .file_compression_type(FileCompressionType::BZIP2);
       let df = ctx.read_csv("summary.csv.bz2", csv_options).await?;
       let df = df
           .filter(col("status").eq(lit("OK")))?
           .select_columns(&["name", "id"])?;
   
       df.show().await?;
       Ok(())
   }
   ```
   
   results in
   
   ```
   Error: SchemaError(FieldNotFound { field: Column { relation: None, name: "status" }, valid_fields: [] })
   ```
   
   Code works fine if I work on the uncompressed CSV. Since the documentation for this feature is missing, I am wondering if I'm holding it wrong. Would appreciate if the documentation could give example of sample usage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ismail commented on issue #5657: Request for documentation for compressed CSV/JSON support

Posted by "ismail (via GitHub)" <gi...@apache.org>.
ismail commented on issue #5657:
URL: https://github.com/apache/arrow-datafusion/issues/5657#issuecomment-1500435274

   Sorry for the late reply. @jiangzhx I tested your branch directly, and it resolves the issue, thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jiangzhx commented on issue #5657: Request for documentation for compressed CSV/JSON support

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #5657:
URL: https://github.com/apache/arrow-datafusion/issues/5657#issuecomment-1495432405

   @ismail 
   
   Followed @Jefffrey  clue and finally this problem was solved.
   when file name end with "csv.bz2", should set option with .file_extension("csv.bz2").
   
   ```
       let csv_options = CsvReadOptions::default()
           .has_header(true)
           .file_compression_type(FileCompressionType::BZIP2)
           .file_extension("csv.bz2");
       let df = ctx
           .read_csv(&format!("{testdata}/csv/summary.csv.bz2"), csv_options)
           .await?;
       let df = df
           .filter(col("status").eq(lit("OK")))?
           .select_columns(&["name", "id"])?;
   
       df.show().await?;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #5657: Request for documentation for compressed CSV/JSON support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #5657: Request for documentation for compressed CSV/JSON support
URL: https://github.com/apache/arrow-datafusion/issues/5657


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jiangzhx commented on issue #5657: Request for documentation for compressed CSV/JSON support

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #5657:
URL: https://github.com/apache/arrow-datafusion/issues/5657#issuecomment-1501528510

   > Sorry for the late reply. @jiangzhx I tested your branch directly, and it resolves the issue, thanks a lot!
   
   you are welcome 😁


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5657: Request for documentation for compressed CSV/JSON support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5657:
URL: https://github.com/apache/arrow-datafusion/issues/5657#issuecomment-1490452786

   I am not sure what is going on here -- it would be great if someone could investigate further


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ismail commented on issue #5657: Request for documentation for compressed CSV/JSON support

Posted by "ismail (via GitHub)" <gi...@apache.org>.
ismail commented on issue #5657:
URL: https://github.com/apache/arrow-datafusion/issues/5657#issuecomment-1486467447

   Here is a randomly generated example that shows the issue on my machine. Just rename it to `summary.csv.bz2`.
   
   [summary.csv.bz2.zip](https://github.com/apache/arrow-datafusion/files/11087333/summary.csv.bz2.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5657: Request for documentation for compressed CSV/JSON support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5657:
URL: https://github.com/apache/arrow-datafusion/issues/5657#issuecomment-1484898745

   I wonder if you could provide an example file?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Jefffrey commented on issue #5657: Request for documentation for compressed CSV/JSON support

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on issue #5657:
URL: https://github.com/apache/arrow-datafusion/issues/5657#issuecomment-1491561571

   Specific issue seems to be in this function:
   
   https://github.com/apache/arrow-datafusion/blob/667f19ebad216b7592af5a91b70a24fb21c3bb64/datafusion/core/src/datasource/listing/table.rs#L431-L444
   
   Because the file extension is `.csv.bz2` and not just `.csv` it doesn't list the file hence leading to inferring schema from an empty list of files, leading to empty schema.
   
   As a temporary workaround I renamed the file from `summary.csv.bz2` to `summary.csv` and this seemed to be picked up properly, however it ran into another issue:
   
   `Error: ArrowError(CsvError("decompression not finished but EOF reached"))`
   
   This specifically stems from here:
   
   https://github.com/apache/arrow-datafusion/blob/667f19ebad216b7592af5a91b70a24fb21c3bb64/datafusion/core/src/datasource/file_format/csv.rs#L208-L215
   
   Haven't looked into it too much, but seems similar to these issues:
   
   - https://github.com/apache/arrow-datafusion/issues/1736
   - https://github.com/apache/arrow-datafusion/issues/5041


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org