You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/10 20:02:56 UTC

[GitHub] [arrow-datafusion] Miyake-Diogo opened a new issue, #3783: Write cssv not save all lines

Miyake-Diogo opened a new issue, #3783:
URL: https://github.com/apache/arrow-datafusion/issues/3783

   **Describe the bug**
   When I try to save dataframe as csv, only around 400K of lines are saved.. data has more than 1M of lines.
   
   **To Reproduce**
   My code: 
   ``` rust
   use datafusion::prelude::*;
   use log::{debug, info, LevelFilter, trace};
   use crate::datapipeline::data_utils::*;
   pub mod datapipeline;
   use datafusion::logical_plan::when;
   
   use datafusion::arrow::datatypes::DataType::{Int64,Utf8};
   #[tokio::main]
   async fn main() -> datafusion::error::Result<()> {
     let ctx: SessionContext = SessionContext::new();
     let raw_fato_path: &str = "data/minilake/raw/fato_census/Data8277.csv";
     let stage_fato_path: &str = "data/minilake/stage/fato_census/";
     let fato_census_df = ctx.read_csv(raw_fato_path,  
                                     CsvReadOptions::new()).await?;
     
     let fato_census_df = fato_census_df.with_column("area",cast(
       col("Area"),
       Utf8))?;
   
     let fato_census_df = fato_census_df
       //.with_column("Area",concat_ws("-", &vec![lit("A"),col("Area")]))?
       .select(vec![
         col("Year").alias("year"),
         col("Age").alias("age"),
         col("Ethnic").alias("ethnic"),
         col("Sex").alias("sex"),
         col("Area").alias("area"),
         col("count").alias("total_count")
         ])?;
     
     // We can see the ..C values in Count column
     fato_census_df.show_limit(5).await?;
     print_schema_of_dataframe(&fato_census_df).await?;
     // Create a function to make trnasformation
     let transform_count_data = when(col("total_count")
       .eq(lit("..C")), lit(0_u32))
       .otherwise(col("total_count"))?;
   
     //Cast column datatype
     let fato_census_df = fato_census_df.with_column(
       "total_count",
       cast(transform_count_data, Int64))?;
     
     fato_census_df.write_csv(stage_fato_path).await?;
   
     Ok(())
     }
   ```
   Dataset: 
   
   [Age and sex by ethnic group (grouped total responses), for census usually resident population counts, 2006, 2013, and 2018 Censuses (RC, TA, SA2, DHB)](https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip?_ga=2.148542962.457556406.1664998127-985979153.1663098055)
   **Expected behavior**
   See all lines saved: 
   
   <img width="845" alt="image" src="https://user-images.githubusercontent.com/24550387/194943530-8082c81a-18d1-45be-89fc-df4e54bec121.png">
   
   
   But only this quantity are saved.
   <img width="790" alt="image" src="https://user-images.githubusercontent.com/24550387/194943662-df2a4f9a-b7cf-419f-a08f-d66b3c80eb08.png">
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #3783: Write csv not save all lines of dataframe

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #3783:
URL: https://github.com/apache/arrow-datafusion/issues/3783#issuecomment-1275273200

   @Miyake-Diogo The issue is that this error is happening:
   
   ```
   Error: ArrowError(ParseError("Error while parsing value CMB07601 for column 4 at line 431740"))
   ```
   
   I recommend specifying the schema for the file since it contains mixed types for this column.
   
   You did not see the error due to a bug with the error being ignored and the fix for that issue is in https://github.com/apache/arrow-datafusion/pull/3801
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove closed issue #3783: Write csv not save all lines of dataframe

Posted by GitBox <gi...@apache.org>.

andygrove closed issue #3783: Write csv not save all lines of dataframe
URL: https://github.com/apache/arrow-datafusion/issues/3783


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #3783: Write csv not save all lines of dataframe

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #3783:
URL: https://github.com/apache/arrow-datafusion/issues/3783#issuecomment-1296304358

   @Miyake-Diogo Apologies for the late reply, but schema can be set in `CsvReadOptions`.
   
   The root issue of not writing all results was fixed in https://github.com/apache/arrow-datafusion/pull/3801


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Miyake-Diogo commented on issue #3783: Write csv not save all lines of dataframe

Posted by GitBox <gi...@apache.org>.

Miyake-Diogo commented on issue #3783:
URL: https://github.com/apache/arrow-datafusion/issues/3783#issuecomment-1297823260

   Don't worry @andygrove thanks for answering me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Miyake-Diogo commented on issue #3783: Write csv not save all lines of dataframe

Posted by GitBox <gi...@apache.org>.

Miyake-Diogo commented on issue #3783:
URL: https://github.com/apache/arrow-datafusion/issues/3783#issuecomment-1275380610

   Hi @andygrove , all codes are in this repo: https://gitlab.com/miyake-diogo/rust-big-data-playground
   How can I specify Schema on read? I don't found any example on documentation... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #3783: Write csv not save all lines of dataframe

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #3783:
URL: https://github.com/apache/arrow-datafusion/issues/3783#issuecomment-1275193291

   Here is a smaller repro case:
   
   ```
   use datafusion::prelude::*;
   
   #[tokio::main]
   async fn main() -> datafusion::error::Result<()> {
       let ctx: SessionContext = SessionContext::new();
       let raw_fato_path: &str = "/mnt/bigdata/census/Data8277.csv";
       let stage_fato_path: &str = "/tmp/stage";
       let fato_census_df = ctx.read_csv(raw_fato_path, CsvReadOptions::new()).await?;
       fato_census_df.write_csv(stage_fato_path).await?;
       Ok(())
   }
   ```
   
   ```
   $ wc -l /tmp/stage/part-0.csv 
   425985 /tmp/stage/part-0.csv
   ```
   
   I tested with DataFusion 11, 12, and 13, and all have the same issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #3783: Write csv not save all lines of dataframe

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #3783:
URL: https://github.com/apache/arrow-datafusion/issues/3783#issuecomment-1275173481

   @Miyake-Diogo So `part-0.csv` only has 400k lines but were there other csv files? 
   
   I tried running this code but it has dependencies that are not here:
   
   ```
   error[E0583]: file not found for module `datapipeline`
    --> src/main.rs:4:1
     |
   4 | pub mod datapipeline;
   ```
   
   Do you have this code in GitHub somewhere? I am happy to help debug if you have a public repro case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org