You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/21 01:37:03 UTC

[GitHub] [arrow-datafusion] alitrack opened a new issue #2044: wrong result when operation parquet

alitrack opened a new issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044


   **Describe the bug**
   A clear and concise description of what the bug is.
   when use register_parquet, datetime got wrong result, but register_csv no problem.
   if use pandas read it dataframe and use register_record_batches also OK.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   ```python
   import datafusion
   import pyarrow as pa
   
   ctx = datafusion.ExecutionContext()
   ctx.register_parquet('taxi_sample','yellow_taxi_sample.parquet')
   sql ="select * from taxi_sample"
   pydf=ctx.sql(query)
   pa.Table.from_batches(pydf.collect()).to_pandas()  
   ```
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   expected result ,
   
   ```
   	pickup_datetime
   0	2009-01-04 02:52:00
   1	2009-01-04 03:31:00
   2	2009-01-03 15:43:00
   ```
   
   but got,
   ```
   pickup_datetime
   0	1970-01-15 05:57:17.520
   1	1970-01-15 05:57:19.860
   2	1970-01-15 05:56:37.380
   ```
   
   **Additional context**
   Add any other context about the problem here.
   
   the sample data is part of [Year 2009-2015 - 1 billion rows - 107GB](https://vaex.s3.us-east-2.amazonaws.com/taxi/yellow_taxi_2009_2015_f32.hdf5)
   
   [yellow_taxi_sample.parquet.zip](https://github.com/apache/arrow-datafusion/files/8312598/yellow_taxi_sample.parquet.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

jiangzhx commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1084095786


   > @alitrack, the issue may be caused by "ARROW:schema" key-value pair in .parquet metadata - it contains schema which treats pickup/dropoff_datatime fields as Timestamp(Nanosecond) instead of Timestamp(Microseconds) in actual file schema. I suppose removing this tag from file metadata should help.
   
   i did more research, read parquet metadata with parquet = { version = "9.0.0"} .
   the value of key ARROW:schema was base64 encoding .
   
   @korowa  was right, the column pickup_datetim's datatype was datetime64[ns]
   
   `
           {
               "name": "pickup_datetime",
               "field_name": "pickup_datetime",
               "pandas_type": "datetime",
               "numpy_type": "datetime64[ns]",
               "metadata": null
           },
   
   `
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx edited a comment on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

jiangzhx edited a comment on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1084142721


   confused....
   
   print_row_with_parquet testcase get right result
   pickup_datetime:2009-01-04 02:52:00 +00:00
   
   
   
   print_row_with_datafusion get wrong result
   pickup_datetime:1970-01-15 05:57:17.520
   
   
   ```Rust
   use datafusion::error::Result;
   use datafusion::prelude::ExecutionContext;
   use std::convert::TryFrom;
   use std::fs::File;
   use std::path::Path;
   
   use parquet::file::reader::FileReader;
   use parquet::file::serialized_reader::SerializedFileReader;
   
   #[tokio::test]
   async fn print_row_with_parquet() -> Result<()> {
   	let path = Path::new("yellow_taxi_sample.parquet");
   	let row_iter = SerializedFileReader::try_from(path).unwrap().into_iter();
   
   	for row in row_iter {
   		println!("{}", row);
   	}
   	Ok(())
   }
   
   #[tokio::test]
   async fn grouped_counts() -> Result<()> {
   	let mut ctx = ExecutionContext::new();
   	ctx.register_parquet("taxi_sample", "yellow_taxi_sample.parquet")
   		.await?;
   	let df = ctx.sql("SELECT * from taxi_sample").await?;
   	df.show().await?;
   
   	Ok(())
   }
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

jiangzhx commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1073636705


   also test with python 0.5.1；
   query result same with rust datafusion version;
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

jiangzhx commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1073685318


       import pandas as pd
       df = pd.read_parquet('yellow_taxi_sample.parquet')
       #df = pd.read_parquet('yellow_taxi_sample.parquet',engine='pyarrow')
       print(df.head())
   
   print right result:
   
   <img width="1257" alt="image" src="https://user-images.githubusercontent.com/494507/159236046-690e9a69-8cc1-456b-9238-f44b518975da.png">
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

jiangzhx commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1073634377


   i test your parquet file [yellow_taxi_sample.parquet.zip](https://github.com/apache/arrow-datafusion/files/8312598/yellow_taxi_sample.parquet.zip)
   
   use rust datafusion master version;
   
   query response:
   pickup_datetime
   0	1970-01-15 05:57:17.520
   1	1970-01-15 05:57:19.860
   2	1970-01-15 05:56:37.380
   
   
   does parquet file right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] korowa commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

korowa commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1074835005


   @alitrack, i guess the issue caused by "ARROW:schema" key-value pair in .parquet metadata - it contains schema which treats pickup/dropoff_datatime fields as Timestamp(Nanosecond) instead of Timestamp(Microseconds) in actual file schema. I suppose removing this tag from file metadata should help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx edited a comment on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

jiangzhx edited a comment on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1084142721


   confused....
   
   print_row_with_parquet testcase get right result
   pickup_datetime:2009-01-04 02:52:00 +00:00
   
   
   
   print_row_with_datafusion get wrong result
   pickup_datetime:1970-01-15 05:57:17.520
   
   
   ```Rust
   use datafusion::error::Result;
   use datafusion::prelude::ExecutionContext;
   use std::convert::TryFrom;
   use std::fs::File;
   use std::path::Path;
   
   use parquet::file::reader::FileReader;
   use parquet::file::serialized_reader::SerializedFileReader;
   
   #[tokio::test]
   async fn print_row_with_parquet() -> Result<()> {
   	let path = Path::new("yellow_taxi_sample.parquet");
   	let row_iter = SerializedFileReader::try_from(path).unwrap().into_iter();
   
   	for row in row_iter {
   		let s = row.to_string();
   		println!("{}", s);
   	}
   	Ok(())
   }
   
   #[tokio::test]
   async fn print_row_with_datafusion() -> Result<()> {
   	let mut ctx = ExecutionContext::new();
   	ctx.register_parquet("taxi_sample", "yellow_taxi_sample.parquet")
   		.await?;
   	let df = ctx.sql("SELECT * from taxi_sample").await?;
   	df.show().await?;
   
   	Ok(())
   }
   
   
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alitrack commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

alitrack commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1073605603


   @jiangzhx  0.5.1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alitrack commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

alitrack commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1074557275


   yes, but yellow_taxi_2009_2015_f32.parquet is about 28G, so I want to use register_parquet, not pandas or vaex read it first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

jiangzhx commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1073600418


   @alitrack which version?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alitrack commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

alitrack commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1073644144


   please try pandas ,  pyarrow , vaex, all have the same result(correct one),
   
   
   ```python
   import pandas as pd
   #pd.read_parquet("yellow_taxi_sample.parquet", engine='pyarrow')
   pd.read_parquet("yellow_taxi_sample.parquet", engine='fastparquet')
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

jiangzhx commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1084142721


   confused....
   
   print_row_with_parquet testcase get right result
   pickup_datetime:2009-01-04 02:52:00 +00:00
   
   
   
   print_row_with_datafusion get wrong result
   pickup_datetime:1970-01-15 xx:xx:xxxx
   
   
   ```Rust
   use datafusion::error::Result;
   use datafusion::prelude::ExecutionContext;
   use std::convert::TryFrom;
   use std::fs::File;
   use std::path::Path;
   
   use parquet::file::reader::FileReader;
   use parquet::file::serialized_reader::SerializedFileReader;
   
   #[tokio::test]
   async fn print_row_with_parquet() -> Result<()> {
   	let path = Path::new("yellow_taxi_sample.parquet");
   	let row_iter = SerializedFileReader::try_from(path).unwrap().into_iter();
   
   	for row in row_iter {
   		println!("{}", row);
   	}
   	Ok(())
   }
   
   #[tokio::test]
   async fn grouped_counts() -> Result<()> {
   	let mut ctx = ExecutionContext::new();
   	ctx.register_parquet("taxi_sample", "yellow_taxi_sample.parquet")
   		.await?;
   	let df = ctx.sql("SELECT * from taxi_sample").await?;
   	df.show().await?;
   
   	Ok(())
   }
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1085943742


   This is likely related to https://github.com/apache/arrow-rs/issues/1459


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] korowa edited a comment on issue #2044: wrong result when operation parquet

Posted by GitBox <gi...@apache.org>.

korowa edited a comment on issue #2044:
URL: https://github.com/apache/arrow-datafusion/issues/2044#issuecomment-1074835005


   @alitrack, the issue may be caused by "ARROW:schema" key-value pair in .parquet metadata - it contains schema which treats pickup/dropoff_datatime fields as Timestamp(Nanosecond) instead of Timestamp(Microseconds) in actual file schema. I suppose removing this tag from file metadata should help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org