You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/14 11:58:06 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

jorgecarleitao opened a new pull request #7751:
URL: https://github.com/apache/arrow/pull/7751


   This caused an error as the column.schema() became different from the expected schema, as the conversion was incorrect.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao closed pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
sunchao closed pull request #7751:
URL: https://github.com/apache/arrow/pull/7751


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-658141485


   https://issues.apache.org/jira/browse/ARROW-9461


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-660449463


   @sunchao , I was able to partially test the date64: I tested the conversion from i64 to date64 on the converters. Does not cover 100%, since it does not cover the `array_reader`, but it helps. I also added a test for the value conversion of date32.
   
   I also found how to test this in the `array_reader`; it just takes time because data64 cannot be tested by the macro `test_primitive_array_reader_one_type`: it requires us to use `ComplexObjectArrayReader::<Int64Type, Date64Converter>` instead of `PrimitiveArrayReader::<$arrow_parquet_type>` (used in `test_primitive_array_reader_one_type`).
   
   I would note that we currently do not test any type conversion on `array_reader` apart from the String and structs: the tests use the `PrimitiveArrayReader`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-660449463


   @sunchao , I was able to partially test the date64: I tested the conversion from i64 to date64 on the converters. Does not cover 100%, since it does not cover the array_reader, but it helps. I also added a test for the value conversion of date32.
   
   I also found how to test this in the `array_reader`; it just takes time because data64 cannot be tested by the macro `test_primitive_array_reader_one_type´: it requires us to use `ComplexObjectArrayReader::<Int64Type, Date64Converter>` instead of `PrimitiveArrayReader::<$arrow_parquet_type>` (used in `test_primitive_array_reader_one_type`).
   
   I would note that we currently do not test any type conversion on `array_reader` apart from the String and structs: the tests use the `PrimitiveArrayReader`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-658968604


   > Date64 is an arrow specific format, and thus to test it we need to have a file in parquet with that format, since we do not have a writer yet to play around.
   
   I don't fully understand this. You mean we need Parquet file with the Date64 type? 
   
   To my understanding, there is no conversion from Parquet type to Arrow's Date64 type, therefore the conversion clause may never happen. See [here](https://github.com/apache/arrow/blob/master/rust/parquet/src/arrow/schema.rs#L409) for Rust and [here](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema_internal.cc#L161) for C++


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-660449463


   @sunchao , I was able to partially test the date64: I tested the conversion from i64 to date64 on the converters. Does not cover 100%, since it does not cover the array_reader, but it helps. I also added a test for the value conversion of date32.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-659168585


   I filed this under a separate issue for Python,C++ [here](https://issues.apache.org/jira/browse/ARROW-9502).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-660286242


   Right. I think this case is impossible as Parquet don't support it. Also the change you made for this case:
   ```rust
                   (ArrowType::Date64(unit), PhysicalType::INT64) => match unit {
                       DateUnit::Millisecond => Date64Converter::new()
                           .convert(self.record_reader.cast::<Int64Type>()),
                       _ => Err(general_err!("No conversion from parquet type to arrow type for date with unit {:?}", unit)),
   ```
   will never happen as we'll never convert Parquet's INT64 to Arrow's Date64. I think we just need the change for Date32.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-660503558


   @sunchao, Yeah, re-reading the parquet specification, I agree with you. I will change this to throw an error for the Data64 branch. Sorry for the confusion and thank you for the persistence.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-658854948


   Thanks @andygrove and @sunchao for taking the time to look at this.
   
   I had to [change an existing test](https://github.com/apache/arrow/pull/7751/files#diff-fca053527b552a9f2abb1e280e570cb6R1165) for this to pass the tests, and thus understood that this was already covered.
   
   My understanding is that the current test is incorrect because it asserts that the logical type string `"DATE"` is represented by the logical type `UInt32` and native type `u32` -- this PR changes this assert so that `"DATE"` is represented by `Date32`, of native type `i32`, to be consistent with the statement
   
   ```
   make_type!(Date32Type, i32, DataType::Date32(DateUnit::Day), 32, 0i32);
   ```
   
   that we use to declare the DateType on our (in-memory representation) side.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-660445062


   I understood that many of the arrow data types are written to parquet with different types, and are casted to the correct type on read/write.
   
   IMO the reason we cannot test it is because the current C++ does not write date64 as a 64 bit int in parquet, and thus we have no way of constructing a 64int column in parquet with data type date64.
   
   However, if another implementation writes a date64 to parquet, we need a method to read from it and AFAI can tell, the previous code is wrong. IMO this PR contains the most correct code to handle date64 - it is just untestable atm, like the previous code. So, between a certainly wrong and untested code and a potentially correct but untested code, I would go for the latter :P
   
   We could also just throw an error, but we also can't test that because that branch is unreachable in tests. ^_^
   
   I rest my case, and I will change the code to whatever version you prefer, @sunchao  :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-660496848


   Thanks @jorgecarleitao for being patient with me and updating the test :)
   
   > IMO the reason we cannot test it is because the current C++ does not write date64 as a 64 bit int in parquet, and thus we have no way of constructing a 64int column in parquet with data type date64.
   
   This is the part I don't understand. Since Parquet only support annotate DATE to int32 type according to its specification, how can you write a date64 as int64 in Parquet: it has to be annotated with something to indicate it is a DATE type so that when  Arrow converts a int64 type from Parquet it knows what type in Arrow it should go to (int64, decimal, timestamp, etc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on a change in pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#discussion_r455189234



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -196,11 +196,13 @@ impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
                         .convert(self.record_reader.cast::<Int64Type>()),
                     _ => Err(general_err!("No conversion from parquet type to arrow type for timestamp with unit {:?}", unit)),
                 },
-                (ArrowType::Date32(_), PhysicalType::INT32) => {
-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())
+                (ArrowType::Date32(unit), PhysicalType::INT32) => match unit {
+                    DateUnit::Day => Date32Converter::new().convert(self.record_reader.cast::<Int32Type>()),

Review comment:
       nit: long line. Please limit width to 90 chars. Same below.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-659041261


   > You mean we need Parquet file with the Date64 type?
   
   Yes. That would be the most reliable way to test this. Something equivalent to the Rust's counter-part of:
   
   ```python
   import datetime
   import pyarrow as pa
   import pyarrow.parquet
   import os.path
   
   data = [
       datetime.datetime(2000, 1, 1, 12, 34, 56, 123456), 
       datetime.datetime(2000, 1, 1)
   ]
   
   data32 = pa.array(data, type='date32')
   data64 = pa.array(data, type='date64')
   table = pyarrow.Table.from_arrays([data32, data64], names=['a', 'b'])
   
   pyarrow.parquet.write_table(table, 'a.parquet')
   ```
   
   >  there is no conversion from Parquet type to Arrow's Date64 type, therefore the conversion clause may never happen. See here for Rust and here for C++
   
   Well, I do not know the details, but the fact that when I run
   
   ```python
   print(table)
   print()
   print(pyarrow.parquet.read_table('a.parquet'))
   ```
   
   I get 
   
   ```
   pyarrow.Table
   a: date32[day]
   b: date64[ms]
   
   pyarrow.Table
   a: date32[day]
   b: date32[day]  <----------- why?
   ```
   
   does not inspire me much confidence in the current C++/Python implementation of this particular type. I sense that there is a bug in the current C++'s implementation of read or write to parquet for this type.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-658580751


   Thanks @jorgecarleitao . LGTM but some simple tests will be appreciated.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-658906785


   > Thanks @jorgecarleitao . Great that the `Date32` is covered. Is it possible to add a test for `Date64` as well or it is also covered? Ideally we want to test the error case as well but not sure how difficult it is to come up with such a test.
   
   I tried, but I admit that I am not very familiar with the code base nor the theory, so you will have to help me out here.
   
   My current hypothesis is that parquet's logical type for dates is "DATE" annotated as a i32 [as per perquet-format](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date). Date64 is an arrow specific format, and thus to test it we need to have a file in parquet with that format, since we do not have a writer yet to play around. We could do it in memory, but then I was unable to find a good example in the tests to base myself on.
   
   I was hoping to find a place in the code where we do a serialize-deserialize roundtrip and assert that the result is the same for *every* type that we declare. I have been doing this on a package that I have been writing, [here](https://github.com/jorgecarleitao/datafusion-python/blob/master/tests/test_sql.py#L202), and that is where I found the bug, but I can't find a place in `arrow/rust` where we are performing this test consistently. Do you know where I can find this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-658333275


   @jorgecarleitao Could you add a unit test?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#discussion_r455229742



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -196,11 +196,13 @@ impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
                         .convert(self.record_reader.cast::<Int64Type>()),
                     _ => Err(general_err!("No conversion from parquet type to arrow type for timestamp with unit {:?}", unit)),
                 },
-                (ArrowType::Date32(_), PhysicalType::INT32) => {
-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())
+                (ArrowType::Date32(unit), PhysicalType::INT32) => match unit {
+                    DateUnit::Day => Date32Converter::new().convert(self.record_reader.cast::<Int32Type>()),

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-659329634


   @jorgecarleitao thanks! seems this is not a bug in C++ as Parquet only uses int32 to represent date while Arrow uses both int32 and int64. Therefore, when you write a int64 date from Arrow to Parquet and then read it back to Parquet, it first convert it to a int32 in Parquet with logical type "DATE", and then read it back to Arrow as int32 date, which matches to what you observed above. 
   
   I also think there is no precision loss in the transformation since milliseconds in the int64 Arrow date must be evenly divisible by 86400000 (milliseconds in a day).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7751: ARROW-9461: [Rust] Fixed error in reading Date32 and Date64.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7751:
URL: https://github.com/apache/arrow/pull/7751#issuecomment-659971852


   Well, then isn't it fair to say that we have no way of testing that we are correctly reading from date64? we have no way of writing a parquet file with i64 representing a date - every date is written to parquet as i32.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org