You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/30 15:39:52 UTC

[GitHub] [arrow-rs] pacman82 opened a new issue, #2984: Implicity setting of converted type is missing then setting

pacman82 opened a new issue, #2984:
URL: https://github.com/apache/arrow-rs/issues/2984

   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   Parquet converted type is not set implicitly then logical type is set to timestamp.
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   Write a parquet file with a 64Bit Integer primitive array column and logical type timestamp of milliseconds. Inspecting the output reveals missing converted type. Code snippet declaring the primitive array with the erroneous behaviour:
   
   ```rust
   Type::primitive_type_builder(name, PhysicalType::INT64)
               .with_logical_type(Some(LogicalType::Timestamp {
                   is_adjusted_to_u_t_c: false,
                   unit: precision_to_time_unit(self.precision),
               }))
               .with_repetition(self.repetition)
               .build()
               .unwrap()
   ```
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   According to the documentation here the converted type is supposed to be populated automatically then setting the logical type. <https://docs.rs/parquet/latest/parquet/schema/types/struct.PrimitiveTypeBuilder.html#method.with_logical_type>
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->
   This problem occured in upstream `odbc2parquet` then users queried `DATETIME` columns from a Microsoft Database. If a consumer of the written parquet file relies on the converted type, it is empty. See this issue: <https://github.com/pacman82/odbc2parquet/issues/284>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] pacman82 commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
pacman82 commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1302308902

   @tustvold Would you prefer I would open a new bug ticket for this and leaving this one closed? My initial assumption that it is not set in the parquet metadata is falsified by the test you have written. Still the converted type does not show up in the written file, at least according to `parquet-tools`. Could it be that the `parquet` crate is inferring the `converted-type` from the logic type then reading the file, but of course an older reader relying on the converted type being actually written in the file could not do that? Just a guess on my side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1296336652

   The PrimitiveTypeBuilder shoudl add the converted type automatically when applicable - https://github.com/apache/arrow-rs/blob/master/parquet/src/schema/types.rs#L307.
   
   However, according to the [format docs](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-time-convertedtype).
   
   > TIME_MILLIS is the deprecated ConvertedType counterpart of a TIME logical type that is UTC normalized and has MILLIS precision. Like the logical type counterpart, it must annotate an int32.
   >
   > TIME_MICROS is the deprecated ConvertedType counterpart of a TIME logical type that is UTC normalized and has MICROS precision. Like the logical type counterpart, it must annotate an int64.
   
   Are any of the following possible:
   
   * The time unit is nanoseconds
   * The timeunit is milliseconds but being written as Int64
   * The timeunit is microseconds but being written as Int32


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] pacman82 commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
pacman82 commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1299226278

   To reproduce try creating a file and inspect it with parquet tools. I am not surprised to see this test pass, as it failed for me even if I set the converted type explicitly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] pacman82 commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
pacman82 commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1296638647

   Seems like it is the third option. Thank you again. Fine with me to close this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1299074771

   Oh I got confused, parquet has TIME and TIMESTAMP :facepalm: 
   
   > TIMESTAMP_MILLIS is the deprecated ConvertedType counterpart of a TIMESTAMP logical type that is UTC normalized and has MILLIS precision. Like the logical type counterpart, it must annotate an int64.
   > 
   > TIMESTAMP_MICROS is the deprecated ConvertedType counterpart of a TIMESTAMP logical type that is UTC normalized and has MICROS precision. Like the logical type counterpart, it must annotate an int64.
   
   So they should both be Int64.
   
   Taking a step back, the following test passes so I am not sure how to reproduce your issue
   
   ```
   #[test]
   fn test_timestamp() {
       let t= Type::primitive_type_builder("foo", PhysicalType::INT64)
           .with_logical_type(Some(LogicalType::Timestamp {
               is_adjusted_to_u_t_c: false,
               unit: TimeUnit::MICROS(MicroSeconds {}),
           }))
           .with_repetition(Repetition::OPTIONAL)
           .build()
           .unwrap();
       assert_eq!(t.get_basic_info().converted_type, ConvertedType::TIMESTAMP_MICROS);
   
       let t= Type::primitive_type_builder("foo", PhysicalType::INT64)
           .with_logical_type(Some(LogicalType::Timestamp {
               is_adjusted_to_u_t_c: false,
               unit: TimeUnit::MILLIS(MilliSeconds {}),
           }))
           .with_repetition(Repetition::OPTIONAL)
           .build()
           .unwrap();
       assert_eq!(t.get_basic_info().converted_type, ConvertedType::TIMESTAMP_MILLIS);
   }
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1299397502

   I'm not sure what I am doing wrong here
   
   ```
   #[test]
       fn test_timestamp_round_trip() {
           let mut data = Vec::with_capacity(1024);
           let field = Arc::new(
               types::Type::primitive_type_builder("col1", Type::INT64)
                   .with_logical_type(Some(LogicalType::Timestamp {
                       is_adjusted_to_u_t_c: false,
                       unit: TimeUnit::MICROS(MicroSeconds {}),
                   }))
                   .with_repetition(Repetition::REQUIRED)
                   .build()
                   .unwrap(),
           );
           let schema = Arc::new(
               types::Type::group_type_builder("schema")
                   .with_fields(&mut vec![field.clone()])
                   .build()
                   .unwrap(),
           );
   
           let props = Arc::new(WriterProperties::builder().build());
           let mut writer =
               SerializedFileWriter::new(&mut data, schema.clone(), props).unwrap();
           let mut row_group_writer = writer.next_row_group().unwrap();
           let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
           column_writer
               .typed::<Int64Type>()
               .write_batch(&[1, 2, 3, 4], None, None)
               .unwrap();
           column_writer.close().unwrap();
           row_group_writer.close().unwrap();
           writer.close().unwrap();
   
           let bytes = Bytes::from(data);
           let reader = SerializedFileReader::new(bytes).unwrap();
           assert_eq!(reader.metadata().file_metadata().schema(), schema.as_ref())
       }
   ```cccccchdviitcbjieuttjinuietjbtufibvefdvgbbbd


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] pacman82 commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
pacman82 commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1297421411

   Here is a minimal reproducing example:
   
   ```rust
   use parquet::{
       basic::{LogicalType, Repetition},
       data_type::{DataType, Int32Type},
       format::{MilliSeconds, TimeUnit},
       schema::types::Type,
   };
   
   fn main() {
       Type::primitive_type_builder("field_name", Int32Type::get_physical_type())
           .with_logical_type(Some(LogicalType::Timestamp {
               is_adjusted_to_u_t_c: false,
               unit: TimeUnit::MILLIS(MilliSeconds {}),
           }))
           .with_repetition(Repetition::OPTIONAL)
           .build()
           .unwrap();
   }
   ```
   
   It works if I exchange the type for `Int64Type` but then I run into the original issue of the converted type being off.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1302554218

   > Could it be that the parquet crate is inferring the converted-type from the logic type
   
   This is definitely possible, looking at the code it will use the same type builder that will set the type. However, the following test passes
   
   ```
   #[test]
       fn test_writes_converted_type() {
           use crate::format::{MicroSeconds, FileMetaData, TimeUnit};
           let mut data = Vec::with_capacity(1024);
           let logical_type = LogicalType::Timestamp {
               is_adjusted_to_u_t_c: false,
               unit: TimeUnit::MICROS(MicroSeconds {}),
           };
           let field = Arc::new(
               types::Type::primitive_type_builder("col1", Type::INT64)
                   .with_logical_type(Some(logical_type.clone()))
                   .with_repetition(Repetition::REQUIRED)
                   .build()
                   .unwrap(),
           );
           assert_eq!(
               field.get_basic_info().converted_type(),
               ConvertedType::TIMESTAMP_MICROS
           );
           let schema = Arc::new(
               types::Type::group_type_builder("schema")
                   .with_fields(&mut vec![field.clone()])
                   .build()
                   .unwrap(),
           );
   
           // Write data
           let props = Arc::new(WriterProperties::builder().build());
           let mut writer =
               SerializedFileWriter::new(&mut data, schema.clone(), props).unwrap();
           let mut row_group_writer = writer.next_row_group().unwrap();
           let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
           column_writer
               .typed::<Int64Type>()
               .write_batch(&[1, 2, 3, 4], None, None)
               .unwrap();
           column_writer.close().unwrap();
           row_group_writer.close().unwrap();
           writer.close().unwrap();
   
           // Must read footer manually as default logic automatically infers
           let footer = data[data.len() - 8..].try_into().unwrap();
           let metadata_len = crate::file::footer::decode_footer(footer).unwrap();
           let metadata = &data[data.len() - 8 - metadata_len..data.len() - 8];
           let mut prot = TCompactInputProtocol::new(metadata);
           let t_file_metadata = FileMetaData::read_from_in_protocol(&mut prot).unwrap();
           assert_eq!(t_file_metadata.schema.len(), 2);
   
           let element = &t_file_metadata.schema[1];
           assert_eq!(&element.name, "col1");
           let actual = ConvertedType::try_from(element.converted_type).unwrap();
           assert_eq!(actual, ConvertedType::TIMESTAMP_MICROS);
           let actual = LogicalType::from(element.logical_type.clone().unwrap());
           assert_eq!(actual, logical_type);
       }
   ```
   
   > Would you prefer I would open a new bug ticket for this and leaving this one closed
   
   If you could that would be grand, I'm struggling to understand what is going on
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #2984: Implicity setting of converted type is missing then setting
URL: https://github.com/apache/arrow-rs/issues/2984


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] pacman82 commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
pacman82 commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1299672225

   Great test. I am very sorry. I should have been much clearer how to reproduce the symptom I am seeing. I did take your test, but modified it to write it into an acutal file called `tmp.par`.
   
   ```rust
   std::fs::write("tmp.par", data).unwrap();
   // let bytes = bytes::Bytes::from(data);
   // let reader = SerializedFileReader::new(bytes).unwrap();
   
   // assert_eq!(reader.metadata().file_metadata().schema(), schema.as_ref());
   // assert_eq!(
   //     reader.metadata().file_metadata().schema().get_fields()[0]
   //         .get_basic_info()
   //         .converted_type(),
   //     ConvertedType::TIMESTAMP_MICROS
   // );
   ```
   
   Now if insepecting the file with parquet tools:
   
   ```shell
   pip install parquet-tools
   parquet-tools inspect tmp.par
   ```
   
   It yields
   
   ```
   serialized_size: 143
   
   
   ############ Columns ############
   col1
   
   ############ Column(col1) ############
   name: col1
   path: col1
   max_definition_level: 0
   max_repetition_level: 0
   physical_type: INT64
   logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
   converted_type (legacy): NONE
   compression: UNCOMPRESSED (space_saved: 0%)
   ```
   
   It is very interessting to see the parquet reader implementations disagree here. In context the issue occurred that timestamps outputed by `odbc2parquet` can not interpreted by Azure Data Lake anymore, since migrating to logical types.
   
   Thanks for your help so far, and sorry for not being clearer in the beginning. I typed some of these into a phone, which made me error on the site of briefety.
   
   Best, Markus


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] pacman82 commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
pacman82 commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1296622469

   Hello @tustvold ,
   
   thank you for your feedback and the quick repsonse. I suspect that it is the third option. I'll go check the code of `odbc2parquet` and will report back here.
   
   Best, Markus


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] pacman82 commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
pacman82 commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1297408481

   Hello @tustvold ,
   
   so I tried to make the changes to `odbc2parquet`, yet now I am again unsure, if there is an issue with the `parquet` crates itself. So it turns out I actually messed up the second bullet point you presented (instead of the third, like I previously stated).
   
   > The timeunit is milliseconds but being written as Int64
   
   After trying to fix that, by using `INT32` as the Physical type instead, my code now "raises" an error right after I call `build` on the `primitive_type_builder`.
   
   ```
   General("Cannot annotate Timestamp { is_adjusted_to_u_t_c: false, unit: MILLIS(MilliSeconds) } from INT32 for field 'l'")
   ```
   
   Sorry if I just do not get something. Any help is appreciated.
   
   Best, Markus


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] pacman82 commented on issue #2984: Implicity setting of converted type is missing then setting

Posted by GitBox <gi...@apache.org>.
pacman82 commented on issue #2984:
URL: https://github.com/apache/arrow-rs/issues/2984#issuecomment-1297423009

   I am happy to post this example into a new issue, if it helps or reopening this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org