You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "cgivre (via GitHub)" <gi...@apache.org> on 2023/02/01 13:57:48 UTC
[GitHub] [drill] cgivre commented on issue #2746: [DISCUSSION] Use INT96 as default timestamp format in Parquet tables

cgivre commented on issue #2746:
URL: https://github.com/apache/drill/issues/2746#issuecomment-1412101491

   I'll weigh in here.  It seems that since this is user configurable, it would make sense to make that the default and fix the UDFs.  We're about to release 1.21 which has a lot of major improvements, so IMHO it would be a good time to do so.
   
   Vova, would you mind explaining how this will break UDFs?
   Best,
   -- C
   
   
   
   > On Feb 1, 2023, at 7:54 AM, Christian Pfarr ***@***.***> wrote:
   > 
   > 
   > Hi everyone,
   > 
   > i want to raise a discussion about the current behavior in drill regarding parquet timestamps.
   > 
   > Drill uses INT64 for timestamps and you can switch to INT96 by setting store.parquet.reader.int96_as_timestamp to true. With that its not a big problem to work with both types of parquet timestamps, but since that spark uses INT96 as default, you have to switch this configure in almost all situations, especially when working with new lakehouse architectures like deltalake and iceberg.
   > 
   > For spark its clearly documented that they use INT96 in all scenarios:
   > 
   > here for reading -> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
   > 
   > Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
   > 
   > here for writing-> https://spark.apache.org/docs/latest/configuration.html
   > 
   > Sets which Parquet timestamp type to use when Spark writes data to Parquet files. INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.
   > 
   > Of course we could advise every drill user to write its spark jobs with the configuration spark.sql.parquet.outputTimestampType to TIMESTAMP_MICROS or TIMESTAMP_MILLIS or always toggle this drill configuration after startup, but its still an additional step.
   > 
   > @vvysotskyi <https://github.com/vvysotskyi> mentioned that if we would switch this default now, we would have issues with some UDF´s, so i would think it could be a topic for upcomming Drill 2.0.0 as a breaking change.
   > 
   > What do you think?
   > 
   > —
   > Reply to this email directly, view it on GitHub <https://github.com/apache/drill/issues/2746>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKB7PTKHEOHFBSTC433NIDWVJMIHANCNFSM6AAAAAAUNV7C5Y>.
   > You are receiving this because you are subscribed to this thread.
   > 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org