You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/26 14:16:57 UTC

[GitHub] [hudi] sstimmel opened a new issue, #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file

sstimmel opened a new issue, #6798:
URL: https://github.com/apache/hudi/issues/6798

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I am testing out partitioning a dataset by an eventTime which is timestamp column, but only want up to day precision for partitioning.  Is there a way to read back the original value from hudi instead of the truncated value?
   
   **To Reproduce**
   
   Configs
   hoodie.datasource.write.recordkey.field=companyId
   hoodie.datasource.write.precombine.field=eventTime
   hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.write.hive_style_partitioning=false
   hoodie.datasource.hive_sync.enable=false
   hoodie.datasource.write.drop.partition.columns=false
   hoodie.deltastreamer.source.dfs.root=s3://blah/tenantconfig/raw
   hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt=true
   hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
   hoodie.datasource.write.partitionpath.field=eventTime
   hoodie.datasource.write.keygen.timebased.timestamp.type=SCALAR
   hoodie.datasource.write.keygen.timebased.timezone=UTC
   hoodie.datasource.write.keygen.timebased.timestamp.scalar.time.unit=microseconds
   hoodie.datasource.write.keygen.timebased.output.dateformat=yyyy-MM-dd
   hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
   hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=microseconds
   hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM-dd
   hoodie.deltastreamer.keygen.timebased.timezone=UTC
   hoodie.deltastreamer.source.s3incr.fs.prefix=s3a
   hoodie.index.type=GLOBAL_SIMPLE
   hoodie.simple.index.update.partition.path=true
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.cleaner.commits.retained=200
   hoodie.keep.min.commits=250
   hoodie.keep.max.commits=500
   hoodie.allow.empty.commit=false
   hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true
   hoodie.parquet.outputtimestamptype=TIMESTAMP_MICROS
    
    
   If I read a particular partition folder in parquet format, I can get the original eventTime values
    
   spark.read.format("parquet").load(testPath+"/2022-06-14").createOrReplaceTempView("t")
   spark.sql("select eventId, eventTime,companyId from t").show(10, false)
    
    
   +------------------------------------+-----------------------+---------+
   |eventId                             |eventTime              |companyId|
   +------------------------------------+-----------------------+---------+
   |f08eae6d-6103-4b4d-8d3f-348477ab055c|2022-06-14 05:34:49.128|1285302  |
   |49ecd5b2-c782-482f-b796-b008b9091d8b|2022-06-14 05:34:52.83 |1285306  |
   |b6eab34e-9e7d-4365-87ef-36086e18a3a0|2022-06-14 11:00:30.96 |1285489  |
   |1697c79d-0180-42bc-89f8-e29d3bb806c7|2022-06-14 08:27:49.169|1285375  |
   |6ecf4ffe-a937-4d3e-928e-3edfb09becdd|2022-06-14 08:28:21.978|1285379  |
   |cc774a92-ee81-4e41-9228-b636af58e48c|2022-06-14 05:34:07.788|1285261  |
   |c26f12ba-8b6a-4eef-a9ff-3051d65f72d2|2022-06-14 11:02:37.454|1285492  |
   |e70af180-fd97-48d8-9ab2-d386154f9aad|2022-06-14 08:28:24.475|1285380  |
   |7b3be223-05a2-4a77-9136-899eb2fb05d7|2022-06-14 08:31:14.847|1285383  |
   |29c9afa0-5aa9-4c4d-972b-542ea1762daa|2022-06-14 08:31:16.055|1285385  |
   +------------------------------------+-----------------------+---------+
   only showing top 10 rows
    
    
   Reading in hudi format, with the following option, still is returning the value as a string
    
    
   val df = spark.read.option("hoodie.datasource.read.extract.partition.values.from.path", "false").format("org.apache.hudi").load(testPath)
   df.printSchema()
   spark.read.format("org.apache.hudi").option("hoodie.datasource.read.extract.partition.values.from.path", "false").load(testPath).createOrReplaceTempView("temp2")
   spark.sql("select _hoodie_partition_path, eventId, eventTime,companyId from temp2").show(10, false)
    
    
   root
   |-- _hoodie_commit_time: string (nullable = true)
   |-- _hoodie_commit_seqno: string (nullable = true)
   |-- _hoodie_record_key: string (nullable = true)
   |-- _hoodie_partition_path: string (nullable = true)
   |-- _hoodie_file_name: string (nullable = true)
   |-- eventId: string (nullable = true)
   |-- companyId: long (nullable = true)
   |-- configId: string (nullable = true)
   |-- tenantType: string (nullable = true)
   |-- label: string (nullable = true)
   |-- propertyId: long (nullable = true)
   |-- created: timestamp (nullable = true)
   |-- deleted: timestamp (nullable = true)
   |-- eventTime: string (nullable = true)
    
   +----------------------+------------------------------------+----------+---------+
   |_hoodie_partition_path|eventId                             |eventTime |companyId|
   +----------------------+------------------------------------+----------+---------+
   |2022-08-27            |6ec9a519-40a3-4955-be63-df42627d9898|2022-08-27|600953   |
   |2022-08-27            |859c5d1e-f458-44e1-a14f-de2a9db16bc2|2022-08-27|223727   |
   |2022-08-27            |797f4c95-c5f3-4034-9c09-d55d0d24f0fc|2022-08-27|730148   |
   |2022-08-27            |c02cb2be-1113-44a4-8d60-53ad77873da3|2022-08-27|413799   |
   |2022-08-27            |a276685f-22a3-4a21-94fd-5f4242216abd|2022-08-27|824036   |
   |2022-08-27            |d1043a46-f3bf-46c3-b3ff-d1f64f2f6829|2022-08-27|647835   |
   |2022-08-27            |a8d1f925-f55e-4900-a266-491d496e5f0e|2022-08-27|187089   |
   |2022-08-27            |2ee0f31c-5691-4a6b-af46-50652aa8617e|2022-08-27|683024   |
   |2022-08-27            |cbb24162-b41d-4c98-baa6-6aae59123753|2022-08-27|780756   |
   |2022-08-27            |82e22107-5393-4be0-8d92-6e0f71757f67|2022-08-27|203468   |
   +----------------------+------------------------------------+----------+---------+
   only showing top 10 rows
   
   
   **Expected behavior**
   with option hoodie.datasource.read.extract.partition.values.from.path = false, should it be reading the eventTIme from the parquet file instead of path value?
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.2 (and 3.1)
   
   * Hive version :
   
   * Hadoop version : (3.3)
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope closed issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file

Posted by GitBox <gi...@apache.org>.

codope closed issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file
URL: https://github.com/apache/hudi/issues/6798


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #6798:
URL: https://github.com/apache/hudi/issues/6798#issuecomment-1262803683

   @sstimmel this is a known issue due to how Spark treats partition-columns (by default, Spark doesn't persist them in the data files, but instead encoding them into partition path). Since we're relying on some of the Spark infra to read the data to make sure that Hudi's tables are compatible w/ Spark execution engines optimizations we're unfortunately strangled by these limitations currently, but we're actively looking for solutions there. 
   
   You can find more details in the HUDI-3204


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file

Posted by GitBox <gi...@apache.org>.

codope commented on issue #6798:
URL: https://github.com/apache/hudi/issues/6798#issuecomment-1263395871

   Closing it as the issue has already been triaged and the fix is being worked upon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6798:
URL: https://github.com/apache/hudi/issues/6798#issuecomment-1261797405

   @alexeykudinkin : can you take this up. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org