You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/26 14:16:57 UTC
[GitHub] [hudi] sstimmel opened a new issue, #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file
sstimmel opened a new issue, #6798:
URL: https://github.com/apache/hudi/issues/6798
**_Tips before filing an issue_**
- Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
- Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
- If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
**Describe the problem you faced**
I am testing out partitioning a dataset by an eventTime which is timestamp column, but only want up to day precision for partitioning. Is there a way to read back the original value from hudi instead of the truncated value?
**To Reproduce**
Configs
hoodie.datasource.write.recordkey.field=companyId
hoodie.datasource.write.precombine.field=eventTime
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
hoodie.datasource.write.hive_style_partitioning=false
hoodie.datasource.hive_sync.enable=false
hoodie.datasource.write.drop.partition.columns=false
hoodie.deltastreamer.source.dfs.root=s3://blah/tenantconfig/raw
hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt=true
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
hoodie.datasource.write.partitionpath.field=eventTime
hoodie.datasource.write.keygen.timebased.timestamp.type=SCALAR
hoodie.datasource.write.keygen.timebased.timezone=UTC
hoodie.datasource.write.keygen.timebased.timestamp.scalar.time.unit=microseconds
hoodie.datasource.write.keygen.timebased.output.dateformat=yyyy-MM-dd
hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=microseconds
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM-dd
hoodie.deltastreamer.keygen.timebased.timezone=UTC
hoodie.deltastreamer.source.s3incr.fs.prefix=s3a
hoodie.index.type=GLOBAL_SIMPLE
hoodie.simple.index.update.partition.path=true
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained=200
hoodie.keep.min.commits=250
hoodie.keep.max.commits=500
hoodie.allow.empty.commit=false
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true
hoodie.parquet.outputtimestamptype=TIMESTAMP_MICROS
If I read a particular partition folder in parquet format, I can get the original eventTime values
spark.read.format("parquet").load(testPath+"/2022-06-14").createOrReplaceTempView("t")
spark.sql("select eventId, eventTime,companyId from t").show(10, false)
+------------------------------------+-----------------------+---------+
|eventId |eventTime |companyId|
+------------------------------------+-----------------------+---------+
|f08eae6d-6103-4b4d-8d3f-348477ab055c|2022-06-14 05:34:49.128|1285302 |
|49ecd5b2-c782-482f-b796-b008b9091d8b|2022-06-14 05:34:52.83 |1285306 |
|b6eab34e-9e7d-4365-87ef-36086e18a3a0|2022-06-14 11:00:30.96 |1285489 |
|1697c79d-0180-42bc-89f8-e29d3bb806c7|2022-06-14 08:27:49.169|1285375 |
|6ecf4ffe-a937-4d3e-928e-3edfb09becdd|2022-06-14 08:28:21.978|1285379 |
|cc774a92-ee81-4e41-9228-b636af58e48c|2022-06-14 05:34:07.788|1285261 |
|c26f12ba-8b6a-4eef-a9ff-3051d65f72d2|2022-06-14 11:02:37.454|1285492 |
|e70af180-fd97-48d8-9ab2-d386154f9aad|2022-06-14 08:28:24.475|1285380 |
|7b3be223-05a2-4a77-9136-899eb2fb05d7|2022-06-14 08:31:14.847|1285383 |
|29c9afa0-5aa9-4c4d-972b-542ea1762daa|2022-06-14 08:31:16.055|1285385 |
+------------------------------------+-----------------------+---------+
only showing top 10 rows
Reading in hudi format, with the following option, still is returning the value as a string
val df = spark.read.option("hoodie.datasource.read.extract.partition.values.from.path", "false").format("org.apache.hudi").load(testPath)
df.printSchema()
spark.read.format("org.apache.hudi").option("hoodie.datasource.read.extract.partition.values.from.path", "false").load(testPath).createOrReplaceTempView("temp2")
spark.sql("select _hoodie_partition_path, eventId, eventTime,companyId from temp2").show(10, false)
root
|-- _hoodie_commit_time: string (nullable = true)
|-- _hoodie_commit_seqno: string (nullable = true)
|-- _hoodie_record_key: string (nullable = true)
|-- _hoodie_partition_path: string (nullable = true)
|-- _hoodie_file_name: string (nullable = true)
|-- eventId: string (nullable = true)
|-- companyId: long (nullable = true)
|-- configId: string (nullable = true)
|-- tenantType: string (nullable = true)
|-- label: string (nullable = true)
|-- propertyId: long (nullable = true)
|-- created: timestamp (nullable = true)
|-- deleted: timestamp (nullable = true)
|-- eventTime: string (nullable = true)
+----------------------+------------------------------------+----------+---------+
|_hoodie_partition_path|eventId |eventTime |companyId|
+----------------------+------------------------------------+----------+---------+
|2022-08-27 |6ec9a519-40a3-4955-be63-df42627d9898|2022-08-27|600953 |
|2022-08-27 |859c5d1e-f458-44e1-a14f-de2a9db16bc2|2022-08-27|223727 |
|2022-08-27 |797f4c95-c5f3-4034-9c09-d55d0d24f0fc|2022-08-27|730148 |
|2022-08-27 |c02cb2be-1113-44a4-8d60-53ad77873da3|2022-08-27|413799 |
|2022-08-27 |a276685f-22a3-4a21-94fd-5f4242216abd|2022-08-27|824036 |
|2022-08-27 |d1043a46-f3bf-46c3-b3ff-d1f64f2f6829|2022-08-27|647835 |
|2022-08-27 |a8d1f925-f55e-4900-a266-491d496e5f0e|2022-08-27|187089 |
|2022-08-27 |2ee0f31c-5691-4a6b-af46-50652aa8617e|2022-08-27|683024 |
|2022-08-27 |cbb24162-b41d-4c98-baa6-6aae59123753|2022-08-27|780756 |
|2022-08-27 |82e22107-5393-4be0-8d92-6e0f71757f67|2022-08-27|203468 |
+----------------------+------------------------------------+----------+---------+
only showing top 10 rows
**Expected behavior**
with option hoodie.datasource.read.extract.partition.values.from.path = false, should it be reading the eventTIme from the parquet file instead of path value?
**Environment Description**
* Hudi version : 0.11.1
* Spark version : 3.2 (and 3.1)
* Hive version :
* Hadoop version : (3.3)
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```Add the stacktrace of the error.```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codope closed issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file
Posted by GitBox <gi...@apache.org>.
codope closed issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file
URL: https://github.com/apache/hudi/issues/6798
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file
Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on issue #6798:
URL: https://github.com/apache/hudi/issues/6798#issuecomment-1262803683
@sstimmel this is a known issue due to how Spark treats partition-columns (by default, Spark doesn't persist them in the data files, but instead encoding them into partition path). Since we're relying on some of the Spark infra to read the data to make sure that Hudi's tables are compatible w/ Spark execution engines optimizations we're unfortunately strangled by these limitations currently, but we're actively looking for solutions there.
You can find more details in the HUDI-3204
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codope commented on issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file
Posted by GitBox <gi...@apache.org>.
codope commented on issue #6798:
URL: https://github.com/apache/hudi/issues/6798#issuecomment-1263395871
Closing it as the issue has already been triaged and the fix is being worked upon.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6798:
URL: https://github.com/apache/hudi/issues/6798#issuecomment-1261797405
@alexeykudinkin : can you take this up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org