You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/13 03:00:49 UTC
[GitHub] [hudi] nsivabalan commented on issue #5211: [SUPPORT] Glob pattern to pick specific subfolders not working while reading in Spark
nsivabalan commented on issue #5211:
URL: https://github.com/apache/hudi/issues/5211#issuecomment-1244834419
I could able to verify that @haggy 's solution worked for me. I used 0.12 artifacts and re-used our quickstart example to try it out.
I just added 1 additional config to whats given [here](https://hudi.apache.org/docs/quick-start-guide)
`option("hoodie.datasource.write.hive_style_partitioning","true")`
dir structure
ls -ltr /tmp/hudi_trips_cow/
total 0
drwxr-xr-x 4 nsb wheel 128 Sep 12 19:48 partitionpath=americas
drwxr-xr-x 3 nsb wheel 96 Sep 12 19:48 partitionpath=asia
americas/
brazil/
sao_paulo/
aeb137dc-734a-4b4b-a43d-f74f46ab876e-0_1-86-311_20220912194850933.parquet
aeb137dc-734a-4b4b-a43d-f74f46ab876e-0_1-119-356_20220912194956750.parquet
united_states/
san_francisco/
89f9ed49-fa61-4fc5-bf34-264d3bdbc54f-0_2-86-312_20220912194850933.parquet
89f9ed49-fa61-4fc5-bf34-264d3bdbc54f-0_2-119-357_20220912194956750.parquet
asia/
india/
chennai/
aaea3d1d-63b6-4b40-9c8d-930a3f071dfa-0_0-86-310_20220912194850933.parquet
aaea3d1d-63b6-4b40-9c8d-930a3f071dfa-0_0-119-355_20220912194956750.parquet
Table has 2 commits, 1st is insert and 2nd is update.
reading just 1 partition path
```
val tripsSnapshotDF1 = spark.read.format("hudi").load(basePath + "/partitionpath=americas/brazil/sao_paulo/*")
tripsSnapshotDF1..createOrReplaceTempView("hudi_trips_snapshot1")
spark.sql("select _hoodie_partition_path, _hoodie_record_key, count(*) from hudi_trips_snapshot1 group by 1,2 order by 1,2 ").show(false)
```
output:
```
+---------------------------------------+------------------------------------+--------+
|_hoodie_partition_path |_hoodie_record_key |count(1)|
+---------------------------------------+------------------------------------+--------+
|partitionpath=americas/brazil/sao_paulo|0ab55b9a-8d92-4bf2-8bba-16b03b2b511f|1 |
|partitionpath=americas/brazil/sao_paulo|144c54c9-e237-4bdc-bc94-b7db15e1e98b|1 |
|partitionpath=americas/brazil/sao_paulo|9f4c2420-982b-433d-8b92-65f39fbc3e4c|1 |
+---------------------------------------+------------------------------------+--------+
```
Reading multiple partitions w/ *
```
val tripsSnapshotDF2 = spark.read.format("hudi").load(basePath + "/partitionpath=americas/*/*/*")
tripsSnapshotDF2.createOrReplaceTempView("hudi_trips_snapshot2")
spark.sql("select _hoodie_partition_path, _hoodie_record_key, count(*) from hudi_trips_snapshot2 group by 1,2 order by 1,2 ").show(false)
```
Output:
```
+--------------------------------------------------+------------------------------------+--------+
|_hoodie_partition_path |_hoodie_record_key |count(1)|
+--------------------------------------------------+------------------------------------+--------+
|partitionpath=americas/brazil/sao_paulo |0ab55b9a-8d92-4bf2-8bba-16b03b2b511f|1 |
|partitionpath=americas/brazil/sao_paulo |144c54c9-e237-4bdc-bc94-b7db15e1e98b|1 |
|partitionpath=americas/brazil/sao_paulo |9f4c2420-982b-433d-8b92-65f39fbc3e4c|1 |
|partitionpath=americas/united_states/san_francisco|57444250-3c2d-44a5-93e0-7a546d9dafef|1 |
|partitionpath=americas/united_states/san_francisco|8833aba3-8510-4f04-af47-d67277c1d043|1 |
|partitionpath=americas/united_states/san_francisco|d1e3e865-71e5-4889-9524-90afc328aadb|1 |
|partitionpath=americas/united_states/san_francisco|dc6e7c24-0e57-4a96-b7b2-e8fe49947a22|1 |
|partitionpath=americas/united_states/san_francisco|e9fb07de-dc36-49b8-b909-1e3fc59dd15e|1 |
+--------------------------------------------------+------------------------------------+--------+
```
@kartik18 : does this work. or are you looking for something else.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org