You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/13 03:00:49 UTC

[GitHub] [hudi] nsivabalan commented on issue #5211: [SUPPORT] Glob pattern to pick specific subfolders not working while reading in Spark

nsivabalan commented on issue #5211:
URL: https://github.com/apache/hudi/issues/5211#issuecomment-1244834419

   I could able to verify that @haggy 's solution worked for me. I used 0.12 artifacts and re-used our quickstart  example to try it out. 
   
   I just added 1 additional config to whats given [here](https://hudi.apache.org/docs/quick-start-guide)
   `option("hoodie.datasource.write.hive_style_partitioning","true")`
   
   
   dir structure
   
   ls -ltr /tmp/hudi_trips_cow/
   total 0
   drwxr-xr-x  4 nsb  wheel  128 Sep 12 19:48 partitionpath=americas
   drwxr-xr-x  3 nsb  wheel   96 Sep 12 19:48 partitionpath=asia
   
   americas/
       brazil/
              sao_paulo/
                       aeb137dc-734a-4b4b-a43d-f74f46ab876e-0_1-86-311_20220912194850933.parquet
                       aeb137dc-734a-4b4b-a43d-f74f46ab876e-0_1-119-356_20220912194956750.parquet
       united_states/
             san_francisco/
                      89f9ed49-fa61-4fc5-bf34-264d3bdbc54f-0_2-86-312_20220912194850933.parquet
                       89f9ed49-fa61-4fc5-bf34-264d3bdbc54f-0_2-119-357_20220912194956750.parquet
   asia/
       india/
              chennai/
                       aaea3d1d-63b6-4b40-9c8d-930a3f071dfa-0_0-86-310_20220912194850933.parquet
                       aaea3d1d-63b6-4b40-9c8d-930a3f071dfa-0_0-119-355_20220912194956750.parquet
   
   
   Table has 2 commits, 1st is insert and 2nd is update.
   
   reading just 1 partition path 
   ```
   val tripsSnapshotDF1 = spark.read.format("hudi").load(basePath + "/partitionpath=americas/brazil/sao_paulo/*")
   tripsSnapshotDF1..createOrReplaceTempView("hudi_trips_snapshot1")
   spark.sql("select _hoodie_partition_path, _hoodie_record_key, count(*) from  hudi_trips_snapshot1 group by 1,2 order by 1,2 ").show(false)
   ```
   
   output:
   ```
   +---------------------------------------+------------------------------------+--------+
   |_hoodie_partition_path                 |_hoodie_record_key                  |count(1)|
   +---------------------------------------+------------------------------------+--------+
   |partitionpath=americas/brazil/sao_paulo|0ab55b9a-8d92-4bf2-8bba-16b03b2b511f|1       |
   |partitionpath=americas/brazil/sao_paulo|144c54c9-e237-4bdc-bc94-b7db15e1e98b|1       |
   |partitionpath=americas/brazil/sao_paulo|9f4c2420-982b-433d-8b92-65f39fbc3e4c|1       |
   +---------------------------------------+------------------------------------+--------+
   ```
   
   Reading multiple partitions w/ *
   
   ```
   val tripsSnapshotDF2 = spark.read.format("hudi").load(basePath + "/partitionpath=americas/*/*/*")
   tripsSnapshotDF2.createOrReplaceTempView("hudi_trips_snapshot2")
   spark.sql("select _hoodie_partition_path, _hoodie_record_key, count(*) from  hudi_trips_snapshot2 group by 1,2 order by 1,2 ").show(false)
   ```
   
   Output:
   ```
   +--------------------------------------------------+------------------------------------+--------+
   |_hoodie_partition_path                            |_hoodie_record_key                  |count(1)|
   +--------------------------------------------------+------------------------------------+--------+
   |partitionpath=americas/brazil/sao_paulo           |0ab55b9a-8d92-4bf2-8bba-16b03b2b511f|1       |
   |partitionpath=americas/brazil/sao_paulo           |144c54c9-e237-4bdc-bc94-b7db15e1e98b|1       |
   |partitionpath=americas/brazil/sao_paulo           |9f4c2420-982b-433d-8b92-65f39fbc3e4c|1       |
   |partitionpath=americas/united_states/san_francisco|57444250-3c2d-44a5-93e0-7a546d9dafef|1       |
   |partitionpath=americas/united_states/san_francisco|8833aba3-8510-4f04-af47-d67277c1d043|1       |
   |partitionpath=americas/united_states/san_francisco|d1e3e865-71e5-4889-9524-90afc328aadb|1       |
   |partitionpath=americas/united_states/san_francisco|dc6e7c24-0e57-4a96-b7b2-e8fe49947a22|1       |
   |partitionpath=americas/united_states/san_francisco|e9fb07de-dc36-49b8-b909-1e3fc59dd15e|1       |
   +--------------------------------------------------+------------------------------------+--------+
   
   ```
   
   @kartik18 : does this work. or are you looking for something else. 
   
   
   
   
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org