You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/12/11 19:13:29 UTC

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #828: Synchronizing to hive partition is incorrect

lamber-ken edited a comment on issue #828: Synchronizing to hive partition is incorrect
URL: https://github.com/apache/incubator-hudi/issues/828#issuecomment-564689247
 
 
   @imperio-wxm, you need to set the value of `DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY()` to `false`. 
   
   ### Why the first time can't get the data
   
   Because, when you rewrite the partitioning method in order to parse `yyyy-mm-dd` form, it depends on the the return form of the method`PartitionValueExtractor#extractPartitionValuesInPath`. 
   
   For example, if you custom extractor like bellow, the new partition is like `part_date=2019-11-12`, so
   it needs to set `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` to `false`.
   After that, HoodieHiveClient will get all folder partitions. Otherwise, it will get three levels form partition. 
   For detail, you can visit `FSUtils#getAllPartitionPaths`.
   
   ```
   package org.apache.hudi.hive;
   
   import com.beust.jcommander.internal.Lists;
   import org.joda.time.DateTime;
   import org.joda.time.format.DateTimeFormat;
   import org.joda.time.format.DateTimeFormatter;
   
   import java.util.List;
   
   public class DayValueExtractor implements PartitionValueExtractor {
   
       private transient DateTimeFormatter dtfOut;
   
       public DayValueExtractor() {
           this.dtfOut = DateTimeFormat.forPattern("yyyy-MM-dd");
       }
   
       private DateTimeFormatter getDtfOut() {
           if (dtfOut == null) {
               dtfOut = DateTimeFormat.forPattern("yyyy-MM-dd");
           }
           return dtfOut;
       }
   
       @Override
       public List<String> extractPartitionValuesInPath(String partitionPath) {
           // partition path is expected to be in this format yyyy/mm/dd
           String[] splits = partitionPath.split("-");
           if (splits.length != 3) {
               throw new IllegalArgumentException(
                       "Partition path " + partitionPath + " is not in the form yyyy-mm-dd ");
           }
           // Get the partition part and remove the / as well at the end
           int year = Integer.parseInt(splits[0]);
           int mm = Integer.parseInt(splits[1]);
           int dd = Integer.parseInt(splits[2]);
           DateTime dateTime = new DateTime(year, mm, dd, 0, 0);
           return Lists.newArrayList(getDtfOut().print(dateTime));
       }
   }
   ```
   
   ### Right example
   ```
   import org.apache.spark.sql.SaveMode
   val basePath = "/flink/hudi/hoodie_test"
   var datas = List("{ \"key\": \"uuid\", \"event_time\": 1574297893836, \"part_date\": \"2019-11-12\"}")
   val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
   
   df.write.format("hudi").
       option("hoodie.insert.shuffle.parallelism", "10").
       option("hoodie.upsert.shuffle.parallelism", "10").
       option("hoodie.delete.shuffle.parallelism", "10").
       option("hoodie.bulkinsert.shuffle.parallelism", "10").
   
       option("hoodie.datasource.hive_sync.enable", true).
       option("hoodie.datasource.hive_sync.jdbcurl", "jdbc:hive2://xxxx:12326").
       option("hoodie.datasource.hive_sync.username", "dcadmin").
       option("hoodie.datasource.hive_sync.password", "dcadmin").
       option("hoodie.datasource.hive_sync.database", "default").
       option("hoodie.datasource.hive_sync.table", "hoodie_test").
       option("hoodie.datasource.hive_sync.assume_date_partitioning", false).
       option("hoodie.datasource.hive_sync.partition_fields", "part_date").
   
       option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.DayValueExtractor").
   
       option("hoodie.datasource.write.precombine.field", "event_time").
       option("hoodie.datasource.write.recordkey.field", "key").
       option("hoodie.datasource.write.partitionpath.field", "part_date").
   
       option("hoodie.table.name", "hoodie_test").
       mode(SaveMode.Append).
       save(basePath);
   
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services