You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/12 16:43:25 UTC

[GitHub] [hudi] bkosuru opened a new issue, #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

bkosuru opened a new issue, #5569:
URL: https://github.com/apache/hudi/issues/5569

   Hello, 
   
   We created hudi table in hdfs using version 0.8.0. We are planning to upgrade to hudi 0.11.0. There seems to be a difference in url encoding. We are using 
   option(URL_ENCODE_PARTITIONING_OPT_KEY, value = true)
   
   For partition <http://purl.obolibrary.org/obo/uberon.owl>
   0.8.0 encoding is 
   %3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fuberon.owl%3E
   
   0.11.0 encoding is
   <http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fuberon.owl>
   
   When we insert new data with hudi 0.11.0 it creates a different partition.
   /data/spog/g=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fuberon.owl%3E
   /data/spog/g=<http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fuberon.owl>
   
   Is this a bug? Any workaround?
   
   Thanks,
   Bindu
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bettermouse commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
bettermouse commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1126979897

           //hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java   getRecordPartitionPath
           //https://github.com/apache/hudi/pull/2645
           String encodeVersion08 = URLEncoder.encode("<http://purl.obolibrary.org/obo/uberon.owl>", StandardCharsets.UTF_8.toString());
           String encodeVersion11 = PartitionPathEncodeUtils.escapePathName("<http://purl.obolibrary.org/obo/uberon.owl>");
   
           System.out.println(encodeVersion08);
           System.out.println(encodeVersion11);
   
   It seems like " Partition Path  Encode " in different way in version 0.8 and 0.11.
   may  need to rebuild table by hudi 0.11?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bkosuru commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
bkosuru commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1130108911

   ```
   Another issue: In the spark reader, we have to change 
   val urlEncodedGraph = URLEncoder.encode(s"<http://purl.obolibrary.org/obo/uberon.owl>", StandardCharsets.UTF_8.toString)
   to
   val urlEncodedGraph = PartitionPathEncodeUtils.escapePathName("<http://purl.obolibrary.org/obo/uberon.owl>")
   
   to make the incremental query work
   
   val tmp = spark.read.format("hudi")
   .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
           .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "20220511133204671")
           .option("hoodie.file.index.enable", false)
    .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, s"/g=$urlEncodedGraph/p=*")
    .load("/testing/hudi_11/spog")
   ```
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bkosuru commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
bkosuru commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1126346985

   please add angle brackets on each side of the input. It is getting deleted after I submit
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1126265194

   do you happened to know what do you see w/ local FS ? I will give it a try. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1129802845

   @nsivabalan the problem is some newer encoding logic introduced in after 0.8 and not encoding `<` and `>`, which resulted in the difference. We need to make a call to fix this in minor release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan closed issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
xushiyan closed issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0
URL: https://github.com/apache/hudi/issues/5569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1133928928

   @bkosuru You're right that users shouldn't worry about the encoding. This was changed in 0.9.0 release where `<` and `>` won't be encoded, and the behavior stayed the same afterwards. We suggest that you could either keep the current conversion logic as your own workaround, or do a one-time migration of the data. I've linked a PR to update the 0.9.0 release notes on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1126344542

   I could not able to reproduce atleast w/ local FS. 
   
   ```
   
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_trips_cow"
   val basePath = "file:///tmp/hudi_trips_cow"
   val dataGen = new DataGenerator
   
   // spark-shell
   val inserts = convertToStringList(dataGen.generateInserts(10))
   val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   
   import org.apache.spark.sql.functions.lit;
   
   
   df.withColumn("ppath",lit("http://purl.obolibrary.org/obo/uberon.owl")).write.format("hudi").
     options(getQuickstartWriteConfigs).
     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
     option(TABLE_NAME, tableName).
     option(URL_ENCODE_PARTITIONING_OPT_KEY,"true").
     mode(Append).
     save(basePath)
   
   ```
   
   I tired above script for 0.8.0, 0.10.0 and 0.11.0 and its all same. infact, I same file group was updated for every commit which I tried w/ newer versions. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1126344910

   ```
   nsb$ ls -ltr /tmp/hudi_trips_cow/
   total 0
   drwxr-xr-x  10 nsb  wheel  320 May 13 14:39 http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fuberon.owl
   nsb$ ls -ltr /tmp/hudi_trips_cow/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fuberon.owl/
   total 2592
   -rw-r--r--  1 nsb  wheel  438813 May 13 14:37 9a5cf8d4-029c-4389-9184-0c43d1b22d13-0_0-84-106_20220513143702.parquet
   -rw-r--r--  1 nsb  wheel  439521 May 13 14:38 9a5cf8d4-029c-4389-9184-0c43d1b22d13-0_0-27-30_20220513143842591.parquet
   -rw-r--r--  1 nsb  wheel  440341 May 13 14:39 9a5cf8d4-029c-4389-9184-0c43d1b22d13-0_0-36-43_20220513143934780.parquet
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bkosuru commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
bkosuru commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1126345831

   I am sorry, the input is <http://purl.obolibrary.org/obo/uberon.owl>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bkosuru commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
bkosuru commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1126291826

   I see the same difference on the local FS


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1135207869

   @bkosuru this is a historical issue with 0.8->0.9 and the discrepancy is a minor case, so we have to keep 0.9+ versions consistent. Also once https://issues.apache.org/jira/browse/HUDI-512 is implemented then you won't need to worry about this situation any more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bkosuru commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
bkosuru commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1126267752

   I did not check. I tested it on the cluster. We are not able upgrade to 0.11.0 because of this issue. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bkosuru commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
bkosuru commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1126349096

   ![Screen Shot 2022-05-13 at 2 47 09 PM](https://user-images.githubusercontent.com/7408351/168347823-08d12376-436c-41ef-bee7-d43f385473e4.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] bkosuru commented on issue #5569: [SUPPORT] Issues with URL_ENCODE_PARTITIONING_OPT_KEY in hudi 0.11.0

Posted by GitBox <gi...@apache.org>.
bkosuru commented on issue #5569:
URL: https://github.com/apache/hudi/issues/5569#issuecomment-1133983759

   @xushiyan If you are not going fix this issue, we will have to do 
   (1) One-time migration of the data because the current partition has ```<``` and ```>``` encoded. 
   (2) Change the encoding for the reader
   ````val urlEncodedGraph = PartitionPathEncodeUtils.escapePathName("<http://purl.obolibrary.org/obo/uberon.owl>")```
   for 
   ```option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, s"/g=$urlEncodedGraph/p=*")```
   
   And we will lose all our commit history with the one-time migration. There is no other workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org