You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/22 14:25:42 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

RussellSpitzer opened a new pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228


   More info at apache/iceberg#1224 
   
   Previously the listPartition method would use a string representation
   of the Partition's URI when creating the Hadoop Path. This constructor would
   handle the string as the literal input for the path and ignore any encoded
   characters. This casues an issue if the directory or filename has whitespace
   or some other special character being reported as it's encoded version
    resulting in FNF exceptions. By switching to passing a URI as
   the SparkPartition's location uri, the escaped characters are correctly handled
   by the Hadoop Path class..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458929956



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       +1 for using the approach that @aokolnychyi pasted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662535053


   cc @raptond and @chenjunjiedada as you worked on this part
   
   cc @prodeezy @rdblue as well


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458933173



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       Works at least for my little example
   
   ```scala> val uri = new URI("file:///has%20spaces")
   uri: java.net.URI = file:///has%20spaces
   
   scala> new Path(uri).toString
   res0: String = file:/has spaces
   
   scala> new Path(new Path(uri).toString)
   res1: org.apache.hadoop.fs.Path = file:/has spaces```

##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       Works at least for my little example
   
   ```scala> val uri = new URI("file:///has%20spaces")
   uri: java.net.URI = file:///has%20spaces
   
   scala> new Path(uri).toString
   res0: String = file:/has spaces
   
   scala> new Path(new Path(uri).toString)
   res1: org.apache.hadoop.fs.Path = file:/has spaces
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662569759


   Are there any objections to using the CatalogUtils method from Spark directly for this purpose? We could duplicate the code path as well but I feel like if we are going to do it the way Spark does it we should explicitly use the method. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662533530


   Based on the discussion in the issue, this looks like the correct direction. One question, though. How will this work when we construct a dataframe of partitions in `partitionDF`? I ran tests in `TestSparkTableUtil` locally and they failed too.
   
   I believe `SparkPartition` is a bean and Spark requires that all fields of it must be valid beans too. URI isn't, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662671744


   Thanks @RussellSpitzer!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458928543



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       Here what Spark does:
   
   ```
     /**
      * Convert URI to String.
      * Since URI.toString does not decode the uri, e.g. change '%25' to '%'.
      * Here we create a hadoop Path with the given URI, and rely on Path.toString
      * to decode the uri
      * @param uri the URI of the path
      * @return the String of the path
      */
     def URIToString(uri: URI): String = {
       new Path(uri).toString
     }
   
     /**
      * Convert String to URI.
      * Since new URI(string) does not encode string, e.g. change '%' to '%25'.
      * Here we create a hadoop Path with the given String, and rely on Path.toUri
      * to encode the string
      * @param str the String of the path
      * @return the URI of the path
      */
     def stringToURI(str: String): URI = {
       new Path(str).toUri
     }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662561301


   @aokolnychyi, I think we just need to change how we convert a URI to a String, by going through Path as an intermediate step. Then we don't need to change how any of the other parts work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458931331



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       Sounds good to me. I'd rather we didn't have 3 classes involved :) but If Spark does it this way at least we will be broken in similar ways if something doesn't work ;)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458919714



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       ```
   scala> import java.net.URI
   import java.net.URI
   
   scala> import org.apache.hadoop.fs.Path
   import org.apache.hadoop.fs.Path
   
   scala> val uri = new URI("file:///has%20spaces")
   uri: java.net.URI = file:///has%20spaces
   
   scala> new Path(uri).toString
   res4: String = file:/has spaces
   
   scala> new Path(uri.toString).toString
   res5: String = file:/has%20spaces
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662543772


   I think I mentioned this before, but we can pass the URI around as a string if Neccessary, we just need to reconvert it back into a URI before passing it into the Hadoop Path constructor. If we pass it as a String literal it assumes it has been un-escaped. This is just a little weirder solution because we end up keeping it in a different format that we actually use it.
   
   If we want to go that route, we can keep everything as a string, and then just at the last moment instead of passing the string directly do
   
   partitionUri = new URI(partitionUriString)
   path = new Path(partitionUri)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458933173



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       Works at least for my little example
   
   `scala> val uri = new URI("file:///has%20spaces")
   uri: java.net.URI = file:///has%20spaces
   
   scala> new Path(uri).toString
   res0: String = file:/has spaces
   
   scala> new Path(new Path(uri).toString)
   res1: org.apache.hadoop.fs.Path = file:/has spaces`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458924749



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -313,7 +313,7 @@ private SparkTableUtil() {
     }
   }
 
-  private static List<DataFile> listParquetPartition(Map<String, String> partitionPath, String partitionUri,
+  private static List<DataFile> listParquetPartition(Map<String, String> partitionPath, URI partitionUri,
                                                      PartitionSpec spec, Configuration conf,
                                                      MetricsConfig metricsSpec) {
     try {

Review comment:
       The alternate fix is on the line below (an all other times we make a Path from a string), where instead of just passing through the string we pass back through the URI version of the string, or possibly just decode the string before passing it through. I think it's probably safest to use URI the whole time, second safest to change back to URI at the last moment, third safest to attempt to decode and hope Hadoop parses the string like we want it too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458998201



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -105,6 +105,20 @@
   private SparkTableUtil() {
   }
 
+  /**
+   * From Apache Spark
+   *
+   * Convert URI to String.
+   * Since URI.toString does not decode the uri, e.g. change '%25' to '%'.
+   * Here we create a hadoop Path with the given URI, and rely on Path.toString
+   * to decode the uri
+   * @param uri the URI of the path
+   * @return the String of the path
+   */
+  public static String uriToString(URI uri) {

Review comment:
       I'll move it to Util, since this will probably come up again in the future an possibly in a non Spark Context (pretty sure all HadoopFS interactions will run into this if we ever implement methods for another system)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458928613



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       Okay, it sounds like we can keep this fix scoped to just these classes where we need to preserve the original URI instead of modifying it. That's a good rule for us to follow: if we have a URI, then use the URI constructor. If we have a String, use the String constructor.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662533530


   Based on the discussion in the issue, this looks like the correct direction. One question, though. How will this work when we construct a dataframe of partitions in `partitionDF`? I ran tests in `TestSparkTableUtil` locally and they failed too.
   
   I believe `SparkPartition` is a bean and Spark requires that all fields of it to be valid beans too. URI isn't a bean, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458916359



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       Looks like the main problem is that `new Path(locationUri.get())` is not the same as `new Path(locationUri.get().toString())`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662600900


   Changes pushed! I also added a test for the "Import Unpartitioned" which has a slightly different place where it was using the URI as a string.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458998201



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -105,6 +105,20 @@
   private SparkTableUtil() {
   }
 
+  /**
+   * From Apache Spark
+   *
+   * Convert URI to String.
+   * Since URI.toString does not decode the uri, e.g. change '%25' to '%'.
+   * Here we create a hadoop Path with the given URI, and rely on Path.toString
+   * to decode the uri
+   * @param uri the URI of the path
+   * @return the String of the path
+   */
+  public static String uriToString(URI uri) {

Review comment:
       I'll move it to Util, since this will probably come up again in the future and possibly in a non Spark Context (pretty sure all HadoopFS interactions will run into this if we ever implement methods for another system)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662672884


   Thanks for the help with review everyone!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458929096



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -271,7 +271,7 @@ private SparkTableUtil() {
    * @param metricsConfig a metrics conf
    * @return a List of DataFile
    */
-  public static List<DataFile> listPartition(Map<String, String> partition, String uri, String format,
+  public static List<DataFile> listPartition(Map<String, String> partition, URI uri, String format,

Review comment:
       I'd like to avoid changing this method since it is public and using a URI will probably change behavior for users passing strings (String -> URI -> Path instead of String -> Path).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662548210


   I'm really skeptical about introducing URI to solve this problem. We purposely use only string internally to avoid introducing more complicated behavior when converting between the 3 classes (String, URI, and Path). Path and URI do not reliably convert back and forth, if I remember correctly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662575275


   Yeah I was thinking a bit about the long term version incompatibility, I think the risk is small but it's easy enough to copy. PR will be up in a few.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662533530


   Based on the discussion in the issue, this looks like the correct direction. One question, though. How will this work when we construct a dataframe of partitions in `partitionDF`? I ran tests in `TestSparkTableUtil` locally and they failed too.
   
   I believe `SparkPartition` is a bean and Spark requires that all fields of it to be valid beans too. URI isn't a bean, right?
   
   I think we either need to switch to Java serialization in `partitionDF` or do a trick like `CatalogUtils` in Spark that converts a URI to String via Hadoop Path.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458991231



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -105,6 +105,20 @@
   private SparkTableUtil() {
   }
 
+  /**
+   * From Apache Spark
+   *
+   * Convert URI to String.
+   * Since URI.toString does not decode the uri, e.g. change '%25' to '%'.
+   * Here we create a hadoop Path with the given URI, and rely on Path.toString
+   * to decode the uri
+   * @param uri the URI of the path
+   * @return the String of the path
+   */
+  public static String uriToString(URI uri) {

Review comment:
       If this is going to be public, I'd rather find a better place than in `SparkTableUtil`, like `org.apache.iceberg.hadoop.Util`. I'm also fine with this being a private method here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662572191


   I'd rather not depend on Spark here because we will probably need to do this elsewhere eventually. The idea is simple (produce strings with Path if we want to use the Path(String) constructor) and the code is small enough that I think it is reasonable to reproduce it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662560097


   Using Java serialization in methods like `partitionDF` is probably not an option too. Right now, we return 3 columns. We cannot return the whole row as bytes.
   
   One more question, should methods like `partitionDF` return already decoded uris as strings? What if we decode it before creating `SparkPartition`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#discussion_r458918870



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -386,7 +386,7 @@ private static SparkPartition toSparkPartition(CatalogTablePartition partition,
     Preconditions.checkArgument(serde.nonEmpty() || table.provider().nonEmpty(),
         "Partition format should be defined");
 
-    String uri = String.valueOf(locationUri.get());
+    URI uri = locationUri.get();

Review comment:
       Yes, Like I wrote on the issue. The constructor of Path(string) assumes the string is decoded. The constructor of Path(Uri) has a URI so it knows it's encoded. That's why we can also fix this by reconverting back to a URI at the last moment (basically re asserting the encoding)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on pull request #1228: Fix Handling of SparkPartitions with Whitepsace in Location

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on pull request #1228:
URL: https://github.com/apache/iceberg/pull/1228#issuecomment-662572191


   I'd rather not depend on Spark here because we will probably need to do this elsewhere eventually. The idea is simple (produce strings with Path if we want to use the Path(String) constructor) and the code is small enough that I think it is reasonable to reproduce it.
   
   Also, we depend on multiple Spark versions so compatibility with whatever we use in Spark is a long-term risk.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org