You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/21 19:47:07 UTC

[GitHub] [iceberg] RussellSpitzer opened a new issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

RussellSpitzer opened a new issue #1224:
URL: https://github.com/apache/iceberg/issues/1224


   Because SparkPartition stores partition URI's as a String there is an issue when reconverting the URI into a Hadoop path if the String contains URI encoded characters (like a space -> %20). This ends up causing FNF errors when attempting the import into iceberg.
   
   Example Test Case
   
   ```
       String partitionCol = "dAtA sPaced";
       String spacedTableName = "whitespacetable";
       String whiteSpaceKey = "some key value";
   
       List<SimpleRecord> spacedRecords = Lists.newArrayList(new SimpleRecord(1, whiteSpaceKey));
   
       File location = temp.newFolder("partitioned_table");
   
       spark.createDataFrame(spacedRecords, SimpleRecord.class)
           .withColumnRenamed("data", partitionCol)
           .write().mode("overwrite").partitionBy(partitionCol).format("parquet")
           .saveAsTable(spacedTableName);
   
       TableIdentifier source = spark.sessionState().sqlParser()
           .parseTableIdentifier(spacedTableName);
       HadoopTables tables = new HadoopTables(spark.sessionState().newHadoopConf());
       Table table = tables.create(SparkSchemaUtil.schemaForTable(spark, spacedTableName),
           SparkSchemaUtil.specForTable(spark, spacedTableName),
           ImmutableMap.of(),
           location.getCanonicalPath());
       File stagingDir = temp.newFolder("staging-dir");
       SparkTableUtil.importSparkTable(spark, source, table, stagingDir.toString());
       List<Row> results = spark.read().format("iceberg").load(location.toString()).collectAsList()
   ```
   
   Which throws
   
   ```
       Caused by:
           org.apache.iceberg.exceptions.RuntimeIOException: Unable to list files in partition: file:/var/folders/yl/6cwgks7919s1td2mfdq86cbm0000gn/T/hive4638538718213787020/whitespacetable/dAtA%20sPaced=some%20key%20value
               Caused by:
               java.io.FileNotFoundException: File file:/var/folders/yl/6cwgks7919s1td2mfdq86cbm0000gn/T/hive4638538718213787020/whitespacetable/dAtA%20sPaced=some%20key%20value does not exist
   ```
   
   I am currently working on a PR to fix this (and add a failing test case) and will have it up soon


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer closed issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed issue #1224:
URL: https://github.com/apache/iceberg/issues/1224


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1224:
URL: https://github.com/apache/iceberg/issues/1224#issuecomment-662112918


   scala> import java.net.URI
   import java.net.URI
   
   scala> import org.apache.hadoop.fs.Path
   import org.apache.hadoop.fs.Path
   
   scala> val uri = new URI("file:///has%20spaces")
   uri: java.net.URI = file:///has%20spaces
   
   scala> new Path(uri).toString
   res4: String = file:/has spaces
   
   scala> new Path(uri.toString).toString
   res5: String = file:/has%20spaces


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1224:
URL: https://github.com/apache/iceberg/issues/1224#issuecomment-662112918


   ```scala> new URI("file:///has%20spaces")
   res3: java.net.URI = file:///has%20spaces
   
   scala> val uri = new URI("file:///has%20spaces")
   uri: java.net.URI = file:///has%20spaces
   
   scala> new Path(uri).toString
   res4: String = file:/has spaces
   
   scala> new Path(uri.toString).toString
   res5: String = file:/has%20spaces


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1224:
URL: https://github.com/apache/iceberg/issues/1224#issuecomment-662111528


   Yes. Basically the constructor `public Path(String pathString) throws IllegalArgumentException` is just take apart string as is and attempts to break it into the correct parts.
   
   While the constructor `public Path(URI)` just takes the URI as is and sets it to the internal representation of the path so it handles the encoding properly
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1224:
URL: https://github.com/apache/iceberg/issues/1224#issuecomment-662112918


   ```scala> new URI("file:///has%20spaces")
   res3: java.net.URI = file:///has%20spaces
   
   scala> val uri = new URI("file:///has%20spaces")
   uri: java.net.URI = file:///has%20spaces
   
   scala> new Path(uri).toString
   res4: String = file:/has spaces
   
   scala> new Path(uri.toString).toString
   res5: String = file:/has%20spaces```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1224:
URL: https://github.com/apache/iceberg/issues/1224#issuecomment-662111528


   Yes. Basically the constructor `public Path(String pathString)` will just take apart the string as is without encoding and attempts to break it into the correct parts.
   
   While the constructor `public Path(URI)` just takes the URI as is and sets it to the internal representation of the path so it handles the encoding properly
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1224:
URL: https://github.com/apache/iceberg/issues/1224#issuecomment-662112918


   ```scala
   scala> import java.net.URI
   import java.net.URI
   
   scala> import org.apache.hadoop.fs.Path
   import org.apache.hadoop.fs.Path
   
   scala> val uri = new URI("file:///has%20spaces")
   uri: java.net.URI = file:///has%20spaces
   
   scala> new Path(uri).toString
   res4: String = file:/has spaces
   
   scala> new Path(uri.toString).toString
   res5: String = file:/has%20spaces
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] raptond commented on issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
raptond commented on issue #1224:
URL: https://github.com/apache/iceberg/issues/1224#issuecomment-662130198


   👍  to keep the paths as `URI` objects and not convert to` String`s. I thought perhaps we should `unescape`, but it seems a better idea to delegate that to `Path`. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1224:
URL: https://github.com/apache/iceberg/issues/1224#issuecomment-662101559


   Is it because `URI.toString` that we are using does not decode the URI correctly? We later create `Path` from `String`? Spark has `CatalogUtils` that handles `URI` to `String` conversion through Hadoop `Path` but we should be ok with just using `URI` to construct `Path`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1224: Importing a Spark table with Whitespace in Partition URI Results in FileNotFound Exception

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1224:
URL: https://github.com/apache/iceberg/issues/1224#issuecomment-662133811


   We would need to check the remove orphan files action in a follow-up too to make sure we are safe.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org