You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kevinyu98 <gi...@git.apache.org> on 2018/05/09 21:02:18 UTC

[GitHub] spark pull request #21285: [SPARK-24176][SQL] LOAD DATA can't identify wildc...

GitHub user kevinyu98 opened a pull request:

    https://github.com/apache/spark/pull/21285

    [SPARK-24176][SQL] LOAD DATA can't identify wildcard in the hdfs file path 

    ## What changes were proposed in this pull request?
    
    When the wildcard characters (like "?") were in the LOAD DATA command's path name, the Path related API (hadoop and URI) couldn't parse it correctly. For example:
    `val srcPath = new Path(hdfsUri)` in the `tables.scala`, returned wrong result for the following cases:
    - `hdfsUri` = `file: /user/testdemo1/user1/t??eddata60.txt`, 
       `srcPath` = `file:/user/testdemo1/user1/t`
    - `hdfsUri` = `file:/user/testdemo1/user1/?eddata60.txt'`, 
       `srcPath` = `file:/user/testdemo1/user1/`
    (the same problem exists at `val uriPath = uri.getPath()`)
    
     The LOAD DATA LOCAL works  because the local case called a utility `Utils.resolveURI` to replaced the "?" to "%3F", then the PATH API will not truncate the file name.
    
    This fix uses `Utils.resolveURI` method for both local and non-local cases.
    
    I did similar test on hive, it seems the hive has the same behavior.
    
    `hive> load data inpath 'hdfs:/tmp/?evin.txt' into table foo1;
    FAILED: SemanticException Line 1:17 Invalid path ''hdfs:/tmp/?evin.txt'': No files matching path hdfs://stcindia-node-6.fyre.ibm.com:8020/tmp/%3Fevin.txt
    hive> load data inpath 'hdfs:/tmp/k?evin.txt' into table foo1;
    FAILED: SemanticException Line 1:17 Invalid path ''hdfs:/tmp/k?evin.txt'': No files matching path hdfs://stcindia-node-6.fyre.ibm.com:8020/tmp/k%3Fevin.txt
    hive> 
    `
    ## How was this patch tested?
    Did the unit test locally, and added new test cases.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kevinyu98/spark spark-24176

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21285.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21285
    
----
commit 3c1a1cf9fbf23fe9c6a0c32090558dc8d7156871
Author: Kevin Yu <qy...@...>
Date:   2018-05-09T18:53:31Z

    resolve the path string for load data before using it

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21285: [SPARK-24176][SQL] LOAD DATA can't identify wildcard in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21285
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21285: [SPARK-24176][SQL] LOAD DATA can't identify wildc...

Posted by kevinyu98 <gi...@git.apache.org>.
Github user kevinyu98 closed the pull request at:

    https://github.com/apache/spark/pull/21285


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21285: [SPARK-24176][SQL] LOAD DATA can't identify wildcard in ...

Posted by kevinyu98 <gi...@git.apache.org>.
Github user kevinyu98 commented on the issue:

    https://github.com/apache/spark/pull/21285
  
    close this pr, pr#20611 has combined this fix into his. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21285: [SPARK-24176][SQL] LOAD DATA can't identify wildcard in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21285
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21285: [SPARK-24176][SQL] LOAD DATA can't identify wildcard in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21285
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21285: [SPARK-24176][SQL] LOAD DATA can't identify wildcard in ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21285
  
    cc @wzhfy and @sujith71955


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21285: [SPARK-24176][SQL] LOAD DATA can't identify wildcard in ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21285
  
    is it a duplicate of https://github.com/apache/spark/pull/20611?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21285: [SPARK-24176][SQL] LOAD DATA can't identify wildcard in ...

Posted by kevinyu98 <gi...@git.apache.org>.
Github user kevinyu98 commented on the issue:

    https://github.com/apache/spark/pull/21285
  
    @HyukjinKwon thanks for reviewing this pr. I didn't notice that pr until you point out. If we plan to support wildcard in the LOAD DATA command, then we can close this PR. 
    But with his current code, the problem reported by this JIRA still exists, because for the non-local case, the Path will be truncate after `val srcPath = new Path(loadPath)`. I download his code, and it still have the same issue as this pr reported.
    I create text1.txt on my local machine, then run LOAD DATA
    `load data inpath '/Users/qianyangyu/IdeaProjects/spark/??xt1.txt' into table foo1;' ` successful, but it didn't load data into the table
    `load data inpath '/Users/qianyangyu/IdeaProjects/spark/t?xt1.txt' into table foo1;` failed
    `spark-sql> load data inpath '/Users/qianyangyu/IdeaProjects/spark/??xt1.txt' into table foo1;
    Time taken: 0.112 seconds
    
    spark-sql> select * from foo1;
    18/05/09 23:23:51 INFO DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 0.000029 s
    Time taken: 0.056 seconds
    18/05/09 23:23:51 INFO SparkSQLCLIDriver: Time taken: 0.056 seconds
    
    spark-sql> load data inpath '/Users/qianyangyu/IdeaProjects/spark/t?xt1.txt' into table foo1;
    Error in query: LOAD DATA input path does not exist: /Users/qianyangyu/IdeaProjects/spark/t?xt1.txt;
    `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org