You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/11/13 05:34:17 UTC

[GitHub] [hudi] BalaMahesh opened a new issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

BalaMahesh opened a new issue #2251:
URL: https://github.com/apache/hudi/issues/2251


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   When running the select col1,col2 ..  queries on HUDI tables , i am getting the error 
   
   org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3a//path.
   
   But if I do hdfs dfs -cat on the same file, i am able to see the data, and this not for all the cases, in some cases query is returning the result and in most of the cases it is failing .
   
   But if run select count(*),dt from _ro group by dt, it isn't throwing any error. Where could be the problem ?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Ingest data with Delta streamer
   2.Query _ro table
   
   **Expected behavior**
   
   Query should return the rows.
   
   **Environment Description**
   
   * Hudi version : 0.6.1
   
   * Spark version : 2.4.7
   
   * Hive version : 1.2
   
   * Hadoop version : 2.7.1
   
   * Storage (HDFS/S3/GCS..) : s3a
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ``` org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3a://xx/test/hudi/data/xx/xx/dt=2020-11-12/.hoodie_partition_metadata
   	at org.apache.hadoop.mapred.LocatedFileStatusFetcher.getFileStatuses(LocatedFileStatusFetcher.java:155)
   	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:237)
   	at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:105)
   	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
   	at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:307)
   	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:409)
   	at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155)
   	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:273)
   	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:266)
   	at java.security.AccessController.doPrivileged(Native Method)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
   	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:266)
   	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   	at java.lang.Thread.run(Thread.java:745)
   ]
   DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
   FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1599034786224_1149165_1_00, diagnostics=[Vertex vertex_1599034786224_1149165_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: accounting_new_zealand_crn_tracker_ro initializer failed, vertex=vertex_1599034786224_1149165_1_00 [Map 1], org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3a://xx/test/hudi/data/xx/xx/dt=2020-11-12/.hoodie_partition_metadata
   	at org.apache.hadoop.mapred.LocatedFileStatusFetcher.getFileStatuses(LocatedFileStatusFetcher.java:155)
   	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:237)
   	at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:105)
   	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
   	at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:307)
   	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:409)
   	at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155)
   	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:273)
   	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:266)
   	at java.security.AccessController.doPrivileged(Native Method)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
   	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:266)
   	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   	at java.lang.Thread.run(Thread.java:745)
   ]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
   ```
   
   dfs -cat s3a://xx/test/hudi/data/xx/xx/dt=2020-11-12/.hoodie_partition_metadata;
   #partition metadata
   #Thu Nov 12 06:14:36 IST 2020
   commitTime=20201112061416
   partitionDepth=1
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] BalaMahesh commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

BalaMahesh commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-749933935


   @bvaradar I tried with beeline on the same hive servers and it didn't threw any exception. This issue is only when running query from hive cli. I haven't spend enough time understanding the difference hive is making between cli and beeline. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar edited a comment on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

bvaradar edited a comment on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-728214460


   You can try printing the jobConf when HoodieParquetInputFormat.listStatus is called. You can check if hive partitions are configured correctly. My guess is that they may not be configured correctly.
   
   describe formatted tbl_name partition (dt='<partition>')
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] BalaMahesh commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

BalaMahesh commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-780248837


   @nsivabalan 
   I was doing POC . 
   This issue was observed when we were trying to query `hudi` tables from hive cli . All the error logs, we had are posted here. But when we tried the same query from beeline it didn't threw any errors. The suspection is when query submitted directly from hive cli is going through unwanted planner and optimizer, still not sure what optimization / plan it was choosing. With beeline we didn't see this issue.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-728214460


   You can try printing the jobConf when HoodieParquetInputFormat.listStatus is called. You can check if hive partitions are configured correctly 
   
   describe formatted tbl_name partition (dt='<partition>')
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-774518949


   @BalaMahesh : sorry for a late follow up. would be nice if you can you post the logs as requested. If not for the logs, not sure how much we can debug further. 
   If you were able to fix the issue, can you post what was the fix and close out the ticket. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] BalaMahesh edited a comment on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

BalaMahesh edited a comment on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-780248837


   @nsivabalan 
   I was doing POC . 
   This issue was observed when we were trying to query `hudi` tables from hive cli . All the error logs, we had are posted here. But when we tried the same query from beeline it didn't threw any errors. The suspection is when query submitted directly from hive cli, it is going through unwanted planner and optimizer, still not sure what optimization / plan it was choosing. With beeline we didn't see this issue.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] BalaMahesh closed issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

BalaMahesh closed issue #2251:
URL: https://github.com/apache/hudi/issues/2251


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-737416280


   Can you enable debug logs in hive cli and post the logs here ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-744704364


   @BalaMahesh : Any updates on this ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-780281907


   @BalaMahesh : thanks for the update. appreciate it. 
   @bvaradar @n3nash : do we know if there could be diff between beeline and hive cli? and is there any known issue w/ hive cli? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] BalaMahesh edited a comment on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

BalaMahesh edited a comment on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-726732563


   Update : 1 . After adding the additional log statement in HoodieParquetInputFormat and InputHandler classes, I have found this : 
   
   1) [InputInitializer {Map 1} #0] |hadoop.InputPathHandler|: Got the input paths : [s3a://xxx/test/hudi/data/xxx/xxx/dt=2020-11-13/.hoodie_partition_metadata, s3a://xxx/test/hudi/data/xxx/xxx/dt=2020-11-13/4e5582b0-ceb4-4d7c-ab98-bb9dfb0962e6-0_0-17038-5024094_20201113170011.parquet]conf : Configuration: incrementalTables : []
   
   Query Job has got the input paths as the files inside partition directory instead of partition directory itself , now Hudi mr bundle is trying to append metadata filename to these base files and failing to find the metadata file path . 
   
   In the same hive session , query on the different hudi table has the below logs : 
   
   hadoop.InputPathHandler|: Got the input paths : [s3a://xxxx/test/hudi/data/xxx/xxx/dt=2020-11-13]conf : Configuration: incrementalTables : []  which is upto partition directory unlike above base file path, in this case ,partition metadata file is accessible and query is finishing . 
   
   I would need help to figuring out from where job is getting the base files as inputPath instead of directory, i did describe formatted table partition(val) on the tables and they both have same directory structure. 
   
   
   
    
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] BalaMahesh commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

BalaMahesh commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-726732563


   Update : 1 . After adding the additional log statement in HoodieParquetInputFormat and InputHandler classes, I have found this : 
   
   1) [InputInitializer {Map 1} #0] |hadoop.InputPathHandler|: Got the input paths : [s3a://xxx/test/hudi/data/xxx/xxx/dt=2020-11-13/.hoodie_partition_metadata, s3a://xxx/test/hudi/data/xxx/xxx/dt=2020-11-13/4e5582b0-ceb4-4d7c-ab98-bb9dfb0962e6-0_0-17038-5024094_20201113170011.parquet]conf : Configuration: incrementalTables : []
   
   Query Job has got the input paths as the files inside partition directory instead of partition directory itself , now Hudi mr bundle is trying to append metadata filename to these base files and failing to find the metadata file path . 
   
   In the same hive session , query on the different hudi table has the below logs : 
   
   hadoop.InputPathHandler|: Got the input paths : [s3a://xxxx/test/hudi/data/xxx/xxx/dt=2020-11-13]conf : Configuration: incrementalTables : []  which is upto partition directory unlike above base file path, in this case ,partition metadata file is accessible and query is finishing . 
   
   I would need help to figuring out from where job is getting the base files are inputPath instead of directory, i did describe formatted table partition(val) on the tables and they both have same directory structure. 
   
   
   
    
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] BalaMahesh commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

BalaMahesh commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-736996338


   As I mentioned in the update 1, describe formatted tbl_name partition (dt='') is showing the path upto partition directory not base files. But Interestingly the queries are running from beeline without any issue and failing from hive cli in the same installation and cluster.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2251: [SUPPORT] select queries failing with InvalidInputException: Input path does not exist even though file is present in directory

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2251:
URL: https://github.com/apache/hudi/issues/2251#issuecomment-774519270


   also, few quick questions as we triage the issue. 
   - Were you running older version of Hudi and encountered this trying to upgrade to a latest version? 
   - Is this affecting your production? trying to gauge the severity. 
   - Or you are trying out a POC ? and this is the first time trying out Hudi. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org