You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/04 15:54:28 UTC

[GitHub] [hudi] luffyd opened a new issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

luffyd opened a new issue #1913:
URL: https://github.com/apache/hudi/issues/1913


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? yes
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I have an MOR table with continuous upserts in a loop. I have the maxFileLimit to 2G and smallFileLimit as 1G
   I am seeing `java.io.IOException: Too many open files`  in the Hudi logs and spark job terminates with FATAL.
   
   Every upsert contains 3M records new records
   This happens after about 200M in the table.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   ```
   val parallelism = options.getOrElse("parallelism", Math.max(2, upsertCount/100000).toString).toInt
         println("parallelism", parallelism)
   
   var fileSizeOptions = Map[String,String] ()
   fileSizeOptions += (HoodieStorageConfig.PARQUET_FILE_MAX_BYTES -> String.valueOf(2013265920)) // 2GB fileSizeOptions += (HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES -> String.valueOf(2013265920)) // 2GB
   fileSizeOptions += (HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES -> String.valueOf(1006632960)) // 1GB
   
   (inputDF
           .write
           .format("org.apache.hudi")
           .option(HoodieWriteConfig.TABLE_NAME, options.getOrElse("tableName", "facestest"))
           .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
           .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partitionKey")
           .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "nodeId")
           .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts")
           .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
           .option("hoodie.upsert.shuffle.parallelism", parallelism)//not the ideal way but works
           .option("hoodie.bulkinsert.shuffle.parallelism", parallelism)//not the ideal way but works
           .option("hoodie.cleaner.commits.retained", "1")
           .options(fileSizeOptions)
           .mode(SaveMode.Append)
           .save(getHudiPath(spark)))
   ```
   **Expected behavior**
   
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I notice lots of FileSystemViewHandler activity happening like this
   ```
   20/08/03 03:26:49 INFO FileSystemViewHandler: TimeTakenMillis[Total=101, Refresh=98, handle=3, Check=0], Success=true, Query=basepath=s3%3A%2F%2Fchelan-dev-mock-faces%2FTestFacesUpserForLoop%2F1000P2G&lastinstantts=20200803032307&timelinehash=9ef96ca1f93d48b24891b749ead94dfeb95ac7123fe538eb09733a1494befe77, Host=ip-10-0-1-42.us-west-2.compute.internal:39767, synced=false
   ```
   Looking at the host in the above logs, it is the driver host where this logs are being executed. Can you explain why driver has to access the FileSystem so much. 
   
   ```
   [hadoop@ip-10-0-1-42 s-1DFOYHZTX6MAG]$ zgrep FileSystemViewHandler stderr.gz |wc -l
   506548
   [hadoop@ip-10-0-1-42 s-1DFOYHZTX6MAG]$ zgrep 'Too many open files' stderr.gz |wc -l
   17054
   ```
   
   
   **Stacktrace**
   
   ```
   20/08/03 10:00:44 INFO FileSystemViewHandler: TimeTakenMillis[Total=1, Refresh=0, handle=0, Check=0], Success=false, Query=partition=670&basepath=s3%3A%2F%2Fchelan-dev-mock-faces%2FTestFacesUpserForLoop%2F1000P2G&lastinstantts=20200803094501&timelinehash=96d2a8ca3e0bb58b3
   82377345ed35eaf99f09f97540b13bf6295682fc42eade0, Host=ip-10-0-1-42.us-west-2.compute.internal:44483, synced=false
   20/08/03 10:00:44 WARN ExceptionMapper: Uncaught exception
   com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: chelan-dev-mock-faces.s3.us-west-2.amazonaws.com
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1189)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1135)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:784)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:752)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335)
           at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:22)
           at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:8)
           at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:114)
           at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:189)
           at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
           at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:96)
           at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:43)
           at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.getFileMetadataFromCacheOrS3(Jets3tNativeFileSystemStore.java:497)
           at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:223)
           at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:590)
           at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1440)
           at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:357)
           at org.apache.hudi.common.fs.HoodieWrapperFileSystem.exists(HoodieWrapperFileSystem.java:460)
           at org.apache.hudi.common.fs.FSUtils.createPathIfNotExists(FSUtils.java:517)
           at org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$5(AbstractTableFileSystemView.java:223)
           at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
           at org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:214)
           at org.apache.hudi.common.table.view.AbstractTableFileSystemView.getAllFileGroups(AbstractTableFileSystemView.java:545)
           at org.apache.hudi.timeline.service.handlers.FileSliceHandler.getAllFileGroups(FileSliceHandler.java:88)
           at org.apache.hudi.timeline.service.FileSystemViewHandler.lambda$registerFileSlicesAPI$17(FileSystemViewHandler.java:281)
           at org.apache.hudi.timeline.service.FileSystemViewHandler$ViewHandler.handle(FileSystemViewHandler.java:329)
   at io.javalin.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:22)
           at io.javalin.Javalin.lambda$addHandler$0(Javalin.java:606)
           at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:46)
           at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:17)
           at io.javalin.core.JavalinServlet$service$1.invoke(JavalinServlet.kt:143)
           at io.javalin.core.JavalinServlet$service$2.invoke(JavalinServlet.kt:41)
           at io.javalin.core.JavalinServlet.service(JavalinServlet.kt:107)
           at io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(JettyServerUtil.kt:72)
           at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
           at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
           at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1668)
           at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
           at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
           at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
           at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:61)
           at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
           at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
           at org.eclipse.jetty.server.Server.handle(Server.java:502)
           at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
           at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
           at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
           at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
           at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
           at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
           at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
           at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
           at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
           at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
           at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)
           at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: java.net.UnknownHostException: chelan-dev-mock-faces.s3.us-west-2.amazonaws.com
           at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
           at java.net.InetAddress.getAllByName(InetAddress.java:1193)
           at java.net.InetAddress.getAllByName(InetAddress.java:1127)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38)
           at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
           at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
           at sun.reflect.GeneratedMethodAccessor294.invoke(Unknown Source)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy27.connect(Unknown Source)
           at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
           at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
           at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
           at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
           at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
           at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1311)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1127)
           ... 62 more
   20/08/03 10:00:44 WARN AbstractConnector:
   java.io.IOException: Too many open files
           at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
           at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:419)
           at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:247)
           at org.spark_project.jetty.server.ServerConnector.accept(ServerConnector.java:397)
           at org.spark_project.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
           at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
           at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
           at java.lang.Thread.run(Thread.java:748)
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ares-W commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

Ares-W commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-669866341


   Maybe https://issues.apache.org/jira/browse/PARQUET-783 cause this exception.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

luffyd commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-670211630


   From Spark ENV tab, parquet version seems to be this
   /mnt2/yarn/usercache/hadoop/appcache/application_1596743154329_0001/container_1596743154329_0001_01_000001/__spark_libs__/parquet-format-2.4.0.jar


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

luffyd commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-670385009


   It is this, seems latest. This is whatever comes in AWS emr
   /mnt2/yarn/usercache/hadoop/appcache/application_1596743154329_0001/container_1596743154329_0001_01_000001/__spark_libs__/parquet-hadoop-1.10.1-spark-amzn-1.jar
   
   /mnt2/yarn/usercache/hadoop/appcache/application_1596743154329_0001/container_1596743154329_0001_01_000001/__spark_libs__/parquet-hadoop-bundle-1.6.0.jar
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-669661304


   Are you using 0.5.3 version of hudi or master ?
   
   Hudi uses driver side caching of file listing to prevent executors from repeatedly making duplicate file listing calls and is controlled by the config "hoodie.embed.timeline.server". 
   
   This is not expected. Are you seeing exceptions when writing the data ? Are you seeing successful commits ? Can you turn on INFO level logging and provide them here. Can you also list the timeline folder (.hoodie) and attach here.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-670015540


   @luffyd : I spent some time trying to understand your use-case. 
   
   To your question : Hudi needs to list partitions in-order to figure out the list of valid files that constitute latest snapshot. It looks like your use-case is such that you are writing to a lot of partitions and hudi needs to list all of them to perform the write. I did check the code and I don't think the leak is coming from Hudi. Can you look at the parquet version being used in your runtime as @Ares-W  suggested.
   
   On a different note, Regarding the looping, Are you writing the same data to hudi again and again ? If not, have you considered looking at Spark Structured streaming.  I do see occasional compactions. With latest master, we have added async compaction support for structured streaming.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

luffyd commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-668679565


   Spark side configuration questions - I have been using client mode, does it makes a difference using in cluster mode?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

luffyd commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-670039732


   Thanks for the input @bvaradar 
   "Too many open files on IOException" issue also seems to be co-related with having 2G as max file limit.
   Will confirm the parquet version.
   
   Regarding the tests you are correct, write is touching all 1000 partitions. I am simulating prod write pattern.
   Will check on Spark Structured streaming options.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-670356269


   Please check the version of parquet-hadoop


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

luffyd commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-669674913


   Have built jars from master branch, latest commit at the time of build was
   ```
   commit 8c4ff185f1752b5041c4e66ac595bd90c2693137 (HEAD -> master, 0r)
   Author: mabin001 <ma...@163.com>
   Date:   Tue Jul 7 19:10:16 2020 +0800
   
       [HUDI-1064]Trim hoodie table name (#1805)
   ```
   
   I did see successful commits, after a certain point of time it crashed. I tried to look at the spark UI, it did not load and was slow.
   Moving from driver mode to cluster mode, I am not noticing this issue. But I am curious on what is driver doing by accessing Filesystem, even if there is cache. Why does driver need to look at these Files?
   
   s3://dev-anushb-emr/publicLogs/tooManyFiles-stderr.gz -- Please ignore log lines containing luffyd, I have added them to figure the s3 call pattern
   s3://dev-anushb-emr/publicLogs/tooManyFiles-hoodieLog


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kimmazhenxin commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

kimmazhenxin commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-674878876


   > **_Tips before filing an issue_**
   > 
   > * Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? yes
   > * Join the mailing list to engage in conversations and get faster support at [dev-subscribe@hudi.apache.org](mailto:dev-subscribe@hudi.apache.org).
   > * If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   > 
   > **Describe the problem you faced**
   > 
   > I have an MOR table with continuous upserts in a loop. I have the maxFileLimit to 2G and smallFileLimit as 1G
   > I am seeing `java.io.IOException: Too many open files` in the Hudi logs and spark job terminates with FATAL.
   > 
   > Every upsert contains 3M records new records
   > This happens after about 200M in the table.
   > 
   > **To Reproduce**
   > 
   > Steps to reproduce the behavior:
   > 
   > ```
   > val parallelism = options.getOrElse("parallelism", Math.max(2, upsertCount/100000).toString).toInt
   >       println("parallelism", parallelism)
   > 
   > var fileSizeOptions = Map[String,String] ()
   > fileSizeOptions += (HoodieStorageConfig.PARQUET_FILE_MAX_BYTES -> String.valueOf(2013265920)) // 2GB fileSizeOptions += (HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES -> String.valueOf(2013265920)) // 2GB
   > fileSizeOptions += (HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES -> String.valueOf(1006632960)) // 1GB
   > 
   > (inputDF
   >         .write
   >         .format("org.apache.hudi")
   >         .option(HoodieWriteConfig.TABLE_NAME, options.getOrElse("tableName", "facestest"))
   >         .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
   >         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partitionKey")
   >         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "nodeId")
   >         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts")
   >         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   >         .option("hoodie.upsert.shuffle.parallelism", parallelism)//not the ideal way but works
   >         .option("hoodie.bulkinsert.shuffle.parallelism", parallelism)//not the ideal way but works
   >         .option("hoodie.cleaner.commits.retained", "1")
   >         .options(fileSizeOptions)
   >         .mode(SaveMode.Append)
   >         .save(getHudiPath(spark)))
   > ```
   > 
   > **Expected behavior**
   > 
   > A clear and concise description of what you expected to happen.
   > 
   > **Environment Description**
   > 
   > * Hudi version : 0.5.3
   > * Spark version : 2.4.5
   > * Storage (HDFS/S3/GCS..) : S3
   > * Running on Docker? (yes/no) : no
   > 
   > **Additional context**
   > 
   > I notice lots of FileSystemViewHandler activity happening like this
   > 
   > ```
   > 20/08/03 03:26:49 INFO FileSystemViewHandler: TimeTakenMillis[Total=101, Refresh=98, handle=3, Check=0], Success=true, Query=basepath=s3%3A%2F%2Fchelan-dev-mock-faces%2FTestFacesUpserForLoop%2F1000P2G&lastinstantts=20200803032307&timelinehash=9ef96ca1f93d48b24891b749ead94dfeb95ac7123fe538eb09733a1494befe77, Host=ip-10-0-1-42.us-west-2.compute.internal:39767, synced=false
   > ```
   > 
   > Looking at the host in the above logs, it is the driver host where this logs are being executed. Can you explain why driver has to access the FileSystem so much.
   > 
   > ```
   > [hadoop@ip-10-0-1-42 s-1DFOYHZTX6MAG]$ zgrep FileSystemViewHandler stderr.gz |wc -l
   > 506548
   > [hadoop@ip-10-0-1-42 s-1DFOYHZTX6MAG]$ zgrep 'Too many open files' stderr.gz |wc -l
   > 17054
   > ```
   > 
   > **Stacktrace**
   > 
   > ```
   > 20/08/03 10:00:44 INFO FileSystemViewHandler: TimeTakenMillis[Total=1, Refresh=0, handle=0, Check=0], Success=false, Query=partition=670&basepath=s3%3A%2F%2Fchelan-dev-mock-faces%2FTestFacesUpserForLoop%2F1000P2G&lastinstantts=20200803094501&timelinehash=96d2a8ca3e0bb58b3
   > 82377345ed35eaf99f09f97540b13bf6295682fc42eade0, Host=ip-10-0-1-42.us-west-2.compute.internal:44483, synced=false
   > 20/08/03 10:00:44 WARN ExceptionMapper: Uncaught exception
   > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: chelan-dev-mock-faces.s3.us-west-2.amazonaws.com
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1189)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1135)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:784)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:752)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335)
   >         at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:22)
   >         at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:8)
   >         at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:114)
   >         at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:189)
   >         at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
   >         at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:96)
   >         at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:43)
   >         at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.getFileMetadataFromCacheOrS3(Jets3tNativeFileSystemStore.java:497)
   >         at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:223)
   >         at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:590)
   >         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1440)
   >         at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:357)
   >         at org.apache.hudi.common.fs.HoodieWrapperFileSystem.exists(HoodieWrapperFileSystem.java:460)
   >         at org.apache.hudi.common.fs.FSUtils.createPathIfNotExists(FSUtils.java:517)
   >         at org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$5(AbstractTableFileSystemView.java:223)
   >         at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
   >         at org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:214)
   >         at org.apache.hudi.common.table.view.AbstractTableFileSystemView.getAllFileGroups(AbstractTableFileSystemView.java:545)
   >         at org.apache.hudi.timeline.service.handlers.FileSliceHandler.getAllFileGroups(FileSliceHandler.java:88)
   >         at org.apache.hudi.timeline.service.FileSystemViewHandler.lambda$registerFileSlicesAPI$17(FileSystemViewHandler.java:281)
   >         at org.apache.hudi.timeline.service.FileSystemViewHandler$ViewHandler.handle(FileSystemViewHandler.java:329)
   > at io.javalin.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:22)
   >         at io.javalin.Javalin.lambda$addHandler$0(Javalin.java:606)
   >         at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:46)
   >         at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:17)
   >         at io.javalin.core.JavalinServlet$service$1.invoke(JavalinServlet.kt:143)
   >         at io.javalin.core.JavalinServlet$service$2.invoke(JavalinServlet.kt:41)
   >         at io.javalin.core.JavalinServlet.service(JavalinServlet.kt:107)
   >         at io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(JettyServerUtil.kt:72)
   >         at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
   >         at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
   >         at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1668)
   >         at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
   >         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
   >         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
   >         at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:61)
   >         at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
   >         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
   >         at org.eclipse.jetty.server.Server.handle(Server.java:502)
   >         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
   >         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
   >         at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
   >         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
   >         at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
   >         at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
   >         at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
   >         at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
   >         at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
   >         at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
   >         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)
   >         at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)
   >         at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.net.UnknownHostException: chelan-dev-mock-faces.s3.us-west-2.amazonaws.com
   >         at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
   >         at java.net.InetAddress.getAllByName(InetAddress.java:1193)
   >         at java.net.InetAddress.getAllByName(InetAddress.java:1127)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
   >         at sun.reflect.GeneratedMethodAccessor294.invoke(Unknown Source)
   >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >         at java.lang.reflect.Method.invoke(Method.java:498)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy27.connect(Unknown Source)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1311)
   >         at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1127)
   >         ... 62 more
   > 20/08/03 10:00:44 WARN AbstractConnector:
   > java.io.IOException: Too many open files
   >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
   >         at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:419)
   >         at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:247)
   >         at org.spark_project.jetty.server.ServerConnector.accept(ServerConnector.java:397)
   >         at org.spark_project.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
   >         at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
   >         at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
   >         at java.lang.Thread.run(Thread.java:748)
   > ```
   
   @luffyd Did you solve this problem? I have the same problem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd edited a comment on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

luffyd edited a comment on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-669674913


   Have built jars from master branch, latest commit at the time of build was
   ```
   commit 8c4ff185f1752b5041c4e66ac595bd90c2693137 (HEAD -> master, 0r)
   Author: mabin001 <ma...@163.com>
   Date:   Tue Jul 7 19:10:16 2020 +0800
   
       [HUDI-1064]Trim hoodie table name (#1805)
   ```
   
   I did see successful commits, after a certain point of time it crashed. I tried to look at the spark UI, it did not load and was slow.
   Moving from driver mode to cluster mode, I am not noticing this issue. But I am curious on what is driver doing by accessing Filesystem, even if there is cache. Why does driver need to look at these Files?
   
   https://dev-anushb-emr.s3.amazonaws.com/publicLogs/tooManyFiles-stderr.gz -- Please ignore log lines containing luffyd, I have added them to figure the s3 call pattern
   https://dev-anushb-emr.s3.amazonaws.com/publicLogs/tooManyFiles-hoodieLog


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1913:
URL: https://github.com/apache/hudi/issues/1913


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

luffyd commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-677888677


   We started executing the emr job in cluster mode and not seeing this issue now.
   Is your job running in client mode or cluster mode?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-691739504


   There is a WIP PR (https://github.com/apache/hudi/pull/2069) which addresses early cleaning of local files that are opened. This is the only close thing that I can suspect anything related to Hudi. Otherwise, this is not a Hudi issue per-se. Closing this ticket as there is a related PR and a Jira on the known issue. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org