You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2019/11/03 18:41:00 UTC

[jira] [Commented] (SPARK-29735) DataSource V2 CSVDataSource leaks file system

    [ https://issues.apache.org/jira/browse/SPARK-29735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965710#comment-16965710 ] 

Dongjoon Hyun commented on SPARK-29735:
---------------------------------------

Since this test suite has been detecting the file leaking issues for a long time for ORC and Parquet data sources before, I've been monitoring this test carefully. New Data Source V2 CSV also seems to hit some failures in this test cases. Please note that the failure happens not always.

> DataSource V2 CSVDataSource leaks file system
> ---------------------------------------------
>
>                 Key: SPARK-29735
>                 URL: https://issues.apache.org/jira/browse/SPARK-29735
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Dongjoon Hyun
>            Priority: Major
>
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113178/consoleFull
> {code}
> [info] FileBasedDataSourceSuite:
> [info] - Writing empty datasets should not fail - orc (309 milliseconds)
> [info] - Writing empty datasets should not fail - parquet (367 milliseconds)
> [info] - Writing empty datasets should not fail - csv (171 milliseconds)
> [info] - Writing empty datasets should not fail - json (130 milliseconds)
> [info] - Writing empty datasets should not fail - text (423 milliseconds)
> [info] - SPARK-23072 Write and read back unicode column names - orc (274 milliseconds)
> [info] - SPARK-23072 Write and read back unicode column names - parquet (318 milliseconds)
> [info] - SPARK-23072 Write and read back unicode column names - csv (358 milliseconds)
> [info] - SPARK-23072 Write and read back unicode column names - json (290 milliseconds)
> [info] - SPARK-15474 Write and read back non-empty schema with empty dataframe - orc (327 milliseconds)
> [info] - SPARK-15474 Write and read back non-empty schema with empty dataframe - parquet (334 milliseconds)
> [info] - SPARK-23271 empty RDD when saved should write a metadata only file - orc (273 milliseconds)
> [info] - SPARK-23271 empty RDD when saved should write a metadata only file - parquet (352 milliseconds)
> [info] - SPARK-23372 error while writing empty schema files using orc (29 milliseconds)
> [info] - SPARK-23372 error while writing empty schema files using parquet (15 milliseconds)
> [info] - SPARK-23372 error while writing empty schema files using csv (12 milliseconds)
> [info] - SPARK-23372 error while writing empty schema files using json (10 milliseconds)
> [info] - SPARK-23372 error while writing empty schema files using text (11 milliseconds)
> [info] - SPARK-22146 read files containing special characters using orc (256 milliseconds)
> [info] - SPARK-22146 read files containing special characters using parquet (380 milliseconds)
> [info] - SPARK-22146 read files containing special characters using csv (428 milliseconds)
> [info] - SPARK-22146 read files containing special characters using json (284 milliseconds)
> [info] - SPARK-22146 read files containing special characters using text (254 milliseconds)
> [info] - SPARK-23148 read files containing special characters using json with multiline enabled (557 milliseconds)
> [info] - SPARK-23148 read files containing special characters using csv with multiline enabled (424 milliseconds)
> [info] - Enabling/disabling ignoreMissingFiles using orc (1 second, 605 milliseconds)
> [info] - Enabling/disabling ignoreMissingFiles using parquet (1 second, 895 milliseconds)
> 09:26:51.342 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 94.0 (TID 125, amp-jenkins-worker-04.amp, executor driver): TaskKilled (Stage cancelled)
> [info] - Enabling/disabling ignoreMissingFiles using csv (1 second, 672 milliseconds)
> 09:26:51.344 WARN org.apache.spark.DebugFilesystem: Leaked filesystem connection created at:
> java.lang.Throwable
> 	at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35)
> 	at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:69)
> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
> 	at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
> 	at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
> 	at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.readFile(CSVDataSource.scala:99)
> 	at org.apache.spark.sql.execution.datasources.v2.csv.CSVPartitionReaderFactory.buildReader(CSVPartitionReaderFactory.scala:68)
> 	at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createReader$1(FilePartitionReaderFactory.scala:29)
> 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> 	at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:109)
> 	at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:42)
> 	at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:95)
> 	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:62)
> 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
> 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
> 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> 	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1832)
> 	at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227)
> 	at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227)
> 	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2135)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:127)
> 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org