You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lantao Jin (JIRA)" <ji...@apache.org> on 2016/11/02 13:39:58 UTC
[jira] [Comment Edited] (SPARK-18227) Parquet file stream sink create a hidden directory "_spark_metadata" cause the DataFrame read failed

    [ https://issues.apache.org/jira/browse/SPARK-18227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15628997#comment-15628997 ] 

Lantao Jin edited comment on SPARK-18227 at 11/2/16 1:39 PM:
-------------------------------------------------------------

hadoop fs -ls hdfs:///path/out
Found 3 items
-rw-r--r--   3 hdfs hdfs        962 2016-11-02 03:46 hdfs:///path/out/095ed2d6-f9d3-4ecf-b0b7-48d0d6173cf8
-rw-r--r--   3 hdfs hdfs        956 2016-11-02 04:00 hdfs:///path/out/626d1b92-cd28-43dc-b7cd-09c0b31ff3e3
drwxr-xr-x   - hdfs hdfs          0 2016-11-02 04:00 hdfs:///path/out/_spark_metadata

The parquet files in the out path are named with random numbers(That's done by writeStream framework), so I can't load the parquet by load("/path/out/*.parquet")
And load("/path/out/*") will load the files in the hidden "_spark_metadata" also.


was (Author: cltlfcjin):
hadoop fs -ls hdfs:///path/out
Found 3 items
-rw-r--r--   3 hdfs hdfs        962 2016-11-02 03:46 hdfs:///path/out/095ed2d6-f9d3-4ecf-b0b7-48d0d6173cf8
-rw-r--r--   3 hdfs hdfs        956 2016-11-02 04:00 hdfs:///path/out/626d1b92-cd28-43dc-b7cd-09c0b31ff3e3
drwxr-xr-x   - hdfs hdfs          0 2016-11-02 04:00 hdfs:///path/out/_spark_metadata

The parquet files in the out path are named with random numbers, so I can't load the parquet by load("/path/out/*.parquet")
And load("/path/out/*") will load the files in the hidden "_spark_metadata" also.

> Parquet file stream sink create a hidden directory "_spark_metadata" cause the DataFrame read failed
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18227
>                 URL: https://issues.apache.org/jira/browse/SPARK-18227
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.0.1
>            Reporter: Lantao Jin
>
> When we set an out directory as a streaming sink with parquet format in structured streaming,  as the streaming job running, all output parquet files will be written to this out directory. However, it also creates a hidden directory called "_spark_metadata" in the out directory. If we load the parquet files from the out directory by "load", it will throw RuntimeException and task failed.
> {code:java}
> val stream = modifiedData.writeStream.format("parquet")
> .option("checkpointLocation", "/path/ck/")
> .start("/path/out/")
> val df1 = spark.read.format("parquet").load("/path/out/*")
> {code}
> {panel}
> 16/11/02 03:49:40 WARN TaskSetManager: Lost task 1.0 in stage 110.0 (TID 3131, cupid044.stratus.phx.ebay.com): java.lang.Ru
> ntimeException: hdfs:///path/out/_spark_metadata/0 is not a Parquet file (too s
> mall)   
>         at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
>         at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>         at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRec
> ordReaderBase.java:107)
>         at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRec
> ordReader.java:109)
>         at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFor
> mat.scala:367)
>         at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFor
> mat.scala:341)
>         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Sour
> ce)     
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
>         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> {panel}
> That's because the ParquetFileReader reads the metadata file as a parquet format. 
> I thought the smooth way to fix it is moving the metadata directory to another path, but from the code DataSource.scala, it has less path information except out directory path to store into. So maybe skipping hidden files and paths could be a better way. But from the stack trace above, it failed in initialize() in SpecificParquetRecordReaderBase. It means  that metadata files in hidden directory have been traversed in upper invocation(FileScanRDD). But in there, no format info can be known to skip a hidden directory(or over authority).
> So, what is the best way to fix it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org