You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Alex Levenson (JIRA)" <ji...@apache.org> on 2015/06/25 01:03:04 UTC

[jira] [Resolved] (PARQUET-284) Should use ConcurrentHashMap instead of HashMap in ParquetMetadataConverter

     [ https://issues.apache.org/jira/browse/PARQUET-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Levenson resolved PARQUET-284.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.0

Issue resolved by pull request 220
[https://github.com/apache/parquet-mr/pull/220]

> Should use ConcurrentHashMap instead of HashMap in ParquetMetadataConverter
> ---------------------------------------------------------------------------
>
>                 Key: PARQUET-284
>                 URL: https://issues.apache.org/jira/browse/PARQUET-284
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.1
>         Environment: Spark 1.2.1, CentOS 6.4
>            Reporter: Tony Yan
>            Assignee: Alex Levenson
>             Fix For: 2.0.0
>
>
> When using parquet in Spark Environment, sometimes got hang  with following thread dump:
> "Executor task launch worker-0" daemon prio=10 tid=0x000000004073d000 nid=0xd6c5 runnable [0x00007ff3fda40000]
> java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.get(HashMap.java:303)
> at parquet.format.converter.ParquetMetadataConverter.fromFormatEncodings(ParquetMetadataConverter.java:218)
> at parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
> at parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
> at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:161)
> at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:135)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> From the source code of ParquetMetadataConverter:
> private Map> encodingLists = new HashMap>();
> It use HashMap instead of ConcurrentHashMap. Because HashMap is not thread safe and can cause hang when run in multithread environment. So it need to change to ConcurrentHashMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)