You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by joewang <me...@joewang.net> on 2017/06/01 00:06:39 UTC

Re: igfs-meta behavior when node restarts

A bug to track this would be great. I've also noticed this behavior occur
without nodes rejoining, so the root cause may not be due to that, but it
probably more likely triggers the loop. I've tried the following
configurations, and they all exhibit this issue:

- 1.9 with igfs-meta replicated
- 2.0 with igfs-meta replicated
- 2.0 with igfs-meta partitioned + 1 backup

I don't think there's anything unusual with my configuration--my workload is
a set of Spark jobs running at constant intervals. 



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/igfs-meta-behavior-when-node-restarts-tp13155p13299.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: igfs-meta behavior when node restarts

Posted by joewang <me...@joewang.net>.

It's been a little difficult trying to get a repro, since the behavior
happens when there's more activity than what happens with just two nodes.
However, I did have two observations:

1. The Hadoop FileOutputCommitter has two modes, 1 and 2; 1 performs all the
renames from a single node, where as 2 performs them from all the nodes with
executors. By switching the mode from 2 to 1, it's reduced (but not
eliminated) the occurrences of this behavior happening.

2. Around the time the behavior happens, I see these in the IGFS hadoop
clients:

17/06/10 06:33:41 WARN TaskSetManager: Lost task 6.0 in stage 4.0 (TID 15,
BN4SCH102092522, executor 6): org.apache.spark.SparkException: Task failed
while writing rows
	at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
	at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
	at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
	at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:133)
	at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:105)
	at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutProc.closeStream(HadoopIgfsOutProc.java:446)
	at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutputStream.close(HadoopIgfsOutputStream.java:142)
	at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
	at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
	at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
	at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2429)
	at
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
	at
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:91)
	at
org.apache.spark.sql.hive.orc.OrcOutputWriter.close(OrcFileFormat.scala:251)
	at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:252)
	at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:191)
	at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
	at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
	... 8 more
	Suppressed: java.io.IOException: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
		at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:133)
		at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper.withReconnectHandling(HadoopIgfsWrapper.java:329)
		at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper.delete(HadoopIgfsWrapper.java:163)
		at
org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem.delete(IgniteHadoopFileSystem.java:838)
		at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortTask(FileOutputCommitter.java:615)
		at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortTask(FileOutputCommitter.java:604)
		at
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortTask(HadoopMapReduceCommitProtocol.scala:153)
		at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:198)
		at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1350)
		... 9 more
	Caused by: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
		at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
		at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
		at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutProc.delete(HadoopIgfsOutProc.java:259)
		at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper$6.apply(HadoopIgfsWrapper.java:166)
		at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper$6.apply(HadoopIgfsWrapper.java:163)
		at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper.withReconnectHandling(HadoopIgfsWrapper.java:312)
		... 16 more
	Caused by: java.lang.InterruptedException
		at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
		at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:161)
		... 21 more
Caused by: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
	at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
	at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
	at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutProc.closeStream(HadoopIgfsOutProc.java:443)
	... 21 more
Caused by: java.lang.InterruptedException
	at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
	at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
	at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:161)
	... 23 more




17/06/10 06:32:59 WARN TaskSetManager: Lost task 15.1 in stage 4.0 (TID 291,
BN4SCH102092623, executor 5): java.io.IOException: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
	at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:133)
	at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:105)
	at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsInputStream.read(HadoopIgfsInputStream.java:198)
	at java.io.DataInputStream.readFully(DataInputStream.java:195)
	at
org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
	at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:964)
	at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:793)
	at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)
	at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1019)
	at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:205)
	at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
	at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createReaderFromFile(OrcInputFormat.java:230)
	at
org.apache.hadoop.hive.ql.io.orc.SparkOrcNewRecordReader.<init>(SparkOrcNewRecordReader.java:48)
	at
org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:155)
	at
org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:129)
	at
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:138)
	at
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:122)
	at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:150)
	at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
	at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
	at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
	at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
	at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
	at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsInputStream.read(HadoopIgfsInputStream.java:186)
	... 31 more
Caused by: java.lang.InterruptedException
	at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
	at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
	at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:161)
	... 33 more



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/igfs-meta-behavior-when-node-restarts-tp13155p13599.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: igfs-meta behavior when node restarts

Posted by Vladimir Ozerov <vo...@gridgain.com>.

Hi,

Is it possible to create isolated reproducer which will show gradual
increase of failed operations over time?

On Wed, Jun 7, 2017 at 6:39 AM, joewang <me...@joewang.net> wrote:

> Would it be helpful if I uploaded logs collected from the cluster? I can
> point to the time when the behavior begins to occur.
>
> My concern with this is that the recurrent transactions (removing entries
> from the TRASH) and rollbacks are being performed over and over again for
> the same set of entries -- over time, this set continues to grow. In my
> observation, over the course of 3-4 days, the initial 4 TX op/s grows to
> ~200 TX op/s across all my nodes. I'm assuming if this grows unbounded, the
> cluster performance and stability will eventually be affected.
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/igfs-meta-behavior-when-node-restarts-
> tp13155p13446.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

Re: igfs-meta behavior when node restarts

Posted by joewang <me...@joewang.net>.

Would it be helpful if I uploaded logs collected from the cluster? I can
point to the time when the behavior begins to occur.

My concern with this is that the recurrent transactions (removing entries
from the TRASH) and rollbacks are being performed over and over again for
the same set of entries -- over time, this set continues to grow. In my
observation, over the course of 3-4 days, the initial 4 TX op/s grows to
~200 TX op/s across all my nodes. I'm assuming if this grows unbounded, the
cluster performance and stability will eventually be affected.



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/igfs-meta-behavior-when-node-restarts-tp13155p13446.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: igfs-meta behavior when node restarts

Posted by "Ivan V." <iv...@gridgain.com>.

I tried to reproduce this with 2-node Ignite cluster running upon HDFS.
I tried action sequences like this: put some files/dirs , move/delete some
files/dirs, restart one of the nodes.
The metrics were observed via JConsole.
I used Ignite 2.1.
Unfortunately, I was not able to reproduce the issue.
My configs are like this:

                    <property name="metaCacheConfiguration">
                         <bean
class="org.apache.ignite.configuration.CacheConfiguration">
                              <property name="atomicityMode"
value="TRANSACTIONAL"/>
                              <property name="cacheMode"
value="PARTITIONED"/>
                              <property name="backups" value="1"/>
                              <property name="statisticsEnabled"
value="true"/>
                         </bean>
                    </property>

                    <property name="dataCacheConfiguration">
                         <bean
class="org.apache.ignite.configuration.CacheConfiguration">
                              <property name="atomicityMode"
value="TRANSACTIONAL"/>
                              <property name="cacheMode"
value="PARTITIONED"/>
                              <property name="backups" value="0"/>
                              <property name="statisticsEnabled"
value="true"/>
                         </bean>
                    </property>

....
                    <property name="secondaryFileSystem">
                        <bean
class="org.apache.ignite.hadoop.fs.IgniteHadoopIgfsSecondaryFileSystem">
                            <property name="fileSystemFactory">
                                <bean
class="org.apache.ignite.hadoop.fs.CachingHadoopFileSystemFactory">
                                    <property name="uri"
value="hdfs://localhost:9000/"/>
                                </bean>
                            </property>
                        </bean>
                    </property>

Re: igfs-meta behavior when node restarts

Posted by Vladimir Ozerov <vo...@gridgain.com>.

Hi,

Ignite performs delete in a "soft" fashion:
1) When "remove" command is executed, we propagate it to the secondary file
system;
2) For IGFS meta cache, we do not remove all records immediately, but
rather execute a single "move" operation and move removed tree to a hidden
"TRASH" folder.
3) "TRASH" folder is cleared periodically - this is what you see in logs.

Removal of trash content should not interfere with normal operations
anyhow, nor should it cause any performance issues. Do you observe some
real slowdown, or you are only concerned with metrics summary?

Vladimir.

On Thu, Jun 1, 2017 at 8:39 PM, joewang <me...@joewang.net> wrote:

> The reads are from a non-IGFS source, but the writes are through IGFS.
> Spark
> uses Hadoop's FileOutputCommitter to write the output to IGFS. I think what
> happens is essentially:
>
> - During processing, temporary files are written by each of n executors
> running on different nodes to some /data/path/output/_temporary/part-n...
> - When the job completes, each of the executor performs the final "commit"
> by renaming the files under /data/path/output/_temporary/part-n... to
> /data/path/output/part-n... and deletes the _temporary directory.
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/igfs-meta-behavior-when-node-restarts-
> tp13155p13322.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

Re: igfs-meta behavior when node restarts

Posted by joewang <me...@joewang.net>.

The reads are from a non-IGFS source, but the writes are through IGFS. Spark
uses Hadoop's FileOutputCommitter to write the output to IGFS. I think what
happens is essentially:

- During processing, temporary files are written by each of n executors
running on different nodes to some /data/path/output/_temporary/part-n...
- When the job completes, each of the executor performs the final "commit"
by renaming the files under /data/path/output/_temporary/part-n... to
/data/path/output/part-n... and deletes the _temporary directory.



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/igfs-meta-behavior-when-node-restarts-tp13155p13322.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: igfs-meta behavior when node restarts

Posted by Ivan Veselovsky <iv...@gridgain.com>.

I suppose, the file system contents and the sequence of the operations on
them may play a role there. 
What is the sequence of operations in your case, at least approximately?



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/igfs-meta-behavior-when-node-restarts-tp13155p13308.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.