You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Andrey Mashenkov <an...@gmail.com> on 2018/01/15 14:50:19 UTC

Re: Segmentation fault (JVM crash) while memory restoring on start with native persistance

Hi Arseny,

Have you success with reproducing the issue and getting stacktrace?
Do you observe same behavior on OracleJDK?

On Tue, Dec 26, 2017 at 2:43 PM, Andrey Mashenkov <
andrey.mashenkov@gmail.com> wrote:

> Hi Arseny,
>
> This looks like a known issues that is unresolved yet [1],
> but we can't sure it is same issue as there is no stacktrace in logs
> attached.
>
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7278
>
> On Tue, Dec 26, 2017 at 12:54 PM, Arseny Kovalchuk <
> arseny.kovalchuk@synesis.ru> wrote:
>
>> Hi guys.
>>
>> We've successfully tested Ignite as in-memory solution, it showed
>> acceptable performance. But we cannot get stable work of Ignite cluster
>> with native persistence enabled. Our first error we've got is Segmentation
>> fault (JVM crash) while memory restoring on start.
>>
>> [2017-12-22 11:11:51,992]  INFO [exchange-worker-#46%ignite-instance-0%]
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>> - Read checkpoint status [startMarker=/ignite-work-dire
>> ctory/db/ignite_instance_0/cp/1513938154201-8c574131-763d-
>> 4cfa-99b6-0ce0321d61ab-START.bin, endMarker=/ignite-work-directo
>> ry/db/ignite_instance_0/cp/1513932413840-55ea1713-8e9e-
>> 44cd-b51a-fcad8fb94de1-END.bin]
>> [2017-12-22 11:11:51,993]  INFO [exchange-worker-#46%ignite-instance-0%]
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>> - Checking memory state [lastValidPos=FileWALPointer [idx=391,
>> fileOffset=220593830, len=19573, forceFlush=false],
>> lastMarked=FileWALPointer [idx=394, fileOffset=38532201, len=19573,
>> forceFlush=false], lastCheckpointId=8c574131-763d-4cfa-99b6-0ce0321d61ab]
>> [2017-12-22 11:11:51,993]  WARN [exchange-worker-#46%ignite-instance-0%]
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>> - Ignite node stopped in the middle of checkpoint. Will restore memory
>> state and finish checkpoint on node start.
>> [CodeBlob (0x00007f9b58f24110)]
>> Framesize: 0
>> BufferBlob (0x00007f9b58f24110) used for StubRoutines (2)
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  Internal Error (sharedRuntime.cpp:842), pid=221, tid=0x00007f9b473c1ae8
>> #  fatal error: exception happened outside interpreter, nmethods and
>> vtable stubs at pc 0x00007f9b58f248f6
>> #
>> # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build
>> 1.8.0_151-b12)
>> # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64
>> compressed oops)
>> # Derivative: IcedTea 3.6.0
>> # Distribution: Custom build (Tue Nov 21 11:22:36 GMT 2017)
>> # Core dump written. Default location: /opt/ignite/core or core.221
>> #
>> # An error report file with more information is saved as:
>> # /ignite-work-directory/core_dump_221.log
>> #
>> # If you would like to submit a bug report, please include
>> # instructions on how to reproduce the bug and visit:
>> #   http://icedtea.classpath.org/bugzilla
>> #
>>
>>
>>
>> Please find logs and configs attached.
>>
>> We deploy Ignite along with our services in Kubernetes (v 1.8) on
>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite
>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>>
>> We put about 230 events/second into Ignite, 70% of events are ~200KB in
>> size and 30% are 5000KB. Smaller events have indexed fields and we query
>> them via SQL.
>>
>> The cluster is activated from a client node which also streams events
>> into Ignite from Kafka. We use custom implementation of streamer which uses
>> cache.putAll() API.
>>
>> We got the error when we stopped and restarted cluster again. It happened
>> only on one instance.
>>
>> The general question is:
>>
>> *Is it possible to tune up (or implement) native persistence in a way
>> when it just reports about error in data or corrupted data, then skip it
>> and continue to work without that corrupted part. Thus it will make the
>> cluster to continue operating regardless of errors on storage?*
>>
>>
>> 
>> Arseny Kovalchuk
>>
>> Senior Software Engineer at Synesis
>> skype: arseny.kovalchuk
>> mobile: +375 (29) 666-16-16
>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>



-- 
Best regards,
Andrey V. Mashenkov

Re: Segmentation fault (JVM crash) while memory restoring on start with native persistance

Posted by Arseny Kovalchuk <ar...@synesis.ru>.

Hi Andrey.

Unfortunately I couldn't copy all data from file system to try reproducing
that locally or in our cluster. That was very likely due to some issues
with our underlying CEPH behavior, I mean we also got some problems with
CEPH in our cluster at the same time, so that might cause data corruption.
So, no results with OracleJDK.

From the other hand, we disabled backup copies of data "backups=0" (taking
into account information from mentioned JIRAs) and we haven't got any
severe issues with Ignite persistence so far.




Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16
LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>

On 15 January 2018 at 17:50, Andrey Mashenkov <an...@gmail.com>
wrote:

> Hi Arseny,
>
> Have you success with reproducing the issue and getting stacktrace?
> Do you observe same behavior on OracleJDK?
>
> On Mon, Jan 15, 2018 at 5:50 PM, Andrey Mashenkov <
> andrey.mashenkov@gmail.com> wrote:
>
>> Hi Arseny,
>>
>> Have you success with reproducing the issue and getting stacktrace?
>> Do you observe same behavior on OracleJDK?
>>
>> On Tue, Dec 26, 2017 at 2:43 PM, Andrey Mashenkov <
>> andrey.mashenkov@gmail.com> wrote:
>>
>>> Hi Arseny,
>>>
>>> This looks like a known issues that is unresolved yet [1],
>>> but we can't sure it is same issue as there is no stacktrace in logs
>>> attached.
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/IGNITE-7278
>>>
>>> On Tue, Dec 26, 2017 at 12:54 PM, Arseny Kovalchuk <
>>> arseny.kovalchuk@synesis.ru> wrote:
>>>
>>>> Hi guys.
>>>>
>>>> We've successfully tested Ignite as in-memory solution, it showed
>>>> acceptable performance. But we cannot get stable work of Ignite cluster
>>>> with native persistence enabled. Our first error we've got is Segmentation
>>>> fault (JVM crash) while memory restoring on start.
>>>>
>>>> [2017-12-22 11:11:51,992]  INFO [exchange-worker-#46%ignite-instance-0%]
>>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>>>> - Read checkpoint status [startMarker=/ignite-work-dire
>>>> ctory/db/ignite_instance_0/cp/1513938154201-8c574131-763d-4c
>>>> fa-99b6-0ce0321d61ab-START.bin, endMarker=/ignite-work-directo
>>>> ry/db/ignite_instance_0/cp/1513932413840-55ea1713-8e9e-44cd-
>>>> b51a-fcad8fb94de1-END.bin]
>>>> [2017-12-22 11:11:51,993]  INFO [exchange-worker-#46%ignite-instance-0%]
>>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>>>> - Checking memory state [lastValidPos=FileWALPointer [idx=391,
>>>> fileOffset=220593830, len=19573, forceFlush=false],
>>>> lastMarked=FileWALPointer [idx=394, fileOffset=38532201, len=19573,
>>>> forceFlush=false], lastCheckpointId=8c574131-763d
>>>> -4cfa-99b6-0ce0321d61ab]
>>>> [2017-12-22 11:11:51,993]  WARN [exchange-worker-#46%ignite-instance-0%]
>>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>>>> - Ignite node stopped in the middle of checkpoint. Will restore memory
>>>> state and finish checkpoint on node start.
>>>> [CodeBlob (0x00007f9b58f24110)]
>>>> Framesize: 0
>>>> BufferBlob (0x00007f9b58f24110) used for StubRoutines (2)
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> #  Internal Error (sharedRuntime.cpp:842), pid=221,
>>>> tid=0x00007f9b473c1ae8
>>>> #  fatal error: exception happened outside interpreter, nmethods and
>>>> vtable stubs at pc 0x00007f9b58f248f6
>>>> #
>>>> # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build
>>>> 1.8.0_151-b12)
>>>> # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64
>>>> compressed oops)
>>>> # Derivative: IcedTea 3.6.0
>>>> # Distribution: Custom build (Tue Nov 21 11:22:36 GMT 2017)
>>>> # Core dump written. Default location: /opt/ignite/core or core.221
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /ignite-work-directory/core_dump_221.log
>>>> #
>>>> # If you would like to submit a bug report, please include
>>>> # instructions on how to reproduce the bug and visit:
>>>> #   http://icedtea.classpath.org/bugzilla
>>>> #
>>>>
>>>>
>>>>
>>>> Please find logs and configs attached.
>>>>
>>>> We deploy Ignite along with our services in Kubernetes (v 1.8) on
>>>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite
>>>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>>>>
>>>> We put about 230 events/second into Ignite, 70% of events are ~200KB in
>>>> size and 30% are 5000KB. Smaller events have indexed fields and we query
>>>> them via SQL.
>>>>
>>>> The cluster is activated from a client node which also streams events
>>>> into Ignite from Kafka. We use custom implementation of streamer which uses
>>>> cache.putAll() API.
>>>>
>>>> We got the error when we stopped and restarted cluster again. It
>>>> happened only on one instance.
>>>>
>>>> The general question is:
>>>>
>>>> *Is it possible to tune up (or implement) native persistence in a way
>>>> when it just reports about error in data or corrupted data, then skip it
>>>> and continue to work without that corrupted part. Thus it will make the
>>>> cluster to continue operating regardless of errors on storage?*
>>>>
>>>>
>>>> 
>>>> Arseny Kovalchuk
>>>>
>>>> Senior Software Engineer at Synesis
>>>> skype: arseny.kovalchuk
>>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>>>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrey V. Mashenkov
>>>
>>
>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Segmentation fault (JVM crash) while memory restoring on start with native persistance

Posted by Andrey Mashenkov <an...@gmail.com>.

Hi Arseny,

Have you success with reproducing the issue and getting stacktrace?
Do you observe same behavior on OracleJDK?

On Mon, Jan 15, 2018 at 5:50 PM, Andrey Mashenkov <
andrey.mashenkov@gmail.com> wrote:

> Hi Arseny,
>
> Have you success with reproducing the issue and getting stacktrace?
> Do you observe same behavior on OracleJDK?
>
> On Tue, Dec 26, 2017 at 2:43 PM, Andrey Mashenkov <
> andrey.mashenkov@gmail.com> wrote:
>
>> Hi Arseny,
>>
>> This looks like a known issues that is unresolved yet [1],
>> but we can't sure it is same issue as there is no stacktrace in logs
>> attached.
>>
>>
>> [1] https://issues.apache.org/jira/browse/IGNITE-7278
>>
>> On Tue, Dec 26, 2017 at 12:54 PM, Arseny Kovalchuk <
>> arseny.kovalchuk@synesis.ru> wrote:
>>
>>> Hi guys.
>>>
>>> We've successfully tested Ignite as in-memory solution, it showed
>>> acceptable performance. But we cannot get stable work of Ignite cluster
>>> with native persistence enabled. Our first error we've got is Segmentation
>>> fault (JVM crash) while memory restoring on start.
>>>
>>> [2017-12-22 11:11:51,992]  INFO [exchange-worker-#46%ignite-instance-0%]
>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>>> - Read checkpoint status [startMarker=/ignite-work-dire
>>> ctory/db/ignite_instance_0/cp/1513938154201-8c574131-763d-4c
>>> fa-99b6-0ce0321d61ab-START.bin, endMarker=/ignite-work-directo
>>> ry/db/ignite_instance_0/cp/1513932413840-55ea1713-8e9e-44cd-
>>> b51a-fcad8fb94de1-END.bin]
>>> [2017-12-22 11:11:51,993]  INFO [exchange-worker-#46%ignite-instance-0%]
>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>>> - Checking memory state [lastValidPos=FileWALPointer [idx=391,
>>> fileOffset=220593830, len=19573, forceFlush=false],
>>> lastMarked=FileWALPointer [idx=394, fileOffset=38532201, len=19573,
>>> forceFlush=false], lastCheckpointId=8c574131-763d
>>> -4cfa-99b6-0ce0321d61ab]
>>> [2017-12-22 11:11:51,993]  WARN [exchange-worker-#46%ignite-instance-0%]
>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>>> - Ignite node stopped in the middle of checkpoint. Will restore memory
>>> state and finish checkpoint on node start.
>>> [CodeBlob (0x00007f9b58f24110)]
>>> Framesize: 0
>>> BufferBlob (0x00007f9b58f24110) used for StubRoutines (2)
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  Internal Error (sharedRuntime.cpp:842), pid=221,
>>> tid=0x00007f9b473c1ae8
>>> #  fatal error: exception happened outside interpreter, nmethods and
>>> vtable stubs at pc 0x00007f9b58f248f6
>>> #
>>> # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build
>>> 1.8.0_151-b12)
>>> # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64
>>> compressed oops)
>>> # Derivative: IcedTea 3.6.0
>>> # Distribution: Custom build (Tue Nov 21 11:22:36 GMT 2017)
>>> # Core dump written. Default location: /opt/ignite/core or core.221
>>> #
>>> # An error report file with more information is saved as:
>>> # /ignite-work-directory/core_dump_221.log
>>> #
>>> # If you would like to submit a bug report, please include
>>> # instructions on how to reproduce the bug and visit:
>>> #   http://icedtea.classpath.org/bugzilla
>>> #
>>>
>>>
>>>
>>> Please find logs and configs attached.
>>>
>>> We deploy Ignite along with our services in Kubernetes (v 1.8) on
>>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite
>>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>>>
>>> We put about 230 events/second into Ignite, 70% of events are ~200KB in
>>> size and 30% are 5000KB. Smaller events have indexed fields and we query
>>> them via SQL.
>>>
>>> The cluster is activated from a client node which also streams events
>>> into Ignite from Kafka. We use custom implementation of streamer which uses
>>> cache.putAll() API.
>>>
>>> We got the error when we stopped and restarted cluster again. It
>>> happened only on one instance.
>>>
>>> The general question is:
>>>
>>> *Is it possible to tune up (or implement) native persistence in a way
>>> when it just reports about error in data or corrupted data, then skip it
>>> and continue to work without that corrupted part. Thus it will make the
>>> cluster to continue operating regardless of errors on storage?*
>>>
>>>
>>> 
>>> Arseny Kovalchuk
>>>
>>> Senior Software Engineer at Synesis
>>> skype: arseny.kovalchuk
>>> mobile: +375 (29) 666-16-16
>>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>
>>>
>>
>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>



-- 
Best regards,
Andrey V. Mashenkov