You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@usergrid.apache.org by Michael Russo <mi...@gmail.com> on 2016/02/03 00:39:02 UTC

Re: Usergrid2 - Reindexing ES data

Can you take a build from the latest of master branch?  We recently made a
fix in Usergrid that would improve memory usage and overall speed of
iterating all of the entities during a re-index for normal GET requests
where limit query parameter has a value > 1.

Those errors are basically showing that the Tomcat JVM was having a full GC
event, on total memory of 11.8 GB.  When the full GC occurs the application
will pause.  All of the testing we have been doing is on an instance where
we only give Tomcat JVM 4.5 GB of heap allocation.  This would make a full
GC time less than what you occurred, but should also see this happen less
frequently with the improvements in master.

Thanks.
-Michael

On Thu, Jan 28, 2016 at 6:06 AM, Harish Singh Bisht <
harish.bisht@comprotechnologies.com> wrote:

> Hi Michael,
>
> Thank you very much for the help. I checked the logs as you suggested:
> 1. ElasticSearch Logs show the below warnings:
> Jan 28 19:19:19 Elasticsearch elasticsearch.log:  [2016-01-28 08:49:47,115][WARN
> ][transport                ] [Zach] Received response for a request that
> has timed out, sent [56430ms] ago, timed out [26430ms] ago, action
> [internal:discovery/zen/fd/ping], node
> [[default][AGRTPd64QyuiYtE4QKxfYQ][ip-10-0-0-237][inet[/10.0.0.237:9301]]{client=true,
> data=false}], id [157852]
>
> 2. Usergrid logs show the below warning:
> Jan 28 19:20:29 Usergrid-2 catalina.out:  2016-01-28 08:50:29 WARN
> elasticsearch.monitor.jvm.monitorLongGc(203)<elasticsearch[default][scheduler][T#1]>-
> [default] [gc][old][102381][600] duration [56.4s], collections [1]/[57.2s],
> total [56.4s]/[3.8h], memory [11.8gb]->[11.7gb]/[11.9gb], all_pools
> {[young] [532.5mb]->[358.6mb]/[532.5mb]}{[survivor]
> [18.3mb]->[0b]/[66.5mb]}{[old] [11.3gb]->[11.3gb]/[11.3gb]}
>
> The re-indexing is still in progress. But in the last 6 hours the
> re-indexing seems to be progressing very very slowly. We have observed a
> very high RAM usage on the Usergrid machine (runs tomcat and cassandra).
> The warning logs on the ES machine, came at the same time when the free RAM
> dropped on the Usergrid Machine.
> (On the other hand, the ES machine RAM / CPU seems to be fine.)
>
> Please advise.
>
> Many thanks
> Harish
>
> On Thu, Jan 28, 2016 at 9:05 AM, Michael Russo <mi...@gmail.com>
> wrote:
>
>> 1.  This simply means that the processing is being ran in the calling
>> thread (the queue consumer thread that reads off of SQS) instead of the
>> async pool as there are no threads available in the async pool.  It's
>> typical during a full re-index or extreme write load.  The events are still
>> getting processed and indexed.  We implemented this to ensure that the
>> system does not get saturated and requests do not get rejected.
>>
>> 3.  It doesn't return the number of errors / failures, only the number of
>> entities that it has processed.  Currently you'd only be able to see this
>> information from parsing the log file.
>>
>> Thanks.
>> -Michael
>>
>> On Wed, Jan 27, 2016 at 5:28 AM, Harish Singh Bisht <
>> harish.bisht@comprotechnologies.com> wrote:
>>
>>> Hi Michael,
>>>
>>> Thank you very much for the detailed explanation. I deleted the ES data
>>> folders, increased the ES bulk queue size to 2000, restarted ES / UserGrid
>>> and then started the re-indexing process. This fixed most of the errors - a
>>> big thanks to you.
>>> Below are my observations when i started the re-indexing process today:
>>>
>>> 1. Usergrid logs show the below warnings very frequently
>>> (Usergrid-SQS-Pool task queue full):
>>> Jan 27 14:33:00 Usergrid-2 catalina.out:  2016-01-27 04:03:00 INFO
>>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-14>-
>>> Sending batch of 1000 to be indexed.
>>> Jan 27 14:33:01 Usergrid-2 catalina.out:  2016-01-27 04:03:01 WARN
>>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_9>-
>>> Usergrid-SQS-Pool task queue full, rejecting task
>>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@4ef107a and
>>> running in thread QueueConsumer_9
>>>
>>> Does this mean some index requests are being rejected, resulting in a
>>> loss of data in ES? Do i need to set the scheduler.io.threads to a higher
>>> value, in the usergrid property file? Will this prevent a loss of data
>>> during the re-indexing.
>>>
>>> 2. I needed to trigger the re-index api twice, as the first time it
>>> appears to have hung / stopped in between. (only 2 indexes were created out
>>> of 6, and no docs were getting created after that).
>>> After waiting for a few hours, I re-triggered the rebuild api again,
>>> which seems to be working well now (the 5th index is is in process)
>>>
>>> 3. Does the reindex 'status' api "
>>> http://localhost:8080/system/index/rebuild/<job-uuid>" return the
>>> number of indexing errors / failures etc?
>>>
>>> Please advise. Appreciate the help.
>>>
>>> Many Thanks,
>>> Harish
>>>
>>> On Mon, Jan 25, 2016 at 10:15 PM, Michael Russo <michaelarusso@gmail.com
>>> > wrote:
>>>
>>>> Hi Harish,
>>>>
>>>> I'm going to try explaining the errors by type:
>>>>
>>>> 1)
>>>>
>>>> This is occurring because of Elasticsearch not being able to start the
>>>> shard for the index.  If your cluster health is red, this is possible that
>>>> you don't have enough disk space or at one point you did not have enough
>>>> disk space so the already created shards for Usergrid indexes are corrupt.
>>>> Since Usergrid can completely rebuild the index, I'd suggest that you
>>>> completely delete the Elasticsearch data directories, restart
>>>> Elasticsearch, restart Usergrid, and begin the process again for
>>>> re-indexing.  As an FYI we have configured in the code a default of 18
>>>> shards and 1 replica used for each index.
>>>>
>>>> 2.)
>>>>
>>>> *Type 1* - Example exceptions that can occur due to issues see in #1.
>>>> Usergrid's use of the Elasticsearch sdk as a client will print these type
>>>> of statements as it's aware of the Elasticsearch cluster state.
>>>>
>>>> *Type 2* - This shows that the Async index queue consumers were unable
>>>> to index an entity.  This is most likely due to #1 as well, but does not
>>>> always have to be the cause of this exception.  If for any reason the
>>>> consumer threads get some exception while trying to index the entity at
>>>> Elasticsearch, this type of error will be seen.
>>>>
>>>> *Type 3 *- This message goes with 'Type 2'.
>>>>
>>>> *Type 4* - We have a finite threadpool size set for the asynchronous
>>>> indexing (Usergrid-SQS-Pool).  This default used to be 100, but now is
>>>> of size 40 in master branch.  If there is not a thread available (rejected
>>>> task) in the pool, we implemented a 'caller runs' strategy in which the
>>>> thread trying to hand-off the event (Queue_Consumer) to the pool will
>>>> execute the request.  The number of threads available in this pool can be
>>>> configured with the *scheduler.io.threads* property.
>>>>
>>>> *Type 5* - Elasticsearch also maintains threadpool sizes, and will
>>>> queue up requests when a thread is not available to process.  The default
>>>> queue sizes are fairly small if you're trying for high concurrency.
>>>> However, if there is a problem with ES and it's unable to process requests
>>>> properly, the queue would likely fill up regardless.  See
>>>> https://www.elastic.co/guide/en/elasticsearch/reference/2.x/modules-threadpool.html.
>>>> As an FYI when we scaled Usergrid to 10,000 requests per second, we set our
>>>> queue sizes for search and bulk to 5000.  That is a bit high, you should
>>>> probably start with something lower like 1000 if you plan to change.
>>>>
>>>>
>>>> Hope this info helps!  Let me know how things progress.
>>>>
>>>> Thanks!
>>>> -Michael
>>>>
>>>>
>>>> On Mon, Jan 25, 2016 at 5:06 AM, Harish Singh Bisht <
>>>> harish.bisht@comprotechnologies.com> wrote:
>>>>
>>>>> Hi Michael,
>>>>>
>>>>> (resending email as i found another type of error from the logs)
>>>>> Thank you very much for the help. I tried the full system re-index
>>>>> API, but i am still seeing the below issues:
>>>>>
>>>>> 1. One of the ES indexes (usergrid_applications_4) still has 15 shards
>>>>> in "INITIALIZING" state. The ES cluster health is red.
>>>>>  Log Extract from ES (details in appendices) shows that:
>>>>> "[usergrid_applications_4][14] failed to start shard.".
>>>>>
>>>>> 2. Usergrid logs show the below types of errors after invoking the
>>>>> re-index api (details in appendices):
>>>>> Type 1
>>>>> UnavailableShardsException[[usergrid_applications_4][7] Primary shard
>>>>> is not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>> Type 2
>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>> Type 3
>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$null$70(735)<Usergrid-SQS-Pool-16>-
>>>>> Missing messages from queue post operation
>>>>> Type 4
>>>>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_11>-
>>>>> Usergrid-SQS-Pool task queue full, rejecting task
>>>>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@131a4c5 and
>>>>> running in thread QueueConsumer_11
>>>>> Type 5
>>>>> EsRejectedExecutionException[rejected execution (queue capacity 300)
>>>>> on
>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@13afccc8
>>>>> ];
>>>>>
>>>>> Please suggest the next steps. Appreciate the help.
>>>>>
>>>>> Many Thanks,
>>>>> Harish
>>>>>
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------
>>>>> *Appendices*
>>>>> *1. ELASTICSEARCH LOG*
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,227][WARN
>>>>> ][indices.cluster          ] [Blindside] [usergrid_applications_4][14]
>>>>> failed to start shard
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:
>>>>>  org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
>>>>> [usergrid_applications_4][14] failed to recover shard
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> java.lang.Thread.run(Thread.java:745)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>> org.elasticsearch.index.translog.TranslogCorruptedException: translog
>>>>> corruption while reading from stream
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 4 more
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>> java.io.EOFException
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 5 more
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,279][WARN
>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>> sending failed shard for [usergrid_applications_4][14],
>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>> while reading from stream]; nested: EOFException; ]]
>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,279][WARN
>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>> received shard failed for [usergrid_applications_4][14],
>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>> while reading from stream]; nested: EOFException; ]]
>>>>> ......
>>>>> *2. USERGRID LOG*
>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>> rest.system.IndexResource.rebuildIndexesPost(78)<http-bio-80-exec-92>-
>>>>> Rebuilding all applications
>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-35>-
>>>>> Sending batch of 1000 to be indexed.
>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-13>-
>>>>> Unable to index
>>>>> id=appId(cd2bd460-a3e8-11e5-a327-0a75091e6d25,application).entityId(7c3328cc-bdd1-11e5-88d3-0a75091e6d25,activity).version(7c3328cd-bdd1-11e5-88d3-0a75091e6d25).nodeId(99400999-a3ef-11e5-a327-0a75091e6d25,group).edgeName(zzzcollzzz|feed).nodeType(TARGET),
>>>>> type=entity, index=usergrid_applications_4,
>>>>> failureMessage=UnavailableShardsException[[usergrid_applications_4][4]
>>>>> Primary shard is not active or isn't assigned is a known node. Timeout:
>>>>> [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3d277a1
>>>>> ]
>>>>> .....
>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  java.lang.RuntimeException:
>>>>> Error during processing of bulk index operations one of the responses
>>>>> failed.
>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>  UnavailableShardsException[[usergrid_applications_4][4] Primary shard is
>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>> org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>  UnavailableShardsException[[usergrid_applications_4][7] Primary shard is
>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>  UnavailableShardsException[[usergrid_applications_4][10] Primary shard is
>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>> org.elasticsearch.action.bulk.BulkShardRequest@445968e3]
>>>>> .......
>>>>> Jan 25 16:53:02 Usergrid-2 catalina.out: 2016-01-25 06:23:02 ERROR
>>>>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-97>-
>>>>> Unable to index
>>>>> id=appId(65ccc2b7-bde0-11e5-88d3-0a75091e6d25,application).entityId(b7dcf6c9-bf62-11e5-88d3-0a75091e6d25,activity).version(b7dcf6cb-bf62-11e5-88d3-0a75091e6d25).nodeId(65ccc2b7-bde0-11e5-88d3-0a75091e6d25,application).edgeName(zzzcollzzz|activities).nodeType(TARGET),
>>>>> type=entity, index=usergrid_applications_3,
>>>>> failureMessage=RemoteTransportException[[Blindside][inet[/10.0.0.148:9300]][indices:data/write/bulk[s]]];
>>>>> nested: EsRejectedExecutionException[rejected execution (queue capacity
>>>>> 300) on
>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@13afccc8
>>>>> ];
>>>>> Jan 25 16:53:03 Usergrid-2 catalina.out:
>>>>> RemoteTransportException[[Blindside][inet[/10.0.0.148:9300]][indices:data/write/bulk[s]]];
>>>>> nested: EsRejectedExecutionException[rejected execution (queue capacity
>>>>> 300) on
>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@59da0cf2
>>>>> ];
>>>>>
>>>>> On Mon, Jan 25, 2016 at 5:54 PM, Harish Singh Bisht <
>>>>> harish.bisht@comprotechnologies.com> wrote:
>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> Thank you vert much for the help. I tried the full system re-index
>>>>>> API, but i am still seeing the below issues:
>>>>>>
>>>>>> 1. One of the ES indexes (usergrid_applications_4) still has 15
>>>>>> shards in "INITIALIZING" state. The ES cluster health is red.
>>>>>>  Log Extract from ES (details in appendices) shows that:
>>>>>> "[usergrid_applications_4][14] failed to start shard.".
>>>>>>
>>>>>> 2. Usergrid logs show the below types of errors after invoking the
>>>>>> re-index api (details in appendices):
>>>>>> Type 1
>>>>>> UnavailableShardsException[[usergrid_applications_4][7] Primary shard
>>>>>> is not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>> Type 2
>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>> Type 3
>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$null$70(735)<Usergrid-SQS-Pool-16>-
>>>>>> Missing messages from queue post operation
>>>>>> Type 4
>>>>>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_11>-
>>>>>> Usergrid-SQS-Pool task queue full, rejecting task
>>>>>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@131a4c5 and
>>>>>> running in thread QueueConsumer_11
>>>>>>
>>>>>> Please suggest the next steps. Appriciate the help.
>>>>>>
>>>>>> Many Thanks,
>>>>>> Harish
>>>>>>
>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------------------
>>>>>> *Appendices*
>>>>>> *1. ELASTICSEARCH LOG*
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,227][WARN
>>>>>> ][indices.cluster          ] [Blindside] [usergrid_applications_4][14]
>>>>>> failed to start shard
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:
>>>>>>  org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
>>>>>> [usergrid_applications_4][14] failed to recover shard
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> java.lang.Thread.run(Thread.java:745)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>> org.elasticsearch.index.translog.TranslogCorruptedException: translog
>>>>>> corruption while reading from stream
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 4 more
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>> java.io.EOFException
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 5 more
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,279][WARN
>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>> sending failed shard for [usergrid_applications_4][14],
>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,279][WARN
>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>> received shard failed for [usergrid_applications_4][14],
>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>> ......
>>>>>> *2. USERGRID LOG*
>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>> rest.system.IndexResource.rebuildIndexesPost(78)<http-bio-80-exec-92>-
>>>>>> Rebuilding all applications
>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-35>-
>>>>>> Sending batch of 1000 to be indexed.
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-13>-
>>>>>> Unable to index
>>>>>> id=appId(cd2bd460-a3e8-11e5-a327-0a75091e6d25,application).entityId(7c3328cc-bdd1-11e5-88d3-0a75091e6d25,activity).version(7c3328cd-bdd1-11e5-88d3-0a75091e6d25).nodeId(99400999-a3ef-11e5-a327-0a75091e6d25,group).edgeName(zzzcollzzz|feed).nodeType(TARGET),
>>>>>> type=entity, index=usergrid_applications_4,
>>>>>> failureMessage=UnavailableShardsException[[usergrid_applications_4][4]
>>>>>> Primary shard is not active or isn't assigned is a known node. Timeout:
>>>>>> [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3d277a1
>>>>>> ]
>>>>>> .....
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  java.lang.RuntimeException:
>>>>>> Error during processing of bulk index operations one of the responses
>>>>>> failed.
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>  UnavailableShardsException[[usergrid_applications_4][4] Primary shard is
>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>  UnavailableShardsException[[usergrid_applications_4][7] Primary shard is
>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>  UnavailableShardsException[[usergrid_applications_4][10] Primary shard is
>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@445968e3]
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 22, 2016 at 10:29 PM, Michael Russo <
>>>>>> michaelarusso@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Harish,
>>>>>>>
>>>>>>> Yeah, in Usergrid 2 you can re-index all of the entity data that
>>>>>>> exists in Cassandra.  Here are example curl requests that invoke the
>>>>>>> reindex APIs:
>>>>>>>
>>>>>>> *Full system re-index:*
>>>>>>>
>>>>>>> curl -i -X POST -u <sysadmin user>:<sysadmin pass> "
>>>>>>> http://localhost:8080/system/index/rebuild"
>>>>>>>
>>>>>>> *Per application re-index:*
>>>>>>>
>>>>>>> curl -i -X POST -u <sysadmin user>:<sysadmin pass> "
>>>>>>> http://localhost:8080/system/index/rebuild/<application uuid>"
>>>>>>>
>>>>>>> Thanks.
>>>>>>> -Michael
>>>>>>>
>>>>>>> On Fri, Jan 22, 2016 at 4:37 AM, Harish Singh Bisht <
>>>>>>> harish.bisht@comprotechnologies.com> wrote:
>>>>>>>
>>>>>>>> Hi Team,
>>>>>>>>
>>>>>>>> We have been our testing our application based on Usergrid 2
>>>>>>>> (master branch) and started noticing unusual poor performance with spikes
>>>>>>>> in the response time.
>>>>>>>>
>>>>>>>> Our investigations revealed that during the load testing we ran out
>>>>>>>> of HDD space on the elasticsearch single node cluster. This led to indexing
>>>>>>>> failures.
>>>>>>>>
>>>>>>>> So we increased the HDD space and restarted ES. But now the cluster
>>>>>>>> health is red and alot of shards are in the initializing state. It seem
>>>>>>>> data has been lost on the ES node.
>>>>>>>>
>>>>>>>> Is there any way to recover the lost data in ES? Specifically, is
>>>>>>>> there a way to trigger a re-index of data from Cassandra to ES?
>>>>>>>>
>>>>>>>> Appreciate the help.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Harish
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> Harish Singh Bisht
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Harish Singh Bisht
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Harish Singh Bisht
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Harish Singh Bisht
>>>
>>>
>>
>
>
> --
> Regards,
> Harish Singh Bisht
>
>

Re: Usergrid2 - Reindexing ES data

Posted by Harish Singh Bisht <ha...@comprotechnologies.com>.
Hi Michael,

Thank you for your email.
I deployed the latest Usergrid code (master branch), reduced the RAM to 4.5
GB and then triggered the re-index job. After 20 hours of reindexing data,
Usergrid seems to have hung / stopped due to the heap memory finishing.
Below are the logs:

*1. UserGrid logs (out of java heap space)*
Feb 04 17:06:32 Usergrid-2 catalina.out:  06:36:31,961  WARN
OioServerSocketPipelineSink:83 - Failed to accept a connection.
Feb 04 17:06:32 Usergrid-2 catalina.out:  java.lang.OutOfMemoryError: Java
heap space
Feb 04 15:05:03 Usergrid-2 catalina.out:  04:34:25,166  WARN jvm:203 -
[default] [gc][old][29454][2683] duration [54.2s], collections [3]/[54.3s],
total [54.2s]/[13.2h], memory [4.4gb]->[4.4gb]/[4.4gb], all_pools {[young]
[532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [62.5mb]->[65mb]/[66.5mb]}{[old]
[3.8gb]->[3.8gb]/[3.8gb]}
Feb 03 20:38:34 Usergrid-2 catalina.out: 10:08:34,616 ERROR
AmazonAsyncEventService:361 - Failed to index message:
886ea9bd-708d-4bea-ab1c-844ff97c947c

*2. ES logs (time out, removed non-data node from cluster)*
Feb 04 16:00:33 Elasticsearch elasticsearch.log:  [2016-02-04
05:31:17,243][INFO
][cluster.service          ] [Arlette Truffaut] removed
{[default][3GuynlamR6GvdiCAhhTBmw][ip-10-0-0-237][inet[/10.0.0.237:9301]]{client=true,
data=false},}, reason:
zen-disco-node_failed([default][3GuynlamR6GvdiCAhhTBmw][ip-10-0-0-237][inet[/10.0.0.237:9301]]{client=true,
data=false}), reason failed to ping, tried [3] times, each with maximum
[30s] timeout
Feb 04 16:00:33 Elasticsearch elasticsearch.log:  [2016-02-04
05:31:17,256][DEBUG][action.admin.cluster.node.stats]
[Arlette Truffaut] failed to execute on node [3GuynlamR6GvdiCAhhTBmw]
Feb 04 16:00:33 Elasticsearch elasticsearch.log:
 org.elasticsearch.transport.NodeDisconnectedException:
[default][inet[/10.0.0.237:9301]][cluster:monitor/nodes/stats[n]]
disconnected


Is it possible that the re-indexing code in Usergrid could have memory
leaks and thus uses up all the java heap memory.
Please suggest.

Many Thanks
Harish

On Wed, Feb 3, 2016 at 5:09 AM, Michael Russo <mi...@gmail.com>
wrote:

> Can you take a build from the latest of master branch?  We recently made a
> fix in Usergrid that would improve memory usage and overall speed of
> iterating all of the entities during a re-index for normal GET requests
> where limit query parameter has a value > 1.
>
> Those errors are basically showing that the Tomcat JVM was having a full
> GC event, on total memory of 11.8 GB.  When the full GC occurs the
> application will pause.  All of the testing we have been doing is on an
> instance where we only give Tomcat JVM 4.5 GB of heap allocation.  This
> would make a full GC time less than what you occurred, but should also see
> this happen less frequently with the improvements in master.
>
> Thanks.
> -Michael
>
> On Thu, Jan 28, 2016 at 6:06 AM, Harish Singh Bisht <
> harish.bisht@comprotechnologies.com> wrote:
>
>> Hi Michael,
>>
>> Thank you very much for the help. I checked the logs as you suggested:
>> 1. ElasticSearch Logs show the below warnings:
>> Jan 28 19:19:19 Elasticsearch elasticsearch.log:  [2016-01-28 08:49:47,115][WARN
>> ][transport                ] [Zach] Received response for a request that
>> has timed out, sent [56430ms] ago, timed out [26430ms] ago, action
>> [internal:discovery/zen/fd/ping], node
>> [[default][AGRTPd64QyuiYtE4QKxfYQ][ip-10-0-0-237][inet[/10.0.0.237:9301]]{client=true,
>> data=false}], id [157852]
>>
>> 2. Usergrid logs show the below warning:
>> Jan 28 19:20:29 Usergrid-2 catalina.out:  2016-01-28 08:50:29 WARN
>> elasticsearch.monitor.jvm.monitorLongGc(203)<elasticsearch[default][scheduler][T#1]>-
>> [default] [gc][old][102381][600] duration [56.4s], collections [1]/[57.2s],
>> total [56.4s]/[3.8h], memory [11.8gb]->[11.7gb]/[11.9gb], all_pools
>> {[young] [532.5mb]->[358.6mb]/[532.5mb]}{[survivor]
>> [18.3mb]->[0b]/[66.5mb]}{[old] [11.3gb]->[11.3gb]/[11.3gb]}
>>
>> The re-indexing is still in progress. But in the last 6 hours the
>> re-indexing seems to be progressing very very slowly. We have observed a
>> very high RAM usage on the Usergrid machine (runs tomcat and cassandra).
>> The warning logs on the ES machine, came at the same time when the free RAM
>> dropped on the Usergrid Machine.
>> (On the other hand, the ES machine RAM / CPU seems to be fine.)
>>
>> Please advise.
>>
>> Many thanks
>> Harish
>>
>> On Thu, Jan 28, 2016 at 9:05 AM, Michael Russo <mi...@gmail.com>
>> wrote:
>>
>>> 1.  This simply means that the processing is being ran in the calling
>>> thread (the queue consumer thread that reads off of SQS) instead of the
>>> async pool as there are no threads available in the async pool.  It's
>>> typical during a full re-index or extreme write load.  The events are still
>>> getting processed and indexed.  We implemented this to ensure that the
>>> system does not get saturated and requests do not get rejected.
>>>
>>> 3.  It doesn't return the number of errors / failures, only the number
>>> of entities that it has processed.  Currently you'd only be able to see
>>> this information from parsing the log file.
>>>
>>> Thanks.
>>> -Michael
>>>
>>> On Wed, Jan 27, 2016 at 5:28 AM, Harish Singh Bisht <
>>> harish.bisht@comprotechnologies.com> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> Thank you very much for the detailed explanation. I deleted the ES data
>>>> folders, increased the ES bulk queue size to 2000, restarted ES / UserGrid
>>>> and then started the re-indexing process. This fixed most of the errors - a
>>>> big thanks to you.
>>>> Below are my observations when i started the re-indexing process today:
>>>>
>>>> 1. Usergrid logs show the below warnings very frequently
>>>> (Usergrid-SQS-Pool task queue full):
>>>> Jan 27 14:33:00 Usergrid-2 catalina.out:  2016-01-27 04:03:00 INFO
>>>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-14>-
>>>> Sending batch of 1000 to be indexed.
>>>> Jan 27 14:33:01 Usergrid-2 catalina.out:  2016-01-27 04:03:01 WARN
>>>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_9>-
>>>> Usergrid-SQS-Pool task queue full, rejecting task
>>>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@4ef107a and
>>>> running in thread QueueConsumer_9
>>>>
>>>> Does this mean some index requests are being rejected, resulting in a
>>>> loss of data in ES? Do i need to set the scheduler.io.threads to a higher
>>>> value, in the usergrid property file? Will this prevent a loss of data
>>>> during the re-indexing.
>>>>
>>>> 2. I needed to trigger the re-index api twice, as the first time it
>>>> appears to have hung / stopped in between. (only 2 indexes were created out
>>>> of 6, and no docs were getting created after that).
>>>> After waiting for a few hours, I re-triggered the rebuild api again,
>>>> which seems to be working well now (the 5th index is is in process)
>>>>
>>>> 3. Does the reindex 'status' api "
>>>> http://localhost:8080/system/index/rebuild/<job-uuid>" return the
>>>> number of indexing errors / failures etc?
>>>>
>>>> Please advise. Appreciate the help.
>>>>
>>>> Many Thanks,
>>>> Harish
>>>>
>>>> On Mon, Jan 25, 2016 at 10:15 PM, Michael Russo <
>>>> michaelarusso@gmail.com> wrote:
>>>>
>>>>> Hi Harish,
>>>>>
>>>>> I'm going to try explaining the errors by type:
>>>>>
>>>>> 1)
>>>>>
>>>>> This is occurring because of Elasticsearch not being able to start the
>>>>> shard for the index.  If your cluster health is red, this is possible that
>>>>> you don't have enough disk space or at one point you did not have enough
>>>>> disk space so the already created shards for Usergrid indexes are corrupt.
>>>>> Since Usergrid can completely rebuild the index, I'd suggest that you
>>>>> completely delete the Elasticsearch data directories, restart
>>>>> Elasticsearch, restart Usergrid, and begin the process again for
>>>>> re-indexing.  As an FYI we have configured in the code a default of 18
>>>>> shards and 1 replica used for each index.
>>>>>
>>>>> 2.)
>>>>>
>>>>> *Type 1* - Example exceptions that can occur due to issues see in
>>>>> #1.  Usergrid's use of the Elasticsearch sdk as a client will print these
>>>>> type of statements as it's aware of the Elasticsearch cluster state.
>>>>>
>>>>> *Type 2* - This shows that the Async index queue consumers were
>>>>> unable to index an entity.  This is most likely due to #1 as well, but does
>>>>> not always have to be the cause of this exception.  If for any reason the
>>>>> consumer threads get some exception while trying to index the entity at
>>>>> Elasticsearch, this type of error will be seen.
>>>>>
>>>>> *Type 3 *- This message goes with 'Type 2'.
>>>>>
>>>>> *Type 4* - We have a finite threadpool size set for the asynchronous
>>>>> indexing (Usergrid-SQS-Pool).  This default used to be 100, but now
>>>>> is of size 40 in master branch.  If there is not a thread available
>>>>> (rejected task) in the pool, we implemented a 'caller runs' strategy in
>>>>> which the thread trying to hand-off the event (Queue_Consumer) to the pool
>>>>> will execute the request.  The number of threads available in this pool can
>>>>> be configured with the *scheduler.io.threads* property.
>>>>>
>>>>> *Type 5* - Elasticsearch also maintains threadpool sizes, and will
>>>>> queue up requests when a thread is not available to process.  The default
>>>>> queue sizes are fairly small if you're trying for high concurrency.
>>>>> However, if there is a problem with ES and it's unable to process requests
>>>>> properly, the queue would likely fill up regardless.  See
>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/2.x/modules-threadpool.html.
>>>>> As an FYI when we scaled Usergrid to 10,000 requests per second, we set our
>>>>> queue sizes for search and bulk to 5000.  That is a bit high, you should
>>>>> probably start with something lower like 1000 if you plan to change.
>>>>>
>>>>>
>>>>> Hope this info helps!  Let me know how things progress.
>>>>>
>>>>> Thanks!
>>>>> -Michael
>>>>>
>>>>>
>>>>> On Mon, Jan 25, 2016 at 5:06 AM, Harish Singh Bisht <
>>>>> harish.bisht@comprotechnologies.com> wrote:
>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> (resending email as i found another type of error from the logs)
>>>>>> Thank you very much for the help. I tried the full system re-index
>>>>>> API, but i am still seeing the below issues:
>>>>>>
>>>>>> 1. One of the ES indexes (usergrid_applications_4) still has 15
>>>>>> shards in "INITIALIZING" state. The ES cluster health is red.
>>>>>>  Log Extract from ES (details in appendices) shows that:
>>>>>> "[usergrid_applications_4][14] failed to start shard.".
>>>>>>
>>>>>> 2. Usergrid logs show the below types of errors after invoking the
>>>>>> re-index api (details in appendices):
>>>>>> Type 1
>>>>>> UnavailableShardsException[[usergrid_applications_4][7] Primary shard
>>>>>> is not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>> Type 2
>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>> Type 3
>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$null$70(735)<Usergrid-SQS-Pool-16>-
>>>>>> Missing messages from queue post operation
>>>>>> Type 4
>>>>>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_11>-
>>>>>> Usergrid-SQS-Pool task queue full, rejecting task
>>>>>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@131a4c5 and
>>>>>> running in thread QueueConsumer_11
>>>>>> Type 5
>>>>>> EsRejectedExecutionException[rejected execution (queue capacity 300)
>>>>>> on
>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@13afccc8
>>>>>> ];
>>>>>>
>>>>>> Please suggest the next steps. Appreciate the help.
>>>>>>
>>>>>> Many Thanks,
>>>>>> Harish
>>>>>>
>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------------------
>>>>>> *Appendices*
>>>>>> *1. ELASTICSEARCH LOG*
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,227][WARN
>>>>>> ][indices.cluster          ] [Blindside] [usergrid_applications_4][14]
>>>>>> failed to start shard
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:
>>>>>>  org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
>>>>>> [usergrid_applications_4][14] failed to recover shard
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> java.lang.Thread.run(Thread.java:745)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>> org.elasticsearch.index.translog.TranslogCorruptedException: translog
>>>>>> corruption while reading from stream
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 4 more
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>> java.io.EOFException
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 5 more
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,279][WARN
>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>> sending failed shard for [usergrid_applications_4][14],
>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,279][WARN
>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>> received shard failed for [usergrid_applications_4][14],
>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>> ......
>>>>>> *2. USERGRID LOG*
>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>> rest.system.IndexResource.rebuildIndexesPost(78)<http-bio-80-exec-92>-
>>>>>> Rebuilding all applications
>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-35>-
>>>>>> Sending batch of 1000 to be indexed.
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-13>-
>>>>>> Unable to index
>>>>>> id=appId(cd2bd460-a3e8-11e5-a327-0a75091e6d25,application).entityId(7c3328cc-bdd1-11e5-88d3-0a75091e6d25,activity).version(7c3328cd-bdd1-11e5-88d3-0a75091e6d25).nodeId(99400999-a3ef-11e5-a327-0a75091e6d25,group).edgeName(zzzcollzzz|feed).nodeType(TARGET),
>>>>>> type=entity, index=usergrid_applications_4,
>>>>>> failureMessage=UnavailableShardsException[[usergrid_applications_4][4]
>>>>>> Primary shard is not active or isn't assigned is a known node. Timeout:
>>>>>> [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3d277a1
>>>>>> ]
>>>>>> .....
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  java.lang.RuntimeException:
>>>>>> Error during processing of bulk index operations one of the responses
>>>>>> failed.
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>  UnavailableShardsException[[usergrid_applications_4][4] Primary shard is
>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>  UnavailableShardsException[[usergrid_applications_4][7] Primary shard is
>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>  UnavailableShardsException[[usergrid_applications_4][10] Primary shard is
>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@445968e3]
>>>>>> .......
>>>>>> Jan 25 16:53:02 Usergrid-2 catalina.out: 2016-01-25 06:23:02 ERROR
>>>>>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-97>-
>>>>>> Unable to index
>>>>>> id=appId(65ccc2b7-bde0-11e5-88d3-0a75091e6d25,application).entityId(b7dcf6c9-bf62-11e5-88d3-0a75091e6d25,activity).version(b7dcf6cb-bf62-11e5-88d3-0a75091e6d25).nodeId(65ccc2b7-bde0-11e5-88d3-0a75091e6d25,application).edgeName(zzzcollzzz|activities).nodeType(TARGET),
>>>>>> type=entity, index=usergrid_applications_3,
>>>>>> failureMessage=RemoteTransportException[[Blindside][inet[/10.0.0.148:9300]][indices:data/write/bulk[s]]];
>>>>>> nested: EsRejectedExecutionException[rejected execution (queue capacity
>>>>>> 300) on
>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@13afccc8
>>>>>> ];
>>>>>> Jan 25 16:53:03 Usergrid-2 catalina.out:
>>>>>> RemoteTransportException[[Blindside][inet[/10.0.0.148:9300]][indices:data/write/bulk[s]]];
>>>>>> nested: EsRejectedExecutionException[rejected execution (queue capacity
>>>>>> 300) on
>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@59da0cf2
>>>>>> ];
>>>>>>
>>>>>> On Mon, Jan 25, 2016 at 5:54 PM, Harish Singh Bisht <
>>>>>> harish.bisht@comprotechnologies.com> wrote:
>>>>>>
>>>>>>> Hi Michael,
>>>>>>>
>>>>>>> Thank you vert much for the help. I tried the full system re-index
>>>>>>> API, but i am still seeing the below issues:
>>>>>>>
>>>>>>> 1. One of the ES indexes (usergrid_applications_4) still has 15
>>>>>>> shards in "INITIALIZING" state. The ES cluster health is red.
>>>>>>>  Log Extract from ES (details in appendices) shows that:
>>>>>>> "[usergrid_applications_4][14] failed to start shard.".
>>>>>>>
>>>>>>> 2. Usergrid logs show the below types of errors after invoking the
>>>>>>> re-index api (details in appendices):
>>>>>>> Type 1
>>>>>>> UnavailableShardsException[[usergrid_applications_4][7] Primary
>>>>>>> shard is not active or isn't assigned is a known node. Timeout: [1m],
>>>>>>> request: org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>>> Type 2
>>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>>> Type 3
>>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$null$70(735)<Usergrid-SQS-Pool-16>-
>>>>>>> Missing messages from queue post operation
>>>>>>> Type 4
>>>>>>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_11>-
>>>>>>> Usergrid-SQS-Pool task queue full, rejecting task
>>>>>>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@131a4c5 and
>>>>>>> running in thread QueueConsumer_11
>>>>>>>
>>>>>>> Please suggest the next steps. Appriciate the help.
>>>>>>>
>>>>>>> Many Thanks,
>>>>>>> Harish
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -------------------------------------------------------------------
>>>>>>> *Appendices*
>>>>>>> *1. ELASTICSEARCH LOG*
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,227][WARN
>>>>>>> ][indices.cluster          ] [Blindside] [usergrid_applications_4][14]
>>>>>>> failed to start shard
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:
>>>>>>>  org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
>>>>>>> [usergrid_applications_4][14] failed to recover shard
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> java.lang.Thread.run(Thread.java:745)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>>> org.elasticsearch.index.translog.TranslogCorruptedException: translog
>>>>>>> corruption while reading from stream
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 4 more
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>>> java.io.EOFException
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 5 more
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,279][WARN
>>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>>> sending failed shard for [usergrid_applications_4][14],
>>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 06:40:44,279][WARN
>>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>>> received shard failed for [usergrid_applications_4][14],
>>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>>> ......
>>>>>>> *2. USERGRID LOG*
>>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>>> rest.system.IndexResource.rebuildIndexesPost(78)<http-bio-80-exec-92>-
>>>>>>> Rebuilding all applications
>>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-35>-
>>>>>>> Sending batch of 1000 to be indexed.
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-13>-
>>>>>>> Unable to index
>>>>>>> id=appId(cd2bd460-a3e8-11e5-a327-0a75091e6d25,application).entityId(7c3328cc-bdd1-11e5-88d3-0a75091e6d25,activity).version(7c3328cd-bdd1-11e5-88d3-0a75091e6d25).nodeId(99400999-a3ef-11e5-a327-0a75091e6d25,group).edgeName(zzzcollzzz|feed).nodeType(TARGET),
>>>>>>> type=entity, index=usergrid_applications_4,
>>>>>>> failureMessage=UnavailableShardsException[[usergrid_applications_4][4]
>>>>>>> Primary shard is not active or isn't assigned is a known node. Timeout:
>>>>>>> [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3d277a1
>>>>>>> ]
>>>>>>> .....
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>>  java.lang.RuntimeException: Error during processing of bulk index
>>>>>>> operations one of the responses failed.
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>>  UnavailableShardsException[[usergrid_applications_4][4] Primary shard is
>>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>>  UnavailableShardsException[[usergrid_applications_4][7] Primary shard is
>>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>>  UnavailableShardsException[[usergrid_applications_4][10] Primary shard is
>>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@445968e3]
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 22, 2016 at 10:29 PM, Michael Russo <
>>>>>>> michaelarusso@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Harish,
>>>>>>>>
>>>>>>>> Yeah, in Usergrid 2 you can re-index all of the entity data that
>>>>>>>> exists in Cassandra.  Here are example curl requests that invoke the
>>>>>>>> reindex APIs:
>>>>>>>>
>>>>>>>> *Full system re-index:*
>>>>>>>>
>>>>>>>> curl -i -X POST -u <sysadmin user>:<sysadmin pass> "
>>>>>>>> http://localhost:8080/system/index/rebuild"
>>>>>>>>
>>>>>>>> *Per application re-index:*
>>>>>>>>
>>>>>>>> curl -i -X POST -u <sysadmin user>:<sysadmin pass> "
>>>>>>>> http://localhost:8080/system/index/rebuild/<application uuid>"
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> -Michael
>>>>>>>>
>>>>>>>> On Fri, Jan 22, 2016 at 4:37 AM, Harish Singh Bisht <
>>>>>>>> harish.bisht@comprotechnologies.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Team,
>>>>>>>>>
>>>>>>>>> We have been our testing our application based on Usergrid 2
>>>>>>>>> (master branch) and started noticing unusual poor performance with spikes
>>>>>>>>> in the response time.
>>>>>>>>>
>>>>>>>>> Our investigations revealed that during the load testing we ran
>>>>>>>>> out of HDD space on the elasticsearch single node cluster. This led to
>>>>>>>>> indexing failures.
>>>>>>>>>
>>>>>>>>> So we increased the HDD space and restarted ES. But now the
>>>>>>>>> cluster health is red and alot of shards are in the initializing state. It
>>>>>>>>> seem data has been lost on the ES node.
>>>>>>>>>
>>>>>>>>> Is there any way to recover the lost data in ES? Specifically, is
>>>>>>>>> there a way to trigger a re-index of data from Cassandra to ES?
>>>>>>>>>
>>>>>>>>> Appreciate the help.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Harish
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> Harish Singh Bisht
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Harish Singh Bisht
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Harish Singh Bisht
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Harish Singh Bisht
>>>>
>>>>
>>>
>>
>>
>> --
>> Regards,
>> Harish Singh Bisht
>>
>>
>


-- 
Regards,
Harish Singh Bisht