You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@usergrid.apache.org by Jaskaran Singh <ja...@comprotechnologies.com> on 2015/12/05 16:07:06 UTC

Usergrid 2.x Issues

Hello All,

We are testing usergrid 2.x (master branch) for our application that was
previously being prototyped on usergrid 1.x. We are noticing some weird
anomalies which are causing errors in our application which otherwise works
fine against usergrid 1.x. Specifically, we are seeing empty responses when
querying custom collections for a particular entity record.
Following is an example of one such query:
http://server-name/b2perf1/default/userdata?client_id=
<...>&client_secret=<....>&ql=userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'

In the above scenario, we are querying a custom collection “userdata”.  And
under high load conditions (performance tests), this query starts returning
an empty entities array (see below), even though this entity did exist at
one time and we have no code / logic to delete entities.
{
    "action": "get",
    "application": "0f7a2396-9826-11e5-ba08-0a75091e6d25",
    "params": {
        "ql": [

"userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'"
        ]
    },
    "path": "/userdata",
    "uri": "http://localhost:8080/b2perf1/default/userdata",
    "entities": [],
    "timestamp": 1449322746733,
    "duration": 1053,
    "organization": "b2perf1",
    "applicationName": "default",
    "count": 0
}

This has been happening quite randomly / intermittently and we have not
been able to isolate any replication steps besides, running load /
performance tests when this problem does eventually show up.
Note, the creation of the entities is prior to the load test and we can
confirm that they existed before running the load test.

We have never noticed this issue for ‘non’ query calls (ie calls that do
not directly provide a field to query on)

Our suspicion is that while these records do exist in Cassandra (because we
have never deleted them), but the ElasticSearch index is ‘not’ in sync or
is not functioning properly.
How do we go about debugging this problem? Is there any particular logging
or metric that we can check for us to confirm if all the elasticsearch
index is upto date with the changes in cassandra.

Any other suggestions will be greatly appreciated.

Thanks
Jaskaran

Re: Usergrid 2.x Issues

Posted by Michael Russo <mi...@gmail.com>.

There's already a ticket:
https://issues.apache.org/jira/browse/USERGRID-1051

Just as FYI when we load tested Usergrid to upwards of 10k TPS, we were
using a search queue of 5000. In fact, we made all of the queues have size
of 5000, including the bulk queue.

On Thu, Dec 10, 2015 at 7:17 AM, Jaskaran Singh <
jaskaran.singh@comprotechnologies.com> wrote:

> Hi Michael,
>
> I am providing an update on my situation. We have changed our application
> logic to minimize the use of queries (ie calls with "ql=.....") in usergrid
> 2.x. This seems to have provided significant benefit and all the problems
> reported below, seem to have disappeared.
>
> To some extent this is good news. However we were lucky that we were able
> to work around the logic and would like to understand any limitations or
> best practices around the use of queries (which are serviced by
> elasticsearch in usergrid 2.x) under high load situations.
>
> Also please let me know me know if there is an existing Jira issue for
> addressing the empty entity response when elasticsearch when it is
> overloaded. Or should i add one?
>
> Thanks in advance,
>
> Thanks
> Jaskaran
>
>
> On Tue, Dec 8, 2015 at 6:00 PM, Jaskaran Singh <
> jaskaran.singh@comprotechnologies.com> wrote:
>
>> Hi Michael,
>>
>> This makes sense. I can confirm that while we have been seeing missing
>> entity errors on high load; these automatically get resolved themselves as
>> the load decreases.
>>
>> Another anomaly that we have noticed is that usergrid responds with a
>> code "401" and message "Unable to authenticate OAuth credentials" for
>> certain user's credentials under high load and the same credentials work
>> fine after the load reduces. Can we assume that this issue (intermittent
>> invalid credentials) has the same underlying root cause (ie elasticsearch
>> is not responding)? See below a couple of examples of the error_description
>> for such 401 errors:
>> 1. 'invalid username or password'
>> 2. ‘Unable to authenticate OAuth credentials’
>> 3. ‘Unable to authenticate due to corrupt access token’
>>
>> Regarding your suggestion to increase the search thread pool queue size,
>> we were already using a setting of 1000 (with 320 threads). Should we
>> consider further increasing this? Or simply provide additional resources
>> (cpu / ram) to the ES process.
>>
>> Additionally we are also seeing cassandra connection timeouts,
>> specifically the exceptions below under high load conditions:
>> ERROR stage.write.WriteCommit.call(132)<Usergrid-Collection-Pool-12>-
>> Failed to execute write asynchronously
>> com.netflix.astyanax.connectionpool.exceptions.TimeoutException:
>> TimeoutException: [host=10.0.0.237(10.0.0.237):9160, latency=2003(2003),
>> attempts=1]org.apache.thrift.transport.TTransportException:
>> java.net.SocketTimeoutException: Read timed out
>>
>> These exceptions occur even though opscenter was reporting medium load on
>> our cluster. Is there a way to optimize the astyanax library. Please let us
>> know if you have any recommendations in this area.
>>
>> Thanks a lot for the help.
>>
>> Thanks
>> Jaskaran
>>
>> On Mon, Dec 7, 2015 at 2:29 AM, Michael Russo <mi...@gmail.com>
>> wrote:
>>
>>> Here are a couple things to check:
>>>
>>> 1) Can you query all of these entities out when the system is not under
>>> load?
>>> 2) Elasticsearch has a search queue for index query requests. (
>>> https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-threadpool.html)
>>> When this is full the searches are rejected. Currently Usergrid surfaces
>>> this as no results returned rather than unable to query or some other
>>> identifying error message (we're aware and plan to fix this in the future).
>>> Try increasing the queue size to 1000. You might have delayed results, but
>>> can prevent them from being empty results for data that's known to be in
>>> the index.
>>>
>>> Thanks.
>>> -Michael R.
>>>
>>> On Dec 5, 2015, at 07:07, Jaskaran Singh <
>>> jaskaran.singh@comprotechnologies.com> wrote:
>>>
>>> Hello All,
>>>
>>> We are testing usergrid 2.x (master branch) for our application that was
>>> previously being prototyped on usergrid 1.x. We are noticing some weird
>>> anomalies which are causing errors in our application which otherwise works
>>> fine against usergrid 1.x. Specifically, we are seeing empty responses when
>>> querying custom collections for a particular entity record.
>>> Following is an example of one such query:
>>> http://server-name/b2perf1/default/userdata?client_id=
>>> <...>&client_secret=<....>&ql=userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'
>>>
>>> In the above scenario, we are querying a custom collection “userdata”.
>>> And under high load conditions (performance tests), this query starts
>>> returning an empty entities array (see below), even though this entity did
>>> exist at one time and we have no code / logic to delete entities.
>>> {
>>>     "action": "get",
>>>     "application": "0f7a2396-9826-11e5-ba08-0a75091e6d25",
>>>     "params": {
>>>         "ql": [
>>>
>>> "userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'"
>>>         ]
>>>     },
>>>     "path": "/userdata",
>>>     "uri": "http://localhost:8080/b2perf1/default/userdata",
>>>     "entities": [],
>>>     "timestamp": 1449322746733,
>>>     "duration": 1053,
>>>     "organization": "b2perf1",
>>>     "applicationName": "default",
>>>     "count": 0
>>> }
>>>
>>> This has been happening quite randomly / intermittently and we have not
>>> been able to isolate any replication steps besides, running load /
>>> performance tests when this problem does eventually show up.
>>> Note, the creation of the entities is prior to the load test and we can
>>> confirm that they existed before running the load test.
>>>
>>> We have never noticed this issue for ‘non’ query calls (ie calls that do
>>> not directly provide a field to query on)
>>>
>>> Our suspicion is that while these records do exist in Cassandra (because
>>> we have never deleted them), but the ElasticSearch index is ‘not’ in sync
>>> or is not functioning properly.
>>> How do we go about debugging this problem? Is there any particular
>>> logging or metric that we can check for us to confirm if all the
>>> elasticsearch index is upto date with the changes in cassandra.
>>>
>>> Any other suggestions will be greatly appreciated.
>>>
>>> Thanks
>>> Jaskaran
>>>
>>>
>>
>

Re: Usergrid 2.x Issues

Posted by Jaskaran Singh <ja...@comprotechnologies.com>.

Hi Michael,

I am providing an update on my situation. We have changed our application
logic to minimize the use of queries (ie calls with "ql=.....") in usergrid
2.x. This seems to have provided significant benefit and all the problems
reported below, seem to have disappeared.

To some extent this is good news. However we were lucky that we were able
to work around the logic and would like to understand any limitations or
best practices around the use of queries (which are serviced by
elasticsearch in usergrid 2.x) under high load situations.

Also please let me know me know if there is an existing Jira issue for
addressing the empty entity response when elasticsearch when it is
overloaded. Or should i add one?

Thanks in advance,

Thanks
Jaskaran


On Tue, Dec 8, 2015 at 6:00 PM, Jaskaran Singh <
jaskaran.singh@comprotechnologies.com> wrote:

> Hi Michael,
>
> This makes sense. I can confirm that while we have been seeing missing
> entity errors on high load; these automatically get resolved themselves as
> the load decreases.
>
> Another anomaly that we have noticed is that usergrid responds with a code
> "401" and message "Unable to authenticate OAuth credentials" for certain
> user's credentials under high load and the same credentials work fine after
> the load reduces. Can we assume that this issue (intermittent invalid
> credentials) has the same underlying root cause (ie elasticsearch is not
> responding)? See below a couple of examples of the error_description for
> such 401 errors:
> 1. 'invalid username or password'
> 2. ‘Unable to authenticate OAuth credentials’
> 3. ‘Unable to authenticate due to corrupt access token’
>
> Regarding your suggestion to increase the search thread pool queue size,
> we were already using a setting of 1000 (with 320 threads). Should we
> consider further increasing this? Or simply provide additional resources
> (cpu / ram) to the ES process.
>
> Additionally we are also seeing cassandra connection timeouts,
> specifically the exceptions below under high load conditions:
> ERROR stage.write.WriteCommit.call(132)<Usergrid-Collection-Pool-12>-
> Failed to execute write asynchronously
> com.netflix.astyanax.connectionpool.exceptions.TimeoutException:
> TimeoutException: [host=10.0.0.237(10.0.0.237):9160, latency=2003(2003),
> attempts=1]org.apache.thrift.transport.TTransportException:
> java.net.SocketTimeoutException: Read timed out
>
> These exceptions occur even though opscenter was reporting medium load on
> our cluster. Is there a way to optimize the astyanax library. Please let us
> know if you have any recommendations in this area.
>
> Thanks a lot for the help.
>
> Thanks
> Jaskaran
>
> On Mon, Dec 7, 2015 at 2:29 AM, Michael Russo <mi...@gmail.com>
> wrote:
>
>> Here are a couple things to check:
>>
>> 1) Can you query all of these entities out when the system is not under
>> load?
>> 2) Elasticsearch has a search queue for index query requests. (
>> https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-threadpool.html)
>> When this is full the searches are rejected. Currently Usergrid surfaces
>> this as no results returned rather than unable to query or some other
>> identifying error message (we're aware and plan to fix this in the future).
>> Try increasing the queue size to 1000. You might have delayed results, but
>> can prevent them from being empty results for data that's known to be in
>> the index.
>>
>> Thanks.
>> -Michael R.
>>
>> On Dec 5, 2015, at 07:07, Jaskaran Singh <
>> jaskaran.singh@comprotechnologies.com> wrote:
>>
>> Hello All,
>>
>> We are testing usergrid 2.x (master branch) for our application that was
>> previously being prototyped on usergrid 1.x. We are noticing some weird
>> anomalies which are causing errors in our application which otherwise works
>> fine against usergrid 1.x. Specifically, we are seeing empty responses when
>> querying custom collections for a particular entity record.
>> Following is an example of one such query:
>> http://server-name/b2perf1/default/userdata?client_id=
>> <...>&client_secret=<....>&ql=userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'
>>
>> In the above scenario, we are querying a custom collection “userdata”.
>> And under high load conditions (performance tests), this query starts
>> returning an empty entities array (see below), even though this entity did
>> exist at one time and we have no code / logic to delete entities.
>> {
>>     "action": "get",
>>     "application": "0f7a2396-9826-11e5-ba08-0a75091e6d25",
>>     "params": {
>>         "ql": [
>>
>> "userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'"
>>         ]
>>     },
>>     "path": "/userdata",
>>     "uri": "http://localhost:8080/b2perf1/default/userdata",
>>     "entities": [],
>>     "timestamp": 1449322746733,
>>     "duration": 1053,
>>     "organization": "b2perf1",
>>     "applicationName": "default",
>>     "count": 0
>> }
>>
>> This has been happening quite randomly / intermittently and we have not
>> been able to isolate any replication steps besides, running load /
>> performance tests when this problem does eventually show up.
>> Note, the creation of the entities is prior to the load test and we can
>> confirm that they existed before running the load test.
>>
>> We have never noticed this issue for ‘non’ query calls (ie calls that do
>> not directly provide a field to query on)
>>
>> Our suspicion is that while these records do exist in Cassandra (because
>> we have never deleted them), but the ElasticSearch index is ‘not’ in sync
>> or is not functioning properly.
>> How do we go about debugging this problem? Is there any particular
>> logging or metric that we can check for us to confirm if all the
>> elasticsearch index is upto date with the changes in cassandra.
>>
>> Any other suggestions will be greatly appreciated.
>>
>> Thanks
>> Jaskaran
>>
>>
>

Re: Usergrid 2.x Issues

Posted by Jaskaran Singh <ja...@comprotechnologies.com>.

Hi Michael,

This makes sense. I can confirm that while we have been seeing missing
entity errors on high load; these automatically get resolved themselves as
the load decreases.

Another anomaly that we have noticed is that usergrid responds with a code
"401" and message "Unable to authenticate OAuth credentials" for certain
user's credentials under high load and the same credentials work fine after
the load reduces. Can we assume that this issue (intermittent invalid
credentials) has the same underlying root cause (ie elasticsearch is not
responding)? See below a couple of examples of the error_description for
such 401 errors:
1. 'invalid username or password'
2. ‘Unable to authenticate OAuth credentials’
3. ‘Unable to authenticate due to corrupt access token’

Regarding your suggestion to increase the search thread pool queue size, we
were already using a setting of 1000 (with 320 threads). Should we consider
further increasing this? Or simply provide additional resources (cpu / ram)
to the ES process.

Additionally we are also seeing cassandra connection timeouts, specifically
the exceptions below under high load conditions:
ERROR stage.write.WriteCommit.call(132)<Usergrid-Collection-Pool-12>-
Failed to execute write asynchronously
com.netflix.astyanax.connectionpool.exceptions.TimeoutException:
TimeoutException: [host=10.0.0.237(10.0.0.237):9160, latency=2003(2003),
attempts=1]org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: Read timed out

These exceptions occur even though opscenter was reporting medium load on
our cluster. Is there a way to optimize the astyanax library. Please let us
know if you have any recommendations in this area.

Thanks a lot for the help.

Thanks
Jaskaran

On Mon, Dec 7, 2015 at 2:29 AM, Michael Russo <mi...@gmail.com>
wrote:

> Here are a couple things to check:
>
> 1) Can you query all of these entities out when the system is not under
> load?
> 2) Elasticsearch has a search queue for index query requests. (
> https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-threadpool.html)
> When this is full the searches are rejected. Currently Usergrid surfaces
> this as no results returned rather than unable to query or some other
> identifying error message (we're aware and plan to fix this in the future).
> Try increasing the queue size to 1000. You might have delayed results, but
> can prevent them from being empty results for data that's known to be in
> the index.
>
> Thanks.
> -Michael R.
>
> On Dec 5, 2015, at 07:07, Jaskaran Singh <
> jaskaran.singh@comprotechnologies.com> wrote:
>
> Hello All,
>
> We are testing usergrid 2.x (master branch) for our application that was
> previously being prototyped on usergrid 1.x. We are noticing some weird
> anomalies which are causing errors in our application which otherwise works
> fine against usergrid 1.x. Specifically, we are seeing empty responses when
> querying custom collections for a particular entity record.
> Following is an example of one such query:
> http://server-name/b2perf1/default/userdata?client_id=
> <...>&client_secret=<....>&ql=userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'
>
> In the above scenario, we are querying a custom collection “userdata”.
> And under high load conditions (performance tests), this query starts
> returning an empty entities array (see below), even though this entity did
> exist at one time and we have no code / logic to delete entities.
> {
>     "action": "get",
>     "application": "0f7a2396-9826-11e5-ba08-0a75091e6d25",
>     "params": {
>         "ql": [
>
> "userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'"
>         ]
>     },
>     "path": "/userdata",
>     "uri": "http://localhost:8080/b2perf1/default/userdata",
>     "entities": [],
>     "timestamp": 1449322746733,
>     "duration": 1053,
>     "organization": "b2perf1",
>     "applicationName": "default",
>     "count": 0
> }
>
> This has been happening quite randomly / intermittently and we have not
> been able to isolate any replication steps besides, running load /
> performance tests when this problem does eventually show up.
> Note, the creation of the entities is prior to the load test and we can
> confirm that they existed before running the load test.
>
> We have never noticed this issue for ‘non’ query calls (ie calls that do
> not directly provide a field to query on)
>
> Our suspicion is that while these records do exist in Cassandra (because
> we have never deleted them), but the ElasticSearch index is ‘not’ in sync
> or is not functioning properly.
> How do we go about debugging this problem? Is there any particular logging
> or metric that we can check for us to confirm if all the elasticsearch
> index is upto date with the changes in cassandra.
>
> Any other suggestions will be greatly appreciated.
>
> Thanks
> Jaskaran
>
>

Re: Usergrid 2.x Issues

Posted by Michael Russo <mi...@gmail.com>.

Here are a couple things to check:

1) Can you query all of these entities out when the system is not under load?
2) Elasticsearch has a search queue for index query requests. (https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-threadpool.html) When this is full the searches are rejected. Currently Usergrid surfaces this as no results returned rather than unable to query or some other identifying error message (we're aware and plan to fix this in the future). Try increasing the queue size to 1000. You might have delayed results, but can prevent them from being empty results for data that's known to be in the index.

Thanks.
-Michael R.

> On Dec 5, 2015, at 07:07, Jaskaran Singh <ja...@comprotechnologies.com> wrote:
> 
> Hello All,
> 
> We are testing usergrid 2.x (master branch) for our application that was previously being prototyped on usergrid 1.x. We are noticing some weird anomalies which are causing errors in our application which otherwise works fine against usergrid 1.x. Specifically, we are seeing empty responses when querying custom collections for a particular entity record. 
> Following is an example of one such query:
> http://server-name/b2perf1/default/userdata?client_id=<...>&client_secret=<....>&ql=userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'
> 
> In the above scenario, we are querying a custom collection “userdata”.  And under high load conditions (performance tests), this query starts returning an empty entities array (see below), even though this entity did exist at one time and we have no code / logic to delete entities.
> {
>     "action": "get",
>     "application": "0f7a2396-9826-11e5-ba08-0a75091e6d25",
>     "params": {
>         "ql": [
>             "userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'"
>         ]
>     },
>     "path": "/userdata",
>     "uri": "http://localhost:8080/b2perf1/default/userdata",
>     "entities": [],
>     "timestamp": 1449322746733,
>     "duration": 1053,
>     "organization": "b2perf1",
>     "applicationName": "default",
>     "count": 0
> }
> 
> This has been happening quite randomly / intermittently and we have not been able to isolate any replication steps besides, running load / performance tests when this problem does eventually show up.
> Note, the creation of the entities is prior to the load test and we can confirm that they existed before running the load test.
> 
> We have never noticed this issue for ‘non’ query calls (ie calls that do not directly provide a field to query on)
> 
> Our suspicion is that while these records do exist in Cassandra (because we have never deleted them), but the ElasticSearch index is ‘not’ in sync or is not functioning properly.
> How do we go about debugging this problem? Is there any particular logging or metric that we can check for us to confirm if all the elasticsearch index is upto date with the changes in cassandra. 
> 
> Any other suggestions will be greatly appreciated.
> 
> Thanks
> Jaskaran