You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@shindig.apache.org by "Merrill, Matt" <mm...@mitre.org> on 2014/07/09 20:40:30 UTC

Performance problems with opensocial rpc calls

Hi shindig devs,

We are in the process of upgrading from shindig 2.0 to 2.5-update1 and everything has gone ok, however, once we got into our production environment, we are seeing significant slowdowns for the opensocial RPC calls that shindig makes to itself when rendering a gadget.

This is obviously very dependent on how we’ve implemented the shindig interfaces in our own code, and also our infrastructure, however, so we’re hoping someone on the list can help give us some more ideas for areas to investigate inside shindig itself or in general.

Here’s what’s happening:
* Gadgets load fine when the app is not experiencing much load (< 10 users rendering 10-12 gadgets on a page)
* Once a reasonable subset of users begins rendering gadgets, gadget render calls through the “ifr” endpoint start taking a very long time to respond
* The problem gets worse from there
* Even with extensive load testing we can’t recreate this problem in our testing environments
* Our system adminstrators have assured us that the configurations of our servers are the same between int and prod

This is an example of what we’re seeing from the logs inside BasicHttpFetcher:
http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?st=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliywVwc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T95j7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNpdzOH4xCfgROnNCnAI is responding slowly. 12,449 ms elapsed.

We’ll continue to get these warnings for rpc calls for many different gadgets, the amount of time elapsed will grow, and ultimately every gadget render slows to a crawl.

Some other relevant information:
* We have implemented “throttling” logic in our own custom HttpFetcher which extends the BasicHttpFetcher.  Basically, what this does, is keep track of how many outgoing requests are happening for a given url, and if there are too many concurrent ones going at once, it will start rejecting outgoing requests.  This was done to avoid a situation where an external service is responding slowly and ties up all of shindig’s external http connections.  In our case, I believe that because our rpc endpoint is taking so long to respond, we start rejecting these requests with our throttling logic.

I have tried to trace through the rpc calls inside the shindig code starting in the RpcServlet, and as best I can tell, these rpc calls are used for:
* getting viewer data
* getting application data
* anything else?

I’ve also looked at the BasicHTTPFetcher, but nothing stands out at me at first glance that would cause such a difference in performance between environments if, as our sys admins say, they are the same.

Additionally, I’ve ensured that the database table which contains our Application Data has been indexed properly (by person ID and gadget url) and that person data is cached.

Any other ideas, or areas in the codebase to explore are very much appreciated.

Thanks!
-Matt

Re: Performance problems with opensocial rpc calls

Posted by "Merrill, Matt" <mm...@mitre.org>.

Yep, that’s where I’m headed next.  Obviously there’s some hesitation to
do that on the part of our product owners so it takes a while to get to
that point.

Will let you know what I find.

Thanks!
-Matt

On 7/18/14, 8:44 AM, "Ryan Baxter" <rb...@gmail.com> wrote:

>Matt, I think further investigation is warranted.  I really think you
>need to find a way to trace through the code and find where the
>slowdown is occurring.  That will help us narrow down what the problem
>is.  I know it is production, but getting some code on there that
>starts timing method calls and such can be very useful.
>
>On Tue, Jul 15, 2014 at 3:04 PM, Merrill, Matt <mm...@mitre.org> wrote:
>> Hi Ryan,
>>
>> Thanks for responding!
>>
>> I’ve attached our ehcacheConfig, however, comparing it to the default
>> configuration the only difference is the overall amount of elements
>>(10000
>> in ours vs 1000 in default) and also the temp disk store location.
>>
>> I’m assuming you are asking if each user in our system has the exact
>>same
>> set of gadgets to render, correct?  If that’s the case: different users
>> have different sets of gadgets, however, many of them have a default set
>> we give them when they are initially setup in our system.  So, many
>>people
>> will hit the same gadgets over and over again.  This default subset of
>> gadgets is about 10-12 different gadgets and that is by and large what
>> many users have.
>>
>> However, we have a total of 48 different gadgets that could be rendered
>>by
>> a user at any given time on this instance of shindig.  We do run another
>> instance of shindig which could render a different subset of gadgets,
>>but
>> that has a much lower usage and only renders about 10 different gadgets
>> altogether.
>>
>>
>> I am admittedly rusty with my ehCache configuration knowledge, but
>>here’s
>> a couple things I noticed:
>> * I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb,
>>which
>> seems low, however, this is the same setting we had in shindig 2.0, so I
>> have to wonder if that has anything to do with it.
>> * Our old ehCache configuration for shindig 2.0 specified a defaultCache
>> maxElementsInMemory of 1000 but NO sizeOfPolicy at all.
>> * Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy
>> maxDepth of 10000 but NO defaultCache maxElementsInMemory.
>>
>> Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a
>> cache seems adequate. This is the same heap size from when we were using
>> shindig 2.0.  Unfortunately, we don’t have profiling tools enabled on
>>our
>> Tomcat instances so i can’t see what the heap looked like when things
>> crashed, and like I said, we’re unable to reproduce this in int.
>>
>> I think we might be on to something here… I will keep searching but if
>>any
>> devs out there have any ideas, please let me know.
>>
>> Thanks shindig list!
>> -Matt
>>
>> On 7/13/14, 10:12 AM, "Ryan Baxter" <rb...@gmail.com> wrote:
>>
>>>Matt can you tell us more about how you have configured the caches in
>>>shindig?  When you are rendering these gadgets are you rendering the
>>>same
>>>gadget across all users?
>>>
>>>-Ryan
>>>
>>>> On Jul 9, 2014, at 3:31 PM, "Merrill, Matt" <mm...@mitre.org>
>>>>wrote:
>>>>
>>>> Stanton,
>>>>
>>>> Thanks for responding!
>>>>
>>>> This is one instance of shindig.
>>>>
>>>> If you mean the configuration within the container and for the shindig
>>>> java app, then yes, the locked domains are the same.  In fact, the
>>>> configuration with the exception of shindig¹s host URL¹s is exactly
>>>>the
>>>> same from what I can tell.
>>>>
>>>> Unfortunately, I don¹t have any way to trace that exact message, but I
>>>>did
>>>> do a traceroute from the server running shindig to the URL that is
>>>>being
>>>> called for rpc calls to make sure there weren¹t any extra network
>>>>hops,
>>>> and there weren¹t, it actually only had one, as expected for an app
>>>>making
>>>> an HTTP call to itself.
>>>>
>>>> Thanks again for responding.
>>>>
>>>> -Matt
>>>>
>>>>> On 7/9/14, 3:08 PM, "Stanton Sievers" <ss...@apache.org> wrote:
>>>>>
>>>>> Hi Matt,
>>>>>
>>>>> Is the configuration for locked domains and security tokens
>>>>>consistent
>>>>> between your test and production environments?
>>>>>
>>>>> Do you have any way of tracing the request in the log entry you
>>>>>provided
>>>>> through the network?  Is this a single Shindig server or is there any
>>>>>load
>>>>> balancing occurring?
>>>>>
>>>>> Regards,
>>>>> -Stanton
>>>>>
>>>>>
>>>>>> On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mm...@mitre.org>
>>>>>>wrote:
>>>>>>
>>>>>> Hi shindig devs,
>>>>>>
>>>>>> We are in the process of upgrading from shindig 2.0 to 2.5-update1
>>>>>>and
>>>>>> everything has gone ok, however, once we got into our production
>>>>>> environment, we are seeing significant slowdowns for the opensocial
>>>>>>RPC
>>>>>> calls that shindig makes to itself when rendering a gadget.
>>>>>>
>>>>>> This is obviously very dependent on how we¹ve implemented the
>>>>>>shindig
>>>>>> interfaces in our own code, and also our infrastructure, however, so
>>>>>> we¹re
>>>>>> hoping someone on the list can help give us some more ideas for
>>>>>>areas
>>>>>>to
>>>>>> investigate inside shindig itself or in general.
>>>>>>
>>>>>> Here¹s what¹s happening:
>>>>>> * Gadgets load fine when the app is not experiencing much load (< 10
>>>>>> users
>>>>>> rendering 10-12 gadgets on a page)
>>>>>> * Once a reasonable subset of users begins rendering gadgets, gadget
>>>>>> render calls through the ³ifr² endpoint start taking a very long
>>>>>>time
>>>>>>to
>>>>>> respond
>>>>>> * The problem gets worse from there
>>>>>> * Even with extensive load testing we can¹t recreate this problem in
>>>>>>our
>>>>>> testing environments
>>>>>> * Our system adminstrators have assured us that the configurations
>>>>>>of
>>>>>> our
>>>>>> servers are the same between int and prod
>>>>>>
>>>>>> This is an example of what we¹re seeing from the logs inside
>>>>>> BasicHttpFetcher:
>>>>>>
>>>>>>
>>>>>>
>>>>>>http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rp
>>>>>>c?
>>>>>>st
>>>>>>
>>>>>>=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAl
>>>>>>iy
>>>>>>wV
>>>>>>
>>>>>>wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5
>>>>>>T9
>>>>>>5j
>>>>>>
>>>>>>7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKW
>>>>>>Np
>>>>>>dz
>>>>>> OH4xCfgROnNCnAI
>>>>>> is responding slowly. 12,449 ms elapsed.
>>>>>>
>>>>>> We¹ll continue to get these warnings for rpc calls for many
>>>>>>different
>>>>>> gadgets, the amount of time elapsed will grow, and ultimately every
>>>>>> gadget
>>>>>> render slows to a crawl.
>>>>>>
>>>>>> Some other relevant information:
>>>>>> * We have implemented ³throttling² logic in our own custom
>>>>>>HttpFetcher
>>>>>> which extends the BasicHttpFetcher.  Basically, what this does, is
>>>>>>keep
>>>>>> track of how many outgoing requests are happening for a given url,
>>>>>>and
>>>>>> if
>>>>>> there are too many concurrent ones going at once, it will start
>>>>>> rejecting
>>>>>> outgoing requests.  This was done to avoid a situation where an
>>>>>>external
>>>>>> service is responding slowly and ties up all of shindig¹s external
>>>>>>http
>>>>>> connections.  In our case, I believe that because our rpc endpoint
>>>>>>is
>>>>>> taking so long to respond, we start rejecting these requests with
>>>>>>our
>>>>>> throttling logic.
>>>>>>
>>>>>> I have tried to trace through the rpc calls inside the shindig code
>>>>>> starting in the RpcServlet, and as best I can tell, these rpc calls
>>>>>>are
>>>>>> used for:
>>>>>> * getting viewer data
>>>>>> * getting application data
>>>>>> * anything else?
>>>>>>
>>>>>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at
>>>>>>me
>>>>>> at
>>>>>> first glance that would cause such a difference in performance
>>>>>>between
>>>>>> environments if, as our sys admins say, they are the same.
>>>>>>
>>>>>> Additionally, I¹ve ensured that the database table which contains
>>>>>>our
>>>>>> Application Data has been indexed properly (by person ID and gadget
>>>>>>url)
>>>>>> and that person data is cached.
>>>>>>
>>>>>> Any other ideas, or areas in the codebase to explore are very much
>>>>>> appreciated.
>>>>>>
>>>>>> Thanks!
>>>>>> -Matt
>>>>
>>

Re: Performance problems with opensocial rpc calls

Posted by Ryan Baxter <rb...@gmail.com>.

Matt, I think further investigation is warranted.  I really think you
need to find a way to trace through the code and find where the
slowdown is occurring.  That will help us narrow down what the problem
is.  I know it is production, but getting some code on there that
starts timing method calls and such can be very useful.

On Tue, Jul 15, 2014 at 3:04 PM, Merrill, Matt <mm...@mitre.org> wrote:
> Hi Ryan,
>
> Thanks for responding!
>
> I’ve attached our ehcacheConfig, however, comparing it to the default
> configuration the only difference is the overall amount of elements (10000
> in ours vs 1000 in default) and also the temp disk store location.
>
> I’m assuming you are asking if each user in our system has the exact same
> set of gadgets to render, correct?  If that’s the case: different users
> have different sets of gadgets, however, many of them have a default set
> we give them when they are initially setup in our system.  So, many people
> will hit the same gadgets over and over again.  This default subset of
> gadgets is about 10-12 different gadgets and that is by and large what
> many users have.
>
> However, we have a total of 48 different gadgets that could be rendered by
> a user at any given time on this instance of shindig.  We do run another
> instance of shindig which could render a different subset of gadgets, but
> that has a much lower usage and only renders about 10 different gadgets
> altogether.
>
>
> I am admittedly rusty with my ehCache configuration knowledge, but here’s
> a couple things I noticed:
> * I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb, which
> seems low, however, this is the same setting we had in shindig 2.0, so I
> have to wonder if that has anything to do with it.
> * Our old ehCache configuration for shindig 2.0 specified a defaultCache
> maxElementsInMemory of 1000 but NO sizeOfPolicy at all.
> * Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy
> maxDepth of 10000 but NO defaultCache maxElementsInMemory.
>
> Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a
> cache seems adequate. This is the same heap size from when we were using
> shindig 2.0.  Unfortunately, we don’t have profiling tools enabled on our
> Tomcat instances so i can’t see what the heap looked like when things
> crashed, and like I said, we’re unable to reproduce this in int.
>
> I think we might be on to something here… I will keep searching but if any
> devs out there have any ideas, please let me know.
>
> Thanks shindig list!
> -Matt
>
> On 7/13/14, 10:12 AM, "Ryan Baxter" <rb...@gmail.com> wrote:
>
>>Matt can you tell us more about how you have configured the caches in
>>shindig?  When you are rendering these gadgets are you rendering the same
>>gadget across all users?
>>
>>-Ryan
>>
>>> On Jul 9, 2014, at 3:31 PM, "Merrill, Matt" <mm...@mitre.org> wrote:
>>>
>>> Stanton,
>>>
>>> Thanks for responding!
>>>
>>> This is one instance of shindig.
>>>
>>> If you mean the configuration within the container and for the shindig
>>> java app, then yes, the locked domains are the same.  In fact, the
>>> configuration with the exception of shindig¹s host URL¹s is exactly the
>>> same from what I can tell.
>>>
>>> Unfortunately, I don¹t have any way to trace that exact message, but I
>>>did
>>> do a traceroute from the server running shindig to the URL that is being
>>> called for rpc calls to make sure there weren¹t any extra network hops,
>>> and there weren¹t, it actually only had one, as expected for an app
>>>making
>>> an HTTP call to itself.
>>>
>>> Thanks again for responding.
>>>
>>> -Matt
>>>
>>>> On 7/9/14, 3:08 PM, "Stanton Sievers" <ss...@apache.org> wrote:
>>>>
>>>> Hi Matt,
>>>>
>>>> Is the configuration for locked domains and security tokens consistent
>>>> between your test and production environments?
>>>>
>>>> Do you have any way of tracing the request in the log entry you
>>>>provided
>>>> through the network?  Is this a single Shindig server or is there any
>>>>load
>>>> balancing occurring?
>>>>
>>>> Regards,
>>>> -Stanton
>>>>
>>>>
>>>>> On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mm...@mitre.org>
>>>>>wrote:
>>>>>
>>>>> Hi shindig devs,
>>>>>
>>>>> We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
>>>>> everything has gone ok, however, once we got into our production
>>>>> environment, we are seeing significant slowdowns for the opensocial
>>>>>RPC
>>>>> calls that shindig makes to itself when rendering a gadget.
>>>>>
>>>>> This is obviously very dependent on how we¹ve implemented the shindig
>>>>> interfaces in our own code, and also our infrastructure, however, so
>>>>> we¹re
>>>>> hoping someone on the list can help give us some more ideas for areas
>>>>>to
>>>>> investigate inside shindig itself or in general.
>>>>>
>>>>> Here¹s what¹s happening:
>>>>> * Gadgets load fine when the app is not experiencing much load (< 10
>>>>> users
>>>>> rendering 10-12 gadgets on a page)
>>>>> * Once a reasonable subset of users begins rendering gadgets, gadget
>>>>> render calls through the ³ifr² endpoint start taking a very long time
>>>>>to
>>>>> respond
>>>>> * The problem gets worse from there
>>>>> * Even with extensive load testing we can¹t recreate this problem in
>>>>>our
>>>>> testing environments
>>>>> * Our system adminstrators have assured us that the configurations of
>>>>> our
>>>>> servers are the same between int and prod
>>>>>
>>>>> This is an example of what we¹re seeing from the logs inside
>>>>> BasicHttpFetcher:
>>>>>
>>>>>
>>>>>
>>>>>http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?
>>>>>st
>>>>>
>>>>>=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliy
>>>>>wV
>>>>>
>>>>>wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T9
>>>>>5j
>>>>>
>>>>>7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNp
>>>>>dz
>>>>> OH4xCfgROnNCnAI
>>>>> is responding slowly. 12,449 ms elapsed.
>>>>>
>>>>> We¹ll continue to get these warnings for rpc calls for many different
>>>>> gadgets, the amount of time elapsed will grow, and ultimately every
>>>>> gadget
>>>>> render slows to a crawl.
>>>>>
>>>>> Some other relevant information:
>>>>> * We have implemented ³throttling² logic in our own custom HttpFetcher
>>>>> which extends the BasicHttpFetcher.  Basically, what this does, is
>>>>>keep
>>>>> track of how many outgoing requests are happening for a given url, and
>>>>> if
>>>>> there are too many concurrent ones going at once, it will start
>>>>> rejecting
>>>>> outgoing requests.  This was done to avoid a situation where an
>>>>>external
>>>>> service is responding slowly and ties up all of shindig¹s external
>>>>>http
>>>>> connections.  In our case, I believe that because our rpc endpoint is
>>>>> taking so long to respond, we start rejecting these requests with our
>>>>> throttling logic.
>>>>>
>>>>> I have tried to trace through the rpc calls inside the shindig code
>>>>> starting in the RpcServlet, and as best I can tell, these rpc calls
>>>>>are
>>>>> used for:
>>>>> * getting viewer data
>>>>> * getting application data
>>>>> * anything else?
>>>>>
>>>>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at me
>>>>> at
>>>>> first glance that would cause such a difference in performance between
>>>>> environments if, as our sys admins say, they are the same.
>>>>>
>>>>> Additionally, I¹ve ensured that the database table which contains our
>>>>> Application Data has been indexed properly (by person ID and gadget
>>>>>url)
>>>>> and that person data is cached.
>>>>>
>>>>> Any other ideas, or areas in the codebase to explore are very much
>>>>> appreciated.
>>>>>
>>>>> Thanks!
>>>>> -Matt
>>>
>

Re: Performance problems with opensocial rpc calls

Posted by "Merrill, Matt" <mm...@mitre.org>.

Hi Ryan,

Thanks for responding!

I’ve attached our ehcacheConfig, however, comparing it to the default
configuration the only difference is the overall amount of elements (10000
in ours vs 1000 in default) and also the temp disk store location.

I’m assuming you are asking if each user in our system has the exact same
set of gadgets to render, correct?  If that’s the case: different users
have different sets of gadgets, however, many of them have a default set
we give them when they are initially setup in our system.  So, many people
will hit the same gadgets over and over again.  This default subset of
gadgets is about 10-12 different gadgets and that is by and large what
many users have. 

However, we have a total of 48 different gadgets that could be rendered by
a user at any given time on this instance of shindig.  We do run another
instance of shindig which could render a different subset of gadgets, but
that has a much lower usage and only renders about 10 different gadgets
altogether.


I am admittedly rusty with my ehCache configuration knowledge, but here’s
a couple things I noticed:
* I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb, which
seems low, however, this is the same setting we had in shindig 2.0, so I
have to wonder if that has anything to do with it.
* Our old ehCache configuration for shindig 2.0 specified a defaultCache
maxElementsInMemory of 1000 but NO sizeOfPolicy at all.
* Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy
maxDepth of 10000 but NO defaultCache maxElementsInMemory.

Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a
cache seems adequate. This is the same heap size from when we were using
shindig 2.0.  Unfortunately, we don’t have profiling tools enabled on our
Tomcat instances so i can’t see what the heap looked like when things
crashed, and like I said, we’re unable to reproduce this in int.

I think we might be on to something here… I will keep searching but if any
devs out there have any ideas, please let me know.

Thanks shindig list!
-Matt

On 7/13/14, 10:12 AM, "Ryan Baxter" <rb...@gmail.com> wrote:

>Matt can you tell us more about how you have configured the caches in
>shindig?  When you are rendering these gadgets are you rendering the same
>gadget across all users?
>
>-Ryan
>
>> On Jul 9, 2014, at 3:31 PM, "Merrill, Matt" <mm...@mitre.org> wrote:
>> 
>> Stanton, 
>> 
>> Thanks for responding!
>> 
>> This is one instance of shindig.
>> 
>> If you mean the configuration within the container and for the shindig
>> java app, then yes, the locked domains are the same.  In fact, the
>> configuration with the exception of shindig¹s host URL¹s is exactly the
>> same from what I can tell.
>> 
>> Unfortunately, I don¹t have any way to trace that exact message, but I
>>did
>> do a traceroute from the server running shindig to the URL that is being
>> called for rpc calls to make sure there weren¹t any extra network hops,
>> and there weren¹t, it actually only had one, as expected for an app
>>making
>> an HTTP call to itself.
>> 
>> Thanks again for responding.
>> 
>> -Matt
>> 
>>> On 7/9/14, 3:08 PM, "Stanton Sievers" <ss...@apache.org> wrote:
>>> 
>>> Hi Matt,
>>> 
>>> Is the configuration for locked domains and security tokens consistent
>>> between your test and production environments?
>>> 
>>> Do you have any way of tracing the request in the log entry you
>>>provided
>>> through the network?  Is this a single Shindig server or is there any
>>>load
>>> balancing occurring?
>>> 
>>> Regards,
>>> -Stanton
>>> 
>>> 
>>>> On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mm...@mitre.org>
>>>>wrote:
>>>> 
>>>> Hi shindig devs,
>>>> 
>>>> We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
>>>> everything has gone ok, however, once we got into our production
>>>> environment, we are seeing significant slowdowns for the opensocial
>>>>RPC
>>>> calls that shindig makes to itself when rendering a gadget.
>>>> 
>>>> This is obviously very dependent on how we¹ve implemented the shindig
>>>> interfaces in our own code, and also our infrastructure, however, so
>>>> we¹re
>>>> hoping someone on the list can help give us some more ideas for areas
>>>>to
>>>> investigate inside shindig itself or in general.
>>>> 
>>>> Here¹s what¹s happening:
>>>> * Gadgets load fine when the app is not experiencing much load (< 10
>>>> users
>>>> rendering 10-12 gadgets on a page)
>>>> * Once a reasonable subset of users begins rendering gadgets, gadget
>>>> render calls through the ³ifr² endpoint start taking a very long time
>>>>to
>>>> respond
>>>> * The problem gets worse from there
>>>> * Even with extensive load testing we can¹t recreate this problem in
>>>>our
>>>> testing environments
>>>> * Our system adminstrators have assured us that the configurations of
>>>> our
>>>> servers are the same between int and prod
>>>> 
>>>> This is an example of what we¹re seeing from the logs inside
>>>> BasicHttpFetcher:
>>>> 
>>>> 
>>>> 
>>>>http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?
>>>>st
>>>> 
>>>>=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliy
>>>>wV
>>>> 
>>>>wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T9
>>>>5j
>>>> 
>>>>7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNp
>>>>dz
>>>> OH4xCfgROnNCnAI
>>>> is responding slowly. 12,449 ms elapsed.
>>>> 
>>>> We¹ll continue to get these warnings for rpc calls for many different
>>>> gadgets, the amount of time elapsed will grow, and ultimately every
>>>> gadget
>>>> render slows to a crawl.
>>>> 
>>>> Some other relevant information:
>>>> * We have implemented ³throttling² logic in our own custom HttpFetcher
>>>> which extends the BasicHttpFetcher.  Basically, what this does, is
>>>>keep
>>>> track of how many outgoing requests are happening for a given url, and
>>>> if
>>>> there are too many concurrent ones going at once, it will start
>>>> rejecting
>>>> outgoing requests.  This was done to avoid a situation where an
>>>>external
>>>> service is responding slowly and ties up all of shindig¹s external
>>>>http
>>>> connections.  In our case, I believe that because our rpc endpoint is
>>>> taking so long to respond, we start rejecting these requests with our
>>>> throttling logic.
>>>> 
>>>> I have tried to trace through the rpc calls inside the shindig code
>>>> starting in the RpcServlet, and as best I can tell, these rpc calls
>>>>are
>>>> used for:
>>>> * getting viewer data
>>>> * getting application data
>>>> * anything else?
>>>> 
>>>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at me
>>>> at
>>>> first glance that would cause such a difference in performance between
>>>> environments if, as our sys admins say, they are the same.
>>>> 
>>>> Additionally, I¹ve ensured that the database table which contains our
>>>> Application Data has been indexed properly (by person ID and gadget
>>>>url)
>>>> and that person data is cached.
>>>> 
>>>> Any other ideas, or areas in the codebase to explore are very much
>>>> appreciated.
>>>> 
>>>> Thanks!
>>>> -Matt
>>

Re: Performance problems with opensocial rpc calls

Posted by Ryan Baxter <rb...@gmail.com>.

Matt can you tell us more about how you have configured the caches in shindig?  When you are rendering these gadgets are you rendering the same gadget across all users?

-Ryan

> On Jul 9, 2014, at 3:31 PM, "Merrill, Matt" <mm...@mitre.org> wrote:
> 
> Stanton, 
> 
> Thanks for responding!
> 
> This is one instance of shindig.
> 
> If you mean the configuration within the container and for the shindig
> java app, then yes, the locked domains are the same.  In fact, the
> configuration with the exception of shindig¹s host URL¹s is exactly the
> same from what I can tell.
> 
> Unfortunately, I don¹t have any way to trace that exact message, but I did
> do a traceroute from the server running shindig to the URL that is being
> called for rpc calls to make sure there weren¹t any extra network hops,
> and there weren¹t, it actually only had one, as expected for an app making
> an HTTP call to itself.
> 
> Thanks again for responding.
> 
> -Matt
> 
>> On 7/9/14, 3:08 PM, "Stanton Sievers" <ss...@apache.org> wrote:
>> 
>> Hi Matt,
>> 
>> Is the configuration for locked domains and security tokens consistent
>> between your test and production environments?
>> 
>> Do you have any way of tracing the request in the log entry you provided
>> through the network?  Is this a single Shindig server or is there any load
>> balancing occurring?
>> 
>> Regards,
>> -Stanton
>> 
>> 
>>> On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mm...@mitre.org> wrote:
>>> 
>>> Hi shindig devs,
>>> 
>>> We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
>>> everything has gone ok, however, once we got into our production
>>> environment, we are seeing significant slowdowns for the opensocial RPC
>>> calls that shindig makes to itself when rendering a gadget.
>>> 
>>> This is obviously very dependent on how we¹ve implemented the shindig
>>> interfaces in our own code, and also our infrastructure, however, so
>>> we¹re
>>> hoping someone on the list can help give us some more ideas for areas to
>>> investigate inside shindig itself or in general.
>>> 
>>> Here¹s what¹s happening:
>>> * Gadgets load fine when the app is not experiencing much load (< 10
>>> users
>>> rendering 10-12 gadgets on a page)
>>> * Once a reasonable subset of users begins rendering gadgets, gadget
>>> render calls through the ³ifr² endpoint start taking a very long time to
>>> respond
>>> * The problem gets worse from there
>>> * Even with extensive load testing we can¹t recreate this problem in our
>>> testing environments
>>> * Our system adminstrators have assured us that the configurations of
>>> our
>>> servers are the same between int and prod
>>> 
>>> This is an example of what we¹re seeing from the logs inside
>>> BasicHttpFetcher:
>>> 
>>> 
>>> http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?st
>>> =mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliywV
>>> wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T95j
>>> 7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNpdz
>>> OH4xCfgROnNCnAI
>>> is responding slowly. 12,449 ms elapsed.
>>> 
>>> We¹ll continue to get these warnings for rpc calls for many different
>>> gadgets, the amount of time elapsed will grow, and ultimately every
>>> gadget
>>> render slows to a crawl.
>>> 
>>> Some other relevant information:
>>> * We have implemented ³throttling² logic in our own custom HttpFetcher
>>> which extends the BasicHttpFetcher.  Basically, what this does, is keep
>>> track of how many outgoing requests are happening for a given url, and
>>> if
>>> there are too many concurrent ones going at once, it will start
>>> rejecting
>>> outgoing requests.  This was done to avoid a situation where an external
>>> service is responding slowly and ties up all of shindig¹s external http
>>> connections.  In our case, I believe that because our rpc endpoint is
>>> taking so long to respond, we start rejecting these requests with our
>>> throttling logic.
>>> 
>>> I have tried to trace through the rpc calls inside the shindig code
>>> starting in the RpcServlet, and as best I can tell, these rpc calls are
>>> used for:
>>> * getting viewer data
>>> * getting application data
>>> * anything else?
>>> 
>>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at me
>>> at
>>> first glance that would cause such a difference in performance between
>>> environments if, as our sys admins say, they are the same.
>>> 
>>> Additionally, I¹ve ensured that the database table which contains our
>>> Application Data has been indexed properly (by person ID and gadget url)
>>> and that person data is cached.
>>> 
>>> Any other ideas, or areas in the codebase to explore are very much
>>> appreciated.
>>> 
>>> Thanks!
>>> -Matt
>

Re: Performance problems with opensocial rpc calls

Posted by "Merrill, Matt" <mm...@mitre.org>.

Stanton, 

Thanks for responding!

This is one instance of shindig.

If you mean the configuration within the container and for the shindig
java app, then yes, the locked domains are the same.  In fact, the
configuration with the exception of shindig¹s host URL¹s is exactly the
same from what I can tell.

Unfortunately, I don¹t have any way to trace that exact message, but I did
do a traceroute from the server running shindig to the URL that is being
called for rpc calls to make sure there weren¹t any extra network hops,
and there weren¹t, it actually only had one, as expected for an app making
an HTTP call to itself.

Thanks again for responding.

-Matt

On 7/9/14, 3:08 PM, "Stanton Sievers" <ss...@apache.org> wrote:

>Hi Matt,
>
>Is the configuration for locked domains and security tokens consistent
>between your test and production environments?
>
>Do you have any way of tracing the request in the log entry you provided
>through the network?  Is this a single Shindig server or is there any load
>balancing occurring?
>
>Regards,
>-Stanton
>
>
>On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mm...@mitre.org> wrote:
>
>> Hi shindig devs,
>>
>> We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
>> everything has gone ok, however, once we got into our production
>> environment, we are seeing significant slowdowns for the opensocial RPC
>> calls that shindig makes to itself when rendering a gadget.
>>
>> This is obviously very dependent on how we¹ve implemented the shindig
>> interfaces in our own code, and also our infrastructure, however, so
>>we¹re
>> hoping someone on the list can help give us some more ideas for areas to
>> investigate inside shindig itself or in general.
>>
>> Here¹s what¹s happening:
>> * Gadgets load fine when the app is not experiencing much load (< 10
>>users
>> rendering 10-12 gadgets on a page)
>> * Once a reasonable subset of users begins rendering gadgets, gadget
>> render calls through the ³ifr² endpoint start taking a very long time to
>> respond
>> * The problem gets worse from there
>> * Even with extensive load testing we can¹t recreate this problem in our
>> testing environments
>> * Our system adminstrators have assured us that the configurations of
>>our
>> servers are the same between int and prod
>>
>> This is an example of what we¹re seeing from the logs inside
>> BasicHttpFetcher:
>>
>> 
>>http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?st
>>=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliywV
>>wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T95j
>>7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNpdz
>>OH4xCfgROnNCnAI
>> is responding slowly. 12,449 ms elapsed.
>>
>> We¹ll continue to get these warnings for rpc calls for many different
>> gadgets, the amount of time elapsed will grow, and ultimately every
>>gadget
>> render slows to a crawl.
>>
>> Some other relevant information:
>> * We have implemented ³throttling² logic in our own custom HttpFetcher
>> which extends the BasicHttpFetcher.  Basically, what this does, is keep
>> track of how many outgoing requests are happening for a given url, and
>>if
>> there are too many concurrent ones going at once, it will start
>>rejecting
>> outgoing requests.  This was done to avoid a situation where an external
>> service is responding slowly and ties up all of shindig¹s external http
>> connections.  In our case, I believe that because our rpc endpoint is
>> taking so long to respond, we start rejecting these requests with our
>> throttling logic.
>>
>> I have tried to trace through the rpc calls inside the shindig code
>> starting in the RpcServlet, and as best I can tell, these rpc calls are
>> used for:
>> * getting viewer data
>> * getting application data
>> * anything else?
>>
>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at me
>>at
>> first glance that would cause such a difference in performance between
>> environments if, as our sys admins say, they are the same.
>>
>> Additionally, I¹ve ensured that the database table which contains our
>> Application Data has been indexed properly (by person ID and gadget url)
>> and that person data is cached.
>>
>> Any other ideas, or areas in the codebase to explore are very much
>> appreciated.
>>
>> Thanks!
>> -Matt
>>

Re: Performance problems with opensocial rpc calls

Posted by Stanton Sievers <ss...@apache.org>.

Hi Matt,

Is the configuration for locked domains and security tokens consistent
between your test and production environments?

Do you have any way of tracing the request in the log entry you provided
through the network?  Is this a single Shindig server or is there any load
balancing occurring?

Regards,
-Stanton


On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mm...@mitre.org> wrote:

> Hi shindig devs,
>
> We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
> everything has gone ok, however, once we got into our production
> environment, we are seeing significant slowdowns for the opensocial RPC
> calls that shindig makes to itself when rendering a gadget.
>
> This is obviously very dependent on how we’ve implemented the shindig
> interfaces in our own code, and also our infrastructure, however, so we’re
> hoping someone on the list can help give us some more ideas for areas to
> investigate inside shindig itself or in general.
>
> Here’s what’s happening:
> * Gadgets load fine when the app is not experiencing much load (< 10 users
> rendering 10-12 gadgets on a page)
> * Once a reasonable subset of users begins rendering gadgets, gadget
> render calls through the “ifr” endpoint start taking a very long time to
> respond
> * The problem gets worse from there
> * Even with extensive load testing we can’t recreate this problem in our
> testing environments
> * Our system adminstrators have assured us that the configurations of our
> servers are the same between int and prod
>
> This is an example of what we’re seeing from the logs inside
> BasicHttpFetcher:
>
> http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?st=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliywVwc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T95j7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNpdzOH4xCfgROnNCnAI
> is responding slowly. 12,449 ms elapsed.
>
> We’ll continue to get these warnings for rpc calls for many different
> gadgets, the amount of time elapsed will grow, and ultimately every gadget
> render slows to a crawl.
>
> Some other relevant information:
> * We have implemented “throttling” logic in our own custom HttpFetcher
> which extends the BasicHttpFetcher.  Basically, what this does, is keep
> track of how many outgoing requests are happening for a given url, and if
> there are too many concurrent ones going at once, it will start rejecting
> outgoing requests.  This was done to avoid a situation where an external
> service is responding slowly and ties up all of shindig’s external http
> connections.  In our case, I believe that because our rpc endpoint is
> taking so long to respond, we start rejecting these requests with our
> throttling logic.
>
> I have tried to trace through the rpc calls inside the shindig code
> starting in the RpcServlet, and as best I can tell, these rpc calls are
> used for:
> * getting viewer data
> * getting application data
> * anything else?
>
> I’ve also looked at the BasicHTTPFetcher, but nothing stands out at me at
> first glance that would cause such a difference in performance between
> environments if, as our sys admins say, they are the same.
>
> Additionally, I’ve ensured that the database table which contains our
> Application Data has been indexed properly (by person ID and gadget url)
> and that person data is cached.
>
> Any other ideas, or areas in the codebase to explore are very much
> appreciated.
>
> Thanks!
> -Matt
>