You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jmeter.apache.org by Philippe Mouawad <p....@ubik-ingenierie.com> on 2016/08/11 18:53:38 UTC

Axis of Performance improvement of Resource Download ?

Hello,

Today, when we activate the "Download embedded resources" in Http Request
this has a certain CPU impact related to HTML parsing to extract the links.

Do you think it would be interesting to cache the parsed links based on the
MD5 of the file (similar to what has been done in 59885).

The difference here is that HTML will vary much more than CSS content, so
we may end up computing MD5 (which eats some CPU) for nothing and in the
case of a mandatory LRU cache clearing too frequently parsed pages.

What do you think of this optimization ?
Is it useless ?

-- 
Cordialement.
Philippe

Re: Axis of Performance improvement of Resource Download ?

Posted by sebb <se...@gmail.com>.
Seems to me that there are several aspects to whether this is worth it:

1) the cost of hashing the page and maintaining the cache (not
ignoring the memory requirements)
2) the cost of parsing the page
3) the likelihood of cache hits.

It's only worth the effort if test plans are likely to result in cache hits.

As has been pointed out, for CSS URLs the content is generally
constant, so the likelihood of cache hits depends on the likelihood of
encountering the same content.
This will generally be quite high for a specific site, thus caching
CSS can make sense.

However for HTML pages even the same URL often has different content
(timestamps, cookies etc).

One way to find out would be to measure this for some existing
real-world test plans.

This could be done with a simple Listener/Post-Processor that does the
hashing for each html page and logs the results. The hashes could be
extracted from the log and used to derive stats for the potential
cache hits.
(Or the listener could do the stats, but that would increase
complexity and resources. Or one could use the existing Save Responses
Listener and post-process the files, but that would require a lot more
storage.)

I don't think it's worth proceeding without some data that shows cache
hits are sufficiently frequent in practise.


On 11 August 2016 at 20:44, Philippe Mouawad <ph...@gmail.com> wrote:
> On Thu, Aug 11, 2016 at 9:36 PM, Vladimir Sitnikov <
> sitnikov.vladimir@gmail.com> wrote:
>
>> Philippe Mouawad>
>>
>> > "certain"  in my sentence does not mean "certainty" :-) at least from
>> what
>> > I understand in english.
>> >
>>
>> Of course I mean "please provide some measurements of the parsing overhead"
>>  :-)
>>
>> Philippe Mouawad>
>>
>> > It more means "an impact of a certain degree".
>> > No numbers, more of reasoning that Parsing (based on Jodd or JSoup) comes
>> > at the cost of Regexp parsing, which I think has certainly :-) a cost
>> right
>> > ?
>> >
>>
>> Do you have some numbers to compare?
>>
>
> No, before starting any work on this I wanted to have some feedback.
> I don't want to spend too much time on potentially bad idea.
>
>
>
>> Of course HTML parsing is not free. The basic question is how much CPU does
>> it take, so we can analyze/compare/reproduce that.
>>
>>  Philippe Mouawad>
>>
>> > That was my doubt. But take an ecommerce website where part of users are
>> > navigating anonymously, don't you think an important part of the pages is
>> > similar ?
>> > - product page
>> > - home page
>> > - category page
>> > ...
>> >
>> I do not have such experience, so I cannot tell what would be the hit rate.
>>
>>
>> Philippe> Maybe user could indicate in a way when to optimize and when not
>> ?
>>
>> That reminds me
>> http://mrale.ph/blog/2015/01/11/whats-up-with-monomorphism.html
>> For instance: make each HTTP samplers store additional state.
>> The state is one of "unknown" (initial), "has duplicates" (that is when we
>> check cache first), "always unique" (avoid caching as sampler is known to
>> sending unique outputs).
>>
>> So the first several executions we estimate if the sampler is worth
>> caching, then we switch into "has duplicates" or "always unique" mode.
>>
>>
>> Philippe>Maybe user could indicate in a way when to optimize and when not ?
>>
>> The lesser the number of knobs the better the UX is. I would try some
>> automatic solution first, then semi-automatic, then fully manual.
>>
>>
>>
>> > > 4) What if we implement "fetch links only during the first sampler
>> > > execution"?
>> > >
>> >
>> > Can you give more details on your idea ?
>> >
>>
>> On the first sampler execution, do proper HTML parsing and collect the
>> external links. Then make a pokerface and just assume that this particular
>> test element would always return the same set of resources no matter what.
>> Of course it will not work for the cases like
>> url=${home_or_product_page_based_on_the_moons_phase}, but for certain
>> cases
>> where the sampler is dedicated to one particular type of page it might work
>> just fine.
>>
>>
>> Vladimir
>>
>
>
>
> --
> Cordialement.
> Philippe Mouawad.

Re: Axis of Performance improvement of Resource Download ?

Posted by Philippe Mouawad <ph...@gmail.com>.
On Thu, Aug 11, 2016 at 9:36 PM, Vladimir Sitnikov <
sitnikov.vladimir@gmail.com> wrote:

> Philippe Mouawad>
>
> > "certain"  in my sentence does not mean "certainty" :-) at least from
> what
> > I understand in english.
> >
>
> Of course I mean "please provide some measurements of the parsing overhead"
>  :-)
>
> Philippe Mouawad>
>
> > It more means "an impact of a certain degree".
> > No numbers, more of reasoning that Parsing (based on Jodd or JSoup) comes
> > at the cost of Regexp parsing, which I think has certainly :-) a cost
> right
> > ?
> >
>
> Do you have some numbers to compare?
>

No, before starting any work on this I wanted to have some feedback.
I don't want to spend too much time on potentially bad idea.



> Of course HTML parsing is not free. The basic question is how much CPU does
> it take, so we can analyze/compare/reproduce that.
>
>  Philippe Mouawad>
>
> > That was my doubt. But take an ecommerce website where part of users are
> > navigating anonymously, don't you think an important part of the pages is
> > similar ?
> > - product page
> > - home page
> > - category page
> > ...
> >
> I do not have such experience, so I cannot tell what would be the hit rate.
>
>
> Philippe> Maybe user could indicate in a way when to optimize and when not
> ?
>
> That reminds me
> http://mrale.ph/blog/2015/01/11/whats-up-with-monomorphism.html
> For instance: make each HTTP samplers store additional state.
> The state is one of "unknown" (initial), "has duplicates" (that is when we
> check cache first), "always unique" (avoid caching as sampler is known to
> sending unique outputs).
>
> So the first several executions we estimate if the sampler is worth
> caching, then we switch into "has duplicates" or "always unique" mode.
>
>
> Philippe>Maybe user could indicate in a way when to optimize and when not ?
>
> The lesser the number of knobs the better the UX is. I would try some
> automatic solution first, then semi-automatic, then fully manual.
>
>
>
> > > 4) What if we implement "fetch links only during the first sampler
> > > execution"?
> > >
> >
> > Can you give more details on your idea ?
> >
>
> On the first sampler execution, do proper HTML parsing and collect the
> external links. Then make a pokerface and just assume that this particular
> test element would always return the same set of resources no matter what.
> Of course it will not work for the cases like
> url=${home_or_product_page_based_on_the_moons_phase}, but for certain
> cases
> where the sampler is dedicated to one particular type of page it might work
> just fine.
>
>
> Vladimir
>



-- 
Cordialement.
Philippe Mouawad.

Re: Axis of Performance improvement of Resource Download ?

Posted by Vladimir Sitnikov <si...@gmail.com>.
Philippe Mouawad>

> "certain"  in my sentence does not mean "certainty" :-) at least from what
> I understand in english.
>

Of course I mean "please provide some measurements of the parsing overhead"
 :-)

Philippe Mouawad>

> It more means "an impact of a certain degree".
> No numbers, more of reasoning that Parsing (based on Jodd or JSoup) comes
> at the cost of Regexp parsing, which I think has certainly :-) a cost right
> ?
>

Do you have some numbers to compare?
Of course HTML parsing is not free. The basic question is how much CPU does
it take, so we can analyze/compare/reproduce that.

 Philippe Mouawad>

> That was my doubt. But take an ecommerce website where part of users are
> navigating anonymously, don't you think an important part of the pages is
> similar ?
> - product page
> - home page
> - category page
> ...
>
I do not have such experience, so I cannot tell what would be the hit rate.


Philippe> Maybe user could indicate in a way when to optimize and when not ?

That reminds me
http://mrale.ph/blog/2015/01/11/whats-up-with-monomorphism.html
For instance: make each HTTP samplers store additional state.
The state is one of "unknown" (initial), "has duplicates" (that is when we
check cache first), "always unique" (avoid caching as sampler is known to
sending unique outputs).

So the first several executions we estimate if the sampler is worth
caching, then we switch into "has duplicates" or "always unique" mode.


Philippe>Maybe user could indicate in a way when to optimize and when not ?

The lesser the number of knobs the better the UX is. I would try some
automatic solution first, then semi-automatic, then fully manual.



> > 4) What if we implement "fetch links only during the first sampler
> > execution"?
> >
>
> Can you give more details on your idea ?
>

On the first sampler execution, do proper HTML parsing and collect the
external links. Then make a pokerface and just assume that this particular
test element would always return the same set of resources no matter what.
Of course it will not work for the cases like
url=${home_or_product_page_based_on_the_moons_phase}, but for certain cases
where the sampler is dedicated to one particular type of page it might work
just fine.


Vladimir

Re: Axis of Performance improvement of Resource Download ?

Posted by Philippe Mouawad <ph...@gmail.com>.
On Thu, Aug 11, 2016 at 9:05 PM, Vladimir Sitnikov <
sitnikov.vladimir@gmail.com> wrote:

> 1) Regarding content hashing there might be a question which hash function
> we should use.
> For instance, there's https://github.com/OpenHFT/Zero-Allocation-Hashing
> that
> offers fast implementations of some hash functions.
> FarmHash, CityHash, MurmurHash3
> We might want to apply it to other "MD5" usages.
>

good idea

>
> 2)
> Philippe>this has a certain CPU impact related to HTML parsing to extract
> the links.
>
> Do you have some numbers that represent "certainty"?
>

"certain"  in my sentence does not mean "certainty" :-) at least from what
I understand in english.
It more means "an impact of a certain degree".
No numbers, more of reasoning that Parsing (based on Jodd or JSoup) comes
at the cost of Regexp parsing, which I think has certainly :-) a cost right
?



> 3) Re "cache HTML parsing", it does not sound to be very useful. Typical
> pages I see have different content, so the cache there does not sound
> promising
>

That was my doubt. But take an ecommerce website where part of users are
navigating anonymously, don't you think an important part of the pages is
similar ?
- product page
- home page
- category page
...

Isn't why webperf SAAS exist ? I would say around 20% at least would be the
same. Maybe user could indicate in a way when to optimize and when not ?


>
> 4) What if we implement "fetch links only during the first sampler
> execution"?
>

Can you give more details on your idea ?

>
> As far as I understand, the idea of "fetching resources automatically" is
> that users do not have to hard-code the resources right into jmx.
> It might be OK if we implement Cache<TestElement, List<URL>> kind of thing.
>
>
> Vladimir
>



-- 
Cordialement.
Philippe Mouawad.

Re: Axis of Performance improvement of Resource Download ?

Posted by Vladimir Sitnikov <si...@gmail.com>.
1) Regarding content hashing there might be a question which hash function
we should use.
For instance, there's https://github.com/OpenHFT/Zero-Allocation-Hashing that
offers fast implementations of some hash functions.
FarmHash, CityHash, MurmurHash3
We might want to apply it to other "MD5" usages.

2)
Philippe>this has a certain CPU impact related to HTML parsing to extract
the links.

Do you have some numbers that represent "certainty"?

3) Re "cache HTML parsing", it does not sound to be very useful. Typical
pages I see have different content, so the cache there does not sound
promising

4) What if we implement "fetch links only during the first sampler
execution"?

As far as I understand, the idea of "fetching resources automatically" is
that users do not have to hard-code the resources right into jmx.
It might be OK if we implement Cache<TestElement, List<URL>> kind of thing.


Vladimir