You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Timothée Maret <ti...@gmail.com> on 2017/03/06 15:43:28 UTC

[FileVault][discuss] performance improvement proposal

Hi,

With Sling content distribution (using FileVault), we observe a
significantly lower throughput for content packages containing binaries.
The main bottleneck seems to be the compression algorithm applied to every
element contained in the content package.

I think that we could improve the throughput significantly, simply by
avoiding to re-compress binaries that are already compressed.
In order to figure out what binaries are already compressed, we could use
match the content type stored along the binary against a list of
configurable content types.

I have done some micro tests with this idea (patch in [0]). I think that
the results are promising.

Exporting a single 250 MB JPEG is 80% faster (22.4 sec -> 4.3 sec) for a 3%
bigger content package (233.2 MB -> 240.4 MB)
Exporting AEM OOTB /content/dam is 50% faster (11.9 sec -> 5.9 sec) for a
5% bigger content package (92.8 MB -> 97.4 MB)
Import for the same cases is 66% faster respectively 32% faster.

I think this could either be done by default and allowing to configure the
list of types that skip compression.
Alternatively, it could be done on a project level, by extending FileVault
with the following

1. For each package, allow to define the default compression level (best
compression, best speed)
2. Expose an API that allow to plugin a custom logic to decide how to
compress a given artefact

In any case, the changes would be backward compatible. Content packages
created with the new code would be installable on instances running the old
code and vice versa.

wdyt ?

Regards,

Timothee


[0]
https://github.com/tmaret/jackrabbit-filevault/tree/performance-avoid-compressing-already-compressed-binaries-based-on-content-type-detection
[1]
https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#BEST_SPEED

Re: [FileVault][discuss] performance improvement proposal

Posted by Timothée Maret <ti...@gmail.com>.

Hi,

2017-03-06 20:52 GMT+01:00 Felix Meschberger <fm...@adobe.com>:

> Hi
>
> This looks great.
>
> As for configuration: What is the reason for having a configuration option
> ? Not being able to decide ? Or real customer need for having it
> configurable ?
>

Setting the compression level is a tradeoff between the compression speed
and the size of the compressed artefacts.

IMO, different use cases favour maximising either of the two or keep the
current default which is a compromise between the two.

For instance, as I see it, Sling Content Distribution would maximise
compression speed and the AEM Quickstart would maximise compression size of
its content packages.

This, IMO, it makes sense to allow configuring/specifying the compression
levels per use case (not globally).


>
> I think we should start with reasonble heuristics first and consider
> configuration options in case there is a need/desire.
>

I have opened JCRVLT-163 to track this. We could indeed add the
configuration later, assuming the increased package size (expected to be <
5% for packages containing already compressed binaries, 0% for other
packages) is not an issue even with size sensitive use cases (such as the
AEM Quickstart).

Regards,

Timothee


>
> Regards
> Felix
>
> Am 06.03.2017 um 16:43 schrieb Timothée Maret <ti...@gmail.com>:
>
> Hi,
>
> With Sling content distribution (using FileVault), we observe a
> significantly lower throughput for content packages containing binaries.
> The main bottleneck seems to be the compression algorithm applied to every
> element contained in the content package.
>
> I think that we could improve the throughput significantly, simply by
> avoiding to re-compress binaries that are already compressed.
> In order to figure out what binaries are already compressed, we could use
> match the content type stored along the binary against a list of
> configurable content types.
>
> I have done some micro tests with this idea (patch in [0]). I think that
> the results are promising.
>
> Exporting a single 250 MB JPEG is 80% faster (22.4 sec -> 4.3 sec) for a
> 3% bigger content package (233.2 MB -> 240.4 MB)
> Exporting AEM OOTB /content/dam is 50% faster (11.9 sec -> 5.9 sec) for a
> 5% bigger content package (92.8 MB -> 97.4 MB)
> Import for the same cases is 66% faster respectively 32% faster.
>
> I think this could either be done by default and allowing to configure the
> list of types that skip compression.
> Alternatively, it could be done on a project level, by extending FileVault
> with the following
>
> 1. For each package, allow to define the default compression level (best
> compression, best speed)
> 2. Expose an API that allow to plugin a custom logic to decide how to
> compress a given artefact
>
> In any case, the changes would be backward compatible. Content packages
> created with the new code would be installable on instances running the old
> code and vice versa.
>
> wdyt ?
>
> Regards,
>
> Timothee
>
>
> [0] https://github.com/tmaret/jackrabbit-filevault/tree/
> performance-avoid-compressing-already-compressed-binaries-
> based-on-content-type-detection
> [1] https://docs.oracle.com/javase/7/docs/api/java/util/
> zip/Deflater.html#BEST_SPEED
>
>

Re: [FileVault][discuss] performance improvement proposal

Posted by Timothée Maret <ti...@gmail.com>.

Hi Bertrand,

2017-03-07 14:42 GMT+01:00 Bertrand Delacretaz <bd...@apache.org>:

> Hi,
>
> On Tue, Mar 7, 2017 at 2:28 PM, Timothée Maret <ti...@gmail.com>
> wrote:
> > ...IMO we should still allow to tweak between best performance and best
> > compression though, in order to accommodate different use cases...
>
> Best compression is probably just "compress everything regardless", right?

If yes that's a simple configuration switch.
>
> -Bertrand
>

Yes, the JCRVLT-163 improvement would be disabled by a simple config switch.
I think we can consider that compressing everything (feature disabled)
produces the best compression.

For the JCRVLT-164 improvement, best compression would be obtained by
configuring the java.util.zip.Deflater.html#BEST_COMPRESSION compression
level.

The two features are complementary. You could also have settings avoiding
to compress binaries, but selecting the best compression for the remaining
data.

Regards,

Timothee

Re: [FileVault][discuss] performance improvement proposal

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi,

On Tue, Mar 7, 2017 at 2:28 PM, Timothée Maret <ti...@gmail.com> wrote:
> ...IMO we should still allow to tweak between best performance and best
> compression though, in order to accommodate different use cases...

Best compression is probably just "compress everything regardless", right?
If yes that's a simple configuration switch.

-Bertrand

Re: [FileVault][discuss] performance improvement proposal

Posted by Timothée Maret <ti...@gmail.com>.

Hi Toby and Thomas,

The idea is indeed to skip compressing incompressible binaries, at the
entry level, by tweaking each entry compression level.
This seemed to already be a good improvement in my tests. I have attached a
patch in JCRVLT-163 around that idea, including the suggestions from this
thread.

Reusing the dictionary between similar entries (e.g. .content.xml or based
on MIME type) may improve the throughput furthermore, at least it seems
reasonable. I have not tested it yet, we may cover this in a separate issue.

Regards,

Timothee


2017-03-09 15:03 GMT+01:00 Thomas Mueller <mu...@adobe.com>:

> Hi,
>
>
>
> Entries in a zip file are compressed individually, so the dictionary is
> not kept or reused. See also https://en.wikipedia.org/wiki/
> Zip_(file_format) "Because the files in a .ZIP archive are compressed
> individually…"
>
>
>
> By the way, for Deflate, the sliding window size (dictionary size) is 32
> KB. That's quite small compared to, for example, LZMA (7z files), where it
> can be up to 4 GB (I think 64 MB by default nowadays). So even reusing the
> dictionary wouldn't help all that much for large files.
>
>
>
> > if you have a lot of small text files, interleaved with binaries, then
> the text files are probably not compressed
>
>
>
> I would assume it's the reverse: text files are compressed, even if the
> binaries can't be compressed.
>
>
>
> Regards,
>
> Thomas
>
>
>
>
>
>
>
> *From: *Tobias Bocanegra <tr...@apache.org>
> *Reply-To: *"dev@jackrabbit.apache.org" <de...@jackrabbit.apache.org>
> *Date: *Thursday, 9 March 2017 at 14:49
> *To: *"dev@jackrabbit.apache.org" <de...@jackrabbit.apache.org>
> *Subject: *Re: [FileVault][discuss] performance improvement proposal
>
>
>
> Hi,
>
>
>
> one issue to remember is that you can only change the compression level
> per zip-entry. I didn't test too much, but from the javadoc is says:
>
>
>
> public void setLevel(int level)
>
> Sets the compression level for subsequent entries which are DEFLATED. The
> default setting is DEFAULT_COMPRESSION.
>
>
>
> I'm not exactly sure if zip retains the dictionary if you switch
> compression levels, but I would assume not. i.e. if you have a lot of small
> text files, interleaved with binaries, then the text files are probably not
> compressed. which might not be a problem, though.
>
>
>
> it would be interresting to see some tests that take a typical content
> asset content package, that has many text files (.content.xml) and few
> compressed binaries (jpegs).
>
>
>
> - what is the size difference of the final binary with no compression at
> all?
>
> - what is the size difference of the final binary with interleaved
> compression?
>
> - what are the performance characteristics to unpack/pack the zips?
>
>
>
> regards, toby
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Mar 9, 2017 at 8:10 PM, Thomas Mueller <mu...@adobe.com> wrote:
>
> Hi,
>
> > I think your help is mandatory, given the level of voodoo in the five
> lines you propose :-)
>
> Sure, I can help.
>
> > I did some preliminary tests with the "partial entropy" method … and it
> seems the algorithm works but it does not get as fast as the content type
> detection method.
>
> Note you only need to test about 256 bytes, not the whole binary. Sure,
> the more the better.
>
> > Maybe ultimately we could keep both heuristics.
>
> I agree. But not to speed up things: to avoid false positives / negatives.
> Auto-detection is far from perfect.
>
> > Start with the content type detection that would match against MIME
> types we know for sure are compressed (expected to be a reasonably fixed
> and short list of MIME types).
>
> I would probably use the following logic:
>
> * list of mime types that are compressed (text/plain and so on)
> * list of mime types that should not be compressed (application/zip,
> application/java-archive, and so on)
>
> For the remainder, and if you don't know the mime type, I would use
> auto-detection.
>
> Regards,
> Thomas
>
>
>



-- 
Timothée Maret

Re: [FileVault][discuss] performance improvement proposal

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

Entries in a zip file are compressed individually, so the dictionary is not kept or reused. See also https://en.wikipedia.org/wiki/Zip_(file_format) "Because the files in a .ZIP archive are compressed individually…"

By the way, for Deflate, the sliding window size (dictionary size) is 32 KB. That's quite small compared to, for example, LZMA (7z files), where it can be up to 4 GB (I think 64 MB by default nowadays). So even reusing the dictionary wouldn't help all that much for large files.

> if you have a lot of small text files, interleaved with binaries, then the text files are probably not compressed

I would assume it's the reverse: text files are compressed, even if the binaries can't be compressed.

Regards,
Thomas

From: Tobias Bocanegra <tr...@apache.org>
Reply-To: "dev@jackrabbit.apache.org" <de...@jackrabbit.apache.org>
Date: Thursday, 9 March 2017 at 14:49
To: "dev@jackrabbit.apache.org" <de...@jackrabbit.apache.org>
Subject: Re: [FileVault][discuss] performance improvement proposal

Hi,

one issue to remember is that you can only change the compression level per zip-entry. I didn't test too much, but from the javadoc is says:

public void setLevel(int level)
Sets the compression level for subsequent entries which are DEFLATED. The default setting is DEFAULT_COMPRESSION.

I'm not exactly sure if zip retains the dictionary if you switch compression levels, but I would assume not. i.e. if you have a lot of small text files, interleaved with binaries, then the text files are probably not compressed. which might not be a problem, though.

it would be interresting to see some tests that take a typical content asset content package, that has many text files (.content.xml) and few compressed binaries (jpegs).

- what is the size difference of the final binary with no compression at all?
- what is the size difference of the final binary with interleaved compression?
- what are the performance characteristics to unpack/pack the zips?

regards, toby

On Thu, Mar 9, 2017 at 8:10 PM, Thomas Mueller <mu...@adobe.com>> wrote:
Hi,

> I think your help is mandatory, given the level of voodoo in the five lines you propose :-)

Sure, I can help.

> I did some preliminary tests with the "partial entropy" method … and it seems the algorithm works but it does not get as fast as the content type detection method.

Note you only need to test about 256 bytes, not the whole binary. Sure, the more the better.

> Maybe ultimately we could keep both heuristics.

I agree. But not to speed up things: to avoid false positives / negatives. Auto-detection is far from perfect.

> Start with the content type detection that would match against MIME types we know for sure are compressed (expected to be a reasonably fixed and short list of MIME types).

I would probably use the following logic:

* list of mime types that are compressed (text/plain and so on)
* list of mime types that should not be compressed (application/zip, application/java-archive, and so on)

For the remainder, and if you don't know the mime type, I would use auto-detection.

Regards,
Thomas

Re: [FileVault][discuss] performance improvement proposal

Posted by Tobias Bocanegra <tr...@apache.org>.

Hi,

one issue to remember is that you can only change the compression level per
zip-entry. I didn't test too much, but from the javadoc is says:

public void setLevel(int level)
Sets the compression level for subsequent entries which are DEFLATED. The
default setting is DEFAULT_COMPRESSION.

I'm not exactly sure if zip retains the dictionary if you switch
compression levels, but I would assume not. i.e. if you have a lot of small
text files, interleaved with binaries, then the text files are probably not
compressed. which might not be a problem, though.

it would be interresting to see some tests that take a typical content
asset content package, that has many text files (.content.xml) and few
compressed binaries (jpegs).

- what is the size difference of the final binary with no compression at
all?
- what is the size difference of the final binary with interleaved
compression?
- what are the performance characteristics to unpack/pack the zips?

regards, toby

On Thu, Mar 9, 2017 at 8:10 PM, Thomas Mueller <mu...@adobe.com> wrote:

> Hi,
>
> > I think your help is mandatory, given the level of voodoo in the five
> lines you propose :-)
>
> Sure, I can help.
>
> > I did some preliminary tests with the "partial entropy" method … and it
> seems the algorithm works but it does not get as fast as the content type
> detection method.
>
> Note you only need to test about 256 bytes, not the whole binary. Sure,
> the more the better.
>
> > Maybe ultimately we could keep both heuristics.
>
> I agree. But not to speed up things: to avoid false positives / negatives.
> Auto-detection is far from perfect.
>
> > Start with the content type detection that would match against MIME
> types we know for sure are compressed (expected to be a reasonably fixed
> and short list of MIME types).
>
> I would probably use the following logic:
>
> * list of mime types that are compressed (text/plain and so on)
> * list of mime types that should not be compressed (application/zip,
> application/java-archive, and so on)
>
> For the remainder, and if you don't know the mime type, I would use
> auto-detection.
>
> Regards,
> Thomas
>
>
>

Re: [FileVault][discuss] performance improvement proposal

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

> I think your help is mandatory, given the level of voodoo in the five lines you propose :-)

Sure, I can help.

> I did some preliminary tests with the "partial entropy" method … and it seems the algorithm works but it does not get as fast as the content type detection method.

Note you only need to test about 256 bytes, not the whole binary. Sure, the more the better.

> Maybe ultimately we could keep both heuristics.

I agree. But not to speed up things: to avoid false positives / negatives. Auto-detection is far from perfect.

> Start with the content type detection that would match against MIME types we know for sure are compressed (expected to be a reasonably fixed and short list of MIME types).

I would probably use the following logic:

* list of mime types that are compressed (text/plain and so on)
* list of mime types that should not be compressed (application/zip, application/java-archive, and so on)

For the remainder, and if you don't know the mime type, I would use auto-detection.

Regards,
Thomas

Re: [FileVault][discuss] performance improvement proposal

Posted by Timothée Maret <ti...@gmail.com>.

Hi Thomas,

2017-03-07 15:09 GMT+01:00 Thomas Mueller <mu...@adobe.com>:

> Hi,
>
>
>
> I'm be glad to help you with the auto-detection, as I wrote that code (a
> long time ago).
>

I think your help is mandatory, given the level of voodoo in the five lines
you propose :-)
Actually I think it'd makes sense if you contributed the code, either in a
patch in JIRA or in your github and I'll pull from there. I think for now
the methods could go in a utility class under

https://github.com/apache/jackrabbit-filevault/tree/trunk/vault-core/src/main/java/org/apache/jackrabbit/vault/fs/impl/io

unless you see a better place to make it available.


> As I said, it's not a "perfect" solution, and you might want to tweak it
> for best results.
>

I agree. I did some preliminary tests with the "partial entropy" method on
the exporting the OOTB AEM /content/dam and it seems the algorithm works
but it does not get as fast as the content type detection method.

Maybe ultimately we could keep both heuristics.

Start with the content type detection that would match against MIME types
we know for sure are compressed (expected to be a reasonably fixed and
short list of MIME types).
If the content type is not matched, apply the auto-detection method for
files which are big enough to offset the extra processing.

To be tested further. I'll post more observations in JCRVLT-163 when I get
the time.


>
>
> I run a small test against a "out-of-the-box" repository and found 99% of
> the binaries are in a jcr:data property, and a mime type is available. This
> might not be the case for all repositories. The mix of mime types probably
> varies even more; in my case, over 90% were from 6 mime types
> (application/zip, application/java-archive, image/png,
> application/javascript, image/jpeg, text/css).
>
>
>
> > IMO we should still allow to tweak between best performance and best
> compression
>
>
>
> Yes, that makes sense!
>
>
>
> A global switch "compress everything regardless" sounds easy.
>
>
>
> A more complex solution would be to use a list of configurable mime types
> to _never_ compress, probably application/zip, application/java-archive,
> image/png, image/jpeg, video/mp4 or so. And for the rest a threshold, at
> which point to compress (an extreme value means compress everything else).
>

+1 the list of MIME types to never compress is what I used initially in
https://github.com/tmaret/jackrabbit-filevault/commit/e2630268833a9d69d5a7bb0064b87eaa4b6b0254

If we use the auto-detection method as fallback, I think we have a good
tradeoff between configuration annoyance (no exposed list of MIME types)
and performance gain (high for well known compressed files of any size,
high for any MIME type of size larger than a threshold).

Regards,

Timothee


>
>
> Regards,
>
> Thomas
>
>
>
>
>
> *From: *<ma...@gmail.com> on behalf of Timothée Maret <
> timothee.maret@gmail.com>
> *Reply-To: *"dev@jackrabbit.apache.org" <de...@jackrabbit.apache.org>
> *Date: *Tuesday, 7 March 2017 at 14:28
> *To: *"dev@jackrabbit.apache.org" <de...@jackrabbit.apache.org>
> *Subject: *Re: [FileVault][discuss] performance improvement proposal
>
>
>
> Hi Thomas,
>
>
>
> 2017-03-07 11:27 GMT+01:00 Thomas Mueller <mu...@adobe.com>:
>
> Hi,
>
> > As for configuration: What is the reason for having a configuration
> option ?
>
> Detecting if data is compressible can be done with low overhead, without
> having to look at the content type, and without having to use configuration
> options:
>
> http://stackoverflow.com/questions/7027022/how-to-
> efficiently-predict-if-data-is-compressible
>
> Sample code is available in one of the answers ("I implemented a few
> methods to test if data is compressible…"). It is quite simple, and only
> needs to process 256 bytes. Both the "Partial Entropy" and the "Simplified
> Compression" work relatively well.
>
> This is not designed to be a "perfect" solution for the problem. It's a
> low-overhead heuristic, that will reduce the compression overhead on the
> average.
>
>
>
> This sounds very nice :-) we could indeed drop the list of MIME type
> configuration.
>
>
>
> IMO we should still allow to tweak between best performance and best
> compression though, in order to accommodate different use cases.
>
> I thought about covering the two aspects in JCRVLT-163, but now changed
> the focus of JCRVLT-163 on avoiding compressing binaries (with or without
> auto-detection) and created JCRVLT-164 for allowing to tweak the default
> compression level.
>
>
>
>
>
> Regards,
>
>
>
> Timothee
>
>
>
>
> Regards,
> Thomas
>
>
>
>
>
> Am 06.03.2017 um 16:43 schrieb Timothée Maret <ti...@gmail.com>:
>
> Hi,
>
> With Sling content distribution (using FileVault), we observe a
> significantly lower throughput for content packages containing binaries.
> The main bottleneck seems to be the compression algorithm applied to every
> element contained in the content package.
>
> I think that we could improve the throughput significantly, simply by
> avoiding to re-compress binaries that are already compressed.
> In order to figure out what binaries are already compressed, we could use
> match the content type stored along the binary against a list of
> configurable content types.
>
> I have done some micro tests with this idea (patch in [0]). I think that
> the results are promising.
>
> Exporting a single 250 MB JPEG is 80% faster (22.4 sec -> 4.3 sec) for a
> 3% bigger content package (233.2 MB -> 240.4 MB)
> Exporting AEM OOTB /content/dam is 50% faster (11.9 sec -> 5.9 sec) for a
> 5% bigger content package (92.8 MB -> 97.4 MB)
> Import for the same cases is 66% faster respectively 32% faster.
>
> I think this could either be done by default and allowing to configure the
> list of types that skip compression.
> Alternatively, it could be done on a project level, by extending FileVault
> with the following
>
> 1. For each package, allow to define the default compression level (best
> compression, best speed)
> 2. Expose an API that allow to plugin a custom logic to decide how to
> compress a given artefact
>
> In any case, the changes would be backward compatible. Content packages
> created with the new code would be installable on instances running the old
> code and vice versa.
>
> wdyt ?
>
> Regards,
>
> Timothee
>
>
> [0] https://github.com/tmaret/jackrabbit-filevault/tree/
> performance-avoid-compressing-already-compressed-binaries-
> based-on-content-type-detection
> [1] https://docs.oracle.com/javase/7/docs/api/java/util/
> zip/Deflater.html#BEST_SPEED
>
>
>
>
>
>
> --
>
> Timothée Maret
>



-- 
Timothée Maret

Re: [FileVault][discuss] performance improvement proposal

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

I'm be glad to help you with the auto-detection, as I wrote that code (a long time ago). As I said, it's not a "perfect" solution, and you might want to tweak it for best results.

I run a small test against a "out-of-the-box" repository and found 99% of the binaries are in a jcr:data property, and a mime type is available. This might not be the case for all repositories. The mix of mime types probably varies even more; in my case, over 90% were from 6 mime types (application/zip, application/java-archive, image/png, application/javascript, image/jpeg, text/css).

> IMO we should still allow to tweak between best performance and best compression

Yes, that makes sense!

A global switch "compress everything regardless" sounds easy.

A more complex solution would be to use a list of configurable mime types to _never_ compress, probably application/zip, application/java-archive, image/png, image/jpeg, video/mp4 or so. And for the rest a threshold, at which point to compress (an extreme value means compress everything else).

Regards,
Thomas


From: <ma...@gmail.com> on behalf of Timothée Maret <ti...@gmail.com>
Reply-To: "dev@jackrabbit.apache.org" <de...@jackrabbit.apache.org>
Date: Tuesday, 7 March 2017 at 14:28
To: "dev@jackrabbit.apache.org" <de...@jackrabbit.apache.org>
Subject: Re: [FileVault][discuss] performance improvement proposal

Hi Thomas,

2017-03-07 11:27 GMT+01:00 Thomas Mueller <mu...@adobe.com>>:
Hi,

> As for configuration: What is the reason for having a configuration option ?

Detecting if data is compressible can be done with low overhead, without having to look at the content type, and without having to use configuration options:

http://stackoverflow.com/questions/7027022/how-to-efficiently-predict-if-data-is-compressible

Sample code is available in one of the answers ("I implemented a few methods to test if data is compressible…"). It is quite simple, and only needs to process 256 bytes. Both the "Partial Entropy" and the "Simplified Compression" work relatively well.

This is not designed to be a "perfect" solution for the problem. It's a low-overhead heuristic, that will reduce the compression overhead on the average.

This sounds very nice :-) we could indeed drop the list of MIME type configuration.

IMO we should still allow to tweak between best performance and best compression though, in order to accommodate different use cases.
I thought about covering the two aspects in JCRVLT-163, but now changed the focus of JCRVLT-163 on avoiding compressing binaries (with or without auto-detection) and created JCRVLT-164 for allowing to tweak the default compression level.


Regards,

Timothee


Regards,
Thomas




Am 06.03.2017 um 16:43 schrieb Timothée Maret <ti...@gmail.com>>:

Hi,

With Sling content distribution (using FileVault), we observe a significantly lower throughput for content packages containing binaries.
The main bottleneck seems to be the compression algorithm applied to every element contained in the content package.

I think that we could improve the throughput significantly, simply by avoiding to re-compress binaries that are already compressed.
In order to figure out what binaries are already compressed, we could use match the content type stored along the binary against a list of configurable content types.

I have done some micro tests with this idea (patch in [0]). I think that the results are promising.

Exporting a single 250 MB JPEG is 80% faster (22.4 sec -> 4.3 sec) for a 3% bigger content package (233.2 MB -> 240.4 MB)
Exporting AEM OOTB /content/dam is 50% faster (11.9 sec -> 5.9 sec) for a 5% bigger content package (92.8 MB -> 97.4 MB)
Import for the same cases is 66% faster respectively 32% faster.

I think this could either be done by default and allowing to configure the list of types that skip compression.
Alternatively, it could be done on a project level, by extending FileVault with the following

1. For each package, allow to define the default compression level (best compression, best speed)
2. Expose an API that allow to plugin a custom logic to decide how to compress a given artefact

In any case, the changes would be backward compatible. Content packages created with the new code would be installable on instances running the old code and vice versa.

wdyt ?

Regards,

Timothee


[0] https://github.com/tmaret/jackrabbit-filevault/tree/performance-avoid-compressing-already-compressed-binaries-based-on-content-type-detection
[1] https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#BEST_SPEED





--
Timothée Maret

Re: [FileVault][discuss] performance improvement proposal

Posted by Timothée Maret <ti...@gmail.com>.

Hi Thomas,

2017-03-07 11:27 GMT+01:00 Thomas Mueller <mu...@adobe.com>:

> Hi,
>
> > As for configuration: What is the reason for having a configuration
> option ?
>
> Detecting if data is compressible can be done with low overhead, without
> having to look at the content type, and without having to use configuration
> options:
>
> http://stackoverflow.com/questions/7027022/how-to-
> efficiently-predict-if-data-is-compressible
>
> Sample code is available in one of the answers ("I implemented a few
> methods to test if data is compressible…"). It is quite simple, and only
> needs to process 256 bytes. Both the "Partial Entropy" and the "Simplified
> Compression" work relatively well.
>
> This is not designed to be a "perfect" solution for the problem. It's a
> low-overhead heuristic, that will reduce the compression overhead on the
> average.
>

This sounds very nice :-) we could indeed drop the list of MIME type
configuration.

IMO we should still allow to tweak between best performance and best
compression though, in order to accommodate different use cases.
I thought about covering the two aspects in JCRVLT-163, but now changed the
focus of JCRVLT-163 on avoiding compressing binaries (with or without
auto-detection) and created JCRVLT-164 for allowing to tweak the default
compression level.


Regards,

Timothee


>
> Regards,
> Thomas
>
>
>
>
> Am 06.03.2017 um 16:43 schrieb Timothée Maret <ti...@gmail.com>:
>
> Hi,
>
> With Sling content distribution (using FileVault), we observe a
> significantly lower throughput for content packages containing binaries.
> The main bottleneck seems to be the compression algorithm applied to every
> element contained in the content package.
>
> I think that we could improve the throughput significantly, simply by
> avoiding to re-compress binaries that are already compressed.
> In order to figure out what binaries are already compressed, we could use
> match the content type stored along the binary against a list of
> configurable content types.
>
> I have done some micro tests with this idea (patch in [0]). I think that
> the results are promising.
>
> Exporting a single 250 MB JPEG is 80% faster (22.4 sec -> 4.3 sec) for a
> 3% bigger content package (233.2 MB -> 240.4 MB)
> Exporting AEM OOTB /content/dam is 50% faster (11.9 sec -> 5.9 sec) for a
> 5% bigger content package (92.8 MB -> 97.4 MB)
> Import for the same cases is 66% faster respectively 32% faster.
>
> I think this could either be done by default and allowing to configure the
> list of types that skip compression.
> Alternatively, it could be done on a project level, by extending FileVault
> with the following
>
> 1. For each package, allow to define the default compression level (best
> compression, best speed)
> 2. Expose an API that allow to plugin a custom logic to decide how to
> compress a given artefact
>
> In any case, the changes would be backward compatible. Content packages
> created with the new code would be installable on instances running the old
> code and vice versa.
>
> wdyt ?
>
> Regards,
>
> Timothee
>
>
> [0] https://github.com/tmaret/jackrabbit-filevault/tree/
> performance-avoid-compressing-already-compressed-binaries-
> based-on-content-type-detection
> [1] https://docs.oracle.com/javase/7/docs/api/java/util/
> zip/Deflater.html#BEST_SPEED
>
>
>
>


-- 
Timothée Maret

Re: [FileVault][discuss] performance improvement proposal

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

> As for configuration: What is the reason for having a configuration option ?

Detecting if data is compressible can be done with low overhead, without having to look at the content type, and without having to use configuration options:

http://stackoverflow.com/questions/7027022/how-to-efficiently-predict-if-data-is-compressible

Sample code is available in one of the answers ("I implemented a few methods to test if data is compressible…"). It is quite simple, and only needs to process 256 bytes. Both the "Partial Entropy" and the "Simplified Compression" work relatively well.

This is not designed to be a "perfect" solution for the problem. It's a low-overhead heuristic, that will reduce the compression overhead on the average.

Regards,
Thomas

Am 06.03.2017 um 16:43 schrieb Timothée Maret <ti...@gmail.com>:

Hi,

With Sling content distribution (using FileVault), we observe a significantly lower throughput for content packages containing binaries.
The main bottleneck seems to be the compression algorithm applied to every element contained in the content package.

I think that we could improve the throughput significantly, simply by avoiding to re-compress binaries that are already compressed.
In order to figure out what binaries are already compressed, we could use match the content type stored along the binary against a list of configurable content types.

I have done some micro tests with this idea (patch in [0]). I think that the results are promising.

Exporting a single 250 MB JPEG is 80% faster (22.4 sec -> 4.3 sec) for a 3% bigger content package (233.2 MB -> 240.4 MB)
Exporting AEM OOTB /content/dam is 50% faster (11.9 sec -> 5.9 sec) for a 5% bigger content package (92.8 MB -> 97.4 MB)
Import for the same cases is 66% faster respectively 32% faster.

I think this could either be done by default and allowing to configure the list of types that skip compression.
Alternatively, it could be done on a project level, by extending FileVault with the following

1. For each package, allow to define the default compression level (best compression, best speed)
2. Expose an API that allow to plugin a custom logic to decide how to compress a given artefact

In any case, the changes would be backward compatible. Content packages created with the new code would be installable on instances running the old code and vice versa.

wdyt ?

Regards,

Timothee

[0] https://github.com/tmaret/jackrabbit-filevault/tree/performance-avoid-compressing-already-compressed-binaries-based-on-content-type-detection
[1] https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#BEST_SPEED

Re: [FileVault][discuss] performance improvement proposal

Posted by Felix Meschberger <fm...@adobe.com>.

This looks great.

As for configuration: What is the reason for having a configuration option ? Not being able to decide ? Or real customer need for having it configurable ?

I think we should start with reasonble heuristics first and consider configuration options in case there is a need/desire.

Regards
Felix

Am 06.03.2017 um 16:43 schrieb Timothée Maret <ti...@gmail.com>>:

Hi,

I have done some micro tests with this idea (patch in [0]). I think that the results are promising.

1. For each package, allow to define the default compression level (best compression, best speed)
2. Expose an API that allow to plugin a custom logic to decide how to compress a given artefact

In any case, the changes would be backward compatible. Content packages created with the new code would be installable on instances running the old code and vice versa.

wdyt ?

Regards,

Timothee