You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Alexander Klimetschek <ak...@adobe.com.INVALID> on 2018/02/15 21:06:11 UTC

[SegmentStore] Blobs under 16 KB always inlined in tar files?

Hi,

it seems the segment store will inline any binary blob up to ~16KB in the tar files and not store them in the BlobStore [1]. The 16 KB limit (Segment.MEDIUM_LIMIT) is hardcoded and not configurable.

I can see this in action when debugging and when looking at an S3 datastore of a full Oak segment + s3 ds installation, where the smallest binaries in S3 are 16 + something KB.

As Ian pointed out:

> This could bloat Tar files, impact memory mapping, and may be a major consumer of RAM for TarMK mmap mode, but I don't know TarMK well enough to know the logic behind doing that. The OS Disk cache is the correct place to deal with any file over 1 block in size, especially if its accessed sporadically.

I would agree on first sight. However, there might be good reasons for the current design and these concerns would not be true in practice. The same setting is essentially used for both STRING and BINARY properties - maybe it makes a lot of sense for Strings, but not so much for immutable binaries?

Could someone shed some light?

IIUC, it also makes the minRecordLength config [3] of the datastore(s) have no effect, since that should probably be rather low (default is 100 bytes), given it encodes the binary in the blob id itself. But since only binaries larger than 16KB will ever reach the blob store (for a segment store setup), all binaries will effectively always be larger than minRecordLength.

[1] https://github.com/apache/jackrabbit-oak/blob/58fdaf0dc0786f4cc9e39e7d26684fda04b32e78/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/DefaultSegmentWriter.java#L648
[2] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/Segment.java#L111-L118
[3] https://jackrabbit.apache.org/oak/docs/osgi_config.html

Cheers,
Alex

Re: [SegmentStore] Blobs under 16 KB always inlined in tar files?

Posted by Michael Dürig <md...@apache.org>.

On 15.02.18 22:06, Alexander Klimetschek wrote:

> I would agree on first sight. However, there might be good reasons for the current design and these concerns would not be true in practice. The same setting is essentially used for both STRING and BINARY properties - maybe it makes a lot of sense for Strings, but not so much for immutable binaries?
> 
> Could someone shed some light?

The current threshold is based on some statistics collected early on in 
the history of Oak. Numbers might have changed in the meanwhile so 
re-evaluating this makes sense.

> IIUC, it also makes the minRecordLength config [3] of the datastore(s) have no effect, since that should probably be rather low (default is 100 bytes), given it encodes the binary in the blob id itself. But since only binaries larger than 16KB will ever reach the blob store (for a segment store setup), all binaries will effectively always be larger than minRecordLength.

That configuration is about the blob store. The segment store can make 
its own decisions independently of that setting on whether to inline a 
binary or not.

Michael

Re: [SegmentStore] Blobs under 16 KB always inlined in tar files?

Posted by Alexander Klimetschek <ak...@adobe.com.INVALID>.
Thanks!

The real world example of " In one setup out of 370 GB segmentstore size 290GB is due to inlined binary" shows that Ian's hunch was pretty spot on. It's clear we don't want these binaries waste memory being memory mapped. Same would apply to longer strings IMO, say rich text html or markdown snippets that would be typical in a CMS and stored in JCR STRING properties.

Cheers,
Alex

> On 15.02.2018, at 23:24, Chetan Mehrotra <ch...@gmail.com> wrote:
> 
> See OAK-6911
> Chetan Mehrotra
> 
> 
> On Fri, Feb 16, 2018 at 2:36 AM, Alexander Klimetschek
> <ak...@adobe.com.invalid> wrote:
>> Hi,
>> 
>> it seems the segment store will inline any binary blob up to ~16KB in the tar files and not store them in the BlobStore [1]. The 16 KB limit (Segment.MEDIUM_LIMIT) is hardcoded and not configurable.
>> 
>> I can see this in action when debugging and when looking at an S3 datastore of a full Oak segment + s3 ds installation, where the smallest binaries in S3 are 16 + something KB.
>> 
>> As Ian pointed out:
>> 
>>> This could bloat Tar files, impact memory mapping, and may be a major consumer of RAM for TarMK mmap mode, but I don't know TarMK well enough to know the logic behind doing that. The OS Disk cache is the correct place to deal with any file over 1 block in size, especially if its accessed sporadically.
>> 
>> I would agree on first sight. However, there might be good reasons for the current design and these concerns would not be true in practice. The same setting is essentially used for both STRING and BINARY properties - maybe it makes a lot of sense for Strings, but not so much for immutable binaries?
>> 
>> Could someone shed some light?
>> 
>> IIUC, it also makes the minRecordLength config [3] of the datastore(s) have no effect, since that should probably be rather low (default is 100 bytes), given it encodes the binary in the blob id itself. But since only binaries larger than 16KB will ever reach the blob store (for a segment store setup), all binaries will effectively always be larger than minRecordLength.
>> 
>> [1] https://github.com/apache/jackrabbit-oak/blob/58fdaf0dc0786f4cc9e39e7d26684fda04b32e78/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/DefaultSegmentWriter.java#L648
>> [2] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/Segment.java#L111-L118
>> [3] https://jackrabbit.apache.org/oak/docs/osgi_config.html
>> 
>> Cheers,
>> Alex


Re: [SegmentStore] Blobs under 16 KB always inlined in tar files?

Posted by Chetan Mehrotra <ch...@gmail.com>.
See OAK-6911
Chetan Mehrotra


On Fri, Feb 16, 2018 at 2:36 AM, Alexander Klimetschek
<ak...@adobe.com.invalid> wrote:
> Hi,
>
> it seems the segment store will inline any binary blob up to ~16KB in the tar files and not store them in the BlobStore [1]. The 16 KB limit (Segment.MEDIUM_LIMIT) is hardcoded and not configurable.
>
> I can see this in action when debugging and when looking at an S3 datastore of a full Oak segment + s3 ds installation, where the smallest binaries in S3 are 16 + something KB.
>
> As Ian pointed out:
>
>> This could bloat Tar files, impact memory mapping, and may be a major consumer of RAM for TarMK mmap mode, but I don't know TarMK well enough to know the logic behind doing that. The OS Disk cache is the correct place to deal with any file over 1 block in size, especially if its accessed sporadically.
>
> I would agree on first sight. However, there might be good reasons for the current design and these concerns would not be true in practice. The same setting is essentially used for both STRING and BINARY properties - maybe it makes a lot of sense for Strings, but not so much for immutable binaries?
>
> Could someone shed some light?
>
> IIUC, it also makes the minRecordLength config [3] of the datastore(s) have no effect, since that should probably be rather low (default is 100 bytes), given it encodes the binary in the blob id itself. But since only binaries larger than 16KB will ever reach the blob store (for a segment store setup), all binaries will effectively always be larger than minRecordLength.
>
> [1] https://github.com/apache/jackrabbit-oak/blob/58fdaf0dc0786f4cc9e39e7d26684fda04b32e78/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/DefaultSegmentWriter.java#L648
> [2] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/Segment.java#L111-L118
> [3] https://jackrabbit.apache.org/oak/docs/osgi_config.html
>
> Cheers,
> Alex