You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2014/02/10 12:16:22 UTC

[jira] [Commented] (OAK-1392) SegmentBlob.equals() optimization

    [ https://issues.apache.org/jira/browse/OAK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13896405#comment-13896405 ] 

Jukka Zitting commented on OAK-1392:
------------------------------------

I think there's still some work to do:

* As a simple checksum instead of a secure hash function, Adler-32 can be used to tell if two binaries are different, but not that they're equal. Thus the "{{return getChecksum() == other.getChecksum()}}" statement is incorrect. To work the way you're going for, we'd need something like a SHA hash of the binary.
* If the checksum/hash isn't stored along with the binary value, it doesn't help performance as the stream will in almost all cases get scanned over and over again for each new comparison. It's very unlikely for a blob instance to be reused for another equality comparison, since for example {{SegmentPropertyState.getValue(BINARY)}} will always return a new {{SegmentBlob}} instance. In fact the patch might even make comparisons _slower_, since the we will always need to scan the entire stream, whereas the earlier {{ByteStreams.equal()}} call would at least fail fast as soon as it encounters the first non-equal byte.
* Instead of computing (and storing) the checksum/hash for the entire binary value, it would be better to do it separately for each block record. This way we wouldn't need to re-scan the entire stream if just a part of it has changed. (Of course that feature is not used anywhere ATM, but it would be nice if we didn't introduce any functionality that would break our ability to do highly efficient in-place updates of binary values.)

I'll follow up with some ideas on #3.

> SegmentBlob.equals() optimization
> ---------------------------------
>
>                 Key: OAK-1392
>                 URL: https://issues.apache.org/jira/browse/OAK-1392
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: core
>            Reporter: Jukka Zitting
>         Attachments: OAK-1392-v0.patch
>
>
> The current {{SegmentBlob.equals()}} method only checks for reference equality before falling back to the {{AbstractBlob.equals()}} method that just scans the entire byte stream.
> This works well for the majority of cases where a binary won't change at all or at least not often. However, there are some cases where a client frequently updates a binary or even rewrites it with the exact same contents. We should optimize the handling of also those cases.
> Some ideas on different things we can/should do:
> # Make {{AbstractBlob.equals()}} compare the blob lengths before scanning the byte streams. If a blob has changed it's length is likely also different, in which case the length check should provide a quick shortcut.
> # Keep a simple checksum like Adler-32 along with medium-sized value records and the block record references of a large value record. Compare those checksums before falling back to a full byte scan. This should capture practically all cases where the binaries are different even with equal lengths, but still not the case where they're equal.
> # When updating a binary value, do an equality check with the previous value and reuse the previous value if equal. The extra cost of doing this should get recovered already when the commit hooks that look at the change won't have to consider an unchanged binary.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)