You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by "Kristian Waagan (JIRA)" <ji...@apache.org> on 2008/08/05 19:14:44 UTC
[jira] Resolved: (DERBY-3766) EmbedBlob.setPosition is highly ineffective for streams

     [ https://issues.apache.org/jira/browse/DERBY-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristian Waagan resolved DERBY-3766.
------------------------------------

       Resolution: Fixed
    Fix Version/s: 10.5.0.0
                   10.4.2.0

Backported patches 1a and 2a to 10.4 with revision 682803. They went in together with the second patch for DERBY-3783.

> EmbedBlob.setPosition is highly ineffective for streams
> -------------------------------------------------------
>
>                 Key: DERBY-3766
>                 URL: https://issues.apache.org/jira/browse/DERBY-3766
>             Project: Derby
>          Issue Type: Bug
>          Components: JDBC, Network Server
>    Affects Versions: 10.1.3.1, 10.2.2.0, 10.3.3.0, 10.4.1.3, 10.5.0.0
>            Reporter: Kristian Waagan
>            Assignee: Kristian Waagan
>             Fix For: 10.4.2.0, 10.5.0.0
>
>         Attachments: derby-3766-1a-preparations.diff, derby-3766-2a-position_fix.diff
>
>
> The EmbedBlob.setPosition implementation has two performance problems when the
> Blob is represented by a store stream, at least one of them rather significant:
>   1) The store stream is reset to position zero for each position request.
>      Data is then read until the requested position has been reached.
>   2) 'read' is used instead of 'skip', which causes Derby to miss out on the
>      optimization potential with streams that have a more efficient skip mechanism.
> My gut feeling is that once point 1) has been fixed, point 2) will have
> disappeared. Also note that the reason why the unconditional reset approach was
> chosen was because the blob implementation couldn't keep track of the underlying
> streams position. This issue still has to be addressed.
> Performance degradation
> =======================
> Observations suggest the following approximation can be used to quantify the
> number of bytes that have to be processed on the server. Goes for both the
> embedded and the client driver when using the EmbedBlob.getBytes call.
>     s = size of the blob
>     b = buffer/block size
>     n = s/b      (number of iterations needed to read the whole Blob)
>     s + b * n * (n+1) / 2 = number of bytes processed on the server side
>     From now on, I ignore the s.
> For a 1 MB Blob when using a buffer of 32 000 bytes, we get:
>    n = 1 * 1'024 * 1'024 / 32'000 ~ 33
>    32'000 * 33 * (33+1) / 2 = 17'952'000 ~ 17 MB
> To quickly verify the approximation I summed up the bytes processed by
> EmbedBlob.getBytes(long,int) and EmbedBlob.setPosition(long) with the 64 MB
> Blob used by the repro for DERBY-550 (modified to use 32'000 read buffer):
>   - approximation: 32'000 * 2097 * (2097+1) / 2 =  70392096000 ~  66 GB
>   - Derby byte count                            =  70459204864 ~  66 GB
>   - Derby byte count (buffer 33'000, see below) = 136526134864 ~ 127 GB
> I'll explain the biggest number further down.
> As you see, the number of bytes processed is huge. Note that all of the
> gigabytes are just the 64 MB repeated over and over again. Since the actual data
> volume is so small, all the data will be in the caches of Derby and the
> operating system. Note that only 64 MB is actually transferred to the client
> when using the client driver.
> Another consequence of processing all this data repeatedly, is the effect it has
> on the page cache. Pages has to be evicted and read back in. The performance hit
> taken by this depends on the page cache size, operating system buffers and other
> database activities on the system.
> The explanation for the largest number above, is another performance issue in
> the client driver, related to locators. I'll explain it more detailed in a
> separate Jira issue, but in short the issue causes the stream to be reset even
> more frequently for read buffer sizes over 32'672 bytes.
> Suggested fix
> =============
> My initial analysis suggests that the problem can be fixed by using the
> functionality provided by PosistionedStoreStream. There are a few complicating
> issues:
>  a) The length encoding bytes must be accounted for properly.
>  b) One must make sure all stream access happens through the
>     PositionedStoreStream, otherwise the position will be incorrect and the
>     wrong data will be returned.
> With a prototype fix, the repro duration went down from minutes (7 minutes
> for the 127 GB case) down to between 2 and 4 seconds with a sane build, running
> on localhost.
> Affected versions
> =================
> The code suffering from the performance issues is old, but because it isn't
> used in the same way in all versions some releases are more affected than
> others.
> Version     Embedded.getBytes   Client.getXXX
> 10.5                X                   X
> 10.4.1.3            X                   X
> 10.3.3.0            X                   X
> 10.2.2.1            X                   _
> 10.1.3.1            X                   _
> (X = issue present)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.