You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Mete Atamel <ma...@adobe.com> on 2012/09/26 09:38:19 UTC

[MongoMK] Reading blobs incrementally

Hi,

I realized that MicroKernelIT#testBlobs takes a while to complete on
MongoMK. This is partly due to how the test was written and partly due to
how the blob read offset is implemented in MongoMK. I'm looking for
feedback on where to fix this.

To give you an idea on testBlobs, it first writes a blob using MK. Then,
it verifies that the blob bytes were written correctly by reading the blob
from MK. However, blob read from MK is not done in one shot. Instead, it's
done via this input stream:

InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
id));


MicroKernelInputStream reads from the MK and BufferedInputStream buffers
the reads in 8K chunks. Then, there's a while loop with in2.read() to read
the blob fully. This makes a call to MicroKernel#read method with the
right offset for every 8K chunk until the blob bytes are fully read.

This is not a problem for small blob sizes but for bigger blob sizes,
reading 8K chunks can be slow because in MongoMK, every read with offset
triggers the following:
-Find the blob from GridFS
-Retrieve its input stream
-Skip to the right offset
-Read 8K 
-Close the input stream

I could fix this by changing the test to read the blob bytes in one shot
and then do the comparison. However, I was wondering if we should also
work on an optimization for successive reads from the blob with
incremental offsets? Maybe we could keep the input stream of recently read
blobs around for some time before closing them?

Best,
Mete

Re: [MongoMK] Reading blobs incrementally

Posted by Michael Dürig <md...@apache.org>.


On 17.10.12 10:20, Stefan Guggisberg wrote:
> it could, but then why should it?

Because there are pros and cons for both approaches and I wonder what 
the considerations were when the decision was made for the current API.

Michael

Re: [MongoMK] Reading blobs incrementally

Posted by Stefan Guggisberg <st...@gmail.com>.

On Wed, Oct 17, 2012 at 11:08 AM, Michael Dürig <md...@apache.org> wrote:
>
>
> On 17.10.12 10:03, Stefan Guggisberg wrote:
>>
>> On Wed, Oct 17, 2012 at 10:42 AM, Michael Dürig <md...@apache.org>
>> wrote:
>>>
>>>
>>> I wonder why the Microkernel API has an asymmetry here: for writing a
>>> binary
>>> you can pass a stream where as for reading you need to pass a byte array.
>>
>>
>> the write method implies a content-addressable storage for blobs,
>> i.e. identical binary content is identified by identical identifiers.
>> the identifier
>> needs to be computed from the entire blob content. that's why the
>> signature takes
>> a stream rather than supporting chunked writes.
>
>
> Makes sense so far but this is only half of the story ;-) Why couldn't the
> read method also return a stream?

it could, but then why should it? for cosmetical reasons? personally
i prefer the current signature for cleaner semantics and ease of implementation.

cheers
stefan

>
> Michael
>
>
>>
>> cheers
>> stefan
>>
>>>
>>> Michael
>>>
>>>
>>> On 26.9.12 8:38, Mete Atamel wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I realized that MicroKernelIT#testBlobs takes a while to complete on
>>>> MongoMK. This is partly due to how the test was written and partly due
>>>> to
>>>> how the blob read offset is implemented in MongoMK. I'm looking for
>>>> feedback on where to fix this.
>>>>
>>>> To give you an idea on testBlobs, it first writes a blob using MK. Then,
>>>> it verifies that the blob bytes were written correctly by reading the
>>>> blob
>>>> from MK. However, blob read from MK is not done in one shot. Instead,
>>>> it's
>>>> done via this input stream:
>>>>
>>>> InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
>>>> id));
>>>>
>>>>
>>>> MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>>>> the reads in 8K chunks. Then, there's a while loop with in2.read() to
>>>> read
>>>> the blob fully. This makes a call to MicroKernel#read method with the
>>>> right offset for every 8K chunk until the blob bytes are fully read.
>>>>
>>>> This is not a problem for small blob sizes but for bigger blob sizes,
>>>> reading 8K chunks can be slow because in MongoMK, every read with offset
>>>> triggers the following:
>>>> -Find the blob from GridFS
>>>> -Retrieve its input stream
>>>> -Skip to the right offset
>>>> -Read 8K
>>>> -Close the input stream
>>>>
>>>> I could fix this by changing the test to read the blob bytes in one shot
>>>> and then do the comparison. However, I was wondering if we should also
>>>> work on an optimization for successive reads from the blob with
>>>> incremental offsets? Maybe we could keep the input stream of recently
>>>> read
>>>> blobs around for some time before closing them?
>>>>
>>>> Best,
>>>> Mete
>>>>
>>>>
>>>
>

Re: [MongoMK] Reading blobs incrementally

Posted by Michael Dürig <md...@apache.org>.


On 17.10.12 10:03, Stefan Guggisberg wrote:
> On Wed, Oct 17, 2012 at 10:42 AM, Michael Dürig <md...@apache.org> wrote:
>>
>> I wonder why the Microkernel API has an asymmetry here: for writing a binary
>> you can pass a stream where as for reading you need to pass a byte array.
>
> the write method implies a content-addressable storage for blobs,
> i.e. identical binary content is identified by identical identifiers.
> the identifier
> needs to be computed from the entire blob content. that's why the
> signature takes
> a stream rather than supporting chunked writes.

Makes sense so far but this is only half of the story ;-) Why couldn't 
the read method also return a stream?

Michael

>
> cheers
> stefan
>
>>
>> Michael
>>
>>
>> On 26.9.12 8:38, Mete Atamel wrote:
>>>
>>> Hi,
>>>
>>> I realized that MicroKernelIT#testBlobs takes a while to complete on
>>> MongoMK. This is partly due to how the test was written and partly due to
>>> how the blob read offset is implemented in MongoMK. I'm looking for
>>> feedback on where to fix this.
>>>
>>> To give you an idea on testBlobs, it first writes a blob using MK. Then,
>>> it verifies that the blob bytes were written correctly by reading the blob
>>> from MK. However, blob read from MK is not done in one shot. Instead, it's
>>> done via this input stream:
>>>
>>> InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
>>> id));
>>>
>>>
>>> MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>>> the reads in 8K chunks. Then, there's a while loop with in2.read() to read
>>> the blob fully. This makes a call to MicroKernel#read method with the
>>> right offset for every 8K chunk until the blob bytes are fully read.
>>>
>>> This is not a problem for small blob sizes but for bigger blob sizes,
>>> reading 8K chunks can be slow because in MongoMK, every read with offset
>>> triggers the following:
>>> -Find the blob from GridFS
>>> -Retrieve its input stream
>>> -Skip to the right offset
>>> -Read 8K
>>> -Close the input stream
>>>
>>> I could fix this by changing the test to read the blob bytes in one shot
>>> and then do the comparison. However, I was wondering if we should also
>>> work on an optimization for successive reads from the blob with
>>> incremental offsets? Maybe we could keep the input stream of recently read
>>> blobs around for some time before closing them?
>>>
>>> Best,
>>> Mete
>>>
>>>
>>

Re: [MongoMK] Reading blobs incrementally

Posted by Stefan Guggisberg <st...@gmail.com>.

On Wed, Oct 17, 2012 at 10:42 AM, Michael Dürig <md...@apache.org> wrote:
>
> I wonder why the Microkernel API has an asymmetry here: for writing a binary
> you can pass a stream where as for reading you need to pass a byte array.

the write method implies a content-addressable storage for blobs,
i.e. identical binary content is identified by identical identifiers.
the identifier
needs to be computed from the entire blob content. that's why the
signature takes
a stream rather than supporting chunked writes.

cheers
stefan

>
> Michael
>
>
> On 26.9.12 8:38, Mete Atamel wrote:
>>
>> Hi,
>>
>> I realized that MicroKernelIT#testBlobs takes a while to complete on
>> MongoMK. This is partly due to how the test was written and partly due to
>> how the blob read offset is implemented in MongoMK. I'm looking for
>> feedback on where to fix this.
>>
>> To give you an idea on testBlobs, it first writes a blob using MK. Then,
>> it verifies that the blob bytes were written correctly by reading the blob
>> from MK. However, blob read from MK is not done in one shot. Instead, it's
>> done via this input stream:
>>
>> InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
>> id));
>>
>>
>> MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>> the reads in 8K chunks. Then, there's a while loop with in2.read() to read
>> the blob fully. This makes a call to MicroKernel#read method with the
>> right offset for every 8K chunk until the blob bytes are fully read.
>>
>> This is not a problem for small blob sizes but for bigger blob sizes,
>> reading 8K chunks can be slow because in MongoMK, every read with offset
>> triggers the following:
>> -Find the blob from GridFS
>> -Retrieve its input stream
>> -Skip to the right offset
>> -Read 8K
>> -Close the input stream
>>
>> I could fix this by changing the test to read the blob bytes in one shot
>> and then do the comparison. However, I was wondering if we should also
>> work on an optimization for successive reads from the blob with
>> incremental offsets? Maybe we could keep the input stream of recently read
>> blobs around for some time before closing them?
>>
>> Best,
>> Mete
>>
>>
>

Re: [MongoMK] Reading blobs incrementally

Posted by Michael Dürig <md...@apache.org>.

I wonder why the Microkernel API has an asymmetry here: for writing a 
binary you can pass a stream where as for reading you need to pass a 
byte array.

Michael

On 26.9.12 8:38, Mete Atamel wrote:
> Hi,
>
> I realized that MicroKernelIT#testBlobs takes a while to complete on
> MongoMK. This is partly due to how the test was written and partly due to
> how the blob read offset is implemented in MongoMK. I'm looking for
> feedback on where to fix this.
>
> To give you an idea on testBlobs, it first writes a blob using MK. Then,
> it verifies that the blob bytes were written correctly by reading the blob
> from MK. However, blob read from MK is not done in one shot. Instead, it's
> done via this input stream:
>
> InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
> id));
>
>
> MicroKernelInputStream reads from the MK and BufferedInputStream buffers
> the reads in 8K chunks. Then, there's a while loop with in2.read() to read
> the blob fully. This makes a call to MicroKernel#read method with the
> right offset for every 8K chunk until the blob bytes are fully read.
>
> This is not a problem for small blob sizes but for bigger blob sizes,
> reading 8K chunks can be slow because in MongoMK, every read with offset
> triggers the following:
> -Find the blob from GridFS
> -Retrieve its input stream
> -Skip to the right offset
> -Read 8K
> -Close the input stream
>
> I could fix this by changing the test to read the blob bytes in one shot
> and then do the comparison. However, I was wondering if we should also
> work on an optimization for successive reads from the blob with
> incremental offsets? Maybe we could keep the input stream of recently read
> blobs around for some time before closing them?
>
> Best,
> Mete
>
>

Re: [MongoMK] Reading blobs incrementally

Posted by Julian Reschke <ju...@gmx.de>.

On 2012-09-26 09:38, Mete Atamel wrote:
> ...
> I could fix this by changing the test to read the blob bytes in one shot
> and then do the comparison. However, I was wondering if we should also
> work on an optimization for successive reads from the blob with
> incremental offsets? Maybe we could keep the input stream of recently read
> blobs around for some time before closing them?
> ...

One use case is streaming content from the repository to a web client; 
that usually involves a loop reading from the repo and writing to the 
ServletOutputStream. This will need to work even for large binaries.

Best regards, Julian

Re: [MongoMK] Reading blobs incrementally

Posted by Mete Atamel <ma...@adobe.com>.

Thanks for the feedback. Using AbstractBlobStore instead of GridFS is
indeed on the list of things I want to try out once the rest of missing
functionality is done in MongoMK. I'll report back once I get a chance to
implement that.

-Mete

On 10/17/12 10:26 AM, "Thomas Mueller" <mu...@adobe.com> wrote:

>Hi,
>
>As a workaround, you could keep the last few streams open in the Mongo MK
>for some time (a cache) together with the current position. That way seek
>is not required in most cases, as usually binaries are read as a stream.
>
>However, keeping resources open is problematic (we do that in the
>DbDataStore in Jackrabbit, and we ran into various problems), and I would
>avoid it if possible. I would probably use the AbstractBlobStore instead
>which splits blobs into blocks, I believe that way you can just use
>regular MongoDB features and don't need to use GridFS. But you might want
>to test which approach is faster / easier.
>
>Regards,
>Thomas
>
>
>
>On 9/26/12 9:48 AM, "Mete Atamel" <ma...@adobe.com> wrote:
>
>>Forgot to mention. I could also increase the BufferedInputStream's buffer
>>size to something high to speed up the large blob read. That's probably
>>what I'll do in the short term but my question is more about whether the
>>optimization I mentioned in my previous email is worth pursuing at some
>>point.
>>
>>Best,
>>Mete
>>
>>On 9/26/12 9:38 AM, "Mete Atamel" <ma...@adobe.com> wrote:
>>
>>>Hi,
>>>
>>>I realized that MicroKernelIT#testBlobs takes a while to complete on
>>>MongoMK. This is partly due to how the test was written and partly due
>>>to
>>>how the blob read offset is implemented in MongoMK. I'm looking for
>>>feedback on where to fix this.
>>>
>>>To give you an idea on testBlobs, it first writes a blob using MK. Then,
>>>it verifies that the blob bytes were written correctly by reading the
>>>blob
>>>from MK. However, blob read from MK is not done in one shot. Instead,
>>>it's
>>>done via this input stream:
>>>
>>>InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
>>>id));
>>>
>>>
>>>MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>>>the reads in 8K chunks. Then, there's a while loop with in2.read() to
>>>read
>>>the blob fully. This makes a call to MicroKernel#read method with the
>>>right offset for every 8K chunk until the blob bytes are fully read.
>>>
>>>This is not a problem for small blob sizes but for bigger blob sizes,
>>>reading 8K chunks can be slow because in MongoMK, every read with offset
>>>triggers the following:
>>>-Find the blob from GridFS
>>>-Retrieve its input stream
>>>-Skip to the right offset
>>>-Read 8K 
>>>-Close the input stream
>>>
>>>I could fix this by changing the test to read the blob bytes in one shot
>>>and then do the comparison. However, I was wondering if we should also
>>>work on an optimization for successive reads from the blob with
>>>incremental offsets? Maybe we could keep the input stream of recently
>>>read
>>>blobs around for some time before closing them?
>>>
>>>Best,
>>>Mete
>>>
>>>
>>
>

Re: [MongoMK] Reading blobs incrementally

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

As a workaround, you could keep the last few streams open in the Mongo MK
for some time (a cache) together with the current position. That way seek
is not required in most cases, as usually binaries are read as a stream.

However, keeping resources open is problematic (we do that in the
DbDataStore in Jackrabbit, and we ran into various problems), and I would
avoid it if possible. I would probably use the AbstractBlobStore instead
which splits blobs into blocks, I believe that way you can just use
regular MongoDB features and don't need to use GridFS. But you might want
to test which approach is faster / easier.

Regards,
Thomas



On 9/26/12 9:48 AM, "Mete Atamel" <ma...@adobe.com> wrote:

>Forgot to mention. I could also increase the BufferedInputStream's buffer
>size to something high to speed up the large blob read. That's probably
>what I'll do in the short term but my question is more about whether the
>optimization I mentioned in my previous email is worth pursuing at some
>point.
>
>Best,
>Mete
>
>On 9/26/12 9:38 AM, "Mete Atamel" <ma...@adobe.com> wrote:
>
>>Hi,
>>
>>I realized that MicroKernelIT#testBlobs takes a while to complete on
>>MongoMK. This is partly due to how the test was written and partly due to
>>how the blob read offset is implemented in MongoMK. I'm looking for
>>feedback on where to fix this.
>>
>>To give you an idea on testBlobs, it first writes a blob using MK. Then,
>>it verifies that the blob bytes were written correctly by reading the
>>blob
>>from MK. However, blob read from MK is not done in one shot. Instead,
>>it's
>>done via this input stream:
>>
>>InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
>>id));
>>
>>
>>MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>>the reads in 8K chunks. Then, there's a while loop with in2.read() to
>>read
>>the blob fully. This makes a call to MicroKernel#read method with the
>>right offset for every 8K chunk until the blob bytes are fully read.
>>
>>This is not a problem for small blob sizes but for bigger blob sizes,
>>reading 8K chunks can be slow because in MongoMK, every read with offset
>>triggers the following:
>>-Find the blob from GridFS
>>-Retrieve its input stream
>>-Skip to the right offset
>>-Read 8K 
>>-Close the input stream
>>
>>I could fix this by changing the test to read the blob bytes in one shot
>>and then do the comparison. However, I was wondering if we should also
>>work on an optimization for successive reads from the blob with
>>incremental offsets? Maybe we could keep the input stream of recently
>>read
>>blobs around for some time before closing them?
>>
>>Best,
>>Mete
>>
>>
>

Re: [MongoMK] Reading blobs incrementally

Posted by Mete Atamel <ma...@adobe.com>.

Forgot to mention. I could also increase the BufferedInputStream's buffer
size to something high to speed up the large blob read. That's probably
what I'll do in the short term but my question is more about whether the
optimization I mentioned in my previous email is worth pursuing at some
point.

Best,
Mete

On 9/26/12 9:38 AM, "Mete Atamel" <ma...@adobe.com> wrote:

>Hi,
>
>I realized that MicroKernelIT#testBlobs takes a while to complete on
>MongoMK. This is partly due to how the test was written and partly due to
>how the blob read offset is implemented in MongoMK. I'm looking for
>feedback on where to fix this.
>
>To give you an idea on testBlobs, it first writes a blob using MK. Then,
>it verifies that the blob bytes were written correctly by reading the blob
>from MK. However, blob read from MK is not done in one shot. Instead, it's
>done via this input stream:
>
>InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
>id));
>
>
>MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>the reads in 8K chunks. Then, there's a while loop with in2.read() to read
>the blob fully. This makes a call to MicroKernel#read method with the
>right offset for every 8K chunk until the blob bytes are fully read.
>
>This is not a problem for small blob sizes but for bigger blob sizes,
>reading 8K chunks can be slow because in MongoMK, every read with offset
>triggers the following:
>-Find the blob from GridFS
>-Retrieve its input stream
>-Skip to the right offset
>-Read 8K 
>-Close the input stream
>
>I could fix this by changing the test to read the blob bytes in one shot
>and then do the comparison. However, I was wondering if we should also
>work on an optimization for successive reads from the blob with
>incremental offsets? Maybe we could keep the input stream of recently read
>blobs around for some time before closing them?
>
>Best,
>Mete
>
>