You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Ian Boston <ie...@tfd.co.uk> on 2016/09/05 13:21:07 UTC

Seekable access to a Binary

Hi,
Is it possible to write to an Oak Binary via the JCR API at an offset ?

I am asking because I am working on the Sling Upload mechanism to make it
streamable in an attempt to eliminate the many duplicate IO operations. A
whole body upload works, and depending on the DS being used shows good
improvements in speed resulting from less IO.

The Sling Chunked Upload protocol, documented at [1] generates 3x the IO
compared against a streamed upload and more compared to a non streamed
upload. IIUC the protocol was implemented in that way as the only way to
update a Binary is to re-write it from scratch with a fresh InputStream
every time.


Is there an alternative, more efficient way to achieve this that would not
require the Binary to be read from the DS, updated and written back to the
DS ?

eg
<AFAIKDoestnExist>
valueFactory.createBinary(inputStream, startingAtByteOffsetLong);

or

OutputStream binaryOutputStream = node.getOutputStream();
binaryOutputStream.seek(startingAtByteOffsetLong);
IOUtils.copy(inputStream, outputStream);
node.getSession().save();

</AFAIKDoestnExist>

The Sling issue being worked on is [2]


Best Regards
Ian

1
https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support
2 https://issues.apache.org/jira/browse/SLING-6027

Re: Seekable access to a Binary

Posted by Ian Boston <ie...@tfd.co.uk>.
Hi,

On 6 September 2016 at 11:34, Bertrand Delacretaz <bd...@apache.org>
wrote:

> Hi,
>
> On Tue, Sep 6, 2016 at 9:49 AM, Marcel Reutegger <mr...@adobe.com>
> wrote:
> > ...we'd still have to add
> > Jackrabbit API to support it. E.g. something like:
> >
> > valueFactory.createBinary(existingBinary, appendThisInputStream); ...
>
> And maybe a way to mark the binary as "in progress" to avoid
> applications using half-uploaded binaries?
>

yes, thats also needed where an incremental upload is being performed.
AWS and the Google Data API both have the concepts of a session ID when
performing incremental uploads to avoid conflicts between multiple clients
operating on the same.
The current impl in Sling assumes only 1 upload is being performed per
resource. If there are 2 a 500 will be issued and the client will probably
reset the state breaking the other upload session.

@Marcel I'll document the use case on the wiki. Thanks for the pointer.

Best Regards
Ian


>
> Maybe just a boolean property convention that application developers
> are supposed to take into account, as I don't think JCR Sessions work
> in that use case.
>
> -Bertrand
>

Re: Seekable access to a Binary

Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi,

On Tue, Sep 6, 2016 at 1:25 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> On 6 September 2016 at 12:14, Marcel Reutegger <mr...@adobe.com> wrote:
>>... This can easily be prevented if the 'in progress' binary is
>> uploaded to a temporary location first and then copied over
>> to the correct location once complete....

> Is that exposed as an API that can be used by Sling ?...

I think Marcel just means uploading to a node that's not visible to
clients first, and then moving that node to its final destination, so
that the binary is not visible too early there.

-Bertrand

Re: Seekable access to a Binary

Posted by Marcel Reutegger <mr...@adobe.com>.
Hi,

On 06/09/16 13:25, Ian Boston wrote:
> On 6 September 2016 at 12:14, Marcel Reutegger <mr...@adobe.com> wrote:
>> This can easily be prevented if the 'in progress' binary is
>> uploaded to a temporary location first and then copied over
>> to the correct location once complete. Keep in mind that
>> copying a large existing binary in Oak is simply a cheap
>> copy of the reference.
>
> Is that exposed as an API that can be used by Sling ?

You would use standard JCR API:

node.setProperty("jcr:data", binaryFromTempLocation)

> For the Sling use case it might not be sufficient, as the chunks that are
> uploaded need to be joined together to form the binary. Streaming to the
> chunks is supported (in the patch I am working on), concatenating the
> chunks into a final binary requires copying in and out of the DS.

I see. Then you would rather need a factory method that combines
existing binary values instead. Maybe something like:

ValueFactory.createBinary(String name, Binary... parts)

> BTW, the Wiki page is immutable so I can't edit.

I can add you to the contributors group. What is your wiki
user name?

Regards
  Marcel

Re: Seekable access to a Binary

Posted by Ian Boston <ie...@tfd.co.uk>.
Hi,

On 6 September 2016 at 12:14, Marcel Reutegger <mr...@adobe.com> wrote:

> Hi,
>
> On 06/09/16 12:34, Bertrand Delacretaz wrote:
>
>> On Tue, Sep 6, 2016 at 9:49 AM, Marcel Reutegger <mr...@adobe.com>
>> wrote:
>>
>>> ...we'd still have to add
>>> Jackrabbit API to support it. E.g. something like:
>>>
>>> valueFactory.createBinary(existingBinary, appendThisInputStream); ...
>>>
>>
>> And maybe a way to mark the binary as "in progress" to avoid
>> applications using half-uploaded binaries?
>>
>
> This can easily be prevented if the 'in progress' binary is
> uploaded to a temporary location first and then copied over
> to the correct location once complete. Keep in mind that
> copying a large existing binary in Oak is simply a cheap
> copy of the reference.
>


Is that exposed as an API that can be used by Sling ?

For the Sling use case it might not be sufficient, as the chunks that are
uploaded need to be joined together to form the binary. Streaming to the
chunks is supported (in the patch I am working on), concatenating the
chunks into a final binary requires copying in and out of the DS.

BTW, the Wiki page is immutable so I can't edit.
This use case is encapsulated in the documentation of the feature,
https://cwiki.apache.org/confluence/display/SLING/
Chunked+File+Upload+Support
There are references in there to other implementations, including a AWS S3
API.
The Google Drive API is documented here
https://developers.google.com/drive/v3/web/manage-uploads
They all do a simular thing.

Best Regards
Ian




>
> Regards
>  Marcel
>

Re: Seekable access to a Binary

Posted by Ian Boston <ie...@tfd.co.uk>.
Hi,

On 7 September 2016 at 12:26, Marcel Reutegger <mr...@adobe.com> wrote:

> On 07/09/16 12:23, Julian Reschke wrote:
>
>> Maybe we could have a Oak specific InputStream implementation that wraps
>> a series of existing Binary implementations, and which Oak, when writing
>> a new binary, could leverage? (by not actually reading the binaries, but
>> just copying references around...)
>>
>
> or introduce a new Binary class that wraps existing Binary objects...
>

List<Binary> binaryList = new ArrayList<>();
populateBinaryList(binaryList, "jcr:data", listOfResources);
node.setProperty("jcr:data", new WrappedBinary(binaryList));

where WrappedBinary implements JCR Binary and is something Oak understands
how to deal with efficiently.

I think that works for me.
I will need to make some changes to the Sling Resource API as I am binding
to that and talking in terms of InputStreams rather than Binary objects.

Best Regards
Ian





>
> Regards
>  Marcel
>

Re: Seekable access to a Binary

Posted by Marcel Reutegger <mr...@adobe.com>.
On 07/09/16 12:23, Julian Reschke wrote:
> Maybe we could have a Oak specific InputStream implementation that wraps
> a series of existing Binary implementations, and which Oak, when writing
> a new binary, could leverage? (by not actually reading the binaries, but
> just copying references around...)

or introduce a new Binary class that wraps existing Binary objects...

Regards
  Marcel

Re: Seekable access to a Binary

Posted by Julian Reschke <ju...@gmx.de>.
On 2016-09-07 11:06, Michael Marth wrote:
> Hi,
>
> I believe Oak has no notion of requests - the 1-1 binding of a request to a session is done in Sling.
> However, having said that: I was not aware of all the complexities you mention. To add one more: probably the design would have to encounter for different clustered Sling instances (that share 1 repository) that receive chunks belonging to the same binary. Is that right?
>
> Afaik branches are not exposed into userland, but are an implementation detail.  When I made my comment below, I did not realize that in order for this to work branches would have exposed. I am not sure if that's a good idea. Also not sure if it would even solve the problem.
> Maybe a better approach could be to persist the chunks in a temp space, similar to what Marcel suggested. But maybe that temp space could be a functionality of the datastore (I believe Marcel suggested to create a temp location by the user itself via the JCR API)
>
> Michael
>
> Sent from a mobile device


Maybe we could have a Oak specific InputStream implementation that wraps 
a series of existing Binary implementations, and which Oak, when writing 
a new binary, could leverage? (by not actually reading the binaries, but 
just copying references around...)

Best regards, Julian

Re: Seekable access to a Binary

Posted by Michael Marth <mm...@adobe.com>.
Hi,

I believe Oak has no notion of requests - the 1-1 binding of a request to a session is done in Sling.
However, having said that: I was not aware of all the complexities you mention. To add one more: probably the design would have to encounter for different clustered Sling instances (that share 1 repository) that receive chunks belonging to the same binary. Is that right?

Afaik branches are not exposed into userland, but are an implementation detail.  When I made my comment below, I did not realize that in order for this to work branches would have exposed. I am not sure if that's a good idea. Also not sure if it would even solve the problem.
Maybe a better approach could be to persist the chunks in a temp space, similar to what Marcel suggested. But maybe that temp space could be a functionality of the datastore (I believe Marcel suggested to create a temp location by the user itself via the JCR API)

Michael

Sent from a mobile device

_____________________________
From: Ian Boston <ie...@tfd.co.uk>>
Sent: Wednesday, September 7, 2016 9:36 AM
Subject: Re: Seekable access to a Binary
To: <oa...@jackrabbit.apache.org>>


Hi,

On 6 September 2016 at 18:00, Michael Marth <mm...@adobe.com>> wrote:

> Hi,
>
> I think it would be neat if we could utilize our existing mechanism rather
> than a new flag. In particular, MVCC and branches for session isolation.
> And also simply use session.save() to indicate that an upload is complete
> (and the branch containing the binaries/chunks can be merged).
>

Do branches and sessions hang around between requests ?

Each body part will come from different requests, sometimes separated by
hours and possibly even from different source IP addresses, especially
under upload restart conditions. At present, in streaming mode, as each
body part is encountered a session.save is performed to cause JCR/Oak to
read that input stream from the request, since JCR does not expose anything
that can be used to write binary data to the repository.

Best Regards
Ian



>
> Michael
>
> Sent from a mobile device
>
>
>
>
> On Tue, Sep 6, 2016 at 1:15 PM +0200, "Marcel Reutegger" <
> mreutegg@adobe.com<ma...@adobe.com>> wrote:
>
> Hi,
>
> On 06/09/16 12:34, Bertrand Delacretaz wrote:
> > On Tue, Sep 6, 2016 at 9:49 AM, Marcel Reutegger <mr...@adobe.com>>
> wrote:
> >> ...we'd still have to add
> >> Jackrabbit API to support it. E.g. something like:
> >>
> >> valueFactory.createBinary(existingBinary, appendThisInputStream); ...
> >
> > And maybe a way to mark the binary as "in progress" to avoid
> > applications using half-uploaded binaries?
>
> This can easily be prevented if the 'in progress' binary is
> uploaded to a temporary location first and then copied over
> to the correct location once complete. Keep in mind that
> copying a large existing binary in Oak is simply a cheap
> copy of the reference.
>
> Regards
> Marcel
>



Re: Seekable access to a Binary

Posted by Ian Boston <ie...@tfd.co.uk>.
Hi,

On 6 September 2016 at 18:00, Michael Marth <mm...@adobe.com> wrote:

> Hi,
>
> I think it would be neat if we could utilize our existing mechanism rather
> than a new flag. In particular, MVCC and branches for session isolation.
> And also simply use session.save() to indicate that an upload is complete
> (and the branch containing the binaries/chunks can be merged).
>

Do branches and sessions hang around between requests ?

Each body part will come from different requests, sometimes separated by
hours and possibly even from different source IP addresses, especially
under upload restart conditions. At present, in streaming mode, as each
body part is encountered a session.save is performed to cause JCR/Oak to
read that input stream from the request, since JCR does not expose anything
that can be used to write binary data to the repository.

Best Regards
Ian



>
> Michael
>
> Sent from a mobile device
>
>
>
>
> On Tue, Sep 6, 2016 at 1:15 PM +0200, "Marcel Reutegger" <
> mreutegg@adobe.com<ma...@adobe.com>> wrote:
>
> Hi,
>
> On 06/09/16 12:34, Bertrand Delacretaz wrote:
> > On Tue, Sep 6, 2016 at 9:49 AM, Marcel Reutegger <mr...@adobe.com>
> wrote:
> >> ...we'd still have to add
> >> Jackrabbit API to support it. E.g. something like:
> >>
> >> valueFactory.createBinary(existingBinary, appendThisInputStream); ...
> >
> > And maybe a way to mark the binary as "in progress" to avoid
> > applications using half-uploaded binaries?
>
> This can easily be prevented if the 'in progress' binary is
> uploaded to a temporary location first and then copied over
> to the correct location once complete. Keep in mind that
> copying a large existing binary in Oak is simply a cheap
> copy of the reference.
>
> Regards
>   Marcel
>

Re: Seekable access to a Binary

Posted by Michael Marth <mm...@adobe.com>.
Hi,

I think it would be neat if we could utilize our existing mechanism rather than a new flag. In particular, MVCC and branches for session isolation. And also simply use session.save() to indicate that an upload is complete (and the branch containing the binaries/chunks can be merged).

Michael

Sent from a mobile device




On Tue, Sep 6, 2016 at 1:15 PM +0200, "Marcel Reutegger" <mr...@adobe.com>> wrote:

Hi,

On 06/09/16 12:34, Bertrand Delacretaz wrote:
> On Tue, Sep 6, 2016 at 9:49 AM, Marcel Reutegger <mr...@adobe.com> wrote:
>> ...we'd still have to add
>> Jackrabbit API to support it. E.g. something like:
>>
>> valueFactory.createBinary(existingBinary, appendThisInputStream); ...
>
> And maybe a way to mark the binary as "in progress" to avoid
> applications using half-uploaded binaries?

This can easily be prevented if the 'in progress' binary is
uploaded to a temporary location first and then copied over
to the correct location once complete. Keep in mind that
copying a large existing binary in Oak is simply a cheap
copy of the reference.

Regards
  Marcel

Re: Seekable access to a Binary

Posted by Marcel Reutegger <mr...@adobe.com>.
Hi,

On 06/09/16 12:34, Bertrand Delacretaz wrote:
> On Tue, Sep 6, 2016 at 9:49 AM, Marcel Reutegger <mr...@adobe.com> wrote:
>> ...we'd still have to add
>> Jackrabbit API to support it. E.g. something like:
>>
>> valueFactory.createBinary(existingBinary, appendThisInputStream); ...
>
> And maybe a way to mark the binary as "in progress" to avoid
> applications using half-uploaded binaries?

This can easily be prevented if the 'in progress' binary is
uploaded to a temporary location first and then copied over
to the correct location once complete. Keep in mind that
copying a large existing binary in Oak is simply a cheap
copy of the reference.

Regards
  Marcel

Re: Seekable access to a Binary

Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi,

On Tue, Sep 6, 2016 at 9:49 AM, Marcel Reutegger <mr...@adobe.com> wrote:
> ...we'd still have to add
> Jackrabbit API to support it. E.g. something like:
>
> valueFactory.createBinary(existingBinary, appendThisInputStream); ...

And maybe a way to mark the binary as "in progress" to avoid
applications using half-uploaded binaries?

Maybe just a boolean property convention that application developers
are supposed to take into account, as I don't think JCR Sessions work
in that use case.

-Bertrand

Re: Seekable access to a Binary

Posted by Marcel Reutegger <mr...@adobe.com>.
Hi Ian,

On 05/09/16 15:21, Ian Boston wrote:
> Is it possible to write to an Oak Binary via the JCR API at an offset ?

no, this is not possible with the JCR API nor with current Jackrabbit
API extensions.

There is a wiki page with uses cases that are currently not well
supported by JCR/Oak:

https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase

Your request is similar to UC7 and UC10 and could be supported with
a repository backed by a BlobStore, but we'd still have to add
Jackrabbit API to support it. E.g. something like:

valueFactory.createBinary(existingBinary, appendThisInputStream);

Feel free to update the wiki page with references to the work
done in Sling. This will help designing the API extensions on
our side.

Regards
  Marcel