You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Felix Meschberger <fm...@adobe.com> on 2013/03/12 12:32:01 UTC

Getting a value by its data identifier

Hi all

we have a couple of use cases, where we would like to leverage the global data store to prevent sending around and copying around large binary data unnecessarily: We have two separate Jackrabbit instances configured to use the same DataStore (for the sake of this discussion assume we have the problems of concurrent access and garbage collection under control). When sending content from one instance to the other instance we don't want to send potentially large binary data (e.g. video files) if not needed.

The idea is for the sender to just send the content identity from JackrabbitValue.getContentIdentity(). The receiver would then check whether the such content already exists and would reuse if so:

  String ci = contentIdentity_from_sender;
  try {
    Value v = session.getValueByContentIdentity(ci);
    Property p = targetNode.setProperty(propName, v);
  } catch (ItemNotFoundException ie) {
    // unknown or invalid content Identity
  } catch (RepositoryException re) {
    // some other exception
  }

Thus the proposed JackrabbitSession.getValueByContentIdentity(String) method would allow for round tripping the JackrabbitValue.getContentIdentity() preventing superfluous binary data copying and moving.

Questions:

(a) Would such a method technically be possible (preventing actual large binary data copy !) ?
(b) Would a patch be accepted ?
(c) Can we and if yes, how can we control access ?
(c) What else ?

Regards
Felix

--
Felix Meschberger | Principal Scientist | Adobe

Re: Getting a value by its data identifier

Posted by Felix Meschberger <fm...@adobe.com>.

Hi

I created https://issues.apache.org/jira/browse/JCR-3534 with a patch implementing the proposed method along with a unit test validating round tripping.

Regards
Felix

Am 12.03.2013 um 12:32 schrieb Felix Meschberger:

> Hi all
> 
> we have a couple of use cases, where we would like to leverage the global data store to prevent sending around and copying around large binary data unnecessarily: We have two separate Jackrabbit instances configured to use the same DataStore (for the sake of this discussion assume we have the problems of concurrent access and garbage collection under control). When sending content from one instance to the other instance we don't want to send potentially large binary data (e.g. video files) if not needed.
> 
> The idea is for the sender to just send the content identity from JackrabbitValue.getContentIdentity(). The receiver would then check whether the such content already exists and would reuse if so:
> 
>  String ci = contentIdentity_from_sender;
>  try {
>    Value v = session.getValueByContentIdentity(ci);
>    Property p = targetNode.setProperty(propName, v);
>  } catch (ItemNotFoundException ie) {
>    // unknown or invalid content Identity
>  } catch (RepositoryException re) {
>    // some other exception
>  }
> 
> Thus the proposed JackrabbitSession.getValueByContentIdentity(String) method would allow for round tripping the JackrabbitValue.getContentIdentity() preventing superfluous binary data copying and moving.
> 
> Questions:
> 
> (a) Would such a method technically be possible (preventing actual large binary data copy !) ?
> (b) Would a patch be accepted ?
> (c) Can we and if yes, how can we control access ?
> (c) What else ?
> 
> Regards
> Felix
> 
> --
> Felix Meschberger | Principal Scientist | Adobe
> 
> 
> 
> 
> 
> 
> 


--
Felix Meschberger | Principal Scientist | Adobe

Re: Getting a value by its data identifier

Posted by Alexander Klimetschek <ak...@adobe.com>.

On 15.03.2013, at 09:01, Felix Meschberger <fm...@adobe.com> wrote:

> I would prefer a SecurityException, but JCR has a notion of "no access looks the same as non-existing",

Right, it should not be possible to find out if something exists if you don't have the permission!

> so an ItemNotFoundException would probably be thrown in this case (due to JCR throwing an exception if something does not exist instead of just returning null).

I always found that a bad design - returning "null" for not found allows much more readable code, since it is an expected case and not an unexpected failure case for which exceptions are designed for. Not sure if a custom Jackrabbit API extension has to follow the same design...

Cheers,
Alex

Re: Getting a value by its data identifier

Posted by Felix Meschberger <fm...@adobe.com>.

Hi,

Am 12.03.2013 um 15:02 schrieb Alexander Klimetschek:

> On 12.03.2013, at 12:32, Felix Meschberger <fm...@adobe.com> wrote:
> 
>> Thus the proposed JackrabbitSession.getValueByContentIdentity(String) method would allow for round tripping the JackrabbitValue.getContentIdentity() preventing superfluous binary data copying and moving.
> 
> The idea sounds good to me :-) (Disclaimer: discussed this with Felix f2f before)
> 
>> Questions:
>> 
>> (c) Can we and if yes, how can we control access ?
> 
> It's a bit tricky, and I think the best way to do it is:
> - by default no access at all (getValueByContentIdentity() returns null aka not found)

I would prefer a SecurityException, but JCR has a notion of "no access looks the same as non-existing", so an ItemNotFoundException would probably be thrown in this case (due to JCR throwing an exception if something does not exist instead of just returning null).

> - have a special privilege for this feature, that you only want to enable for users that need this feature
> - because such a repository-wide optimization feature generally does require a user with wide permissions

+1

We could use a repository level permission like we have to workspace creation.

> - nice to have: avoid that the content ID is a hash of the binary, so that an attacker (who already go the above privilege) still cannot infer existence of a binary he knows; but then he might have enough read & write access already, as a user with that permission is likely to have broad rights, as for copying things over from one instance to another requires that

We don't do such "security by obscurity" things for regular path and node ID acces. So we might not want to try it here. Rather we should provide proper access control on access.

> 
>> (d) What else ?
> 
> This is practically only about Binaries and the FileDataStore, but the JackrabbitValue.getContentIdentity() is generic across all value types. If there might be such a store for other properties in the future, the content id must uniquely identify that store (e.g. value type) as well.

I would expect such a content identity to be "globally unique" and internally handled by the repository such that roundtripping between getContentIdentity and getValueByContentIdentity can be guaranteed (provided access control allows for it.

Regards
Felix

> 
> Cheers,
> Alex
> 
> 

--
Felix Meschberger | Principal Scientist | Adobe

Re: Getting a value by its data identifier

Posted by Alexander Klimetschek <ak...@adobe.com>.

On 12.03.2013, at 12:32, Felix Meschberger <fm...@adobe.com> wrote:

> Thus the proposed JackrabbitSession.getValueByContentIdentity(String) method would allow for round tripping the JackrabbitValue.getContentIdentity() preventing superfluous binary data copying and moving.

The idea sounds good to me :-) (Disclaimer: discussed this with Felix f2f before)

> Questions:
> 
> (c) Can we and if yes, how can we control access ?

It's a bit tricky, and I think the best way to do it is:
- by default no access at all (getValueByContentIdentity() returns null aka not found)
- have a special privilege for this feature, that you only want to enable for users that need this feature
- because such a repository-wide optimization feature generally does require a user with wide permissions
- nice to have: avoid that the content ID is a hash of the binary, so that an attacker (who already go the above privilege) still cannot infer existence of a binary he knows; but then he might have enough read & write access already, as a user with that permission is likely to have broad rights, as for copying things over from one instance to another requires that

> (d) What else ?

This is practically only about Binaries and the FileDataStore, but the JackrabbitValue.getContentIdentity() is generic across all value types. If there might be such a store for other properties in the future, the content id must uniquely identify that store (e.g. value type) as well.

Cheers,
Alex

Re: Getting a value by its data identifier

Posted by Felix Meschberger <fm...@adobe.com>.

Hi,

I think there really are two sides to the story:

(a) getting an ID
(b) getting the data for that ID

We may or may not be able -- on a large scale -- to prevent (a). After all "getting an ID" might just be the result of wold guessing and doing a brute force attack.

We have to be able to limit (b): While restricting to "admin" sessions might be an option, I think that is not the right way to do it. I tend to agree with AlexK that a permission might be the way to do it. The problematic thing really is that permission checking is hooked to a repository path (and thus related to an Item) whereas here we don't have an item: The DataStore BLOB does not know where it belongs to -- and in a shared DataStore setup, there might not even be an "owner" property.

In short: forget about (a). For(b) use a custom permission on / to grant access to the new method (denied by default, of course).

Regards
Felix

Am 12.03.2013 um 16:09 schrieb Thomas Mueller:

> Hi,
> 
>> (a) Would such a method technically be possible (preventing actual large
>> binary data copy !) ?
> 
> Yes I think it's possible. Would this be needed for Oak or Jackrabbit 2.x
> or both?
> 
>> (c) Can we and if yes, how can we control access ?
> 
> Currently the content identifier is the content hash (SHA-1), so there is
> no risk of "enumeration" or "scanning" attack (not sure what is the right
> word for this - where the attacker blindly tries out many possible ids in
> the hope to find one).
> 
> One risk is that an attacker can "prove" a certain document is stored in
> the repository, where the attacker already has the document or at least
> knows the hash code. For example he could prove the "wikileaks file x" is
> stored in the repository, which might be a problem if possession of the
> "wikileaks file x" is illegal. Not sure if we need protection against
> that; if yes, we might only allow this method to be called for admin
> sessions or so.
> 
> Another risk is that an attacker that has a list of identifiers might be
> able to get the documents in that way, if they are stored in the
> repository. The question is how did the attacker get the identifier, but
> if it's a simple SHA-1 it might be a bigger risk. One way to protect
> against that might be to encrypt the SHA-1 hash code with a
> repository-wide, configurable "private key" or so.
> 
> Regards,
> Thomas
> 


--
Felix Meschberger | Principal Scientist | Adobe

Re: Getting a value by its data identifier

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>(a) Would such a method technically be possible (preventing actual large
>binary data copy !) ?

Yes I think it's possible. Would this be needed for Oak or Jackrabbit 2.x
or both?

>(c) Can we and if yes, how can we control access ?

Currently the content identifier is the content hash (SHA-1), so there is
no risk of "enumeration" or "scanning" attack (not sure what is the right
word for this - where the attacker blindly tries out many possible ids in
the hope to find one).

One risk is that an attacker can "prove" a certain document is stored in
the repository, where the attacker already has the document or at least
knows the hash code. For example he could prove the "wikileaks file x" is
stored in the repository, which might be a problem if possession of the
"wikileaks file x" is illegal. Not sure if we need protection against
that; if yes, we might only allow this method to be called for admin
sessions or so.

Another risk is that an attacker that has a list of identifiers might be
able to get the documents in that way, if they are stored in the
repository. The question is how did the attacker get the identifier, but
if it's a simple SHA-1 it might be a bigger risk. One way to protect
against that might be to encrypt the SHA-1 hash code with a
repository-wide, configurable "private key" or so.

Regards,
Thomas