You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Lei Zhou <Le...@pointalliance.com> on 2006/11/21 23:16:23 UTC

Question on jcr:deref usage

Hi,

I'd like to build a custom query which is quite complicated. I'd like to 
use jcr:deref to achieve the SQL Join style query but am not sure if this 
is doable in Jackrabbit. Could anyone comment on the following use case? 

The objects: Document has text properties and Category references
The query: user need to search for documents by specifying values of any 
combination of text values and/or category values
The query result: user demands a categorized result view, which contains 
expandable/collapsable categories

For example, a document may have text properties Subject, author, 
Description; and refers to one or more entries of category "Products". 
I'd like to be able to create something like  below: 

SELECT  Document.Subject, Document.Description, 
Document.ProductReference->categoryName
where Subject='Manual' AND description contains 'maintenance' AND 
Document.ProductReference->categoryName='Product #1'
order by Document.ProductReference->categoryName, Document.Subject

Is this possible in Jackrabbit with Xpath query? 

Thanks,

Re: Question on jcr:deref usage

Posted by Lei Zhou <Le...@pointalliance.com>.

Hi,

Not sure I understand what you meant by "extend JCR API" - I guess adding 
new features to JCR2.0?  Or adding new features to Jackrabbit? 

Simply put, it would be good to see added support for SELECT DISTINCT and 
GROUP BY. I haven't tried with Jackrabbit for JOIN between primary 
nodetype tables, it would be nice if it is there.

An example, if one needs to create a content navigator (similar to the 
JTree format) for search results, one way is to grab ALL data and use 
application code to create the navigational structure. Another way is to 
use one query (like SELECT DISTINCT, and/or GROUP BY ) to get the 
navigation data (tree nodes), and user another query (when user selects 
one category) to get the requested details. 


>Is this now not more an implementation issue (how to do it) than a
>user issue (what features are required, how the API looks like)?

No. See below.

>AFAIK, Lucene (not the RDBMS) is used to for queries (except for
>direct, relative Node access and references).

That's what I thought. Just wonder, wouldn't the RDBMS search/index engine 
be faster than an added layer of external code? This is part of the reason 
I feel it may be beneficial to expand the DB schema and not use Blobs - 
let the DB do the indexing and search - if I'm using DB repository.

>For me, the main reasons to use RDBMS are transaction support and speed.
So I have a file system index from Lucene, on top of another DB index for 
Nodes...

>I know, many people think like that. On the other hand, if using Blobs
>is faster, and people don't wants to access the data store directly,
>why not use Blobs?

Agreed. Like I said before, this is more of a "convincing others" issue. 
In business, it could be one of the factors that affect decision making by 
business users. (Aren't we IT people paid by business :-). 

Regards,
Lei

Re: Question on jcr:deref usage

Posted by Thomas Mueller <th...@gmail.com>.

Hi,

> So I wouldn't dare to say "here is how I
> think things should be done".

Maybe you have some ideas how to extend the JCR API to simplify using
it, or just what additional features are required.

> This approach also has a draw-back: by treating Local file systems and RDB
> systems the same way - same set of features, same indexing & searching API
> (??),  a lot of good stuff from RDBMS are wasted.

Is this now not more an implementation issue (how to do it) than a
user issue (what features are required, how the API looks like)?

> These features are very useful even in querying structural data model

AFAIK, Lucene (not the RDBMS) is used to for queries (except for
direct, relative Node access and references).

> My personal experience is that production-level content management systems
> are more implemented on RDBMS than on local file system.

For me, the main reasons to use RDBMS are transaction support and speed.

> 2.  When presenting architecture design to a business client (usually with
> 'some' knowledge of the IT
>      systems/products), the first question would be "is this a serious
> design? why are all the data in
>      Blobs?". Although we as developers know that there are good reasons
> for that, it may not be easily
>      conveyed to the client.

I know, many people think like that. On the other hand, if using Blobs
is faster, and people don't wants to access the data store directly,
why not use Blobs?

Thomas

Re: Question on jcr:deref usage

Posted by Lei Zhou <Le...@pointalliance.com>.

Hi Thomas, 

Thanks for responding.

>I think there are two solutions: Add the missing features to the JCR
>API, and provide the missing features in some other way. Do you have a
>suggestion how to extend the API to support the features you like to
>have (seems to be: aggregation, join, ordering)?

I came to JCR and Jackrabbit more from the perspective of  an 
integrator/end user, and did not have a chance to study the implementation 
in greater technical details. So I wouldn't dare to say "here is how I 
think things should be done".
I could provide some thoughts & observations though.

Jackrabbit (or JCR) handles all kinds of possible persistence storage 
(local files, DB, etc.) through unified PersistenceManager interface. This 
is great because it makes the product easily adapt to any usage scenario. 

This approach also has a draw-back: by treating Local file systems and RDB 
systems the same way - same set of features, same indexing & searching API 
(??),  a lot of good stuff from RDBMS are wasted. For example, SELECT 
DISTINCT, JOIN, GROUP BY, triggers, stored procedures etc. Some may argue 
that JCR is about "Structural", not "Relational" data, why would we care? 
These features are very useful even in querying structural data model - my 
previous email discussed one use case as an example.

I'm not saying we should use all RDBMS features where it exists, because 
there are compatibility & portability issues. Since we have already 
provided DB persistence manager and schema DDLs for several RDBMS, it 
wouldn't hurt if we extend that effort to do more with the 'native' 
features of supported RDBMS. 

One idea is to have an "extended set" of features for RDBMS, that can be 
queried by Repository.getDescriptorKeys(). These features would support 
extended SQL capabilities like SELECT DISTINCT, JOIN, GROUP BY, and ORDER 
BY etc.
And I'm not proposing to completely "normalize" the DB schema, there is 
always a line between "better" and "extreme".

My personal experience is that production-level content management systems 
are more implemented on RDBMS than on local file system. If this applies 
to most of the community (??), why would we restrict ourselves? 

>If you want to integrate other products in the DB schema level, then
>the current schema may not be the best. However I don't think it was
>the idea that other software accesses the schema of Jackrabbit
>directly.

As described above, I'm not trying to manipulate the repository at DB 
level. There are two reasons for me to raise that point: 

1.  For same reasons as mentioned above, and previous emails, I felt it 
would be more beneficial for 
     people who use RDBMS for repository - and I would bet that represents 
a good portion of 
    JCR/Jackrabbit based applications. 

2.  When presenting architecture design to a business client (usually with 
'some' knowledge of the IT 
     systems/products), the first question would be "is this a serious 
design? why are all the data in 
     Blobs?". Although we as developers know that there are good reasons 
for that, it may not be easily 
     conveyed to the client.

Again, these are just personal observations and I'm not yet an expert in 
JCR/Jackrabbit. Any comments / corrections are appreciated.

Best regards,
Lei





"Thomas Mueller" <th...@gmail.com> 
11/24/06 03:53 AM
Please respond to
users@jackrabbit.apache.org


To
users@jackrabbit.apache.org
cc

Subject
Re: Question on jcr:deref usage






Hi,

> So it seems that due to the limitation of JCR (no aggregation query
> support).

I think there are two solutions: Add the missing features to the JCR
API, and provide the missing features in some other way. Do you have a
suggestion how to extend the API to support the features you like to
have (seems to be: aggregation, join, ordering)?

One option is to make the structured part of the JCR repository
accessible like a 'standard' SQL database. Existing (SQL based) report
generators could then be used as well. If you could access the data
stored in the repository using the JDBC API using the following SQL
query, would this provide the convenience you are looking for?

select m.uuid from manual m, product p, region r
where p.uuid = m.product and r.uuid = m.region
and p.name in ('TV', 'VCR', 'DVD')
and r.name in ('North America', 'Europe')
and p.availableFor in ('distributor', 'repairHouse')
order by r.name, p.name

My idea is to add support for 'jcr views' to my database
(http://www.h2database.com).

> #2. The RDBMS based repository, current DB schema is not very convincing
> for large enterprise level applications. A more normalized schema might
> help both performance and #1, but yes, more DB level code may be needed
> (for performance's sake) and that may limit the portability of the
> product.

If you want to integrate other products in the DB schema level, then
the current schema may not be the best. However I don't think it was
the idea that other software accesses the schema of Jackrabbit
directly.

Thomas

Re: Question on jcr:deref usage

Posted by Thomas Mueller <th...@gmail.com>.

Hi,

> So it seems that due to the limitation of JCR (no aggregation query
> support).

I think there are two solutions: Add the missing features to the JCR
API, and provide the missing features in some other way. Do you have a
suggestion how to extend the API to support the features you like to
have (seems to be: aggregation, join, ordering)?

One option is to make the structured part of the JCR repository
accessible like a 'standard' SQL database. Existing (SQL based) report
generators could then be used as well. If you could access the data
stored in the repository using the JDBC API using the following SQL
query, would this provide the convenience you are looking for?

select m.uuid from manual m, product p, region r
where p.uuid = m.product and r.uuid = m.region
and p.name in ('TV', 'VCR', 'DVD')
and r.name in ('North America', 'Europe')
and p.availableFor in ('distributor', 'repairHouse')
order by r.name, p.name

My idea is to add support for 'jcr views' to my database
(http://www.h2database.com).

> #2. The RDBMS based repository, current DB schema is not very convincing
> for large enterprise level applications. A more normalized schema might
> help both performance and #1, but yes, more DB level code may be needed
> (for performance's sake) and that may limit the portability of the
> product.

If you want to integrate other products in the DB schema level, then
the current schema may not be the best. However I don't think it was
the idea that other software accesses the schema of Jackrabbit
directly.

Thomas

Re: Question on jcr:deref usage

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi Lei,

the exception message is indeed wrong. I've fixed just it (see: 
http://issues.apache.org/jira/browse/JCR-646)

Thanks for reporting this issue.

regards
  marcel

Lei Zhou wrote:
> Hi,
> 
> Just found out that I have to quote the * char, and '%' indeed doesn't 
> work, although it is the one used for jcr:like.
> 
> //element(*, Document)[@Subject='Manual']/jcr:deref(@ProductReference,'*')
> 
> regards,
> Lei

Re: Question on jcr:deref usage

Posted by Lei Zhou <Le...@pointalliance.com>.

Hi,

Just found out that I have to quote the * char, and '%' indeed doesn't 
work, although it is the one used for jcr:like.

//element(*, Document)[@Subject='Manual']/jcr:deref(@ProductReference,'*')

regards,
Lei

>Thanks Marcel,
>I just tried the query below and got exceptions complaining about "second 

>argument type for jcr:like".
>//element(*, Document)[@Subject = 
'Manual']/jcr:deref(@ProductReference,*)
>Here is the error: javax.jcr.query.InvalidQueryException: Wrong second 
>argument type for jcr:like
>I then replaced '*' with '%',  which seems to be accepted, but did not 
>return any result: 
>//element(*, Document)[@Subject = 'Manual']/jcr:deref(@ProductReference, 
>'%')
>Did I miss anything?

Re: Question on jcr:deref usage

Posted by Lei Zhou <Le...@pointalliance.com>.

Thanks Marcel,
I just tried the query below and got exceptions complaining about "second 
argument type for jcr:like".
//element(*, Document)[@Subject = 'Manual']/jcr:deref(@ProductReference, 
*)
Here is the error: javax.jcr.query.InvalidQueryException: Wrong second 
argument type for jcr:like
I then replaced '*' with '%',  which seems to be accepted, but did not 
return any result: 
//element(*, Document)[@Subject = 'Manual']/jcr:deref(@ProductReference, 
'%')
Did I miss anything? 

Thanks,
Lei

Re: Question on jcr:deref usage

Posted by Marcel Reutegger <ma...@gmx.net>.

Lei Zhou wrote:
> Thanks Marcel!
> 
> So it seems that due to the limitation of JCR (no aggregation query 
> support), it would be much slower to support this type of application than 
> RDBMS. 
> 
> Is that a correct assessment? 

An RDBMS certainly provides a wider range of operations through SQL than JCR 
with the current set of XPath or SQL syntax. depending on your needs some of the 
queries won't be possible in JCR but others will just be obsolete. E.g. in JCR 
you don't have to execute a query to follow a reference you simply call the 
method Property.getNode().

> Also, to articulate, if I have to present to users with a query result 
> view that is categorized (or grouped) by ProductName, I'd have to do the 
> following: 
> 
> 1. Run query #1
>   //element(*, Document)[@Subject = 'Manual' and 
> jcr:contains(@description, 
>   'maintenance')]
> 
> 2.  iterate through the entire RowIterator (may have thousands of 
> entries),  use Java code
>     to create an aggregated ProductNames/ProductReference pairs collection 
> 
>     (since JCR doesn't have this type of query),
> 
> 3. No "Order By" clause is used because the ProductReferences won't be in 
> same order as
>     the ProductNames, manual sorting is required in Java post-processing

The same can be achieved in one step:

//element(*, Document)[@Subject = 'Manual' and jcr:contains(@description, 
'maintenance')]/jcr:deref(@ProductReference, *) order by @ProductName

this will return an ordered list of product names which contain matches.

> 4. Depending on which category has been selected by user to expand, run 
> query #2, limiting 
>     results to that single product category:
>     (query #2)
>   //element(*, Document)[@Subject = 'Manual' and 
> jcr:contains(@description, 
>   'maintenance') and @ProductReference = '<uuid-of-Product-#1>']

Correct.

> 5. Again, product names has to be de-referenced manually, and ordering has 
> to be moved from
>     the query to the java post-processing

This step I don't understand. What's the purpose of this step and why is it 
needed? Isn't all information already available?

> I'm fairly new to JCR and Jackrabbit. I've found them very helpful in many 
> aspects of managing contents. But I do feel that certains improvements 
> could make Jackrabbit a better choice for enterprise use. 
> 
> #1. In the many years of enterprise application development, I've seen a 
> lot of our content based applications in need of support for complicated 
> search, e.g, search by arbitrary combination of document properties, and 
> grouping of search results (it is not uncommon to see 2, even 3 levels of 
> nested grouping). 
>      -- Aggregations and Joins are definitely a big plus for querying a 
> complicated content model.

Such requirements are also discussed in the expert group of JSR 283. You can 
comment on the current spec and post enhancement wishes to jsr-283-comments@jcp.org.

> I've seen posts mentioning use of Node references to compensate the lack 
> of SQL Join, but what if I need to perform a search like below 
> (ProductNames, Regions and AvailableFors would most likely be categories 
> that are referenced by all documents): 
>     FIND all manuals
>     THAT (ProductName is 'TV' or 'VCR' or 'DVD') 
>          and (Region is 'North America' or 'Europe') 
>          and (AvailableFor is 'distributor' or 'repairHouse')
>      GROUP BY Region, ProductName

such a query is certainly not possible with the current set of XPath or SQL in 
JCR. You would have to break up the query into multiple queries. e.g. retrieve 
uuids for produces with names 'TV', 'VCR' and 'DVD' and use those uuids in a 
query. The same applies to Region and AvailableFor.

IMO XQuery would be a nice fit for those requirements.

> #2. The RDBMS based repository, current DB schema is not very convincing 
> for large enterprise level applications. A more normalized schema might 
> help both performance and #1, but yes, more DB level code may be needed 
> (for performance's sake) and that may limit the portability of the 
> product. 

I'm not sure that's really the case. Usually a normalized schema means less 
performance. There were attempts to create a persistence manager using a 
normalized schema, but in the end the currently used schema turned out to be the 
most practical one.

regards
  marcel

Re: Question on jcr:deref usage

Posted by Lei Zhou <Le...@pointalliance.com>.

Thanks Marcel!

So it seems that due to the limitation of JCR (no aggregation query 
support), it would be much slower to support this type of application than 
RDBMS. 

Is that a correct assessment? 

Also, to articulate, if I have to present to users with a query result 
view that is categorized (or grouped) by ProductName, I'd have to do the 
following: 

1. Run query #1
  //element(*, Document)[@Subject = 'Manual' and 
jcr:contains(@description, 
  'maintenance')]

2.  iterate through the entire RowIterator (may have thousands of 
entries),  use Java code
    to create an aggregated ProductNames/ProductReference pairs collection 

    (since JCR doesn't have this type of query),

3. No "Order By" clause is used because the ProductReferences won't be in 
same order as
    the ProductNames, manual sorting is required in Java post-processing

4. Depending on which category has been selected by user to expand, run 
query #2, limiting 
    results to that single product category:
    (query #2)
  //element(*, Document)[@Subject = 'Manual' and 
jcr:contains(@description, 
  'maintenance') and @ProductReference = '<uuid-of-Product-#1>']

5. Again, product names has to be de-referenced manually, and ordering has 
to be moved from
    the query to the java post-processing

I'm fairly new to JCR and Jackrabbit. I've found them very helpful in many 
aspects of managing contents. But I do feel that certains improvements 
could make Jackrabbit a better choice for enterprise use. 

#1. In the many years of enterprise application development, I've seen a 
lot of our content based applications in need of support for complicated 
search, e.g, search by arbitrary combination of document properties, and 
grouping of search results (it is not uncommon to see 2, even 3 levels of 
nested grouping). 
     -- Aggregations and Joins are definitely a big plus for querying a 
complicated content model.

I've seen posts mentioning use of Node references to compensate the lack 
of SQL Join, but what if I need to perform a search like below 
(ProductNames, Regions and AvailableFors would most likely be categories 
that are referenced by all documents): 
    FIND all manuals
    THAT (ProductName is 'TV' or 'VCR' or 'DVD') 
         and (Region is 'North America' or 'Europe') 
         and (AvailableFor is 'distributor' or 'repairHouse')
     GROUP BY Region, ProductName

#2. The RDBMS based repository, current DB schema is not very convincing 
for large enterprise level applications. A more normalized schema might 
help both performance and #1, but yes, more DB level code may be needed 
(for performance's sake) and that may limit the portability of the 
product. 

These are just my perspective of evaluating Jackrabbit and I'd welcome any 
comments or corrections on mis-understanding.

All in all, I understand that Jackrabbit is only a reference 
implementation to JCR 1.0, and it is really a great product. Just hoped it 
can be even better and be more extensively adopted just like Apache HTTP 
server. 

Best Regards,
Lei 

Marcel Reutegger <ma...@gmx.net> 
11/23/06 09:25 AM
Please respond to
users@jackrabbit.apache.org

To
users@jackrabbit.apache.org
cc

Subject
Re: Question on jcr:deref usage

Lei Zhou wrote:
> SELECT  Document.Subject, Document.Description, 
> Document.ProductReference->categoryName
> where Subject='Manual' AND description contains 'maintenance' AND 
> Document.ProductReference->categoryName='Product #1'
> order by Document.ProductReference->categoryName, Document.Subject
> 
> Is this possible in Jackrabbit with Xpath query? 

no, not quite. the jcr:deref() function cannot be used in a predicate, 
which 
would be required for that use case. furthermore the select clause and 
order by 
clause may only contain property names.

the closed you can get is something like:

//element(*, Document)[@Subject = 'Manual' and jcr:contains(@description, 
'maintenance') and @ProductReference = '<uuid-of-Product-#1>'] order by 
@ProductReference, @Subject

and then you have to do some post processing. basically dereferencing the 
ProductReference to get the name of the product.

regards
  marcel

Re: Question on jcr:deref usage

Posted by Marcel Reutegger <ma...@gmx.net>.

Lei Zhou wrote:
> SELECT  Document.Subject, Document.Description, 
> Document.ProductReference->categoryName
> where Subject='Manual' AND description contains 'maintenance' AND 
> Document.ProductReference->categoryName='Product #1'
> order by Document.ProductReference->categoryName, Document.Subject
> 
> Is this possible in Jackrabbit with Xpath query? 

no, not quite. the jcr:deref() function cannot be used in a predicate, which 
would be required for that use case. furthermore the select clause and order by 
clause may only contain property names.

the closed you can get is something like:

//element(*, Document)[@Subject = 'Manual' and jcr:contains(@description, 
'maintenance') and @ProductReference = '<uuid-of-Product-#1>'] order by 
@ProductReference, @Subject

and then you have to do some post processing. basically dereferencing the 
ProductReference to get the name of the product.

regards
  marcel