You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Boomah <ni...@gmail.com> on 2010/02/17 18:44:49 UTC

A couple of simple questions

Hi all.

I'm currently investigating Jackrabbit and am very new to it so please
forgive me if these questions are obvious.

Logically I have a "Contract" that has a bit of meta data (e.g. id) and
maybe a pdf file associated with it.

I set up a TransientRepository and added a node for each "Contract" with a
unique path. I then set a property on this node called id with the
associated string value of the id:

node.setProperty("id", "123")

I have a lot of these (about 200000) and when I do a search on the id using:

"SELECT * FROM [nt:unstructured] WHERE id = '123'

it seems to take longer than I would expect.

1) So my first question is, is there an index on my id property by default?
If not, how do I add an index to it?

If the "Contract" has a pdf file associated with it, at the moment I'm just
adding a BinaryValue to the same node:

node.setProperty("pdfFile", new BinaryValue(pdfInputStream))

2) My next question is, does the pdf file get indexed such that I can search
for text inside it? If not how can I add it in such a way that it does? Once
it has been, what is the SQL2 to query for the string "test"?

Many thanks for help anyone can provide.

Nick.
-- 
View this message in context: http://n4.nabble.com/A-couple-of-simple-questions-tp1559067p1559067.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: A couple of simple questions

Posted by Alexander Klimetschek <ak...@day.com>.

On Wed, Feb 17, 2010 at 18:44, Boomah <ni...@gmail.com> wrote:
> Logically I have a "Contract" that has a bit of meta data (e.g. id) and
> maybe a pdf file associated with it.
>
> I set up a TransientRepository and added a node for each "Contract" with a
> unique path. I then set a property on this node called id with the
> associated string value of the id:
>
> node.setProperty("id", "123")
>
> I have a lot of these (about 200000) and when I do a search on the id using:
>
> "SELECT * FROM [nt:unstructured] WHERE id = '123'
>
> it seems to take longer than I would expect.

How long does the query take? How long does iterating over the result
nodes and working with it take? The latter could be slow if you have a
flat hierarchy, for which Jackrabbit isn't optimized. A typical
approach are things like date folders, eg. 2010/02/03. See also
https://issues.apache.org/jira/browse/JCR-642

> 1) So my first question is, is there an index on my id property by default?
> If not, how do I add an index to it?

All properties except binaries are indexed by default.

> If the "Contract" has a pdf file associated with it, at the moment I'm just
> adding a BinaryValue to the same node:
>
> node.setProperty("pdfFile", new BinaryValue(pdfInputStream))
>
> 2) My next question is, does the pdf file get indexed such that I can search
> for text inside it? If not how can I add it in such a way that it does? Once
> it has been, what is the SQL2 to query for the string "test"?

Binary properties are indexed if they are part of an nt:file, ie. the
jcr:content/jcr:data property (when storing files in the repository,
you should always use the standard nt:file nodetype for that anyway,
it pays off for integrations). A range of text extractors built-in
(using Apache Tika) will try to extract the text first that will be
full-text indexed. PDF is supported, using PDFbox.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: A couple of simple questions

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi,

On Wed, Feb 17, 2010 at 6:44 PM, Boomah <ni...@gmail.com> wrote:
> ...I set up a TransientRepository and added a node for each "Contract" with a
> unique path. I then set a property on this node called id with the
> associated string value of the id:
>
> node.setProperty("id", "123")
>
> I have a lot of these (about 200000) and when I do a search on the id using:
>
> "SELECT * FROM [nt:unstructured] WHERE id = '123'
>
> it seems to take longer than I would expect....

Can't you make the id part of the unique path of each contract?

In this way you would use tree navigation instead of queries to find a
node, much faster.

Note that it's not recommended to have too many (> 10K) child nodes
for the same parent, so you'd need to break down your paths, storing
ID 123456 under /contracts/12/34/123456 for example.

-Bertrand