You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Sergio <ti...@gmail.com> on 2008/06/10 13:43:49 UTC

Searching inside binary contents abd other queries

Hi,

I am new to JCR technology and I have a couple of questions I would like to
ask.
I am working on a project where I need to store some document files (PDFs,
DOCs, XMLs and text files). The complete datamodel will be stored in a
DBRMS (Oracle). However, there's the need for searching inside those
documents efficiently. That's where I think Jackrabbit will come to the
rescue.

I have been reading the docs and wikis in Jackrabbit's site for about a week
now. I understand some of the basics, but I feel lost most of the time. For
instance:

1) As our database will be holding most of the data, I thought about the
following schema: storing the documents inside BLOBs in the database (in
case we need to access them using some other criteria) AND in Jackrabbit's
repository. While storing those documents using Jackrabbit, I plan to keep
the RDBMS' pointers (probably the document's record primary key) using
properties. The question is: does this make sense? Is it a common practice?
And if not, what is the standard approach?

2) Do I need to define node types for representing my documents? If not, is
there some standard type I can use?

3) I have read that Jackrabbit is able to read inside some document types,
how do you accomplish that? Using TextExtractors? How? Could you point me
to some examples? I failed to find any. Does it depend on the way I store
those documents? If so, how do you do it?

I know that's a lot of questions. If someone could point me to the right
direction (maybe pointing me to some code sample, it would be very
thankful.

Best regards.

-- 
Sergio Tridente

Re: Searching inside binary contents abd other queries

Posted by Sergio Tridente <ti...@gmail.com>.

Thank you Marcel, that was it. I tried it again with lucene-core 2.2.0 abd
it worked fkawkessly.


Marcel Reutegger wrote:

> Hi Sergio,
> 
> Sergio Tridente wrote:
>> So far I think I got it right. But when I try to do a search I get the
>> following message:
>> Exception in thread "main"
>> org.apache.lucene.store.AlreadyClosedException: this IndexReader is
>> closed
> 
> this sounds like a lucene version mismatch. please make sure you use
> lucene-core 2.2.0. Jackrabbit does not work with the most recent lucene
> version!
> 
> regards
>   marcel

-- 
Best regards

Sergio Tridente

Re: Searching inside binary contents abd other queries

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi Sergio,

Sergio Tridente wrote:
> So far I think I got it right. But when I try to do a search I get the
> following message:
> Exception in thread "main" org.apache.lucene.store.AlreadyClosedException:
> this IndexReader is closed

this sounds like a lucene version mismatch. please make sure you use lucene-core 
2.2.0. Jackrabbit does not work with the most recent lucene version!

regards
  marcel

Re: Searching inside binary contents abd other queries

Posted by Alexander Klimetschek <ak...@day.com>.

On Wed, Jun 11, 2008 at 2:56 AM, Sergio Tridente <ti...@gmail.com> wrote:
> Here's the code that stores the pdf:
> [...]
> session.save();
> pdf.close();

Not sure, but I think you should not close the input stream, it will
be done by Jackrabbit. The exception below might be related.

> Exception in thread "main" org.apache.lucene.store.AlreadyClosedException:
> this IndexReader is closed

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: Searching inside binary contents abd other queries

Posted by Sergio Tridente <ti...@gmail.com>.

Thank you Marcel and Florian for your answers.

For the moment we are evaluating if we'll be using JackRabbit for our
application or go for a RDBMS. But that's another thing.

Please, bear with my ignorance. I have coded a simple program for storing a
pdf file and searching inside its contents.

Here's the code that stores the pdf:
Node root = session.getRootNode();
Node folder = root.addNode("my.project", "nt:folder");
Node file = folder.addNode("my.pdf", "nt:file");
Node resource = file.addNode("jcr:content", "nt:resource");
resource.setProperty("jcr:mimeType", "application/pdf");
FileInputStream pdf = new
FileInputStream("/home/sergio/Documents/10gR2_openSUSE102_introduction.pdf");
resource.setProperty("jcr:data", pdf);
Calendar cal = Calendar.getInstance();
cal.set(2008, Calendar.JUNE, 10);
resource.setProperty("jcr:lastModified", cal);
session.save();
pdf.close();

So far I think I got it right. But when I try to do a search I get the
following message:
Exception in thread "main" org.apache.lucene.store.AlreadyClosedException:
this IndexReader is closed

Here I am pasting the code for performing the search inside the PDF's
contents:
Workspace ws = session.getWorkspace();
QueryManager qm = ws.getQueryManager();
Query q = qm.createQuery("select * from nt:resource where
jcr:contains='Oracle'", Query.SQL);
QueryResult res = q.execute();
NodeIterator it = res.getNodes();
while (it.hasNext()) {
   Node n = it.nextNode();
   Property prop = n.getProperty("jcr:lastModified");
   System.out.println("Found document containing the word 'Oracle', last
modified date: " + prop.getDate());
}

I don't know if I got it right with the query language syntax. If you can
point me to some resources where I can take a look, it would be great.

I also wanted to point out that the TextFilters are declared inside the
repositoy.xml:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
    <param name="path" value="${wsp.home}/index"/>
    <param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor"/>
    <param name="extractorPoolSize" value="2"/>
    <param name="supportHighlighting" value="true"/>
</SearchIndex>

The following JARs are in my CLASSPATH: PDFBox.jar, poi.jar and
tm-extractors.jar

If you could point me to the right direction, I would highly appreciated. 

-- 
Best regards

Sergio Tridente


Marcel Reutegger wrote:
> Sergio wrote:
>> 1) As our database will be holding most of the data, I thought about the
>> following schema: storing the documents inside BLOBs in the database (in
>> case we need to access them using some other criteria) AND in
>> Jackrabbit's repository. While storing those documents using Jackrabbit,
>> I plan to keep the RDBMS' pointers (probably the document's record
>> primary key) using properties. The question is: does this make sense? Is
>> it a common practice? And if not, what is the standard approach?
> 
> well, the recommended approach is to replace your RDBMS with Jackrabbit.
> 
>> 2) Do I need to define node types for representing my documents? If not,
>> is there some standard type I can use?
> 
> for files and folders there's nt:file and nt:folder. See:
> http://wiki.apache.org/jackrabbit/NodeTypeRegistry and of course the JSR
> 170 specification.
> 
>> 3) I have read that Jackrabbit is able to read inside some document
>> types, how do you accomplish that? Using TextExtractors?
> 
> correct. see: http://jackrabbit.apache.org/jackrabbit-text-extractors.html
> 
>> How? Could you point me
>> to some examples? I failed to find any. Does it depend on the way I store
>> those documents? If so, how do you do it?
> 
> the text extractors only work with nt:resource nodes. this means your
> content structure would look like this:
> 
> + my.pdf (nt:file)
>    - jcr:created=20080101 (DATE)
>    + jcr:content (nt:resource)
>      - jcr:mimeType=application/pdf (STRING)
>      - jcr:lastModified=20080101 (DATE)
>      - jcr:date=<pdf-binary> (BINARY>
> 
> regards
>   marcel

Re: Searching inside binary contents abd other queries

Posted by Marcel Reutegger <ma...@gmx.net>.

Sergio wrote:
> 1) As our database will be holding most of the data, I thought about the
> following schema: storing the documents inside BLOBs in the database (in
> case we need to access them using some other criteria) AND in Jackrabbit's
> repository. While storing those documents using Jackrabbit, I plan to keep
> the RDBMS' pointers (probably the document's record primary key) using
> properties. The question is: does this make sense? Is it a common practice?
> And if not, what is the standard approach?

well, the recommended approach is to replace your RDBMS with Jackrabbit.

> 2) Do I need to define node types for representing my documents? If not, is
> there some standard type I can use?

for files and folders there's nt:file and nt:folder. See: 
http://wiki.apache.org/jackrabbit/NodeTypeRegistry and of course the JSR 170 
specification.

> 3) I have read that Jackrabbit is able to read inside some document types,
> how do you accomplish that? Using TextExtractors?

correct. see: http://jackrabbit.apache.org/jackrabbit-text-extractors.html

> How? Could you point me
> to some examples? I failed to find any. Does it depend on the way I store
> those documents? If so, how do you do it?

the text extractors only work with nt:resource nodes. this means your content 
structure would look like this:

+ my.pdf (nt:file)
   - jcr:created=20080101 (DATE)
   + jcr:content (nt:resource)
     - jcr:mimeType=application/pdf (STRING)
     - jcr:lastModified=20080101 (DATE)
     - jcr:date=<pdf-binary> (BINARY>

regards
  marcel

Re: Searching inside binary contents abd other queries

Posted by Florian Holeczek <fl...@holeczek.de>.

Hallo Sergio,

if you absolutely need to store your data in your relational database,
I don't think Jackrabbit is the right tool for you (otherwise maybe!).

Did you have a look at the possibilities your DBMS provides? I think
about pre-built extensions for search indizes or using the possibility
of writing your own index which suits your requirements.

Regards,
 Florian