You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Martin Perez <mp...@gmail.com> on 2005/10/28 11:05:22 UTC

Another issue with text filtering

If you want to add a PDF document to a repository using a PdfTextFilter, and
you do the following steps:

session.save()
node.checkin();

The method PdfTextFilter.doFilter() gets called 4 times!!!

session's save method calls doFilter one time. This is normal

But checkin method calls doFilter three times. Is this normal? I do not see
the sense.

Thanks.

Martin

Re: Another issue with text filtering

Posted by Martin Perez <mp...@gmail.com>.

I'll post the issues then.

Really, with most filetypes isn't very important, but the big problem is
with PDF files. PDF extraction is very slow whatever library you use
(pdfbox, transient tools, etc...) and so with this files this is a big
issue. Word, POI, Excel, Text, extraction is much faster.

Regards,

Martin

On 10/28/05, Marcel Reutegger <ma...@gmx.net> wrote:
>
> Hi Martin,
>
> this is unfortunate and should be improved. the reason why this happens
> is the following:
> the search index implementation always indexes a node as a whole to
> improve query performance. that means even if a single property changes
> the parent node with all its properties is re-indexed.
>
> unfortunately the checkin method sets properties in three separate
> 'transactions', causing the search to re-index the according node three
> times.
>
> usually this is not an issue, because the index implementation keeps a
> buffer for pending index work. that is, if you change the same property
> several times and save after each setProperty() call, it won't actually
> get re-indexed several times. but text filters behave differently here,
> because they extract the text even though the text will never be used.
>
> eventually this will improve without any change to the search index
> implementation, because as soon as versioning participates properly in
> transactions there will only be one call to index a node on checkin().
>
> as a quick fix we could improve the text filter classes to only parse
> the binary when the returned reader is acutally used.
>
> could you create a jira issue for this enhancement too?
>
> thanks a lot.
>
> regards
> marcel
>
>
> Martin Perez wrote:
> > If you want to add a PDF document to a repository using a PdfTextFilter,
> and
> > you do the following steps:
> >
> > session.save()
> > node.checkin();
> >
> > The method PdfTextFilter.doFilter() gets called 4 times!!!
> >
> > session's save method calls doFilter one time. This is normal
> >
> > But checkin method calls doFilter three times. Is this normal? I do not
> see
> > the sense.
> >
> > Thanks.
> >
> > Martin
> >
>

Re: Another issue with text filtering

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi Martin,

this is unfortunate and should be improved. the reason why this happens 
is the following:
the search index implementation always indexes a node as a whole to 
improve query performance. that means even if a single property changes 
the parent node with all its properties is re-indexed.

unfortunately the checkin method sets properties in three separate 
'transactions', causing the search to re-index the according node three 
times.

usually this is not an issue, because the index implementation keeps a 
buffer for pending index work. that is, if you change the same property 
several times and save after each setProperty() call, it won't actually 
get re-indexed several times. but text filters behave differently here, 
because they extract the text even though the text will never be used.

eventually this will improve without any change to the search index 
implementation, because as soon as versioning participates properly in 
transactions there will only be one call to index a node on checkin().

as a quick fix we could improve the text filter classes to only parse 
the binary when the returned reader is acutally used.

could you create a jira issue for this enhancement too?

thanks a lot.

regards
  marcel

Martin Perez wrote:
> If you want to add a PDF document to a repository using a PdfTextFilter, and
> you do the following steps:
> 
> session.save()
> node.checkin();
> 
> The method PdfTextFilter.doFilter() gets called 4 times!!!
> 
> session's save method calls doFilter one time. This is normal
> 
> But checkin method calls doFilter three times. Is this normal? I do not see
> the sense.
> 
> Thanks.
> 
> Martin
>

Re: Another issue with text filtering

Posted by Martin Perez <mp...@gmail.com>.

Thanks Marcel :)

Re: Another issue with text filtering

Posted by Marcel Reutegger <ma...@gmx.net>.

that would be another enhancement. I associated the binary content to 
the node scope index because that way it is more closely to what the 
spec states to be required. jcr:contains on a specific property is 
optional, and I didn't want to index the content twice, so I added it to 
the node scope.

you may also write your own custom query handler and custom node indexer 
that extend the current ones and index the binary content also for the 
jcr:data property scope.

regards
  marcel

Martin Perez wrote:
> mmmm
> 
> but that (//*[jcr:contains(., 'phrase')]) will search all nodes with that
> phrase on *any" property. I want to search only on the indexed text. (this
> is because my users can choose where they want to search (keywords, name,
> notes, *content*)
> 
> Isn't it possible?
> 
> Thanks

Re: Another issue with text filtering

Posted by Martin Perez <mp...@gmail.com>.

mmmm

but that (//*[jcr:contains(., 'phrase')]) will search all nodes with that
phrase on *any" property. I want to search only on the indexed text. (this
is because my users can choose where they want to search (keywords, name,
notes, *content*)

Isn't it possible?

Thanks


On 10/28/05, Marcel Reutegger <ma...@gmx.net> wrote:
>
> Martin Perez wrote:
> > Sorry for the delay Marcel, I had other issues here at work.
>
> no worries, we all have ;)
>
> > The patch compiles and works. Now the same amount of opertions (4) are
> done,
> > but they happen on the background. Well, I'm not sure about the impact
> of
> > this. What if one user wants to work with the object?
>
> ok, thanks for the feedback, I'll have a deeper look at it and post my
> findings in the jira issue you created.
>
> > One last question (I promise it :)). How can I search now on that text
> > parsed content? I was using something like
> [jcr:contains(@jcr:data,phrase)]
> > but this, as could be expected, does not work.
>
> //*[jcr:contains(., 'phrase')]
>
> will do it. the extracted text is indexed as a hidden fulltext field
> that is associated with the node and not the jcr:data property.
>
> regards
> marcel
>

Re: Another issue with text filtering

Posted by Marcel Reutegger <ma...@gmx.net>.

Martin Perez wrote:
> Sorry for the delay Marcel, I had other issues here at work.

no worries, we all have ;)

> The patch compiles and works. Now the same amount of opertions (4) are done,
> but they happen on the background. Well, I'm not sure about the impact of
> this. What if one user wants to work with the object?

ok, thanks for the feedback, I'll have a deeper look at it and post my 
findings in the jira issue you created.

> One last question (I promise it :)). How can I search now on that text
> parsed content? I was using something like [jcr:contains(@jcr:data,phrase)]
> but this, as could be expected, does not work.

//*[jcr:contains(., 'phrase')]

will do it. the extracted text is indexed as a hidden fulltext field 
that is associated with the node and not the jcr:data property.

regards
  marcel

Re: Another issue with text filtering

Posted by Martin Perez <mp...@gmail.com>.

Sorry for the delay Marcel, I had other issues here at work.

The patch compiles and works. Now the same amount of opertions (4) are done,
but they happen on the background. Well, I'm not sure about the impact of
this. What if one user wants to work with the object?

One last question (I promise it :)). How can I search now on that text
parsed content? I was using something like [jcr:contains(@jcr:data,phrase)]
but this, as could be expected, does not work.

Re: Another issue with text filtering

Posted by Martin Perez <mp...@gmail.com>.

I'm testing it :)

In a few minutes you'll have a result. I have submited also the two JIRA
issues we talked previously.

On 10/28/05, Marcel Reutegger <ma...@gmx.net> wrote:
>
> nope, I don't commit untested changes ;)
>
> it's just something I put together in a couple of minutes. I though you
> might be interested in testing it out...
>
> regards
> marcel
>
> Martin Perez wrote:
> > Are these changes on the SVN repository?
> >
> > I can't see them...
> >
> > Regards,
> >
> > Martin
> >
> > On 10/28/05, Marcel Reutegger <ma...@gmx.net> wrote:
> >
> >>Hi Martin,
> >>
> >>I quickly put together a patch for the pdf text filter. completely
> >>untested because I'm a bit short of time at the moment.
> >>
> >>Any feedback if it works is appreciated.
> >>
> >>regards
> >>marcel
>

Re: Another issue with text filtering

Posted by Marcel Reutegger <ma...@gmx.net>.

nope, I don't commit untested changes ;)

it's just something I put together in a couple of minutes. I though you 
might be interested in testing it out...

regards
  marcel

Martin Perez wrote:
> Are these changes on the SVN repository?
> 
> I can't see them...
> 
> Regards,
> 
> Martin
> 
> On 10/28/05, Marcel Reutegger <ma...@gmx.net> wrote:
> 
>>Hi Martin,
>>
>>I quickly put together a patch for the pdf text filter. completely
>>untested because I'm a bit short of time at the moment.
>>
>>Any feedback if it works is appreciated.
>>
>>regards
>>marcel

Re: Another issue with text filtering

Posted by Martin Perez <mp...@gmail.com>.

Are these changes on the SVN repository?

I can't see them...

Regards,

Martin

On 10/28/05, Marcel Reutegger <ma...@gmx.net> wrote:
>
> Hi Martin,
>
> I quickly put together a patch for the pdf text filter. completely
> untested because I'm a bit short of time at the moment.
>
> Any feedback if it works is appreciated.
>
> regards
> marcel
>
> Martin Perez wrote:
> > If you want to add a PDF document to a repository using a PdfTextFilter,
> and
> > you do the following steps:
> >
> > session.save()
> > node.checkin();
> >
> > The method PdfTextFilter.doFilter() gets called 4 times!!!
> >
> > session's save method calls doFilter one time. This is normal
> >
> > But checkin method calls doFilter three times. Is this normal? I do not
> see
> > the sense.
> >
> > Thanks.
> >
> > Martin
> >
>
>
> Index: java/org/apache/jackrabbit/core/query/LazyReader.java
> ===================================================================
> --- java/org/apache/jackrabbit/core/query/LazyReader.java (revision 0)
> +++ java/org/apache/jackrabbit/core/query/LazyReader.java (revision 0)
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright 2004-2005 The Apache Software Foundation or its licensors,
> + * as applicable.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +package org.apache.jackrabbit.core.query;
> +
> +import java.io.Reader;
> +import java.io.IOException;
> +
> +/**
> + * <code>LazyReader</code> implement an utility that allows an
> implementing
> + * class to lazy initialize an actual reader.
> + */
> +public abstract class LazyReader extends Reader {
> +
> + /**
> + * The actual reader, set by concrete sub class.
> + */
> + protected Reader delegate;
> +
> + /**
> + * Implementation must set the actual reader {@link #delegate} when
> + * this method is called.
> + *
> + * @throws IOException if an error occurs.
> + */
> + protected abstract void initializeReader() throws IOException;
> +
> + /**
> + * Closes the underlying reader.
> + *
> + * @throws IOException if an exception occurs while closing the
> underlying
> + * reader.
> + */
> + public void close() throws IOException {
> + if (delegate != null) {
> + delegate.close();
> + }
> + }
> +
> + /**
> + * @inheritDoc
> + */
> + public int read(char cbuf[], int off, int len) throws IOException {
> + if (delegate == null) {
> + initializeReader();
> + }
> + // be suspicious
> + if (delegate == null) {
> + throw new IOException("reader not initialized");
> + }
> + return delegate.read(cbuf, off, len);
> + }
> +}
>
> Property changes on: java/org/apache/jackrabbit/core/query/LazyReader.java
> ___________________________________________________________________
> Name: svn:eol-style
> + native
>
> Index: java/org/apache/jackrabbit/core/query/PdfTextFilter.java
> ===================================================================
> --- java/org/apache/jackrabbit/core/query/PdfTextFilter.java (revision
> 329171)
> +++ java/org/apache/jackrabbit/core/query/PdfTextFilter.java (working
> copy)
> @@ -57,31 +57,37 @@
> public Map doFilter(PropertyState data, String encoding) throws
> RepositoryException {
> InternalValue[] values = data.getValues();
> if (values.length > 0) {
> - BLOBFileValue blob = (BLOBFileValue) values[0].internalValue();
> -
> - try {
> - PDFParser parser = new PDFParser(blob.getStream());
> - parser.parse();
> -
> - PDDocument document = parser.getPDDocument();
> -
> - CharArrayWriter writer = new CharArrayWriter();
> -
> - PDFTextStripper stripper = new PDFTextStripper();
> - stripper.setLineSeparator("\n");
> - stripper.writeText(document, writer);
> -
> - document.close();
> - writer.close();
> -
> - Map result = new HashMap();
> - result.put(FieldNames.FULLTEXT, new CharArrayReader(writer.toCharArray
> ()));
> - return result;
> - }
> - catch (IOException ex) {
> - throw new RepositoryException(ex);
> - }
> - }
> + final BLOBFileValue blob = (BLOBFileValue) values[0].internalValue();
> + LazyReader reader = new LazyReader() {
> + protected void initializeReader() throws IOException {
> + PDFParser parser;
> + try {
> + parser = new PDFParser(blob.getStream());
> + } catch (RepositoryException e) {
> + throw new IOException(e.getMessage());
> + }
> + parser.parse();
> +
> + PDDocument document = parser.getPDDocument();
> +
> + CharArrayWriter writer = new CharArrayWriter();
> +
> + PDFTextStripper stripper = new PDFTextStripper();
> + stripper.setLineSeparator("\n");
> + stripper.writeText(document, writer);
> +
> + document.close();
> + writer.close();
> +
> + delegate = new CharArrayReader(writer.toCharArray());
> + }
> + };
> +
> +
> + Map result = new HashMap();
> + result.put(FieldNames.FULLTEXT, reader);
> + return result;
> + }
> else {
> // multi value not supported
> throw new RepositoryException("Multi-valued binary properties not
> supported.");
>
>
>

Re: Another issue with text filtering

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi Martin,

I quickly put together a patch for the pdf text filter. completely 
untested because I'm a bit short of time at the moment.

Any feedback if it works is appreciated.

regards
  marcel

Martin Perez wrote:
> If you want to add a PDF document to a repository using a PdfTextFilter, and
> you do the following steps:
> 
> session.save()
> node.checkin();
> 
> The method PdfTextFilter.doFilter() gets called 4 times!!!
> 
> session's save method calls doFilter one time. This is normal
> 
> But checkin method calls doFilter three times. Is this normal? I do not see
> the sense.
> 
> Thanks.
> 
> Martin
>