You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Kurz Wolfgang <wo...@gwvs.de> on 2009/03/26 16:20:17 UTC

Problem getting full textual search to work with textextractors

Hello everyone,

i am trying to get the full textual search to work with text extractors.


I uploaded a pfd-file as resource into jackrabbit which works fine as I can download it just fine and I get the file back.

But now I wanted to implement textual search inside document I uploaded and somehow it doesn't find the documents even though the document contains the strings that I am searching for.

What I did I this:

I added these jar files to my tomcat server lib folder since I am using JNDI to connect

-jackrabbit-text-extractors-1.5.0.jar
-fontbox-0.1.0.jar
-junit-3.8.1.jar
-nekohtml-1.9.7.jar
-pdfbox-0.7.3.jar
-poi-3.0.2-FINAL.jar
-poi-scratchpad-3.0.2-FINAL.jar
-tm-extractors-0.4.jar

Then my x-path query looks like this:

//*[((jcr:contains(.,'consetetur')) or (jcr:contains(.,'sadipscing')))]

Both of those words are inside the pdf but the search result is empty.

Here is the code how I do the search:

javax.jcr.query.Query jcrQuery;
		try {
			jcrQuery = session.getWorkspace().getQueryManager().createQuery(query, language);
			QueryResult queryResult = jcrQuery.execute();
			NodeIterator nodeIterator = queryResult.getNodes();
			return nodeIterator;
		}
		catch (InvalidQueryException iqe) {
			throw new org.apache.jackrabbit.ocm.exception.InvalidQueryException(iqe);
		}
		catch (RepositoryException re) {
			throw new ObjectContentManagerException(re.getMessage(), re);
		}


Would be really awesome if anyone had an idea for me why this doesn't work

Thx a lot in advance
Wolfgang

AW: Problem getting full textual search to work with textextractors

Posted by Kurz Wolfgang <wo...@gwvs.de>.
Thx for the info!

I have it working now even without jempbox.

I will add that to my jars though.

Don't ask me why it didn't work:-) I haven't changed anything:-)

-----Ursprüngliche Nachricht-----
Von: mreutegg@day.com [mailto:mreutegg@day.com] Im Auftrag von Marcel Reutegger
Gesendet: Freitag, 27. März 2009 09:31
An: users@jackrabbit.apache.org
Betreff: Re: Problem getting full textual search to work with textextractors

Hi Wolfgang,

pdfbox has an additional dependency to jempbox:

[dependency:tree]
org.apache.jackrabbit:jackrabbit-text-extractors:jar:1.5.0
+- org.apache.poi:poi:jar:3.0.2-FINAL:compile
|  \- commons-logging:commons-logging:jar:1.1:compile
|     \- log4j:log4j:jar:1.2.14:compile
+- org.apache.poi:poi-scratchpad:jar:3.0.2-FINAL:compile
+- pdfbox:pdfbox:jar:0.7.3:compile
|  +- org.fontbox:fontbox:jar:0.1.0:compile
|  \- org.jempbox:jempbox:jar:0.2.0:compile   <=====
+- net.sourceforge.nekohtml:nekohtml:jar:1.9.7:compile
|  \- xerces:xercesImpl:jar:2.8.1:compile
|     \- xml-apis:xml-apis:jar:1.3.03:compile
+- org.slf4j:slf4j-api:jar:1.5.3:compile

did you see any warnings or errors in the logs?

regards
 marcel

On Thu, Mar 26, 2009 at 16:20, Kurz Wolfgang <wo...@gwvs.de> wrote:
> Hello everyone,
>
> i am trying to get the full textual search to work with text extractors.
>
>
> I uploaded a pfd-file as resource into jackrabbit which works fine as I can download it just fine and I get the file back.
>
> But now I wanted to implement textual search inside document I uploaded and somehow it doesn't find the documents even though the document contains the strings that I am searching for.
>
> What I did I this:
>
> I added these jar files to my tomcat server lib folder since I am using JNDI to connect
>
> -jackrabbit-text-extractors-1.5.0.jar
> -fontbox-0.1.0.jar
> -junit-3.8.1.jar
> -nekohtml-1.9.7.jar
> -pdfbox-0.7.3.jar
> -poi-3.0.2-FINAL.jar
> -poi-scratchpad-3.0.2-FINAL.jar
> -tm-extractors-0.4.jar
>
> Then my x-path query looks like this:
>
> //*[((jcr:contains(.,'consetetur')) or (jcr:contains(.,'sadipscing')))]
>
> Both of those words are inside the pdf but the search result is empty.
>
> Here is the code how I do the search:
>
> javax.jcr.query.Query jcrQuery;
>                try {
>                        jcrQuery = session.getWorkspace().getQueryManager().createQuery(query, language);
>                        QueryResult queryResult = jcrQuery.execute();
>                        NodeIterator nodeIterator = queryResult.getNodes();
>                        return nodeIterator;
>                }
>                catch (InvalidQueryException iqe) {
>                        throw new org.apache.jackrabbit.ocm.exception.InvalidQueryException(iqe);
>                }
>                catch (RepositoryException re) {
>                        throw new ObjectContentManagerException(re.getMessage(), re);
>                }
>
>
> Would be really awesome if anyone had an idea for me why this doesn't work
>
> Thx a lot in advance
> Wolfgang
>

Re: Problem getting full textual search to work with textextractors

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi Wolfgang,

pdfbox has an additional dependency to jempbox:

[dependency:tree]
org.apache.jackrabbit:jackrabbit-text-extractors:jar:1.5.0
+- org.apache.poi:poi:jar:3.0.2-FINAL:compile
|  \- commons-logging:commons-logging:jar:1.1:compile
|     \- log4j:log4j:jar:1.2.14:compile
+- org.apache.poi:poi-scratchpad:jar:3.0.2-FINAL:compile
+- pdfbox:pdfbox:jar:0.7.3:compile
|  +- org.fontbox:fontbox:jar:0.1.0:compile
|  \- org.jempbox:jempbox:jar:0.2.0:compile   <=====
+- net.sourceforge.nekohtml:nekohtml:jar:1.9.7:compile
|  \- xerces:xercesImpl:jar:2.8.1:compile
|     \- xml-apis:xml-apis:jar:1.3.03:compile
+- org.slf4j:slf4j-api:jar:1.5.3:compile

did you see any warnings or errors in the logs?

regards
 marcel

On Thu, Mar 26, 2009 at 16:20, Kurz Wolfgang <wo...@gwvs.de> wrote:
> Hello everyone,
>
> i am trying to get the full textual search to work with text extractors.
>
>
> I uploaded a pfd-file as resource into jackrabbit which works fine as I can download it just fine and I get the file back.
>
> But now I wanted to implement textual search inside document I uploaded and somehow it doesn't find the documents even though the document contains the strings that I am searching for.
>
> What I did I this:
>
> I added these jar files to my tomcat server lib folder since I am using JNDI to connect
>
> -jackrabbit-text-extractors-1.5.0.jar
> -fontbox-0.1.0.jar
> -junit-3.8.1.jar
> -nekohtml-1.9.7.jar
> -pdfbox-0.7.3.jar
> -poi-3.0.2-FINAL.jar
> -poi-scratchpad-3.0.2-FINAL.jar
> -tm-extractors-0.4.jar
>
> Then my x-path query looks like this:
>
> //*[((jcr:contains(.,'consetetur')) or (jcr:contains(.,'sadipscing')))]
>
> Both of those words are inside the pdf but the search result is empty.
>
> Here is the code how I do the search:
>
> javax.jcr.query.Query jcrQuery;
>                try {
>                        jcrQuery = session.getWorkspace().getQueryManager().createQuery(query, language);
>                        QueryResult queryResult = jcrQuery.execute();
>                        NodeIterator nodeIterator = queryResult.getNodes();
>                        return nodeIterator;
>                }
>                catch (InvalidQueryException iqe) {
>                        throw new org.apache.jackrabbit.ocm.exception.InvalidQueryException(iqe);
>                }
>                catch (RepositoryException re) {
>                        throw new ObjectContentManagerException(re.getMessage(), re);
>                }
>
>
> Would be really awesome if anyone had an idea for me why this doesn't work
>
> Thx a lot in advance
> Wolfgang
>