You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Katia Santos <ka...@gmail.com> on 2008/03/05 13:09:22 UTC

Search in binary Content

I´m trying to search in PDF binary content, the text is being extracted, but
when I do the query, I get no results :(
Do anyone has the same problem, or anyone knows what the problem is?

my query is:

//*[jcr:contains(.,'myword')]



I have another problem....When the text is being extracted, in xls, odt,
odp, and ods files  works fine, but in pdf, xml, txt, rtf , doc, ppt doesnt
:(
No text is extracted in this last file types. If some one could help me wiht
that...

Thanks

Re: Search in binary Content

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi Katia,

Katia Santos wrote:
> I´m trying to search in PDF binary content, the text is being extracted, but
> when I do the query, I get no results :(
> Do anyone has the same problem, or anyone knows what the problem is?
> 
> my query is:
> 
> //*[jcr:contains(.,'myword')]

Did you set the testFilterClasses parameter in your workspace.xml? Please also 
make sure you put all depending jar files into your classpath.

Here's a list of supported classes and the corresponding mime types that are 
recognized:
http://jackrabbit.apache.org/jackrabbit-text-extractors.html

See also query section in the FAQ:
http://jackrabbit.apache.org/frequently-asked-questions.html

regards
  marcel

> I have another problem....When the text is being extracted, in xls, odt,
> odp, and ods files  works fine, but in pdf, xml, txt, rtf , doc, ppt doesnt
> :(
> No text is extracted in this last file types. If some one could help me wiht
> that...
> 
> Thanks
>

Re: Search in binary Content

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Mar 5, 2008 at 7:10 PM, Prakash Reddy K. L. V.
<Pr...@sun.com> wrote:
>  I am sorry for giving out the wrong info but I somehow have this vague
>  memory of reading somewhere that jackrabbit does not support this.

No problem. The current binary indexing feature only supports a rather
specific category of binary properties, so I wouldn't be surprised if
there are claims that Jackrabbit doesn't support binary indexing in
general.

We have a feature request
(https://issues.apache.org/jira/browse/JCR-729) for adding support for
indexing all binary properties, but that depends on having code that
is able to automatically detect and parse the various types of
binaries. Currently we are hoping to use Apache Tika
(http://incubator.apache.org/tika/) for this purpose once the project
graduates from the Apache Incubator.

BR,

Jukka Zitting

Re: Search in binary Content

Posted by "Prakash Reddy K. L. V." <Pr...@Sun.COM>.

Hi Jukka,

I am sorry for giving out the wrong info but I somehow have this vague 
memory of reading somewhere that jackrabbit does not support this.
Thanks for clarifying this.

Sorry again.

Prakash

Jukka Zitting wrote:
> Hi,
>
> On Wed, Mar 5, 2008 at 4:15 PM, Prakash Reddy K. L. V.
> <Pr...@sun.com> wrote:
>   
>>  Jackrabbit does not support searching in binary content.
>>     
>
> It does, but there are certain restrictions before that happens.
>
> You need to put your binary content in a jcr:data property, and have a
> related jcr:mimeType string property with the exact MIME type of the
> binary data.
>
> Then, if you've configured the appropriate Jackrabbit text extractors
> in the repository configuration file and have all the required parser
> libraries (e.g. pdfbox for PDFs) available, Jackrabbit will index such
> binary properties.
>
> BR,
>
> Jukka Zitting
>

Re: Search in binary Content

Posted by Katia Santos <ka...@gmail.com>.

thanks Jukka,

I can now search in pdf binary content, but I still cant extract anything in
txt, rtf, xml, html, doc or ppt...and i dont know why! I have the text
extractors in worksapce configuration, and i have the mimetypes that are in
jackrabbit website, but it does not work. It´s not the query...its the
extractor, because no text is being extracted when the types are txt, rtf,
xml, html, doc and ppt. Is there any libraries that i dont know of for these
types of files?

If someone knows, please give me an hint :)

Re: Search in binary Content

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Mar 5, 2008 at 4:15 PM, Prakash Reddy K. L. V.
<Pr...@sun.com> wrote:
>  Jackrabbit does not support searching in binary content.

It does, but there are certain restrictions before that happens.

You need to put your binary content in a jcr:data property, and have a
related jcr:mimeType string property with the exact MIME type of the
binary data.

Then, if you've configured the appropriate Jackrabbit text extractors
in the repository configuration file and have all the required parser
libraries (e.g. pdfbox for PDFs) available, Jackrabbit will index such
binary properties.

BR,

Jukka Zitting

Re: Search in binary Content

Posted by "Prakash Reddy K. L. V." <Pr...@Sun.COM>.

Hi Katia,

Jackrabbit does not support searching in binary content.

Regards,
Prakash

Katia Santos wrote:
> I´m trying to search in PDF binary content, the text is being extracted, but
> when I do the query, I get no results :(
> Do anyone has the same problem, or anyone knows what the problem is
>
> my query is:
>
> //*[jcr:contains(.,'myword')]
>
>
>
> I have another problem....When the text is being extracted, in xls, odt,
> odp, and ods files  works fine, but in pdf, xml, txt, rtf , doc, ppt doesnt
> :(
> No text is extracted in this last file types. If some one could help me wiht
> that...
>
> Thanks
>
>