You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Philipp Bunge <bu...@crimson.ch> on 2010/01/07 11:48:35 UTC

Fulltext indexing of binary property

Hi,

we have a custom node that is not extending nt:resource it looks like
this (originally there is also a subnode and other inherited types):

[doc:dokument] > nt:base, mix:mimeType
   - filename (string)
   - original (binary) primary

According to the Spec (3.7.11.10 mix:mimeType) this should provide
what is needed to index the binary property "original".

We use standard indexing configuration, thus, no custom configuration,
and still can not do a fulltext search on the content of "original".

The reason we want to do it like this is that "original" is not
mandatory (jcr:data in nt:resource is mandatory).

Is there a way to get this working like this?

We could also define "original" as non mandatory sub-node of type
nt:resource. For simplicity and other reasons we would prefer the
definition as property.

Cheers,
Philipp

Re: Fulltext indexing of binary property

Posted by Philipp Bunge <bu...@crimson.ch>.
>> Does that mean Jackrabbit can text extract binary properties on
>> arbitrary nodes?
>
> Currently not, but now that we use Tika for full text extraction it
> should be fairly straightforward to implement this. See
> https://issues.apache.org/jira/browse/JCR-729 for the related issue.

Ah, thanks for clearing that up Jukka!
In that case we'll resort to using nt:resource for the moment.

Thanks,
Philipp

Re: Fulltext indexing of binary property

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Jan 13, 2010 at 2:55 PM, Philipp Bunge <bu...@crimson.ch> wrote:
> Does that mean Jackrabbit can text extract binary properties on
> arbitrary nodes?

Currently not, but now that we use Tika for full text extraction it
should be fairly straightforward to implement this. See
https://issues.apache.org/jira/browse/JCR-729 for the related issue.

> How do you know what mimetype/encoding to extract with?

Tika supports pretty accurate automatic type detection based on the
first few bytes of the binary.

BR,

Jukka Zitting

Re: Fulltext indexing of binary property

Posted by Philipp Bunge <bu...@crimson.ch>.
Hi Alex

Thanks for your reply.

> Binary properties are indexed with the help of full text extractors.
> The recent Jackrabbit versions use Apache Tika for that. You can
> provide your custom extractors.

Does that mean Jackrabbit can text extract binary properties on
arbitrary nodes? How do you know what mimetype/encoding to extract
with?

(According to the spec mix:mimeType only applies to binary properties
marked as primary).

Thanks,
Philipp

Re: Fulltext indexing of binary property

Posted by Alexander Klimetschek <ak...@day.com>.
On Thu, Jan 7, 2010 at 11:48, Philipp Bunge <bu...@crimson.ch> wrote:
> We use standard indexing configuration, thus, no custom configuration,
> and still can not do a fulltext search on the content of "original".

Binary properties are indexed with the help of full text extractors.
The recent Jackrabbit versions use Apache Tika for that. You can
provide your custom extractors.

See
http://wiki.apache.org/jackrabbit/Search
http://wiki.apache.org/jackrabbit/TextExtractorExamples
http://lucene.apache.org/tika/

> We could also define "original" as non mandatory sub-node of type
> nt:resource. For simplicity and other reasons we would prefer the
> definition as property.

Indexing does not care about mandatory or not, it will index what is
stored and what the indexing configuration defines.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com