You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by ChadDavis <ch...@gmail.com> on 2010/02/18 18:35:53 UTC

jackrabbit 2.0 binary search indexing

I'm looking for information on how to enable binary search indexing.
I found documentation for pre-2.0 jackrabbit, and reference to the
fact that Tika is now used internally for the binary indexing.
However, I can't find any documentation of how to enable the binary
indexing . . ..

Can some one point me to some docs, or impart their wisdom directly?

Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]

Posted by ChadDavis <ch...@gmail.com>.

On Thu, Feb 18, 2010 at 5:41 PM,  <Ro...@ipaustralia.gov.au> wrote:
> I only have a small dataset in my test application (<100 docs), it certainly
> only takes a few seconds to be available for the keyword search.

I figured it out.  User error.

Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]

Posted by Ro...@ipaustralia.gov.au.

I only have a small dataset in my test application (<100 docs), it 
certainly only takes a few seconds to be available for the keyword search.

ChadDavis <ch...@gmail.com> wrote on 19/02/2010 11:33:27 AM:

> From: ChadDavis <ch...@gmail.com>
> To: users@jackrabbit.apache.org
> Date: 19/02/2010 11:34 AM
> Subject: Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
> 
> On Thu, Feb 18, 2010 at 5:30 PM,  <Ro...@ipaustralia.gov.au> wrote:
> > My binary files are all PDFs, so the text is extracted with PdfBox 
toolkit
> > and the full text becomes keyword searchable.
> > All done using the default configuration, except I extended 
nt:resource to
> > add a few attributes.
> >
> > The mimeType attribute will be application/octet-stream.
> > Perhaps there is no plug-in that knows how to extract text from your 
binary
> > files?
> 
> I tried pdf, word, and a plain text file . . . how long does it take
> for a doc to be indexed?
> 
> >
> >
> >
> >
> > From:        ChadDavis <ch...@gmail.com>
> > To:        users@jackrabbit.apache.org
> > Date:        19/02/2010 11:13 AM
> > Subject:        Re: jackrabbit 2.0 binary search indexing
> > ________________________________
> >
> >
> > On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek 
<ak...@day.com>
> > wrote:
> >> On Thu, Feb 18, 2010 at 18:35, ChadDavis <ch...@gmail.com>
> >> wrote:
> >>> I'm looking for information on how to enable binary search indexing.
> >>> I found documentation for pre-2.0 jackrabbit, and reference to the
> >>> fact that Tika is now used internally for the binary indexing.
> >>> However, I can't find any documentation of how to enable the binary
> >>> indexing . . ..
> >>
> >> It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
> >> property. The mimetype for text extraction is taken from the
> >> jcr:content/jcr:mimeType property. I don't know if you can enable it
> >> for other binary properties.
> >>
> >
> > Just to clarify, you are saying that the binary indexing, as long as
> > I'm using the JCR built-in node types for my binary file storage, e.g.
> > nt:file --> jcr:content <nt:resource> -->jcr:data ( binary property
> > with my file ), occurs automatically?
> >
> > If so, then something's not working for me.  Can you recommend some
> > troubleshooting tips?  How can I determine whether the binaries are
> > being indexed?  Note, I'm doing a full text search and it DOES hit
> > other node properties, etc.
> >
> >
> >
> > --
> > This message contains privileged and confidential information only
> > for use by the intended recipient.  If you are not the intended
> > recipient of this message, you must not disseminate, copy or use
> > it in any manner.  If you have received this message in error,
> > please advise the sender by reply e-mail.  Please ensure all
> > e-mail attachments are scanned for viruses prior to opening or
> > using.
> >
> >

Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]

Posted by ChadDavis <ch...@gmail.com>.

On Thu, Feb 18, 2010 at 5:30 PM,  <Ro...@ipaustralia.gov.au> wrote:
> My binary files are all PDFs, so the text is extracted with PdfBox toolkit
> and the full text becomes keyword searchable.
> All done using the default configuration, except I extended nt:resource to
> add a few attributes.
>
> The mimeType attribute will be application/octet-stream.
> Perhaps there is no plug-in that knows how to extract text from your binary
> files?

I tried pdf, word, and a plain text file . . . how long does it take
for a doc to be indexed?

>
>
>
>
> From:        ChadDavis <ch...@gmail.com>
> To:        users@jackrabbit.apache.org
> Date:        19/02/2010 11:13 AM
> Subject:        Re: jackrabbit 2.0 binary search indexing
> ________________________________
>
>
> On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek <ak...@day.com>
> wrote:
>> On Thu, Feb 18, 2010 at 18:35, ChadDavis <ch...@gmail.com>
>> wrote:
>>> I'm looking for information on how to enable binary search indexing.
>>> I found documentation for pre-2.0 jackrabbit, and reference to the
>>> fact that Tika is now used internally for the binary indexing.
>>> However, I can't find any documentation of how to enable the binary
>>> indexing . . ..
>>
>> It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
>> property. The mimetype for text extraction is taken from the
>> jcr:content/jcr:mimeType property. I don't know if you can enable it
>> for other binary properties.
>>
>
> Just to clarify, you are saying that the binary indexing, as long as
> I'm using the JCR built-in node types for my binary file storage, e.g.
> nt:file --> jcr:content <nt:resource> -->jcr:data ( binary property
> with my file ), occurs automatically?
>
> If so, then something's not working for me.  Can you recommend some
> troubleshooting tips?  How can I determine whether the binaries are
> being indexed?  Note, I'm doing a full text search and it DOES hit
> other node properties, etc.
>
>
>
> --
> This message contains privileged and confidential information only
> for use by the intended recipient.  If you are not the intended
> recipient of this message, you must not disseminate, copy or use
> it in any manner.  If you have received this message in error,
> please advise the sender by reply e-mail.  Please ensure all
> e-mail attachments are scanned for viruses prior to opening or
> using.
>
>

Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]

Posted by Ro...@ipaustralia.gov.au.

My binary files are all PDFs, so the text is extracted with PdfBox toolkit 
and the full text becomes keyword searchable.
All done using the default configuration, except I extended nt:resource to 
add a few attributes.

The mimeType attribute will be application/octet-stream. 
Perhaps there is no plug-in that knows how to extract text from your 
binary files?

From:   ChadDavis <ch...@gmail.com>
To:     users@jackrabbit.apache.org
Date:   19/02/2010 11:13 AM
Subject:        Re: jackrabbit 2.0 binary search indexing

On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek <ak...@day.com> 
wrote:
> On Thu, Feb 18, 2010 at 18:35, ChadDavis <ch...@gmail.com> 
wrote:
>> I'm looking for information on how to enable binary search indexing.
>> I found documentation for pre-2.0 jackrabbit, and reference to the
>> fact that Tika is now used internally for the binary indexing.
>> However, I can't find any documentation of how to enable the binary
>> indexing . . ..
>
> It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
> property. The mimetype for text extraction is taken from the
> jcr:content/jcr:mimeType property. I don't know if you can enable it
> for other binary properties.
>

Just to clarify, you are saying that the binary indexing, as long as
I'm using the JCR built-in node types for my binary file storage, e.g.
nt:file --> jcr:content <nt:resource> -->jcr:data ( binary property
with my file ), occurs automatically?

If so, then something's not working for me.  Can you recommend some
troubleshooting tips?  How can I determine whether the binaries are
being indexed?  Note, I'm doing a full text search and it DOES hit
other node properties, etc.

Re: jackrabbit 2.0 binary search indexing

Posted by ChadDavis <ch...@gmail.com>.

> Make sure you have all the extractors configured you need. This is
> done in the <SearchIndex> configuration. You can also look this up at
> the wiki


Is this necessary for jackrabbit 2.0?

Re: jackrabbit 2.0 binary search indexing

Posted by Ard Schrijvers <a....@onehippo.com>.

On Fri, Feb 19, 2010 at 1:13 AM, ChadDavis <ch...@gmail.com> wrote:
> On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek <ak...@day.com> wrote:
>> On Thu, Feb 18, 2010 at 18:35, ChadDavis <ch...@gmail.com> wrote:
>>> I'm looking for information on how to enable binary search indexing.
>>> I found documentation for pre-2.0 jackrabbit, and reference to the
>>> fact that Tika is now used internally for the binary indexing.
>>> However, I can't find any documentation of how to enable the binary
>>> indexing . . ..
>>
>> It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
>> property. The mimetype for text extraction is taken from the
>> jcr:content/jcr:mimeType property. I don't know if you can enable it
>> for other binary properties.
>>
>
> Just to clarify, you are saying that the binary indexing, as long as
> I'm using the JCR built-in node types for my binary file storage, e.g.
> nt:file --> jcr:content <nt:resource> -->jcr:data ( binary property
> with my file ), occurs automatically?
>
> If so, then something's not working for me.  Can you recommend some
> troubleshooting tips?  How can I determine whether the binaries are
> being indexed?  Note, I'm doing a full text search and it DOES hit
> other node properties, etc.

Make sure you have all the extractors configured you need. This is
done in the <SearchIndex> configuration. You can also look this up at
the wiki

Ard

>

Re: jackrabbit 2.0 binary search indexing

Posted by ChadDavis <ch...@gmail.com>.

On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek <ak...@day.com> wrote:
> On Thu, Feb 18, 2010 at 18:35, ChadDavis <ch...@gmail.com> wrote:
>> I'm looking for information on how to enable binary search indexing.
>> I found documentation for pre-2.0 jackrabbit, and reference to the
>> fact that Tika is now used internally for the binary indexing.
>> However, I can't find any documentation of how to enable the binary
>> indexing . . ..
>
> It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
> property. The mimetype for text extraction is taken from the
> jcr:content/jcr:mimeType property. I don't know if you can enable it
> for other binary properties.
>

Just to clarify, you are saying that the binary indexing, as long as
I'm using the JCR built-in node types for my binary file storage, e.g.
nt:file --> jcr:content <nt:resource> -->jcr:data ( binary property
with my file ), occurs automatically?

If so, then something's not working for me.  Can you recommend some
troubleshooting tips?  How can I determine whether the binaries are
being indexed?  Note, I'm doing a full text search and it DOES hit
other node properties, etc.

Re: jackrabbit 2.0 binary search indexing

Posted by Alexander Klimetschek <ak...@day.com>.

On Thu, Feb 18, 2010 at 18:35, ChadDavis <ch...@gmail.com> wrote:
> I'm looking for information on how to enable binary search indexing.
> I found documentation for pre-2.0 jackrabbit, and reference to the
> fact that Tika is now used internally for the binary indexing.
> However, I can't find any documentation of how to enable the binary
> indexing . . ..

It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
property. The mimetype for text extraction is taken from the
jcr:content/jcr:mimeType property. I don't know if you can enable it
for other binary properties.

For the search configuration in general, see [1]. I don't know if the
TextExtractor config described there is still valid wrt to the use of
Tika in Jackrabbit 2.0.

[1] http://wiki.apache.org/jackrabbit/Search#Search_Configuration

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com