You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Ro...@ipaustralia.gov.au on 2010/02/19 01:30:55 UTC
Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
My binary files are all PDFs, so the text is extracted with PdfBox toolkit
and the full text becomes keyword searchable.
All done using the default configuration, except I extended nt:resource to
add a few attributes.
The mimeType attribute will be application/octet-stream.
Perhaps there is no plug-in that knows how to extract text from your
binary files?
From: ChadDavis <ch...@gmail.com>
To: users@jackrabbit.apache.org
Date: 19/02/2010 11:13 AM
Subject: Re: jackrabbit 2.0 binary search indexing
On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek <ak...@day.com>
wrote:
> On Thu, Feb 18, 2010 at 18:35, ChadDavis <ch...@gmail.com>
wrote:
>> I'm looking for information on how to enable binary search indexing.
>> I found documentation for pre-2.0 jackrabbit, and reference to the
>> fact that Tika is now used internally for the binary indexing.
>> However, I can't find any documentation of how to enable the binary
>> indexing . . ..
>
> It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
> property. The mimetype for text extraction is taken from the
> jcr:content/jcr:mimeType property. I don't know if you can enable it
> for other binary properties.
>
Just to clarify, you are saying that the binary indexing, as long as
I'm using the JCR built-in node types for my binary file storage, e.g.
nt:file --> jcr:content <nt:resource> -->jcr:data ( binary property
with my file ), occurs automatically?
If so, then something's not working for me. Can you recommend some
troubleshooting tips? How can I determine whether the binaries are
being indexed? Note, I'm doing a full text search and it DOES hit
other node properties, etc.
Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
Posted by ChadDavis <ch...@gmail.com>.
On Thu, Feb 18, 2010 at 5:41 PM, <Ro...@ipaustralia.gov.au> wrote:
> I only have a small dataset in my test application (<100 docs), it certainly
> only takes a few seconds to be available for the keyword search.
I figured it out. User error.
Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
Posted by Ro...@ipaustralia.gov.au.
I only have a small dataset in my test application (<100 docs), it
certainly only takes a few seconds to be available for the keyword search.
ChadDavis <ch...@gmail.com> wrote on 19/02/2010 11:33:27 AM:
> From: ChadDavis <ch...@gmail.com>
> To: users@jackrabbit.apache.org
> Date: 19/02/2010 11:34 AM
> Subject: Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
>
> On Thu, Feb 18, 2010 at 5:30 PM, <Ro...@ipaustralia.gov.au> wrote:
> > My binary files are all PDFs, so the text is extracted with PdfBox
toolkit
> > and the full text becomes keyword searchable.
> > All done using the default configuration, except I extended
nt:resource to
> > add a few attributes.
> >
> > The mimeType attribute will be application/octet-stream.
> > Perhaps there is no plug-in that knows how to extract text from your
binary
> > files?
>
> I tried pdf, word, and a plain text file . . . how long does it take
> for a doc to be indexed?
>
> >
> >
> >
> >
> > From: ChadDavis <ch...@gmail.com>
> > To: users@jackrabbit.apache.org
> > Date: 19/02/2010 11:13 AM
> > Subject: Re: jackrabbit 2.0 binary search indexing
> > ________________________________
> >
> >
> > On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek
<ak...@day.com>
> > wrote:
> >> On Thu, Feb 18, 2010 at 18:35, ChadDavis <ch...@gmail.com>
> >> wrote:
> >>> I'm looking for information on how to enable binary search indexing.
> >>> I found documentation for pre-2.0 jackrabbit, and reference to the
> >>> fact that Tika is now used internally for the binary indexing.
> >>> However, I can't find any documentation of how to enable the binary
> >>> indexing . . ..
> >>
> >> It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
> >> property. The mimetype for text extraction is taken from the
> >> jcr:content/jcr:mimeType property. I don't know if you can enable it
> >> for other binary properties.
> >>
> >
> > Just to clarify, you are saying that the binary indexing, as long as
> > I'm using the JCR built-in node types for my binary file storage, e.g.
> > nt:file --> jcr:content <nt:resource> -->jcr:data ( binary property
> > with my file ), occurs automatically?
> >
> > If so, then something's not working for me. Can you recommend some
> > troubleshooting tips? How can I determine whether the binaries are
> > being indexed? Note, I'm doing a full text search and it DOES hit
> > other node properties, etc.
> >
> >
> >
> > --
> > This message contains privileged and confidential information only
> > for use by the intended recipient. If you are not the intended
> > recipient of this message, you must not disseminate, copy or use
> > it in any manner. If you have received this message in error,
> > please advise the sender by reply e-mail. Please ensure all
> > e-mail attachments are scanned for viruses prior to opening or
> > using.
> >
> >
Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
Posted by ChadDavis <ch...@gmail.com>.
On Thu, Feb 18, 2010 at 5:30 PM, <Ro...@ipaustralia.gov.au> wrote:
> My binary files are all PDFs, so the text is extracted with PdfBox toolkit
> and the full text becomes keyword searchable.
> All done using the default configuration, except I extended nt:resource to
> add a few attributes.
>
> The mimeType attribute will be application/octet-stream.
> Perhaps there is no plug-in that knows how to extract text from your binary
> files?
I tried pdf, word, and a plain text file . . . how long does it take
for a doc to be indexed?
>
>
>
>
> From: ChadDavis <ch...@gmail.com>
> To: users@jackrabbit.apache.org
> Date: 19/02/2010 11:13 AM
> Subject: Re: jackrabbit 2.0 binary search indexing
> ________________________________
>
>
> On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek <ak...@day.com>
> wrote:
>> On Thu, Feb 18, 2010 at 18:35, ChadDavis <ch...@gmail.com>
>> wrote:
>>> I'm looking for information on how to enable binary search indexing.
>>> I found documentation for pre-2.0 jackrabbit, and reference to the
>>> fact that Tika is now used internally for the binary indexing.
>>> However, I can't find any documentation of how to enable the binary
>>> indexing . . ..
>>
>> It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
>> property. The mimetype for text extraction is taken from the
>> jcr:content/jcr:mimeType property. I don't know if you can enable it
>> for other binary properties.
>>
>
> Just to clarify, you are saying that the binary indexing, as long as
> I'm using the JCR built-in node types for my binary file storage, e.g.
> nt:file --> jcr:content <nt:resource> -->jcr:data ( binary property
> with my file ), occurs automatically?
>
> If so, then something's not working for me. Can you recommend some
> troubleshooting tips? How can I determine whether the binaries are
> being indexed? Note, I'm doing a full text search and it DOES hit
> other node properties, etc.
>
>
>
> --
> This message contains privileged and confidential information only
> for use by the intended recipient. If you are not the intended
> recipient of this message, you must not disseminate, copy or use
> it in any manner. If you have received this message in error,
> please advise the sender by reply e-mail. Please ensure all
> e-mail attachments are scanned for viruses prior to opening or
> using.
>
>