You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Martin Perez <mp...@gmail.com> on 2005/10/27 14:34:35 UTC
Searching in file contents
Hi again. Here goes another one about searching.
I'm storing files on jackrabbit for later searching ( what innovative! ).
Ok, I'm storing the content using the "jcr:data" property:
node.setProperty("jcr:data",inputstream) being inputstream the stream with
the file contents.
The problem is that I don't know how to search later within that contents.
The content can be sometimes binary (images, video, pdfs, ...) and sometimes
text (html, xml, txt, ..) Currently I'm using the next query statement
//*[jcr:contains(@jcr:data,'phrase')]
So, first question, how to search within stream properties?
And the second one. I'm migrating a repository system that was based on
lucene. In that repository system, I was following the next process to index
binary content:
1 - Try to extract the text from the file (pdf extractors, word extractors,
excel extractors, etc..)
2 - Store the file contents in database or filesystem storage
3 - Index the text content.
But now I have the problem of how to do word,pdf,excel, etc. management. One
option is to extract the text and store both "extracted-text" and "content"
as properties, but this will duplicate storage for these files.
So, how would you handle storage and searching within binary text files like
pdf or word ones?
Thanks!
Martin
Re: Searching in file contents
Posted by Martin Perez <mp...@gmail.com>.
That's great Marcel.
I'll take a look to it.
Thanks,
Martin
On 10/27/05, Marcel Reutegger <ma...@gmx.net> wrote:
>
> Hi Martin,
>
> oh, I forgot, of course you also need the dependent jar files of the
> text filters. that's stuff like pdfbox, etc... the project.xml file
> lists them all.
>
> the mapping is more or less hard coded in each text filter
> implementation. jackrabbit will ask all known text filter
> implementations whether they support a certain mime-type and then let
> the filter do its work if it returns true for such a call.
>
> regards
> marcel
>
> Martin Perez wrote:
> > Marcel, that is interesting.
> > So to get it working you only have to add that contrib .jar files to the
> > classpath and put the correct mime type? Is there any place where you
> can
> > search mappings between mime types and converter classes? I suppose that
> > looking into contrib code :D ( sorry for these obvious questions but I
> can't
> > look at contrib code until night ;))
>
Re: Searching in file contents
Posted by Marcel Reutegger <ma...@gmx.net>.
Hi Martin,
oh, I forgot, of course you also need the dependent jar files of the
text filters. that's stuff like pdfbox, etc... the project.xml file
lists them all.
the mapping is more or less hard coded in each text filter
implementation. jackrabbit will ask all known text filter
implementations whether they support a certain mime-type and then let
the filter do its work if it returns true for such a call.
regards
marcel
Martin Perez wrote:
> Marcel, that is interesting.
> So to get it working you only have to add that contrib .jar files to the
> classpath and put the correct mime type? Is there any place where you can
> search mappings between mime types and converter classes? I suppose that
> looking into contrib code :D ( sorry for these obvious questions but I can't
> look at contrib code until night ;))
Re: Searching in file contents
Posted by Martin Perez <mp...@gmail.com>.
Marcel, that is interesting.
So to get it working you only have to add that contrib .jar files to the
classpath and put the correct mime type? Is there any place where you can
search mappings between mime types and converter classes? I suppose that
looking into contrib code :D ( sorry for these obvious questions but I can't
look at contrib code until night ;))
Thanks
Martin
On 10/27/05, Marcel Reutegger <ma...@gmx.net> wrote:
>
> Hi Martin,
>
> jackrabbit comes with an extension mechanism that allows you to plugin
> text filters. those filters basically convert a binary stream into a
> character stream that can be indexed by lucene.
>
> the core classes contain a sample implementation that filters binaries
> of type text/plain according (also not very innovative, but it takes the
> encoding into account. that's at least something ;))
>
> there are additional text filters in contrib, if I remember correctly
> for some ms office documents and pdf.
>
> simply build the text filter contrib and put it into the classpath, that
> should do it.
>
> btw. this mechanism doesn't need an additional property to store the
> text version of the binary.
>
> regards
> marcel
>
> Martin Perez wrote:
> > Hi again. Here goes another one about searching.
> >
> > I'm storing files on jackrabbit for later searching ( what innovative!
> ).
> > Ok, I'm storing the content using the "jcr:data" property:
> >
> > node.setProperty("jcr:data",inputstream) being inputstream the stream
> with
> > the file contents.
> >
> > The problem is that I don't know how to search later within that
> contents.
> > The content can be sometimes binary (images, video, pdfs, ...) and
> sometimes
> > text (html, xml, txt, ..) Currently I'm using the next query statement
> > //*[jcr:contains(@jcr:data,'phrase')]
> >
> > So, first question, how to search within stream properties?
> >
> >
> > And the second one. I'm migrating a repository system that was based on
> > lucene. In that repository system, I was following the next process to
> index
> > binary content:
> >
> > 1 - Try to extract the text from the file (pdf extractors, word
> extractors,
> > excel extractors, etc..)
> > 2 - Store the file contents in database or filesystem storage
> > 3 - Index the text content.
> >
> > But now I have the problem of how to do word,pdf,excel, etc. management.
> One
> > option is to extract the text and store both "extracted-text" and
> "content"
> > as properties, but this will duplicate storage for these files.
> >
> > So, how would you handle storage and searching within binary text files
> like
> > pdf or word ones?
> >
> > Thanks!
> >
> > Martin
> >
>
Re: Searching in file contents
Posted by Marcel Reutegger <ma...@gmx.net>.
Hi Martin,
jackrabbit comes with an extension mechanism that allows you to plugin
text filters. those filters basically convert a binary stream into a
character stream that can be indexed by lucene.
the core classes contain a sample implementation that filters binaries
of type text/plain according (also not very innovative, but it takes the
encoding into account. that's at least something ;))
there are additional text filters in contrib, if I remember correctly
for some ms office documents and pdf.
simply build the text filter contrib and put it into the classpath, that
should do it.
btw. this mechanism doesn't need an additional property to store the
text version of the binary.
regards
marcel
Martin Perez wrote:
> Hi again. Here goes another one about searching.
>
> I'm storing files on jackrabbit for later searching ( what innovative! ).
> Ok, I'm storing the content using the "jcr:data" property:
>
> node.setProperty("jcr:data",inputstream) being inputstream the stream with
> the file contents.
>
> The problem is that I don't know how to search later within that contents.
> The content can be sometimes binary (images, video, pdfs, ...) and sometimes
> text (html, xml, txt, ..) Currently I'm using the next query statement
> //*[jcr:contains(@jcr:data,'phrase')]
>
> So, first question, how to search within stream properties?
>
>
> And the second one. I'm migrating a repository system that was based on
> lucene. In that repository system, I was following the next process to index
> binary content:
>
> 1 - Try to extract the text from the file (pdf extractors, word extractors,
> excel extractors, etc..)
> 2 - Store the file contents in database or filesystem storage
> 3 - Index the text content.
>
> But now I have the problem of how to do word,pdf,excel, etc. management. One
> option is to extract the text and store both "extracted-text" and "content"
> as properties, but this will duplicate storage for these files.
>
> So, how would you handle storage and searching within binary text files like
> pdf or word ones?
>
> Thanks!
>
> Martin
>