You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Jukka Zitting <ju...@gmail.com> on 2006/07/17 23:59:48 UTC

Library for extracting text content from binaries

Hi,

I'm a committer of the Apache Jackrabbit project, and I've recently
been working on improving the full text indexing support in
Jackrabbit. We've used standard Lucene Java as the embedded full text
search engine in Jackrabbit, but created our own set of parsers for
extracting text content from binary files. So far our parser interface
TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
proposal, TextExtractor, [2] aims for a generic solution that converts
a generic InputStream into a Reader for passing to Lucene Java.

Before coming up with the proposal I tried looking for similar
solutions, but couldn't find any that would have satisfied my
requirement of no external dependencies other than the JRE. Your
o.a.nutch.parse.Parser interface however came quite close, and you
already have an extensive set of existing implementations, so I'd like
to leverage your work with the Parser implementations while finding a
way to avoid the full Nutch and Hadoop dependencies. I believe that
there are a number of other Lucene users who have similar needs.

Thus I'd like to ask if there would be interest in making your Parser
interface and implementations more easily accessible to external
projects, perhaps as a separate library. If  you're interested, I'd be
happy to participate in such an effort.

[1] http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org/apache/jackrabbit/core/query/TextFilter.java?view=markup
[2] http://issues.apache.org/jira/browse/JCR-415


BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting, and Java development

Re: Library for extracting text content from binaries

Posted by Michael Wechner <mi...@wyona.com>.

Jukka Zitting wrote:

> Hi,
>
> Any interest in this?


definitely :-)

Michi

> If not, is there some other Lucene project that
> I should approach?
>
> BR,
>
> Jukka Zitting
>
> On 7/18/06, Jukka Zitting <ju...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm a committer of the Apache Jackrabbit project, and I've recently
>> been working on improving the full text indexing support in
>> Jackrabbit. We've used standard Lucene Java as the embedded full text
>> search engine in Jackrabbit, but created our own set of parsers for
>> extracting text content from binary files. So far our parser interface
>> TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
>> proposal, TextExtractor, [2] aims for a generic solution that converts
>> a generic InputStream into a Reader for passing to Lucene Java.
>>
>> Before coming up with the proposal I tried looking for similar
>> solutions, but couldn't find any that would have satisfied my
>> requirement of no external dependencies other than the JRE. Your
>> o.a.nutch.parse.Parser interface however came quite close, and you
>> already have an extensive set of existing implementations, so I'd like
>> to leverage your work with the Parser implementations while finding a
>> way to avoid the full Nutch and Hadoop dependencies. I believe that
>> there are a number of other Lucene users who have similar needs.
>>
>> Thus I'd like to ask if there would be interest in making your Parser
>> interface and implementations more easily accessible to external
>> projects, perhaps as a separate library. If  you're interested, I'd be
>> happy to participate in such an effort.
>>
>> [1] 
>> http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org/apache/jackrabbit/core/query/TextFilter.java?view=markup 
>>
>> [2] http://issues.apache.org/jira/browse/JCR-415
>>
>>
>> BR,
>>
>> Jukka Zitting
>>
>> -- 
>> Yukatan - http://yukatan.fi/ - info@yukatan.fi
>> Software craftsmanship, JCR consulting, and Java development
>>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61

Re: Library for extracting text content from binaries

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 7/24/06, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> Thanks for your email. Jerome Charron and I proposed a project with a
> similar goal in mind that we wanted to dub "Tika". Tika would effectively be
> a Lucene sub-project, and would factor out some of the capabilities you
> mention below from Nutch, incl:

Sounds very useful! Jackrabbit could certainly use not only the
generalized parser functionality but also the other proposed features
like language identifiers, etc. Count me in.

> If you're interested in this idea, maybe it would be a good idea to contact Jerome
> and I off-list, and maybe we could get going on a proposal.

OK.

BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting, and Java development

RE: Library for extracting text content from binaries

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Jukka,

  Thanks for your email. Jerome Charron and I proposed a project with a
similar goal in mind that we wanted to dub "Tika". Tika would effectively be
a Lucene sub-project, and would factor out some of the capabilities you
mention below from Nutch, incl:

1. MimeType repository
2. Parser interface and Parser plugins
3. Metadata infrastructure
4. LanguageIdentifier

And a few others. Here is the mailing list thread discussion that we had a
few months back:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200604.mbox/%3cc82
2c4ce0604070126h28af2c1du59319b28df91b971@mail.gmail.com%3e

Jerome and I have been quite busy lately, however, and we haven't had a
chance to draft the proposal to send to the Lucene PMC, although Doug (and a
few others) told us that if we garner enough support and feel that the
project would make a significant contribution as it's own Lucene
sub-project, to email the PMC and see what happens. If you're interested in
this idea, maybe it would be a good idea to contact Jerome and I off-list,
and maybe we could get going on a proposal.

Thanks!

Cheers,
  Chris

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

> -----Original Message-----
> From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> Sent: Monday, July 24, 2006 11:29 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Library for extracting text content from binaries
> 
> Hi,
> 
> Any interest in this? If not, is there some other Lucene project that
> I should approach?
> 
> BR,
> 
> Jukka Zitting
> 
> On 7/18/06, Jukka Zitting <ju...@gmail.com> wrote:
> > Hi,
> >
> > I'm a committer of the Apache Jackrabbit project, and I've recently
> > been working on improving the full text indexing support in
> > Jackrabbit. We've used standard Lucene Java as the embedded full text
> > search engine in Jackrabbit, but created our own set of parsers for
> > extracting text content from binary files. So far our parser interface
> > TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
> > proposal, TextExtractor, [2] aims for a generic solution that converts
> > a generic InputStream into a Reader for passing to Lucene Java.
> >
> > Before coming up with the proposal I tried looking for similar
> > solutions, but couldn't find any that would have satisfied my
> > requirement of no external dependencies other than the JRE. Your
> > o.a.nutch.parse.Parser interface however came quite close, and you
> > already have an extensive set of existing implementations, so I'd like
> > to leverage your work with the Parser implementations while finding a
> > way to avoid the full Nutch and Hadoop dependencies. I believe that
> > there are a number of other Lucene users who have similar needs.
> >
> > Thus I'd like to ask if there would be interest in making your Parser
> > interface and implementations more easily accessible to external
> > projects, perhaps as a separate library. If  you're interested, I'd be
> > happy to participate in such an effort.
> >
> > [1]
> http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org
> /apache/jackrabbit/core/query/TextFilter.java?view=markup
> > [2] http://issues.apache.org/jira/browse/JCR-415
> >
> >
> > BR,
> >
> > Jukka Zitting
> >
> > --
> > Yukatan - http://yukatan.fi/ - info@yukatan.fi
> > Software craftsmanship, JCR consulting, and Java development
> >

Re: Library for extracting text content from binaries

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

Any interest in this? If not, is there some other Lucene project that
I should approach?

BR,

Jukka Zitting

On 7/18/06, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> I'm a committer of the Apache Jackrabbit project, and I've recently
> been working on improving the full text indexing support in
> Jackrabbit. We've used standard Lucene Java as the embedded full text
> search engine in Jackrabbit, but created our own set of parsers for
> extracting text content from binary files. So far our parser interface
> TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
> proposal, TextExtractor, [2] aims for a generic solution that converts
> a generic InputStream into a Reader for passing to Lucene Java.
>
> Before coming up with the proposal I tried looking for similar
> solutions, but couldn't find any that would have satisfied my
> requirement of no external dependencies other than the JRE. Your
> o.a.nutch.parse.Parser interface however came quite close, and you
> already have an extensive set of existing implementations, so I'd like
> to leverage your work with the Parser implementations while finding a
> way to avoid the full Nutch and Hadoop dependencies. I believe that
> there are a number of other Lucene users who have similar needs.
>
> Thus I'd like to ask if there would be interest in making your Parser
> interface and implementations more easily accessible to external
> projects, perhaps as a separate library. If  you're interested, I'd be
> happy to participate in such an effort.
>
> [1] http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org/apache/jackrabbit/core/query/TextFilter.java?view=markup
> [2] http://issues.apache.org/jira/browse/JCR-415
>
>
> BR,
>
> Jukka Zitting
>
> --
> Yukatan - http://yukatan.fi/ - info@yukatan.fi
> Software craftsmanship, JCR consulting, and Java development
>