You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/09/08 10:22:44 UTC

Re: Indexing MS Powerpoint files with Lucene

(moved to nutch-user)

Tomi NA wrote:
> On 9/7/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>> Tomi NA wrote:
>> > On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
>> >> On Thu, 7 Sep 2006, Tomi NA wrote:
>> >> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>> >> >> Is there any filter available for extracting text from MS
>> >> Powerpoint files
>> >> >> and indexing them?
>> >> >> The lucene website suggests the POI project, which, it seems 
>> does not
>> >> >> support PPT files as of now.
>> >> >
>> >> > http://jakarta.apache.org/poi/hslf/index.html
>> >> >
>> >> > It doesn't say poi doesn't support ppt. It just says support is
>> >> limited.
>> >> > Don't know exactly how limited, but certainly not useless for 
>> indexing
>> >> > purposes.
>> >>
>> >> Support for editing and adding things to PowerPoint files is 
>> limited, as
>> >> is getting out the finer points of fonts and positioning.
>> >
>> > Which brings me to another (off)topic: can lucene/nutch assign
>> > different weights to tokens in the same document field? An obvious
>> > example would be: "this text seems to be in large, bold, blinking
>> > letters: I'll assume it's more important than the surrounding 8px
>> > text."
>>
>> No, it can't (at least not yet). As a workaround you can extract these
>> portions of text to another field (or multiple fields), and then add
>> them with a higher boost. Then, expand your queries so that they include
>> also this field. This way, if query matches these special tokens,
>> results will get higher rank because of matching on this boosted field.
>
> I thought a workaround like that would be needed. Still, it could give
> useful results...though as a nutch user, the possibility is mostly
> theoretical for me, as probably none of the existing parsers take into
> account the formatting information. I could be completely wrong here,
> so please, feel free to correct me.

You can write a HtmlParseFilter, which will extract these portions of 
text and put them into ParseData.metadata. Then, during indexing you can 
check if such metadata exists and if yes - add it as separate fields. 
You will need also to modify the QueryFilters, to expand user queries to 
also include clauses for these additional fields.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Indexing MS Powerpoint files with Lucene

Posted by Tomi NA <he...@gmail.com>.

On 9/8/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> (moved to nutch-user)
>
> Tomi NA wrote:
> > On 9/7/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> >> Tomi NA wrote:
> >> > On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
> >> >> On Thu, 7 Sep 2006, Tomi NA wrote:
> >> >> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
> >> >> >> Is there any filter available for extracting text from MS
> >> >> Powerpoint files
> >> >> >> and indexing them?
> >> >> >> The lucene website suggests the POI project, which, it seems
> >> does not
> >> >> >> support PPT files as of now.
> >> >> >
> >> >> > http://jakarta.apache.org/poi/hslf/index.html
> >> >> >
> >> >> > It doesn't say poi doesn't support ppt. It just says support is
> >> >> limited.
> >> >> > Don't know exactly how limited, but certainly not useless for
> >> indexing
> >> >> > purposes.
> >> >>
> >> >> Support for editing and adding things to PowerPoint files is
> >> limited, as
> >> >> is getting out the finer points of fonts and positioning.
> >> >
> >> > Which brings me to another (off)topic: can lucene/nutch assign
> >> > different weights to tokens in the same document field? An obvious
> >> > example would be: "this text seems to be in large, bold, blinking
> >> > letters: I'll assume it's more important than the surrounding 8px
> >> > text."
> >>
> >> No, it can't (at least not yet). As a workaround you can extract these
> >> portions of text to another field (or multiple fields), and then add
> >> them with a higher boost. Then, expand your queries so that they include
> >> also this field. This way, if query matches these special tokens,
> >> results will get higher rank because of matching on this boosted field.
> >
> > I thought a workaround like that would be needed. Still, it could give
> > useful results...though as a nutch user, the possibility is mostly
> > theoretical for me, as probably none of the existing parsers take into
> > account the formatting information. I could be completely wrong here,
> > so please, feel free to correct me.
>
> You can write a HtmlParseFilter, which will extract these portions of
> text and put them into ParseData.metadata. Then, during indexing you can
> check if such metadata exists and if yes - add it as separate fields.
> You will need also to modify the QueryFilters, to expand user queries to
> also include clauses for these additional fields.

Thanks Andrzej, I understand the concepts involved now. If the need
arises, I'll see what I can do about making it work as intended.

t.n.a.