You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Venkateshprasanna <pr...@yahoo.co.in> on 2006/09/07 08:41:12 UTC

Indexing MS Powerpoint files with Lucene

Is there any filter available for extracting text from MS Powerpoint files
and indexing them?
The lucene website suggests the POI project, which, it seems does not
support PPT files as of now.

Regards,
Venkateshprasanna

-- 
View this message in context: http://www.nabble.com/which-way-to-index-pdf%2Cword%2Cexcel-tf2224468.html#a6185039
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing MS Powerpoint files with Lucene

Posted by Tomi NA <he...@gmail.com>.
On 9/8/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> (moved to nutch-user)
>
> Tomi NA wrote:
> > On 9/7/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> >> Tomi NA wrote:
> >> > On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
> >> >> On Thu, 7 Sep 2006, Tomi NA wrote:
> >> >> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
> >> >> >> Is there any filter available for extracting text from MS
> >> >> Powerpoint files
> >> >> >> and indexing them?
> >> >> >> The lucene website suggests the POI project, which, it seems
> >> does not
> >> >> >> support PPT files as of now.
> >> >> >
> >> >> > http://jakarta.apache.org/poi/hslf/index.html
> >> >> >
> >> >> > It doesn't say poi doesn't support ppt. It just says support is
> >> >> limited.
> >> >> > Don't know exactly how limited, but certainly not useless for
> >> indexing
> >> >> > purposes.
> >> >>
> >> >> Support for editing and adding things to PowerPoint files is
> >> limited, as
> >> >> is getting out the finer points of fonts and positioning.
> >> >
> >> > Which brings me to another (off)topic: can lucene/nutch assign
> >> > different weights to tokens in the same document field? An obvious
> >> > example would be: "this text seems to be in large, bold, blinking
> >> > letters: I'll assume it's more important than the surrounding 8px
> >> > text."
> >>
> >> No, it can't (at least not yet). As a workaround you can extract these
> >> portions of text to another field (or multiple fields), and then add
> >> them with a higher boost. Then, expand your queries so that they include
> >> also this field. This way, if query matches these special tokens,
> >> results will get higher rank because of matching on this boosted field.
> >
> > I thought a workaround like that would be needed. Still, it could give
> > useful results...though as a nutch user, the possibility is mostly
> > theoretical for me, as probably none of the existing parsers take into
> > account the formatting information. I could be completely wrong here,
> > so please, feel free to correct me.
>
> You can write a HtmlParseFilter, which will extract these portions of
> text and put them into ParseData.metadata. Then, during indexing you can
> check if such metadata exists and if yes - add it as separate fields.
> You will need also to modify the QueryFilters, to expand user queries to
> also include clauses for these additional fields.

Thanks Andrzej, I understand the concepts involved now. If the need
arises, I'll see what I can do about making it work as intended.

t.n.a.

Re: Indexing MS Powerpoint files with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.
(moved to nutch-user)

Tomi NA wrote:
> On 9/7/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>> Tomi NA wrote:
>> > On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
>> >> On Thu, 7 Sep 2006, Tomi NA wrote:
>> >> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>> >> >> Is there any filter available for extracting text from MS
>> >> Powerpoint files
>> >> >> and indexing them?
>> >> >> The lucene website suggests the POI project, which, it seems 
>> does not
>> >> >> support PPT files as of now.
>> >> >
>> >> > http://jakarta.apache.org/poi/hslf/index.html
>> >> >
>> >> > It doesn't say poi doesn't support ppt. It just says support is
>> >> limited.
>> >> > Don't know exactly how limited, but certainly not useless for 
>> indexing
>> >> > purposes.
>> >>
>> >> Support for editing and adding things to PowerPoint files is 
>> limited, as
>> >> is getting out the finer points of fonts and positioning.
>> >
>> > Which brings me to another (off)topic: can lucene/nutch assign
>> > different weights to tokens in the same document field? An obvious
>> > example would be: "this text seems to be in large, bold, blinking
>> > letters: I'll assume it's more important than the surrounding 8px
>> > text."
>>
>> No, it can't (at least not yet). As a workaround you can extract these
>> portions of text to another field (or multiple fields), and then add
>> them with a higher boost. Then, expand your queries so that they include
>> also this field. This way, if query matches these special tokens,
>> results will get higher rank because of matching on this boosted field.
>
> I thought a workaround like that would be needed. Still, it could give
> useful results...though as a nutch user, the possibility is mostly
> theoretical for me, as probably none of the existing parsers take into
> account the formatting information. I could be completely wrong here,
> so please, feel free to correct me.

You can write a HtmlParseFilter, which will extract these portions of 
text and put them into ParseData.metadata. Then, during indexing you can 
check if such metadata exists and if yes - add it as separate fields. 
You will need also to modify the QueryFilters, to expand user queries to 
also include clauses for these additional fields.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Indexing MS Powerpoint files with Lucene

Posted by Tomi NA <he...@gmail.com>.
On 9/7/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Tomi NA wrote:
> > On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
> >> On Thu, 7 Sep 2006, Tomi NA wrote:
> >> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
> >> >> Is there any filter available for extracting text from MS
> >> Powerpoint files
> >> >> and indexing them?
> >> >> The lucene website suggests the POI project, which, it seems does not
> >> >> support PPT files as of now.
> >> >
> >> > http://jakarta.apache.org/poi/hslf/index.html
> >> >
> >> > It doesn't say poi doesn't support ppt. It just says support is
> >> limited.
> >> > Don't know exactly how limited, but certainly not useless for indexing
> >> > purposes.
> >>
> >> Support for editing and adding things to PowerPoint files is limited, as
> >> is getting out the finer points of fonts and positioning.
> >
> > Which brings me to another (off)topic: can lucene/nutch assign
> > different weights to tokens in the same document field? An obvious
> > example would be: "this text seems to be in large, bold, blinking
> > letters: I'll assume it's more important than the surrounding 8px
> > text."
>
> No, it can't (at least not yet). As a workaround you can extract these
> portions of text to another field (or multiple fields), and then add
> them with a higher boost. Then, expand your queries so that they include
> also this field. This way, if query matches these special tokens,
> results will get higher rank because of matching on this boosted field.

I thought a workaround like that would be needed. Still, it could give
useful results...though as a nutch user, the possibility is mostly
theoretical for me, as probably none of the existing parsers take into
account the formatting information. I could be completely wrong here,
so please, feel free to correct me.

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing MS Powerpoint files with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.
Tomi NA wrote:
> On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
>> On Thu, 7 Sep 2006, Tomi NA wrote:
>> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>> >> Is there any filter available for extracting text from MS 
>> Powerpoint files
>> >> and indexing them?
>> >> The lucene website suggests the POI project, which, it seems does not
>> >> support PPT files as of now.
>> >
>> > http://jakarta.apache.org/poi/hslf/index.html
>> >
>> > It doesn't say poi doesn't support ppt. It just says support is 
>> limited.
>> > Don't know exactly how limited, but certainly not useless for indexing
>> > purposes.
>>
>> Support for editing and adding things to PowerPoint files is limited, as
>> is getting out the finer points of fonts and positioning.
>
> Which brings me to another (off)topic: can lucene/nutch assign
> different weights to tokens in the same document field? An obvious
> example would be: "this text seems to be in large, bold, blinking
> letters: I'll assume it's more important than the surrounding 8px
> text."

No, it can't (at least not yet). As a workaround you can extract these 
portions of text to another field (or multiple fields), and then add 
them with a higher boost. Then, expand your queries so that they include 
also this field. This way, if query matches these special tokens, 
results will get higher rank because of matching on this boosted field.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing MS Powerpoint files with Lucene

Posted by Tomi NA <he...@gmail.com>.
On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
> On Thu, 7 Sep 2006, Tomi NA wrote:
> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
> >> Is there any filter available for extracting text from MS Powerpoint files
> >> and indexing them?
> >> The lucene website suggests the POI project, which, it seems does not
> >> support PPT files as of now.
> >
> > http://jakarta.apache.org/poi/hslf/index.html
> >
> > It doesn't say poi doesn't support ppt. It just says support is limited.
> > Don't know exactly how limited, but certainly not useless for indexing
> > purposes.
>
> Support for editing and adding things to PowerPoint files is limited, as
> is getting out the finer points of fonts and positioning.

Which brings me to another (off)topic: can lucene/nutch assign
different weights to tokens in the same document field? An obvious
example would be: "this text seems to be in large, bold, blinking
letters: I'll assume it's more important than the surrounding 8px
text."

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing MS Powerpoint files with Lucene

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 7 Sep 2006, Tomi NA wrote:
> On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>> Is there any filter available for extracting text from MS Powerpoint files
>> and indexing them?
>> The lucene website suggests the POI project, which, it seems does not
>> support PPT files as of now.
>
> http://jakarta.apache.org/poi/hslf/index.html
>
> It doesn't say poi doesn't support ppt. It just says support is limited. 
> Don't know exactly how limited, but certainly not useless for indexing 
> purposes.

Support for editing and adding things to PowerPoint files is limited, as 
is getting out the finer points of fonts and positioning.

Getting text out should "just work" for you. The only thing you'll need to 
decide is if you want hslf.PowerPointExtractor to give you slide and notes 
text, or just slide text :)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing MS Powerpoint files with Lucene

Posted by Tomi NA <he...@gmail.com>.
On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>
> Is there any filter available for extracting text from MS Powerpoint files
> and indexing them?
> The lucene website suggests the POI project, which, it seems does not
> support PPT files as of now.

http://jakarta.apache.org/poi/hslf/index.html

It doesn't say poi doesn't support ppt. It just says support is
limited. Don't know exactly how limited, but certainly not useless for
indexing purposes.
Don't know if the plugin works, though. :)

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing MS Powerpoint files with Lucene

Posted by Gopikrishnan Subramani <go...@gmail.com>.
Did you check POI javadocs? Look for
org.apache.poi.hslf.extractor.PowerPointExtractor. It's one of the most
straightforward classes from POI as far extracting text for indexing is
concerned.

-Gopi

On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>
>
> Is there any filter available for extracting text from MS Powerpoint files
> and indexing them?
> The lucene website suggests the POI project, which, it seems does not
> support PPT files as of now.
>
> Regards,
> Venkateshprasanna
>
> --
> View this message in context:
> http://www.nabble.com/which-way-to-index-pdf%2Cword%2Cexcel-tf2224468.html#a6185039
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>