You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by James liu <li...@gmail.com> on 2006/09/06 04:14:24 UTC

which way to index pdf,word,excel

i find lius many question ,,,,so i wanna give up and find new.

who recommend ?

Re: which way to index pdf,word,excel

Posted by Christiaan Fluit <ch...@aduna-software.com>.

Have a look at Aperture: http://aperture.sourceforge.net/
It provides components for crawling and text and metadata extraction. 
It's still in alpha stage though. The development code in CVS has 
already improved a lot over the last official alpha release.

Chris
--

James liu wrote:
> i wanna find frame which can index xml,word,excel,pdf,,,not one.
> 
> 
> 2006/9/6, Doron Cohen <DO...@il.ibm.com>:
>>
>> Lucene FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ - has a few
>> entries just for this:
>>
>>   How can I index HTML documents?
>>   How can I index XML documents?
>>   How can I index OpenOffice.org files?
>>   How can I index MS-Word documents?
>>   How can I index MS-Excel documents?
>>   How can I index MS-Powerpoint documents?
>>   How can I index Email (from MS-Exchange or another IMAP server) ?
>>   How can I index RTF documents?
>>   How can I index PDF documents?
>>   How can I index JSP files?
>>
>>
>> "James liu" <li...@gmail.com> wrote on 05/09/2006 19:14:24:
>>
>> > i find lius many question ,,,,so i wanna give up and find new.
>> >
>> > who recommend ?
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 


Met vriendelijke groet,

Christiaan Fluit
-- 
Aduna - Guided Exploration
www.aduna-software.com

Prinses Julianaplein 14-b
3817 CS Amersfoort
The Netherlands
+31-33-4659987 (office)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: which way to index pdf,word,excel

Posted by James liu <li...@gmail.com>.

thk,,,Cohen and lin.



2006/9/6, Doron Cohen <DO...@il.ibm.com>:
>
> I think that Nutch would crawl and search all these 3 types. Not sure that
> Nutch would provide the framework you seem to look for, but perhaps it is
> worth to take a look - http://lucene.apache.org/nutch/
>
> "James liu" <li...@gmail.com> wrote on 05/09/2006 23:10:16:
>
> > i wanna find frame which can index xml,word,excel,pdf,,,not one.
> >
> > i just wanna know who know the frame like what i wanna.
> >
> >
> > 2006/9/6, yueyu lin <po...@gmail.com>:
> > >
> > > First, Lucene is just a index toolkit, you have to USE it to implement
> > > your
> > > application.
> > >
> > > If you want to index something, you must have knowledge how to extract
> > > information from them and what kind of keys they need to be set.
> > >
> > > Then you can do what you want to.
> > > On 9/5/06, James liu <li...@gmail.com> wrote:
> > > >
> > > > i wanna find frame which can index xml,word,excel,pdf,,,not one.
> > > >
> > > >
> > > > 2006/9/6, Doron Cohen <DO...@il.ibm.com>:
> > > > >
> > > > > Lucene FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ - has
> a
> > > few
> > > > > entries just for this:
> > > > >
> > > > >   How can I index HTML documents?
> > > > >   How can I index XML documents?
> > > > >   How can I index OpenOffice.org files?
> > > > >   How can I index MS-Word documents?
> > > > >   How can I index MS-Excel documents?
> > > > >   How can I index MS-Powerpoint documents?
> > > > >   How can I index Email (from MS-Exchange or another IMAP server)
> ?
> > > > >   How can I index RTF documents?
> > > > >   How can I index PDF documents?
> > > > >   How can I index JSP files?
> > > > >
> > > > >
> > > > > "James liu" <li...@gmail.com> wrote on 05/09/2006
> 19:14:24:
> > > > >
> > > > > > i find lius many question ,,,,so i wanna give up and find new.
> > > > > >
> > > > > > who recommend ?
> > > > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: which way to index pdf,word,excel

Posted by Doron Cohen <DO...@il.ibm.com>.

I think that Nutch would crawl and search all these 3 types. Not sure that
Nutch would provide the framework you seem to look for, but perhaps it is
worth to take a look - http://lucene.apache.org/nutch/

"James liu" <li...@gmail.com> wrote on 05/09/2006 23:10:16:

> i wanna find frame which can index xml,word,excel,pdf,,,not one.
>
> i just wanna know who know the frame like what i wanna.
>
>
> 2006/9/6, yueyu lin <po...@gmail.com>:
> >
> > First, Lucene is just a index toolkit, you have to USE it to implement
> > your
> > application.
> >
> > If you want to index something, you must have knowledge how to extract
> > information from them and what kind of keys they need to be set.
> >
> > Then you can do what you want to.
> > On 9/5/06, James liu <li...@gmail.com> wrote:
> > >
> > > i wanna find frame which can index xml,word,excel,pdf,,,not one.
> > >
> > >
> > > 2006/9/6, Doron Cohen <DO...@il.ibm.com>:
> > > >
> > > > Lucene FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ - has
a
> > few
> > > > entries just for this:
> > > >
> > > >   How can I index HTML documents?
> > > >   How can I index XML documents?
> > > >   How can I index OpenOffice.org files?
> > > >   How can I index MS-Word documents?
> > > >   How can I index MS-Excel documents?
> > > >   How can I index MS-Powerpoint documents?
> > > >   How can I index Email (from MS-Exchange or another IMAP server) ?
> > > >   How can I index RTF documents?
> > > >   How can I index PDF documents?
> > > >   How can I index JSP files?
> > > >
> > > >
> > > > "James liu" <li...@gmail.com> wrote on 05/09/2006 19:14:24:
> > > >
> > > > > i find lius many question ,,,,so i wanna give up and find new.
> > > > >
> > > > > who recommend ?
> > > >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: which way to index pdf,word,excel

Posted by James liu <li...@gmail.com>.

i wanna find frame which can index xml,word,excel,pdf,,,not one.

i just wanna know who know the frame like what i wanna.


2006/9/6, yueyu lin <po...@gmail.com>:
>
> First, Lucene is just a index toolkit, you have to USE it to implement
> your
> application.
>
> If you want to index something, you must have knowledge how to extract
> information from them and what kind of keys they need to be set.
>
> Then you can do what you want to.
> On 9/5/06, James liu <li...@gmail.com> wrote:
> >
> > i wanna find frame which can index xml,word,excel,pdf,,,not one.
> >
> >
> > 2006/9/6, Doron Cohen <DO...@il.ibm.com>:
> > >
> > > Lucene FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ - has a
> few
> > > entries just for this:
> > >
> > >   How can I index HTML documents?
> > >   How can I index XML documents?
> > >   How can I index OpenOffice.org files?
> > >   How can I index MS-Word documents?
> > >   How can I index MS-Excel documents?
> > >   How can I index MS-Powerpoint documents?
> > >   How can I index Email (from MS-Exchange or another IMAP server) ?
> > >   How can I index RTF documents?
> > >   How can I index PDF documents?
> > >   How can I index JSP files?
> > >
> > >
> > > "James liu" <li...@gmail.com> wrote on 05/09/2006 19:14:24:
> > >
> > > > i find lius many question ,,,,so i wanna give up and find new.
> > > >
> > > > who recommend ?
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
>
>
> --
> --
> Yueyu Lin
>
>

Re: which way to index pdf,word,excel

Posted by yueyu lin <po...@gmail.com>.

First, Lucene is just a index toolkit, you have to USE it to implement your
application.

If you want to index something, you must have knowledge how to extract
information from them and what kind of keys they need to be set.

Then you can do what you want to.
On 9/5/06, James liu <li...@gmail.com> wrote:
>
> i wanna find frame which can index xml,word,excel,pdf,,,not one.
>
>
> 2006/9/6, Doron Cohen <DO...@il.ibm.com>:
> >
> > Lucene FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ - has a few
> > entries just for this:
> >
> >   How can I index HTML documents?
> >   How can I index XML documents?
> >   How can I index OpenOffice.org files?
> >   How can I index MS-Word documents?
> >   How can I index MS-Excel documents?
> >   How can I index MS-Powerpoint documents?
> >   How can I index Email (from MS-Exchange or another IMAP server) ?
> >   How can I index RTF documents?
> >   How can I index PDF documents?
> >   How can I index JSP files?
> >
> >
> > "James liu" <li...@gmail.com> wrote on 05/09/2006 19:14:24:
> >
> > > i find lius many question ,,,,so i wanna give up and find new.
> > >
> > > who recommend ?
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>


-- 
--
Yueyu Lin

Re: which way to index pdf,word,excel

Posted by James liu <li...@gmail.com>.

i wanna find frame which can index xml,word,excel,pdf,,,not one.


2006/9/6, Doron Cohen <DO...@il.ibm.com>:
>
> Lucene FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ - has a few
> entries just for this:
>
>   How can I index HTML documents?
>   How can I index XML documents?
>   How can I index OpenOffice.org files?
>   How can I index MS-Word documents?
>   How can I index MS-Excel documents?
>   How can I index MS-Powerpoint documents?
>   How can I index Email (from MS-Exchange or another IMAP server) ?
>   How can I index RTF documents?
>   How can I index PDF documents?
>   How can I index JSP files?
>
>
> "James liu" <li...@gmail.com> wrote on 05/09/2006 19:14:24:
>
> > i find lius many question ,,,,so i wanna give up and find new.
> >
> > who recommend ?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Indexing MS Powerpoint files with Lucene

Posted by Tomi NA <he...@gmail.com>.

On 9/8/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> (moved to nutch-user)
>
> Tomi NA wrote:
> > On 9/7/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> >> Tomi NA wrote:
> >> > On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
> >> >> On Thu, 7 Sep 2006, Tomi NA wrote:
> >> >> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
> >> >> >> Is there any filter available for extracting text from MS
> >> >> Powerpoint files
> >> >> >> and indexing them?
> >> >> >> The lucene website suggests the POI project, which, it seems
> >> does not
> >> >> >> support PPT files as of now.
> >> >> >
> >> >> > http://jakarta.apache.org/poi/hslf/index.html
> >> >> >
> >> >> > It doesn't say poi doesn't support ppt. It just says support is
> >> >> limited.
> >> >> > Don't know exactly how limited, but certainly not useless for
> >> indexing
> >> >> > purposes.
> >> >>
> >> >> Support for editing and adding things to PowerPoint files is
> >> limited, as
> >> >> is getting out the finer points of fonts and positioning.
> >> >
> >> > Which brings me to another (off)topic: can lucene/nutch assign
> >> > different weights to tokens in the same document field? An obvious
> >> > example would be: "this text seems to be in large, bold, blinking
> >> > letters: I'll assume it's more important than the surrounding 8px
> >> > text."
> >>
> >> No, it can't (at least not yet). As a workaround you can extract these
> >> portions of text to another field (or multiple fields), and then add
> >> them with a higher boost. Then, expand your queries so that they include
> >> also this field. This way, if query matches these special tokens,
> >> results will get higher rank because of matching on this boosted field.
> >
> > I thought a workaround like that would be needed. Still, it could give
> > useful results...though as a nutch user, the possibility is mostly
> > theoretical for me, as probably none of the existing parsers take into
> > account the formatting information. I could be completely wrong here,
> > so please, feel free to correct me.
>
> You can write a HtmlParseFilter, which will extract these portions of
> text and put them into ParseData.metadata. Then, during indexing you can
> check if such metadata exists and if yes - add it as separate fields.
> You will need also to modify the QueryFilters, to expand user queries to
> also include clauses for these additional fields.

Thanks Andrzej, I understand the concepts involved now. If the need
arises, I'll see what I can do about making it work as intended.

t.n.a.

Re: Indexing MS Powerpoint files with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.

(moved to nutch-user)

Tomi NA wrote:
> On 9/7/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>> Tomi NA wrote:
>> > On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
>> >> On Thu, 7 Sep 2006, Tomi NA wrote:
>> >> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>> >> >> Is there any filter available for extracting text from MS
>> >> Powerpoint files
>> >> >> and indexing them?
>> >> >> The lucene website suggests the POI project, which, it seems 
>> does not
>> >> >> support PPT files as of now.
>> >> >
>> >> > http://jakarta.apache.org/poi/hslf/index.html
>> >> >
>> >> > It doesn't say poi doesn't support ppt. It just says support is
>> >> limited.
>> >> > Don't know exactly how limited, but certainly not useless for 
>> indexing
>> >> > purposes.
>> >>
>> >> Support for editing and adding things to PowerPoint files is 
>> limited, as
>> >> is getting out the finer points of fonts and positioning.
>> >
>> > Which brings me to another (off)topic: can lucene/nutch assign
>> > different weights to tokens in the same document field? An obvious
>> > example would be: "this text seems to be in large, bold, blinking
>> > letters: I'll assume it's more important than the surrounding 8px
>> > text."
>>
>> No, it can't (at least not yet). As a workaround you can extract these
>> portions of text to another field (or multiple fields), and then add
>> them with a higher boost. Then, expand your queries so that they include
>> also this field. This way, if query matches these special tokens,
>> results will get higher rank because of matching on this boosted field.
>
> I thought a workaround like that would be needed. Still, it could give
> useful results...though as a nutch user, the possibility is mostly
> theoretical for me, as probably none of the existing parsers take into
> account the formatting information. I could be completely wrong here,
> so please, feel free to correct me.

You can write a HtmlParseFilter, which will extract these portions of 
text and put them into ParseData.metadata. Then, during indexing you can 
check if such metadata exists and if yes - add it as separate fields. 
You will need also to modify the QueryFilters, to expand user queries to 
also include clauses for these additional fields.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Indexing MS Powerpoint files with Lucene

Posted by Tomi NA <he...@gmail.com>.

On 9/7/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Tomi NA wrote:
> > On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
> >> On Thu, 7 Sep 2006, Tomi NA wrote:
> >> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
> >> >> Is there any filter available for extracting text from MS
> >> Powerpoint files
> >> >> and indexing them?
> >> >> The lucene website suggests the POI project, which, it seems does not
> >> >> support PPT files as of now.
> >> >
> >> > http://jakarta.apache.org/poi/hslf/index.html
> >> >
> >> > It doesn't say poi doesn't support ppt. It just says support is
> >> limited.
> >> > Don't know exactly how limited, but certainly not useless for indexing
> >> > purposes.
> >>
> >> Support for editing and adding things to PowerPoint files is limited, as
> >> is getting out the finer points of fonts and positioning.
> >
> > Which brings me to another (off)topic: can lucene/nutch assign
> > different weights to tokens in the same document field? An obvious
> > example would be: "this text seems to be in large, bold, blinking
> > letters: I'll assume it's more important than the surrounding 8px
> > text."
>
> No, it can't (at least not yet). As a workaround you can extract these
> portions of text to another field (or multiple fields), and then add
> them with a higher boost. Then, expand your queries so that they include
> also this field. This way, if query matches these special tokens,
> results will get higher rank because of matching on this boosted field.

I thought a workaround like that would be needed. Still, it could give
useful results...though as a nutch user, the possibility is mostly
theoretical for me, as probably none of the existing parsers take into
account the formatting information. I could be completely wrong here,
so please, feel free to correct me.

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing MS Powerpoint files with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.

Tomi NA wrote:
> On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
>> On Thu, 7 Sep 2006, Tomi NA wrote:
>> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>> >> Is there any filter available for extracting text from MS 
>> Powerpoint files
>> >> and indexing them?
>> >> The lucene website suggests the POI project, which, it seems does not
>> >> support PPT files as of now.
>> >
>> > http://jakarta.apache.org/poi/hslf/index.html
>> >
>> > It doesn't say poi doesn't support ppt. It just says support is 
>> limited.
>> > Don't know exactly how limited, but certainly not useless for indexing
>> > purposes.
>>
>> Support for editing and adding things to PowerPoint files is limited, as
>> is getting out the finer points of fonts and positioning.
>
> Which brings me to another (off)topic: can lucene/nutch assign
> different weights to tokens in the same document field? An obvious
> example would be: "this text seems to be in large, bold, blinking
> letters: I'll assume it's more important than the surrounding 8px
> text."

No, it can't (at least not yet). As a workaround you can extract these 
portions of text to another field (or multiple fields), and then add 
them with a higher boost. Then, expand your queries so that they include 
also this field. This way, if query matches these special tokens, 
results will get higher rank because of matching on this boosted field.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing MS Powerpoint files with Lucene

Posted by Tomi NA <he...@gmail.com>.

On 9/7/06, Nick Burch <ni...@torchbox.com> wrote:
> On Thu, 7 Sep 2006, Tomi NA wrote:
> > On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
> >> Is there any filter available for extracting text from MS Powerpoint files
> >> and indexing them?
> >> The lucene website suggests the POI project, which, it seems does not
> >> support PPT files as of now.
> >
> > http://jakarta.apache.org/poi/hslf/index.html
> >
> > It doesn't say poi doesn't support ppt. It just says support is limited.
> > Don't know exactly how limited, but certainly not useless for indexing
> > purposes.
>
> Support for editing and adding things to PowerPoint files is limited, as
> is getting out the finer points of fonts and positioning.

Which brings me to another (off)topic: can lucene/nutch assign
different weights to tokens in the same document field? An obvious
example would be: "this text seems to be in large, bold, blinking
letters: I'll assume it's more important than the surrounding 8px
text."

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing MS Powerpoint files with Lucene

Posted by Nick Burch <ni...@torchbox.com>.

On Thu, 7 Sep 2006, Tomi NA wrote:
> On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>> Is there any filter available for extracting text from MS Powerpoint files
>> and indexing them?
>> The lucene website suggests the POI project, which, it seems does not
>> support PPT files as of now.
>
> http://jakarta.apache.org/poi/hslf/index.html
>
> It doesn't say poi doesn't support ppt. It just says support is limited. 
> Don't know exactly how limited, but certainly not useless for indexing 
> purposes.

Support for editing and adding things to PowerPoint files is limited, as 
is getting out the finer points of fonts and positioning.

Getting text out should "just work" for you. The only thing you'll need to 
decide is if you want hslf.PowerPointExtractor to give you slide and notes 
text, or just slide text :)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing MS Powerpoint files with Lucene

Posted by Tomi NA <he...@gmail.com>.

On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>
> Is there any filter available for extracting text from MS Powerpoint files
> and indexing them?
> The lucene website suggests the POI project, which, it seems does not
> support PPT files as of now.

http://jakarta.apache.org/poi/hslf/index.html

It doesn't say poi doesn't support ppt. It just says support is
limited. Don't know exactly how limited, but certainly not useless for
indexing purposes.
Don't know if the plugin works, though. :)

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing MS Powerpoint files with Lucene

Posted by Gopikrishnan Subramani <go...@gmail.com>.

Did you check POI javadocs? Look for
org.apache.poi.hslf.extractor.PowerPointExtractor. It's one of the most
straightforward classes from POI as far extracting text for indexing is
concerned.

-Gopi

On 9/7/06, Venkateshprasanna <pr...@yahoo.co.in> wrote:
>
>
> Is there any filter available for extracting text from MS Powerpoint files
> and indexing them?
> The lucene website suggests the POI project, which, it seems does not
> support PPT files as of now.
>
> Regards,
> Venkateshprasanna
>
> --
> View this message in context:
> http://www.nabble.com/which-way-to-index-pdf%2Cword%2Cexcel-tf2224468.html#a6185039
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Indexing MS Powerpoint files with Lucene

Posted by Venkateshprasanna <pr...@yahoo.co.in>.

Is there any filter available for extracting text from MS Powerpoint files
and indexing them?
The lucene website suggests the POI project, which, it seems does not
support PPT files as of now.

Regards,
Venkateshprasanna

-- 
View this message in context: http://www.nabble.com/which-way-to-index-pdf%2Cword%2Cexcel-tf2224468.html#a6185039
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: which way to index pdf,word,excel

Posted by Doron Cohen <DO...@il.ibm.com>.

Lucene FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ - has a few
entries just for this:

  How can I index HTML documents?
  How can I index XML documents?
  How can I index OpenOffice.org files?
  How can I index MS-Word documents?
  How can I index MS-Excel documents?
  How can I index MS-Powerpoint documents?
  How can I index Email (from MS-Exchange or another IMAP server) ?
  How can I index RTF documents?
  How can I index PDF documents?
  How can I index JSP files?


"James liu" <li...@gmail.com> wrote on 05/09/2006 19:14:24:

> i find lius many question ,,,,so i wanna give up and find new.
>
> who recommend ?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org