You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jun Zhou <AC...@sheffield.ac.uk> on 2002/07/26 16:52:00 UTC

index other document types

Dear all,
 
 I learned from Lucene FAQ that if we want to index other document types, we need to provide a parser or extractor for every document type. I know there are some tools available which can convert other document types to txt format. Is the converter a parser or extractor at all?
 
 Thank you for your kind assistance in advance.
 
 Best regards
Jun Zhou
acp01jz@sheffield.ac.uk

Re: index other document types

Posted by Jun Zhou <AC...@sheffield.ac.uk>.

Thank you very much, Dave! So I am sure I can choose Lucene to work on my project now.

Best regards
Jun Zhou
ACP01JZ@sheffield.ac.uk

----- Original Message ----- 
From: "Dave Peixotto" <pe...@geofolio.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, July 26, 2002 4:34 PM
Subject: Re: index other document types


> Lucene is very good at indexing and searching text documents.  If you need
> to index other types of documents (Word docs, PDFs, etc.) then a good
> strategy is to convert those documents to text and use Lucene to index the
> text version of the document.  If you already have a tool to convert other
> document types to text, then you should have no trouble indexing those
> documents.
> 
> ----- Original Message -----
> From: "Jun Zhou" <AC...@sheffield.ac.uk>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Friday, July 26, 2002 7:52 AM
> Subject: index other document types
> 
> 
> > Dear all,
> >
> >  I learned from Lucene FAQ that if we want to index other document types,
> we need to provide a parser or extractor for every document type. I know
> there are some tools available which can convert other document types to txt
> format. Is the converter a parser or extractor at all?
> >
> >  Thank you for your kind assistance in advance.
> >
> >  Best regards
> > Jun Zhou
> > acp01jz@sheffield.ac.uk
> >
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
>

Re: index other document types

Posted by Dave Peixotto <pe...@geofolio.com>.

Lucene is very good at indexing and searching text documents.  If you need
to index other types of documents (Word docs, PDFs, etc.) then a good
strategy is to convert those documents to text and use Lucene to index the
text version of the document.  If you already have a tool to convert other
document types to text, then you should have no trouble indexing those
documents.

----- Original Message -----
From: "Jun Zhou" <AC...@sheffield.ac.uk>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, July 26, 2002 7:52 AM
Subject: index other document types

> Dear all,
>
>  I learned from Lucene FAQ that if we want to index other document types,
we need to provide a parser or extractor for every document type. I know
there are some tools available which can convert other document types to txt
format. Is the converter a parser or extractor at all?
>
>  Thank you for your kind assistance in advance.
>
>  Best regards
> Jun Zhou
> acp01jz@sheffield.ac.uk
>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>