You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by ganesh H D <ga...@sigmainfo.net> on 2008/11/21 13:49:47 UTC

Indexing Open office documents

Hi,

I have been working on Apache Lucene from past 3 days. I tried to deploy the
sample application which we get from lucene distribution. its working
absolutely fine. It's indexing all type files like .pdf, .Xml, .java , .txt
etc.....its also indexing open office documents also. but when i search for
the words of open office documents, its not showing the exact result. later
i come to know that open office documents are ZIP archives that contain XML
files. we need to uncompress the file using Java's ZIP support, then parse
meta.xml to get title etc. and content.xml to get the document's content.
But i couldn't get much information about this issue. please help me to
solve this issue.

regards,
ganesh

-- 
View this message in context: http://www.nabble.com/Indexing-Open-office-documents-tp20620421p20620421.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Indexing Open office documents

Posted by ganesh H D <ga...@sigmainfo.net>.

Hi,
open office documents are getting indexed but when i search for the words of
those documents i am not seeing the correct result.

regards,
ganesh 

Uwe Schindler wrote:
> 
> For converting full text to plain text for indexing look at Apache TIKA,
> which has an converter for OpenDocument: http://lucene.apache.org/tika/
> 
> This Mailing List is *about* the development of Lucene, not about
> questions
> *how* to develop own code that uses Lucene.
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
>> -----Original Message-----
>> From: ganesh H D [mailto:ganesh.h@sigmainfo.net]
>> Sent: Friday, November 21, 2008 1:50 PM
>> To: java-dev@lucene.apache.org
>> Subject: Indexing Open office documents
>> 
>> 
>> Hi,
>> 
>> I have been working on Apache Lucene from past 3 days. I tried to deploy
>> the
>> sample application which we get from lucene distribution. its working
>> absolutely fine. It's indexing all type files like .pdf, .Xml, .java ,
>> .txt
>> etc.....its also indexing open office documents also. but when i search
>> for
>> the words of open office documents, its not showing the exact result.
>> later
>> i come to know that open office documents are ZIP archives that contain
>> XML
>> files. we need to uncompress the file using Java's ZIP support, then
>> parse
>> meta.xml to get title etc. and content.xml to get the document's content.
>> But i couldn't get much information about this issue. please help me to
>> solve this issue.
>> 
>> regards,
>> ganesh
>> 
>> --
>> View this message in context: http://www.nabble.com/Indexing-Open-office-
>> documents-tp20620421p20620421.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-Open-office-documents-tp20620421p20658947.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Indexing Open office documents

Posted by Uwe Schindler <uw...@thetaphi.de>.

For converting full text to plain text for indexing look at Apache TIKA,
which has an converter for OpenDocument: http://lucene.apache.org/tika/

This Mailing List is *about* the development of Lucene, not about questions
*how* to develop own code that uses Lucene.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: ganesh H D [mailto:ganesh.h@sigmainfo.net]
> Sent: Friday, November 21, 2008 1:50 PM
> To: java-dev@lucene.apache.org
> Subject: Indexing Open office documents
> 
> 
> Hi,
> 
> I have been working on Apache Lucene from past 3 days. I tried to deploy
> the
> sample application which we get from lucene distribution. its working
> absolutely fine. It's indexing all type files like .pdf, .Xml, .java ,
> .txt
> etc.....its also indexing open office documents also. but when i search
> for
> the words of open office documents, its not showing the exact result.
> later
> i come to know that open office documents are ZIP archives that contain
> XML
> files. we need to uncompress the file using Java's ZIP support, then parse
> meta.xml to get title etc. and content.xml to get the document's content.
> But i couldn't get much information about this issue. please help me to
> solve this issue.
> 
> regards,
> ganesh
> 
> --
> View this message in context: http://www.nabble.com/Indexing-Open-office-
> documents-tp20620421p20620421.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Indexing Open office documents

Posted by ganesh H D <ga...@sigmainfo.net>.

Hi,
open office documents are getting indexed but when i search for the words of
those documents i am not seeing the correct result.

regards,
ganesh

ganesh H D wrote:
> 
> Hi,
> 
> I have been working on Apache Lucene from past 3 days. I tried to deploy
> the sample application which we get from lucene distribution. its working
> absolutely fine. It's indexing all type files like .pdf, .Xml, .java ,
> .txt etc.....its also indexing open office documents also. but when i
> search for the words of open office documents, its not showing the exact
> result. later i come to know that open office documents are ZIP archives
> that contain XML files. we need to uncompress the file using Java's ZIP
> support, then parse meta.xml to get title etc. and content.xml to get the
> document's content. But i couldn't get much information about this issue.
> please help me to solve this issue.
> 
> regards,
> ganesh
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-Open-office-documents-tp20620421p20658923.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org