You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by dipesh <di...@gmail.com> on 2008/11/12 04:37:45 UTC

Parsing MSWord

Hello,
I wanted to know if there are classes in Lucene that support parsing MSWord
documents.
Many thanks,
Dipesh

----------------------------------------
"Help Ever Hurt Never"- Baba

Re: Parsing MSWord

Posted by Alexander Aristov <al...@gmail.com>.
Antiword would be hard to inject into Nutch as it is not Java based. It will
reqier native calls.

Alexander

2008/11/12 Sertic Mirko, Bedag <Mi...@bedag.ch>

> Hi
>
> You can also use a tool called "antiword" to extract the text from a .doc
> file, and then
> give the text to lucene.
>
> See here : http://en.wikipedia.org/wiki/Antiword
>
> Regards
> Mirko
>
> -----Ursprüngliche Nachricht-----
> Von: dipesh [mailto:dipshrestha@gmail.com]
> Gesendet: Mittwoch, 12. November 2008 04:38
> An: java-user@lucene.apache.org
> Betreff: Parsing MSWord
>
> Hello,
> I wanted to know if there are classes in Lucene that support parsing MSWord
> documents.
> Many thanks,
> Dipesh
>
> ----------------------------------------
> "Help Ever Hurt Never"- Baba
>



-- 
Best Regards
Alexander Aristov

Re: AW: Parsing MSWord

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Or Tika, Lucene's cousin: http://incubator.apache.org/tika/
(which uses POI under the hood, but goes beyond MS Word parsing)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




________________________________
From: Donna L Gresh <gr...@us.ibm.com>
To: java-user@lucene.apache.org
Sent: Wednesday, November 12, 2008 8:25:43 AM
Subject: Re: AW: Parsing MSWord

Check out POI; that's what I use

http://poi.apache.org/


"Sertic Mirko, Bedag" <Mi...@bedag.ch> wrote on 11/12/2008 03:25:47 
AM:

> Hi
> 
> You can also use a tool called "antiword" to extract the text from a
> .doc file, and then
> give the text to lucene.
> 
> See here : http://en.wikipedia.org/wiki/Antiword
> 
> Regards
> Mirko
> 
> -----Ursprüngliche Nachricht-----
> Von: dipesh [mailto:dipshrestha@gmail.com] 
> Gesendet: Mittwoch, 12. November 2008 04:38
> An: java-user@lucene.apache.org
> Betreff: Parsing MSWord
> 
> Hello,
> I wanted to know if there are classes in Lucene that support parsing 
MSWord
> documents.
> Many thanks,
> Dipesh
> 
> ----------------------------------------
> "Help Ever Hurt Never"- Baba

Re: AW: Parsing MSWord

Posted by Donna L Gresh <gr...@us.ibm.com>.
Check out POI; that's what I use

http://poi.apache.org/


"Sertic Mirko, Bedag" <Mi...@bedag.ch> wrote on 11/12/2008 03:25:47 
AM:

> Hi
> 
> You can also use a tool called "antiword" to extract the text from a
> .doc file, and then
> give the text to lucene.
> 
> See here : http://en.wikipedia.org/wiki/Antiword
> 
> Regards
> Mirko
> 
> -----Ursprüngliche Nachricht-----
> Von: dipesh [mailto:dipshrestha@gmail.com] 
> Gesendet: Mittwoch, 12. November 2008 04:38
> An: java-user@lucene.apache.org
> Betreff: Parsing MSWord
> 
> Hello,
> I wanted to know if there are classes in Lucene that support parsing 
MSWord
> documents.
> Many thanks,
> Dipesh
> 
> ----------------------------------------
> "Help Ever Hurt Never"- Baba

AW: Parsing MSWord

Posted by "Sertic Mirko, Bedag" <Mi...@bedag.ch>.
Hi

You can also use a tool called "antiword" to extract the text from a .doc file, and then
give the text to lucene.

See here : http://en.wikipedia.org/wiki/Antiword

Regards
Mirko

-----Ursprüngliche Nachricht-----
Von: dipesh [mailto:dipshrestha@gmail.com] 
Gesendet: Mittwoch, 12. November 2008 04:38
An: java-user@lucene.apache.org
Betreff: Parsing MSWord

Hello,
I wanted to know if there are classes in Lucene that support parsing MSWord
documents.
Many thanks,
Dipesh

----------------------------------------
"Help Ever Hurt Never"- Baba

RE: Parsing MSWord

Posted by John Griffin <jg...@thebluezone.net>.
Dipesh,

Start here.

http://poi.apache.org/

John G.

-----Original Message-----
From: dipesh [mailto:dipshrestha@gmail.com] 
Sent: Tuesday, November 11, 2008 8:38 PM
To: java-user@lucene.apache.org
Subject: Parsing MSWord

Hello,
I wanted to know if there are classes in Lucene that support parsing MSWord
documents.
Many thanks,
Dipesh

----------------------------------------
"Help Ever Hurt Never"- Baba


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Parsing MSWord

Posted by dipesh <di...@gmail.com>.
Thank you,
It was really helpful. I also found some similar work being done in the
Nutch project.
Regards,
Dipesh


On Wed, Nov 12, 2008 at 12:52 PM, Dave Newton <ne...@yahoo.com> wrote:

> --- On Tue, 11/11/08, dipesh wrote:
> > I wanted to know if there are classes in Lucene that support
> > parsing MSWord documents.
>
> Searching the web might help:
>
> http://www.google.com/search?q=lucene+%2Bword
>
> The Apache Tika project (http://incubator.apache.org/tika/) might also be
> of interest.
>
> Dave
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
----------------------------------------
"Help Ever Hurt Never"- Baba

Re: Parsing MSWord

Posted by Dave Newton <ne...@yahoo.com>.
--- On Tue, 11/11/08, dipesh wrote:
> I wanted to know if there are classes in Lucene that support 
> parsing MSWord documents.

Searching the web might help:

http://www.google.com/search?q=lucene+%2Bword

The Apache Tika project (http://incubator.apache.org/tika/) might also be of interest.

Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org