You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by dipesh <di...@gmail.com> on 2008/11/12 04:37:45 UTC
Parsing MSWord
Hello,
I wanted to know if there are classes in Lucene that support parsing MSWord
documents.
Many thanks,
Dipesh
----------------------------------------
"Help Ever Hurt Never"- Baba
Re: Parsing MSWord
Posted by Alexander Aristov <al...@gmail.com>.
Antiword would be hard to inject into Nutch as it is not Java based. It will
reqier native calls.
Alexander
2008/11/12 Sertic Mirko, Bedag <Mi...@bedag.ch>
> Hi
>
> You can also use a tool called "antiword" to extract the text from a .doc
> file, and then
> give the text to lucene.
>
> See here : http://en.wikipedia.org/wiki/Antiword
>
> Regards
> Mirko
>
> -----Ursprüngliche Nachricht-----
> Von: dipesh [mailto:dipshrestha@gmail.com]
> Gesendet: Mittwoch, 12. November 2008 04:38
> An: java-user@lucene.apache.org
> Betreff: Parsing MSWord
>
> Hello,
> I wanted to know if there are classes in Lucene that support parsing MSWord
> documents.
> Many thanks,
> Dipesh
>
> ----------------------------------------
> "Help Ever Hurt Never"- Baba
>
--
Best Regards
Alexander Aristov
Re: AW: Parsing MSWord
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Or Tika, Lucene's cousin: http://incubator.apache.org/tika/
(which uses POI under the hood, but goes beyond MS Word parsing)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
________________________________
From: Donna L Gresh <gr...@us.ibm.com>
To: java-user@lucene.apache.org
Sent: Wednesday, November 12, 2008 8:25:43 AM
Subject: Re: AW: Parsing MSWord
Check out POI; that's what I use
http://poi.apache.org/
"Sertic Mirko, Bedag" <Mi...@bedag.ch> wrote on 11/12/2008 03:25:47
AM:
> Hi
>
> You can also use a tool called "antiword" to extract the text from a
> .doc file, and then
> give the text to lucene.
>
> See here : http://en.wikipedia.org/wiki/Antiword
>
> Regards
> Mirko
>
> -----Ursprüngliche Nachricht-----
> Von: dipesh [mailto:dipshrestha@gmail.com]
> Gesendet: Mittwoch, 12. November 2008 04:38
> An: java-user@lucene.apache.org
> Betreff: Parsing MSWord
>
> Hello,
> I wanted to know if there are classes in Lucene that support parsing
MSWord
> documents.
> Many thanks,
> Dipesh
>
> ----------------------------------------
> "Help Ever Hurt Never"- Baba
Re: AW: Parsing MSWord
Posted by Donna L Gresh <gr...@us.ibm.com>.
Check out POI; that's what I use
http://poi.apache.org/
"Sertic Mirko, Bedag" <Mi...@bedag.ch> wrote on 11/12/2008 03:25:47
AM:
> Hi
>
> You can also use a tool called "antiword" to extract the text from a
> .doc file, and then
> give the text to lucene.
>
> See here : http://en.wikipedia.org/wiki/Antiword
>
> Regards
> Mirko
>
> -----Ursprüngliche Nachricht-----
> Von: dipesh [mailto:dipshrestha@gmail.com]
> Gesendet: Mittwoch, 12. November 2008 04:38
> An: java-user@lucene.apache.org
> Betreff: Parsing MSWord
>
> Hello,
> I wanted to know if there are classes in Lucene that support parsing
MSWord
> documents.
> Many thanks,
> Dipesh
>
> ----------------------------------------
> "Help Ever Hurt Never"- Baba
AW: Parsing MSWord
Posted by "Sertic Mirko, Bedag" <Mi...@bedag.ch>.
Hi
You can also use a tool called "antiword" to extract the text from a .doc file, and then
give the text to lucene.
See here : http://en.wikipedia.org/wiki/Antiword
Regards
Mirko
-----Ursprüngliche Nachricht-----
Von: dipesh [mailto:dipshrestha@gmail.com]
Gesendet: Mittwoch, 12. November 2008 04:38
An: java-user@lucene.apache.org
Betreff: Parsing MSWord
Hello,
I wanted to know if there are classes in Lucene that support parsing MSWord
documents.
Many thanks,
Dipesh
----------------------------------------
"Help Ever Hurt Never"- Baba
RE: Parsing MSWord
Posted by John Griffin <jg...@thebluezone.net>.
Dipesh,
Start here.
http://poi.apache.org/
John G.
-----Original Message-----
From: dipesh [mailto:dipshrestha@gmail.com]
Sent: Tuesday, November 11, 2008 8:38 PM
To: java-user@lucene.apache.org
Subject: Parsing MSWord
Hello,
I wanted to know if there are classes in Lucene that support parsing MSWord
documents.
Many thanks,
Dipesh
----------------------------------------
"Help Ever Hurt Never"- Baba
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Parsing MSWord
Posted by dipesh <di...@gmail.com>.
Thank you,
It was really helpful. I also found some similar work being done in the
Nutch project.
Regards,
Dipesh
On Wed, Nov 12, 2008 at 12:52 PM, Dave Newton <ne...@yahoo.com> wrote:
> --- On Tue, 11/11/08, dipesh wrote:
> > I wanted to know if there are classes in Lucene that support
> > parsing MSWord documents.
>
> Searching the web might help:
>
> http://www.google.com/search?q=lucene+%2Bword
>
> The Apache Tika project (http://incubator.apache.org/tika/) might also be
> of interest.
>
> Dave
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
----------------------------------------
"Help Ever Hurt Never"- Baba
Re: Parsing MSWord
Posted by Dave Newton <ne...@yahoo.com>.
--- On Tue, 11/11/08, dipesh wrote:
> I wanted to know if there are classes in Lucene that support
> parsing MSWord documents.
Searching the web might help:
http://www.google.com/search?q=lucene+%2Bword
The Apache Tika project (http://incubator.apache.org/tika/) might also be of interest.
Dave
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org