You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by e....@student.utwente.nl on 2007/03/23 18:48:56 UTC

index word files ( doc )

Hello,
 
I am planning to index Word 2003 files. I read I have to use Jakarta Apache POI, but I also read on the POI site that their work with doc's is in an early stage.
 
Is POI advisable? Or are there better alternatives?
Please give some advice.
 
Regards,
 
Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by John Haxby <jc...@scalix.com>.

Daniel Noll wrote:
> The only screenshots I can see look like plain text to me, and I'm 
> currently working on something which needs to convert Word to HTML, 
> which is why I ask.
wvWare, which I mentioned earlier, can convert word to HTML and does a 
pretty good job of maintaining formatting.  abiword is better though 
(because it goe through a different internal representation).

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Daniel Noll <da...@nuix.com>.

Ryan Ackley wrote:
> As the author of both Word POI and textmining.org, I recommend using
> textmining.org. POI is for general purpose manipulation of Word
> documents. textmining's only purpose is extracting text.

I wish the two would collaborate though.  It's true that POI contains 
code for writing which isn't necessary for indexing.  But it's also true 
that POI contains code for extracting images, which for many projects 
*is* necessary.

> Also, people recommend using POI for text extraction but the only
> place I've seen an actual how-to on this is in the "Lucene in Action"
> book.

It's not too difficult though:

   doc.getTextTable().getTextPieces();

Downside of that approach is that some of the text you get back isn't 
"text" in the sense that you might expect.  (I consider it an upside 
myself, because sometimes it's good to find all this otherwise hidden text.)

Daniel

-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Daniel Noll <da...@nuix.com>.

Ryan Ackley wrote:
 >> Any comments on this are appreciated. One thing I thought of would be
 >> to continue to offer the text extraction as open source but add html
 >> conversion with hit highlighting for a variety of file formats as a
 >> commercial add on. Is this something anyone would pay for? What are
 >> some other pain points of the Lucene community besides text
 >> extraction?

jafarim wrote:
> Good to know that your devised commercial feature is already offered by
> Enhydra Snapper as an open-source feature.
> Check here: http://www.enhydra.org/apps/snapper/index.html

Snapper can convert to HTML?  i.e. while maintaining formatting?

The only screenshots I can see look like plain text to me, and I'm 
currently working on something which needs to convert Word to HTML, 
which is why I ask.

Daniel

-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Ryan Ackley <ry...@gmail.com>.

That is good to know thank you. Looking at their documentation, their
preview seems to show the contents of the index for a particular file
and you can transform this using xml. I can see how this would be
useful. What I was proposing was a conversion from the binary format
to html and including the rich formatting.

On 3/26/07, jafarim <ja...@gmail.com> wrote:
> Good to know that your devised commercial feature is already offered by
> Enhydra Snapper as an open-source feature.
> Check here: http://www.enhydra.org/apps/snapper/index.html
>
> On 3/26/07, Ryan Ackley <ry...@gmail.com> wrote:
> >
> > Yes I do have plans for adding fast save support and support for more
> > file formats. The time frame for this happening is the next couple of
> > months.
> >
> > I'm playing with the idea of offering a commercial version. I want to
> > continue to support the open source community so I want to keep it
> > open source or free and add value that people would be willing to pay
> > for.
> >
> > Any comments on this are appreciated. One thing I thought of would be
> > to continue to offer the text extraction as open source but add html
> > conversion with hit highlighting for a variety of file formats as a
> > commercial add on. Is this something anyone would pay for? What are
> > some other pain points of the Lucene community besides text
> > extraction?
> >
> > On 3/25/07, Antony Bowesman <ad...@teamware.com> wrote:
> > > I've been using Ryan's textmining in prefence to the POI as internally
> > TM uses
> > > POI and the Word6 extractor so handles a greater variety of files.
> > >
> > > Ryan, thanks for fixing your site.  Do you have any plans/ideas on how
> > to parse
> > > the 'fast-saved' files and any ideas on Word files older than the Word 6
> > format?
> > >
> > > Regards
> > > Antony
> > >
> > >
> > > Ryan Ackley wrote:
> > > > As the author of both Word POI and textmining.org, I recommend using
> > > > textmining.org. POI is for general purpose manipulation of Word
> > > > documents. textmining's only purpose is extracting text.
> > > >
> > > > Also, people recommend using POI for text extraction but the only
> > > > place I've seen an actual how-to on this is in the "Lucene in Action"
> > > > book.
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by jafarim <ja...@gmail.com>.

Good to know that your devised commercial feature is already offered by
Enhydra Snapper as an open-source feature.
Check here: http://www.enhydra.org/apps/snapper/index.html

On 3/26/07, Ryan Ackley <ry...@gmail.com> wrote:
>
> Yes I do have plans for adding fast save support and support for more
> file formats. The time frame for this happening is the next couple of
> months.
>
> I'm playing with the idea of offering a commercial version. I want to
> continue to support the open source community so I want to keep it
> open source or free and add value that people would be willing to pay
> for.
>
> Any comments on this are appreciated. One thing I thought of would be
> to continue to offer the text extraction as open source but add html
> conversion with hit highlighting for a variety of file formats as a
> commercial add on. Is this something anyone would pay for? What are
> some other pain points of the Lucene community besides text
> extraction?
>
> On 3/25/07, Antony Bowesman <ad...@teamware.com> wrote:
> > I've been using Ryan's textmining in prefence to the POI as internally
> TM uses
> > POI and the Word6 extractor so handles a greater variety of files.
> >
> > Ryan, thanks for fixing your site.  Do you have any plans/ideas on how
> to parse
> > the 'fast-saved' files and any ideas on Word files older than the Word 6
> format?
> >
> > Regards
> > Antony
> >
> >
> > Ryan Ackley wrote:
> > > As the author of both Word POI and textmining.org, I recommend using
> > > textmining.org. POI is for general purpose manipulation of Word
> > > documents. textmining's only purpose is extracting text.
> > >
> > > Also, people recommend using POI for text extraction but the only
> > > place I've seen an actual how-to on this is in the "Lucene in Action"
> > > book.
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: index word files ( doc )

Posted by Antony Bowesman <ad...@teamware.com>.

Ryan Ackley wrote:
> The 512 byte thing is a limitation of POIFS I think. I could be wrong
> though. Have you tried opening the file with just POIFS?

It was some time ago, but it looks like I used both

org.apache.poi.hwpf.extractor.WordExtractor
org.apache.poi.hdf.extractor.WordDocument

with the same problem.
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Ryan Ackley <ry...@gmail.com>.

The 512 byte thing is a limitation of POIFS I think. I could be wrong
though. Have you tried opening the file with just POIFS?

On 3/26/07, Antony Bowesman <ad...@teamware.com> wrote:
> Ryan Ackley wrote:
> > Yes I do have plans for adding fast save support and support for more
> > file formats. The time frame for this happening is the next couple of
> > months.
>
> That would be good when it comes.  It would be nice if it could handle a 'brute
> force' mode where in the event of problems, it will just allow the text it can
> find to be extracted.  Currently if there is an Exception, I just run a raw
> strings parser on the file to fetch what I can.
>
> One problem I found is that files not padded to 512 byte blocks cannot be
> parsed, but Words reads them happily.  They seem to be valid in other respects,
> i.e. they have the 1Table, Root Entry and other recognisable parts.  Padding the
> file to 512 byte boundary with nulls parses OK.
>
> Antony
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Antony Bowesman <ad...@teamware.com>.

Ryan Ackley wrote:
> Yes I do have plans for adding fast save support and support for more
> file formats. The time frame for this happening is the next couple of
> months.

That would be good when it comes.  It would be nice if it could handle a 'brute 
force' mode where in the event of problems, it will just allow the text it can 
find to be extracted.  Currently if there is an Exception, I just run a raw 
strings parser on the file to fetch what I can.

One problem I found is that files not padded to 512 byte blocks cannot be 
parsed, but Words reads them happily.  They seem to be valid in other respects, 
i.e. they have the 1Table, Root Entry and other recognisable parts.  Padding the 
file to 512 byte boundary with nulls parses OK.

Antony

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Ryan Ackley <ry...@gmail.com>.

Yes I do have plans for adding fast save support and support for more
file formats. The time frame for this happening is the next couple of
months.

I'm playing with the idea of offering a commercial version. I want to
continue to support the open source community so I want to keep it
open source or free and add value that people would be willing to pay
for.

Any comments on this are appreciated. One thing I thought of would be
to continue to offer the text extraction as open source but add html
conversion with hit highlighting for a variety of file formats as a
commercial add on. Is this something anyone would pay for? What are
some other pain points of the Lucene community besides text
extraction?

On 3/25/07, Antony Bowesman <ad...@teamware.com> wrote:
> I've been using Ryan's textmining in prefence to the POI as internally TM uses
> POI and the Word6 extractor so handles a greater variety of files.
>
> Ryan, thanks for fixing your site.  Do you have any plans/ideas on how to parse
> the 'fast-saved' files and any ideas on Word files older than the Word 6 format?
>
> Regards
> Antony
>
>
> Ryan Ackley wrote:
> > As the author of both Word POI and textmining.org, I recommend using
> > textmining.org. POI is for general purpose manipulation of Word
> > documents. textmining's only purpose is extracting text.
> >
> > Also, people recommend using POI for text extraction but the only
> > place I've seen an actual how-to on this is in the "Lucene in Action"
> > book.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Antony Bowesman <ad...@teamware.com>.

I've been using Ryan's textmining in prefence to the POI as internally TM uses 
POI and the Word6 extractor so handles a greater variety of files.

Ryan, thanks for fixing your site.  Do you have any plans/ideas on how to parse 
the 'fast-saved' files and any ideas on Word files older than the Word 6 format?

Regards
Antony

Ryan Ackley wrote:
> As the author of both Word POI and textmining.org, I recommend using
> textmining.org. POI is for general purpose manipulation of Word
> documents. textmining's only purpose is extracting text.
> 
> Also, people recommend using POI for text extraction but the only
> place I've seen an actual how-to on this is in the "Lucene in Action"
> book.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Ryan Ackley <ry...@gmail.com>.

As the author of both Word POI and textmining.org, I recommend using
textmining.org. POI is for general purpose manipulation of Word
documents. textmining's only purpose is extracting text.

Also, people recommend using POI for text extraction but the only
place I've seen an actual how-to on this is in the "Lucene in Action"
book.

On 3/24/07, jafarim <ja...@gmail.com> wrote:
> Can anyone make a comparison between the two, namely POI API and the one
> from textmining.org?
>
> On 3/24/07, Ryan Ackley <ry...@gmail.com> wrote:
> >
> > The site is down but you can download the word extractor library direct
> > here:
> >
> > http://www.textmining.org/textmining.zip
> >
> > Going to fix the site this weekend.
> >
> > On 3/24/07, Sami Siren <ss...@gmail.com> wrote:
> > > Antony Bowesman wrote:
> > >
> > > >> Are there other sollutions?
> > >
> > > There's also antiword [1] which can convert your .doc to plain text or
> > > PS, not sure how good it is.
> > >
> > > --
> > >  Sami Siren
> > >
> > > [1] http://www.winfield.demon.nl/
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by jafarim <ja...@gmail.com>.

Can anyone make a comparison between the two, namely POI API and the one
from textmining.org?

On 3/24/07, Ryan Ackley <ry...@gmail.com> wrote:
>
> The site is down but you can download the word extractor library direct
> here:
>
> http://www.textmining.org/textmining.zip
>
> Going to fix the site this weekend.
>
> On 3/24/07, Sami Siren <ss...@gmail.com> wrote:
> > Antony Bowesman wrote:
> >
> > >> Are there other sollutions?
> >
> > There's also antiword [1] which can convert your .doc to plain text or
> > PS, not sure how good it is.
> >
> > --
> >  Sami Siren
> >
> > [1] http://www.winfield.demon.nl/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: index word files ( doc )

Posted by Ryan Ackley <ry...@gmail.com>.

The site is down but you can download the word extractor library direct here:

http://www.textmining.org/textmining.zip

Going to fix the site this weekend.

On 3/24/07, Sami Siren <ss...@gmail.com> wrote:
> Antony Bowesman wrote:
>
> >> Are there other sollutions?
>
> There's also antiword [1] which can convert your .doc to plain text or
> PS, not sure how good it is.
>
> --
>  Sami Siren
>
> [1] http://www.winfield.demon.nl/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by John Haxby <jc...@scalix.com>.

John Haxby wrote:
> Sami Siren wrote:
>> There's also antiword [1] which can convert your .doc to plain text or
>> PS, not sure how good it is.
>>   
> antiword isn't very good.  I use wvWare 
> (http://wvware.sourceforge.net/) directly, but you may find that using 
> abiword is better for you (abiword is an editor, but it also does 
> conversions and actually it's quite fast for that).  It also deals 
> with fast-saved word docs.
>
Sigh.   Must remember to read messages before sending.   abiword uses 
wvWare -- both will deal with fast-saved word docs, not just abiword.
> We used to use POI, but it's a little fragile and people were getting 
> all upset when a word document gummed up the works.  Using an external 
> executable seems to be no slower and is certainly less problematic.
>
> jch
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by John Haxby <jc...@scalix.com>.

Sami Siren wrote:
> There's also antiword [1] which can convert your .doc to plain text or
> PS, not sure how good it is.
>   
antiword isn't very good.  I use wvWare (http://wvware.sourceforge.net/) 
directly, but you may find that using abiword is better for you (abiword 
is an editor, but it also does conversions and actually it's quite fast 
for that).  It also deals with fast-saved word docs.

We used to use POI, but it's a little fragile and people were getting 
all upset when a word document gummed up the works.  Using an external 
executable seems to be no slower and is certainly less problematic.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Sami Siren <ss...@gmail.com>.

Antony Bowesman wrote:

>> Are there other sollutions?

There's also antiword [1] which can convert your .doc to plain text or
PS, not sure how good it is.

--
 Sami Siren

[1] http://www.winfield.demon.nl/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by Antony Bowesman <ad...@teamware.com>.

www.textmining.org, but the site is no longer accessible.  Check Nutch which has 
a Word parser - it seems to be the original textmining.org Word6+POI parser.

Pre-word6 and "fast-saved" files will not work.  I've not found a solution for those
Antony

e.j.w.vanbloem@student.utwente.nl wrote:
> Thank you,
>  
> Are there other sollutions?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: index word files ( doc )

Posted by e....@student.utwente.nl.

Thank you,
 
Are there other sollutions?

________________________________

Van: jafarim [mailto:jafarim@gmail.com]
Verzonden: vr 23-3-2007 18:55
Aan: java-user@lucene.apache.org
Onderwerp: Re: index word files ( doc )



Hi
My experience is not much satisfactory. It breaks very easily on many files.

On 3/23/07, e.j.w.vanbloem@student.utwente.nl <
e.j.w.vanbloem@student.utwente.nl> wrote:
>
> Hello,
>
> I am planning to index Word 2003 files. I read I have to use Jakarta
> Apache POI, but I also read on the POI site that their work with doc's is in
> an early stage.
>
> Is POI advisable? Or are there better alternatives?
> Please give some advice.
>
> Regards,
>
> Erik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: index word files ( doc )

Posted by jafarim <ja...@gmail.com>.

Hi
My experience is not much satisfactory. It breaks very easily on many files.

On 3/23/07, e.j.w.vanbloem@student.utwente.nl <
e.j.w.vanbloem@student.utwente.nl> wrote:
>
> Hello,
>
> I am planning to index Word 2003 files. I read I have to use Jakarta
> Apache POI, but I also read on the POI site that their work with doc's is in
> an early stage.
>
> Is POI advisable? Or are there better alternatives?
> Please give some advice.
>
> Regards,
>
> Erik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>