You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Trym B. Asserson" <tr...@creuna.no> on 2006/09/21 13:04:58 UTC

Nutch 0.8 - MS Word document parse failure : "Can't be handled as micrsosoft document. java.util.NoSuchElementException"

Hi,

We've encountered an error wrt. parsing of a particular MS Word document
which throws an exception with the message:

"Can't be handled as micrsosoft document.
java.util.NoSuchElementException"

I've tried to modify the source Word document and have found that it
fails parsing of the first page in the document. This page has two
rectangular elements (simple drawing objects with colours) and three
TextBox objects, and this produces the error above. However, just
removing one TextBox object, and it's arbitrary whichever of the three
TextBox objects I remove, makes the Word document pass through parsing
just fine.

Has anyone encountered anything similar or know a work-around to make
the MS Word parsing just ignore this and only parse whatever it can?
I have the source document available and can attach that if anyone is
willing to give it a go to see for themselves.


Best regards,
Trym

--
Trym Asserson

Re: Nutch 0.8 - MS Word document parse failure : "Can't be handled as micrsosoft document. java.util.NoSuchElementException"

Posted by Tomi NA <he...@gmail.com>.
On 9/21/06, Jim Wilson <wi...@gmail.com> wrote:
> I haven't had this particular problem, but here's something to consider:
> After you remove the TextBox objects you have to re-save the document.  Is
> the new document the same version as the previous one?  By this I mean, the
> same Word version (97, 2000, etc).

I've had some difficulties with misc MS Office documents and it makes
me wonder: would using OpenOffice.org to parse the files make more
sense than using POI? OO.org uses the UNO framework which has a Java
API so conceivably, anything OO.org understands nutch would
understand.
The fact that OO.org is able to parse MS formats fairly well (better
than most other libraries/applications) suggest that it'd give the
best results if at some point nutch/lucene supported weighted
relations between a term and a field/document. It would make e.g.
words appearing in the headers more important than e.g. words in
footnotes.
Returning to the subject of parsing MS document formats at all, has
anyone considered/attempted using OO.org UNO to parse them? Are there
any major shortcomings to the approach?

t.n.a.

Re: Nutch 0.8 - MS Word document parse failure : "Can't be handled as micrsosoft document. java.util.NoSuchElementException"

Posted by Jim Wilson <wi...@gmail.com>.
I haven't had this particular problem, but here's something to consider:
After you remove the TextBox objects you have to re-save the document.  Is
the new document the same version as the previous one?  By this I mean, the
same Word version (97, 2000, etc).

It's possible that the Java code can only handle Word Docs up to a certain
MS Office version (XP for example).  If you were using Word XP to save the
new document, but the original was made with Word 2003, then the new
document would be parsable regardless of the TextBox change.

I suggest opening the Doc in Word, then (without making any changes to the
doc) doing a File -> Save As.  In the type drop down, select Word Doc
(97/2000/XP) or similar.  I'm interested to see if this new one works. If it
doesn't, then I have no idea :(

-- Jim

On 9/21/06, Trym B. Asserson <tr...@creuna.no> wrote:
>
> Hi,
>
> We've encountered an error wrt. parsing of a particular MS Word document
> which throws an exception with the message:
>
> "Can't be handled as micrsosoft document.
> java.util.NoSuchElementException"
>
> I've tried to modify the source Word document and have found that it
> fails parsing of the first page in the document. This page has two
> rectangular elements (simple drawing objects with colours) and three
> TextBox objects, and this produces the error above. However, just
> removing one TextBox object, and it's arbitrary whichever of the three
> TextBox objects I remove, makes the Word document pass through parsing
> just fine.
>
> Has anyone encountered anything similar or know a work-around to make
> the MS Word parsing just ignore this and only parse whatever it can?
> I have the source document available and can attach that if anyone is
> willing to give it a go to see for themselves.
>
>
> Best regards,
> Trym
>
> --
> Trym Asserson
>