You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by "chris.b" <om...@gmail.com> on 2007/11/26 21:18:50 UTC

Problem with word documents

got the wrong forum the first time round, so here goes...

okay, so i'm very new to lucene, so it may be my bad, but i can get it to
index .txt files, and when trying to index word documents (using poi), the
program starts running and when it reaches a .doc file, i get the following
errors:

Exception in thread "main"
org.apache.poi.hpsf.IllegalPropertySetDataException: The property set claims
to have a size of 16 bytes. However, it exceeds 16 bytes.
        at org.apache.poi.hpsf.Section.<init>(Section.java:255)
        at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:454)
        at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:249)
        at
org.apache.poi.hpsf.PropertySetFactory.create(PropertySetFactory.java:61)
        at org.apache.poi.POIDocument.getPropertySet(POIDocument.java:92)
        at org.apache.poi.POIDocument.readProperties(POIDocument.java:69)
        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:147)
        at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:56)
        at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:48)
        at Indexer.indexFile(Indexer.java:76)
        at Indexer.indexDirectory(Indexer.java:57)
        at Indexer.index(Indexer.java:38)
        at Indexer.main(Indexer.java:20)

and my code is as follows:

        private static void indexFile(IndexWriter writer, File f) throws
IOException {
                if (f.isHidden() || !f.exists() || !f.canRead()) {
                        return;
                }

                System.out.println("A acrescentar " + f.getCanonicalPath() +
" ao indice.");

                Document doc = new Document();
               
                // For .doc files
                if (f.getName().endsWith(".doc")){
                        FileInputStream docfin = new
FileInputStream(f.getAbsolutePath());
                        WordExtractor docextractor = new
WordExtractor(docfin);
                        String content = docextractor.getText();
                        doc.add(new Field("contents", content,
Field.Store.NO, Field.Index.TOKENIZED));
                } // For .txt files
                else if (f.getName().endsWith(".txt")) {
                        doc.add(new Field("contents", new FileReader(f)));
                }
               
                doc.add(new Field("filename", f.getCanonicalPath(),
Field.Store.YES, Field.Index.TOKENIZED));
                writer.addDocument(doc);
        }

(I think i included all that's necessary)
Thanks in advance for any help.
-- 
View this message in context: http://www.nabble.com/Problem-with-word-documents-tf4877644.html#a13957674
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: Problem with word documents

Posted by "chris.b" <om...@gmail.com>.
when using your method i get "unhandled exception type Exception" :s


WATHELET Thomas-2 wrote:
> 
> I have a solution I ry to attach the solution in this mail if it's not
> attach for some security reason please let me informe and I sed you the
> solution in your private Email adresse.
> 
> Add this jar on the classpath of your project and use it like this:
> 
> public String getMSWordContent(InputStream is) throws Exception {
>         String temp = null;
>         europarl.trad.sild.msword.extraction.WordExtractor wd = new
> europarl.trad.sild.msword.extraction.WordExtractor();
>         try {
>             temp = wd.extractText(is);
>         } catch (Exception e) {
>             throw new Exception("getTextMiningContent: " +
> e.toString());
>         }
>         return temp;
>     }
>  
> 
> -----Original Message-----
> From: chris.b [mailto:omelhornomedomundo@gmail.com] 
> Sent: 27 November 2007 16:15
> To: user@poi.apache.org
> Subject: Re: Problem with word documents
> 
> 
> here's a sample file that i wasn't able to index
> http://www.nabble.com/file/p13972759/monte.doc monte.doc 
> thanks for the help :)
> 
> 
> Rainer Schwarze wrote:
>> 
>> chris.b wrote:
>>> seen as it only happens with documents that are created and saved
> using
>>> open
>>> office, could that be the problem?
>>> i know that this part has nothing to do with poi, but should i in
> that
>>> case
>>> try using the open office sdk to handle word documents?
>>> 
>>> thank you
>>> 
>>> Chris
>> 
>> I just tried to open a OpenOffice generated Word file in my HWPF
> version
>> and it worked without problems. If you can send me a sample file, I
> can
>> take a look at it.
>> Regarding the Open Office SDK: I never used it so I can't say much
> about
>> it - if it works for you, you may have less trouble reading all the
>> different Word files. It depends a bit on how complex your files are
>> (full blown fancy flyers made in Word or simple reports consisting of
>> text mainly...).
>> 
>> Best wishes, Rainer
>> -- 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>> 
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/Problem-with-word-documents-tf4877644.html#a139727
> 59
> Sent from the POI - User mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 

-- 
View this message in context: http://www.nabble.com/Problem-with-word-documents-tf4877644.html#a13974499
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: Problem with word documents

Posted by WATHELET Thomas <th...@europarl.europa.eu>.
I have a solution I ry to attach the solution in this mail if it's not
attach for some security reason please let me informe and I sed you the
solution in your private Email adresse.

Add this jar on the classpath of your project and use it like this:

public String getMSWordContent(InputStream is) throws Exception {
        String temp = null;
        europarl.trad.sild.msword.extraction.WordExtractor wd = new
europarl.trad.sild.msword.extraction.WordExtractor();
        try {
            temp = wd.extractText(is);
        } catch (Exception e) {
            throw new Exception("getTextMiningContent: " +
e.toString());
        }
        return temp;
    }
 

-----Original Message-----
From: chris.b [mailto:omelhornomedomundo@gmail.com] 
Sent: 27 November 2007 16:15
To: user@poi.apache.org
Subject: Re: Problem with word documents


here's a sample file that i wasn't able to index
http://www.nabble.com/file/p13972759/monte.doc monte.doc 
thanks for the help :)


Rainer Schwarze wrote:
> 
> chris.b wrote:
>> seen as it only happens with documents that are created and saved
using
>> open
>> office, could that be the problem?
>> i know that this part has nothing to do with poi, but should i in
that
>> case
>> try using the open office sdk to handle word documents?
>> 
>> thank you
>> 
>> Chris
> 
> I just tried to open a OpenOffice generated Word file in my HWPF
version
> and it worked without problems. If you can send me a sample file, I
can
> take a look at it.
> Regarding the Open Office SDK: I never used it so I can't say much
about
> it - if it works for you, you may have less trouble reading all the
> different Word files. It depends a bit on how complex your files are
> (full blown fancy flyers made in Word or simple reports consisting of
> text mainly...).
> 
> Best wishes, Rainer
> -- 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> 

-- 
View this message in context:
http://www.nabble.com/Problem-with-word-documents-tf4877644.html#a139727
59
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org



RE: Problem with word documents

Posted by WATHELET Thomas <th...@europarl.europa.eu>.
I forgot to tell you to add also the pois' files to your classpath 

-----Original Message-----
From: chris.b [mailto:omelhornomedomundo@gmail.com] 
Sent: 27 November 2007 16:15
To: user@poi.apache.org
Subject: Re: Problem with word documents


here's a sample file that i wasn't able to index
http://www.nabble.com/file/p13972759/monte.doc monte.doc 
thanks for the help :)


Rainer Schwarze wrote:
> 
> chris.b wrote:
>> seen as it only happens with documents that are created and saved
using
>> open
>> office, could that be the problem?
>> i know that this part has nothing to do with poi, but should i in
that
>> case
>> try using the open office sdk to handle word documents?
>> 
>> thank you
>> 
>> Chris
> 
> I just tried to open a OpenOffice generated Word file in my HWPF
version
> and it worked without problems. If you can send me a sample file, I
can
> take a look at it.
> Regarding the Open Office SDK: I never used it so I can't say much
about
> it - if it works for you, you may have less trouble reading all the
> different Word files. It depends a bit on how complex your files are
> (full blown fancy flyers made in Word or simple reports consisting of
> text mainly...).
> 
> Best wishes, Rainer
> -- 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> 

-- 
View this message in context:
http://www.nabble.com/Problem-with-word-documents-tf4877644.html#a139727
59
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Problem with word documents

Posted by "chris.b" <om...@gmail.com>.
is there any way to catch the illegalpropertysetdata exception?
it seems to me that the documents must have an rtf header, but the text
encoding is the same as a word document (don't know if this makes sense),
because using the rtf handler it reads the documents but doesn't index the
contents :s

thank you,
Chris


Rainer Schwarze wrote:
> 
> chris.b wrote:
>> here's a sample file that i wasn't able to index
>> http://www.nabble.com/file/p13972759/monte.doc monte.doc 
>> thanks for the help :)
> 
> As a last thing today I took a quick look at the file. A quick solution
> might be to skip the readProperties() call in the HWPFDocument
> constructor (don't know right now, whether the properties are really
> needed if you only read the Word file):
> 
>  public HWPFDocument(POIFSFileSystem pfilesystem) throws IOException
>   {
>     // Sort out the hpsf properties
>     filesystem = pfilesystem;
>     readProperties();    // <---- remove that one
>     ...
> 
> Depending on how much work you intend to do, you could either comment
> the line out and rebuild the library or subclass HWPFDocument and
> override readProperties() with an empty method (what I would recommend
> to try first). For the second case, you should get along by changing the
> WordExtractor constructor call in the code which you posted to:
> 
> WordExtractor docextractor = new WordExtractor(new MyHWPFDoc(docfin));
> 
> (MyHWPFDoc being a subclass of HWPFDocument with the empty
> readProperties() )
> 
> Best wishes, Rainer
> -- 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Problem-with-word-documents-tf4877644.html#a14022545
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Problem with word documents

Posted by Rainer Schwarze <rs...@admadic.de>.
chris.b wrote:
> i found a simpler way round this, i read somewhere that open office simply
> converted .rtf files to .doc, so i just handle it using the rtf handler, but
> i wanted to do it so it would try handling it with poi, in case it catches a
> IllegalPropertySetDataException, to handle it with rtf handler, but it
> always gives me the exception...

The rtf handler should not be able to read  a Word document. Word can
pretend that an RTF is a Word file while actually it is still an RTF,
but if it is a Word file and not RTF, then RTF readers cannot handle it.

What do you mean by "always gives me the exception"? Do all your files
throw this exception or is the exception propagated out of the try/catch
range? (If all your files are created using open office, all your files
are likely to throw that exception if one is doing that.) BTW: which
version of OpenOffice was used to create the files?

> Example of what i'm trying to do:
[...]
> I don't know if there's any "bad programming in this", for now i just wanted
> it to work :p

This is what I tried:
(The declaration of fisdoc was not included in your copied code - I
assume you created a second FileInputStream...)

---8<--------------
public static void main(String[] args)
throws IOException, BadLocationException {
  String content = null;
  File f = new File("monte.doc");
  FileInputStream docfin = new FileInputStream(f.getAbsolutePath());
  try {
    WordExtractor docextractor = new WordExtractor(docfin);
    content = docextractor.getText();
  } catch (IllegalPropertySetDataException e) {
    try {
      System.out.println("exc caught");
      FileInputStream fisdoc = new FileInputStream(f.getAbsolutePath());
      DefaultStyledDocument styledDoc = new DefaultStyledDocument();
      new RTFEditorKit().read(fisdoc, styledDoc, 0);
      System.out.println(
        "styledDoc.getLength() = " + styledDoc.getLength());
      content = styledDoc.getText(0, styledDoc.getLength());
    } catch (Throwable t) {
      System.out.println("exc2 caught");
    }
  }
  System.out.println("[" + content + "]");
}
---8<--------------

It prints this:

exc caught
styledDoc.getLength() = 0
[]


I assume that the RTFReader (used in the RTFEditorKit) only encounters
binary data and never hits a keyword which it interprets as RTF. So it
does not read any real content from the file. (I would have expected
that it throws an exception - which is why I wanted to see what happens
by myself :-) )

Best wishes,
Rainer
-- 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Problem with word documents

Posted by "chris.b" <om...@gmail.com>.
i found a simpler way round this, i read somewhere that open office simply
converted .rtf files to .doc, so i just handle it using the rtf handler, but
i wanted to do it so it would try handling it with poi, in case it catches a
IllegalPropertySetDataException, to handle it with rtf handler, but it
always gives me the exception...

Example of what i'm trying to do:

        try {
        	content = docextractor.getText();		
        } catch (IllegalPropertySetDataException ipsde) {
        	try {
        		DefaultStyledDocument styledDoc = new DefaultStyledDocument();
        		new RTFEditorKit().read(fisdoc, styledDoc, 0);
        		content = styledDoc.getText(0, styledDoc.getLength());
        	} catch (BadLocationException ble) {
        		return;
       	}
        	finally {
        		return;
        	}

I don't know if there's any "bad programming in this", for now i just wanted
it to work :p

thank you, 

chris
-- 
View this message in context: http://www.nabble.com/Problem-with-word-documents-tf4877644.html#a14022202
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Problem with word documents

Posted by Rainer Schwarze <rs...@admadic.de>.
chris.b wrote:
> here's a sample file that i wasn't able to index
> http://www.nabble.com/file/p13972759/monte.doc monte.doc 
> thanks for the help :)

As a last thing today I took a quick look at the file. A quick solution
might be to skip the readProperties() call in the HWPFDocument
constructor (don't know right now, whether the properties are really
needed if you only read the Word file):

 public HWPFDocument(POIFSFileSystem pfilesystem) throws IOException
  {
    // Sort out the hpsf properties
    filesystem = pfilesystem;
    readProperties();    // <---- remove that one
    ...

Depending on how much work you intend to do, you could either comment
the line out and rebuild the library or subclass HWPFDocument and
override readProperties() with an empty method (what I would recommend
to try first). For the second case, you should get along by changing the
WordExtractor constructor call in the code which you posted to:

WordExtractor docextractor = new WordExtractor(new MyHWPFDoc(docfin));

(MyHWPFDoc being a subclass of HWPFDocument with the empty
readProperties() )

Best wishes, Rainer
-- 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Problem with word documents

Posted by "chris.b" <om...@gmail.com>.
here's a sample file that i wasn't able to index
http://www.nabble.com/file/p13972759/monte.doc monte.doc 
thanks for the help :)


Rainer Schwarze wrote:
> 
> chris.b wrote:
>> seen as it only happens with documents that are created and saved using
>> open
>> office, could that be the problem?
>> i know that this part has nothing to do with poi, but should i in that
>> case
>> try using the open office sdk to handle word documents?
>> 
>> thank you
>> 
>> Chris
> 
> I just tried to open a OpenOffice generated Word file in my HWPF version
> and it worked without problems. If you can send me a sample file, I can
> take a look at it.
> Regarding the Open Office SDK: I never used it so I can't say much about
> it - if it works for you, you may have less trouble reading all the
> different Word files. It depends a bit on how complex your files are
> (full blown fancy flyers made in Word or simple reports consisting of
> text mainly...).
> 
> Best wishes, Rainer
> -- 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Problem-with-word-documents-tf4877644.html#a13972759
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Problem with word documents

Posted by Rainer Schwarze <rs...@admadic.de>.
chris.b wrote:
> seen as it only happens with documents that are created and saved using open
> office, could that be the problem?
> i know that this part has nothing to do with poi, but should i in that case
> try using the open office sdk to handle word documents?
> 
> thank you
> 
> Chris

I just tried to open a OpenOffice generated Word file in my HWPF version
and it worked without problems. If you can send me a sample file, I can
take a look at it.
Regarding the Open Office SDK: I never used it so I can't say much about
it - if it works for you, you may have less trouble reading all the
different Word files. It depends a bit on how complex your files are
(full blown fancy flyers made in Word or simple reports consisting of
text mainly...).

Best wishes, Rainer
-- 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Problem with word documents

Posted by "chris.b" <om...@gmail.com>.
seen as it only happens with documents that are created and saved using open
office, could that be the problem?
i know that this part has nothing to do with poi, but should i in that case
try using the open office sdk to handle word documents?

thank you

Chris

Rainer Klute (Rainer Klute IT-Consulting) wrote:
> 
> chris.b schrieb:
>> okay, so i'm very new to lucene, so it may be my bad, but i can get it to
>> index .txt files, and when trying to index word documents (using poi),
>> the
>> program starts running and when it reaches a .doc file, i get the
>> following
>> errors:
>>
>> Exception in thread "main"
>> org.apache.poi.hpsf.IllegalPropertySetDataException: The property set
>> claims
>> to have a size of 16 bytes. However, it exceeds 16 bytes.
>>   
> 
> According to the exception's message it seems to be a problem in the
> document.
> 
> Best regards
> Rainer Klute
> 
>                            Rainer Klute IT-Consulting
>   Dipl.-Inform.
>   Rainer Klute             E-Mail:  klute@rainer-klute.de
>   Körner Grund 24          Telefon: +49 172 2324824
> D-44143 Dortmund           Telefax: +49 231 5349423
> 
> OpenPGP fingerprint: E4E4386515EE0BED5C162FBB5343461584B5A42E
> 
> 
> 
>  
> 

-- 
View this message in context: http://www.nabble.com/Problem-with-word-documents-tf4877644.html#a13969471
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: Problem with word documents

Posted by WATHELET Thomas <th...@europarl.europa.eu>.
Ok thanks
In fact this document have been bad converted from wordperfect to word.
That's the problem.
Thank's

-----Original Message-----
From: Rainer Klute (Rainer Klute IT-Consulting GmbH) [mailto:klute@rainer-klute.de] 
Sent: 27 November 2007 05:40
To: POI Users List
Subject: Re: Problem with word documents

chris.b schrieb:
> okay, so i'm very new to lucene, so it may be my bad, but i can get it to
> index .txt files, and when trying to index word documents (using poi), the
> program starts running and when it reaches a .doc file, i get the following
> errors:
>
> Exception in thread "main"
> org.apache.poi.hpsf.IllegalPropertySetDataException: The property set claims
> to have a size of 16 bytes. However, it exceeds 16 bytes.
>   

According to the exception's message it seems to be a problem in the
document.

Best regards
Rainer Klute

                           Rainer Klute IT-Consulting
  Dipl.-Inform.
  Rainer Klute             E-Mail:  klute@rainer-klute.de
  Körner Grund 24          Telefon: +49 172 2324824
D-44143 Dortmund           Telefax: +49 231 5349423

OpenPGP fingerprint: E4E4386515EE0BED5C162FBB5343461584B5A42E



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Problem with word documents

Posted by "Rainer Klute (Rainer Klute IT-Consulting GmbH)" <kl...@rainer-klute.de>.
chris.b schrieb:
> okay, so i'm very new to lucene, so it may be my bad, but i can get it to
> index .txt files, and when trying to index word documents (using poi), the
> program starts running and when it reaches a .doc file, i get the following
> errors:
>
> Exception in thread "main"
> org.apache.poi.hpsf.IllegalPropertySetDataException: The property set claims
> to have a size of 16 bytes. However, it exceeds 16 bytes.
>   

According to the exception's message it seems to be a problem in the
document.

Best regards
Rainer Klute

                           Rainer Klute IT-Consulting
  Dipl.-Inform.
  Rainer Klute             E-Mail:  klute@rainer-klute.de
  Körner Grund 24          Telefon: +49 172 2324824
D-44143 Dortmund           Telefax: +49 231 5349423

OpenPGP fingerprint: E4E4386515EE0BED5C162FBB5343461584B5A42E