You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by Prateek Jain <pj...@sapient.com> on 2015/04/08 12:03:29 UTC

Reading .doc/.docx files


Hi,

I am trying to read Microsoft word documents via apache-poi and found that there are couple of convenient methods provided to scan through document like getText(), getParagraphList() etc.. But my use case is slightly different and the way we want to scan through any document is, it should give us events/information like heading, paragraph, table in the same sequence as they appear in document. It will help me in preparing a document structure like,

<content>
<section>
     <heading> ABC </heading>
     <paragraph>xyz </paragraph>
    <paragraph>scanning through APIs</paragraph>
<section>
.
.
.
</content>

The main intent is to maintain the relationship between heading and paragraphs as in original document. Not sure but can something like this work for me,

Iterator<IBodyElement> itr = doc.getBodyElementsIterator();
    while(itr.hasNext()) {
      IBodyElement ele = itr.next();
      System.out.println(ele.getElementType());
    }

I was able to get the paragraph list but not heading information using this code. Just to mention, I would be interested in all headings, they might be explicitly marked as heading by using style or by using large font size. A prompt response will be highly appreciated.

Regards,
Prateek

RE: Reading .doc/.docx files

Posted by Prateek Jain <pj...@sapient.com>.

I looked into org.apache.tika.parser.microsoft.WordExtractor class and most probably, it will lead me to the destination desired.

Regards,
Prateek Jain

-----Original Message-----
From: Prateek Jain [mailto:pjain18@sapient.com] 
Sent: Wednesday, April 08, 2015 3:33 PM
To: user@poi.apache.org
Subject: Reading .doc/.docx files

Hi,

I am trying to read Microsoft word documents via apache-poi and found that there are couple of convenient methods provided to scan through document like getText(), getParagraphList() etc.. But my use case is slightly different and the way we want to scan through any document is, it should give us events/information like heading, paragraph, table in the same sequence as they appear in document. It will help me in preparing a document structure like,

<content>
<section>
     <heading> ABC </heading>
     <paragraph>xyz </paragraph>
    <paragraph>scanning through APIs</paragraph>
<section>
.
.
.
</content>

The main intent is to maintain the relationship between heading and paragraphs as in original document. Not sure but can something like this work for me,

Iterator<IBodyElement> itr = doc.getBodyElementsIterator();
    while(itr.hasNext()) {
      IBodyElement ele = itr.next();
      System.out.println(ele.getElementType());
    }

I was able to get the paragraph list but not heading information using this code. Just to mention, I would be interested in all headings, they might be explicitly marked as heading by using style or by using large font size. A prompt response will be highly appreciated.

Regards,
Prateek

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org