You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by Ramani Routray <ro...@gmail.com> on 2017/05/09 16:42:45 UTC

Java (Apache POI) : How to retrieve comment/annotation and associated highlight text from Microsoft Word?

I have a Microsoft word (.docx) file and trying to retrieve the comments and it's associated highlighted text. Can you pls help.

Attaching picture of the sample word document and the java code for extracting the comments. [ A file with a line "My name is John". The word "John" is highlighted with a comment "Noun" ]

I am able to extract the comments (Noun, Adjective). I would like to extract the text associated with the comment "Noun" (Noun = John, Adjective = great)

FileInputStream fis = new FileInputStream(new File(msWordFilePath));
    XWPFDocument adoc = new XWPFDocument(fis);
    XWPFWordExtractor xwe = new XWPFWordExtractor(adoc);
    XWPFComment[] comments = adoc.getComments();


    for(int idx=0; idx < comments.length; idx++)
    {
        MSWordAnnotation annot = new MSWordAnnotation();
        annot.setAnnotationName(comments[idx].getId());
        annot.setAnnotationValue(comments[idx].getText());
        aList.add(annot);


    }

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: Java (Apache POI) : How to retrieve comment/annotation and associated highlight text from Microsoft Word?

Posted by Javen O'Neal <on...@apache.org>.

A few additions, since <paragraph><commentRangeStart id="commentId"
/><run><text>John</text></run><commentRangeStop id="commentId"
/></paragraph> is the critical thing:

        <!-- comment range, text run "John" -->
        <w:commentRangeStart w:id="0"/>
        <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
            <w:rPr><w:rtl w:val="0"/></w:rPr>
            <w:t xml:space="preserve">John</w:t>
        </w:r>
        <w:commentRangeEnd w:id="0"/>

      <xsd:element name="commentRangeStart" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Comment Anchor Range Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="commentRangeEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Comment Anchor Range End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>

So if performance isn't a concern here (you don't need to save
pointers to where the comment ranges are), the pseudo-code for a
XWPFComment method that gets the text that a comment refers to would
be:

    public String getRefersToText() {
        StringBuilder refersTo = new StringBuilder();
        for each CTParagraph in document:
            for each child element of the CTParagraph:
                if child element is a commentRangeStart and id==this.id
                    append subsequent text runs to the refersTo buffer
                    continue
                if we have found the comment range start and child
element is a text run
                    append this text run to the refersTo buffer
                if child element is a commentRangeEnd and id==this.id
                    return refersTo.toString() (assuming that one
comment may not refer to multiple text ranges)

    }

This would require searching the entire document for every comment.
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFDocument.java?view=markup
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFParagraph.java?view=markup

On Tue, May 9, 2017 at 11:14 PM, Javen O'Neal <on...@apache.org> wrote:
> First, if you're using Java 1.5+(?), you can use for-each loops for
> more readable code.
> for (final XWPFComment comment : adoc.getComments()) {
>     final String id = comment.getId();
>     final String author = comment.getAuthor();
>     final String text = comment.getText();
> }
>
> I don't see anything in POI right now that make extracting the
> annotated text that a track changes comment refers to.
>
> Here's the current implementation of XWPFComment:
> https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFComment.java?view=markup
>
> Taking a look at the OOXML 2006 schemas wml.xsd (download from
> http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%201st%20edition%20Part%204%20(PDF).zip,
> extract OfficeOpenXML-Part4a.zip, extract OfficeOpenXML-XMLSchema.zip,
> open wml.xsd), I see that the comment (*.docx/word/comments.xml)
> doesn't refer to the document text.
>
>   <xsd:complexType name="CT_Comment">
>     <xsd:complexContent>
>       <xsd:extension base="CT_TrackChange">
>         <xsd:sequence>
>           <xsd:group ref="EG_BlockLevelElts" minOccurs="0"
> maxOccurs="unbounded"></xsd:group>
>         </xsd:sequence>
>         <xsd:attribute name="initials" type="ST_String" use="optional">
>           <xsd:annotation>
>             <xsd:documentation>Initials of Comment Author</xsd:documentation>
>           </xsd:annotation>
>         </xsd:attribute>
>       </xsd:extension>
>     </xsd:complexContent>
>   </xsd:complexType>
>
>   <xsd:complexType name="CT_TrackChange">
>     <xsd:complexContent>
>       <xsd:extension base="CT_Markup">
>         <xsd:attribute name="author" type="ST_String" use="required">
>           <xsd:annotation>
>             <xsd:documentation>Annotation Author</xsd:documentation>
>           </xsd:annotation>
>         </xsd:attribute>
>         <xsd:attribute name="date" type="ST_DateTime" use="optional">
>           <xsd:annotation>
>             <xsd:documentation>Annotation Date</xsd:documentation>
>           </xsd:annotation>
>         </xsd:attribute>
>       </xsd:extension>
>     </xsd:complexContent>
>   </xsd:complexType>
>
>   <xsd:complexType name="CT_Markup">
>     <xsd:attribute name="id" type="ST_DecimalNumber" use="required">
>       <xsd:annotation>
>         <xsd:documentation>Annotation Identifier</xsd:documentation>
>       </xsd:annotation>
>     </xsd:attribute>
>   </xsd:complexType>
>
> Examining the zipped xml contents of a simple comment example docx
> file that I created, I see that the relationship is the other way
> around: the document refers to the comments (this ordering makes more
> sense anyways).
>
> For a simple file that I created with the text "My name is John." and
> annotated the word John with a comment with the message "Noun", here's
> what I got in CommentExample.docx/word/document.xml:
>
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <w:document xmlns....>
> <w:body>
>     <!-- text paragraph: "My name is [[John]]." -->
>     <w:p w:rsidR="00000000" w:rsidDel="00000000" w:rsidP="00000000"
> w:rsidRDefault="00000000" w:rsidRPr="00000000">
>         <w:pPr>
>             <w:pBdr/>
>             <w:contextualSpacing w:val="0"/>
>             <w:rPr/>
>         </w:pPr>
>
>         <!-- text run "My name is " -->
>         <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
>             <w:rPr><w:rtl w:val="0"/></w:rPr>
>             <w:t xml:space="preserve">My name is </w:t>
>         </w:r>
>
>         <!-- comment range, text run "John" -->
>         <w:commentRangeStart w:id="0"/>
>         <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
>             <w:rPr><w:rtl w:val="0"/></w:rPr>
>             <w:t xml:space="preserve">John</w:t>
>         </w:r>
>         <w:commentRangeEnd w:id="0"/>
>
>         <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
>             <w:commentReference w:id="0"/>
>         </w:r>
>
>         <!-- text run "." -->
>         <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
>             <w:rPr><w:rtl w:val="0"/></w:rPr>
>             <w:t xml:space="preserve">.</w:t>
>         </w:r>
>
>     </w:p>
>     <w:sectPr>
>         <w:pgSz w:h="15840" w:w="12240"/>
>         <w:pgMar w:bottom="1440" w:top="1440" w:left="1440"
> w:right="1440" w:header="0"/>
>         <w:pgNumType w:start="1"/>
>     </w:sectPr>
> </w:body>
> </w:document>
>
> So to solve your problem, you could either:
> 1. search the document.xml for all comments, looking up the comment's
> author and text using the ID that is referenced in the document
> commentRangeStart-commentRangeEnd and joining all the text contained
> between those markers
> 2. for each comment in the comment table, find the corresponding
> commentRangeStart and commentRangeEnd tags in document.xml and get the
> corresponding text that was annotated (in this example, John).
>
> If you don't already have a development environment set up, I
> encourage you to do so. Patches are greatly appreciated.
>
> On Tue, May 9, 2017 at 9:42 AM, Ramani Routray <ro...@gmail.com> wrote:
>> I have a Microsoft word (.docx) file and trying to retrieve the comments and it's associated highlighted text. Can you pls help.
>>
>> Attaching picture of the sample word document and the java code for extracting the comments. [ A file with a line "My name is John". The word "John" is highlighted with a comment "Noun" ]
>>
>> I am able to extract the comments (Noun, Adjective). I would like to extract the text associated with the comment "Noun" (Noun = John, Adjective = great)
>>
>> FileInputStream fis = new FileInputStream(new File(msWordFilePath));
>>     XWPFDocument adoc = new XWPFDocument(fis);
>>     XWPFWordExtractor xwe = new XWPFWordExtractor(adoc);
>>     XWPFComment[] comments = adoc.getComments();
>>
>>
>>     for(int idx=0; idx < comments.length; idx++)
>>     {
>>         MSWordAnnotation annot = new MSWordAnnotation();
>>         annot.setAnnotationName(comments[idx].getId());
>>         annot.setAnnotationValue(comments[idx].getText());
>>         aList.add(annot);
>>
>>
>>     }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: Java (Apache POI) : How to retrieve comment/annotation and associated highlight text from Microsoft Word?

Posted by Javen O'Neal <on...@apache.org>.

First, if you're using Java 1.5+(?), you can use for-each loops for
more readable code.
for (final XWPFComment comment : adoc.getComments()) {
    final String id = comment.getId();
    final String author = comment.getAuthor();
    final String text = comment.getText();
}

I don't see anything in POI right now that make extracting the
annotated text that a track changes comment refers to.

Here's the current implementation of XWPFComment:
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFComment.java?view=markup

Taking a look at the OOXML 2006 schemas wml.xsd (download from
http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%201st%20edition%20Part%204%20(PDF).zip,
extract OfficeOpenXML-Part4a.zip, extract OfficeOpenXML-XMLSchema.zip,
open wml.xsd), I see that the comment (*.docx/word/comments.xml)
doesn't refer to the document text.

  <xsd:complexType name="CT_Comment">
    <xsd:complexContent>
      <xsd:extension base="CT_TrackChange">
        <xsd:sequence>
          <xsd:group ref="EG_BlockLevelElts" minOccurs="0"
maxOccurs="unbounded"></xsd:group>
        </xsd:sequence>
        <xsd:attribute name="initials" type="ST_String" use="optional">
          <xsd:annotation>
            <xsd:documentation>Initials of Comment Author</xsd:documentation>
          </xsd:annotation>
        </xsd:attribute>
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>

  <xsd:complexType name="CT_TrackChange">
    <xsd:complexContent>
      <xsd:extension base="CT_Markup">
        <xsd:attribute name="author" type="ST_String" use="required">
          <xsd:annotation>
            <xsd:documentation>Annotation Author</xsd:documentation>
          </xsd:annotation>
        </xsd:attribute>
        <xsd:attribute name="date" type="ST_DateTime" use="optional">
          <xsd:annotation>
            <xsd:documentation>Annotation Date</xsd:documentation>
          </xsd:annotation>
        </xsd:attribute>
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>

  <xsd:complexType name="CT_Markup">
    <xsd:attribute name="id" type="ST_DecimalNumber" use="required">
      <xsd:annotation>
        <xsd:documentation>Annotation Identifier</xsd:documentation>
      </xsd:annotation>
    </xsd:attribute>
  </xsd:complexType>

Examining the zipped xml contents of a simple comment example docx
file that I created, I see that the relationship is the other way
around: the document refers to the comments (this ordering makes more
sense anyways).

For a simple file that I created with the text "My name is John." and
annotated the word John with a comment with the message "Noun", here's
what I got in CommentExample.docx/word/document.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns....>
<w:body>
    <!-- text paragraph: "My name is [[John]]." -->
    <w:p w:rsidR="00000000" w:rsidDel="00000000" w:rsidP="00000000"
w:rsidRDefault="00000000" w:rsidRPr="00000000">
        <w:pPr>
            <w:pBdr/>
            <w:contextualSpacing w:val="0"/>
            <w:rPr/>
        </w:pPr>

        <!-- text run "My name is " -->
        <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
            <w:rPr><w:rtl w:val="0"/></w:rPr>
            <w:t xml:space="preserve">My name is </w:t>
        </w:r>

        <!-- comment range, text run "John" -->
        <w:commentRangeStart w:id="0"/>
        <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
            <w:rPr><w:rtl w:val="0"/></w:rPr>
            <w:t xml:space="preserve">John</w:t>
        </w:r>
        <w:commentRangeEnd w:id="0"/>

        <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
            <w:commentReference w:id="0"/>
        </w:r>

        <!-- text run "." -->
        <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
            <w:rPr><w:rtl w:val="0"/></w:rPr>
            <w:t xml:space="preserve">.</w:t>
        </w:r>

    </w:p>
    <w:sectPr>
        <w:pgSz w:h="15840" w:w="12240"/>
        <w:pgMar w:bottom="1440" w:top="1440" w:left="1440"
w:right="1440" w:header="0"/>
        <w:pgNumType w:start="1"/>
    </w:sectPr>
</w:body>
</w:document>

So to solve your problem, you could either:
1. search the document.xml for all comments, looking up the comment's
author and text using the ID that is referenced in the document
commentRangeStart-commentRangeEnd and joining all the text contained
between those markers
2. for each comment in the comment table, find the corresponding
commentRangeStart and commentRangeEnd tags in document.xml and get the
corresponding text that was annotated (in this example, John).

If you don't already have a development environment set up, I
encourage you to do so. Patches are greatly appreciated.

On Tue, May 9, 2017 at 9:42 AM, Ramani Routray <ro...@gmail.com> wrote:
> I have a Microsoft word (.docx) file and trying to retrieve the comments and it's associated highlighted text. Can you pls help.
>
> Attaching picture of the sample word document and the java code for extracting the comments. [ A file with a line "My name is John". The word "John" is highlighted with a comment "Noun" ]
>
> I am able to extract the comments (Noun, Adjective). I would like to extract the text associated with the comment "Noun" (Noun = John, Adjective = great)
>
> FileInputStream fis = new FileInputStream(new File(msWordFilePath));
>     XWPFDocument adoc = new XWPFDocument(fis);
>     XWPFWordExtractor xwe = new XWPFWordExtractor(adoc);
>     XWPFComment[] comments = adoc.getComments();
>
>
>     for(int idx=0; idx < comments.length; idx++)
>     {
>         MSWordAnnotation annot = new MSWordAnnotation();
>         annot.setAnnotationName(comments[idx].getId());
>         annot.setAnnotationValue(comments[idx].getText());
>         aList.add(annot);
>
>
>     }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org