You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Julien Nioche <li...@gmail.com> on 2011/08/23 12:07:29 UTC

Extract watermarks from Word docs

Hi,

I use Tika to extract text and metadata from Word documents and need to get
the text of watermarks. Is there a standard way of doing this?

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Extract watermarks from Word docs

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 23 Aug 2011, Julien Nioche wrote:
> Can't see the watermark when saving and reopening the doc at the .docx 
> format, have attached the .doc example

We really want to use a .docx file for the initial investigation, as 
they're much easier to pick appart. When we know what kind of bit of the 
file they live in for .docx, we can then find the equivalent bit in .doc

(It will all be documented, but probably not using such a simple name, so 
we need to track the text down to figure out what it's called in the file 
format docs!)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Extract watermarks from Word docs

Posted by Julien Nioche <li...@gmail.com>.
Hi Sergey,
Sounds like we can't extract the watermarks in a generic way then.
Thanks for your comments

Julien

On 23 August 2011 16:40, Sergey Vladimirov <vl...@gmail.com> wrote:

> In specified file watermark is not a text, but a OfficeDrawing shape,
> anchored to header document part. Check the following example from poi
> trunk:
>
>    public void testWatermark() throws UnsupportedEncodingException
>    {
>        HWPFDocument hwpfDocument = HWPFTestDataSamples
>                .openSampleFile( "watermark.doc" );
>        OfficeDrawing drawing = hwpfDocument.getOfficeDrawingsHeaders()
>                .getOfficeDrawings().iterator().next();
>        EscherContainerRecord escherContainerRecord = drawing
>                .getOfficeArtSpContainer();
>
>        EscherOptRecord officeArtFOPT = escherContainerRecord
>                .getChildById( (short) 0xF00B );
>        EscherComplexProperty gtextUNICODE = (EscherComplexProperty)
> officeArtFOPT
>                .lookup( 0x00c0 );
>
>        String text = new String( gtextUNICODE.getComplexData(), "UTF-16LE"
> );
>        assertEquals( "DRAFT CONTRACT\0", text );
>    }
>
> Adding the following text to document metadata has too many assumptions:
>  - we assume there is only one header (i.e. single page structure, no
> even/odd pages, no first/last pages, etc.)
>  - we assume the first office art is actually watermark
>
> For your information, below quotes from doc-dump by HWPFLister:
>    HWPFLister watermark.doc --escher --officeDrawings
>
> == ESCHER PROPERTIES (rebuilded) ==
> org.apache.poi.ddf.EscherContainerRecord (DggContainer):
>  isContainer: true
>  options: 0x000F
>  recordId: 0xF000
>  numchildren: 2
>  children:
>   Child 0:
>    org.apache.poi.ddf.EscherDggRecord:
>      RecordId: 0xF006
>      Options: 0x0000
>      ShapeIdMax: 2050
>      NumIdClusters: 3
>      NumShapesSaved: 3
>      DrawingsSaved: 2
>      DrawingGroupId1: 1
>      NumShapeIdsUsed1: 2
>      DrawingGroupId2: 2
>      NumShapeIdsUsed2: 2
>
>   Child 1:
>    org.apache.poi.ddf.EscherSplitMenuColorsRecord:
>      RecordId: 0xF11E
>      Options: 0x0040
>      Color1: 0x08000004
>      Color2: 0x08000001
>      Color3: 0x08000002
>      Color4: 0x100000F7
>
> org.apache.poi.ddf.EscherContainerRecord (DgContainer):
>  isContainer: true
>  options: 0x000F
>  recordId: 0xF002
>  numchildren: 2
>  children:
>   Child 0:
>    org.apache.poi.ddf.EscherDgRecord:
>      RecordId: 0xF008
>      Options: 0x0010
>      NumShapes: 2
>      LastMSOSPID: 1025
>
>   Child 1:
>    org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
>      isContainer: true
>      options: 0x000F
>      recordId: 0xF003
>      numchildren: 2
>      children:
>       Child 0:
>        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
>          isContainer: true
>          options: 0x000F
>          recordId: 0xF004
>          numchildren: 2
>          children:
>           Child 0:
>            org.apache.poi.ddf.EscherSpgrRecord:
>              RecordId: 0xF009
>              Options: 0x0001
>              RectX: 0
>              RectY: 0
>              RectWidth: -32767
>              RectHeight: -32767
>
>           Child 1:
>            org.apache.poi.ddf.EscherSpRecord:
>              RecordId: 0xF00A
>              Options: 0x0002
>              ShapeId: 1024
>              Flags: GROUP|PATRIARCH (0x00000005)
>
>
>       Child 1:
>        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
>          isContainer: true
>          options: 0x000F
>          recordId: 0xF004
>          numchildren: 3
>          children:
>           Child 0:
>            org.apache.poi.ddf.EscherSpRecord:
>              RecordId: 0xF00A
>              Options: 0x0882
>              ShapeId: 1025
>              Flags: HAVEANCHOR|HASSHAPETYPE (0x00000A00)
>
>           Child 1:
>            org.apache.poi.ddf.EscherOptRecord:
>              isContainer: false
>              options: 0x0143
>              recordId: 0xF00B
>              numchildren: 0
>              properties:
>                propNum: 4, RAW: 0x0004, propName: transform.rotation,
> complex: false, blipId: false, value: 20643840 (0x013B0000)
>                propNum: 133, RAW: 0x0085, propName: text.wraptext,
> complex: false, blipId: false, value: 2 (0x00000002)
>                propNum: 135, RAW: 0x0087, propName: text.anchortext,
> complex: false, blipId: false, value: 1 (0x00000001)
>                propNum: 192, propName: geotext.unicode, complex:
> true, blipId: true, data:
>            00: 44, 00, 52, 00, 41, 00, 46, 00, 54, 00, 20, 00, 43,
> 00, 4F, 00, 4E, 00, 54, 00, 52, 00, 41, 00, 43, 00, 54, 00, 00, 00,
>                propNum: 197, propName: geotext.fontfamilyname,
> complex: true, blipId: true, data:
>            00: 43, 00, 61, 00, 6C, 00, 69, 00, 62, 00, 72, 00, 69, 00, 00,
> 00,
>                propNum: 255, RAW: 0x00FF, propName:
> geotext.strikethroughfont, complex: false, blipId: false, value:
> -47872 (0xFFFF4500)
>                propNum: 327, RAW: 0x0147, propName:
> geometry.adjustvalue, complex: false, blipId: false, value: 10800
> (0x00002A30)
>                propNum: 383, RAW: 0x017F, propName: geometry.fillok,
> complex: false, blipId: false, value: 262205 (0x0004003D)
>                propNum: 384, RAW: 0x0180, propName: fill.filltype,
> complex: false, blipId: false, value: 0 (0x00000000)
>                propNum: 385, RAW: 0x0181, propName: fill.fillcolor,
> complex: false, blipId: false, value: 0 (0x00000000)
>                propNum: 386, RAW: 0x0182, propName: fill.fillopacity,
> complex: false, blipId: false, value: 32768 (0x00008000)
>                propNum: 387, RAW: 0x0183, propName:
> fill.fillbackcolor, complex: false, blipId: false, value: 16777215
> (0x00FFFFFF)
>                propNum: 447, RAW: 0x01BF, propName:
> fill.nofillhittest, complex: false, blipId: false, value: 1048592
> (0x00100010)
>                propNum: 448, RAW: 0x01C0, propName: linestyle.color,
> complex: false, blipId: false, value: 0 (0x00000000)
>                propNum: 450, RAW: 0x01C2, propName:
> linestyle.backcolor, complex: false, blipId: false, value: 16777215
> (0x00FFFFFF)
>                propNum: 470, RAW: 0x01D6, propName:
> linestyle.linejoinstyle, complex: false, blipId: false, value: 2
> (0x00000002)
>                propNum: 511, RAW: 0x01FF, propName:
> linestyle.nolinedrawdash, complex: false, blipId: false, value: 589824
> (0x00090000)
>                propNum: 575, RAW: 0x023F, propName:
> shadowstyle.shadowobsured, complex: false, blipId: false, value:
> 131072 (0x00020000)
>                propNum: 896, propName: groupshape.shapename, complex:
> true, blipId: true, data:
>            00: 50, 00, 6F, 00, 77, 00, 65, 00, 72, 00, 50, 00, 6C,
> 00, 75, 00, 73, 00, 57, 00, 61, 00, 74, 00, 65, 00, 72, 00, 4D, 00,
> 61, 00,
>            32: 72, 00, 6B, 00, 4F, 00, 62, 00, 6A, 00, 65, 00, 63,
> 00, 74, 00, 35, 00, 31, 00, 36, 00, 31, 00, 30, 00, 31, 00, 31, 00,
> 30, 00,
>            64: 34, 00, 00, 00,
>                propNum: 959, RAW: 0x03BF, propName: groupshape.print,
> complex: false, blipId: false, value: 2097184 (0x00200020)
>
>           Child 2:
>            org.apache.poi.ddf.EscherTertiaryOptRecord:
>              isContainer: false
>              options: 0x0043
>              recordId: 0xF122
>              numchildren: 0
>              properties:
>                propNum: 911, RAW: 0x038F, propName: groupshape.posh,
> complex: false, blipId: false, value: 2 (0x00000002)
>                propNum: 912, RAW: 0x0390, propName:
> groupshape.posrelh, complex: false, blipId: false, value: 0
> (0x00000000)
>                propNum: 913, RAW: 0x0391, propName: groupshape.posv,
> complex: false, blipId: false, value: 2 (0x00000002)
>                propNum: 914, RAW: 0x0392, propName:
> groupshape.posrelv, complex: false, blipId: false, value: 0
> (0x00000000)
>
>
>
> org.apache.poi.ddf.EscherContainerRecord (DgContainer):
>  isContainer: true
>  options: 0x000F
>  recordId: 0xF002
>  numchildren: 3
>  children:
>   Child 0:
>    org.apache.poi.ddf.EscherDgRecord:
>      RecordId: 0xF008
>      Options: 0x0020
>      NumShapes: 1
>      LastMSOSPID: 2049
>
>   Child 1:
>    org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
>      isContainer: true
>      options: 0x000F
>      recordId: 0xF003
>      numchildren: 1
>      children:
>       Child 0:
>        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
>          isContainer: true
>          options: 0x000F
>          recordId: 0xF004
>          numchildren: 2
>          children:
>           Child 0:
>            org.apache.poi.ddf.EscherSpgrRecord:
>              RecordId: 0xF009
>              Options: 0x0001
>              RectX: 0
>              RectY: 0
>              RectWidth: -32767
>              RectHeight: -32767
>
>           Child 1:
>            org.apache.poi.ddf.EscherSpRecord:
>              RecordId: 0xF00A
>              Options: 0x0002
>              ShapeId: 2048
>              Flags: GROUP|PATRIARCH (0x00000005)
>
>
>
>   Child 2:
>    org.apache.poi.ddf.EscherContainerRecord (SpContainer):
>      isContainer: true
>      options: 0x000F
>      recordId: 0xF004
>      numchildren: 3
>      children:
>       Child 0:
>        org.apache.poi.ddf.EscherSpRecord:
>          RecordId: 0xF00A
>          Options: 0x0012
>          ShapeId: 2049
>          Flags: HAVEANCHOR|BACKGROUND|HASSHAPETYPE (0x00000E00)
>
>       Child 1:
>        org.apache.poi.ddf.EscherOptRecord:
>          isContainer: false
>          options: 0x0043
>          recordId: 0xF00B
>          numchildren: 0
>          properties:
>            propNum: 448, RAW: 0x01C0, propName: linestyle.color,
> complex: false, blipId: false, value: 134217729 (0x08000001)
>            propNum: 459, RAW: 0x01CB, propName: linestyle.linewidth,
> complex: false, blipId: false, value: 0 (0x00000000)
>            propNum: 511, RAW: 0x01FF, propName:
> linestyle.nolinedrawdash, complex: false, blipId: false, value: 524296
> (0x00080008)
>            propNum: 513, RAW: 0x0201, propName: shadowstyle.color,
> complex: false, blipId: false, value: 134217730 (0x08000002)
>
>       Child 2:
>        org.apache.poi.ddf.EscherClientDataRecord:
>          RecordId: 0xF011
>          Options: 0x0000
>          Extra Data:
>        00000000 01 00 00 00                                     ....
>
>
>
> == OFFICE DRAWINGS (rebuilded) ==
> === Document part: HEADER ===
> OfficeDrawingImpl: [FSPA]
>    .spid                 =  (1025 )
>    .xaLeft               =  (14 )
>    .yaTop                =  (2309 )
>    .xaRight              =  (9346 )
>    .yaBottom             =  (11640 )
>    .flags                =  (16500 )
>         .fHdr                     = false
>         .bx                       = 2
>         .by                       = 2
>         .wr                       = 3
>         .wrk                      = 0
>         .fRcaSimple               = false
>         .fBelowText               = true
>         .fAnchorLock              = false
>    .cTxbx                =  (0 )
> [/FSPA]
>
> === Document part: MAIN ===
>
> --
> Best regards,
> Sergey
>
> On Tue, Aug 23, 2011 at 5:45 PM, Julien Nioche
> <li...@gmail.com> wrote:
> > Created https://issues.apache.org/jira/browse/TIKA-696 to track the
> issue.
> >
> > Can't see the watermark when saving and reopening the doc at the .docx
> > format, have attached the .doc example
> >
> > Thanks
> >
> > Julien
> >
> > On 23 August 2011 14:06, Nick Burch <ni...@alfresco.com> wrote:
> >
> >> On Tue, 23 Aug 2011, Julien Nioche wrote:
> >>
> >>> We definitely don't get them in Tika. See docs attached (saved with
> >>> OpenOffice )
> >>>
> >>
> >> It's probably worth putting these sample files on a tika issue so they
> >> don't get lost, and can be used in a future unit test
> >>
> >> The next thing to check is probably to unit the .docx file, and see
> where
> >> the watermark text lives. If it's in the main document part then it
> should
> >> be farily easy to get for Tika. If it's in a different part, then a
> little
> >> bit of support will likely be needed on the POI side to allow easier
> access
> >> to it
> >>
> >>
> >> Nick
> >>
> >>
> ------------------------------**------------------------------**---------
> >> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<
> user-unsubscribe@poi.apache.org>
> >> For additional commands, e-mail: user-help@poi.apache.org
> >>
> >>
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> Sergey Vladimirov
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Extract watermarks from Word docs

Posted by Sergey Vladimirov <vl...@gmail.com>.
In specified file watermark is not a text, but a OfficeDrawing shape,
anchored to header document part. Check the following example from poi
trunk:

    public void testWatermark() throws UnsupportedEncodingException
    {
        HWPFDocument hwpfDocument = HWPFTestDataSamples
                .openSampleFile( "watermark.doc" );
        OfficeDrawing drawing = hwpfDocument.getOfficeDrawingsHeaders()
                .getOfficeDrawings().iterator().next();
        EscherContainerRecord escherContainerRecord = drawing
                .getOfficeArtSpContainer();

        EscherOptRecord officeArtFOPT = escherContainerRecord
                .getChildById( (short) 0xF00B );
        EscherComplexProperty gtextUNICODE = (EscherComplexProperty)
officeArtFOPT
                .lookup( 0x00c0 );

        String text = new String( gtextUNICODE.getComplexData(), "UTF-16LE" );
        assertEquals( "DRAFT CONTRACT\0", text );
    }

Adding the following text to document metadata has too many assumptions:
 - we assume there is only one header (i.e. single page structure, no
even/odd pages, no first/last pages, etc.)
 - we assume the first office art is actually watermark

For your information, below quotes from doc-dump by HWPFLister:
    HWPFLister watermark.doc --escher --officeDrawings

== ESCHER PROPERTIES (rebuilded) ==
org.apache.poi.ddf.EscherContainerRecord (DggContainer):
  isContainer: true
  options: 0x000F
  recordId: 0xF000
  numchildren: 2
  children:
   Child 0:
    org.apache.poi.ddf.EscherDggRecord:
      RecordId: 0xF006
      Options: 0x0000
      ShapeIdMax: 2050
      NumIdClusters: 3
      NumShapesSaved: 3
      DrawingsSaved: 2
      DrawingGroupId1: 1
      NumShapeIdsUsed1: 2
      DrawingGroupId2: 2
      NumShapeIdsUsed2: 2

   Child 1:
    org.apache.poi.ddf.EscherSplitMenuColorsRecord:
      RecordId: 0xF11E
      Options: 0x0040
      Color1: 0x08000004
      Color2: 0x08000001
      Color3: 0x08000002
      Color4: 0x100000F7

org.apache.poi.ddf.EscherContainerRecord (DgContainer):
  isContainer: true
  options: 0x000F
  recordId: 0xF002
  numchildren: 2
  children:
   Child 0:
    org.apache.poi.ddf.EscherDgRecord:
      RecordId: 0xF008
      Options: 0x0010
      NumShapes: 2
      LastMSOSPID: 1025

   Child 1:
    org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
      isContainer: true
      options: 0x000F
      recordId: 0xF003
      numchildren: 2
      children:
       Child 0:
        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
          isContainer: true
          options: 0x000F
          recordId: 0xF004
          numchildren: 2
          children:
           Child 0:
            org.apache.poi.ddf.EscherSpgrRecord:
              RecordId: 0xF009
              Options: 0x0001
              RectX: 0
              RectY: 0
              RectWidth: -32767
              RectHeight: -32767

           Child 1:
            org.apache.poi.ddf.EscherSpRecord:
              RecordId: 0xF00A
              Options: 0x0002
              ShapeId: 1024
              Flags: GROUP|PATRIARCH (0x00000005)


       Child 1:
        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
          isContainer: true
          options: 0x000F
          recordId: 0xF004
          numchildren: 3
          children:
           Child 0:
            org.apache.poi.ddf.EscherSpRecord:
              RecordId: 0xF00A
              Options: 0x0882
              ShapeId: 1025
              Flags: HAVEANCHOR|HASSHAPETYPE (0x00000A00)

           Child 1:
            org.apache.poi.ddf.EscherOptRecord:
              isContainer: false
              options: 0x0143
              recordId: 0xF00B
              numchildren: 0
              properties:
                propNum: 4, RAW: 0x0004, propName: transform.rotation,
complex: false, blipId: false, value: 20643840 (0x013B0000)
                propNum: 133, RAW: 0x0085, propName: text.wraptext,
complex: false, blipId: false, value: 2 (0x00000002)
                propNum: 135, RAW: 0x0087, propName: text.anchortext,
complex: false, blipId: false, value: 1 (0x00000001)
                propNum: 192, propName: geotext.unicode, complex:
true, blipId: true, data:
            00: 44, 00, 52, 00, 41, 00, 46, 00, 54, 00, 20, 00, 43,
00, 4F, 00, 4E, 00, 54, 00, 52, 00, 41, 00, 43, 00, 54, 00, 00, 00,
                propNum: 197, propName: geotext.fontfamilyname,
complex: true, blipId: true, data:
            00: 43, 00, 61, 00, 6C, 00, 69, 00, 62, 00, 72, 00, 69, 00, 00, 00,
                propNum: 255, RAW: 0x00FF, propName:
geotext.strikethroughfont, complex: false, blipId: false, value:
-47872 (0xFFFF4500)
                propNum: 327, RAW: 0x0147, propName:
geometry.adjustvalue, complex: false, blipId: false, value: 10800
(0x00002A30)
                propNum: 383, RAW: 0x017F, propName: geometry.fillok,
complex: false, blipId: false, value: 262205 (0x0004003D)
                propNum: 384, RAW: 0x0180, propName: fill.filltype,
complex: false, blipId: false, value: 0 (0x00000000)
                propNum: 385, RAW: 0x0181, propName: fill.fillcolor,
complex: false, blipId: false, value: 0 (0x00000000)
                propNum: 386, RAW: 0x0182, propName: fill.fillopacity,
complex: false, blipId: false, value: 32768 (0x00008000)
                propNum: 387, RAW: 0x0183, propName:
fill.fillbackcolor, complex: false, blipId: false, value: 16777215
(0x00FFFFFF)
                propNum: 447, RAW: 0x01BF, propName:
fill.nofillhittest, complex: false, blipId: false, value: 1048592
(0x00100010)
                propNum: 448, RAW: 0x01C0, propName: linestyle.color,
complex: false, blipId: false, value: 0 (0x00000000)
                propNum: 450, RAW: 0x01C2, propName:
linestyle.backcolor, complex: false, blipId: false, value: 16777215
(0x00FFFFFF)
                propNum: 470, RAW: 0x01D6, propName:
linestyle.linejoinstyle, complex: false, blipId: false, value: 2
(0x00000002)
                propNum: 511, RAW: 0x01FF, propName:
linestyle.nolinedrawdash, complex: false, blipId: false, value: 589824
(0x00090000)
                propNum: 575, RAW: 0x023F, propName:
shadowstyle.shadowobsured, complex: false, blipId: false, value:
131072 (0x00020000)
                propNum: 896, propName: groupshape.shapename, complex:
true, blipId: true, data:
            00: 50, 00, 6F, 00, 77, 00, 65, 00, 72, 00, 50, 00, 6C,
00, 75, 00, 73, 00, 57, 00, 61, 00, 74, 00, 65, 00, 72, 00, 4D, 00,
61, 00,
            32: 72, 00, 6B, 00, 4F, 00, 62, 00, 6A, 00, 65, 00, 63,
00, 74, 00, 35, 00, 31, 00, 36, 00, 31, 00, 30, 00, 31, 00, 31, 00,
30, 00,
            64: 34, 00, 00, 00,
                propNum: 959, RAW: 0x03BF, propName: groupshape.print,
complex: false, blipId: false, value: 2097184 (0x00200020)

           Child 2:
            org.apache.poi.ddf.EscherTertiaryOptRecord:
              isContainer: false
              options: 0x0043
              recordId: 0xF122
              numchildren: 0
              properties:
                propNum: 911, RAW: 0x038F, propName: groupshape.posh,
complex: false, blipId: false, value: 2 (0x00000002)
                propNum: 912, RAW: 0x0390, propName:
groupshape.posrelh, complex: false, blipId: false, value: 0
(0x00000000)
                propNum: 913, RAW: 0x0391, propName: groupshape.posv,
complex: false, blipId: false, value: 2 (0x00000002)
                propNum: 914, RAW: 0x0392, propName:
groupshape.posrelv, complex: false, blipId: false, value: 0
(0x00000000)



org.apache.poi.ddf.EscherContainerRecord (DgContainer):
  isContainer: true
  options: 0x000F
  recordId: 0xF002
  numchildren: 3
  children:
   Child 0:
    org.apache.poi.ddf.EscherDgRecord:
      RecordId: 0xF008
      Options: 0x0020
      NumShapes: 1
      LastMSOSPID: 2049

   Child 1:
    org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
      isContainer: true
      options: 0x000F
      recordId: 0xF003
      numchildren: 1
      children:
       Child 0:
        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
          isContainer: true
          options: 0x000F
          recordId: 0xF004
          numchildren: 2
          children:
           Child 0:
            org.apache.poi.ddf.EscherSpgrRecord:
              RecordId: 0xF009
              Options: 0x0001
              RectX: 0
              RectY: 0
              RectWidth: -32767
              RectHeight: -32767

           Child 1:
            org.apache.poi.ddf.EscherSpRecord:
              RecordId: 0xF00A
              Options: 0x0002
              ShapeId: 2048
              Flags: GROUP|PATRIARCH (0x00000005)



   Child 2:
    org.apache.poi.ddf.EscherContainerRecord (SpContainer):
      isContainer: true
      options: 0x000F
      recordId: 0xF004
      numchildren: 3
      children:
       Child 0:
        org.apache.poi.ddf.EscherSpRecord:
          RecordId: 0xF00A
          Options: 0x0012
          ShapeId: 2049
          Flags: HAVEANCHOR|BACKGROUND|HASSHAPETYPE (0x00000E00)

       Child 1:
        org.apache.poi.ddf.EscherOptRecord:
          isContainer: false
          options: 0x0043
          recordId: 0xF00B
          numchildren: 0
          properties:
            propNum: 448, RAW: 0x01C0, propName: linestyle.color,
complex: false, blipId: false, value: 134217729 (0x08000001)
            propNum: 459, RAW: 0x01CB, propName: linestyle.linewidth,
complex: false, blipId: false, value: 0 (0x00000000)
            propNum: 511, RAW: 0x01FF, propName:
linestyle.nolinedrawdash, complex: false, blipId: false, value: 524296
(0x00080008)
            propNum: 513, RAW: 0x0201, propName: shadowstyle.color,
complex: false, blipId: false, value: 134217730 (0x08000002)

       Child 2:
        org.apache.poi.ddf.EscherClientDataRecord:
          RecordId: 0xF011
          Options: 0x0000
          Extra Data:
        00000000 01 00 00 00                                     ....



== OFFICE DRAWINGS (rebuilded) ==
=== Document part: HEADER ===
OfficeDrawingImpl: [FSPA]
    .spid                 =  (1025 )
    .xaLeft               =  (14 )
    .yaTop                =  (2309 )
    .xaRight              =  (9346 )
    .yaBottom             =  (11640 )
    .flags                =  (16500 )
         .fHdr                     = false
         .bx                       = 2
         .by                       = 2
         .wr                       = 3
         .wrk                      = 0
         .fRcaSimple               = false
         .fBelowText               = true
         .fAnchorLock              = false
    .cTxbx                =  (0 )
[/FSPA]

=== Document part: MAIN ===

-- 
Best regards,
Sergey

On Tue, Aug 23, 2011 at 5:45 PM, Julien Nioche
<li...@gmail.com> wrote:
> Created https://issues.apache.org/jira/browse/TIKA-696 to track the issue.
>
> Can't see the watermark when saving and reopening the doc at the .docx
> format, have attached the .doc example
>
> Thanks
>
> Julien
>
> On 23 August 2011 14:06, Nick Burch <ni...@alfresco.com> wrote:
>
>> On Tue, 23 Aug 2011, Julien Nioche wrote:
>>
>>> We definitely don't get them in Tika. See docs attached (saved with
>>> OpenOffice )
>>>
>>
>> It's probably worth putting these sample files on a tika issue so they
>> don't get lost, and can be used in a future unit test
>>
>> The next thing to check is probably to unit the .docx file, and see where
>> the watermark text lives. If it's in the main document part then it should
>> be farily easy to get for Tika. If it's in a different part, then a little
>> bit of support will likely be needed on the POI side to allow easier access
>> to it
>>
>>
>> Nick
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Sergey Vladimirov

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Extract watermarks from Word docs

Posted by Julien Nioche <li...@gmail.com>.
Created https://issues.apache.org/jira/browse/TIKA-696 to track the issue.

Can't see the watermark when saving and reopening the doc at the .docx
format, have attached the .doc example

Thanks

Julien

On 23 August 2011 14:06, Nick Burch <ni...@alfresco.com> wrote:

> On Tue, 23 Aug 2011, Julien Nioche wrote:
>
>> We definitely don't get them in Tika. See docs attached (saved with
>> OpenOffice )
>>
>
> It's probably worth putting these sample files on a tika issue so they
> don't get lost, and can be used in a future unit test
>
> The next thing to check is probably to unit the .docx file, and see where
> the watermark text lives. If it's in the main document part then it should
> be farily easy to get for Tika. If it's in a different part, then a little
> bit of support will likely be needed on the POI side to allow easier access
> to it
>
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Extract watermarks from Word docs

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 23 Aug 2011, Julien Nioche wrote:
> We definitely don't get them in Tika. See docs attached (saved with 
> OpenOffice )

It's probably worth putting these sample files on a tika issue so they 
don't get lost, and can be used in a future unit test

The next thing to check is probably to unit the .docx file, and see where 
the watermark text lives. If it's in the main document part then it should 
be farily easy to get for Tika. If it's in a different part, then a little 
bit of support will likely be needed on the POI side to allow easier 
access to it

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Extract watermarks from Word docs

Posted by Julien Nioche <li...@gmail.com>.
Hi Nick,

We definitely don't get them in Tika. See docs attached (saved with
OpenOffice )

Thanks

Julien

On 23 August 2011 12:07, Nick Burch <ni...@alfresco.com> wrote:

> On Tue, 23 Aug 2011, Julien Nioche wrote:
>
>> I use Tika to extract text and metadata from Word documents and need to
>> get the text of watermarks. Is there a standard way of doing this?
>>
>
> I'm not sure if we currently extract watermarks. Do you have some sample
> small files (ideally a matching .doc and .docx) with a watermark in them?
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Extract watermarks from Word docs

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 23 Aug 2011, Julien Nioche wrote:
> I use Tika to extract text and metadata from Word documents and need to 
> get the text of watermarks. Is there a standard way of doing this?

I'm not sure if we currently extract watermarks. Do you have some sample 
small files (ideally a matching .doc and .docx) with a watermark in them?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org