You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Julien Nioche <li...@gmail.com> on 2011/08/23 12:07:29 UTC
Extract watermarks from Word docs
Hi,
I use Tika to extract text and metadata from Word documents and need to get
the text of watermarks. Is there a standard way of doing this?
Thanks
Julien
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: Extract watermarks from Word docs
Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 23 Aug 2011, Julien Nioche wrote:
> Can't see the watermark when saving and reopening the doc at the .docx
> format, have attached the .doc example
We really want to use a .docx file for the initial investigation, as
they're much easier to pick appart. When we know what kind of bit of the
file they live in for .docx, we can then find the equivalent bit in .doc
(It will all be documented, but probably not using such a simple name, so
we need to track the text down to figure out what it's called in the file
format docs!)
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: Extract watermarks from Word docs
Posted by Julien Nioche <li...@gmail.com>.
Hi Sergey,
Sounds like we can't extract the watermarks in a generic way then.
Thanks for your comments
Julien
On 23 August 2011 16:40, Sergey Vladimirov <vl...@gmail.com> wrote:
> In specified file watermark is not a text, but a OfficeDrawing shape,
> anchored to header document part. Check the following example from poi
> trunk:
>
> public void testWatermark() throws UnsupportedEncodingException
> {
> HWPFDocument hwpfDocument = HWPFTestDataSamples
> .openSampleFile( "watermark.doc" );
> OfficeDrawing drawing = hwpfDocument.getOfficeDrawingsHeaders()
> .getOfficeDrawings().iterator().next();
> EscherContainerRecord escherContainerRecord = drawing
> .getOfficeArtSpContainer();
>
> EscherOptRecord officeArtFOPT = escherContainerRecord
> .getChildById( (short) 0xF00B );
> EscherComplexProperty gtextUNICODE = (EscherComplexProperty)
> officeArtFOPT
> .lookup( 0x00c0 );
>
> String text = new String( gtextUNICODE.getComplexData(), "UTF-16LE"
> );
> assertEquals( "DRAFT CONTRACT\0", text );
> }
>
> Adding the following text to document metadata has too many assumptions:
> - we assume there is only one header (i.e. single page structure, no
> even/odd pages, no first/last pages, etc.)
> - we assume the first office art is actually watermark
>
> For your information, below quotes from doc-dump by HWPFLister:
> HWPFLister watermark.doc --escher --officeDrawings
>
> == ESCHER PROPERTIES (rebuilded) ==
> org.apache.poi.ddf.EscherContainerRecord (DggContainer):
> isContainer: true
> options: 0x000F
> recordId: 0xF000
> numchildren: 2
> children:
> Child 0:
> org.apache.poi.ddf.EscherDggRecord:
> RecordId: 0xF006
> Options: 0x0000
> ShapeIdMax: 2050
> NumIdClusters: 3
> NumShapesSaved: 3
> DrawingsSaved: 2
> DrawingGroupId1: 1
> NumShapeIdsUsed1: 2
> DrawingGroupId2: 2
> NumShapeIdsUsed2: 2
>
> Child 1:
> org.apache.poi.ddf.EscherSplitMenuColorsRecord:
> RecordId: 0xF11E
> Options: 0x0040
> Color1: 0x08000004
> Color2: 0x08000001
> Color3: 0x08000002
> Color4: 0x100000F7
>
> org.apache.poi.ddf.EscherContainerRecord (DgContainer):
> isContainer: true
> options: 0x000F
> recordId: 0xF002
> numchildren: 2
> children:
> Child 0:
> org.apache.poi.ddf.EscherDgRecord:
> RecordId: 0xF008
> Options: 0x0010
> NumShapes: 2
> LastMSOSPID: 1025
>
> Child 1:
> org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
> isContainer: true
> options: 0x000F
> recordId: 0xF003
> numchildren: 2
> children:
> Child 0:
> org.apache.poi.ddf.EscherContainerRecord (SpContainer):
> isContainer: true
> options: 0x000F
> recordId: 0xF004
> numchildren: 2
> children:
> Child 0:
> org.apache.poi.ddf.EscherSpgrRecord:
> RecordId: 0xF009
> Options: 0x0001
> RectX: 0
> RectY: 0
> RectWidth: -32767
> RectHeight: -32767
>
> Child 1:
> org.apache.poi.ddf.EscherSpRecord:
> RecordId: 0xF00A
> Options: 0x0002
> ShapeId: 1024
> Flags: GROUP|PATRIARCH (0x00000005)
>
>
> Child 1:
> org.apache.poi.ddf.EscherContainerRecord (SpContainer):
> isContainer: true
> options: 0x000F
> recordId: 0xF004
> numchildren: 3
> children:
> Child 0:
> org.apache.poi.ddf.EscherSpRecord:
> RecordId: 0xF00A
> Options: 0x0882
> ShapeId: 1025
> Flags: HAVEANCHOR|HASSHAPETYPE (0x00000A00)
>
> Child 1:
> org.apache.poi.ddf.EscherOptRecord:
> isContainer: false
> options: 0x0143
> recordId: 0xF00B
> numchildren: 0
> properties:
> propNum: 4, RAW: 0x0004, propName: transform.rotation,
> complex: false, blipId: false, value: 20643840 (0x013B0000)
> propNum: 133, RAW: 0x0085, propName: text.wraptext,
> complex: false, blipId: false, value: 2 (0x00000002)
> propNum: 135, RAW: 0x0087, propName: text.anchortext,
> complex: false, blipId: false, value: 1 (0x00000001)
> propNum: 192, propName: geotext.unicode, complex:
> true, blipId: true, data:
> 00: 44, 00, 52, 00, 41, 00, 46, 00, 54, 00, 20, 00, 43,
> 00, 4F, 00, 4E, 00, 54, 00, 52, 00, 41, 00, 43, 00, 54, 00, 00, 00,
> propNum: 197, propName: geotext.fontfamilyname,
> complex: true, blipId: true, data:
> 00: 43, 00, 61, 00, 6C, 00, 69, 00, 62, 00, 72, 00, 69, 00, 00,
> 00,
> propNum: 255, RAW: 0x00FF, propName:
> geotext.strikethroughfont, complex: false, blipId: false, value:
> -47872 (0xFFFF4500)
> propNum: 327, RAW: 0x0147, propName:
> geometry.adjustvalue, complex: false, blipId: false, value: 10800
> (0x00002A30)
> propNum: 383, RAW: 0x017F, propName: geometry.fillok,
> complex: false, blipId: false, value: 262205 (0x0004003D)
> propNum: 384, RAW: 0x0180, propName: fill.filltype,
> complex: false, blipId: false, value: 0 (0x00000000)
> propNum: 385, RAW: 0x0181, propName: fill.fillcolor,
> complex: false, blipId: false, value: 0 (0x00000000)
> propNum: 386, RAW: 0x0182, propName: fill.fillopacity,
> complex: false, blipId: false, value: 32768 (0x00008000)
> propNum: 387, RAW: 0x0183, propName:
> fill.fillbackcolor, complex: false, blipId: false, value: 16777215
> (0x00FFFFFF)
> propNum: 447, RAW: 0x01BF, propName:
> fill.nofillhittest, complex: false, blipId: false, value: 1048592
> (0x00100010)
> propNum: 448, RAW: 0x01C0, propName: linestyle.color,
> complex: false, blipId: false, value: 0 (0x00000000)
> propNum: 450, RAW: 0x01C2, propName:
> linestyle.backcolor, complex: false, blipId: false, value: 16777215
> (0x00FFFFFF)
> propNum: 470, RAW: 0x01D6, propName:
> linestyle.linejoinstyle, complex: false, blipId: false, value: 2
> (0x00000002)
> propNum: 511, RAW: 0x01FF, propName:
> linestyle.nolinedrawdash, complex: false, blipId: false, value: 589824
> (0x00090000)
> propNum: 575, RAW: 0x023F, propName:
> shadowstyle.shadowobsured, complex: false, blipId: false, value:
> 131072 (0x00020000)
> propNum: 896, propName: groupshape.shapename, complex:
> true, blipId: true, data:
> 00: 50, 00, 6F, 00, 77, 00, 65, 00, 72, 00, 50, 00, 6C,
> 00, 75, 00, 73, 00, 57, 00, 61, 00, 74, 00, 65, 00, 72, 00, 4D, 00,
> 61, 00,
> 32: 72, 00, 6B, 00, 4F, 00, 62, 00, 6A, 00, 65, 00, 63,
> 00, 74, 00, 35, 00, 31, 00, 36, 00, 31, 00, 30, 00, 31, 00, 31, 00,
> 30, 00,
> 64: 34, 00, 00, 00,
> propNum: 959, RAW: 0x03BF, propName: groupshape.print,
> complex: false, blipId: false, value: 2097184 (0x00200020)
>
> Child 2:
> org.apache.poi.ddf.EscherTertiaryOptRecord:
> isContainer: false
> options: 0x0043
> recordId: 0xF122
> numchildren: 0
> properties:
> propNum: 911, RAW: 0x038F, propName: groupshape.posh,
> complex: false, blipId: false, value: 2 (0x00000002)
> propNum: 912, RAW: 0x0390, propName:
> groupshape.posrelh, complex: false, blipId: false, value: 0
> (0x00000000)
> propNum: 913, RAW: 0x0391, propName: groupshape.posv,
> complex: false, blipId: false, value: 2 (0x00000002)
> propNum: 914, RAW: 0x0392, propName:
> groupshape.posrelv, complex: false, blipId: false, value: 0
> (0x00000000)
>
>
>
> org.apache.poi.ddf.EscherContainerRecord (DgContainer):
> isContainer: true
> options: 0x000F
> recordId: 0xF002
> numchildren: 3
> children:
> Child 0:
> org.apache.poi.ddf.EscherDgRecord:
> RecordId: 0xF008
> Options: 0x0020
> NumShapes: 1
> LastMSOSPID: 2049
>
> Child 1:
> org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
> isContainer: true
> options: 0x000F
> recordId: 0xF003
> numchildren: 1
> children:
> Child 0:
> org.apache.poi.ddf.EscherContainerRecord (SpContainer):
> isContainer: true
> options: 0x000F
> recordId: 0xF004
> numchildren: 2
> children:
> Child 0:
> org.apache.poi.ddf.EscherSpgrRecord:
> RecordId: 0xF009
> Options: 0x0001
> RectX: 0
> RectY: 0
> RectWidth: -32767
> RectHeight: -32767
>
> Child 1:
> org.apache.poi.ddf.EscherSpRecord:
> RecordId: 0xF00A
> Options: 0x0002
> ShapeId: 2048
> Flags: GROUP|PATRIARCH (0x00000005)
>
>
>
> Child 2:
> org.apache.poi.ddf.EscherContainerRecord (SpContainer):
> isContainer: true
> options: 0x000F
> recordId: 0xF004
> numchildren: 3
> children:
> Child 0:
> org.apache.poi.ddf.EscherSpRecord:
> RecordId: 0xF00A
> Options: 0x0012
> ShapeId: 2049
> Flags: HAVEANCHOR|BACKGROUND|HASSHAPETYPE (0x00000E00)
>
> Child 1:
> org.apache.poi.ddf.EscherOptRecord:
> isContainer: false
> options: 0x0043
> recordId: 0xF00B
> numchildren: 0
> properties:
> propNum: 448, RAW: 0x01C0, propName: linestyle.color,
> complex: false, blipId: false, value: 134217729 (0x08000001)
> propNum: 459, RAW: 0x01CB, propName: linestyle.linewidth,
> complex: false, blipId: false, value: 0 (0x00000000)
> propNum: 511, RAW: 0x01FF, propName:
> linestyle.nolinedrawdash, complex: false, blipId: false, value: 524296
> (0x00080008)
> propNum: 513, RAW: 0x0201, propName: shadowstyle.color,
> complex: false, blipId: false, value: 134217730 (0x08000002)
>
> Child 2:
> org.apache.poi.ddf.EscherClientDataRecord:
> RecordId: 0xF011
> Options: 0x0000
> Extra Data:
> 00000000 01 00 00 00 ....
>
>
>
> == OFFICE DRAWINGS (rebuilded) ==
> === Document part: HEADER ===
> OfficeDrawingImpl: [FSPA]
> .spid = (1025 )
> .xaLeft = (14 )
> .yaTop = (2309 )
> .xaRight = (9346 )
> .yaBottom = (11640 )
> .flags = (16500 )
> .fHdr = false
> .bx = 2
> .by = 2
> .wr = 3
> .wrk = 0
> .fRcaSimple = false
> .fBelowText = true
> .fAnchorLock = false
> .cTxbx = (0 )
> [/FSPA]
>
> === Document part: MAIN ===
>
> --
> Best regards,
> Sergey
>
> On Tue, Aug 23, 2011 at 5:45 PM, Julien Nioche
> <li...@gmail.com> wrote:
> > Created https://issues.apache.org/jira/browse/TIKA-696 to track the
> issue.
> >
> > Can't see the watermark when saving and reopening the doc at the .docx
> > format, have attached the .doc example
> >
> > Thanks
> >
> > Julien
> >
> > On 23 August 2011 14:06, Nick Burch <ni...@alfresco.com> wrote:
> >
> >> On Tue, 23 Aug 2011, Julien Nioche wrote:
> >>
> >>> We definitely don't get them in Tika. See docs attached (saved with
> >>> OpenOffice )
> >>>
> >>
> >> It's probably worth putting these sample files on a tika issue so they
> >> don't get lost, and can be used in a future unit test
> >>
> >> The next thing to check is probably to unit the .docx file, and see
> where
> >> the watermark text lives. If it's in the main document part then it
> should
> >> be farily easy to get for Tika. If it's in a different part, then a
> little
> >> bit of support will likely be needed on the POI side to allow easier
> access
> >> to it
> >>
> >>
> >> Nick
> >>
> >>
> ------------------------------**------------------------------**---------
> >> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<
> user-unsubscribe@poi.apache.org>
> >> For additional commands, e-mail: user-help@poi.apache.org
> >>
> >>
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> Sergey Vladimirov
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: Extract watermarks from Word docs
Posted by Sergey Vladimirov <vl...@gmail.com>.
In specified file watermark is not a text, but a OfficeDrawing shape,
anchored to header document part. Check the following example from poi
trunk:
public void testWatermark() throws UnsupportedEncodingException
{
HWPFDocument hwpfDocument = HWPFTestDataSamples
.openSampleFile( "watermark.doc" );
OfficeDrawing drawing = hwpfDocument.getOfficeDrawingsHeaders()
.getOfficeDrawings().iterator().next();
EscherContainerRecord escherContainerRecord = drawing
.getOfficeArtSpContainer();
EscherOptRecord officeArtFOPT = escherContainerRecord
.getChildById( (short) 0xF00B );
EscherComplexProperty gtextUNICODE = (EscherComplexProperty)
officeArtFOPT
.lookup( 0x00c0 );
String text = new String( gtextUNICODE.getComplexData(), "UTF-16LE" );
assertEquals( "DRAFT CONTRACT\0", text );
}
Adding the following text to document metadata has too many assumptions:
- we assume there is only one header (i.e. single page structure, no
even/odd pages, no first/last pages, etc.)
- we assume the first office art is actually watermark
For your information, below quotes from doc-dump by HWPFLister:
HWPFLister watermark.doc --escher --officeDrawings
== ESCHER PROPERTIES (rebuilded) ==
org.apache.poi.ddf.EscherContainerRecord (DggContainer):
isContainer: true
options: 0x000F
recordId: 0xF000
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherDggRecord:
RecordId: 0xF006
Options: 0x0000
ShapeIdMax: 2050
NumIdClusters: 3
NumShapesSaved: 3
DrawingsSaved: 2
DrawingGroupId1: 1
NumShapeIdsUsed1: 2
DrawingGroupId2: 2
NumShapeIdsUsed2: 2
Child 1:
org.apache.poi.ddf.EscherSplitMenuColorsRecord:
RecordId: 0xF11E
Options: 0x0040
Color1: 0x08000004
Color2: 0x08000001
Color3: 0x08000002
Color4: 0x100000F7
org.apache.poi.ddf.EscherContainerRecord (DgContainer):
isContainer: true
options: 0x000F
recordId: 0xF002
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherDgRecord:
RecordId: 0xF008
Options: 0x0010
NumShapes: 2
LastMSOSPID: 1025
Child 1:
org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
isContainer: true
options: 0x000F
recordId: 0xF003
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherContainerRecord (SpContainer):
isContainer: true
options: 0x000F
recordId: 0xF004
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherSpgrRecord:
RecordId: 0xF009
Options: 0x0001
RectX: 0
RectY: 0
RectWidth: -32767
RectHeight: -32767
Child 1:
org.apache.poi.ddf.EscherSpRecord:
RecordId: 0xF00A
Options: 0x0002
ShapeId: 1024
Flags: GROUP|PATRIARCH (0x00000005)
Child 1:
org.apache.poi.ddf.EscherContainerRecord (SpContainer):
isContainer: true
options: 0x000F
recordId: 0xF004
numchildren: 3
children:
Child 0:
org.apache.poi.ddf.EscherSpRecord:
RecordId: 0xF00A
Options: 0x0882
ShapeId: 1025
Flags: HAVEANCHOR|HASSHAPETYPE (0x00000A00)
Child 1:
org.apache.poi.ddf.EscherOptRecord:
isContainer: false
options: 0x0143
recordId: 0xF00B
numchildren: 0
properties:
propNum: 4, RAW: 0x0004, propName: transform.rotation,
complex: false, blipId: false, value: 20643840 (0x013B0000)
propNum: 133, RAW: 0x0085, propName: text.wraptext,
complex: false, blipId: false, value: 2 (0x00000002)
propNum: 135, RAW: 0x0087, propName: text.anchortext,
complex: false, blipId: false, value: 1 (0x00000001)
propNum: 192, propName: geotext.unicode, complex:
true, blipId: true, data:
00: 44, 00, 52, 00, 41, 00, 46, 00, 54, 00, 20, 00, 43,
00, 4F, 00, 4E, 00, 54, 00, 52, 00, 41, 00, 43, 00, 54, 00, 00, 00,
propNum: 197, propName: geotext.fontfamilyname,
complex: true, blipId: true, data:
00: 43, 00, 61, 00, 6C, 00, 69, 00, 62, 00, 72, 00, 69, 00, 00, 00,
propNum: 255, RAW: 0x00FF, propName:
geotext.strikethroughfont, complex: false, blipId: false, value:
-47872 (0xFFFF4500)
propNum: 327, RAW: 0x0147, propName:
geometry.adjustvalue, complex: false, blipId: false, value: 10800
(0x00002A30)
propNum: 383, RAW: 0x017F, propName: geometry.fillok,
complex: false, blipId: false, value: 262205 (0x0004003D)
propNum: 384, RAW: 0x0180, propName: fill.filltype,
complex: false, blipId: false, value: 0 (0x00000000)
propNum: 385, RAW: 0x0181, propName: fill.fillcolor,
complex: false, blipId: false, value: 0 (0x00000000)
propNum: 386, RAW: 0x0182, propName: fill.fillopacity,
complex: false, blipId: false, value: 32768 (0x00008000)
propNum: 387, RAW: 0x0183, propName:
fill.fillbackcolor, complex: false, blipId: false, value: 16777215
(0x00FFFFFF)
propNum: 447, RAW: 0x01BF, propName:
fill.nofillhittest, complex: false, blipId: false, value: 1048592
(0x00100010)
propNum: 448, RAW: 0x01C0, propName: linestyle.color,
complex: false, blipId: false, value: 0 (0x00000000)
propNum: 450, RAW: 0x01C2, propName:
linestyle.backcolor, complex: false, blipId: false, value: 16777215
(0x00FFFFFF)
propNum: 470, RAW: 0x01D6, propName:
linestyle.linejoinstyle, complex: false, blipId: false, value: 2
(0x00000002)
propNum: 511, RAW: 0x01FF, propName:
linestyle.nolinedrawdash, complex: false, blipId: false, value: 589824
(0x00090000)
propNum: 575, RAW: 0x023F, propName:
shadowstyle.shadowobsured, complex: false, blipId: false, value:
131072 (0x00020000)
propNum: 896, propName: groupshape.shapename, complex:
true, blipId: true, data:
00: 50, 00, 6F, 00, 77, 00, 65, 00, 72, 00, 50, 00, 6C,
00, 75, 00, 73, 00, 57, 00, 61, 00, 74, 00, 65, 00, 72, 00, 4D, 00,
61, 00,
32: 72, 00, 6B, 00, 4F, 00, 62, 00, 6A, 00, 65, 00, 63,
00, 74, 00, 35, 00, 31, 00, 36, 00, 31, 00, 30, 00, 31, 00, 31, 00,
30, 00,
64: 34, 00, 00, 00,
propNum: 959, RAW: 0x03BF, propName: groupshape.print,
complex: false, blipId: false, value: 2097184 (0x00200020)
Child 2:
org.apache.poi.ddf.EscherTertiaryOptRecord:
isContainer: false
options: 0x0043
recordId: 0xF122
numchildren: 0
properties:
propNum: 911, RAW: 0x038F, propName: groupshape.posh,
complex: false, blipId: false, value: 2 (0x00000002)
propNum: 912, RAW: 0x0390, propName:
groupshape.posrelh, complex: false, blipId: false, value: 0
(0x00000000)
propNum: 913, RAW: 0x0391, propName: groupshape.posv,
complex: false, blipId: false, value: 2 (0x00000002)
propNum: 914, RAW: 0x0392, propName:
groupshape.posrelv, complex: false, blipId: false, value: 0
(0x00000000)
org.apache.poi.ddf.EscherContainerRecord (DgContainer):
isContainer: true
options: 0x000F
recordId: 0xF002
numchildren: 3
children:
Child 0:
org.apache.poi.ddf.EscherDgRecord:
RecordId: 0xF008
Options: 0x0020
NumShapes: 1
LastMSOSPID: 2049
Child 1:
org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
isContainer: true
options: 0x000F
recordId: 0xF003
numchildren: 1
children:
Child 0:
org.apache.poi.ddf.EscherContainerRecord (SpContainer):
isContainer: true
options: 0x000F
recordId: 0xF004
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherSpgrRecord:
RecordId: 0xF009
Options: 0x0001
RectX: 0
RectY: 0
RectWidth: -32767
RectHeight: -32767
Child 1:
org.apache.poi.ddf.EscherSpRecord:
RecordId: 0xF00A
Options: 0x0002
ShapeId: 2048
Flags: GROUP|PATRIARCH (0x00000005)
Child 2:
org.apache.poi.ddf.EscherContainerRecord (SpContainer):
isContainer: true
options: 0x000F
recordId: 0xF004
numchildren: 3
children:
Child 0:
org.apache.poi.ddf.EscherSpRecord:
RecordId: 0xF00A
Options: 0x0012
ShapeId: 2049
Flags: HAVEANCHOR|BACKGROUND|HASSHAPETYPE (0x00000E00)
Child 1:
org.apache.poi.ddf.EscherOptRecord:
isContainer: false
options: 0x0043
recordId: 0xF00B
numchildren: 0
properties:
propNum: 448, RAW: 0x01C0, propName: linestyle.color,
complex: false, blipId: false, value: 134217729 (0x08000001)
propNum: 459, RAW: 0x01CB, propName: linestyle.linewidth,
complex: false, blipId: false, value: 0 (0x00000000)
propNum: 511, RAW: 0x01FF, propName:
linestyle.nolinedrawdash, complex: false, blipId: false, value: 524296
(0x00080008)
propNum: 513, RAW: 0x0201, propName: shadowstyle.color,
complex: false, blipId: false, value: 134217730 (0x08000002)
Child 2:
org.apache.poi.ddf.EscherClientDataRecord:
RecordId: 0xF011
Options: 0x0000
Extra Data:
00000000 01 00 00 00 ....
== OFFICE DRAWINGS (rebuilded) ==
=== Document part: HEADER ===
OfficeDrawingImpl: [FSPA]
.spid = (1025 )
.xaLeft = (14 )
.yaTop = (2309 )
.xaRight = (9346 )
.yaBottom = (11640 )
.flags = (16500 )
.fHdr = false
.bx = 2
.by = 2
.wr = 3
.wrk = 0
.fRcaSimple = false
.fBelowText = true
.fAnchorLock = false
.cTxbx = (0 )
[/FSPA]
=== Document part: MAIN ===
--
Best regards,
Sergey
On Tue, Aug 23, 2011 at 5:45 PM, Julien Nioche
<li...@gmail.com> wrote:
> Created https://issues.apache.org/jira/browse/TIKA-696 to track the issue.
>
> Can't see the watermark when saving and reopening the doc at the .docx
> format, have attached the .doc example
>
> Thanks
>
> Julien
>
> On 23 August 2011 14:06, Nick Burch <ni...@alfresco.com> wrote:
>
>> On Tue, 23 Aug 2011, Julien Nioche wrote:
>>
>>> We definitely don't get them in Tika. See docs attached (saved with
>>> OpenOffice )
>>>
>>
>> It's probably worth putting these sample files on a tika issue so they
>> don't get lost, and can be used in a future unit test
>>
>> The next thing to check is probably to unit the .docx file, and see where
>> the watermark text lives. If it's in the main document part then it should
>> be farily easy to get for Tika. If it's in a different part, then a little
>> bit of support will likely be needed on the POI side to allow easier access
>> to it
>>
>>
>> Nick
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
--
Sergey Vladimirov
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: Extract watermarks from Word docs
Posted by Julien Nioche <li...@gmail.com>.
Created https://issues.apache.org/jira/browse/TIKA-696 to track the issue.
Can't see the watermark when saving and reopening the doc at the .docx
format, have attached the .doc example
Thanks
Julien
On 23 August 2011 14:06, Nick Burch <ni...@alfresco.com> wrote:
> On Tue, 23 Aug 2011, Julien Nioche wrote:
>
>> We definitely don't get them in Tika. See docs attached (saved with
>> OpenOffice )
>>
>
> It's probably worth putting these sample files on a tika issue so they
> don't get lost, and can be used in a future unit test
>
> The next thing to check is probably to unit the .docx file, and see where
> the watermark text lives. If it's in the main document part then it should
> be farily easy to get for Tika. If it's in a different part, then a little
> bit of support will likely be needed on the POI side to allow easier access
> to it
>
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: Extract watermarks from Word docs
Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 23 Aug 2011, Julien Nioche wrote:
> We definitely don't get them in Tika. See docs attached (saved with
> OpenOffice )
It's probably worth putting these sample files on a tika issue so they
don't get lost, and can be used in a future unit test
The next thing to check is probably to unit the .docx file, and see where
the watermark text lives. If it's in the main document part then it should
be farily easy to get for Tika. If it's in a different part, then a little
bit of support will likely be needed on the POI side to allow easier
access to it
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: Extract watermarks from Word docs
Posted by Julien Nioche <li...@gmail.com>.
Hi Nick,
We definitely don't get them in Tika. See docs attached (saved with
OpenOffice )
Thanks
Julien
On 23 August 2011 12:07, Nick Burch <ni...@alfresco.com> wrote:
> On Tue, 23 Aug 2011, Julien Nioche wrote:
>
>> I use Tika to extract text and metadata from Word documents and need to
>> get the text of watermarks. Is there a standard way of doing this?
>>
>
> I'm not sure if we currently extract watermarks. Do you have some sample
> small files (ideally a matching .doc and .docx) with a watermark in them?
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: Extract watermarks from Word docs
Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 23 Aug 2011, Julien Nioche wrote:
> I use Tika to extract text and metadata from Word documents and need to
> get the text of watermarks. Is there a standard way of doing this?
I'm not sure if we currently extract watermarks. Do you have some sample
small files (ideally a matching .doc and .docx) with a watermark in them?
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org