You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "David A. Patterson (JIRA)" <ji...@apache.org> on 2012/10/11 16:29:02 UTC
[jira] [Created] (TIKA-1005) In Microsoft Office Word 2010
documents, text inside a textbox is not extracted/parsed out.
David A. Patterson created TIKA-1005:
----------------------------------------
Summary: In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
Key: TIKA-1005
URL: https://issues.apache.org/jira/browse/TIKA-1005
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.2
Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each)
Reporter: David A. Patterson
Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1005) In Microsoft Office Word 2010
documents, text inside a textbox is not extracted/parsed out.
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-1005:
-------------------------------------
Attachment: TIKA-1005.patch
Patch w/ test ...
> In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
> -------------------------------------------------------------------------------------------
>
> Key: TIKA-1005
> URL: https://issues.apache.org/jira/browse/TIKA-1005
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each)
> Reporter: David A. Patterson
> Assignee: Michael McCandless
> Attachments: Textbox example.docx, TIKA-1005.patch
>
>
> Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1005) In Microsoft Office Word 2010
documents, text inside a textbox is not extracted/parsed out.
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless resolved TIKA-1005.
--------------------------------------
Resolution: Fixed
Fix Version/s: 1.3
Thanks David!
> In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
> -------------------------------------------------------------------------------------------
>
> Key: TIKA-1005
> URL: https://issues.apache.org/jira/browse/TIKA-1005
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each)
> Reporter: David A. Patterson
> Assignee: Michael McCandless
> Fix For: 1.3
>
> Attachments: Textbox example.docx, TIKA-1005.patch
>
>
> Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1005) In Microsoft Office Word 2010
documents, text inside a textbox is not extracted/parsed out.
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474250#comment-13474250 ]
Michael McCandless commented on TIKA-1005:
------------------------------------------
Could you attach an example showing the problem? Thanks.
> In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
> -------------------------------------------------------------------------------------------
>
> Key: TIKA-1005
> URL: https://issues.apache.org/jira/browse/TIKA-1005
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each)
> Reporter: David A. Patterson
>
> Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1005) In Microsoft Office Word 2010
documents, text inside a textbox is not extracted/parsed out.
Posted by "David A. Patterson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David A. Patterson updated TIKA-1005:
-------------------------------------
Attachment: Textbox example.docx
> In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
> -------------------------------------------------------------------------------------------
>
> Key: TIKA-1005
> URL: https://issues.apache.org/jira/browse/TIKA-1005
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each)
> Reporter: David A. Patterson
> Attachments: Textbox example.docx
>
>
> Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1005) In Microsoft Office Word 2010
documents, text inside a textbox is not extracted/parsed out.
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474958#comment-13474958 ]
Michael McCandless commented on TIKA-1005:
------------------------------------------
Thanks David, I'll dig!
> In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
> -------------------------------------------------------------------------------------------
>
> Key: TIKA-1005
> URL: https://issues.apache.org/jira/browse/TIKA-1005
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each)
> Reporter: David A. Patterson
> Assignee: Michael McCandless
> Attachments: Textbox example.docx
>
>
> Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (TIKA-1005) In Microsoft Office Word 2010
documents, text inside a textbox is not extracted/parsed out.
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless reassigned TIKA-1005:
----------------------------------------
Assignee: Michael McCandless
> In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
> -------------------------------------------------------------------------------------------
>
> Key: TIKA-1005
> URL: https://issues.apache.org/jira/browse/TIKA-1005
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each)
> Reporter: David A. Patterson
> Assignee: Michael McCandless
> Attachments: Textbox example.docx
>
>
> Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira