You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Peter Davies (JIRA)" <ji...@apache.org> on 2018/05/02 07:11:00 UTC
[jira] [Created] (TIKA-2640) MS Word document checkboxes and
dropdowns not fully converted to text
Peter Davies created TIKA-2640:
----------------------------------
Summary: MS Word document checkboxes and dropdowns not fully converted to text
Key: TIKA-2640
URL: https://issues.apache.org/jira/browse/TIKA-2640
Project: Tika
Issue Type: Improvement
Components: core
Affects Versions: 1.18
Environment: [^MSWordDocWithCheckboxesAndDropdowns.doc]
Reporter: Peter Davies
Attachments: MSWordDocWithCheckboxesAndDropdowns.doc
When we use Tika to parse the text from a Microsoft Word document (.doc) file with a check box we get +FORMCHECKBOX+ with no indication as to whether it is checked or not.
When the doc has a dropdown menu we get _FORMDROPDOWN_ with no indication as to which was selected.
If we parse to XHTML instead we still get e.g.
{code:java}
<tr> <td><p class="header">Another kind of incident</p>
</td> <td><p class="header"><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" />|_|</p>
</td> <td><p />
</td></tr>
{code}
even though the checkbox is ticked in the doc (checkboxes always show *_|_|_*).
Is there a way that Tika can be configured to return text showing what was selected in each case?
Our code:
{code:java}
InputStream stream = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + fileName);
String text = new Tika().parseToString(stream, new Metadata(), -1).trim();
{code}
I have attached an example MS Word doc file with checkboxes and a dropdown.
Regards and thanks, Pete
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)