You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Peter Davies (JIRA)" <ji...@apache.org> on 2018/05/03 08:12:00 UTC

[jira] [Updated] (TIKA-2640) MS Word document checkboxes and dropdowns not fully converted to text

     [ https://issues.apache.org/jira/browse/TIKA-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Davies updated TIKA-2640:
-------------------------------
    Description: 
When we use Tika to parse the text from a Microsoft Word document (.doc) file with a check box we get +FORMCHECKBOX+ with no indication as to whether it is checked or not.

When the doc has a dropdown menu we get _FORMDROPDOWN_ with no indication as to which was selected.

If we parse to XHTML instead we still get e.g.

 
{code:java}
<tr> <td><p class="header">Another kind of incident</p>
</td> <td><p class="header"><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" />|_|</p>
</td> <td><p />
</td></tr>
 
{code}
even though the checkbox is ticked in the doc (checkboxes always show *_|_|_*).

Shouldn't the text reflect the checkbox as it does in the testCheckboxes() method in [https://svn.apache.org/repos/asf/poi/trunk/src/ooxml/testcases/org/apache/poi/xwpf/extractor/TestXWPFWordExtractor.java]  (I realise this is POI but that is what Tika uses)?

Snippet:

 
{code:java}
XWPFDocument doc = XWPFTestDataSamples.openSampleDocument("checkboxes.docx"); XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
assertEquals("This is a small test for checkboxes \nunchecked: |_| \n" + "Or checked: |X|\n\n\n\n\n" + "Test a checkbox within a textbox: |_| -> |X|\n\n\n" + "In Table:\n|_|\t|X|\n\n\n" + "In Sequence:\n|X||_||X|\n", extractor.getText());
{code}
 

 

Our code:
{code:java}
InputStream stream = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + fileName);
String text = new Tika().parseToString(stream, new Metadata(), -1).trim();
{code}
 

I have attached an example MS Word doc file with checkboxes and a dropdown.

Regards and thanks, Pete

 

 

  was:
When we use Tika to parse the text from a Microsoft Word document (.doc) file with a check box we get +FORMCHECKBOX+ with no indication as to whether it is checked or not.

When the doc has a dropdown menu we get _FORMDROPDOWN_ with no indication as to which was selected.

If we parse to XHTML instead we still get e.g.

 
{code:java}
<tr> <td><p class="header">Another kind of incident</p>
</td> <td><p class="header"><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" />|_|</p>
</td> <td><p />
</td></tr>
 
{code}
even though the checkbox is ticked in the doc (checkboxes always show *_|_|_*).

Is there a way that Tika can be configured to return text showing what was selected in each case?

Our code:

 
{code:java}
InputStream stream = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + fileName);
String text = new Tika().parseToString(stream, new Metadata(), -1).trim();
{code}
 

I have attached an example MS Word doc file with checkboxes and a dropdown.

Regards and thanks, Pete

 

 

     Issue Type: Bug  (was: Improvement)

> MS Word document checkboxes and dropdowns not fully converted to text
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2640
>                 URL: https://issues.apache.org/jira/browse/TIKA-2640
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.18
>         Environment: [^MSWordDocWithCheckboxesAndDropdowns.doc]
>            Reporter: Peter Davies
>            Priority: Major
>         Attachments: MSWordDocWithCheckboxesAndDropdowns.doc
>
>
> When we use Tika to parse the text from a Microsoft Word document (.doc) file with a check box we get +FORMCHECKBOX+ with no indication as to whether it is checked or not.
> When the doc has a dropdown menu we get _FORMDROPDOWN_ with no indication as to which was selected.
> If we parse to XHTML instead we still get e.g.
>  
> {code:java}
> <tr> <td><p class="header">Another kind of incident</p>
> </td> <td><p class="header"><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" />|_|</p>
> </td> <td><p />
> </td></tr>
>  
> {code}
> even though the checkbox is ticked in the doc (checkboxes always show *_|_|_*).
> Shouldn't the text reflect the checkbox as it does in the testCheckboxes() method in [https://svn.apache.org/repos/asf/poi/trunk/src/ooxml/testcases/org/apache/poi/xwpf/extractor/TestXWPFWordExtractor.java]  (I realise this is POI but that is what Tika uses)?
> Snippet:
>  
> {code:java}
> XWPFDocument doc = XWPFTestDataSamples.openSampleDocument("checkboxes.docx"); XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
> assertEquals("This is a small test for checkboxes \nunchecked: |_| \n" + "Or checked: |X|\n\n\n\n\n" + "Test a checkbox within a textbox: |_| -> |X|\n\n\n" + "In Table:\n|_|\t|X|\n\n\n" + "In Sequence:\n|X||_||X|\n", extractor.getText());
> {code}
>  
>  
> Our code:
> {code:java}
> InputStream stream = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + fileName);
> String text = new Tika().parseToString(stream, new Metadata(), -1).trim();
> {code}
>  
> I have attached an example MS Word doc file with checkboxes and a dropdown.
> Regards and thanks, Pete
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)