You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/09 18:41:09 UTC

[jira] [Created] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Word parser doesn't extract optional hyphen correctly
-----------------------------------------------------

                 Key: TIKA-711
                 URL: https://issues.apache.org/jira/browse/TIKA-711
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
             Fix For: 1.0


We seem not to extract the optional hyphen character correctly in
the Word parser.

You can create this char in Word by typing ctrl and -.  It's hidden,
normally; you have to turn on display of formatting marks to see it.

Ideally we'd get U+00AD (unicode soft hyphen), I think.

DOC produces a unicode replacement char, which is wrong.

DOCX and PDF drop the char (which seems acceptable).  RTF produces
U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
produce U+00AD).

PPT and PPTX work correctly (U+00AD).

So DOC is the only bug I think -- I haven't dug into what's wrong
yet...


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118905#comment-13118905 ] 

Michael McCandless commented on TIKA-711:
-----------------------------------------

Curiously, if I use POI's WordToTextConverter command-line tool, it produces U+200b (ZERO WIDTH SPACE) for the optional hyphen, which I think is at least better than ASCII 31.  Still not sure if there's a POI option we can set to get this character out as U+00AD.
                
> Word parser doesn't extract optional hyphen correctly
> -----------------------------------------------------
>
>                 Key: TIKA-711
>                 URL: https://issues.apache.org/jira/browse/TIKA-711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-711.patch, testOptionalHyphen.doc, testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, testOptionalHyphen.pptx, testOptionalHyphen.rtf
>
>
> We seem not to extract the optional hyphen character correctly in
> the Word parser.
> You can create this char in Word by typing ctrl and -.  It's hidden,
> normally; you have to turn on display of formatting marks to see it.
> Ideally we'd get U+00AD (unicode soft hyphen), I think.
> DOC produces a unicode replacement char, which is wrong.
> DOCX and PDF drop the char (which seems acceptable).  RTF produces
> U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
> produce U+00AD).
> PPT and PPTX work correctly (U+00AD).
> So DOC is the only bug I think -- I haven't dug into what's wrong
> yet...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Posted by "Michael McCandless (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-711:
---------------------------------------

    Assignee: Michael McCandless
    
> Word parser doesn't extract optional hyphen correctly
> -----------------------------------------------------
>
>                 Key: TIKA-711
>                 URL: https://issues.apache.org/jira/browse/TIKA-711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: TIKA-711.patch, testOptionalHyphen.doc, testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, testOptionalHyphen.pptx, testOptionalHyphen.rtf
>
>
> We seem not to extract the optional hyphen character correctly in
> the Word parser.
> You can create this char in Word by typing ctrl and -.  It's hidden,
> normally; you have to turn on display of formatting marks to see it.
> Ideally we'd get U+00AD (unicode soft hyphen), I think.
> DOC produces a unicode replacement char, which is wrong.
> DOCX and PDF drop the char (which seems acceptable).  RTF produces
> U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
> produce U+00AD).
> PPT and PPTX work correctly (U+00AD).
> So DOC is the only bug I think -- I haven't dug into what's wrong
> yet...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-711:
------------------------------------

    Attachment: testOptionalHyphen.rtf
                testOptionalHyphen.pptx
                testOptionalHyphen.ppt
                testOptionalHyphen.pdf
                testOptionalHyphen.docx
                testOptionalHyphen.doc
                TIKA-711.patch

Patch.

> Word parser doesn't extract optional hyphen correctly
> -----------------------------------------------------
>
>                 Key: TIKA-711
>                 URL: https://issues.apache.org/jira/browse/TIKA-711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-711.patch, testOptionalHyphen.doc, testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, testOptionalHyphen.pptx, testOptionalHyphen.rtf
>
>
> We seem not to extract the optional hyphen character correctly in
> the Word parser.
> You can create this char in Word by typing ctrl and -.  It's hidden,
> normally; you have to turn on display of formatting marks to see it.
> Ideally we'd get U+00AD (unicode soft hyphen), I think.
> DOC produces a unicode replacement char, which is wrong.
> DOCX and PDF drop the char (which seems acceptable).  RTF produces
> U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
> produce U+00AD).
> PPT and PPTX work correctly (U+00AD).
> So DOC is the only bug I think -- I haven't dug into what's wrong
> yet...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-711.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0
    
> Word parser doesn't extract optional hyphen correctly
> -----------------------------------------------------
>
>                 Key: TIKA-711
>                 URL: https://issues.apache.org/jira/browse/TIKA-711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-711.patch, TIKA-711.patch, testOptionalHyphen.doc, testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, testOptionalHyphen.pptx, testOptionalHyphen.rtf
>
>
> We seem not to extract the optional hyphen character correctly in
> the Word parser.
> You can create this char in Word by typing ctrl and -.  It's hidden,
> normally; you have to turn on display of formatting marks to see it.
> Ideally we'd get U+00AD (unicode soft hyphen), I think.
> DOC produces a unicode replacement char, which is wrong.
> DOCX and PDF drop the char (which seems acceptable).  RTF produces
> U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
> produce U+00AD).
> PPT and PPTX work correctly (U+00AD).
> So DOC is the only bug I think -- I haven't dug into what's wrong
> yet...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101323#comment-13101323 ] 

Michael McCandless commented on TIKA-711:
-----------------------------------------

The WordExtractor seems to receive ASCII 31 ("unit separator") from POI, for the optional hyphen, which SafeContentHandler then replaces w/ unicode replacement char.

I don't think we can assume ASCII 31 will always mean soft hyphen though...

Not sure how to fix this.

> Word parser doesn't extract optional hyphen correctly
> -----------------------------------------------------
>
>                 Key: TIKA-711
>                 URL: https://issues.apache.org/jira/browse/TIKA-711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-711.patch, testOptionalHyphen.doc, testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, testOptionalHyphen.pptx, testOptionalHyphen.rtf
>
>
> We seem not to extract the optional hyphen character correctly in
> the Word parser.
> You can create this char in Word by typing ctrl and -.  It's hidden,
> normally; you have to turn on display of formatting marks to see it.
> Ideally we'd get U+00AD (unicode soft hyphen), I think.
> DOC produces a unicode replacement char, which is wrong.
> DOCX and PDF drop the char (which seems acceptable).  RTF produces
> U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
> produce U+00AD).
> PPT and PPTX work correctly (U+00AD).
> So DOC is the only bug I think -- I haven't dug into what's wrong
> yet...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-711:
------------------------------------

    Attachment: TIKA-711.patch


OK, after digging I found out that in fact POI's AbstractWordConverter
converts ASCII 30 to Unicode non-breaking hyphen (U+2011) and ASCII 31
to Unicode zero-width space (U+200b), but Tika doesn't.  This is why I
see the "right" behavior when running POI's command-line conversion
but not with Tika.

So I think the fix is simple here: just do that same mapping in
WordExtractor.handleCharacterRun; attached patch does that, and
enables the test case (now passing).

                
> Word parser doesn't extract optional hyphen correctly
> -----------------------------------------------------
>
>                 Key: TIKA-711
>                 URL: https://issues.apache.org/jira/browse/TIKA-711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: TIKA-711.patch, TIKA-711.patch, testOptionalHyphen.doc, testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, testOptionalHyphen.pptx, testOptionalHyphen.rtf
>
>
> We seem not to extract the optional hyphen character correctly in
> the Word parser.
> You can create this char in Word by typing ctrl and -.  It's hidden,
> normally; you have to turn on display of formatting marks to see it.
> Ideally we'd get U+00AD (unicode soft hyphen), I think.
> DOC produces a unicode replacement char, which is wrong.
> DOCX and PDF drop the char (which seems acceptable).  RTF produces
> U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
> produce U+00AD).
> PPT and PPTX work correctly (U+00AD).
> So DOC is the only bug I think -- I haven't dug into what's wrong
> yet...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-711:
-------------------------------

    Fix Version/s:     (was: 0.10)

> Word parser doesn't extract optional hyphen correctly
> -----------------------------------------------------
>
>                 Key: TIKA-711
>                 URL: https://issues.apache.org/jira/browse/TIKA-711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-711.patch, testOptionalHyphen.doc, testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, testOptionalHyphen.pptx, testOptionalHyphen.rtf
>
>
> We seem not to extract the optional hyphen character correctly in
> the Word parser.
> You can create this char in Word by typing ctrl and -.  It's hidden,
> normally; you have to turn on display of formatting marks to see it.
> Ideally we'd get U+00AD (unicode soft hyphen), I think.
> DOC produces a unicode replacement char, which is wrong.
> DOCX and PDF drop the char (which seems acceptable).  RTF produces
> U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
> produce U+00AD).
> PPT and PPTX work correctly (U+00AD).
> So DOC is the only bug I think -- I haven't dug into what's wrong
> yet...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira