You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2013/06/06 23:00:21 UTC

[jira] [Created] (TIKA-1131) Output sentence-break "hints" for files such as PPT/X

Shai Erera created TIKA-1131:
--------------------------------

             Summary: Output sentence-break "hints" for files such as PPT/X
                 Key: TIKA-1131
                 URL: https://issues.apache.org/jira/browse/TIKA-1131
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Shai Erera
            Priority: Minor


Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I believe that usually these files contain text that does not end with the usual sentence breaks. As I've shown in the email, the parser seems to detect e.g. different bullets by inserting manual '\n' characters, but that's not enough per the sentence segmentation rules of UAX#29.

It would be better if the parser output a clearer marker which the user could then replace with a true sentence break (e.g. \u2029), rather than arbitrarily replacing every '\n', which I think is not a good general solution.

BTW, I parsed Impress files and it seems the parser does output some hints (I think <p> tags).

I'll upload an isolated test which generates the output as I put in the email.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira