You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2015/03/15 01:17:38 UTC

[jira] [Commented] (TIKA-1131) Output sentence-break "hints" for files such as PPT/X

    [ https://issues.apache.org/jira/browse/TIKA-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362109#comment-14362109 ] 

Tyler Palsulich commented on TIKA-1131:
---------------------------------------

Hi [~shaie]. Sorry no one responded to this! Can you upload a file with the bullets (and *s) you described in your email? Thanks!

> Output sentence-break "hints" for files such as PPT/X
> -----------------------------------------------------
>
>                 Key: TIKA-1131
>                 URL: https://issues.apache.org/jira/browse/TIKA-1131
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Shai Erera
>            Priority: Minor
>
> Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I believe that usually these files contain text that does not end with the usual sentence breaks. As I've shown in the email, the parser seems to detect e.g. different bullets by inserting manual '\n' characters, but that's not enough per the sentence segmentation rules of UAX#29.
> It would be better if the parser output a clearer marker which the user could then replace with a true sentence break (e.g. \u2029), rather than arbitrarily replacing every '\n', which I think is not a good general solution.
> BTW, I parsed Impress files and it seems the parser does output some hints (I think <p> tags).
> I'll upload an isolated test which generates the output as I put in the email.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)