You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Claas Aug. (Jira)" <ji...@apache.org> on 2020/04/21 18:15:00 UTC

[jira] [Updated] (TIKA-3024) Extra whitespace appended within a tag element's text

     [ https://issues.apache.org/jira/browse/TIKA-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Claas Aug. updated TIKA-3024:
-----------------------------
    Attachment: one.odt
                one.odt-parsed.html

> Extra whitespace appended within a tag element's text
> -----------------------------------------------------
>
>                 Key: TIKA-3024
>                 URL: https://issues.apache.org/jira/browse/TIKA-3024
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16, 1.20
>            Reporter: Vivek 
>            Priority: Major
>         Attachments: one.odt, one.odt-parsed.html
>
>
> Website: [http://www.thevanitycase.com/about-us.php]
> While parsing the content of the page using Tika Parser, it splits the text in the tag and sends it to crawler4j for content handling. But the text is contained within a single tag (span tag). The content handler appends extra whitespace ("  ") as it normally does for any text received
> Text: "Tel: +91-22-61801700". 
>  That is, 
>  Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
> Actual text: "<text before this>Tel: +91-22-6180170  0<text after this>"
> The JS path of the element: body > div > div:nth-child(6) > div > div.footer-full.footer-btm > div > p > span



--
This message was sent by Atlassian Jira
(v8.3.4#803005)