You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sai Konuri (Jira)" <ji...@apache.org> on 2022/07/11 18:37:00 UTC
[jira] [Updated] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body
[ https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sai Konuri updated TIKA-3814:
-----------------------------
Priority: Blocker (was: Minor)
> Extracted text from HTML file does not exclude newline chars from body
> ----------------------------------------------------------------------
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.3.0
> Reporter: Sai Konuri
> Priority: Blocker
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a <span>,<p>,<text>, etc, the text that is extracted is not excluding those newlines.
> A sample html file is attached.
>
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>
> {*}Actual{*}:
> !image-2022-07-06-19-09-54-534.png!
>
>
> This is the code I am using to extract the text of the HTML file:
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)