You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Desmond David (Jira)" <ji...@apache.org> on 2019/08/23 08:07:00 UTC

[jira] [Comment Edited] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

    [ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914035#comment-16914035 ] 

Desmond David edited comment on TIKA-2928 at 8/23/19 8:06 AM:
--------------------------------------------------------------

Ok, I tested this out with Jsoup and it appears that Jsoup handles this correctly:
{code:java}
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
	public static void main(String[] args) {
		String str = "<tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure </td></tr>";
		Document doc = Jsoup.parse(str);
		Element e = doc.getAllElements().get(0);
		System.out.println(e.text());
	}
}{code}
Outputs
{code:java}
GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure{code}
Which is as expected.


was (Author: sargent_d):
Ok, I tested this out with Jsoup and it appears that Jsoup handles this correctly:
{code:java}
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;public class Test {
	public static void main(String[] args) {
		String str = "<tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure </td></tr>";
		Document doc = Jsoup.parse(str);
		Element e = doc.getAllElements().get(0);
		System.out.println(e.text());
	}
}{code}
Outputs
{code:java}
GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure{code}
Which is as expected.

> Less than sign within tag boundaries considered as start of a new tag.
> ----------------------------------------------------------------------
>
>                 Key: TIKA-2928
>                 URL: https://issues.apache.org/jira/browse/TIKA-2928
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser, server
>    Affects Versions: 1.22
>            Reporter: Desmond David
>            Priority: Minor
>
> So I have been attempting to parse some (somewhat non-standard) HTML documents using Tika and I have observed that if the document contains a less-than sign (<) as part of a tag's body, Tika parses it as the start of a new tag and eventually omits the rest of the text in the final document, up to the point when the next newline is to be entered.
> For example, consider the following HTML snippet:
>  
> {code:html}
> <tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure </td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code}
> The result is:
> {code:java}
> GFR
> ENZYMES & BILIRUBIN
> {code}
> Here, the rest of the content after the first `GFR` gets omitted. Based on this observation I think this means that the `<60`  and it's subsequent characters are getting interpreted as part of a tag, and since are getting ignored. Then at some point, `</td></tr>` is encountered which short-circuits the execution and starts processing the next line.
> This behaviour was observed using both, the Tika App and the Tika Server.
> I think expected behaviour should be that all text within data tags (p, td, etc.) should be considered as raw text. Or at least Tika's behaviour should be configurable to be allowed to do so.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)