You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Florent Valdelievre (JIRA)" <ji...@apache.org> on 2016/06/15 14:05:09 UTC

[jira] [Created] (TIKA-2010) Unable to get value when header is incorrect</h1><pre>Florent Valdelievre created TIKA-2010: ----------------------------------------- Summary: Unable to get <title> value when header is incorrect Key: TIKA-2010 URL: https://issues.apache.org/jira/browse/TIKA-2010 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.12 Reporter: Florent Valdelievre A lot of websites don't have a valid data within <head></head> tag. However, even if header data are invalid(missplaced tag etc.) we should be able to get title tag value if present. Please find below a straightforward Unit Test to reproduce the problem. You will noticed I have added an anchor in between <head><a></a></head> tags which is not correct. If you remove it, it find title value. {code:java} import java.io.ByteArrayInputStream; import java.io.IOException; import java.nio.charset.Charset; import java.nio.file.Files; import java.nio.file.Paths; import org.apache.hadoop.conf.Configuration; import org.apache.html.dom.HTMLDocumentImpl; import org.apache.nutch.parse.html.DOMBuilder; import org.apache.nutch.parse.tika.DOMContentUtils; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.junit.Assert; import org.junit.Before; import org.junit.Test; import org.w3c.dom.DocumentFragment; public class TestTikaGetTitleWithInvalidHeaders { private Configuration conf; static byte[] readFile(String path, Charset encoding) throws IOException { return Files.readAllBytes(Paths.get(path)); } private final static String WEBPAGE = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML+RDFa 1.0//EN\" \"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd\">" + "<html>" + "<head>" +"<a href=\"https://plus.google.com/113911985765464238166\" rel=\"publisher\">Google+</a> " + "<title>Welcome!</title>" + "</head>" + "<body>" + "content" + "</body>" + "</html>"; @Before public void setUp() throws Exception { conf = new Configuration(); } @Test public void testGetTitle() { HTMLDocumentImpl doc = new HTMLDocumentImpl(); doc.setErrorChecking(false); DocumentFragment root = doc.createDocumentFragment(); Parser parser = new org.apache.tika.parser.html.HtmlParser(); DOMBuilder domBuilder = new DOMBuilder(doc, root); try { parser.parse(new ByteArrayInputStream(WEBPAGE.getBytes()), domBuilder, new Metadata(), new ParseContext()); } catch (Exception e) { e.printStackTrace(); } StringBuffer sb = new StringBuffer(); new DOMContentUtils(conf).getTitle(sb, root); Assert.assertEquals("Welcome!", sb.toString()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) </pre><hr/> </body> </html>