You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Schuch <ma...@web.de> on 2013/03/18 10:23:51 UTC
A question about HTML metadata extraction
Hi,
i use Tika 1.2 to get content from HTML documents.
Besides the content i am also interested in the metadata, which is descibed in Dublincore.
I would assume that tika maps Dublincore metadata as basic tika meta data as described in class TikaCoreProperties.
But after parsing this document:
<html>
<head>
<title>Page Title (from title tag)</title>
<link rel="schema.DC" href="http://purl.org/dc/terms/"></link>
<meta name="DC.creator" content="John Doe" />
<meta name="DC.title" content="Page Title (from DC.title)" />
</head>
<body>
<p>content</p>
</body>
</html>
the Dublincore metadata seems not to be mapped correctly.
When i try to get the metadata by calling
metadata.get(TikaCoreProperties.CREATOR);
null is returned.
When using
metadata.get(TikaCoreProperties.TITLE);
the String "Page title (from title tag)" is returned.
When iterating over all metadata entries, i see that all DC meta tags are in there and i simply could use
metadata.get("DC.creator");
to retrieve the creator, but it feels not right...
Is my assumption wrong, that tika maps certain Dublincore tags from an HTML document as basic metadata?
Or is this a missing feature in the HtmlParser?
Many thanks in advance,
Markus