You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Markus Schuch <ma...@web.de> on 2013/03/18 10:23:51 UTC

A question about HTML metadata extraction

Hi,

i use Tika 1.2 to get content from HTML documents.
Besides the content i am also interested in the metadata, which is descibed in Dublincore.

I would assume that tika maps Dublincore metadata as basic tika meta data as described in class TikaCoreProperties.

But after parsing this document:

<html>
  <head>
    <title>Page Title (from title tag)</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/"></link>
    <meta name="DC.creator" content="John Doe" />
    <meta name="DC.title" content="Page Title (from DC.title)" />
</head>
<body>
    <p>content</p>
</body>
</html>

the Dublincore metadata seems not to be mapped correctly.

When i try to get the metadata by calling

  metadata.get(TikaCoreProperties.CREATOR);

null is returned. 

When using 

  metadata.get(TikaCoreProperties.TITLE);

the String "Page title (from title tag)" is returned.

When iterating over all metadata entries, i see that all DC meta tags are in there and i simply could use

  metadata.get("DC.creator");

to retrieve the creator, but it feels not right...

Is my assumption wrong, that tika maps certain Dublincore tags from an HTML document as basic metadata?
Or is this a missing feature in the HtmlParser?

Many thanks in advance,
Markus