You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by zabrane Mikael <za...@gmail.com> on 2010/05/06 13:53:28 UTC

Append ... to metadata list?

Hi All,

I'm facing a little issue with my Tika based metadata extractor.

Here's the problem. Let assume a very basic HTML page
(page.html) with the following content:

8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8-
...
<title>My amazing website</title>

<meta http-equiv="Content-type" content="text/html; charset=iso-8859-1" />
<meta name="title" content="" />
<meta name="description" content="foo" />
<meta name="keywords" content="bar" />
...
8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8-

Please, notice that it contains a valid title between "<title> ... </title>"
and
an empty metadata "title" (<meta name="title" content="" /).

When trying to extract metadata from it, here's what I got;

8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8
$  java -jar tika-app.jar -m page.html
...
Content-type: text/html; charset=iso-8859-1
copyright: lefigaro.fr
description: foo
keywords: bar
resourceName: page.html
title:
8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8

As you may expect, I got an empty metadata "title".
And this isn't what I'd like to get.

As the page contains a valid title between tags "<title> ... </title>" (i.e
"My amazing website"),
is there a way to tell Tika to return this "title" instead of an empty one?

Regards
Zabrane