You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by zabrane Mikael <za...@gmail.com> on 2010/05/06 13:53:28 UTC
Append ... to metadata list?
Hi All,
I'm facing a little issue with my Tika based metadata extractor.
Here's the problem. Let assume a very basic HTML page
(page.html) with the following content:
8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8-
...
<title>My amazing website</title>
<meta http-equiv="Content-type" content="text/html; charset=iso-8859-1" />
<meta name="title" content="" />
<meta name="description" content="foo" />
<meta name="keywords" content="bar" />
...
8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8-
Please, notice that it contains a valid title between "<title> ... </title>"
and
an empty metadata "title" (<meta name="title" content="" /).
When trying to extract metadata from it, here's what I got;
8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8
$ java -jar tika-app.jar -m page.html
...
Content-type: text/html; charset=iso-8859-1
copyright: lefigaro.fr
description: foo
keywords: bar
resourceName: page.html
title:
8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8- 8
As you may expect, I got an empty metadata "title".
And this isn't what I'd like to get.
As the page contains a valid title between tags "<title> ... </title>" (i.e
"My amazing website"),
is there a way to tell Tika to return this "title" instead of an empty one?
Regards
Zabrane