You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nick Burch <ni...@alfresco.com> on 2010/01/19 14:41:45 UTC

Extracting dublin core metadata in HtmlParser?

Hi All

I've been taking a look at the HtmlParser, and I can't spot anything in 
there that extracts any of the dublin core metadata that could be there. 
It seems that it's only things like location and encoding that get set 
onto the metadata object. Nothing like description, author etc seems to 
get set.

So, two questions: is that feature actually all ready there and I've just 
been useless at finding it? And if not, do people think it's a 
sufficiently useful feature that I should go and write a patch for it?

Cheers
Nick

Re: Extracting dublin core metadata in HtmlParser?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Nick,

On Jan 19, 2010, at 5:41am, Nick Burch wrote:

> Hi All
>
> I've been taking a look at the HtmlParser, and I can't spot anything  
> in there that extracts any of the dublin core metadata that could be  
> there. It seems that it's only things like location and encoding  
> that get set onto the metadata object. Nothing like description,  
> author etc seems to get set.

Only location & encoding are explicitly looked for, but all meta tag  
values get put into the metadata map.

See HtmlHandler.startElement(), where it has:

         if (bodyLevel == 0 && discardLevel == 0) {
             if ("META".equals(name) && atts.getValue("content") !=  
null) {
                 if (atts.getValue("http-equiv") != null) {
                     metadata.set(
                             atts.getValue("http-equiv"),
                             atts.getValue("content"));
                 }
                 if (atts.getValue("name") != null) {
                     metadata.set(
                             atts.getValue("name"),
                             atts.getValue("content"));
                 }


Though the names defined in Tika's DublinCore enum seem to be missing  
the "dc." prefix.

-- Ken



--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g