You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Williams <Pa...@becta.org.uk> on 2005/09/28 11:50:53 UTC

Parsing HTML meta tags

Hi,

 

I'm trying to parse an external site that contains meta tags encoded in
the HTML, such as:

 

<title>BBC - GCSE Bitesize - Homepage</title>
<meta name="description" content="Index for GCSE Bitesize />
<meta name="keywords" content="BBC, bbc, GCSE, Revision, Revise,
Bitesize" />
<meta name="created" content="20041101">
 
Nutch is able to see the pages but I'm not getting any of the meta tags
indexed.  Is there a way to do this?
 
Thanks,
Paul.