You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Yakn <bo...@yahoo.com> on 2007/05/23 17:02:21 UTC

Get meta name="description" and other meta tags from Content

I am using the SegmentReader, iterating over content. I have the Content,
ParseData, and ParseText objects, and I am looking for a way to get access
to the meta tags in the header of the HTML in my Content object. Is there
anyway to get access to these meta tags? I do not want to have to use the
HtmlParseFilter. The only way I have seen for this to work is to use the
HTMLMetaTags.

Can I get what I need from Content? Please help, thanks.
-- 
View this message in context: http://www.nabble.com/Get-meta-name%3D%22description%22-and-other-meta-tags-from-Content-tf3804616.html#a10765747
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Get meta name="description" and other meta tags from Content

Posted by Andrzej Bialecki <ab...@getopt.org>.
Yakn wrote:
> I am using the SegmentReader, iterating over content. I have the Content,
> ParseData, and ParseText objects, and I am looking for a way to get access
> to the meta tags in the header of the HTML in my Content object. Is there
> anyway to get access to these meta tags? I do not want to have to use the
> HtmlParseFilter. The only way I have seen for this to work is to use the
> HTMLMetaTags.

That's the whole purpose of HtmlParseFilters - Nutch doesn't store all 
meta tags in ParseData - it would take too much space, and in general 
case it's not so useful, because all critical information (robot 
directives, redirects) we already handle. If you want to use meta tags 
in any other way you should implement a simple HtmlParseFilter that will 
put all meta tags into ParseData.
> 
> Can I get what I need from Content? Please help, thanks.

If you parse it again - then yes. Otherwise you need to use HtmlParseFilter.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com