You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/08/21 05:27:19 UTC

Keywords?

Is there a way to extract the keywords from an html page?  I can't
find it in ParseData or CrawlDatum anywhere.

-- 
http://www.linkedin.com/in/paultomblin

Re: Keywords?

Posted by Julien Nioche <li...@gmail.com>.

Paul,

You don't have to reimplement all of the HTMLParser, just write a
HtmlParseFilter and which is much simpler. Otherwise you can of course
modify HTMLParser directly so that it does what you need.

J.


2009/8/21 Paul Tomblin <pt...@xcski.com>

> On Fri, Aug 21, 2009 at 4:20 AM, Julien
> Nioche<li...@gmail.com> wrote:
> > ou'll need to write a custom parser implementing HtmlParseFilter and get
> it
> > to store the keywords found in the Metadata, then write a custom Indexer.
> >
> > By default the HTML parser does not do anything about meta tags.
>
> That's unfortunate, because org.apache.nutch.parse.html.HtmlParser
> actually extracts all the meta tags, and then takes a few and throws
> the rest away.  It's mildly annoying that I'm going to have to
> re-implement all of HtmlParser just to add two lines to take that data
> out of "metaTags" and put it in "content.getMetaData()".
>
> --
> http://www.linkedin.com/in/paultomblin
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Keywords?

Posted by Paul Tomblin <pt...@xcski.com>.

On Fri, Aug 21, 2009 at 4:20 AM, Julien
Nioche<li...@gmail.com> wrote:
> ou'll need to write a custom parser implementing HtmlParseFilter and get it
> to store the keywords found in the Metadata, then write a custom Indexer.
>
> By default the HTML parser does not do anything about meta tags.

That's unfortunate, because org.apache.nutch.parse.html.HtmlParser
actually extracts all the meta tags, and then takes a few and throws
the rest away.  It's mildly annoying that I'm going to have to
re-implement all of HtmlParser just to add two lines to take that data
out of "metaTags" and put it in "content.getMetaData()".

-- 
http://www.linkedin.com/in/paultomblin

Re: Keywords?

Posted by Julien Nioche <li...@gmail.com>.

Hi Paul,

You'll need to write a custom parser implementing HtmlParseFilter and get it
to store the keywords found in the Metadata, then write a custom Indexer.

By default the HTML parser does not do anything about meta tags.

J.

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/8/21 Paul Tomblin <pt...@xcski.com>

> Is there a way to extract the keywords from an html page?  I can't
> find it in ParseData or CrawlDatum anywhere.
>
> --
> http://www.linkedin.com/in/paultomblin
>