You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Joshua J Pavel <jp...@us.ibm.com> on 2011/01/31 21:52:54 UTC

Another question from a meta tag newbie


I've been crawling the user groups, and I feel like Nutch can do this by
default, but I just can't seem to crack it.

I want to grab meta tags from indexed pages and insert them in the
database.  Specifically, I'll have some meta tags that identity the type of
content on the page, so that I can group results as either video, photo,
news, etc.

I looked into 655 and 855, but I believe those are for adding metadata, not
utilizing meta data already in the page.

What I expect, is that when I do a dump, I'd have the fields visible in
Metadata

http://test.site.com/index.html   Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Mar 02 20:22:33 UTC 2011
Modified time: Thu Jan 01 00:00:00 UTC 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.013783041
Signature: 26df10bef4cf4cebe3f1041ba121068d
Metadata: _pst_: success(1), lastModified=0, MYFIELD=MYVALUE

I think Nutch-779 may be what I need, and as I'm running version 1.2, I
should have this capability.  I'm filling in db.parsemeta.to.crawldb, but
is there something else I need to do?  Or is it populating it, and dumping
the database doesn't show me those values?