You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by postusenet <po...@gmail.com> on 2009/07/04 19:26:04 UTC

How to get lastModified or create-date content from html pages?

Hi

I try to use create-date or modified-time ergo the lastModified tag from
html-pages.

I found this similar postings, but barely helpful:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12884.html(2009)
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09542.html(2007)
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg07300.html(2007)
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09548.html(2007)
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08668.html(2007)
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01956.html(2005)

If I start nutch-1.0 using as intranet crawl, but regardless setting
index-more and query-more (in nutch-site.xml), lastModified is all over 0
respectively modified time is 01:00:00 CET 1970.

So I ask me why Nutch-1.0 doesn't use date respectively last-modified tags
from following header-sample?

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<head>
  <meta http-equiv="Pragma" content="no-cache" />
  <meta http-equiv="Expires" content="-1" />
<meta http-equiv="Last-modified" content="Sun, 3 May 2009 21:23:00 GMT" />
    <meta name="date" content="2009-05-03" />
  <title>some title</title>
...
<meta name="title" content="our organisation" />
    <meta name="language" content="de" />
    <meta name="subject" content="our topics" />
...
</script></head>

Any help is greatly appreciated.

Thanks, MnT