You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tom Chiverton <tc...@extravision.com> on 2016/10/18 14:51:38 UTC
Date missing from Solr, even though in HTTP last-modified
I have "index-(basic|anchor|more|metadata)" and
"parse-(html|tika|metatags)" included in plugin.includes, but despite:
# bin/nutch parsechecker https:/..... |grep -i date
Date : Tue, 18 Oct 2016 14:37:40 GMT
The 'date' field in Solr for the document is wrong :
|"date": "1970-01-01T00:00:00Z",|
Why is this ? Also, as I think 'date' is being inferred from the
'last-modified' header, I'd like it to go in 'lastModified' too...
I saw some reference to setting solrindex-mapping.xml
<field dest="lastModified" source="date"/>
but this dies during IndexingJob with
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
ERROR: [doc=com.abloz:http/hbase/book.html] multiple values encountered
for non multiValued field lastModified: [Tue Jun 16 10:55:02 UTC 2015,
Tue Jun 16 10:55:02 UTC 2015]
which makes no sense. There aren't two last-modified HTTP headers ? It
does at least confirm the value is going in...
The Solr schema is correct, I think (there's no real world reason for
lastModified to be multi valued!) :
<field name="lastModified" type="date" stored="true" indexed="false"/>
--
*Tom Chiverton*
Lead Developer
e: tc@extravision.com <ma...@extravision.com>
p: 0161 817 2922
t: @extravision <http://www.twitter.com/extravision>
w: www.extravision.com <http://www.extravision.com/>
Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
Manchester, M15 4LD.
Company Reg No: 0\u200c\u200c5017214 VAT: GB 8\u200c\u200c24 5386 19
This e-mail is intended solely for the person to whom it is addressed
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author
and do not necessarily represent those of Extravision Ltd.
Re: Date missing from Solr, even though in HTTP last-modified
Posted by Tom Chiverton <tc...@extravision.com>.
This turned out to be user error - not all pages in the site output a
last-modified, and those that did hadn't been indexed.
Tom