You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tom Chiverton <tc...@extravision.com> on 2016/10/18 14:51:38 UTC

Date missing from Solr, even though in HTTP last-modified

I have "index-(basic|anchor|more|metadata)" and 
"parse-(html|tika|metatags)" included in plugin.includes, but despite:


# bin/nutch parsechecker https:/..... |grep -i date
Date :  Tue, 18 Oct 2016 14:37:40 GMT


The 'date' field in Solr for the document is wrong :

|"date": "1970-01-01T00:00:00Z",|


Why is this ? Also, as I think 'date' is being inferred from the 
'last-modified' header, I'd like it to go in 'lastModified' too...

I saw some reference to setting solrindex-mapping.xml
     <field dest="lastModified" source="date"/>
but this dies during IndexingJob with
     Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
ERROR: [doc=com.abloz:http/hbase/book.html] multiple values encountered 
for non multiValued field lastModified: [Tue Jun 16 10:55:02 UTC 2015, 
Tue Jun 16 10:55:02 UTC 2015]

which makes no sense. There aren't two last-modified HTTP headers ? It 
does at least confirm the value is going in...

The Solr schema is correct, I think (there's no real world reason for 
lastModified to be multi valued!) :
      <field name="lastModified" type="date" stored="true" indexed="false"/>


-- 
*Tom Chiverton*
Lead Developer
e: 	tc@extravision.com <ma...@extravision.com>
p: 	0161 817 2922
t: 	@extravision <http://www.twitter.com/extravision>
w: 	www.extravision.com <http://www.extravision.com/>

Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, 
Manchester, M15 4LD.
Company Reg No: 0\u200c\u200c5017214 VAT: GB 8\u200c\u200c24 5386 19

This e-mail is intended solely for the person to whom it is addressed 
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author 
and do not necessarily represent those of Extravision Ltd.


Re: Date missing from Solr, even though in HTTP last-modified

Posted by Tom Chiverton <tc...@extravision.com>.
This turned out to be user error - not all pages in the site output a 
last-modified, and those that did hadn't been indexed.

Tom