You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/04/19 10:43:06 UTC

How could solrdedup work at all?

Hi,

I did a full crawl and a solrdedup with 1.3. The dedup fails miserably. The 
following exception is thrown in Solr:

Apr 19, 2011 10:34:50 AM org.apache.solr.request.BinaryResponseWriter$Resolver 
getDoc
WARNING: Error reading a field from document : 
SolrDocument[{digest=7ff92a31c58e43a34fd45bc6d87cda03}]
java.lang.NumberFormatException: For input string: "2011-04-19T08:16:31.675Z"
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.valueOf(Long.java:525)
        at org.apache.solr.schema.LongField.toObject(LongField.java:82)
....

Strange enough, Solr seems to allow updates of long fields with a formatted 
date. In Nutch 1.2 the tstamp field is actually a long but in 1.3 the field is 
a valid Solr date format. This exception is only triggered using the javabin 
response writer so there's something weird in Solr too.

We need to either change the tstamp field back to a long or update the Solr 
example schema and fix SolrDeleteDuplicates to use the formatted date instead 
of the long.

Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350