You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Hernan (JIRA)" <ji...@apache.org> on 2012/07/30 22:39:36 UTC

[jira] [Commented] (NUTCH-1100) SolrDedup broken

    [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425207#comment-13425207 ] 

Hernan commented on NUTCH-1100:
-------------------------------

This fields are required:

SolrConstants.ID_FIELD        ("id")
SolrConstants.BOOST_FIELD     ("boost")
SolrConstants.TIMESTAMP_FIELD ("tstamp")
SolrConstants.DIGEST_FIELD    ("digest")


if you had indexed in solr outside of nutch, for example DataImportHandler, you should be set this fields with:

a) Add the fields when you index your documents

b) for copy from other field add to schema-solr4.xml the bellow:
  <copyField source="yourfiled1" dest="boost"/>
  <copyField source="yourfiled2" dest="tstamp"/>
  <copyField source="yourfiled3" dest="digest"/>

c) Modified the source SolrDeleteDuplicates similar to the attached patch, but for all fields (boost, tstamp, digest), the field id you should was set.

d) Change the SOLR_GET_ALL_QUERY for only select the generated records for nutch (This maybe should be one good generic change)

Sorry for my lousy english.
                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira