You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/08/30 14:51:37 UTC

[jira] [Created] (NUTCH-1100) SolrDedup broken

SolrDedup broken
----------------

                 Key: NUTCH-1100
                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.4
            Reporter: Markus Jelsma
             Fix For: 1.4


Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.

{code}
java.lang.NullPointerException
        at org.apache.hadoop.io.Text.encode(Text.java:388)
        at org.apache.hadoop.io.Text.set(Text.java:178)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1100) SolrDedup broken

Posted by "Ashish Shrowty (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280357#comment-13280357 ] 

Ashish Shrowty commented on NUTCH-1100:
---------------------------------------

were you able to resolve this issue? i am consistently getting this error ...
                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1100) SolrDedup broken

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094439#comment-13094439 ] 

Markus Jelsma commented on NUTCH-1100:
--------------------------------------

The above exception can appear from out of thin air, i've seen it happening times and times again. Just now i suddenly saw a long running test cycle magically repair itself. The dedup job failed weeks ago for the first time and until just now continued to fail at each cycle.

I still have no idea on how to consistently reproduce this behaviour.

> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1100) SolrDedup broken

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1100:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1100) SolrDedup broken

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1100:
---------------------------------

    Fix Version/s:     (was: 1.4)
                   1.5

Cannot really reproduce. Mark as 1.5 in case it pops up again.
                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1100) SolrDedup broken

Posted by "lufeng (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437843#comment-13437843 ] 

lufeng commented on NUTCH-1100:
-------------------------------

Maybe it is a setting problem, do you change the mapping field
<field dest="digest" source="digest"/>
in solrindex-mapping.xml, if you change the dest name of the field. The solr will not find the digest field.
                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1100) SolrDedup broken

Posted by "Luca Cavanna (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445930#comment-13445930 ] 

Luca Cavanna commented on NUTCH-1100:
-------------------------------------

I agree, it would make even more sense to filter the query like this: digest:[* TO *] .
This way nutch wouldn't even iterate over documents that don't have a value for the digest field.
Unfortunately this problem is pretty common, it happens all the time if you have in Solr documents that don't come from nutch, together with the crawled documents.
                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1100) SolrDedup broken

Posted by "Luca Cavanna (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446002#comment-13446002 ] 

Luca Cavanna commented on NUTCH-1100:
-------------------------------------

The problem with the approach I mentioned before is that the field digest would need to be made indexed in the solr schema, otherwise that query would always return 0 results.

                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1100) SolrDedup broken

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1100:
---------------------------------

    Attachment: NUTCH-1100-1.6-1.patch

I finally got around this again and it is indeed a problem with the digest field not being there. Here's a patch checking for null and skipping the document.


Please check if this solves your problems.
                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1100) SolrDedup broken

Posted by "Hernan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425207#comment-13425207 ] 

Hernan commented on NUTCH-1100:
-------------------------------

This fields are required:

SolrConstants.ID_FIELD        ("id")
SolrConstants.BOOST_FIELD     ("boost")
SolrConstants.TIMESTAMP_FIELD ("tstamp")
SolrConstants.DIGEST_FIELD    ("digest")


if you had indexed in solr outside of nutch, for example DataImportHandler, you should be set this fields with:

a) Add the fields when you index your documents

b) for copy from other field add to schema-solr4.xml the bellow:
  <copyField source="yourfiled1" dest="boost"/>
  <copyField source="yourfiled2" dest="tstamp"/>
  <copyField source="yourfiled3" dest="digest"/>

c) Modified the source SolrDeleteDuplicates similar to the attached patch, but for all fields (boost, tstamp, digest), the field id you should was set.

d) Change the SOLR_GET_ALL_QUERY for only select the generated records for nutch (This maybe should be one good generic change)

Sorry for my lousy english.
                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira