You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jigal van Hemert | alterNET internet BV <ji...@alternet.nl> on 2015/03/10 09:30:45 UTC

Nutch documents have huge scores in Solr

Hi,

We use Nutch to index "external" sites along with a TYPO3 extension that
sends the page content from the CMS to the same Solr server. The author of
that extension has also made a configuration for Nutch with a few extra
plugins which add some extra fields to make the data compatible with the
documents that come from the CMS.

https://github.com/dkd/nutch-typo3-cms

This combination has worked fine for quite a few installations, but in one
installation the Nutch documents always end up with the highest scores. A
few days ago I heard from someone else that he had the same problem;
completely different websites, only the TYPO3 CMS extension and the Nutch
configuration are almost identical (apart from the site specific settings).

To rule out any boosting, query field and other settings I created some
simple queries in the Solr 4.8.1 admin interface by just supplying a single
search word. Below are some of the results for the word "afval" (Dutch for
"garbage").
What is remarkable is the huge difference  for the fieldNorm values which
seem to be the cause for the extreme differences in scores (CMS content
scored between 0 and 2.3; Nutch documents scored between 6000 and 150,000
(rough numbers)).

I learned that the plugin "scoring-opic" is used to add scores to the Nutch
documents. This seems to work fine in most cases.

Any pointers as to why this results in mega-scores are very much welcome.

List of the debugQuery output ("domain" is the placeholder of the actual
domain name):

[... lots of nutch records skipped ...]
<str name="c293c0a7c8d3311249d309c91f39e5e5b192b6c0/tx_nutch_external/
https://domain/Loket/prodcat/products/getProductDetailsAction.do?name=Asbestverwijdering+bedrijfsmatig">

14760.001 = (MATCH) sum of:
  14760.001 = (MATCH) max of:
    14760.001 = (MATCH) weight(content:afval^40.0 in 6617), product of:
      0.99999994 = queryWeight(content:afval^40.0), product of:
        40.0 = boost
        4.804688 = idf(docFreq=168, maxDocs=7590)
        0.0052032513 = queryNorm
      14760.002 = (MATCH) fieldWeight(content:afval in 6617), product of:
        1.0 = tf(termFreq(content:afval)=1)
        4.804688 = idf(docFreq=168, maxDocs=7590)
        3072.0 = fieldNorm(field=content, doc=6617)
</str><str name="c293c0a7c8d3311249d309c91f39e5e5b192b6c0/tx_nutch_external/
https://domain/Loket/knowledgebase/faqs/getFaqContentAction.do?id=725">
6150.0 = (MATCH) sum of:
  6150.0 = (MATCH) max of:
    6150.0 = (MATCH) weight(content:afval^40.0 in 5877), product of:
      0.99999994 = queryWeight(content:afval^40.0), product of:
        40.0 = boost
        4.804688 = idf(docFreq=168, maxDocs=7590)
        0.0052032513 = queryNorm
      6150.0005 = (MATCH) fieldWeight(content:afval in 5877), product of:
        1.0 = tf(termFreq(content:afval)=1)
        4.804688 = idf(docFreq=168, maxDocs=7590)
        1280.0 = fieldNorm(field=content, doc=5877)
</str><str
name="102b19e401862068820dd53b4a1beccb286f03a7/pages/27363/0/0/0">
2.1233919 = (MATCH) sum of:
  2.1233919 = (MATCH) max of:
    2.1233919 = (MATCH) weight(content:afval^40.0 in 493), product of:
      0.99999994 = queryWeight(content:afval^40.0), product of:
        40.0 = boost
        4.804688 = idf(docFreq=168, maxDocs=7590)
        0.0052032513 = queryNorm
      2.123392 = (MATCH) fieldWeight(content:afval in 493), product of:
        1.4142135 = tf(termFreq(content:afval)=2)
        4.804688 = idf(docFreq=168, maxDocs=7590)
        0.3125 = fieldNorm(field=content, doc=493)
    1.1733533 = (MATCH) weight(title:afval^5.0 in 493), product of:
      0.17471766 = queryWeight(title:afval^5.0), product of:
        5.0 = boost
        6.715711 = idf(docFreq=24, maxDocs=7590)
        0.0052032513 = queryNorm
      6.715711 = (MATCH) fieldWeight(title:afval in 493), product of:
        1.0 = tf(termFreq(title:afval)=1)
        6.715711 = idf(docFreq=24, maxDocs=7590)
        1.0 = fieldNorm(field=title, doc=493)
    1.500486 = (MATCH) weight(tagsH2H3:afval^3.0 in 493), product of:
      0.11628768 = queryWeight(tagsH2H3:afval^3.0), product of:
        3.0 = boost
        7.4496803 = idf(docFreq=11, maxDocs=7590)
        0.0052032513 = queryNorm
      12.903225 = (MATCH) fieldWeight(tagsH2H3:afval in 493), product of:
        1.7320508 = tf(termFreq(tagsH2H3:afval)=3)
        7.4496803 = idf(docFreq=11, maxDocs=7590)
        1.0 = fieldNorm(field=tagsH2H3, doc=493)
</str><str
name="102b19e401862068820dd53b4a1beccb286f03a7/pages/7844/0/0/0">
1.7667065 = (MATCH) sum of:
  1.7667065 = (MATCH) max of:
    1.1917508 = (MATCH) weight(content:afval^40.0 in 3750), product of:
      0.99999994 = queryWeight(content:afval^40.0), product of:
        40.0 = boost
        4.804688 = idf(docFreq=168, maxDocs=7590)
        0.0052032513 = queryNorm
      1.1917509 = (MATCH) fieldWeight(content:afval in 3750), product of:
        2.6457512 = tf(termFreq(content:afval)=7)
        4.804688 = idf(docFreq=168, maxDocs=7590)
        0.09375 = fieldNorm(field=content, doc=3750)
    1.1733533 = (MATCH) weight(title:afval^5.0 in 3750), product of:
      0.17471766 = queryWeight(title:afval^5.0), product of:
        5.0 = boost
        6.715711 = idf(docFreq=24, maxDocs=7590)
        0.0052032513 = queryNorm
      6.715711 = (MATCH) fieldWeight(title:afval in 3750), product of:
        1.0 = tf(termFreq(title:afval)=1)
        6.715711 = idf(docFreq=24, maxDocs=7590)
        1.0 = fieldNorm(field=title, doc=3750)
    1.7667065 = (MATCH) weight(keywords:afval^2.0 in 3750), product of:
      0.08663568 = queryWeight(keywords:afval^2.0), product of:
        2.0 = boost
        8.325149 = idf(docFreq=4, maxDocs=7590)
        0.0052032513 = queryNorm
      20.392366 = (MATCH) fieldWeight(keywords:afval in 3750), product of:
        2.4494898 = tf(termFreq(keywords:afval)=6)
        8.325149 = idf(docFreq=4, maxDocs=7590)
        1.0 = fieldNorm(field=keywords, doc=3750)
    1.500486 = (MATCH) weight(tagsH2H3:afval^3.0 in 3750), product of:
      0.11628768 = queryWeight(tagsH2H3:afval^3.0), product of:
        3.0 = boost
        7.4496803 = idf(docFreq=11, maxDocs=7590)
        0.0052032513 = queryNorm
      12.903225 = (MATCH) fieldWeight(tagsH2H3:afval in 3750), product of:
        1.7320508 = tf(termFreq(tagsH2H3:afval)=3)
        7.4496803 = idf(docFreq=11, maxDocs=7590)
        1.0 = fieldNorm(field=tagsH2H3, doc=3750)
</str>
[... lots of page documents skipped ...]


-- 


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

jigal@alternet.nl
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !