You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bill Goffe <go...@oswego.edu> on 2006/09/21 20:37:48 UTC

Boost for Occurances in a Page / Analyze > Once?

I'm trying to tweak the search results at my http://ese.rfe.org/ and I've
got two questions (I'm running .7.2):

  - In searching at the above for "unemployment" the leading results have
    10 or more occurrences of that word on the page. I'd like to reduce
    the influence of multiple occurrences of a word on a page and give more
    weight to links, titles, and such. But, in looking at
    nutch-default.xml I don't see any obvious parameters for this. I have
    upped the following to these values:
      indexer.score.power 2.5
      db.score.link.external 4.0
      query.url.boost 2.0
      query.anchor.boost 2.0
      query.title.boost 2.0
      query.phrase.boost 2.0
    As the top links for unemployment are state agencies, I think I will 
    switch db.ignore.internal.links back to true as there are more external
    links to where I would like users to go: http://www.bls.gov .

  - In the old 0.7 tutorial, I could swear that the example suggested
    running "nutch analyze", but it no longer mentions that (it's not on
    the Internet Archive). I believe it also suggested running it more
    than once. Thoughts on these lines? I currently run it once after
    "nutch updatedb," but would more runs aid link analysis?

         - Bill

-- 
         *------------------------------------------------------*
         | Bill Goffe                 goffe@oswego.edu          |
         | Department of Economics    voice: (315) 312-3444     |
         | SUNY Oswego                fax:   (315) 312-5444     |
         | 416 Mahar Hall             <http://cook.rfe.org>     |          
         | Oswego, NY  13126                                    |
*--------*------------------------------------------------------*-----------*
| "It's a scholarly activity that has nothing to do with his professional   |
|  activity."                                                               |
|    -- Samuel Landau, the lawyer for Roger Shepherd, who used to be a      |
|       professor at New School University until he left after admitting he |
|       plagiarized part of a book. Landau argues that his book was         |
|       "totally unrelated" to his university work and he is suing to get   |
|       his job back. "Professor Who Acknowledged Plagiarism Accuses New    |
|       School U. of Firing Him Unfairly," Chronicle of Higher Education,   |
|       November 17, 2004.                                                  |
*---------------------------------------------------------------------------*


Re: Boost for Occurances in a Page / Analyze > Once?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Bill Goffe wrote:
> Andrzej said:
>
>   
>> Nutch 0.7 uses a variant of PageRank link analysis, and the analyze tool 
>> would perform a couple iterations to propagate the scores along links. 
>> However, it was a slow and very resource-hungry process, so sometimes it 
>> was even impossible to go through the analysis step even for 
>> moderatly-sized collections. 
>>     
>
> Interesting. If this is invoked with "bin/nutch analyze db_dir 3" (three
> rounds of analysis) it took about 35 minutes with some 300,000 pages on a
> dual Xeon machine with 3 gigs of RAM. This is a small share of time spent
> fetching, generating segments, etc.
>   

300,000 is a relatively small database. With DBs around 10-20mln docs 
this analyze step can take literally days, and consume hundreds GBs of 
disk space.

>   
>> 0.7 offers also an option to use a static ranking method, which doesn't
>> require running the analysis step, and which is based on the number of
>> outlinks and inlinks.
>>     
>
> Um, it isn't clear how to do this. I don't see anything in
> http://wiki.apache.org/nutch/CommandLineOptions nor nutch-default.xml.
>   

It's not a command-line option. This is documented in nutch-default.xml 
under "fetchlist.score.by.link.count" and "indexer.boost.by.link.count". 
There was a discussion about this on the mailing list, ca 1 year ago - 
search the archives for "link analysis".


> P.S. Any thoughts on how to downplay repeated instances of a word on 
>      a page?
>
>   

You should implement your own Similarity, and override idf(Term term, 
Searcher searcher) - please see Lucene javadoc for details. If 
searcher.docFreq(term) > threshold you cap it at a fixed value, or even 
reduce the score factor. Be careful not to penalize common words, which 
may be very frequent for legitimate reasons (e.g. the stopwords).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Boost for Occurances in a Page / Analyze > Once?

Posted by Bill Goffe <go...@oswego.edu>.
Andrzej said:

> Nutch 0.7 uses a variant of PageRank link analysis, and the analyze tool 
> would perform a couple iterations to propagate the scores along links. 
> However, it was a slow and very resource-hungry process, so sometimes it 
> was even impossible to go through the analysis step even for 
> moderatly-sized collections. 

Interesting. If this is invoked with "bin/nutch analyze db_dir 3" (three
rounds of analysis) it took about 35 minutes with some 300,000 pages on a
dual Xeon machine with 3 gigs of RAM. This is a small share of time spent
fetching, generating segments, etc.

> 0.7 offers also an option to use a static ranking method, which doesn't
> require running the analysis step, and which is based on the number of
> outlinks and inlinks.

Um, it isn't clear how to do this. I don't see anything in
http://wiki.apache.org/nutch/CommandLineOptions nor nutch-default.xml.

> Nutch 0.8 uses scoring plugins, which can implement different scoring 
> algorithms. The default one is based on OPIC, which is again a variant 
> of link-based quality metrics - please see OPICScoringFilter for more 
> details.

That sounds useful. The referenced paper sure makes it sure sounds more
efficient.

Thanks and best wishes,

           Bill

P.S. Any thoughts on how to downplay repeated instances of a word on 
     a page?

-- 
         *------------------------------------------------------*
         | Bill Goffe                 goffe@oswego.edu          |
         | Department of Economics    voice: (315) 312-3444     |
         | SUNY Oswego                fax:   (315) 312-5444     |
         | 416 Mahar Hall             <http://cook.rfe.org>     |          
         | Oswego, NY  13126                                    |
*--------*------------------------------------------------------*-----------*
| "I have been informed by the senior neurosurgical society to discontinue  |
| expert testimony for plaintiffs or risk membership. Therefore I am        |
| withdrawing as your expert."                                              |
|  --  Dr. Robert W. Rand, a neurosurgeon, on why he couldn't testify       |
|      against another neurosurgeon, Dr. Edgar Housepian. Dr. Housepian was |
|      alleged to have accidentally cut a major artery in the brain of a 3  |
|      year old who ended up with permanent disabilities. "Making           |
|      Malpractice Harder to Prove," Michelle Andrews, New York Times,      |
|      12/21/03.                                                            |
*---------------------------------------------------------------------------*


Re: Boost for Occurances in a Page / Analyze > Once?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Bill Goffe wrote:
> I'm trying to tweak the search results at my http://ese.rfe.org/ and I've
> got two questions (I'm running .7.2):
>
>   - In searching at the above for "unemployment" the leading results have
>     10 or more occurrences of that word on the page. I'd like to reduce
>     the influence of multiple occurrences of a word on a page and give more
>     weight to links, titles, and such. But, in looking at
>     nutch-default.xml I don't see any obvious parameters for this. I have
>     upped the following to these values:
>       indexer.score.power 2.5
>       db.score.link.external 4.0
>       query.url.boost 2.0
>       query.anchor.boost 2.0
>       query.title.boost 2.0
>       query.phrase.boost 2.0
>     As the top links for unemployment are state agencies, I think I will 
>     switch db.ignore.internal.links back to true as there are more external
>     links to where I would like users to go: http://www.bls.gov .
>
>   - In the old 0.7 tutorial, I could swear that the example suggested
>     running "nutch analyze", but it no longer mentions that (it's not on
>     the Internet Archive). I believe it also suggested running it more
>     than once. Thoughts on these lines? I currently run it once after
>     "nutch updatedb," but would more runs aid link analysis?
>   

Nutch 0.7 uses a variant of PageRank link analysis, and the analyze tool 
would perform a couple iterations to propagate the scores along links. 
However, it was a slow and very resource-hungry process, so sometimes it 
was even impossible to go through the analysis step even for 
moderatly-sized collections. 0.7 offers also an option to use a static 
ranking method, which doesn't require running the analysis step, and 
which is based on the number of outlinks and inlinks.

Nutch 0.8 uses scoring plugins, which can implement different scoring 
algorithms. The default one is based on OPIC, which is again a variant 
of link-based quality metrics - please see OPICScoringFilter for more 
details.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com