You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dan Davis <da...@gmail.com> on 2015/01/26 06:08:13 UTC

Weighting of prominent text in HTML

By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?    Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine

Re: [MASSMAIL]Weighting of prominent text in HTML

Posted by Dan Davis <da...@danizen.net>.

Helps lots.   Thanks, Jorge Luis.   Good point about different fields -
I'll just put the h1 and h2 (however deep I want to go) into fields, and we
can sort out weighting and whether we want it later with edismax.   The
blogs on adding plugins for that sort of thing look straightforward.

On Mon, Jan 26, 2015 at 12:47 AM, Jorge Luis Betancourt González <
jlbetancourt@uci.cu> wrote:

> Hi Dan:
>
> Agreed, this question is more Nutch related than Solr ;)
>
> Nutch doesn't send any data into /update/extract request handler, all the
> text and metadata extraction happens in Nutch side rather than relying in
> the ExtractRequestHandler provided by Solr. Underneath Nutch use Tika the
> same technology as the ExtractRequestHandler provided by Solr so shouldn't
> be any greater difference.
>
> By default Nutch doesn't boost anything as is Solr job to boost the
> different content in the different fields, which is what happens when you
> do a query against Solr. Nutch calculates the LinkRank which is a variation
> of the famous PageRank (or the OPIC score, which is another scoring
> algorithm implemented in Nutch, which I believe is the default in Nutch
> 2.x). What you can do is use the headings and map the heading tags into
> different fields and then apply different boosts to each field.
>
> The general idea with Nutch is to "make pieces of the web page" and store
> each piece in a different field in Solr, then you can tweak your relevance
> function using the values yo see fit, so you don't need to write any plugin
> to accomplish this (at least for the h1, h2, etc. example you provided, if
> you want to extract other parts of the webpage you'll need to write your
> own plugin to do so).
>
> Nutch is highly customizable, you can write a plugin for almost any piece
> of logic, from parsers to indexers, passing from URL filters, scoring
> algorithms, protocols and a long long list, usually the plugins are not so
> difficult to write, but the problem comes to know which extension point you
> need to use, this comes with experience and taking a good dive in the
> source code.
>
> Hope this helps,
>
> ----- Original Message -----
> From: "Dan Davis" <da...@gmail.com>
> To: "solr-user" <so...@lucene.apache.org>
> Sent: Monday, January 26, 2015 12:08:13 AM
> Subject: [MASSMAIL]Weighting of prominent text in HTML
>
> By examining solr.log, I can see that Nutch is using the /update request
> handler rather than /update/extract.   So, this may be a more appropriate
> question for the nutch mailing list.   OTOH, y'all know the anwser off the
> top of your head.
>
> Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
> normal paragraph?    Can this weighting be tuned without writing a plugin?
>    Is writing a plugin often needed because of the flexibility that is
> needed in practice?
>
> I wanted to call this post *Anatomy of a small scale search engine*, but
> lacked the nerve ;)
>
> Thanks, all and many,
>
> Dan Davis, Systems/Applications Architect
> National Library of Medicine
>
>
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
>
>

Re: [MASSMAIL]Weighting of prominent text in HTML

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.

Hi Dan:

Agreed, this question is more Nutch related than Solr ;)

Nutch doesn't send any data into /update/extract request handler, all the text and metadata extraction happens in Nutch side rather than relying in the ExtractRequestHandler provided by Solr. Underneath Nutch use Tika the same technology as the ExtractRequestHandler provided by Solr so shouldn't be any greater difference. 

By default Nutch doesn't boost anything as is Solr job to boost the different content in the different fields, which is what happens when you do a query against Solr. Nutch calculates the LinkRank which is a variation of the famous PageRank (or the OPIC score, which is another scoring algorithm implemented in Nutch, which I believe is the default in Nutch 2.x). What you can do is use the headings and map the heading tags into different fields and then apply different boosts to each field. 

The general idea with Nutch is to "make pieces of the web page" and store each piece in a different field in Solr, then you can tweak your relevance function using the values yo see fit, so you don't need to write any plugin to accomplish this (at least for the h1, h2, etc. example you provided, if you want to extract other parts of the webpage you'll need to write your own plugin to do so). 

Nutch is highly customizable, you can write a plugin for almost any piece of logic, from parsers to indexers, passing from URL filters, scoring algorithms, protocols and a long long list, usually the plugins are not so difficult to write, but the problem comes to know which extension point you need to use, this comes with experience and taking a good dive in the source code.

Hope this helps,

----- Original Message -----
From: "Dan Davis" <da...@gmail.com>
To: "solr-user" <so...@lucene.apache.org>
Sent: Monday, January 26, 2015 12:08:13 AM
Subject: [MASSMAIL]Weighting of prominent text in HTML

By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?    Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine


---------------------------------------------------
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.