You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Joshua J Pavel <jp...@us.ibm.com> on 2012/01/26 18:13:21 UTC

Nutch + Solr ... looking for context around Hits in search display


Hello all!

We're changing things up and integrating our nutch crawls with a solr
frontend.  It looks promising, but I might be missing something.  I get
most of the fields with little/no worry - title, time, boost, etc., but I
have NO content or context for the search term.  With the nutch .war, we
would get a field that we could configure to display a set number of words
before and after the query word.  I seem to be striking out getting that
same behavior in solr.

Of interest might be: Nutch 1.2, solr 3.5, and we are currently using a
publishing system to put the crawldb on the solr server (they are different
than our Nutch servers) instead of using the bin/nutch commands.  It seems
to work fine, as it reads the number of documents and does return
applicable results.

Thanks in advance!

Re: Nutch + Solr ... looking for context around Hits in search display

Posted by Tanguy Moal <ta...@gmail.com>.

Hello Joshua,

Le 26/01/2012 20:40, Joshua J Pavel a écrit :
>
> I *believe* I did that correctly, this is from my solrconf.xml:
>
> <field name="content" type="text" stored="true" indexed="true"/>.
>
> Perhaps I'm querying solr wrong? Sample query:
>
> _http://testsite.com/solr/select/?q=<SEARCH_TERM>&fl=url,title,content&hl=true&hl.q=title,content&hl.fragsize=0_ 
> <http://test.australianopen.com/solr/select/?q=Murray&fl=url,title,content&hl=true&hl.q=title,content&hl.fragsize=0>
>
Yes I think you are using some arguments wrong : hl.q is used to specify 
an alternative query (i.e. something related but different thant 
<SEARCH_TERM>, see http://wiki.apache.org/solr/HighlightingParameters#hl.q).
The parameter to specify which field to highlight / summarize / 
snippetize however :-) is hl.fl ( 
http://wiki.apache.org/solr/HighlightingParameters#hl.fl) , just  like 
fl is used to specify which fields to return in returned documents.

Beware that the highlighted parts will not be inside the documents array 
in the response node, but in an highlight node, where you'll reach your 
snippet per document (using their id) and per field (if avalaible).

I found an example on that page : http://wiki.apache.org/solr/SolJSON 
which describes the JSON output format (which you can have by setting 
&wt=json), as you'll see, you reach highlighting the same you get facet 
counts.

I hope this helps.

--
Tanguy

For more information refer to
>
>
> Josh Pavel
>
> Phone: 919.601.7018
> Email: jpavel@us.ibm.com
> IBM Special Events Team
>
> Inactive hide details for Markus Jelsma ---01/26/2012 02:25:47 
> PM---The Solr schema provided by Nutch does not store the contenMarkus 
> Jelsma ---01/26/2012 02:25:47 PM---The Solr schema provided by Nutch 
> does not store the content. To enable highlighting in Solr you ha
>
>
> From: 	
> Markus Jelsma <ma...@openindex.io>
>
> To: 	
> user@nutch.apache.org
>
> Date: 	
> 01/26/2012 02:25 PM
>
> Subject: 	
> Re: Nutch + Solr ... looking for context around Hits in search display
>
> ------------------------------------------------------------------------
>
>
>
> The Solr schema provided by Nutch does not store the content. To enable
> highlighting in Solr you have to enable the component (see wiki doc) 
> and set
> the stored attr to true for the content field.
>
> > Hello all!
> >
> > We're changing things up and integrating our nutch crawls with a solr
> > frontend.  It looks promising, but I might be missing something.  I get
> > most of the fields with little/no worry - title, time, boost, etc., 
> but I
> > have NO content or context for the search term.  With the nutch .war, we
> > would get a field that we could configure to display a set number of 
> words
> > before and after the query word.  I seem to be striking out getting that
> > same behavior in solr.
> >
> > Of interest might be: Nutch 1.2, solr 3.5, and we are currently using a
> > publishing system to put the crawldb on the solr server (they are 
> different
> > than our Nutch servers) instead of using the bin/nutch commands.  It 
> seems
> > to work fine, as it reads the number of documents and does return
> > applicable results.
> >
> > Thanks in advance!
>
>
>

Re: Nutch + Solr ... looking for context around Hits in search display

Posted by Joshua J Pavel <jp...@us.ibm.com>.

I *believe* I did that correctly, this is from my solrconf.xml:

<field name="content" type="text" stored="true" indexed="true"/>.

Perhaps I'm querying solr wrong?  Sample query:

http://testsite.com/solr/select/?q=<SEARCH_TERM>&fl=url,title,content&hl=true&hl.q=title,content&hl.fragsize=0


Josh Pavel

Phone: 919.601.7018
Email: jpavel@us.ibm.com
IBM Special Events Team


|------------>
| From:      |
|------------>
  >----------------------------------------------------------------------------------------------------------------------------------------------|
  |Markus Jelsma <ma...@openindex.io>                                                                                                    |
  >----------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >----------------------------------------------------------------------------------------------------------------------------------------------|
  |user@nutch.apache.org                                                                                                                         |
  >----------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >----------------------------------------------------------------------------------------------------------------------------------------------|
  |01/26/2012 02:25 PM                                                                                                                           |
  >----------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >----------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: Nutch + Solr ... looking for context around Hits in search display                                                                        |
  >----------------------------------------------------------------------------------------------------------------------------------------------|





The Solr schema provided by Nutch does not store the content. To enable
highlighting in Solr you have to enable the component (see wiki doc) and
set
the stored attr to true for the content field.

> Hello all!
>
> We're changing things up and integrating our nutch crawls with a solr
> frontend.  It looks promising, but I might be missing something.  I get
> most of the fields with little/no worry - title, time, boost, etc., but I
> have NO content or context for the search term.  With the nutch .war, we
> would get a field that we could configure to display a set number of
words
> before and after the query word.  I seem to be striking out getting that
> same behavior in solr.
>
> Of interest might be: Nutch 1.2, solr 3.5, and we are currently using a
> publishing system to put the crawldb on the solr server (they are
different
> than our Nutch servers) instead of using the bin/nutch commands.  It
seems
> to work fine, as it reads the number of documents and does return
> applicable results.
>
> Thanks in advance!

Re: Nutch + Solr ... looking for context around Hits in search display

Posted by Markus Jelsma <ma...@openindex.io>.

The Solr schema provided by Nutch does not store the content. To enable 
highlighting in Solr you have to enable the component (see wiki doc) and set 
the stored attr to true for the content field.

> Hello all!
> 
> We're changing things up and integrating our nutch crawls with a solr
> frontend.  It looks promising, but I might be missing something.  I get
> most of the fields with little/no worry - title, time, boost, etc., but I
> have NO content or context for the search term.  With the nutch .war, we
> would get a field that we could configure to display a set number of words
> before and after the query word.  I seem to be striking out getting that
> same behavior in solr.
> 
> Of interest might be: Nutch 1.2, solr 3.5, and we are currently using a
> publishing system to put the crawldb on the solr server (they are different
> than our Nutch servers) instead of using the bin/nutch commands.  It seems
> to work fine, as it reads the number of documents and does return
> applicable results.
> 
> Thanks in advance!