You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kiks <ki...@gmail.com> on 2011/08/03 08:31:16 UTC

Re: imported to solr

This question was posted on solr list and not answered because nutch
related...


The indexed contents of 100 sites were imported to solr from nutch using:

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*

now, a solr admin search for 'photography' includes these results:

  <doc>
    <float name="score">0.12570743</
float>
    <float name="boost">1.0440307</float>
    <str name="digest">94d97f2806240d18d67cafe9c34f94e1</str>
    <str name="id">http://www.galleryhopper.org/</str>
    <str name="segment">...</str>
    <str name="title">Gallery Hopper: Todd Walker's photography ephemera.
Read, enjoy, share, discard.</str>
    <date name="tstamp">...</date>
    <str name="url">http://www.galleryhopper.org/</str>
  </doc>

but highlighting options are on the title field not page text.

My question: Where is the stored parsetext content of the pages? What is the
solr command to send it from nutch with url/id key? The information is
contained in the crawl segments with solr id field matching nutch url.

Thanks.

Re: imported to solr

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Kiks,

What kind of changes have you made to your schema when transferring to Solr
instance?

You ask about the stored parsed text content, well the default Nutch schema
sets this by default to stored=false as it is not always required for all
content to be stored. Generally speaking terms that occur in title, meta,
etc fields will be more valuable for searching across, especially when
considering data stores. Hopefully you can change this behaviour by simple
making the changes described, however Solr does not like kindly changes to
schema therefore it will be necessary to reindex your data to your Solr
core.

On Wed, Aug 3, 2011 at 7:31 AM, Kiks <ki...@gmail.com> wrote:

> This question was posted on solr list and not answered because nutch
> related...
>
>
> The indexed contents of 100 sites were imported to solr from nutch using:
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
> crawl/segments/*
>
> now, a solr admin search for 'photography' includes these results:
>
>  <doc>
>    <float name="score">0.12570743</
> float>
>    <float name="boost">1.0440307</float>
>    <str name="digest">94d97f2806240d18d67cafe9c34f94e1</str>
>    <str name="id">http://www.galleryhopper.org/</str>
>    <str name="segment">...</str>
>    <str name="title">Gallery Hopper: Todd Walker's photography ephemera.
> Read, enjoy, share, discard.</str>
>    <date name="tstamp">...</date>
>    <str name="url">http://www.galleryhopper.org/</str>
>  </doc>
>
> but highlighting options are on the title field not page text.
>
> My question: Where is the stored parsetext content of the pages? What is
> the
> solr command to send it from nutch with url/id key? The information is
> contained in the crawl segments with solr id field matching nutch url.
>
> Thanks.
>

-- 
*Lewis*

Re: imported to solr

Posted by Way Cool <wa...@gmail.com>.

You are welcome. Glad it worked. Have fun.

On Wed, Aug 3, 2011 at 4:16 PM, Kiks <ki...@gmail.com> wrote:

> That worked thanks to you and lewis.
>
> One thing that came up was I first tried to delete the old
> /apache-solr-3.3.0/example/solr/data/index
> by renaming it and creating a new directory but solr wouldn't start.
>
> After restoring the folder, changing solr schema.xml to
> <field name="content" type="text" stored="true" indexed="true"/>
>
> and then re-running /bin/nutch solrindex... it was OK.
>
>
>
> On Wed, Aug 3, 2011 at 2:42 PM, Way Cool <wa...@gmail.com> wrote:
>
> > Potentially you need to make two changes:
> > 1. As Lewis suggested, make sure to change the content field in
> > solr/conf/schema.xml as below:
> > <field name="content" type="text" stored="true" indexed="true"/>
> > 2. Append the following as a part of search url:
> > &hl=on&hl.fl=content site url title
> > OR
> > Add the following to solrconfig.xml as a part of browse search component
> if
> > you are using solr/browse:
> >  <str name="hl">on</str>
> >  <str name="hl.fl">url site title content</str>
> >
> > You should be able to see something like this when you search in Solr:
> > <lst name="highlighting">
> > <lst name="http://thetechietutorials.blogspot.com/"><arr
> > name="content"><str>, June 15, 2011 A Custom <em>Solr</em> Search
> Component
> > example - RedirectSearchComponent Currently Apache
> > <em>Solr</em></str></arr></lst><lst name="
> >
> >
> http://thetechietutorials.blogspot.com/2011/06/working-example-of-java-annotations.html
> > "><arr
> > name="content"><str>) ▼  June (5) A working example of Java Annotations A
> > Custom <em>Solr</em> Search Component example -
> Redirect</str></arr></lst>
> > ...
> > </lst>
> >
> > You can also look at my blog about a customized solr browser interface
> for
> > Nutch data if you are interested. Here is the url:
> >
> >
> http://thetechietutorials.blogspot.com/2011/07/customized-solr-browser-interface-for.html
> >
> > Thanks.
> >
> > On Wed, Aug 3, 2011 at 12:31 AM, Kiks <ki...@gmail.com> wrote:
> >
> > > This question was posted on solr list and not answered because nutch
> > > related...
> > >
> > >
> > > The indexed contents of 100 sites were imported to solr from nutch
> using:
> > >
> > > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > crawl/linkdb
> > > crawl/segments/*
> > >
> > > now, a solr admin search for 'photography' includes these results:
> > >
> > >  <doc>
> > >    <float name="score">0.12570743</
> > > float>
> > >    <float name="boost">1.0440307</float>
> > >    <str name="digest">94d97f2806240d18d67cafe9c34f94e1</str>
> > >    <str name="id">http://www.galleryhopper.org/</str>
> > >    <str name="segment">...</str>
> > >    <str name="title">Gallery Hopper: Todd Walker's photography
> ephemera.
> > > Read, enjoy, share, discard.</str>
> > >    <date name="tstamp">...</date>
> > >    <str name="url">http://www.galleryhopper.org/</str>
> > >  </doc>
> > >
> > > but highlighting options are on the title field not page text.
> > >
> > > My question: Where is the stored parsetext content of the pages? What
> is
> > > the
> > > solr command to send it from nutch with url/id key? The information is
> > > contained in the crawl segments with solr id field matching nutch url.
> > >
> > > Thanks.
> > >
> >
>

Re: imported to solr

Posted by Kiks <ki...@gmail.com>.

That worked thanks to you and lewis.

One thing that came up was I first tried to delete the old
/apache-solr-3.3.0/example/solr/data/index
by renaming it and creating a new directory but solr wouldn't start.

After restoring the folder, changing solr schema.xml to
<field name="content" type="text" stored="true" indexed="true"/>

and then re-running /bin/nutch solrindex... it was OK.



On Wed, Aug 3, 2011 at 2:42 PM, Way Cool <wa...@gmail.com> wrote:

> Potentially you need to make two changes:
> 1. As Lewis suggested, make sure to change the content field in
> solr/conf/schema.xml as below:
> <field name="content" type="text" stored="true" indexed="true"/>
> 2. Append the following as a part of search url:
> &hl=on&hl.fl=content site url title
> OR
> Add the following to solrconfig.xml as a part of browse search component if
> you are using solr/browse:
>  <str name="hl">on</str>
>  <str name="hl.fl">url site title content</str>
>
> You should be able to see something like this when you search in Solr:
> <lst name="highlighting">
> <lst name="http://thetechietutorials.blogspot.com/"><arr
> name="content"><str>, June 15, 2011 A Custom <em>Solr</em> Search Component
> example - RedirectSearchComponent Currently Apache
> <em>Solr</em></str></arr></lst><lst name="
>
> http://thetechietutorials.blogspot.com/2011/06/working-example-of-java-annotations.html
> "><arr
> name="content"><str>) ▼  June (5) A working example of Java Annotations A
> Custom <em>Solr</em> Search Component example - Redirect</str></arr></lst>
> ...
> </lst>
>
> You can also look at my blog about a customized solr browser interface for
> Nutch data if you are interested. Here is the url:
>
> http://thetechietutorials.blogspot.com/2011/07/customized-solr-browser-interface-for.html
>
> Thanks.
>
> On Wed, Aug 3, 2011 at 12:31 AM, Kiks <ki...@gmail.com> wrote:
>
> > This question was posted on solr list and not answered because nutch
> > related...
> >
> >
> > The indexed contents of 100 sites were imported to solr from nutch using:
> >
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb
> > crawl/segments/*
> >
> > now, a solr admin search for 'photography' includes these results:
> >
> >  <doc>
> >    <float name="score">0.12570743</
> > float>
> >    <float name="boost">1.0440307</float>
> >    <str name="digest">94d97f2806240d18d67cafe9c34f94e1</str>
> >    <str name="id">http://www.galleryhopper.org/</str>
> >    <str name="segment">...</str>
> >    <str name="title">Gallery Hopper: Todd Walker's photography ephemera.
> > Read, enjoy, share, discard.</str>
> >    <date name="tstamp">...</date>
> >    <str name="url">http://www.galleryhopper.org/</str>
> >  </doc>
> >
> > but highlighting options are on the title field not page text.
> >
> > My question: Where is the stored parsetext content of the pages? What is
> > the
> > solr command to send it from nutch with url/id key? The information is
> > contained in the crawl segments with solr id field matching nutch url.
> >
> > Thanks.
> >
>

Re: imported to solr

Posted by Way Cool <wa...@gmail.com>.

Potentially you need to make two changes:
1. As Lewis suggested, make sure to change the content field in
solr/conf/schema.xml as below:
<field name="content" type="text" stored="true" indexed="true"/>
2. Append the following as a part of search url:
&hl=on&hl.fl=content site url title
OR
Add the following to solrconfig.xml as a part of browse search component if
you are using solr/browse:
 <str name="hl">on</str>
 <str name="hl.fl">url site title content</str>

You should be able to see something like this when you search in Solr:
<lst name="highlighting">
<lst name="http://thetechietutorials.blogspot.com/"><arr
name="content"><str>, June 15, 2011 A Custom <em>Solr</em> Search Component
example - RedirectSearchComponent Currently Apache
<em>Solr</em></str></arr></lst><lst name="
http://thetechietutorials.blogspot.com/2011/06/working-example-of-java-annotations.html"><arr
name="content"><str>) ▼  June (5) A working example of Java Annotations A
Custom <em>Solr</em> Search Component example - Redirect</str></arr></lst>
...
</lst>

You can also look at my blog about a customized solr browser interface for
Nutch data if you are interested. Here is the url:
http://thetechietutorials.blogspot.com/2011/07/customized-solr-browser-interface-for.html

Thanks.

On Wed, Aug 3, 2011 at 12:31 AM, Kiks <ki...@gmail.com> wrote:

> This question was posted on solr list and not answered because nutch
> related...
>
>
> The indexed contents of 100 sites were imported to solr from nutch using:
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
> crawl/segments/*
>
> now, a solr admin search for 'photography' includes these results:
>
>  <doc>
>    <float name="score">0.12570743</
> float>
>    <float name="boost">1.0440307</float>
>    <str name="digest">94d97f2806240d18d67cafe9c34f94e1</str>
>    <str name="id">http://www.galleryhopper.org/</str>
>    <str name="segment">...</str>
>    <str name="title">Gallery Hopper: Todd Walker's photography ephemera.
> Read, enjoy, share, discard.</str>
>    <date name="tstamp">...</date>
>    <str name="url">http://www.galleryhopper.org/</str>
>  </doc>
>
> but highlighting options are on the title field not page text.
>
> My question: Where is the stored parsetext content of the pages? What is
> the
> solr command to send it from nutch with url/id key? The information is
> contained in the crawl segments with solr id field matching nutch url.
>
> Thanks.
>