You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Max Lynch <ih...@gmail.com> on 2010/08/01 02:12:18 UTC

Nutch SolrIndex command not adding documents

Hi,
I'm following the nutch tutorial (http://wiki.apache.org/nutch/NutchTutorial)
and everything seems to be working fine, except when I try to run

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*

The document count on my solr server doesn't change (I'm viewing
/solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
<commit /> using curl, with no success.

It seems like my fetch routine grabs a ton of documents, but only a few make
it to solr if at all (there are about 2000 in there already from a previous
nutch solrindex that added a few).  How can I tell how many documents nutch
is sending to solr?  Should I just modify the solrindex driver program?

Just for reference, my nutch cycle looks like this:

$ bin/nutch inject crawlwi/crawldb wiurls/
$ bin/nutch generate crawlwi/crawldb crawlwi/segments

Then I ran the following a few times, with the newest segment in a variable:
$ s1=`ls -d crawlwi/segments/2* | tail -1`
$ echo $s1
$ bin/nutch fetch $s1 -threads 15
$ bin/nutch updatedb crawlwi/crawldb $s1
$ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000

Then
$ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
$ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
crawlwi/segments/*
$ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb crawlwi/linkdb
crawlwi/segments/*

But the new documents don't make the index.

Any ideas?
Thanks.

Re: Re: Nutch SolrIndex command not adding documents

Posted by Max Lynch <ih...@gmail.com>.

I set multiValued="true" on my schema and I don't see the error anymore.
Could it be the interaction with the parse-feed plugin?

Either way, it's working so I'm happy.

I'm on nutch 1.1 and solr 1.4.1

On Mon, Aug 2, 2010 at 12:03 PM, Markus Jelsma <ma...@buyways.nl>wrote:

> Hi,
>
>
>
> It makes no sense indeed. But check your solrindex-mapping.xml in the Nutch
> configuration directory, it might copy the field. Also, check your
> schema.xml in the Solr configuration for it might do the same.
>
>
>
> To make it a bit more complicated, don't you have some deduplication
> mechanism somewhere? It can prevent any additions to the index if you didn't
> properly configure it, such as a recurring field value as a source for the
> signature.
>
>
>
> And, what Nutch and Solr versions are you using? I have had multiple setups
> with Nutch 1.0, 1.1 and trunk and Solr 1.4 and 1.4.1 but never came across
> your error for the title field, some shipped Nutch configurations did
> actually mess up the url and id fields in the Solr index, which are not
> multi valued.
>
>
> Cheers,
>
> -----Original message-----
> From: Max Lynch <ih...@gmail.com>
> Sent: Mon 02-08-2010 18:32
> To: user@nutch.apache.org;
> Subject: Re: Nutch SolrIndex command not adding documents
>
> So, I figured out the log debugging stuff (just had to modify some stuff in
> log4j.properties), and I've found the source of my solrindex errors.  First
> of all, many dates in my index fail to parse properly in
> MoreIndexingFilter.java, so I added another date format of the type "EEE
> MMM
> dd HH:mm:ss zzz yyyy" which I will make a bug tracker entry and a patch
> for.
>
> However, I've also encountered this issue:
> "multiple_values_encountered_for_non_multiValued_field_title"
> which crashes the job.  In my solr schema I don't allow multiple values for
> the "title" field (as per the nutch default).  Why would the parser find
> multiple title values?  Seems to be another bug.
>
> Any ideas?
>
> Thanks.
>
>
> On Sat, Jul 31, 2010 at 9:11 PM, Max Lynch <ih...@gmail.com> wrote:
>
> > The solr schema and mappings all seem to work fine.  It's just that
> > sometimes I run solrindex and no documents get added to the solr index
> and I
> > have no indication of why that might be.  I see my fetcher grabbing
> > thousands of pages and yet my doc count on solr doesn't increase.
> >
> > I've cleared my index and have been following the steps here:
> > http://wiki.apache.org/nutch/RunningNutchAndSolr and it seems to be
> > working better.  I'm just not sure why these steps seem to work better
> yet
> > the nutch tutorial steps before didn't.  The only difference I can see is
> > the -noParse and parse steps added.
> >
> > I think it's the non-determinism or lack of output that unsettles me.
>  Can
> > I enable debugging output or something?
> >
> >
> > On Sat, Jul 31, 2010 at 8:34 PM, Scott Gonyea <sc...@aitrus.org> wrote:
> >
> >> Did you setup the solr mappings? When you index into nutch, do they
> appear
> >> there when you query nutch's interface?
> >>
> >> On Jul 31, 2010, at 5:12 PM, Max Lynch <ih...@gmail.com> wrote:
> >>
> >> > Hi,
> >> > I'm following the nutch tutorial (
> >> http://wiki.apache.org/nutch/NutchTutorial)
> >> > and everything seems to be working fine, except when I try to run
> >> >
> >> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> >> crawl/linkdb
> >> > crawl/segments/*
> >> >
> >> > The document count on my solr server doesn't change (I'm viewing
> >> > /solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
> >> > <commit /> using curl, with no success.
> >> >
> >> > It seems like my fetch routine grabs a ton of documents, but only a
> few
> >> make
> >> > it to solr if at all (there are about 2000 in there already from a
> >> previous
> >> > nutch solrindex that added a few).  How can I tell how many documents
> >> nutch
> >> > is sending to solr?  Should I just modify the solrindex driver
> program?
> >> >
> >> > Just for reference, my nutch cycle looks like this:
> >> >
> >> > $ bin/nutch inject crawlwi/crawldb wiurls/
> >> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments
> >> >
> >> > Then I ran the following a few times, with the newest segment in a
> >> variable:
> >> > $ s1=`ls -d crawlwi/segments/2* | tail -1`
> >> > $ echo $s1
> >> > $ bin/nutch fetch $s1 -threads 15
> >> > $ bin/nutch updatedb crawlwi/crawldb $s1
> >> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000
> >> >
> >> > Then
> >> > $ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
> >> > $ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
> >> > crawlwi/segments/*
> >> > $ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb
> >> crawlwi/linkdb
> >> > crawlwi/segments/*
> >> >
> >> > But the new documents don't make the index.
> >> >
> >> > Any ideas?
> >> > Thanks.
> >>
> >
> >
>

RE: Re: Nutch SolrIndex command not adding documents

Posted by Markus Jelsma <ma...@buyways.nl>.

Hi,

It makes no sense indeed. But check your solrindex-mapping.xml in the Nutch configuration directory, it might copy the field. Also, check your schema.xml in the Solr configuration for it might do the same.

To make it a bit more complicated, don't you have some deduplication mechanism somewhere? It can prevent any additions to the index if you didn't properly configure it, such as a recurring field value as a source for the signature.

And, what Nutch and Solr versions are you using? I have had multiple setups with Nutch 1.0, 1.1 and trunk and Solr 1.4 and 1.4.1 but never came across your error for the title field, some shipped Nutch configurations did actually mess up the url and id fields in the Solr index, which are not multi valued.

Cheers,

-----Original message-----
From: Max Lynch <ih...@gmail.com>
Sent: Mon 02-08-2010 18:32
To: user@nutch.apache.org; 
Subject: Re: Nutch SolrIndex command not adding documents

So, I figured out the log debugging stuff (just had to modify some stuff in
log4j.properties), and I've found the source of my solrindex errors.  First
of all, many dates in my index fail to parse properly in
MoreIndexingFilter.java, so I added another date format of the type "EEE MMM
dd HH:mm:ss zzz yyyy" which I will make a bug tracker entry and a patch for.

However, I've also encountered this issue:
"multiple_values_encountered_for_non_multiValued_field_title"
which crashes the job.  In my solr schema I don't allow multiple values for
the "title" field (as per the nutch default).  Why would the parser find
multiple title values?  Seems to be another bug.

Any ideas?

Thanks.

On Sat, Jul 31, 2010 at 9:11 PM, Max Lynch <ih...@gmail.com> wrote:

> The solr schema and mappings all seem to work fine.  It's just that
> sometimes I run solrindex and no documents get added to the solr index and I
> have no indication of why that might be.  I see my fetcher grabbing
> thousands of pages and yet my doc count on solr doesn't increase.
>
> I've cleared my index and have been following the steps here:
> http://wiki.apache.org/nutch/RunningNutchAndSolr and it seems to be
> working better.  I'm just not sure why these steps seem to work better yet
> the nutch tutorial steps before didn't.  The only difference I can see is
> the -noParse and parse steps added.
>
> I think it's the non-determinism or lack of output that unsettles me.  Can
> I enable debugging output or something?
>
>
> On Sat, Jul 31, 2010 at 8:34 PM, Scott Gonyea <sc...@aitrus.org> wrote:
>
>> Did you setup the solr mappings? When you index into nutch, do they appear
>> there when you query nutch's interface?
>>
>> On Jul 31, 2010, at 5:12 PM, Max Lynch <ih...@gmail.com> wrote:
>>
>> > Hi,
>> > I'm following the nutch tutorial (
>> http://wiki.apache.org/nutch/NutchTutorial)
>> > and everything seems to be working fine, except when I try to run
>> >
>> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
>> crawl/linkdb
>> > crawl/segments/*
>> >
>> > The document count on my solr server doesn't change (I'm viewing
>> > /solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
>> > <commit /> using curl, with no success.
>> >
>> > It seems like my fetch routine grabs a ton of documents, but only a few
>> make
>> > it to solr if at all (there are about 2000 in there already from a
>> previous
>> > nutch solrindex that added a few).  How can I tell how many documents
>> nutch
>> > is sending to solr?  Should I just modify the solrindex driver program?
>> >
>> > Just for reference, my nutch cycle looks like this:
>> >
>> > $ bin/nutch inject crawlwi/crawldb wiurls/
>> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments
>> >
>> > Then I ran the following a few times, with the newest segment in a
>> variable:
>> > $ s1=`ls -d crawlwi/segments/2* | tail -1`
>> > $ echo $s1
>> > $ bin/nutch fetch $s1 -threads 15
>> > $ bin/nutch updatedb crawlwi/crawldb $s1
>> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000
>> >
>> > Then
>> > $ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
>> > $ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
>> > crawlwi/segments/*
>> > $ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb
>> crawlwi/linkdb
>> > crawlwi/segments/*
>> >
>> > But the new documents don't make the index.
>> >
>> > Any ideas?
>> > Thanks.
>>
>
>

Re: Nutch SolrIndex command not adding documents

Posted by Max Lynch <ih...@gmail.com>.

So, I figured out the log debugging stuff (just had to modify some stuff in
log4j.properties), and I've found the source of my solrindex errors.  First
of all, many dates in my index fail to parse properly in
MoreIndexingFilter.java, so I added another date format of the type "EEE MMM
dd HH:mm:ss zzz yyyy" which I will make a bug tracker entry and a patch for.

However, I've also encountered this issue:
"multiple_values_encountered_for_non_multiValued_field_title"
which crashes the job.  In my solr schema I don't allow multiple values for
the "title" field (as per the nutch default).  Why would the parser find
multiple title values?  Seems to be another bug.

Any ideas?

Thanks.


On Sat, Jul 31, 2010 at 9:11 PM, Max Lynch <ih...@gmail.com> wrote:

> The solr schema and mappings all seem to work fine.  It's just that
> sometimes I run solrindex and no documents get added to the solr index and I
> have no indication of why that might be.  I see my fetcher grabbing
> thousands of pages and yet my doc count on solr doesn't increase.
>
> I've cleared my index and have been following the steps here:
> http://wiki.apache.org/nutch/RunningNutchAndSolr and it seems to be
> working better.  I'm just not sure why these steps seem to work better yet
> the nutch tutorial steps before didn't.  The only difference I can see is
> the -noParse and parse steps added.
>
> I think it's the non-determinism or lack of output that unsettles me.  Can
> I enable debugging output or something?
>
>
> On Sat, Jul 31, 2010 at 8:34 PM, Scott Gonyea <sc...@aitrus.org> wrote:
>
>> Did you setup the solr mappings? When you index into nutch, do they appear
>> there when you query nutch's interface?
>>
>> On Jul 31, 2010, at 5:12 PM, Max Lynch <ih...@gmail.com> wrote:
>>
>> > Hi,
>> > I'm following the nutch tutorial (
>> http://wiki.apache.org/nutch/NutchTutorial)
>> > and everything seems to be working fine, except when I try to run
>> >
>> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
>> crawl/linkdb
>> > crawl/segments/*
>> >
>> > The document count on my solr server doesn't change (I'm viewing
>> > /solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
>> > <commit /> using curl, with no success.
>> >
>> > It seems like my fetch routine grabs a ton of documents, but only a few
>> make
>> > it to solr if at all (there are about 2000 in there already from a
>> previous
>> > nutch solrindex that added a few).  How can I tell how many documents
>> nutch
>> > is sending to solr?  Should I just modify the solrindex driver program?
>> >
>> > Just for reference, my nutch cycle looks like this:
>> >
>> > $ bin/nutch inject crawlwi/crawldb wiurls/
>> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments
>> >
>> > Then I ran the following a few times, with the newest segment in a
>> variable:
>> > $ s1=`ls -d crawlwi/segments/2* | tail -1`
>> > $ echo $s1
>> > $ bin/nutch fetch $s1 -threads 15
>> > $ bin/nutch updatedb crawlwi/crawldb $s1
>> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000
>> >
>> > Then
>> > $ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
>> > $ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
>> > crawlwi/segments/*
>> > $ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb
>> crawlwi/linkdb
>> > crawlwi/segments/*
>> >
>> > But the new documents don't make the index.
>> >
>> > Any ideas?
>> > Thanks.
>>
>
>

Re: Nutch SolrIndex command not adding documents

Posted by Max Lynch <ih...@gmail.com>.

The solr schema and mappings all seem to work fine.  It's just that
sometimes I run solrindex and no documents get added to the solr index and I
have no indication of why that might be.  I see my fetcher grabbing
thousands of pages and yet my doc count on solr doesn't increase.

I've cleared my index and have been following the steps here:
http://wiki.apache.org/nutch/RunningNutchAndSolr and it seems to be working
better.  I'm just not sure why these steps seem to work better yet the nutch
tutorial steps before didn't.  The only difference I can see is the -noParse
and parse steps added.

I think it's the non-determinism or lack of output that unsettles me.  Can I
enable debugging output or something?

On Sat, Jul 31, 2010 at 8:34 PM, Scott Gonyea <sc...@aitrus.org> wrote:

> Did you setup the solr mappings? When you index into nutch, do they appear
> there when you query nutch's interface?
>
> On Jul 31, 2010, at 5:12 PM, Max Lynch <ih...@gmail.com> wrote:
>
> > Hi,
> > I'm following the nutch tutorial (
> http://wiki.apache.org/nutch/NutchTutorial)
> > and everything seems to be working fine, except when I try to run
> >
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb
> > crawl/segments/*
> >
> > The document count on my solr server doesn't change (I'm viewing
> > /solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
> > <commit /> using curl, with no success.
> >
> > It seems like my fetch routine grabs a ton of documents, but only a few
> make
> > it to solr if at all (there are about 2000 in there already from a
> previous
> > nutch solrindex that added a few).  How can I tell how many documents
> nutch
> > is sending to solr?  Should I just modify the solrindex driver program?
> >
> > Just for reference, my nutch cycle looks like this:
> >
> > $ bin/nutch inject crawlwi/crawldb wiurls/
> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments
> >
> > Then I ran the following a few times, with the newest segment in a
> variable:
> > $ s1=`ls -d crawlwi/segments/2* | tail -1`
> > $ echo $s1
> > $ bin/nutch fetch $s1 -threads 15
> > $ bin/nutch updatedb crawlwi/crawldb $s1
> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000
> >
> > Then
> > $ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
> > $ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
> > crawlwi/segments/*
> > $ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb
> crawlwi/linkdb
> > crawlwi/segments/*
> >
> > But the new documents don't make the index.
> >
> > Any ideas?
> > Thanks.
>

Re: Nutch SolrIndex command not adding documents

Posted by Scott Gonyea <sc...@aitrus.org>.

Did you setup the solr mappings? When you index into nutch, do they appear there when you query nutch's interface?

On Jul 31, 2010, at 5:12 PM, Max Lynch <ih...@gmail.com> wrote:

> Hi,
> I'm following the nutch tutorial (http://wiki.apache.org/nutch/NutchTutorial)
> and everything seems to be working fine, except when I try to run
> 
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
> crawl/segments/*
> 
> The document count on my solr server doesn't change (I'm viewing
> /solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
> <commit /> using curl, with no success.
> 
> It seems like my fetch routine grabs a ton of documents, but only a few make
> it to solr if at all (there are about 2000 in there already from a previous
> nutch solrindex that added a few).  How can I tell how many documents nutch
> is sending to solr?  Should I just modify the solrindex driver program?
> 
> Just for reference, my nutch cycle looks like this:
> 
> $ bin/nutch inject crawlwi/crawldb wiurls/
> $ bin/nutch generate crawlwi/crawldb crawlwi/segments
> 
> Then I ran the following a few times, with the newest segment in a variable:
> $ s1=`ls -d crawlwi/segments/2* | tail -1`
> $ echo $s1
> $ bin/nutch fetch $s1 -threads 15
> $ bin/nutch updatedb crawlwi/crawldb $s1
> $ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000
> 
> Then
> $ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
> $ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
> crawlwi/segments/*
> $ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb crawlwi/linkdb
> crawlwi/segments/*
> 
> But the new documents don't make the index.
> 
> Any ideas?
> Thanks.