You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2013/05/01 13:32:55 UTC

Solrindex -all not working correctly

My crawl loop consists of the following.

generate -topN
fetch -all
parse -all
updatedb
solrindex -all

With the fetch and parse the -all only pulls the batch that was generated,
skipping all of the other urls.  However, the solrindex seems to be
equivalent to -reindex, commiting everything not just what hasn't been sent.

Anyone else run into this issue?

Thanks.

Re: Solrindex -all not working correctly

Posted by Bai Shen <ba...@gmail.com>.
Just trying to speed things up.  The solrindex currently takes quite a
while because it's reindexing all of my documents instead of just the
latest segment.


On Wed, May 1, 2013 at 1:43 PM, AC Nutch <ac...@gmail.com> wrote:

> I have also run into this issue. Our problem was that we were performing
> analysis on the URLs in Solr and adding data in various fields which get
> overwritten at the next index. We had to edit the source to fix our issue.
>
> In terms of solving it - what is your main issue with that? Is it that you
> are looking for a more efficient workflow or is it something else?
>
>
>
>
> On Wed, May 1, 2013 at 7:32 AM, Bai Shen <ba...@gmail.com> wrote:
>
> > My crawl loop consists of the following.
> >
> > generate -topN
> > fetch -all
> > parse -all
> > updatedb
> > solrindex -all
> >
> > With the fetch and parse the -all only pulls the batch that was
> generated,
> > skipping all of the other urls.  However, the solrindex seems to be
> > equivalent to -reindex, commiting everything not just what hasn't been
> > sent.
> >
> > Anyone else run into this issue?
> >
> > Thanks.
> >
>

Re: Solrindex -all not working correctly

Posted by AC Nutch <ac...@gmail.com>.
I have also run into this issue. Our problem was that we were performing
analysis on the URLs in Solr and adding data in various fields which get
overwritten at the next index. We had to edit the source to fix our issue.

In terms of solving it - what is your main issue with that? Is it that you
are looking for a more efficient workflow or is it something else?




On Wed, May 1, 2013 at 7:32 AM, Bai Shen <ba...@gmail.com> wrote:

> My crawl loop consists of the following.
>
> generate -topN
> fetch -all
> parse -all
> updatedb
> solrindex -all
>
> With the fetch and parse the -all only pulls the batch that was generated,
> skipping all of the other urls.  However, the solrindex seems to be
> equivalent to -reindex, commiting everything not just what hasn't been
> sent.
>
> Anyone else run into this issue?
>
> Thanks.
>

Re: Solrindex -all not working correctly

Posted by Bai Shen <ba...@gmail.com>.
The problem is that it works differently than fetch or parse.  If I say
fetch -all I get every url that has been tagged for fetch, regardless of
the batch id.  Same with parse -all.

However, when I do solrindex -all, I get the same thing as solrindex
-reindex.  What I'm looking for is the same as the fetch and parse all
where it will index all of the items that have just been parsed.

As for passing in a batchid, how can you get the batch id from a CLI
generate call?  That is why I was using -all in the first place.


On Wed, May 15, 2013 at 3:47 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Is this not desired/intended behaviour for the -all switch?
> If not then we should be able to pass an argument to only index a batchId
> or crawlId
>
>
> On Mon, May 13, 2013 at 11:06 AM, Bai Shen <ba...@gmail.com>
> wrote:
>
> > I'm using 2.x HEAD now and I'm still seeing the same problem.  When I
> call
> > solrindex -all it still indexes everything, not just the newly parsed
> > items.
> >
> >
> > On Wed, May 1, 2013 at 2:13 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > What version are you using?
> > > If you can I would advise you to upgrade to 2.x HEAD.
> > >
> > >
> > > On Wed, May 1, 2013 at 4:32 AM, Bai Shen <ba...@gmail.com>
> > wrote:
> > >
> > > > My crawl loop consists of the following.
> > > >
> > > > generate -topN
> > > > fetch -all
> > > > parse -all
> > > > updatedb
> > > > solrindex -all
> > > >
> > > > With the fetch and parse the -all only pulls the batch that was
> > > generated,
> > > > skipping all of the other urls.  However, the solrindex seems to be
> > > > equivalent to -reindex, commiting everything not just what hasn't
> been
> > > > sent.
> > > >
> > > > Anyone else run into this issue?
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Solrindex -all not working correctly

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Is this not desired/intended behaviour for the -all switch?
If not then we should be able to pass an argument to only index a batchId
or crawlId


On Mon, May 13, 2013 at 11:06 AM, Bai Shen <ba...@gmail.com> wrote:

> I'm using 2.x HEAD now and I'm still seeing the same problem.  When I call
> solrindex -all it still indexes everything, not just the newly parsed
> items.
>
>
> On Wed, May 1, 2013 at 2:13 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > What version are you using?
> > If you can I would advise you to upgrade to 2.x HEAD.
> >
> >
> > On Wed, May 1, 2013 at 4:32 AM, Bai Shen <ba...@gmail.com>
> wrote:
> >
> > > My crawl loop consists of the following.
> > >
> > > generate -topN
> > > fetch -all
> > > parse -all
> > > updatedb
> > > solrindex -all
> > >
> > > With the fetch and parse the -all only pulls the batch that was
> > generated,
> > > skipping all of the other urls.  However, the solrindex seems to be
> > > equivalent to -reindex, commiting everything not just what hasn't been
> > > sent.
> > >
> > > Anyone else run into this issue?
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Solrindex -all not working correctly

Posted by Bai Shen <ba...@gmail.com>.
I'm using 2.x HEAD now and I'm still seeing the same problem.  When I call
solrindex -all it still indexes everything, not just the newly parsed items.


On Wed, May 1, 2013 at 2:13 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> What version are you using?
> If you can I would advise you to upgrade to 2.x HEAD.
>
>
> On Wed, May 1, 2013 at 4:32 AM, Bai Shen <ba...@gmail.com> wrote:
>
> > My crawl loop consists of the following.
> >
> > generate -topN
> > fetch -all
> > parse -all
> > updatedb
> > solrindex -all
> >
> > With the fetch and parse the -all only pulls the batch that was
> generated,
> > skipping all of the other urls.  However, the solrindex seems to be
> > equivalent to -reindex, commiting everything not just what hasn't been
> > sent.
> >
> > Anyone else run into this issue?
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>

Re: Solrindex -all not working correctly

Posted by Bai Shen <ba...@gmail.com>.
I'm using 2.1.

Are there any other notable changes for using the HEAD instead?


On Wed, May 1, 2013 at 2:13 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> What version are you using?
> If you can I would advise you to upgrade to 2.x HEAD.
>
>
> On Wed, May 1, 2013 at 4:32 AM, Bai Shen <ba...@gmail.com> wrote:
>
> > My crawl loop consists of the following.
> >
> > generate -topN
> > fetch -all
> > parse -all
> > updatedb
> > solrindex -all
> >
> > With the fetch and parse the -all only pulls the batch that was
> generated,
> > skipping all of the other urls.  However, the solrindex seems to be
> > equivalent to -reindex, commiting everything not just what hasn't been
> > sent.
> >
> > Anyone else run into this issue?
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>

Re: Solrindex -all not working correctly

Posted by Lewis John Mcgibbney <le...@gmail.com>.
What version are you using?
If you can I would advise you to upgrade to 2.x HEAD.


On Wed, May 1, 2013 at 4:32 AM, Bai Shen <ba...@gmail.com> wrote:

> My crawl loop consists of the following.
>
> generate -topN
> fetch -all
> parse -all
> updatedb
> solrindex -all
>
> With the fetch and parse the -all only pulls the batch that was generated,
> skipping all of the other urls.  However, the solrindex seems to be
> equivalent to -reindex, commiting everything not just what hasn't been
> sent.
>
> Anyone else run into this issue?
>
> Thanks.
>



-- 
*Lewis*