You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2012/08/01 19:36:31 UTC

Nutch 2 solrindex

I'm trying to crawl using Nutch 2.  However, I can't seem to get it to
index to solr without adding -reindex to the command.  And at that point it
indexes everything I've crawled.  I've tried both -all and the batch id,
but neither one results in anything being indexed to solr.

Any suggestions of what to look at?

Thanks.

Re: Nutch 2 solrindex

Posted by Ferdy Galema <fe...@kalooga.com>.
Thanks. I will see if I can reproduce and patch this. (In case you do not
create a Jira).

On Thu, Aug 2, 2012 at 7:54 PM, <al...@aim.com> wrote:

> The current code putting updb_mrk in dbUpdateReducer is as follows
>
> Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page);
>     if (mark != null) {
>       Mark.UPDATEDB_MARK.putMark(page, mark);
>    }
> the mark is always null, independent if there is PARSE_MARK or not.
>
> This function calls
>
>  public Utf8 removeFromMarkers(Utf8 key) {
>     if (markers == null) { return null; }
>     getStateManager().setDirty(this, 20);
>     return markers.remove(key);
>   }
>
> it seems to me that getStateManager().setDirty(this, 20); removes marker
> and that is why the last line  returns null.
>
> I tried to follow  getStateManager().setDirty(this, 20)  in the hierarchy
> of classes, but did not find anything useful.
>
> I  have fixed the issue by replacing the above lines with
>
>     Utf8 parse_mark = Mark.PARSE_MARK.checkMark(page);
>     if (parse_mark != null)
>     {
>         Mark.UPDATEDB_MARK.putMark(page, parse_mark);
>         Mark.PARSE_MARK.removeMark(page);
>      }
>
> Thanks.
> Alex.
>
>
>
> -----Original Message-----
>
> From: Ferdy Galema <fe...@kalooga.com>
> To: user <us...@nutch.apache.org>
> Sent: Thu, Aug 2, 2012 12:16 am
> Subject: Re: Nutch 2 solrindex
>
>
> Hi,
>
> Do you want to open a Jira and attach the patch over there? Or just explain
> what the problem is caused. I'm curious to what this might be.
>
> Thanks,
> Ferdy.
>
> On Wed, Aug 1, 2012 at 9:27 PM, <al...@aim.com> wrote:
>
> > This is directly related to the thread I have opened yesterday. I think
> > this is a bug, since updatedb fails to put update mark.
> > I have fixed it by modifying code. I have a patch, but not sure if I can
> > send it as an attachment.
> >
> > Alex.
> >
> >
> >
> > -----Original Message-----
> > From: Bai Shen <ba...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Wed, Aug 1, 2012 10:37 am
> > Subject: Nutch 2 solrindex
> >
> >
> > I'm trying to crawl using Nutch 2.  However, I can't seem to get it to
> > index to solr without adding -reindex to the command.  And at that point
> it
> > indexes everything I've crawled.  I've tried both -all and the batch id,
> > but neither one results in anything being indexed to solr.
> >
> > Any suggestions of what to look at?
> >
> > Thanks.
> >
> >
> >
>
>
>

Re: Nutch 2 solrindex

Posted by al...@aim.com.
The current code putting updb_mrk in dbUpdateReducer is as follows

Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page);
    if (mark != null) {
      Mark.UPDATEDB_MARK.putMark(page, mark); 
   }
the mark is always null, independent if there is PARSE_MARK or not.

This function calls

 public Utf8 removeFromMarkers(Utf8 key) {
    if (markers == null) { return null; }
    getStateManager().setDirty(this, 20);
    return markers.remove(key);
  }

it seems to me that getStateManager().setDirty(this, 20); removes marker and that is why the last line  returns null.

I tried to follow  getStateManager().setDirty(this, 20)  in the hierarchy of classes, but did not find anything useful.

I  have fixed the issue by replacing the above lines with

    Utf8 parse_mark = Mark.PARSE_MARK.checkMark(page);
    if (parse_mark != null)
    {
        Mark.UPDATEDB_MARK.putMark(page, parse_mark);
        Mark.PARSE_MARK.removeMark(page);
     }

Thanks.
Alex.



-----Original Message-----

From: Ferdy Galema <fe...@kalooga.com>
To: user <us...@nutch.apache.org>
Sent: Thu, Aug 2, 2012 12:16 am
Subject: Re: Nutch 2 solrindex


Hi,

Do you want to open a Jira and attach the patch over there? Or just explain
what the problem is caused. I'm curious to what this might be.

Thanks,
Ferdy.

On Wed, Aug 1, 2012 at 9:27 PM, <al...@aim.com> wrote:

> This is directly related to the thread I have opened yesterday. I think
> this is a bug, since updatedb fails to put update mark.
> I have fixed it by modifying code. I have a patch, but not sure if I can
> send it as an attachment.
>
> Alex.
>
>
>
> -----Original Message-----
> From: Bai Shen <ba...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Wed, Aug 1, 2012 10:37 am
> Subject: Nutch 2 solrindex
>
>
> I'm trying to crawl using Nutch 2.  However, I can't seem to get it to
> index to solr without adding -reindex to the command.  And at that point it
> indexes everything I've crawled.  I've tried both -all and the batch id,
> but neither one results in anything being indexed to solr.
>
> Any suggestions of what to look at?
>
> Thanks.
>
>
>

 

Re: Nutch 2 solrindex

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

Do you want to open a Jira and attach the patch over there? Or just explain
what the problem is caused. I'm curious to what this might be.

Thanks,
Ferdy.

On Wed, Aug 1, 2012 at 9:27 PM, <al...@aim.com> wrote:

> This is directly related to the thread I have opened yesterday. I think
> this is a bug, since updatedb fails to put update mark.
> I have fixed it by modifying code. I have a patch, but not sure if I can
> send it as an attachment.
>
> Alex.
>
>
>
> -----Original Message-----
> From: Bai Shen <ba...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Wed, Aug 1, 2012 10:37 am
> Subject: Nutch 2 solrindex
>
>
> I'm trying to crawl using Nutch 2.  However, I can't seem to get it to
> index to solr without adding -reindex to the command.  And at that point it
> indexes everything I've crawled.  I've tried both -all and the batch id,
> but neither one results in anything being indexed to solr.
>
> Any suggestions of what to look at?
>
> Thanks.
>
>
>

Re: Nutch 2 solrindex

Posted by al...@aim.com.
This is directly related to the thread I have opened yesterday. I think this is a bug, since updatedb fails to put update mark.
I have fixed it by modifying code. I have a patch, but not sure if I can send it as an attachment.

Alex.



-----Original Message-----
From: Bai Shen <ba...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Aug 1, 2012 10:37 am
Subject: Nutch 2 solrindex


I'm trying to crawl using Nutch 2.  However, I can't seem to get it to
index to solr without adding -reindex to the command.  And at that point it
indexes everything I've crawled.  I've tried both -all and the batch id,
but neither one results in anything being indexed to solr.

Any suggestions of what to look at?

Thanks.