You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2012/08/01 19:36:31 UTC
Nutch 2 solrindex
I'm trying to crawl using Nutch 2. However, I can't seem to get it to
index to solr without adding -reindex to the command. And at that point it
indexes everything I've crawled. I've tried both -all and the batch id,
but neither one results in anything being indexed to solr.
Any suggestions of what to look at?
Thanks.
Re: Nutch 2 solrindex
Posted by Ferdy Galema <fe...@kalooga.com>.
Thanks. I will see if I can reproduce and patch this. (In case you do not
create a Jira).
On Thu, Aug 2, 2012 at 7:54 PM, <al...@aim.com> wrote:
> The current code putting updb_mrk in dbUpdateReducer is as follows
>
> Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page);
> if (mark != null) {
> Mark.UPDATEDB_MARK.putMark(page, mark);
> }
> the mark is always null, independent if there is PARSE_MARK or not.
>
> This function calls
>
> public Utf8 removeFromMarkers(Utf8 key) {
> if (markers == null) { return null; }
> getStateManager().setDirty(this, 20);
> return markers.remove(key);
> }
>
> it seems to me that getStateManager().setDirty(this, 20); removes marker
> and that is why the last line returns null.
>
> I tried to follow getStateManager().setDirty(this, 20) in the hierarchy
> of classes, but did not find anything useful.
>
> I have fixed the issue by replacing the above lines with
>
> Utf8 parse_mark = Mark.PARSE_MARK.checkMark(page);
> if (parse_mark != null)
> {
> Mark.UPDATEDB_MARK.putMark(page, parse_mark);
> Mark.PARSE_MARK.removeMark(page);
> }
>
> Thanks.
> Alex.
>
>
>
> -----Original Message-----
>
> From: Ferdy Galema <fe...@kalooga.com>
> To: user <us...@nutch.apache.org>
> Sent: Thu, Aug 2, 2012 12:16 am
> Subject: Re: Nutch 2 solrindex
>
>
> Hi,
>
> Do you want to open a Jira and attach the patch over there? Or just explain
> what the problem is caused. I'm curious to what this might be.
>
> Thanks,
> Ferdy.
>
> On Wed, Aug 1, 2012 at 9:27 PM, <al...@aim.com> wrote:
>
> > This is directly related to the thread I have opened yesterday. I think
> > this is a bug, since updatedb fails to put update mark.
> > I have fixed it by modifying code. I have a patch, but not sure if I can
> > send it as an attachment.
> >
> > Alex.
> >
> >
> >
> > -----Original Message-----
> > From: Bai Shen <ba...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Wed, Aug 1, 2012 10:37 am
> > Subject: Nutch 2 solrindex
> >
> >
> > I'm trying to crawl using Nutch 2. However, I can't seem to get it to
> > index to solr without adding -reindex to the command. And at that point
> it
> > indexes everything I've crawled. I've tried both -all and the batch id,
> > but neither one results in anything being indexed to solr.
> >
> > Any suggestions of what to look at?
> >
> > Thanks.
> >
> >
> >
>
>
>
Re: Nutch 2 solrindex
Posted by al...@aim.com.
The current code putting updb_mrk in dbUpdateReducer is as follows
Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page);
if (mark != null) {
Mark.UPDATEDB_MARK.putMark(page, mark);
}
the mark is always null, independent if there is PARSE_MARK or not.
This function calls
public Utf8 removeFromMarkers(Utf8 key) {
if (markers == null) { return null; }
getStateManager().setDirty(this, 20);
return markers.remove(key);
}
it seems to me that getStateManager().setDirty(this, 20); removes marker and that is why the last line returns null.
I tried to follow getStateManager().setDirty(this, 20) in the hierarchy of classes, but did not find anything useful.
I have fixed the issue by replacing the above lines with
Utf8 parse_mark = Mark.PARSE_MARK.checkMark(page);
if (parse_mark != null)
{
Mark.UPDATEDB_MARK.putMark(page, parse_mark);
Mark.PARSE_MARK.removeMark(page);
}
Thanks.
Alex.
-----Original Message-----
From: Ferdy Galema <fe...@kalooga.com>
To: user <us...@nutch.apache.org>
Sent: Thu, Aug 2, 2012 12:16 am
Subject: Re: Nutch 2 solrindex
Hi,
Do you want to open a Jira and attach the patch over there? Or just explain
what the problem is caused. I'm curious to what this might be.
Thanks,
Ferdy.
On Wed, Aug 1, 2012 at 9:27 PM, <al...@aim.com> wrote:
> This is directly related to the thread I have opened yesterday. I think
> this is a bug, since updatedb fails to put update mark.
> I have fixed it by modifying code. I have a patch, but not sure if I can
> send it as an attachment.
>
> Alex.
>
>
>
> -----Original Message-----
> From: Bai Shen <ba...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Wed, Aug 1, 2012 10:37 am
> Subject: Nutch 2 solrindex
>
>
> I'm trying to crawl using Nutch 2. However, I can't seem to get it to
> index to solr without adding -reindex to the command. And at that point it
> indexes everything I've crawled. I've tried both -all and the batch id,
> but neither one results in anything being indexed to solr.
>
> Any suggestions of what to look at?
>
> Thanks.
>
>
>
Re: Nutch 2 solrindex
Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,
Do you want to open a Jira and attach the patch over there? Or just explain
what the problem is caused. I'm curious to what this might be.
Thanks,
Ferdy.
On Wed, Aug 1, 2012 at 9:27 PM, <al...@aim.com> wrote:
> This is directly related to the thread I have opened yesterday. I think
> this is a bug, since updatedb fails to put update mark.
> I have fixed it by modifying code. I have a patch, but not sure if I can
> send it as an attachment.
>
> Alex.
>
>
>
> -----Original Message-----
> From: Bai Shen <ba...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Wed, Aug 1, 2012 10:37 am
> Subject: Nutch 2 solrindex
>
>
> I'm trying to crawl using Nutch 2. However, I can't seem to get it to
> index to solr without adding -reindex to the command. And at that point it
> indexes everything I've crawled. I've tried both -all and the batch id,
> but neither one results in anything being indexed to solr.
>
> Any suggestions of what to look at?
>
> Thanks.
>
>
>
Re: Nutch 2 solrindex
Posted by al...@aim.com.
This is directly related to the thread I have opened yesterday. I think this is a bug, since updatedb fails to put update mark.
I have fixed it by modifying code. I have a patch, but not sure if I can send it as an attachment.
Alex.
-----Original Message-----
From: Bai Shen <ba...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Aug 1, 2012 10:37 am
Subject: Nutch 2 solrindex
I'm trying to crawl using Nutch 2. However, I can't seem to get it to
index to solr without adding -reindex to the command. And at that point it
indexes everything I've crawled. I've tried both -all and the batch id,
but neither one results in anything being indexed to solr.
Any suggestions of what to look at?
Thanks.