You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Benjamin Higgins <bh...@gmail.com> on 2006/09/20 03:14:20 UTC

Changing page injection behavior in Nutch 0.8

In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a
file it will add the page, even if it is already present.

I did this because I can prepare a list of changed files that I have on my
intranet and want Nutch to reindex them right away.

I made a change (suggested by Howie Wang) to
org.apache.nutch.db.WebDBInjector by changing the addPage method.  I
replaced the line:

  dbWriter.addPageIfNotPresent(page);

with:

  dbWriter.addPageWithScore(page);

Question: I'm moving to Nutch 0.8 and I'd like similar behavior, but I don't
know where to put them as a lot of code has changed (and there's no longer a
WebDBInjector.java file).

How can I accomplish this?  If there is a more appropriate way to do this
please let me know that also.

Thanks,

Ben

Re: Changing page injection behavior in Nutch 0.8

Posted by Tomi NA <he...@gmail.com>.
On 9/20/06, Tomi NA <he...@gmail.com> wrote:
> On 9/20/06, Benjamin Higgins <bh...@gmail.com> wrote:
> > In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a
> > file it will add the page, even if it is already present.
> >
> > I did this because I can prepare a list of changed files that I have on my
> > intranet and want Nutch to reindex them right away.
> >
> > I made a change (suggested by Howie Wang) to
> > org.apache.nutch.db.WebDBInjector by changing the addPage method.  I
> > replaced the line:
> >
> >   dbWriter.addPageIfNotPresent(page);
> >
> > with:
> >
> >   dbWriter.addPageWithScore(page);
> >
> > Question: I'm moving to Nutch 0.8 and I'd like similar behavior, but I don't
> > know where to put them as a lot of code has changed (and there's no longer a
> > WebDBInjector.java file).
> >
> > How can I accomplish this?  If there is a more appropriate way to do this
> > please let me know that also.
>
> I'm interested in this problem as well. Haven't had a chance yet to
> look into it, thought.

I think the crawl.Injector.InjectorReducer class is the one we're looking for.
Would this do the trick?

      //output.collect(key, (Writable)values.next()); // just collect
first value
    	while (values.hasNext()) {
    		output.collect(key, (Writable) values.next());
    	}

I can't verify as an IOException's giving me trouble (possibly because
I checkedout 0.9-dev), someone else might have more luck with the
0.8(.1?) sources.

t.n.a.

Re: Changing page injection behavior in Nutch 0.8

Posted by Tomi NA <he...@gmail.com>.
On 9/20/06, Benjamin Higgins <bh...@gmail.com> wrote:
> In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a
> file it will add the page, even if it is already present.
>
> I did this because I can prepare a list of changed files that I have on my
> intranet and want Nutch to reindex them right away.
>
> I made a change (suggested by Howie Wang) to
> org.apache.nutch.db.WebDBInjector by changing the addPage method.  I
> replaced the line:
>
>   dbWriter.addPageIfNotPresent(page);
>
> with:
>
>   dbWriter.addPageWithScore(page);
>
> Question: I'm moving to Nutch 0.8 and I'd like similar behavior, but I don't
> know where to put them as a lot of code has changed (and there's no longer a
> WebDBInjector.java file).
>
> How can I accomplish this?  If there is a more appropriate way to do this
> please let me know that also.

I'm interested in this problem as well. Haven't had a chance yet to
look into it, thought.

t.n.a.