You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2013/05/17 14:36:41 UTC

Re: Example crawl script Nutch 2.1

I just tested the GeneratorJob portion and it works fine.  I have two
comments, though.

1.  I added braces around the -batchId arg if statement.  I don't like if's
without them.
2.  BatchIds never get cleared.  So if you use the same batchId for
multiple crawl cycles your urls per batch will continue to grow.  There
should probably be some sort of note in the help printout.




On Tue, Apr 30, 2013 at 10:37 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi James,
> Please look for NUTCH-1545 capture batchid...
> If you could review and use this patch it would be very very helpful.
> thank you
> lewis
>
> On Tuesday, April 30, 2013, James Ford <si...@gmail.com> wrote:
> > Thanks for your answer!
> >
> > I think I will create my own modified crawlscript then. But I am pretty
> > confused of how to get a generated batchId? Should I just parse the id
> from
> > the output:
> >
> > GeneratorJob: generated batch id: 1367327604-149897259
> >
> > Or should I get the newly generated batchId from the datastore in my
> script?
> > Any best practices?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
>
> http://lucene.472066.n3.nabble.com/Example-crawl-script-Nutch-2-1-tp4059960p4059985.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>
> --
> *Lewis*
>

Re: Example crawl script Nutch 2.1

Posted by Tejas Patil <te...@gmail.com>.
Hi Bai Shen,

Thanks for your comments. Can you kindly add those to the relevant jira [0]
so that it gets tracked ?

[0] https://issues.apache.org/jira/browse/NUTCH-1545

Thanks,
Tejas


On Fri, May 17, 2013 at 5:36 AM, Bai Shen <ba...@gmail.com> wrote:

> I just tested the GeneratorJob portion and it works fine.  I have two
> comments, though.
>
> 1.  I added braces around the -batchId arg if statement.  I don't like if's
> without them.
> 2.  BatchIds never get cleared.  So if you use the same batchId for
> multiple crawl cycles your urls per batch will continue to grow.  There
> should probably be some sort of note in the help printout.
>
>
>
>
> On Tue, Apr 30, 2013 at 10:37 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi James,
> > Please look for NUTCH-1545 capture batchid...
> > If you could review and use this patch it would be very very helpful.
> > thank you
> > lewis
> >
> > On Tuesday, April 30, 2013, James Ford <si...@gmail.com> wrote:
> > > Thanks for your answer!
> > >
> > > I think I will create my own modified crawlscript then. But I am pretty
> > > confused of how to get a generated batchId? Should I just parse the id
> > from
> > > the output:
> > >
> > > GeneratorJob: generated batch id: 1367327604-149897259
> > >
> > > Or should I get the newly generated batchId from the datastore in my
> > script?
> > > Any best practices?
> > >
> > > Thanks
> > >
> > >
> > >
> > > --
> > > View this message in context:
> >
> >
> http://lucene.472066.n3.nabble.com/Example-crawl-script-Nutch-2-1-tp4059960p4059985.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
> > --
> > *Lewis*
> >
>