You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/08/28 05:24:12 UTC

How to generate multiple small segments w/o -numFetchers?

Hello,

All over the FAQ <http://wiki.apache.org/nutch/FAQ> it's bin/nutch
-numFetchers option is documented as a way to generate multiple small
segments. However that option doesn't seem available neither in 1.3 nor 1.4.
So should the FAQ be updated or am I missing something? How else could I
generate multiple small segments?
I can see doing that with -topN but that's less convenient.

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: How to generate multiple small segments w/o -numFetchers?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
On Sun, Aug 28, 2011 at 2:37 PM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Gabriele can you expand on your last comment... are you running in
> deploy
> mode?
>

I'm running nutch locally, from deploy/local/bin/nutch in Fetcher.java
there's special code to set numFetchers to 1 for some reason.


>
> And to reply to your first point, yes you are correct, the FAQ's need
> extensive updating. Please feel free to change anything you feel necessary,
> however as a matter of retaining knowledge for the legacy of Nutch we are
> now moving deprecated/old information resources to the archive section of
> the wiki.
>
> Actually I was wrong, I somehow thought -numFetchers was a bin/nutch fetch
option, but it understandably was bin/nutch generate.

It's a pity that it's not possible to break big segments into smaller ones
on local machines.



>
>
> On Sun, Aug 28, 2011 at 7:58 AM, Gabriele Kahlout
> <ga...@mysimpatico.com>wrote:
>
> > but that's no local solution:
> >
> > if ("local".equals(job.get("mapred.job.tracker")) && numLists != 1) {
> >      // override
> >      LOG.info("Generator: jobtracker is 'local', generating exactly one
> > partition.");
> >      numLists = 1;
> >    }
> >
> > On Sun, Aug 28, 2011 at 8:57 AM, Gabriele Kahlout
> > <ga...@mysimpatico.com>wrote:
> >
> > > it was a bin/nutch generate option.
> > >
> > >
> > > On Sun, Aug 28, 2011 at 6:24 AM, Gabriele Kahlout <
> > > gabriele@mysimpatico.com> wrote:
> > >
> > >> Hello,
> > >>
> > >> All over the FAQ <http://wiki.apache.org/nutch/FAQ> it's bin/nutch
> > >> -numFetchers option is documented as a way to generate multiple small
> > >> segments. However that option doesn't seem available neither in 1.3
> nor
> > 1.4.
> > >> So should the FAQ be updated or am I missing something? How else could
> I
> > >> generate multiple small segments?
> > >> I can see doing that with -topN but that's less convenient.
> > >>
> > >> --
> > >> Regards,
> > >> K. Gabriele
> > >>
> > >> --- unchanged since 20/9/10 ---
> > >> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > >> receipt within 48 hours then I don't resend the email.
> > >> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > >> time(x) < Now + 48h) ⇒ ¬resend(I, this).
> > >>
> > >> If an email is sent by a sender that is not a trusted contact or the
> > email
> > >> does not contain a valid code then the email is not received. A valid
> > code
> > >> starts with a hyphen and ends with "X".
> > >> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> ∈
> > >> L(-[a-z]+[0-9]X)).
> > >>
> > >>
> > >
> > >
> > > --
> > > Regards,
> > > K. Gabriele
> > >
> > > --- unchanged since 20/9/10 ---
> > > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > > receipt within 48 hours then I don't resend the email.
> > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > > time(x) < Now + 48h) ⇒ ¬resend(I, this).
> > >
> > > If an email is sent by a sender that is not a trusted contact or the
> > email
> > > does not contain a valid code then the email is not received. A valid
> > code
> > > starts with a hyphen and ends with "X".
> > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> ∈
> > > L(-[a-z]+[0-9]X)).
> > >
> > >
> >
> >
> > --
> > Regards,
> > K. Gabriele
> >
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > time(x)
> > < Now + 48h) ⇒ ¬resend(I, this).
> >
> > If an email is sent by a sender that is not a trusted contact or the
> email
> > does not contain a valid code then the email is not received. A valid
> code
> > starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).
> >
>
>
>
> --
> *Lewis*
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: How to generate multiple small segments w/o -numFetchers?

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Gabriele can you expand on your last comment... are you running in deploy
mode?

And to reply to your first point, yes you are correct, the FAQ's need
extensive updating. Please feel free to change anything you feel necessary,
however as a matter of retaining knowledge for the legacy of Nutch we are
now moving deprecated/old information resources to the archive section of
the wiki.



On Sun, Aug 28, 2011 at 7:58 AM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> but that's no local solution:
>
> if ("local".equals(job.get("mapred.job.tracker")) && numLists != 1) {
>      // override
>      LOG.info("Generator: jobtracker is 'local', generating exactly one
> partition.");
>      numLists = 1;
>    }
>
> On Sun, Aug 28, 2011 at 8:57 AM, Gabriele Kahlout
> <ga...@mysimpatico.com>wrote:
>
> > it was a bin/nutch generate option.
> >
> >
> > On Sun, Aug 28, 2011 at 6:24 AM, Gabriele Kahlout <
> > gabriele@mysimpatico.com> wrote:
> >
> >> Hello,
> >>
> >> All over the FAQ <http://wiki.apache.org/nutch/FAQ> it's bin/nutch
> >> -numFetchers option is documented as a way to generate multiple small
> >> segments. However that option doesn't seem available neither in 1.3 nor
> 1.4.
> >> So should the FAQ be updated or am I missing something? How else could I
> >> generate multiple small segments?
> >> I can see doing that with -topN but that's less convenient.
> >>
> >> --
> >> Regards,
> >> K. Gabriele
> >>
> >> --- unchanged since 20/9/10 ---
> >> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> >> receipt within 48 hours then I don't resend the email.
> >> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> >> time(x) < Now + 48h) ⇒ ¬resend(I, this).
> >>
> >> If an email is sent by a sender that is not a trusted contact or the
> email
> >> does not contain a valid code then the email is not received. A valid
> code
> >> starts with a hyphen and ends with "X".
> >> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> >> L(-[a-z]+[0-9]X)).
> >>
> >>
> >
> >
> > --
> > Regards,
> > K. Gabriele
> >
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > time(x) < Now + 48h) ⇒ ¬resend(I, this).
> >
> > If an email is sent by a sender that is not a trusted contact or the
> email
> > does not contain a valid code then the email is not received. A valid
> code
> > starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).
> >
> >
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>



-- 
*Lewis*

Re: How to generate multiple small segments w/o -numFetchers?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
but that's no local solution:

if ("local".equals(job.get("mapred.job.tracker")) && numLists != 1) {
      // override
      LOG.info("Generator: jobtracker is 'local', generating exactly one
partition.");
      numLists = 1;
    }

On Sun, Aug 28, 2011 at 8:57 AM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> it was a bin/nutch generate option.
>
>
> On Sun, Aug 28, 2011 at 6:24 AM, Gabriele Kahlout <
> gabriele@mysimpatico.com> wrote:
>
>> Hello,
>>
>> All over the FAQ <http://wiki.apache.org/nutch/FAQ> it's bin/nutch
>> -numFetchers option is documented as a way to generate multiple small
>> segments. However that option doesn't seem available neither in 1.3 nor 1.4.
>> So should the FAQ be updated or am I missing something? How else could I
>> generate multiple small segments?
>> I can see doing that with -topN but that's less convenient.
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: How to generate multiple small segments w/o -numFetchers?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
it was a bin/nutch generate option.

On Sun, Aug 28, 2011 at 6:24 AM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> Hello,
>
> All over the FAQ <http://wiki.apache.org/nutch/FAQ> it's bin/nutch
> -numFetchers option is documented as a way to generate multiple small
> segments. However that option doesn't seem available neither in 1.3 nor 1.4.
> So should the FAQ be updated or am I missing something? How else could I
> generate multiple small segments?
> I can see doing that with -topN but that's less convenient.
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).