You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2011/03/27 14:34:30 UTC

http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Gabriele,

 I think it is a good idea to have a script like this however your proposal
could be improved. It currently works only on a single machine and uses
commands such as mv, ls etc... which won't work on a pseudo or fully
distributed cluster. You should use the 'hadoop fs' commands instead.
If I understand the script correctly, you then merge different crawldbs. Why
do you do that? There should be one crawldb per crawl so I don't think this
is at all necessary.

Having a script would definitely be a plus for beginners and would give more
flexibility than the crawl command.

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
K, hadoopized the script, though i've tried it only locally.
I rethought (lazyness convinced me) not to include the indexer parameter.

On Mon, Mar 28, 2011 at 10:50 AM, Gabriele Kahlout <gabriele@mysimpatico.com
> wrote:

>
>
> On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Hi Gabriele
>>
>>
>>>> you don't need to have 2 *and *3. The hadoop commands will work on the
>>>> local fs in a completely transparent way, it all depends on the way hadoop
>>>> is configured. It isolates the way data are stored (local or distrib) from
>>>> the client code i.e Nutch. By adding a separate script using fs, you'd add
>>>> more confusion and lead beginners to think that they HAVE to use fs.
>>>>
>>>
>>> I apologize for not having yet looked into hadoop in detail but I had
>>> understood that it would abstract over the single machine fs.
>>>
>>
>> No problems. It would be worth spending a bit of time reading about Hadoop
>> if you want to get a better understanding of Nutch. Tom White's book is an
>> excellent reference but the wikis and tutorials would be a good start
>>
>>
>>
>>> However, to get up and running after downloading nutch will the script
>>> just work or will I have to configure hadoop? I assume the latter.
>>>
>>
>> Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API
>> for getting its inputs, so when you run it as you did what actually happens
>> is that you are getting the data from the local FS via Hadoop.
>>
>
> I'll look into it and update the script accordingly.
>
>>
>>
>>> From a beginner prospective I like to reduce the magic (at first) and see
>>> through the commands, and get up and running asap.
>>> Hence 2. I'll be using 3.
>>>
>>
>> Hadoop already reduces the magic for you :-)
>>
>>
> Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of
> the hadoop cmds and get rid of 2.
>
>
>>
>>>
>>>>
>>>> As for the legacy-lucene vs SOLR what about having a parameter to
>>>> determine which one should be used and have a single script?
>>>>
>>>>
>>> Excellent idea. The default is solr for 1 and 3, but one passes parameter
>>> 'll' it will use the legacy lucene. The default for 2 is ll since we want to
>>> get up and running fast (before knowing what solr is and set it up).
>>>
>>
>> It would be nice to have a third possible value (i.e. none) for the
>> parameter -indexer (besides solr and lucene). A lot of people use Nutch as a
>> crawling platform but do not do any indexing
>>
>> agreed. Will add that too.
>
>
>>
>>>> Why do you want to get the info about ALL the urls? There is a readdb
>>>> -stats command which gives an summary of the content of the crawldb. If you
>>>> need to check a particular URL or domain, just use readdb -url and readdb
>>>> -regex (or whatever the name of the param is)
>>>>
>>>
>>>>
>>> At least when debugging/troubleshooting I found it useful to see which
>>> urls were fetched and the responses (robot_blocked, etc..).
>>> I can do that examining each $it_crawlddb in turn, since i don't know
>>> when that url was fetched (although since the fetching is pretty linear I
>>> could also find out, sth. like index in seeds/urls / $it_size.
>>>
>>
>> better to do that by looking at the content of the segments using 'nutch
>> readseg -dump' or using 'hadoop fs -libjars nutch.job
>> segment/SEGMENTNUM/crawl_data' for instance. That's probably not something
>> that most people will want to do so maybe comment it out in your script?
>>
>>
>> running hadoop in peudo distributed mode and looking at the hadoop web
>> guis (http://*localhost*:*50030*) gives you a lot of information about
>> your crawl
>>
>> It would definitely be better to have a single crawldb in your script.
>>
>>
> agreed, maybe again an option and the default is none. But keep every
> $it_crawldb instead of deleting and merging them.
> I should be looking into the necessary Hadoop today and start updating the
> script accordingly.
>
> Julien
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Gabriele
>
>
>>> you don't need to have 2 *and *3. The hadoop commands will work on the
>>> local fs in a completely transparent way, it all depends on the way hadoop
>>> is configured. It isolates the way data are stored (local or distrib) from
>>> the client code i.e Nutch. By adding a separate script using fs, you'd add
>>> more confusion and lead beginners to think that they HAVE to use fs.
>>>
>>
>> I apologize for not having yet looked into hadoop in detail but I had
>> understood that it would abstract over the single machine fs.
>>
>
> No problems. It would be worth spending a bit of time reading about Hadoop
> if you want to get a better understanding of Nutch. Tom White's book is an
> excellent reference but the wikis and tutorials would be a good start
>
>
>
>> However, to get up and running after downloading nutch will the script
>> just work or will I have to configure hadoop? I assume the latter.
>>
>
> Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API
> for getting its inputs, so when you run it as you did what actually happens
> is that you are getting the data from the local FS via Hadoop.
>

I'll look into it and update the script accordingly.

>
>
>> From a beginner prospective I like to reduce the magic (at first) and see
>> through the commands, and get up and running asap.
>> Hence 2. I'll be using 3.
>>
>
> Hadoop already reduces the magic for you :-)
>
>
Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of
the hadoop cmds and get rid of 2.


>
>>
>>>
>>> As for the legacy-lucene vs SOLR what about having a parameter to
>>> determine which one should be used and have a single script?
>>>
>>>
>> Excellent idea. The default is solr for 1 and 3, but one passes parameter
>> 'll' it will use the legacy lucene. The default for 2 is ll since we want to
>> get up and running fast (before knowing what solr is and set it up).
>>
>
> It would be nice to have a third possible value (i.e. none) for the
> parameter -indexer (besides solr and lucene). A lot of people use Nutch as a
> crawling platform but do not do any indexing
>
> agreed. Will add that too.


>
>>> Why do you want to get the info about ALL the urls? There is a readdb
>>> -stats command which gives an summary of the content of the crawldb. If you
>>> need to check a particular URL or domain, just use readdb -url and readdb
>>> -regex (or whatever the name of the param is)
>>>
>>
>>>
>> At least when debugging/troubleshooting I found it useful to see which
>> urls were fetched and the responses (robot_blocked, etc..).
>> I can do that examining each $it_crawlddb in turn, since i don't know when
>> that url was fetched (although since the fetching is pretty linear I could
>> also find out, sth. like index in seeds/urls / $it_size.
>>
>
> better to do that by looking at the content of the segments using 'nutch
> readseg -dump' or using 'hadoop fs -libjars nutch.job
> segment/SEGMENTNUM/crawl_data' for instance. That's probably not something
> that most people will want to do so maybe comment it out in your script?
>
>
> running hadoop in peudo distributed mode and looking at the hadoop web guis
> (http://*localhost*:*50030*) gives you a lot of information about your
> crawl
>
> It would definitely be better to have a single crawldb in your script.
>
>
agreed, maybe again an option and the default is none. But keep every
$it_crawldb instead of deleting and merging them.
I should be looking into the necessary Hadoop today and start updating the
script accordingly.

Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Posted by Julien Nioche <li...@gmail.com>.
Hi Gabriele


>> you don't need to have 2 *and *3. The hadoop commands will work on the
>> local fs in a completely transparent way, it all depends on the way hadoop
>> is configured. It isolates the way data are stored (local or distrib) from
>> the client code i.e Nutch. By adding a separate script using fs, you'd add
>> more confusion and lead beginners to think that they HAVE to use fs.
>>
>
> I apologize for not having yet looked into hadoop in detail but I had
> understood that it would abstract over the single machine fs.
>

No problems. It would be worth spending a bit of time reading about Hadoop
if you want to get a better understanding of Nutch. Tom White's book is an
excellent reference but the wikis and tutorials would be a good start



> However, to get up and running after downloading nutch will the script just
> work or will I have to configure hadoop? I assume the latter.
>

Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API
for getting its inputs, so when you run it as you did what actually happens
is that you are getting the data from the local FS via Hadoop.


> From a beginner prospective I like to reduce the magic (at first) and see
> through the commands, and get up and running asap.
> Hence 2. I'll be using 3.
>

Hadoop already reduces the magic for you :-)


>
>
>>
>> As for the legacy-lucene vs SOLR what about having a parameter to
>> determine which one should be used and have a single script?
>>
>>
> Excellent idea. The default is solr for 1 and 3, but one passes parameter
> 'll' it will use the legacy lucene. The default for 2 is ll since we want to
> get up and running fast (before knowing what solr is and set it up).
>

It would be nice to have a third possible value (i.e. none) for the
parameter -indexer (besides solr and lucene). A lot of people use Nutch as a
crawling platform but do not do any indexing


>> Why do you want to get the info about ALL the urls? There is a readdb
>> -stats command which gives an summary of the content of the crawldb. If you
>> need to check a particular URL or domain, just use readdb -url and readdb
>> -regex (or whatever the name of the param is)
>>
>
>>
> At least when debugging/troubleshooting I found it useful to see which urls
> were fetched and the responses (robot_blocked, etc..).
> I can do that examining each $it_crawlddb in turn, since i don't know when
> that url was fetched (although since the fetching is pretty linear I could
> also find out, sth. like index in seeds/urls / $it_size.
>

better to do that by looking at the content of the segments using 'nutch
readseg -dump' or using 'hadoop fs -libjars nutch.job
segment/SEGMENTNUM/crawl_data' for instance. That's probably not something
that most people will want to do so maybe comment it out in your script?

running hadoop in peudo distributed mode and looking at the hadoop web guis
(http://*localhost*:*50030*) gives you a lot of information about your crawl

It would definitely be better to have a single crawldb in your script.

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
On Sun, Mar 27, 2011 at 7:03 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

>
>  I think it is a good idea to have a script like this however your proposal
>>> could be improved. It currently works only on a single machine and uses
>>> commands such as mv, ls etc... which won't work on a pseudo or fully
>>> distributed cluster. You should use the 'hadoop fs' commands instead.
>>>
>>
>> Okay, let's go for 3 editions:
>> 1. that's abridged and works only with solr (tersest script)
>> 2. unabridged with local fs  - for begginners
>> 3. hadoop unabridged
>>
>
> you don't need to have 2 *and *3. The hadoop commands will work on the
> local fs in a completely transparent way, it all depends on the way hadoop
> is configured. It isolates the way data are stored (local or distrib) from
> the client code i.e Nutch. By adding a separate script using fs, you'd add
> more confusion and lead beginners to think that they HAVE to use fs.
>

I apologize for not having yet looked into hadoop in detail but I had
understood that it would abstract over the single machine fs. However, to
get up and running after downloading nutch will the script just work or will
I have to configure hadoop? I assume the latter. From a beginner prospective
I like to reduce the magic (at first) and see through the commands, and get
up and running asap.
Hence 2. I'll be using 3.


>
> As for the legacy-lucene vs SOLR what about having a parameter to determine
> which one should be used and have a single script?
>
>
Excellent idea. The default is solr for 1 and 3, but one passes parameter
'll' it will use the legacy lucene. The default for 2 is ll since we want to
get up and running fast (before knowing what solr is and set it up).


>
>>
>>> If I understand the script correctly, you then merge different crawldbs.
>>> Why do you do that? There should be one crawldb per crawl so I don't think
>>> this is at all necessary.
>>>
>>> So that I get a single dump with info about all the urls crawled. On the
>> scale of the web this is probably a bad idea, isn't it?
>>
>
> it would be a bad idea even on a medium scale. That sort of works on a
> single machine but as soon as you'd get a bit of data you'd fill the space
> on the disks + the whole thing would take ages.
>

> However the point still is that there should be only one crawldb per crawl
> and it contains all the urls you've injected / discovered
>
>
>>  But then how else could you inspect all the crawled urls at once?
>>
>
> Why do you want to get the info about ALL the urls? There is a readdb
> -stats command which gives an summary of the content of the crawldb. If you
> need to check a particular URL or domain, just use readdb -url and readdb
> -regex (or whatever the name of the param is)
>

>
At least when debugging/troubleshooting I found it useful to see which urls
were fetched and the responses (robot_blocked, etc..).
I can do that examining each $it_crawlddb in turn, since i don't know when
that url was fetched (although since the fetching is pretty linear I could
also find out, sth. like index in seeds/urls / $it_size.

Now I don't about hadoop is a single file stored on a single machine. My
expectation/hope was that the underlying fs loads into memory only the
portions of text i'm viewing (I've seen that around) and I dunno how it'll
handle ctrl+F (maybe some index). And if it has disk space issues it breaks
that up on to several machines, transparently.


>
>>
>>> Having a script would definitely be a plus for beginners and would give
>>> more flexibility than the crawl command.
>>>
>>
>> I as the first of beginners. Crawl is not recommended for whole-web
>> crawling i guess because it doesn't work incrementally. Why not add such
>> option to crawl? Shall I feature-request/patch for that?
>>
>
> IMHO I'd rather have a good and reliable script to replace the Crawl
> command.
>

> Thanks for yor contribution BTW
>

welcome. I like Apache's stuff, and thank you for saving me the trouble of
re-implementing a search engine atop much more limited frameworks!


> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Posted by Julien Nioche <li...@gmail.com>.
>  I think it is a good idea to have a script like this however your proposal
>> could be improved. It currently works only on a single machine and uses
>> commands such as mv, ls etc... which won't work on a pseudo or fully
>> distributed cluster. You should use the 'hadoop fs' commands instead.
>>
>
> Okay, let's go for 3 editions:
> 1. that's abridged and works only with solr (tersest script)
> 2. unabridged with local fs  - for begginners
> 3. hadoop unabridged
>

you don't need to have 2 *and *3. The hadoop commands will work on the local
fs in a completely transparent way, it all depends on the way hadoop is
configured. It isolates the way data are stored (local or distrib) from the
client code i.e Nutch. By adding a separate script using fs, you'd add more
confusion and lead beginners to think that they HAVE to use fs.

As for the legacy-lucene vs SOLR what about having a parameter to determine
which one should be used and have a single script?


>
>
>> If I understand the script correctly, you then merge different crawldbs.
>> Why do you do that? There should be one crawldb per crawl so I don't think
>> this is at all necessary.
>>
>> So that I get a single dump with info about all the urls crawled. On the
> scale of the web this is probably a bad idea, isn't it?
>

it would be a bad idea even on a medium scale. That sort of works on a
single machine but as soon as you'd get a bit of data you'd fill the space
on the disks + the whole thing would take ages.

However the point still is that there should be only one crawldb per crawl
and it contains all the urls you've injected / discovered


> But then how else could you inspect all the crawled urls at once?
>

Why do you want to get the info about ALL the urls? There is a readdb -stats
command which gives an summary of the content of the crawldb. If you need to
check a particular URL or domain, just use readdb -url and readdb -regex (or
whatever the name of the param is)


>
>
>> Having a script would definitely be a plus for beginners and would give
>> more flexibility than the crawl command.
>>
>
> I as the first of beginners. Crawl is not recommended for whole-web
> crawling i guess because it doesn't work incrementally. Why not add such
> option to crawl? Shall I feature-request/patch for that?
>

IMHO I'd rather have a good and reliable script to replace the Crawl
command. It does not help people understanding the underlying steps + having
the script would be easier to recover when there is a failure + there are
other issues with it e.g. the runaway parsing threads which are kept in the
VM. I am sure the all-in-one Crawl command helped many a user but the script
would do just as well

Thanks for yor contribution BTW

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
P.S.
I'm still modifying.

On Sun, Mar 27, 2011 at 2:34 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Gabriele,
>
>  I think it is a good idea to have a script like this however your proposal
> could be improved. It currently works only on a single machine and uses
> commands such as mv, ls etc... which won't work on a pseudo or fully
> distributed cluster. You should use the 'hadoop fs' commands instead.
>

Okay, let's go for 3 editions:
1. that's abridged and works only with solr (tersest script)
2. unabridged with local fs  - for begginners
3. hadoop unabridged


> If I understand the script correctly, you then merge different crawldbs.
> Why do you do that? There should be one crawldb per crawl so I don't think
> this is at all necessary.
>
> So that I get a single dump with info about all the urls crawled. On the
scale of the web this is probably a bad idea, isn't it? But then how else
could you inspect all the crawled urls at once?


> Having a script would definitely be a plus for beginners and would give
> more flexibility than the crawl command.
>

I as the first of beginners. Crawl is not recommended for whole-web crawling
i guess because it doesn't work incrementally. Why not add such option to
crawl? Shall I feature-request/patch for that?

Thanks
>
> Julien
>
> P.S. I'm still modying the page.


> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).