You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2012/03/22 13:53:02 UTC

crawl and update one url already in crawldb

I have created an application that can detect when files are
created/modified/deleted in one of our Windows Share drives and I would like
to know if it is possible upon notification of this to crawl just a single
URL in the crawldb? 

I think it is possible to run individual new crawls for each url with the
goal of merging the linkdbs and crawldbs at somepoint (once a night).  But I
wonder if there is a more efficient  way of doing this.  The other obstacle
is that the main crawldb is part of a continuous looping crawl that
technically could never end (unless I force it to).  Would it be an issue to
update a database that could potentially be locked at any point in time?

Thanks!

--
View this message in context: http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-crawldb-tp3848358p3848358.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawl and update one url already in crawldb

Posted by Markus Jelsma <ma...@openindex.io>.
Use Hadoop or set the hadoop.tmp.dir per job. If you don't, things will break.

On Thursday 22 March 2012 15:29:50 webdev1977 wrote:
> I just tried it out and so far so good.. Not an near instant solution, but
> it works ;-)  One last question..
> 
> If I am running a bunch of bin/nutch commands from the same directory I
> seem to be having an issue.  I am assuming it is with the mapred system
> and various tmp files (running in local mode).  Is it possible to run
> multiple commands using the same nutch directory without causing
> conflicts?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
> wldb-tp3848358p3848665.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: crawl and update one url already in crawldb

Posted by webdev1977 <we...@gmail.com>.
I just tried it out and so far so good.. Not an near instant solution, but it
works ;-)  One last question.. 

If I am running a bunch of bin/nutch commands from the same directory I seem
to be having an issue.  I am assuming it is with the mapred system and
various tmp files (running in local mode).  Is it possible to run multiple
commands using the same nutch directory without causing conflicts?

--
View this message in context: http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-crawldb-tp3848358p3848665.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawl and update one url already in crawldb

Posted by Markus Jelsma <ma...@openindex.io>.

On Thursday 22 March 2012 14:10:41 webdev1977 wrote:
> Thanks for the quick response Markus!
> 
> How would that fit into this continuous crawling scenario (I am trying to
> get the updates as quickly as possible into solr :-)
> 
> If I am doing the generate --> fetch $SEGMENT --> parse $SEGMENT -->
> updatedb crawldb $segment --> solrindex --> solrdedub  cycle and i am
> generating an "on the fly" segment and I just happen to be generating it
> (and not done) when the updatedb command runs (changing it to the -dir
> option), isn't that bad?

You can just fetch and parse that tiny segment and have it updated in the 
crawldb together with another segment. You don't have to update with only one 
segment. -dir is ok, but you can also list the segments.


> Has anyone tested the mergedb command with potentially hundreds and
> hundreds of dbs to merge (one per changed url)?

I wouldn't try that. More scripting and locking horror and it's an I/O 
consumer.

> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
> wldb-tp3848358p3848423.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: crawl and update one url already in crawldb

Posted by webdev1977 <we...@gmail.com>.
Thanks for the quick response Markus! 

How would that fit into this continuous crawling scenario (I am trying to
get the updates as quickly as possible into solr :-)

If I am doing the generate --> fetch $SEGMENT --> parse $SEGMENT -->
updatedb crawldb $segment --> solrindex --> solrdedub  cycle and i am
generating an "on the fly" segment and I just happen to be generating it
(and not done) when the updatedb command runs (changing it to the -dir
option), isn't that bad?

Has anyone tested the mergedb command with potentially hundreds and hundreds
of dbs to merge (one per changed url)?

--
View this message in context: http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-crawldb-tp3848358p3848423.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawl and update one url already in crawldb

Posted by Markus Jelsma <ma...@openindex.io>.

On Thursday 22 March 2012 13:53:02 webdev1977 wrote:
> I have created an application that can detect when files are
> created/modified/deleted in one of our Windows Share drives and I would
> like to know if it is possible upon notification of this to crawl just a
> single URL in the crawldb?
> 

Easiest would be to use the freegenerator tool to generate a segment from a 
plain text file with seed URL's, much like the injector. That segment can then 
later join other segments when updating the crawldb.

> I think it is possible to run individual new crawls for each url with the
> goal of merging the linkdbs and crawldbs at somepoint (once a night).  But
> I wonder if there is a more efficient  way of doing this.  The other
> obstacle is that the main crawldb is part of a continuous looping crawl
> that technically could never end (unless I force it to).  Would it be an
> issue to update a database that could potentially be locked at any point
> in time?
> 
> Thanks!
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
> wldb-tp3848358p3848358.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex