You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by David Philip <da...@gmail.com> on 2013/03/05 06:28:39 UTC

Re: Nutch Incremental Crawl

Hi Markus,

  So I was trying with the *db.injector.update *point that you mentioned,
please see my observations below*. *
Settings: I did  *db.injector.update * to* true *and   *
db.fetch.interval.default *to* 1hour. *
*
*
*
*
*Observation:*

On first time crawl[1],  14 urls were successfully crawled and indexed to
solr.
case 1 :
In those 14 urls I modified the content and title of one url (say Aurl) and
re executed the crawl after one hour.
I see that this(Aurl) url is re-fetched (it shows in log) but at Solr level
: for that url (aurl): content field and title field didn't get updated.
Why? should I do any configuration for this to make solr index get updated?

case2:
Added new url to the crawling site
The url got indexed - This is success. So interested to know why the above
case failed? What configuration need to be made?


Thanks - David


*PS:*
Apologies that I am still asking questions on same topic. I am not able to
find good way for incremental crawl so trying different approaches.  Once I
am clear I will blog this and share it. Thanks lot for replies from mailer.







On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> You can simply reinject the records.  You can overwrite and/or update the
> current record. See the db.injector.update and overwrite settings.
>
> -----Original message-----
> > From:David Philip <da...@gmail.com>
> > Sent: Wed 27-Feb-2013 11:23
> > To: user@nutch.apache.org
> > Subject: Re: Nutch Incremental Crawl
> >
> > HI Markus, I meant over riding  the injected interval.. How to override
> the
> > injected fetch interval?
> > While crawling fetch interval was set 30days (default). Now I want to
> > re-fetch same site (that is to force re-fetch) and not wait for fetch
> > interval (30 days).. how can we do that?
> >
> >
> > Feng Lu : Thank you for the reference link.
> >
> > Thanks - David
> >
> >
> >
> > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > The default or the injected interval? The default interval can be set
>  in
> > > the config (see nutch-default for example). Per URL's can be set using
> the
> > > injector: <URL>\tnutch.fixedFetchInterval=86400
> > >
> > >
> > > -----Original message-----
> > > > From:David Philip <da...@gmail.com>
> > > > Sent: Wed 27-Feb-2013 06:21
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Nutch Incremental Crawl
> > > >
> > > > Hi all,
> > > >
> > > >   Thank you very much for the replies. Very useful information to
> > > > understand how incremental crawling can be achieved.
> > > >
> > > > Dear Markus:
> > > > Can you please tell me how do I over ride this fetch interval ,
> incase
> > > if I
> > > > require to fetch the page before the time interval is passed?
> > > >
> > > >
> > > >
> > > > Thanks very much
> > > > - David
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> > > > <ma...@openindex.io>wrote:
> > > >
> > > > > If you want records to be fetched at a fixed interval its easier to
> > > inject
> > > > > them with a fixed fetch interval.
> > > > >
> > > > > nutch.fixedFetchInterval=86400
> > > > >
> > > > >
> > > > >
> > > > > -----Original message-----
> > > > > > From:kemical <mi...@gmail.com>
> > > > > > Sent: Thu 14-Feb-2013 10:15
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Re: Nutch Incremental Crawl
> > > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > You can also consider setting shorter fetch interval time with
> nutch
> > > > > inject.
> > > > > > This way you'll set higher score (so the url is always taken in
> > > priority
> > > > > > when you generate a segment) and a fetch.interval of 1 day.
> > > > > >
> > > > > > If you have a case similar to me, you'll often want some homepage
> > > fetch
> > > > > each
> > > > > > day but not their inlinks. What you can do is inject all your
> seed
> > > urls
> > > > > > again (assuming those url are only homepages).
> > > > > >
> > > > > > #change nutch option so existing urls can be injected again in
> > > > > > conf/nutch-default.xml or conf/nutch-site.xml
> > > > > > db.injector.update=true
> > > > > >
> > > > > > #Add metadata to update score/fetch interval
> > > > > > #the following line will concat to each line of your seed urls
> files
> > > with
> > > > > > the new score / new interval
> > > > > > perl -pi -e
> > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> > > > > > [your_seed_url_dir]/*
> > > > > >
> > > > > > #run command
> > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> > > > > >
> > > > > > Now, the following crawl will take your urls in top priority and
> > > crawl
> > > > > them
> > > > > > once a day. I've used my situation to illustrate the concept but
> i
> > > guess
> > > > > you
> > > > > > can tweek params to fit your needs.
> > > > > >
> > > > > > This way is useful when you want a regular fetch on some urls, if
> > > it's
> > > > > > occured rarely i guess freegen is the right choice.
> > > > > >
> > > > > > Best,
> > > > > > Mike
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > View this message in context:
> > > > >
> > >
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Nutch Incremental Crawl

Posted by feng lu <am...@gmail.com>.

Hi

<<
  I used less command and checked, it shows the past content , not modified
one. Any other cache clearing from crawl db? or any property to set in
nutch-site so that it  does re-fetch modified content?
>>
As far as i know, the crawl db does not use cache. As Markus says that you
can simply reinject the records. the nutch does not know which web page
will re-fetch again, it only controled by fetchInterval in nutch-site
configuration file.

Perhaps the only reason i can think is that the modified url fetch status
is db_notmodified, so nutch will not download that url. Maybe you can use
this command to check the status of that modified url. bin/nutch readdb
crawldb/ -url http://www.example.com/ . if it's status is 6 indicated that
web page is not modified.




On Tue, Mar 5, 2013 at 7:48 PM, David Philip <da...@gmail.com>wrote:

> Hi,
>   I used less command and checked, it shows the past content , not modified
> one. Any other cache clearing from crawl db? or any property to set in
> nutch-site so that it  does re-fetch modified content?
>
>
>    - Cleared tomcat cache
>    - settings:
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>600</value>
>   </description>
> </property>
>
> <property>
>   <name>db.injector.update</name>
>   <value>true</value>
>   </description>
> </property>
>
>
>
> Crawl command : bin/nutch crawl urls -solr
> http://localhost:8080/solrnutch-dir crawltest -depth 10
> This command I executed after 1 hour (modifying some sites content and
> title) but the title or content is still not fetched. The dump (redseg
> dump) shows old content only :(
>
>
> To separately update solr, I executed this command : bin/nutch solrindex
> http://localhost:8080/solrnutch/ crawltest/crawldb -linkdb
> crawltest/linkdb
> crawltest/segments/* -deleteGone
> but no sucess, nothing updated to solr.
>
> *trace :*
> SolrIndexer: starting at 2013-03-05 17:07:15
> SolrIndexer: deleting gone documents
> Indexing 16 documents
> Deleting 1 documents
> SolrIndexer: finished at 2013-03-05 17:09:38, elapsed: 00:02:22
>
> But after this , when  I check in solr (http://localhost:8080/solrnutch/)
> it still shows 16 docs, why it can be? I use nutch 1.5.1 version and
> solr3.6
>
>
> Thanks - David
>
> P.S
> I basically wanted to achieve on demand re-crawl so that all modified
> website get updated in solr, and so when user searches, he gets accurate
> results.
>
>
>
>
>
>
>
>
>
>
> On Tue, Mar 5, 2013 at 12:54 PM, feng lu <am...@gmail.com> wrote:
>
> > Hi David
> >
> > yes, it's a tomcat web service cache.
> >
> > The dump file can use "less" command to open if you use linux OS. or you
> > can use
> > "bin/nutch readseg -get segments/20130121115214/ http://www.cnbeta.com/"
> > to
> > dump the information of specified url.
> >
> >
> >
> >
> > On Tue, Mar 5, 2013 at 3:02 PM, feng lu <am...@gmail.com> wrote:
> >
> > >
> > >
> > >
> > > On Tue, Mar 5, 2013 at 2:49 PM, David Philip <
> > davidphilipsheron@gmail.com>wrote:
> > >
> > >> Hi,
> > >>
> > >>     web server cache - you mean /tomcat/work/; where the solr is
> > running?
> > >> Did u mean that cache?
> > >>
> > >> I tried to use the below command {bin/nutch readseg -dump
> > >> crawltest/segments/20130304185844/ crawltest/test}and it gives dump
> > file,
> > >> format is GMC link (application/x-gmc-link)  - I am not able to open
> it.
> > >> How to open this file?
> > >>
> > >> How ever when I ran :  bin/nutch readseg -list
> > >> crawltest/segments/20130304185844/
> > >> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> > >> 20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1
> > >>
> > >>
> > >> - David
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Tue, Mar 5, 2013 at 11:25 AM, feng lu <am...@gmail.com>
> wrote:
> > >>
> > >> > Hi David
> > >> >
> > >> > Do you clear the web server cache. Maybe the refetch is also crawl
> the
> > >> old
> > >> > page.
> > >> >
> > >> > Maybe you can dump the url content to check the modification.
> > >> > using bin/nutch readseg command.
> > >> >
> > >> > Thanks
> > >> >
> > >> >
> > >> > On Tue, Mar 5, 2013 at 1:28 PM, David Philip <
> > >> davidphilipsheron@gmail.com
> > >> > >wrote:
> > >> >
> > >> > > Hi Markus,
> > >> > >
> > >> > >   So I was trying with the *db.injector.update *point that you
> > >> mentioned,
> > >> > > please see my observations below*. *
> > >> > > Settings: I did  *db.injector.update * to* true *and   *
> > >> > > db.fetch.interval.default *to* 1hour. *
> > >> > > *
> > >> > > *
> > >> > > *
> > >> > > *
> > >> > > *Observation:*
> > >> > >
> > >> > > On first time crawl[1],  14 urls were successfully crawled and
> > >> indexed to
> > >> > > solr.
> > >> > > case 1 :
> > >> > > In those 14 urls I modified the content and title of one url (say
> > >> Aurl)
> > >> > and
> > >> > > re executed the crawl after one hour.
> > >> > > I see that this(Aurl) url is re-fetched (it shows in log) but at
> > Solr
> > >> > level
> > >> > > : for that url (aurl): content field and title field didn't get
> > >> updated.
> > >> > > Why? should I do any configuration for this to make solr index get
> > >> > updated?
> > >> > >
> > >> > > case2:
> > >> > > Added new url to the crawling site
> > >> > > The url got indexed - This is success. So interested to know why
> the
> > >> > above
> > >> > > case failed? What configuration need to be made?
> > >> > >
> > >> > >
> > >> > > Thanks - David
> > >> > >
> > >> > >
> > >> > > *PS:*
> > >> > > Apologies that I am still asking questions on same topic. I am not
> > >> able
> > >> > to
> > >> > > find good way for incremental crawl so trying different
> approaches.
> > >> >  Once I
> > >> > > am clear I will blog this and share it. Thanks lot for replies
> from
> > >> > mailer.
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> > >> > > <ma...@openindex.io>wrote:
> > >> > >
> > >> > > > You can simply reinject the records.  You can overwrite and/or
> > >> update
> > >> > the
> > >> > > > current record. See the db.injector.update and overwrite
> settings.
> > >> > > >
> > >> > > > -----Original message-----
> > >> > > > > From:David Philip <da...@gmail.com>
> > >> > > > > Sent: Wed 27-Feb-2013 11:23
> > >> > > > > To: user@nutch.apache.org
> > >> > > > > Subject: Re: Nutch Incremental Crawl
> > >> > > > >
> > >> > > > > HI Markus, I meant over riding  the injected interval.. How to
> > >> > override
> > >> > > > the
> > >> > > > > injected fetch interval?
> > >> > > > > While crawling fetch interval was set 30days (default). Now I
> > >> want to
> > >> > > > > re-fetch same site (that is to force re-fetch) and not wait
> for
> > >> fetch
> > >> > > > > interval (30 days).. how can we do that?
> > >> > > > >
> > >> > > > >
> > >> > > > > Feng Lu : Thank you for the reference link.
> > >> > > > >
> > >> > > > > Thanks - David
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
> > >> > > > > <ma...@openindex.io>wrote:
> > >> > > > >
> > >> > > > > > The default or the injected interval? The default interval
> can
> > >> be
> > >> > set
> > >> > > >  in
> > >> > > > > > the config (see nutch-default for example). Per URL's can be
> > set
> > >> > > using
> > >> > > > the
> > >> > > > > > injector: <URL>\tnutch.fixedFetchInterval=86400
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > -----Original message-----
> > >> > > > > > > From:David Philip <da...@gmail.com>
> > >> > > > > > > Sent: Wed 27-Feb-2013 06:21
> > >> > > > > > > To: user@nutch.apache.org
> > >> > > > > > > Subject: Re: Nutch Incremental Crawl
> > >> > > > > > >
> > >> > > > > > > Hi all,
> > >> > > > > > >
> > >> > > > > > >   Thank you very much for the replies. Very useful
> > >> information to
> > >> > > > > > > understand how incremental crawling can be achieved.
> > >> > > > > > >
> > >> > > > > > > Dear Markus:
> > >> > > > > > > Can you please tell me how do I over ride this fetch
> > interval
> > >> ,
> > >> > > > incase
> > >> > > > > > if I
> > >> > > > > > > require to fetch the page before the time interval is
> > passed?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Thanks very much
> > >> > > > > > > - David
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> > >> > > > > > > <ma...@openindex.io>wrote:
> > >> > > > > > >
> > >> > > > > > > > If you want records to be fetched at a fixed interval
> its
> > >> > easier
> > >> > > to
> > >> > > > > > inject
> > >> > > > > > > > them with a fixed fetch interval.
> > >> > > > > > > >
> > >> > > > > > > > nutch.fixedFetchInterval=86400
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > -----Original message-----
> > >> > > > > > > > > From:kemical <mi...@gmail.com>
> > >> > > > > > > > > Sent: Thu 14-Feb-2013 10:15
> > >> > > > > > > > > To: user@nutch.apache.org
> > >> > > > > > > > > Subject: Re: Nutch Incremental Crawl
> > >> > > > > > > > >
> > >> > > > > > > > > Hi David,
> > >> > > > > > > > >
> > >> > > > > > > > > You can also consider setting shorter fetch interval
> > time
> > >> > with
> > >> > > > nutch
> > >> > > > > > > > inject.
> > >> > > > > > > > > This way you'll set higher score (so the url is always
> > >> taken
> > >> > in
> > >> > > > > > priority
> > >> > > > > > > > > when you generate a segment) and a fetch.interval of 1
> > >> day.
> > >> > > > > > > > >
> > >> > > > > > > > > If you have a case similar to me, you'll often want
> some
> > >> > > homepage
> > >> > > > > > fetch
> > >> > > > > > > > each
> > >> > > > > > > > > day but not their inlinks. What you can do is inject
> all
> > >> your
> > >> > > > seed
> > >> > > > > > urls
> > >> > > > > > > > > again (assuming those url are only homepages).
> > >> > > > > > > > >
> > >> > > > > > > > > #change nutch option so existing urls can be injected
> > >> again
> > >> > in
> > >> > > > > > > > > conf/nutch-default.xml or conf/nutch-site.xml
> > >> > > > > > > > > db.injector.update=true
> > >> > > > > > > > >
> > >> > > > > > > > > #Add metadata to update score/fetch interval
> > >> > > > > > > > > #the following line will concat to each line of your
> > seed
> > >> > urls
> > >> > > > files
> > >> > > > > > with
> > >> > > > > > > > > the new score / new interval
> > >> > > > > > > > > perl -pi -e
> > >> > > > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> > >> > > > > > > > > [your_seed_url_dir]/*
> > >> > > > > > > > >
> > >> > > > > > > > > #run command
> > >> > > > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> > >> > > > > > > > >
> > >> > > > > > > > > Now, the following crawl will take your urls in top
> > >> priority
> > >> > > and
> > >> > > > > > crawl
> > >> > > > > > > > them
> > >> > > > > > > > > once a day. I've used my situation to illustrate the
> > >> concept
> > >> > > but
> > >> > > > i
> > >> > > > > > guess
> > >> > > > > > > > you
> > >> > > > > > > > > can tweek params to fit your needs.
> > >> > > > > > > > >
> > >> > > > > > > > > This way is useful when you want a regular fetch on
> some
> > >> > urls,
> > >> > > if
> > >> > > > > > it's
> > >> > > > > > > > > occured rarely i guess freegen is the right choice.
> > >> > > > > > > > >
> > >> > > > > > > > > Best,
> > >> > > > > > > > > Mike
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > --
> > >> > > > > > > > > View this message in context:
> > >> > > > > > > >
> > >> > > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > >> > > > > > > > > Sent from the Nutch - User mailing list archive at
> > >> > Nabble.com.
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Don't Grow Old, Grow Up... :-)
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Don't Grow Old, Grow Up... :-)
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch Incremental Crawl

Posted by David Philip <da...@gmail.com>.

Hi,
  I used less command and checked, it shows the past content , not modified
one. Any other cache clearing from crawl db? or any property to set in
nutch-site so that it  does re-fetch modified content?


   - Cleared tomcat cache
   - settings:

<property>
  <name>db.fetch.interval.default</name>
  <value>600</value>
  </description>
</property>

<property>
  <name>db.injector.update</name>
  <value>true</value>
  </description>
</property>



Crawl command : bin/nutch crawl urls -solr
http://localhost:8080/solrnutch-dir crawltest -depth 10
This command I executed after 1 hour (modifying some sites content and
title) but the title or content is still not fetched. The dump (redseg
dump) shows old content only :(


To separately update solr, I executed this command : bin/nutch solrindex
http://localhost:8080/solrnutch/ crawltest/crawldb -linkdb crawltest/linkdb
crawltest/segments/* -deleteGone
but no sucess, nothing updated to solr.

*trace :*
SolrIndexer: starting at 2013-03-05 17:07:15
SolrIndexer: deleting gone documents
Indexing 16 documents
Deleting 1 documents
SolrIndexer: finished at 2013-03-05 17:09:38, elapsed: 00:02:22

But after this , when  I check in solr (http://localhost:8080/solrnutch/)
it still shows 16 docs, why it can be? I use nutch 1.5.1 version and solr3.6


Thanks - David

P.S
I basically wanted to achieve on demand re-crawl so that all modified
website get updated in solr, and so when user searches, he gets accurate
results.










On Tue, Mar 5, 2013 at 12:54 PM, feng lu <am...@gmail.com> wrote:

> Hi David
>
> yes, it's a tomcat web service cache.
>
> The dump file can use "less" command to open if you use linux OS. or you
> can use
> "bin/nutch readseg -get segments/20130121115214/ http://www.cnbeta.com/"
> to
> dump the information of specified url.
>
>
>
>
> On Tue, Mar 5, 2013 at 3:02 PM, feng lu <am...@gmail.com> wrote:
>
> >
> >
> >
> > On Tue, Mar 5, 2013 at 2:49 PM, David Philip <
> davidphilipsheron@gmail.com>wrote:
> >
> >> Hi,
> >>
> >>     web server cache - you mean /tomcat/work/; where the solr is
> running?
> >> Did u mean that cache?
> >>
> >> I tried to use the below command {bin/nutch readseg -dump
> >> crawltest/segments/20130304185844/ crawltest/test}and it gives dump
> file,
> >> format is GMC link (application/x-gmc-link)  - I am not able to open it.
> >> How to open this file?
> >>
> >> How ever when I ran :  bin/nutch readseg -list
> >> crawltest/segments/20130304185844/
> >> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> >> 20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1
> >>
> >>
> >> - David
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Mar 5, 2013 at 11:25 AM, feng lu <am...@gmail.com> wrote:
> >>
> >> > Hi David
> >> >
> >> > Do you clear the web server cache. Maybe the refetch is also crawl the
> >> old
> >> > page.
> >> >
> >> > Maybe you can dump the url content to check the modification.
> >> > using bin/nutch readseg command.
> >> >
> >> > Thanks
> >> >
> >> >
> >> > On Tue, Mar 5, 2013 at 1:28 PM, David Philip <
> >> davidphilipsheron@gmail.com
> >> > >wrote:
> >> >
> >> > > Hi Markus,
> >> > >
> >> > >   So I was trying with the *db.injector.update *point that you
> >> mentioned,
> >> > > please see my observations below*. *
> >> > > Settings: I did  *db.injector.update * to* true *and   *
> >> > > db.fetch.interval.default *to* 1hour. *
> >> > > *
> >> > > *
> >> > > *
> >> > > *
> >> > > *Observation:*
> >> > >
> >> > > On first time crawl[1],  14 urls were successfully crawled and
> >> indexed to
> >> > > solr.
> >> > > case 1 :
> >> > > In those 14 urls I modified the content and title of one url (say
> >> Aurl)
> >> > and
> >> > > re executed the crawl after one hour.
> >> > > I see that this(Aurl) url is re-fetched (it shows in log) but at
> Solr
> >> > level
> >> > > : for that url (aurl): content field and title field didn't get
> >> updated.
> >> > > Why? should I do any configuration for this to make solr index get
> >> > updated?
> >> > >
> >> > > case2:
> >> > > Added new url to the crawling site
> >> > > The url got indexed - This is success. So interested to know why the
> >> > above
> >> > > case failed? What configuration need to be made?
> >> > >
> >> > >
> >> > > Thanks - David
> >> > >
> >> > >
> >> > > *PS:*
> >> > > Apologies that I am still asking questions on same topic. I am not
> >> able
> >> > to
> >> > > find good way for incremental crawl so trying different approaches.
> >> >  Once I
> >> > > am clear I will blog this and share it. Thanks lot for replies from
> >> > mailer.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> >> > > <ma...@openindex.io>wrote:
> >> > >
> >> > > > You can simply reinject the records.  You can overwrite and/or
> >> update
> >> > the
> >> > > > current record. See the db.injector.update and overwrite settings.
> >> > > >
> >> > > > -----Original message-----
> >> > > > > From:David Philip <da...@gmail.com>
> >> > > > > Sent: Wed 27-Feb-2013 11:23
> >> > > > > To: user@nutch.apache.org
> >> > > > > Subject: Re: Nutch Incremental Crawl
> >> > > > >
> >> > > > > HI Markus, I meant over riding  the injected interval.. How to
> >> > override
> >> > > > the
> >> > > > > injected fetch interval?
> >> > > > > While crawling fetch interval was set 30days (default). Now I
> >> want to
> >> > > > > re-fetch same site (that is to force re-fetch) and not wait for
> >> fetch
> >> > > > > interval (30 days).. how can we do that?
> >> > > > >
> >> > > > >
> >> > > > > Feng Lu : Thank you for the reference link.
> >> > > > >
> >> > > > > Thanks - David
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
> >> > > > > <ma...@openindex.io>wrote:
> >> > > > >
> >> > > > > > The default or the injected interval? The default interval can
> >> be
> >> > set
> >> > > >  in
> >> > > > > > the config (see nutch-default for example). Per URL's can be
> set
> >> > > using
> >> > > > the
> >> > > > > > injector: <URL>\tnutch.fixedFetchInterval=86400
> >> > > > > >
> >> > > > > >
> >> > > > > > -----Original message-----
> >> > > > > > > From:David Philip <da...@gmail.com>
> >> > > > > > > Sent: Wed 27-Feb-2013 06:21
> >> > > > > > > To: user@nutch.apache.org
> >> > > > > > > Subject: Re: Nutch Incremental Crawl
> >> > > > > > >
> >> > > > > > > Hi all,
> >> > > > > > >
> >> > > > > > >   Thank you very much for the replies. Very useful
> >> information to
> >> > > > > > > understand how incremental crawling can be achieved.
> >> > > > > > >
> >> > > > > > > Dear Markus:
> >> > > > > > > Can you please tell me how do I over ride this fetch
> interval
> >> ,
> >> > > > incase
> >> > > > > > if I
> >> > > > > > > require to fetch the page before the time interval is
> passed?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Thanks very much
> >> > > > > > > - David
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> >> > > > > > > <ma...@openindex.io>wrote:
> >> > > > > > >
> >> > > > > > > > If you want records to be fetched at a fixed interval its
> >> > easier
> >> > > to
> >> > > > > > inject
> >> > > > > > > > them with a fixed fetch interval.
> >> > > > > > > >
> >> > > > > > > > nutch.fixedFetchInterval=86400
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > -----Original message-----
> >> > > > > > > > > From:kemical <mi...@gmail.com>
> >> > > > > > > > > Sent: Thu 14-Feb-2013 10:15
> >> > > > > > > > > To: user@nutch.apache.org
> >> > > > > > > > > Subject: Re: Nutch Incremental Crawl
> >> > > > > > > > >
> >> > > > > > > > > Hi David,
> >> > > > > > > > >
> >> > > > > > > > > You can also consider setting shorter fetch interval
> time
> >> > with
> >> > > > nutch
> >> > > > > > > > inject.
> >> > > > > > > > > This way you'll set higher score (so the url is always
> >> taken
> >> > in
> >> > > > > > priority
> >> > > > > > > > > when you generate a segment) and a fetch.interval of 1
> >> day.
> >> > > > > > > > >
> >> > > > > > > > > If you have a case similar to me, you'll often want some
> >> > > homepage
> >> > > > > > fetch
> >> > > > > > > > each
> >> > > > > > > > > day but not their inlinks. What you can do is inject all
> >> your
> >> > > > seed
> >> > > > > > urls
> >> > > > > > > > > again (assuming those url are only homepages).
> >> > > > > > > > >
> >> > > > > > > > > #change nutch option so existing urls can be injected
> >> again
> >> > in
> >> > > > > > > > > conf/nutch-default.xml or conf/nutch-site.xml
> >> > > > > > > > > db.injector.update=true
> >> > > > > > > > >
> >> > > > > > > > > #Add metadata to update score/fetch interval
> >> > > > > > > > > #the following line will concat to each line of your
> seed
> >> > urls
> >> > > > files
> >> > > > > > with
> >> > > > > > > > > the new score / new interval
> >> > > > > > > > > perl -pi -e
> >> > > > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> >> > > > > > > > > [your_seed_url_dir]/*
> >> > > > > > > > >
> >> > > > > > > > > #run command
> >> > > > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> >> > > > > > > > >
> >> > > > > > > > > Now, the following crawl will take your urls in top
> >> priority
> >> > > and
> >> > > > > > crawl
> >> > > > > > > > them
> >> > > > > > > > > once a day. I've used my situation to illustrate the
> >> concept
> >> > > but
> >> > > > i
> >> > > > > > guess
> >> > > > > > > > you
> >> > > > > > > > > can tweek params to fit your needs.
> >> > > > > > > > >
> >> > > > > > > > > This way is useful when you want a regular fetch on some
> >> > urls,
> >> > > if
> >> > > > > > it's
> >> > > > > > > > > occured rarely i guess freegen is the right choice.
> >> > > > > > > > >
> >> > > > > > > > > Best,
> >> > > > > > > > > Mike
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > View this message in context:
> >> > > > > > > >
> >> > > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> >> > > > > > > > > Sent from the Nutch - User mailing list archive at
> >> > Nabble.com.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Don't Grow Old, Grow Up... :-)
> >> >
> >>
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Nutch Incremental Crawl

Posted by feng lu <am...@gmail.com>.

Hi David

yes, it's a tomcat web service cache.

The dump file can use "less" command to open if you use linux OS. or you
can use
"bin/nutch readseg -get segments/20130121115214/ http://www.cnbeta.com/" to
dump the information of specified url.




On Tue, Mar 5, 2013 at 3:02 PM, feng lu <am...@gmail.com> wrote:

>
>
>
> On Tue, Mar 5, 2013 at 2:49 PM, David Philip <da...@gmail.com>wrote:
>
>> Hi,
>>
>>     web server cache - you mean /tomcat/work/; where the solr is running?
>> Did u mean that cache?
>>
>> I tried to use the below command {bin/nutch readseg -dump
>> crawltest/segments/20130304185844/ crawltest/test}and it gives dump file,
>> format is GMC link (application/x-gmc-link)  - I am not able to open it.
>> How to open this file?
>>
>> How ever when I ran :  bin/nutch readseg -list
>> crawltest/segments/20130304185844/
>> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
>> 20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1
>>
>>
>> - David
>>
>>
>>
>>
>>
>> On Tue, Mar 5, 2013 at 11:25 AM, feng lu <am...@gmail.com> wrote:
>>
>> > Hi David
>> >
>> > Do you clear the web server cache. Maybe the refetch is also crawl the
>> old
>> > page.
>> >
>> > Maybe you can dump the url content to check the modification.
>> > using bin/nutch readseg command.
>> >
>> > Thanks
>> >
>> >
>> > On Tue, Mar 5, 2013 at 1:28 PM, David Philip <
>> davidphilipsheron@gmail.com
>> > >wrote:
>> >
>> > > Hi Markus,
>> > >
>> > >   So I was trying with the *db.injector.update *point that you
>> mentioned,
>> > > please see my observations below*. *
>> > > Settings: I did  *db.injector.update * to* true *and   *
>> > > db.fetch.interval.default *to* 1hour. *
>> > > *
>> > > *
>> > > *
>> > > *
>> > > *Observation:*
>> > >
>> > > On first time crawl[1],  14 urls were successfully crawled and
>> indexed to
>> > > solr.
>> > > case 1 :
>> > > In those 14 urls I modified the content and title of one url (say
>> Aurl)
>> > and
>> > > re executed the crawl after one hour.
>> > > I see that this(Aurl) url is re-fetched (it shows in log) but at Solr
>> > level
>> > > : for that url (aurl): content field and title field didn't get
>> updated.
>> > > Why? should I do any configuration for this to make solr index get
>> > updated?
>> > >
>> > > case2:
>> > > Added new url to the crawling site
>> > > The url got indexed - This is success. So interested to know why the
>> > above
>> > > case failed? What configuration need to be made?
>> > >
>> > >
>> > > Thanks - David
>> > >
>> > >
>> > > *PS:*
>> > > Apologies that I am still asking questions on same topic. I am not
>> able
>> > to
>> > > find good way for incremental crawl so trying different approaches.
>> >  Once I
>> > > am clear I will blog this and share it. Thanks lot for replies from
>> > mailer.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
>> > > <ma...@openindex.io>wrote:
>> > >
>> > > > You can simply reinject the records.  You can overwrite and/or
>> update
>> > the
>> > > > current record. See the db.injector.update and overwrite settings.
>> > > >
>> > > > -----Original message-----
>> > > > > From:David Philip <da...@gmail.com>
>> > > > > Sent: Wed 27-Feb-2013 11:23
>> > > > > To: user@nutch.apache.org
>> > > > > Subject: Re: Nutch Incremental Crawl
>> > > > >
>> > > > > HI Markus, I meant over riding  the injected interval.. How to
>> > override
>> > > > the
>> > > > > injected fetch interval?
>> > > > > While crawling fetch interval was set 30days (default). Now I
>> want to
>> > > > > re-fetch same site (that is to force re-fetch) and not wait for
>> fetch
>> > > > > interval (30 days).. how can we do that?
>> > > > >
>> > > > >
>> > > > > Feng Lu : Thank you for the reference link.
>> > > > >
>> > > > > Thanks - David
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
>> > > > > <ma...@openindex.io>wrote:
>> > > > >
>> > > > > > The default or the injected interval? The default interval can
>> be
>> > set
>> > > >  in
>> > > > > > the config (see nutch-default for example). Per URL's can be set
>> > > using
>> > > > the
>> > > > > > injector: <URL>\tnutch.fixedFetchInterval=86400
>> > > > > >
>> > > > > >
>> > > > > > -----Original message-----
>> > > > > > > From:David Philip <da...@gmail.com>
>> > > > > > > Sent: Wed 27-Feb-2013 06:21
>> > > > > > > To: user@nutch.apache.org
>> > > > > > > Subject: Re: Nutch Incremental Crawl
>> > > > > > >
>> > > > > > > Hi all,
>> > > > > > >
>> > > > > > >   Thank you very much for the replies. Very useful
>> information to
>> > > > > > > understand how incremental crawling can be achieved.
>> > > > > > >
>> > > > > > > Dear Markus:
>> > > > > > > Can you please tell me how do I over ride this fetch interval
>> ,
>> > > > incase
>> > > > > > if I
>> > > > > > > require to fetch the page before the time interval is passed?
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > Thanks very much
>> > > > > > > - David
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
>> > > > > > > <ma...@openindex.io>wrote:
>> > > > > > >
>> > > > > > > > If you want records to be fetched at a fixed interval its
>> > easier
>> > > to
>> > > > > > inject
>> > > > > > > > them with a fixed fetch interval.
>> > > > > > > >
>> > > > > > > > nutch.fixedFetchInterval=86400
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > -----Original message-----
>> > > > > > > > > From:kemical <mi...@gmail.com>
>> > > > > > > > > Sent: Thu 14-Feb-2013 10:15
>> > > > > > > > > To: user@nutch.apache.org
>> > > > > > > > > Subject: Re: Nutch Incremental Crawl
>> > > > > > > > >
>> > > > > > > > > Hi David,
>> > > > > > > > >
>> > > > > > > > > You can also consider setting shorter fetch interval time
>> > with
>> > > > nutch
>> > > > > > > > inject.
>> > > > > > > > > This way you'll set higher score (so the url is always
>> taken
>> > in
>> > > > > > priority
>> > > > > > > > > when you generate a segment) and a fetch.interval of 1
>> day.
>> > > > > > > > >
>> > > > > > > > > If you have a case similar to me, you'll often want some
>> > > homepage
>> > > > > > fetch
>> > > > > > > > each
>> > > > > > > > > day but not their inlinks. What you can do is inject all
>> your
>> > > > seed
>> > > > > > urls
>> > > > > > > > > again (assuming those url are only homepages).
>> > > > > > > > >
>> > > > > > > > > #change nutch option so existing urls can be injected
>> again
>> > in
>> > > > > > > > > conf/nutch-default.xml or conf/nutch-site.xml
>> > > > > > > > > db.injector.update=true
>> > > > > > > > >
>> > > > > > > > > #Add metadata to update score/fetch interval
>> > > > > > > > > #the following line will concat to each line of your seed
>> > urls
>> > > > files
>> > > > > > with
>> > > > > > > > > the new score / new interval
>> > > > > > > > > perl -pi -e
>> > > > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
>> > > > > > > > > [your_seed_url_dir]/*
>> > > > > > > > >
>> > > > > > > > > #run command
>> > > > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
>> > > > > > > > >
>> > > > > > > > > Now, the following crawl will take your urls in top
>> priority
>> > > and
>> > > > > > crawl
>> > > > > > > > them
>> > > > > > > > > once a day. I've used my situation to illustrate the
>> concept
>> > > but
>> > > > i
>> > > > > > guess
>> > > > > > > > you
>> > > > > > > > > can tweek params to fit your needs.
>> > > > > > > > >
>> > > > > > > > > This way is useful when you want a regular fetch on some
>> > urls,
>> > > if
>> > > > > > it's
>> > > > > > > > > occured rarely i guess freegen is the right choice.
>> > > > > > > > >
>> > > > > > > > > Best,
>> > > > > > > > > Mike
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > View this message in context:
>> > > > > > > >
>> > > > > >
>> > > >
>> > >
>> >
>> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
>> > > > > > > > > Sent from the Nutch - User mailing list archive at
>> > Nabble.com.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Don't Grow Old, Grow Up... :-)
>> >
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch Incremental Crawl

Posted by feng lu <am...@gmail.com>.

On Tue, Mar 5, 2013 at 2:49 PM, David Philip <da...@gmail.com>wrote:

> Hi,
>
>     web server cache - you mean /tomcat/work/; where the solr is running?
> Did u mean that cache?
>
> I tried to use the below command {bin/nutch readseg -dump
> crawltest/segments/20130304185844/ crawltest/test}and it gives dump file,
> format is GMC link (application/x-gmc-link)  - I am not able to open it.
> How to open this file?
>
> How ever when I ran :  bin/nutch readseg -list
> crawltest/segments/20130304185844/
> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> 20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1
>
>
> - David
>
>
>
>
>
> On Tue, Mar 5, 2013 at 11:25 AM, feng lu <am...@gmail.com> wrote:
>
> > Hi David
> >
> > Do you clear the web server cache. Maybe the refetch is also crawl the
> old
> > page.
> >
> > Maybe you can dump the url content to check the modification.
> > using bin/nutch readseg command.
> >
> > Thanks
> >
> >
> > On Tue, Mar 5, 2013 at 1:28 PM, David Philip <
> davidphilipsheron@gmail.com
> > >wrote:
> >
> > > Hi Markus,
> > >
> > >   So I was trying with the *db.injector.update *point that you
> mentioned,
> > > please see my observations below*. *
> > > Settings: I did  *db.injector.update * to* true *and   *
> > > db.fetch.interval.default *to* 1hour. *
> > > *
> > > *
> > > *
> > > *
> > > *Observation:*
> > >
> > > On first time crawl[1],  14 urls were successfully crawled and indexed
> to
> > > solr.
> > > case 1 :
> > > In those 14 urls I modified the content and title of one url (say Aurl)
> > and
> > > re executed the crawl after one hour.
> > > I see that this(Aurl) url is re-fetched (it shows in log) but at Solr
> > level
> > > : for that url (aurl): content field and title field didn't get
> updated.
> > > Why? should I do any configuration for this to make solr index get
> > updated?
> > >
> > > case2:
> > > Added new url to the crawling site
> > > The url got indexed - This is success. So interested to know why the
> > above
> > > case failed? What configuration need to be made?
> > >
> > >
> > > Thanks - David
> > >
> > >
> > > *PS:*
> > > Apologies that I am still asking questions on same topic. I am not able
> > to
> > > find good way for incremental crawl so trying different approaches.
> >  Once I
> > > am clear I will blog this and share it. Thanks lot for replies from
> > mailer.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> > > <ma...@openindex.io>wrote:
> > >
> > > > You can simply reinject the records.  You can overwrite and/or update
> > the
> > > > current record. See the db.injector.update and overwrite settings.
> > > >
> > > > -----Original message-----
> > > > > From:David Philip <da...@gmail.com>
> > > > > Sent: Wed 27-Feb-2013 11:23
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Nutch Incremental Crawl
> > > > >
> > > > > HI Markus, I meant over riding  the injected interval.. How to
> > override
> > > > the
> > > > > injected fetch interval?
> > > > > While crawling fetch interval was set 30days (default). Now I want
> to
> > > > > re-fetch same site (that is to force re-fetch) and not wait for
> fetch
> > > > > interval (30 days).. how can we do that?
> > > > >
> > > > >
> > > > > Feng Lu : Thank you for the reference link.
> > > > >
> > > > > Thanks - David
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
> > > > > <ma...@openindex.io>wrote:
> > > > >
> > > > > > The default or the injected interval? The default interval can be
> > set
> > > >  in
> > > > > > the config (see nutch-default for example). Per URL's can be set
> > > using
> > > > the
> > > > > > injector: <URL>\tnutch.fixedFetchInterval=86400
> > > > > >
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:David Philip <da...@gmail.com>
> > > > > > > Sent: Wed 27-Feb-2013 06:21
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Re: Nutch Incremental Crawl
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > >   Thank you very much for the replies. Very useful information
> to
> > > > > > > understand how incremental crawling can be achieved.
> > > > > > >
> > > > > > > Dear Markus:
> > > > > > > Can you please tell me how do I over ride this fetch interval ,
> > > > incase
> > > > > > if I
> > > > > > > require to fetch the page before the time interval is passed?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks very much
> > > > > > > - David
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> > > > > > > <ma...@openindex.io>wrote:
> > > > > > >
> > > > > > > > If you want records to be fetched at a fixed interval its
> > easier
> > > to
> > > > > > inject
> > > > > > > > them with a fixed fetch interval.
> > > > > > > >
> > > > > > > > nutch.fixedFetchInterval=86400
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > -----Original message-----
> > > > > > > > > From:kemical <mi...@gmail.com>
> > > > > > > > > Sent: Thu 14-Feb-2013 10:15
> > > > > > > > > To: user@nutch.apache.org
> > > > > > > > > Subject: Re: Nutch Incremental Crawl
> > > > > > > > >
> > > > > > > > > Hi David,
> > > > > > > > >
> > > > > > > > > You can also consider setting shorter fetch interval time
> > with
> > > > nutch
> > > > > > > > inject.
> > > > > > > > > This way you'll set higher score (so the url is always
> taken
> > in
> > > > > > priority
> > > > > > > > > when you generate a segment) and a fetch.interval of 1 day.
> > > > > > > > >
> > > > > > > > > If you have a case similar to me, you'll often want some
> > > homepage
> > > > > > fetch
> > > > > > > > each
> > > > > > > > > day but not their inlinks. What you can do is inject all
> your
> > > > seed
> > > > > > urls
> > > > > > > > > again (assuming those url are only homepages).
> > > > > > > > >
> > > > > > > > > #change nutch option so existing urls can be injected again
> > in
> > > > > > > > > conf/nutch-default.xml or conf/nutch-site.xml
> > > > > > > > > db.injector.update=true
> > > > > > > > >
> > > > > > > > > #Add metadata to update score/fetch interval
> > > > > > > > > #the following line will concat to each line of your seed
> > urls
> > > > files
> > > > > > with
> > > > > > > > > the new score / new interval
> > > > > > > > > perl -pi -e
> > > > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> > > > > > > > > [your_seed_url_dir]/*
> > > > > > > > >
> > > > > > > > > #run command
> > > > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> > > > > > > > >
> > > > > > > > > Now, the following crawl will take your urls in top
> priority
> > > and
> > > > > > crawl
> > > > > > > > them
> > > > > > > > > once a day. I've used my situation to illustrate the
> concept
> > > but
> > > > i
> > > > > > guess
> > > > > > > > you
> > > > > > > > > can tweek params to fit your needs.
> > > > > > > > >
> > > > > > > > > This way is useful when you want a regular fetch on some
> > urls,
> > > if
> > > > > > it's
> > > > > > > > > occured rarely i guess freegen is the right choice.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Mike
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > View this message in context:
> > > > > > > >
> > > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > > > > > > > > Sent from the Nutch - User mailing list archive at
> > Nabble.com.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch Incremental Crawl

Posted by David Philip <da...@gmail.com>.

Hi,

    web server cache - you mean /tomcat/work/; where the solr is running?
Did u mean that cache?

I tried to use the below command {bin/nutch readseg -dump
crawltest/segments/20130304185844/ crawltest/test}and it gives dump file,
format is GMC link (application/x-gmc-link)  - I am not able to open it.
How to open this file?

How ever when I ran :  bin/nutch readseg -list
crawltest/segments/20130304185844/
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1


- David





On Tue, Mar 5, 2013 at 11:25 AM, feng lu <am...@gmail.com> wrote:

> Hi David
>
> Do you clear the web server cache. Maybe the refetch is also crawl the old
> page.
>
> Maybe you can dump the url content to check the modification.
> using bin/nutch readseg command.
>
> Thanks
>
>
> On Tue, Mar 5, 2013 at 1:28 PM, David Philip <davidphilipsheron@gmail.com
> >wrote:
>
> > Hi Markus,
> >
> >   So I was trying with the *db.injector.update *point that you mentioned,
> > please see my observations below*. *
> > Settings: I did  *db.injector.update * to* true *and   *
> > db.fetch.interval.default *to* 1hour. *
> > *
> > *
> > *
> > *
> > *Observation:*
> >
> > On first time crawl[1],  14 urls were successfully crawled and indexed to
> > solr.
> > case 1 :
> > In those 14 urls I modified the content and title of one url (say Aurl)
> and
> > re executed the crawl after one hour.
> > I see that this(Aurl) url is re-fetched (it shows in log) but at Solr
> level
> > : for that url (aurl): content field and title field didn't get updated.
> > Why? should I do any configuration for this to make solr index get
> updated?
> >
> > case2:
> > Added new url to the crawling site
> > The url got indexed - This is success. So interested to know why the
> above
> > case failed? What configuration need to be made?
> >
> >
> > Thanks - David
> >
> >
> > *PS:*
> > Apologies that I am still asking questions on same topic. I am not able
> to
> > find good way for incremental crawl so trying different approaches.
>  Once I
> > am clear I will blog this and share it. Thanks lot for replies from
> mailer.
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > You can simply reinject the records.  You can overwrite and/or update
> the
> > > current record. See the db.injector.update and overwrite settings.
> > >
> > > -----Original message-----
> > > > From:David Philip <da...@gmail.com>
> > > > Sent: Wed 27-Feb-2013 11:23
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Nutch Incremental Crawl
> > > >
> > > > HI Markus, I meant over riding  the injected interval.. How to
> override
> > > the
> > > > injected fetch interval?
> > > > While crawling fetch interval was set 30days (default). Now I want to
> > > > re-fetch same site (that is to force re-fetch) and not wait for fetch
> > > > interval (30 days).. how can we do that?
> > > >
> > > >
> > > > Feng Lu : Thank you for the reference link.
> > > >
> > > > Thanks - David
> > > >
> > > >
> > > >
> > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
> > > > <ma...@openindex.io>wrote:
> > > >
> > > > > The default or the injected interval? The default interval can be
> set
> > >  in
> > > > > the config (see nutch-default for example). Per URL's can be set
> > using
> > > the
> > > > > injector: <URL>\tnutch.fixedFetchInterval=86400
> > > > >
> > > > >
> > > > > -----Original message-----
> > > > > > From:David Philip <da...@gmail.com>
> > > > > > Sent: Wed 27-Feb-2013 06:21
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Re: Nutch Incremental Crawl
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > >   Thank you very much for the replies. Very useful information to
> > > > > > understand how incremental crawling can be achieved.
> > > > > >
> > > > > > Dear Markus:
> > > > > > Can you please tell me how do I over ride this fetch interval ,
> > > incase
> > > > > if I
> > > > > > require to fetch the page before the time interval is passed?
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks very much
> > > > > > - David
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> > > > > > <ma...@openindex.io>wrote:
> > > > > >
> > > > > > > If you want records to be fetched at a fixed interval its
> easier
> > to
> > > > > inject
> > > > > > > them with a fixed fetch interval.
> > > > > > >
> > > > > > > nutch.fixedFetchInterval=86400
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > -----Original message-----
> > > > > > > > From:kemical <mi...@gmail.com>
> > > > > > > > Sent: Thu 14-Feb-2013 10:15
> > > > > > > > To: user@nutch.apache.org
> > > > > > > > Subject: Re: Nutch Incremental Crawl
> > > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > You can also consider setting shorter fetch interval time
> with
> > > nutch
> > > > > > > inject.
> > > > > > > > This way you'll set higher score (so the url is always taken
> in
> > > > > priority
> > > > > > > > when you generate a segment) and a fetch.interval of 1 day.
> > > > > > > >
> > > > > > > > If you have a case similar to me, you'll often want some
> > homepage
> > > > > fetch
> > > > > > > each
> > > > > > > > day but not their inlinks. What you can do is inject all your
> > > seed
> > > > > urls
> > > > > > > > again (assuming those url are only homepages).
> > > > > > > >
> > > > > > > > #change nutch option so existing urls can be injected again
> in
> > > > > > > > conf/nutch-default.xml or conf/nutch-site.xml
> > > > > > > > db.injector.update=true
> > > > > > > >
> > > > > > > > #Add metadata to update score/fetch interval
> > > > > > > > #the following line will concat to each line of your seed
> urls
> > > files
> > > > > with
> > > > > > > > the new score / new interval
> > > > > > > > perl -pi -e
> > > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> > > > > > > > [your_seed_url_dir]/*
> > > > > > > >
> > > > > > > > #run command
> > > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> > > > > > > >
> > > > > > > > Now, the following crawl will take your urls in top priority
> > and
> > > > > crawl
> > > > > > > them
> > > > > > > > once a day. I've used my situation to illustrate the concept
> > but
> > > i
> > > > > guess
> > > > > > > you
> > > > > > > > can tweek params to fit your needs.
> > > > > > > >
> > > > > > > > This way is useful when you want a regular fetch on some
> urls,
> > if
> > > > > it's
> > > > > > > > occured rarely i guess freegen is the right choice.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Mike
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > View this message in context:
> > > > > > >
> > > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > > > > > > > Sent from the Nutch - User mailing list archive at
> Nabble.com.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Nutch Incremental Crawl

Posted by feng lu <am...@gmail.com>.

Hi David

Do you clear the web server cache. Maybe the refetch is also crawl the old
page.

Maybe you can dump the url content to check the modification.
using bin/nutch readseg command.

Thanks


On Tue, Mar 5, 2013 at 1:28 PM, David Philip <da...@gmail.com>wrote:

> Hi Markus,
>
>   So I was trying with the *db.injector.update *point that you mentioned,
> please see my observations below*. *
> Settings: I did  *db.injector.update * to* true *and   *
> db.fetch.interval.default *to* 1hour. *
> *
> *
> *
> *
> *Observation:*
>
> On first time crawl[1],  14 urls were successfully crawled and indexed to
> solr.
> case 1 :
> In those 14 urls I modified the content and title of one url (say Aurl) and
> re executed the crawl after one hour.
> I see that this(Aurl) url is re-fetched (it shows in log) but at Solr level
> : for that url (aurl): content field and title field didn't get updated.
> Why? should I do any configuration for this to make solr index get updated?
>
> case2:
> Added new url to the crawling site
> The url got indexed - This is success. So interested to know why the above
> case failed? What configuration need to be made?
>
>
> Thanks - David
>
>
> *PS:*
> Apologies that I am still asking questions on same topic. I am not able to
> find good way for incremental crawl so trying different approaches.  Once I
> am clear I will blog this and share it. Thanks lot for replies from mailer.
>
>
>
>
>
>
>
> On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
> > You can simply reinject the records.  You can overwrite and/or update the
> > current record. See the db.injector.update and overwrite settings.
> >
> > -----Original message-----
> > > From:David Philip <da...@gmail.com>
> > > Sent: Wed 27-Feb-2013 11:23
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch Incremental Crawl
> > >
> > > HI Markus, I meant over riding  the injected interval.. How to override
> > the
> > > injected fetch interval?
> > > While crawling fetch interval was set 30days (default). Now I want to
> > > re-fetch same site (that is to force re-fetch) and not wait for fetch
> > > interval (30 days).. how can we do that?
> > >
> > >
> > > Feng Lu : Thank you for the reference link.
> > >
> > > Thanks - David
> > >
> > >
> > >
> > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
> > > <ma...@openindex.io>wrote:
> > >
> > > > The default or the injected interval? The default interval can be set
> >  in
> > > > the config (see nutch-default for example). Per URL's can be set
> using
> > the
> > > > injector: <URL>\tnutch.fixedFetchInterval=86400
> > > >
> > > >
> > > > -----Original message-----
> > > > > From:David Philip <da...@gmail.com>
> > > > > Sent: Wed 27-Feb-2013 06:21
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Nutch Incremental Crawl
> > > > >
> > > > > Hi all,
> > > > >
> > > > >   Thank you very much for the replies. Very useful information to
> > > > > understand how incremental crawling can be achieved.
> > > > >
> > > > > Dear Markus:
> > > > > Can you please tell me how do I over ride this fetch interval ,
> > incase
> > > > if I
> > > > > require to fetch the page before the time interval is passed?
> > > > >
> > > > >
> > > > >
> > > > > Thanks very much
> > > > > - David
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> > > > > <ma...@openindex.io>wrote:
> > > > >
> > > > > > If you want records to be fetched at a fixed interval its easier
> to
> > > > inject
> > > > > > them with a fixed fetch interval.
> > > > > >
> > > > > > nutch.fixedFetchInterval=86400
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:kemical <mi...@gmail.com>
> > > > > > > Sent: Thu 14-Feb-2013 10:15
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Re: Nutch Incremental Crawl
> > > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > You can also consider setting shorter fetch interval time with
> > nutch
> > > > > > inject.
> > > > > > > This way you'll set higher score (so the url is always taken in
> > > > priority
> > > > > > > when you generate a segment) and a fetch.interval of 1 day.
> > > > > > >
> > > > > > > If you have a case similar to me, you'll often want some
> homepage
> > > > fetch
> > > > > > each
> > > > > > > day but not their inlinks. What you can do is inject all your
> > seed
> > > > urls
> > > > > > > again (assuming those url are only homepages).
> > > > > > >
> > > > > > > #change nutch option so existing urls can be injected again in
> > > > > > > conf/nutch-default.xml or conf/nutch-site.xml
> > > > > > > db.injector.update=true
> > > > > > >
> > > > > > > #Add metadata to update score/fetch interval
> > > > > > > #the following line will concat to each line of your seed urls
> > files
> > > > with
> > > > > > > the new score / new interval
> > > > > > > perl -pi -e
> > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> > > > > > > [your_seed_url_dir]/*
> > > > > > >
> > > > > > > #run command
> > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> > > > > > >
> > > > > > > Now, the following crawl will take your urls in top priority
> and
> > > > crawl
> > > > > > them
> > > > > > > once a day. I've used my situation to illustrate the concept
> but
> > i
> > > > guess
> > > > > > you
> > > > > > > can tweek params to fit your needs.
> > > > > > >
> > > > > > > This way is useful when you want a regular fetch on some urls,
> if
> > > > it's
> > > > > > > occured rarely i guess freegen is the right choice.
> > > > > > >
> > > > > > > Best,
> > > > > > > Mike
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > View this message in context:
> > > > > >
> > > >
> >
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > > > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Don't Grow Old, Grow Up... :-)