You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil> on 2016/08/03 18:08:09 UTC

RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

CLASSIFICATION: UNCLASSIFIED

Shall I assume that, even though nutch has adaptive capability, I would still have to figure out how to trigger it to go look for content that needs update?

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~


-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Wednesday, August 03, 2016 2:03 PM
To: solr-user@lucene.apache.org
Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

That’s good news.

It should reset the interval estimate on page change instead of slowly shortening it.

I’m pretty sure that Ultraseek used a bounded exponential backoff when the page had not changed.

wunder
Walter Underwood
wunder@wunderwood.org
Caution-http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:51 AM, Marco Scalone <ma...@gmail.com> wrote:
> 
> Nutch also has adaptive strategy:
> 
> This class implements an adaptive re-fetch algorithm. This works as
>> follows:
>> 
>>   - for pages that has changed since the last fetchTime, decrease their
>>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>   - for pages that haven't changed since the last fetchTime, increase
>>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>   If SYNC_DELTA property is true, then:
>>      - calculate a delta = fetchTime - modifiedTime
>>      - try to synchronize with the time of change, by shifting the next
>>      fetchTime by a fraction of the difference between the last modification
>>      time and the last fetch time. I.e. the next fetch time will be set to fetchTime
>>      + fetchInterval - delta * SYNC_DELTA_RATE
>>      - if the adjusted fetch interval is bigger than the delta, then fetchInterval
>>      = delta.
>>   - the minimum value of fetchInterval may not be smaller than
>>   MIN_INTERVAL (default is 1 minute).
>>   - the maximum value of fetchInterval may not be bigger than
>>   MAX_INTERVAL (default is 365 days).
>> 
>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
>> destabilize the algorithm, so that the fetch interval either 
>> increases or decreases infinitely, with little relevance to the page 
>> changes. Please use
>> main(String[])
>> <Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutc
>> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>> method to test the values before applying them in a production system.
>> 
> 
> From:
> Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> crawl/AdaptiveFetchSchedule.html
> 
> 
> 2016-08-03 14:45 GMT-03:00 Walter Underwood <wu...@wunderwood.org>:
> 
>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
>> crawler in Ultraseek.
>> 
>> I think we were the only people who built an adaptive crawler for 
>> enterprise use. I tried to get Ultraseek open-sourced. I made the 
>> argument to Mike Lynch. He looked at me like I had three heads and 
>> didn’t even answer me.
>> 
>> Ultraseek also has great support for sites that need login. If you 
>> use that, you’ll need to find a way to do that with another crawler.
>> 
>> wunder
>> Walter Underwood
>> Former Ultraseek Principal Engineer
>> wunder@wunderwood.org
>> Caution-http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL 
>>> (US)
>> <kr...@mail.mil> wrote:
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>>> 
>>> We are currently using ultraseek and looking to deprecate it in 
>>> favor of
>> solr/nutch.
>>> Ultraseek runs all the time and auto detects when pages have changed 
>>> and
>> automatically reindexes them.
>>> Is this possible with SOLR/nutch?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> Kris T. Musshorn
>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>> US Army Research Lab
>>> Aberdeen Proving Ground
>>> Application Management & Development Branch
>>> 410-278-7251
>>> kris.t.musshorn.ctr@mail.mil
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 
>>> 
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>> 
>> 


CLASSIFICATION: UNCLASSIFIED