You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2009/03/02 13:32:17 UTC

[jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

     [ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren resolved NUTCH-669.
------------------------------

    Resolution: Fixed

replaced fetcher with fetcher2

> Consolidate code for Fetcher and Fetcher2
> -----------------------------------------
>
>                 Key: NUTCH-669
>                 URL: https://issues.apache.org/jira/browse/NUTCH-669
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Assignee: Sami Siren
>             Fix For: 1.0.0
>
>
> I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java.
> It seems to me like there are the following differences:
>   - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself
>   - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it.
> I've begun work on this but want to check with people on the following:
> - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality?
> - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard?
> - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

Posted by Todd Lipcon <tl...@gmail.com>.
Hey guys,
Sorry for the non-responsiveness here. I recently left my old employment and
have been packing for a cross-country move.

I agree that for 1.0 the best bet is what Sami has done. The code that I was
working on is available here:

http://github.com/toddlipcon/nutch/tree/nutch-669

But it is not production ready - notably there's a problem whereby it runs
out of memory even with a reasonably large heap.

I'm not sure if I'll be able to complete working on it, given the cluster
(and workload) I was using to test were from my old job, but I'm happy to
provide any assistance understanding the work I began if you'd like to try
to integrate it for 1.1

-Todd

On Mon, Mar 2, 2009 at 9:48 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Sami Siren wrote:
>
>> Andrzej Bialecki wrote:
>>
>>> Sami Siren (JIRA) wrote:
>>>
>>>>     [
>>>> https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>>>
>>>> Sami Siren resolved NUTCH-669.
>>>> ------------------------------
>>>>
>>>>    Resolution: Fixed
>>>>
>>>> replaced fetcher with fetcher2
>>>>
>>>
>>> I'm puzzled ..  it seemed the goal was to integrate Todd's patch, which
>>> effectively replaces both Fetchers. Does this mean that Todd's version was
>>> not ready, or is the current code based on Todd's version?
>>>
>> There was no Todd's path that I could see,  he never provided one even
>> after asked multiple times, first by you at dec 2008 then dogacan jan 2009
>> and finally me last week.
>>
>> My motivation to get this fixed was, as I understood most of the
>> developers thought too, to get rid of the burden of supporting two classes
>> providing roughly the same piece of functionality. I opened a jira for this
>> but closed it soon after as you told me it was a duplicate to this one.
>>
>> So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher
>> is still there to be improved by Todd and others at will.
>>
>
> Ok, I understand now - given the circumstances I agree this was the right
> thing to do.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren wrote:
> Andrzej Bialecki wrote:
>> Sami Siren (JIRA) wrote:
>>>      [ 
>>> https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>>> ]
>>>
>>> Sami Siren resolved NUTCH-669.
>>> ------------------------------
>>>
>>>     Resolution: Fixed
>>>
>>> replaced fetcher with fetcher2
>>
>> I'm puzzled ..  it seemed the goal was to integrate Todd's patch, 
>> which effectively replaces both Fetchers. Does this mean that Todd's 
>> version was not ready, or is the current code based on Todd's version?
> There was no Todd's path that I could see,  he never provided one even 
> after asked multiple times, first by you at dec 2008 then dogacan jan 
> 2009 and finally me last week.
> 
> My motivation to get this fixed was, as I understood most of the 
> developers thought too, to get rid of the burden of supporting two 
> classes providing roughly the same piece of functionality. I opened a 
> jira for this but closed it soon after as you told me it was a duplicate 
> to this one.
> 
> So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher 
> is still there to be improved by Todd and others at will.

Ok, I understand now - given the circumstances I agree this was the 
right thing to do.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

Posted by Sami Siren <ss...@gmail.com>.
Andrzej Bialecki wrote:
> Sami Siren (JIRA) wrote:
>>      [ 
>> https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>> ]
>>
>> Sami Siren resolved NUTCH-669.
>> ------------------------------
>>
>>     Resolution: Fixed
>>
>> replaced fetcher with fetcher2
>
> I'm puzzled ..  it seemed the goal was to integrate Todd's patch, 
> which effectively replaces both Fetchers. Does this mean that Todd's 
> version was not ready, or is the current code based on Todd's version?
There was no Todd's path that I could see,  he never provided one even 
after asked multiple times, first by you at dec 2008 then dogacan jan 
2009 and finally me last week.

My motivation to get this fixed was, as I understood most of the 
developers thought too, to get rid of the burden of supporting two 
classes providing roughly the same piece of functionality. I opened a 
jira for this but closed it soon after as you told me it was a duplicate 
to this one.

So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher 
is still there to be improved by Todd and others at will.

--
 Sami Siren

Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Sami Siren resolved NUTCH-669.
> ------------------------------
> 
>     Resolution: Fixed
> 
> replaced fetcher with fetcher2

I'm puzzled ..  it seemed the goal was to integrate Todd's patch, which 
effectively replaces both Fetchers. Does this mean that Todd's version 
was not ready, or is the current code based on Todd's version?


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com