You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/01/20 18:16:07 UTC

[DISCUSS] Issues with Fetcher

Hi Everyone,

Since Eddie decided to chap in on the dev lists/Jira we have not been able
to get back to him. I'm referring specifically to NUTCH-1201 and his
comments therewith.

Doing a quick rekkie on the current fetcher issues I can see 32 issues with
7 of them claiming to be patched up... this kinda indicates that although
there are underlying problems with the fetcher we are currently not getting
the time to address them. It also indicates that there is quite a bit of
work to be done with the fetcher...

Has anyone had time to consider Eddie's comments or proposals for taking
the work forward. The last thing we would like to see is him allocating his
time elsewhere if we could have a real go at building a more appropriate
fetcher architecture (plugable, etc).

I was thinking to myself all week that we would seriously be passing up an
opportunity if we didn't try to act on this one.

Thanks guys.

-- 
*Lewis*

Re: [DISCUSS] Issues with Fetcher

Posted by Ken Krugler <kk...@transpac.com>.
Thanks for the poke - I'd started writing up a comment to that issue, but got sidetracked by the day job.

-- Ken

On Jan 20, 2012, at 9:16am, Lewis John Mcgibbney wrote:

> Hi Everyone,
> 
> Since Eddie decided to chap in on the dev lists/Jira we have not been able to get back to him. I'm referring specifically to NUTCH-1201 and his comments therewith.
> 
> Doing a quick rekkie on the current fetcher issues I can see 32 issues with 7 of them claiming to be patched up... this kinda indicates that although there are underlying problems with the fetcher we are currently not getting the time to address them. It also indicates that there is quite a bit of work to be done with the fetcher...
> 
> Has anyone had time to consider Eddie's comments or proposals for taking the work forward. The last thing we would like to see is him allocating his time elsewhere if we could have a real go at building a more appropriate fetcher architecture (plugable, etc).
> 
> I was thinking to myself all week that we would seriously be passing up an opportunity if we didn't try to act on this one.
> 
> Thanks guys. 
> 
> -- 
> Lewis 
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: [DISCUSS] Issues with Fetcher

Posted by Julien Nioche <li...@gmail.com>.
> Or, if you have experience with JSPs/GUI work, then I think there's this
> big open issue around improving the Nutch GUI, which would likely provide
> the most benefit to the most users. I haven't been following the current
> status, but I know that there have been periodic discussions, and I think
> 101tec did some work on this a while back (for a client), but I don't know
> if that's been contributed (or could be, for that matter).
>

A related issue is porting the REST-API from nutchgora to trunk (
https://issues.apache.org/jira/browse/NUTCH-880) which in turn could be
used by a GUI

J.



>
> -- Ken
>
> On Jan 21, 2012, at 8:17am, Edward Drapkin wrote:
>
>  On 1/21/2012 8:27 AM, Lewis John Mcgibbney wrote:
>
> Hi Julien,
>
>
>  There are 8 issues in trunk about the fetcher - some of them unrelated
>> to the Fetcher (NUTCH-827<https://issues.apache.org/jira/browse/NUTCH-827>/ Nutch-1193) with most of the others being improvements (
>> NUTCH-828 <https://issues.apache.org/jira/browse/NUTCH-828> / NUTCH-1079<https://issues.apache.org/jira/browse/NUTCH-1079>)
>> with possibly just a very few being real issues.
>
>
> This puts the whole discussion into much better context, thanks for
> pointing this out. Maybe I should have made it more clear, that I only
> filtered the fetcher issues on our Jira and I was simply modelling my
> discussion around that. You are completely correct though, it would be
> different if the fetcher was in a similar state to protocol-httpclient...
> which it is obviously not.
>
>
>> I am also concerned about getting too radical changes to such a core part
>> of the framework, especially when more pressing issues could be looked
>> after instead.
>
> +1
>
>
>> Having said that if someone can come up with an interesting proposal for
>> improving the Fetcher that would be very good, I would simply suggest that
>> we then have a separate implementation for that.
>>
> +1
>
>
>>
>>
>>  Ok with this in mind then, is there some guidance we can communicate to
> Eddie? He has specifically mentioned that he shares similar opinions wrt
> the fetcher being a core part of Nutch, radical changes etc, and I also
> share this point of view. He has also added that he doesn't want to spend
> the time changing material which we may or may not merge with trunk, this
> also makes perfect sense. Additionally Ken's comments emphasise that this
> has been somewhat attempted in the past and that lessons have been learned
> and the implementation we have cuts the mustard as is.
> Maybe we could nudge Eddie in the right direction, which would benefit
> both himself and the project over the next while, I think this was the most
> important point I was trying to emphasise, however looking over my original
> comment this was maybe not how it was written.
>
> Thanks
> Lewis
>
>
> If there's more important and/or interesting things for me to work on,
> I'll be glad to.  I'm completely unfamiliar with the current state of the
> project as a whole - and looking through JIRA is a bit daunting.  The only
> reason I'm attracted to working on the fetcher is I think it's a really
> interesting and compelling problem to solve, and it's making it more
> flexible is something that would directly benefit our use for it, so it
> will be easier to devote time to it while I'm at the office.  I do have a
> glut of free time at the moment though, so I'm perfectly okay working on
> another area that's more pressing - I just don't know what it is.  I saw
> that protocol-httpclient needs to be rewritten, is there someone working on
> that?
>
> I can work on more important and less controversial / radical things, but
> I do think that having a more flexible, pluggable fetcher will be an
> enormous improvement to Nutch and can greatly expand the potential uses for
> it as a piece of software.  There's a ton of cases where pluggable fetching
> could have a huge improvement: local filesystem search, single-threaded /
> small site indexing, email indexing (SMTP, POP, etc.), etc.  I suggested an
> extremely (perhaps too much so) abstract archtecture for fetching in ticket
> #1201, and for the sake of brevity I won't repeat myself here, but I think
> that would give Nutch a good base for flexible fetching, which I believe is
> a huge improvement to the project.  I'm obviously new to the development
> here and I'm willing do whatever needs doing, I just believe the fetching
> is something that needs doing.  I just want to contribute!
>
> Thanks,
> Eddie
>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: [DISCUSS] Issues with Fetcher

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Ken,

On Jan 21, 2012, at 10:33 AM, Ken Krugler wrote:
> 
> My own personal favorite area would be to integrate with crawler-commons.

+1. Would you crawler-commons guys be interested in bringing that code to Apache?
How about bringing it over to Nutch? 

Would that be something you'd be interested in?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [DISCUSS] Issues with Fetcher

Posted by Ken Krugler <kk...@transpac.com>.
Hi Eddie,

My own personal favorite area would be to integrate with crawler-commons.

There's been some occasional work done to move things into this shared project - e.g. robots parser & a base HTTP fetcher from Bixo.

I believe there's a Jira issue open to switch Nutch to using that robots.txt parser, which would be an improvement over what Nutch currently has.

There are other pieces of Nutch that could/eventually should be moved there, e.g. URL normalization, but that doesn't directly benefit Nutch, just other Java-based crawlers.

Or, if you have experience with JSPs/GUI work, then I think there's this big open issue around improving the Nutch GUI, which would likely provide the most benefit to the most users. I haven't been following the current status, but I know that there have been periodic discussions, and I think 101tec did some work on this a while back (for a client), but I don't know if that's been contributed (or could be, for that matter).

-- Ken

On Jan 21, 2012, at 8:17am, Edward Drapkin wrote:

> On 1/21/2012 8:27 AM, Lewis John Mcgibbney wrote:
>> 
>> Hi Julien,
>> 
>> 
>> There are 8 issues in trunk about the fetcher - some of them unrelated to the Fetcher (NUTCH-827 / Nutch-1193) with most of the others being improvements (NUTCH-828 / NUTCH-1079) with possibly just a very few being real issues.
>>  
>> This puts the whole discussion into much better context, thanks for pointing this out. Maybe I should have made it more clear, that I only filtered the fetcher issues on our Jira and I was simply modelling my discussion around that. You are completely correct though, it would be different if the fetcher was in a similar state to protocol-httpclient... which it is obviously not.
>>  
>> I am also concerned about getting too radical changes to such a core part of the framework, especially when more pressing issues could be looked after instead.
>> +1
>>  
>> Having said that if someone can come up with an interesting proposal for improving the Fetcher that would be very good, I would simply suggest that we then have a separate implementation for that.
>> +1
>>  
>> 
>> 
>> Ok with this in mind then, is there some guidance we can communicate to Eddie? He has specifically mentioned that he shares similar opinions wrt the fetcher being a core part of Nutch, radical changes etc, and I also share this point of view. He has also added that he doesn't want to spend the time changing material which we may or may not merge with trunk, this also makes perfect sense. Additionally Ken's comments emphasise that this has been somewhat attempted in the past and that lessons have been learned and the implementation we have cuts the mustard as is. 
>> Maybe we could nudge Eddie in the right direction, which would benefit both himself and the project over the next while, I think this was the most important point I was trying to emphasise, however looking over my original comment this was maybe not how it was written.
>> 
>> Thanks
>> Lewis
> 
> If there's more important and/or interesting things for me to work on, I'll be glad to.  I'm completely unfamiliar with the current state of the project as a whole - and looking through JIRA is a bit daunting.  The only reason I'm attracted to working on the fetcher is I think it's a really interesting and compelling problem to solve, and it's making it more flexible is something that would directly benefit our use for it, so it will be easier to devote time to it while I'm at the office.  I do have a glut of free time at the moment though, so I'm perfectly okay working on another area that's more pressing - I just don't know what it is.  I saw that protocol-httpclient needs to be rewritten, is there someone working on that?
> 
> I can work on more important and less controversial / radical things, but I do think that having a more flexible, pluggable fetcher will be an enormous improvement to Nutch and can greatly expand the potential uses for it as a piece of software.  There's a ton of cases where pluggable fetching could have a huge improvement: local filesystem search, single-threaded / small site indexing, email indexing (SMTP, POP, etc.), etc.  I suggested an extremely (perhaps too much so) abstract archtecture for fetching in ticket #1201, and for the sake of brevity I won't repeat myself here, but I think that would give Nutch a good base for flexible fetching, which I believe is a huge improvement to the project.  I'm obviously new to the development here and I'm willing do whatever needs doing, I just believe the fetching is something that needs doing.  I just want to contribute!
> 
> Thanks,
> Eddie

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: [DISCUSS] Issues with Fetcher

Posted by Julien Nioche <li...@gmail.com>.
>
> If there's more important and/or interesting things for me to work on,
> I'll be glad to.  I'm completely unfamiliar with the current state of the
> project as a whole - and looking through JIRA is a bit daunting.  The only
> reason I'm attracted to working on the fetcher is I think it's a really
> interesting and compelling problem to solve, and it's making it more
> flexible is something that would directly benefit our use for it, so it
> will be easier to devote time to it while I'm at the office.  I do have a
> glut of free time at the moment though, so I'm perfectly okay working on
> another area that's more pressing - I just don't know what it is.  I saw
> that protocol-httpclient needs to be rewritten, is there someone working on
> that?
>

not that I am aware of.


>
> I can work on more important and less controversial / radical things, but
> I do think that having a more flexible, pluggable fetcher will be an
> enormous improvement to Nutch and can greatly expand the potential uses for
> it as a piece of software.  There's a ton of cases where pluggable fetching
> could have a huge improvement: local filesystem search, single-threaded /
> small site indexing, email indexing (SMTP, POP, etc.), etc.
>

isn't this already done at the protocol level?


>   I suggested an extremely (perhaps too much so) abstract archtecture for
> fetching in ticket #1201, and for the sake of brevity I won't repeat myself
> here, but I think that would give Nutch a good base for flexible fetching,
> which I believe is a huge improvement to the project.  I'm obviously new to
> the development here and I'm willing do whatever needs doing, I just
> believe the fetching is something that needs doing.  I just want to
> contribute!
>

you are of course free to work on anything you want and your contribution
would be more than welcome. I just reacted to Lewis' comments because I did
not want people to have the impression that the Fetcher was broken + I also
see more urgent and useful things to do but that's just my personal views.

Thanks

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: [DISCUSS] Issues with Fetcher

Posted by Edward Drapkin <ed...@wolfram.com>.
On 1/21/2012 8:27 AM, Lewis John Mcgibbney wrote:
> Hi Julien,
>
>
>     There are 8 issues in trunk about the fetcher - some of them
>     unrelated to the Fetcher (NUTCH-827
>     <https://issues.apache.org/jira/browse/NUTCH-827> / Nutch-1193)
>     with most of the others being improvements (NUTCH-828
>     <https://issues.apache.org/jira/browse/NUTCH-828> / NUTCH-1079
>     <https://issues.apache.org/jira/browse/NUTCH-1079>) with possibly
>     just a very few being real issues. 
>
> This puts the whole discussion into much better context, thanks for 
> pointing this out. Maybe I should have made it more clear, that I only 
> filtered the fetcher issues on our Jira and I was simply modelling my 
> discussion around that. You are completely correct though, it would be 
> different if the fetcher was in a similar state to 
> protocol-httpclient... which it is obviously not.
>
>     I am also concerned about getting too radical changes to such a
>     core part of the framework, especially when more pressing issues
>     could be looked after instead. 
>
> +1
>
>     Having said that if someone can come up with an interesting
>     proposal for improving the Fetcher that would be very good, I
>     would simply suggest that we then have a separate implementation
>     for that.
>
> +1
>
>
>
> Ok with this in mind then, is there some guidance we can communicate 
> to Eddie? He has specifically mentioned that he shares similar 
> opinions wrt the fetcher being a core part of Nutch, radical changes 
> etc, and I also share this point of view. He has also added that he 
> doesn't want to spend the time changing material which we may or may 
> not merge with trunk, this also makes perfect sense. Additionally 
> Ken's comments emphasise that this has been somewhat attempted in the 
> past and that lessons have been learned and the implementation we have 
> cuts the mustard as is.
> Maybe we could nudge Eddie in the right direction, which would benefit 
> both himself and the project over the next while, I think this was the 
> most important point I was trying to emphasise, however looking over 
> my original comment this was maybe not how it was written.
>
> Thanks
> Lewis

If there's more important and/or interesting things for me to work on, 
I'll be glad to.  I'm completely unfamiliar with the current state of 
the project as a whole - and looking through JIRA is a bit daunting.  
The only reason I'm attracted to working on the fetcher is I think it's 
a really interesting and compelling problem to solve, and it's making it 
more flexible is something that would directly benefit our use for it, 
so it will be easier to devote time to it while I'm at the office.  I do 
have a glut of free time at the moment though, so I'm perfectly okay 
working on another area that's more pressing - I just don't know what it 
is.  I saw that protocol-httpclient needs to be rewritten, is there 
someone working on that?

I can work on more important and less controversial / radical things, 
but I do think that having a more flexible, pluggable fetcher will be an 
enormous improvement to Nutch and can greatly expand the potential uses 
for it as a piece of software.  There's a ton of cases where pluggable 
fetching could have a huge improvement: local filesystem search, 
single-threaded / small site indexing, email indexing (SMTP, POP, etc.), 
etc.  I suggested an extremely (perhaps too much so) abstract 
archtecture for fetching in ticket #1201, and for the sake of brevity I 
won't repeat myself here, but I think that would give Nutch a good base 
for flexible fetching, which I believe is a huge improvement to the 
project.  I'm obviously new to the development here and I'm willing do 
whatever needs doing, I just believe the fetching is something that 
needs doing.  I just want to contribute!

Thanks,
Eddie

Re: [DISCUSS] Issues with Fetcher

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Julien,


There are 8 issues in trunk about the fetcher - some of them unrelated to
> the Fetcher (NUTCH-827 <https://issues.apache.org/jira/browse/NUTCH-827>/ Nutch-1193) with most of the others being improvements (
> NUTCH-828 <https://issues.apache.org/jira/browse/NUTCH-828> / NUTCH-1079<https://issues.apache.org/jira/browse/NUTCH-1079>)
> with possibly just a very few being real issues.


This puts the whole discussion into much better context, thanks for
pointing this out. Maybe I should have made it more clear, that I only
filtered the fetcher issues on our Jira and I was simply modelling my
discussion around that. You are completely correct though, it would be
different if the fetcher was in a similar state to protocol-httpclient...
which it is obviously not.


> I am also concerned about getting too radical changes to such a core part
> of the framework, especially when more pressing issues could be looked
> after instead.

+1


> Having said that if someone can come up with an interesting proposal for
> improving the Fetcher that would be very good, I would simply suggest that
> we then have a separate implementation for that.
>
+1


>
>
> Ok with this in mind then, is there some guidance we can communicate to
Eddie? He has specifically mentioned that he shares similar opinions wrt
the fetcher being a core part of Nutch, radical changes etc, and I also
share this point of view. He has also added that he doesn't want to spend
the time changing material which we may or may not merge with trunk, this
also makes perfect sense. Additionally Ken's comments emphasise that this
has been somewhat attempted in the past and that lessons have been learned
and the implementation we have cuts the mustard as is.
Maybe we could nudge Eddie in the right direction, which would benefit both
himself and the project over the next while, I think this was the most
important point I was trying to emphasise, however looking over my original
comment this was maybe not how it was written.

Thanks
Lewis

Re: [DISCUSS] Issues with Fetcher

Posted by Julien Nioche <li...@gmail.com>.
Hi Lewis

Doing a quick rekkie on the current fetcher issues I can see 32 issues with
> 7 of them claiming to be patched up... this kinda indicates that although
> there are underlying problems with the fetcher we are currently not getting
> the time to address them. It also indicates that there is quite a bit of
> work to be done with the fetcher...
>
>
There are 8 issues in trunk about the fetcher - some of them unrelated to
the Fetcher (NUTCH-827 <https://issues.apache.org/jira/browse/NUTCH-827> /
Nutch-1193) with most of the others being improvements
(NUTCH-828<https://issues.apache.org/jira/browse/NUTCH-828>/
NUTCH-1079 <https://issues.apache.org/jira/browse/NUTCH-1079>) with
possibly just a very few being real issues. I completely disagree with your
statement that there are underlying problems with the fetcher and that
there is quite a bit of work to do with it.

The Fetcher could be made more flexible for sure and get other improvements
like any other part of the code but you cannot say it is broken or in need
of urgent repair. I am also concerned about getting too radical changes to
such a core part of the framework, especially when more pressing issues
could be looked after instead. Having said that if someone can come up with
an interesting proposal for improving the Fetcher that would be very good,
I would simply suggest that we then have a separate implementation for that.

Thanks

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com