You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Thorsten Scherler <th...@apache.org> on 2009/04/01 02:28:39 UTC

Re: The Future of Nutch

On Fri, 2009-03-13 at 19:42 -0700, buddha1021 wrote:
> hi dennis:
...
> I am confident that hadoop can process the large datas of the  www search
> engine! But lucene? I am afraid of the limited size of lucene's index per
> server is very little ,10G? or 30G? this is not enough for the www search
> engine! IMO, this is a bottleneck!

I agree that the actual problem/solution of accessing lucene indexes is
to keep them small. What does the possibility of having a clouded index
serve if accessing it takes hours? 

For me here should lie one of nutch core competences: making search in
BIG indexes fast (as fast as in SMALL indexes). 

salu2
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source <consulting, training and solutions>

Re: The Future of Nutch

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.

On Wed, 2009-04-01 at 07:42 -0700, Ken Krugler wrote:
...
> I would suggest looking at Katta (http://katta.sourceforge.net/). 
> It's one of several projects where the goal is to support very large 
> Lucene indexes via distributed shards. Solr has also added federated 
> search support.

Interesting. Thanks for the link Ken.

salu2
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)

Re: The Future of Nutch, reactivated

Posted by Raymond Balmès <ra...@gmail.com>.

I 'm still a new user so although I found it rather easy to get going and
build my own plugin's I have some suggestions.

Yes one thing that I'd like to see is a kind of way to estimate how long
will a certain step (fetch, ...)  will take... something like a progress
bar. Because you launch a step and it can go on for days without knowing it
and perfectly working but still you have no idea when it might eventually
end.

I find the WEB front end rather difficult to change and I lost a lot of time
with the NucthBean for understanding how it works.
Coming from Lucene it took me a while to find out all the limitations it
has. So I haven't played much with NutchSolr integration but from the sound
of it looks more powerfull, simpler that is my concern.
-Raymond-
2009/5/14 Mattmann, Chris A <ch...@jpl.nasa.gov>

> Hi Andrzej,
>
> Great summary. My general feeling on this is similar to my prior comments
> on
> similar threads from Otis and from Dennis. My personal pet projects for
> Nutch2:
>
> * refactored Nutch core data structures, modeled as POJOs
> * refactored Nutch architecture where
> crawling/indexing/parsing/scoring/etc.
> are insulated from the underlying messaging substrate (e.g., crawl over
> JMS,
> EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other
> framework, etc.)
> * simpler Nutch deployment mechanisms (separate Nutch deployment package
> from source code package), think about using Maven2
>
> +1 to all of those and other ideas for how to improve the project's focus.
>
> Cheers,
> Chris
>
>
> On 5/14/09 6:45 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:
>
> > Hi all,
> >
> > I'd like to revive this thread and gather additional feedback so that we
> > end up with concrete conclusions. Much of what I write below others have
> > said before, I'm trying here to express this as it looks from my point
> > of view.
> >
> > Target audience
> > ===============
> > I think that the Nutch project experiences a crisis of personality now -
> > we are not sure what is the target audience, and we cannot satisfy
> > everyone. I think that there are following groups of Nutch users:
> >
> > 1. Large-scale Internet crawl & search: actually, there are only few
> > such users, because it takes considerable resources to manage operations
> > on that scale. Scalability, manage-ability and ranking/spam prevention
> > are the chief concerns here.
> >
> > 2. Medium-scale vertical search: I suspect that many Nutch users fall
> > into this category. Modularity, flexibility in implementing custom
> > processing, ability to modify workflows and to use only some Nutch
> > components seem to be chief concerns here. Scalability too, but only up
> > to a volume of ~100-200 mln documents.
> >
> > 3. Small- to medium-scale enterprise search: there's a sizeable number
> > of Nutch users that fall into this category, for historical reasons.
> > Link-based ranking and resource discovery are not that important here,
> > but integration with Windows networking, Microsoft formats and databases
> > , as well as realtime indexing and easy index maintenance are crucial.
> > This class of users often has to heavily customize Nutch to get any
> > sensible result. Also, this is where Solr really shines, so there is
> > little benefit in using Nutch here. I predict that Nutch will have fewer
> > and fewer users of this type.
> >
> > 4. Single desktop to small intranet search: as above, but the accent is
> > on the ease of use out of the box, and an often requested feature is a
> > GUI frontend. Currently IMHO Nutch is too complex and requires too much
> > command-line operation for casual users to make this use case attractive.
> >
> > What is the target audience that we as a community want to support? By
> > this I mean not only the moral support, but also active participation in
> > the development process. From the place where we are at the moment we
> > could go in any of the above directions.
> >
> > Core competence
> > ===============
> > This is a simple but important point. Currently we maintain several
> > major subsystems in Nutch that are implemented by other projects, and
> > often in a better way. Plugin framework (and dependency injection) and
> > content parsing are two areas that we have to delegate to third-party
> > libraries, such as Tika and OSGI or some other simple IOC container -
> > probably there are other components that we don't have to do ourselves.
> > Another thing that I'd love to delegate is the distributed search and
> > index maintenance - either through Solr or Katta or something else.
> >
> > The question then is, what is the core competence of this project? I see
> > the following major areas that are unique to Nutch:
> >
> > * crawling - this includes crawl scheduling (and re-crawl scheduling),
> > discovery and classification of new resources, strategies for crawling
> > specific sets of URLs (hosts and domains) under bandwidth and netiquette
> > constraints, etc.
> >
> > * web graph analysis - this includes link-based ranking, mirror
> > detection (and URL "aliasing") but also link spam detection and a more
> > complex control over the crawling frontier.
> >
> > Anything more? I'm not sure - perhaps I would add template detection and
> > pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
> >
> > Nutch 1.0 already made some steps in this direction, with the new link
> > analysis package and pluggable FetchSchedule and Signature. A lot
> > remains to be done here, and we are still spending a lot of resources on
> > dealing with issues outside this core competence.
> >
> > -------
> >
> > So, what do we need to do next?
> >
> > * we need to decide where we should commit our resources, as a community
> > of users, contributors and committers, so that the project is most
> > useful to our target audience. At this point there are few active
> > committers, so I don't think we can cover more than 1 direction at a
> > time ... ;)
> >
> > * we need to re-architect Nutch to focus on our core competence, and
> > delegate what we can to other projects.
> >
> > Feel free to comment on the above, make suggestions or corrections. I'd
> > like to wrap it up in a concise mission statement that would help us set
> > the goals for the next couple months.
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>

Re: The Future of Nutch, reactivated

Posted by consultas <co...@qualidade.eng.br>.

Keep it simple.
Many people, it seems to me, use nutch to exercise, in some way their 
programming expertise and talents.
I am just a user, and I think that users just want something thant can index 
the web and find results, when they search.  I don't want to deal with 
complicated application names, I just want to crawl and search.  And, it 
should be noted that for most of the users, myself included,  it is not a 
trivial job to get Nutch working, in Linux or Windows.
Anyway, I think that the bigest use for Nutch will be for vertical or 
regional search purposes.
By the way, from this point of view i really didn't like the experience with 
the original release of version of 1.0.  Too slow the crawling phase.


----- Original Message ----- 
From: "Andrzej Bialecki" <ab...@getopt.org>
To: <nu...@lucene.apache.org>
Sent: Thursday, May 14, 2009 10:45 AM
Subject: The Future of Nutch, reactivated


> Hi all,
>
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point
> of view.
>
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
>
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.
>
> 2. Medium-scale vertical search: I suspect that many Nutch users fall
> into this category. Modularity, flexibility in implementing custom
> processing, ability to modify workflows and to use only some Nutch
> components seem to be chief concerns here. Scalability too, but only up
> to a volume of ~100-200 mln documents.
>
> 3. Small- to medium-scale enterprise search: there's a sizeable number
> of Nutch users that fall into this category, for historical reasons.
> Link-based ranking and resource discovery are not that important here,
> but integration with Windows networking, Microsoft formats and databases
> , as well as realtime indexing and easy index maintenance are crucial.
> This class of users often has to heavily customize Nutch to get any
> sensible result. Also, this is where Solr really shines, so there is
> little benefit in using Nutch here. I predict that Nutch will have fewer
> and fewer users of this type.
>
> 4. Single desktop to small intranet search: as above, but the accent is
> on the ease of use out of the box, and an often requested feature is a
> GUI frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
>
> What is the target audience that we as a community want to support? By
> this I mean not only the moral support, but also active participation in
> the development process. From the place where we are at the moment we
> could go in any of the above directions.
>
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several
> major subsystems in Nutch that are implemented by other projects, and
> often in a better way. Plugin framework (and dependency injection) and
> content parsing are two areas that we have to delegate to third-party
> libraries, such as Tika and OSGI or some other simple IOC container -
> probably there are other components that we don't have to do ourselves.
> Another thing that I'd love to delegate is the distributed search and
> index maintenance - either through Solr or Katta or something else.
>
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
>
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
>
> * web graph analysis - this includes link-based ranking, mirror
> detection (and URL "aliasing") but also link spam detection and a more
> complex control over the crawling frontier.
>
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
>
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot
> remains to be done here, and we are still spending a lot of resources on
> dealing with issues outside this core competence.
>
> -------
>
> So, what do we need to do next?
>
> * we need to decide where we should commit our resources, as a community
> of users, contributors and committers, so that the project is most
> useful to our target audience. At this point there are few active
> committers, so I don't think we can cover more than 1 direction at a
> time ... ;)
>
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
>
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set
> the goals for the next couple months.
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--------------------------------------------------------------------------------



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.238 / Virus Database: 270.12.29/2114 - Release Date: 05/14/09 
06:28:00

Re: The Future of Nutch, reactivated

Posted by AJ Chen <ca...@gmail.com>.

Andrzej, great summary. I played with nutch before for web search engine,
but has not used it for a while because it has become too complicated. based
on my experience in building semantic search engine for healthcare vertical,
it think it would be benefitial to separate crawling from search
architecturaly and focus on just crawling for nutch.

My sense is that, if nutch can make crawling simple and deliver high-quality
crawled contents along with important metadata like link structure, it will
have much better chance to become an indispensable part of search engine. Of
course, it's important to include an implementation for search as well so
that nutch can provide end-to-end (i.e. crawl and search) results for
evaluation.  but, don't get stuck in search because there are a variety of
different search needs, such as static search, dynamic search, real time
search, semantic search, etc. it's not easy to make nutch to meet all of
these real-world needs. rather, nutch should provide the crawled contents in
a way that people can easily apply different search tools or search
technology.

As for the audience, it makes sense to focus on the middle of the usage
spectrum, ie. vertical search or focusd search in mid-range scale. but, I
won't ignore the small projects or developer projects because this is often
the start point for new project evaluation.

-aj
-- 
AJ Chen, PhD
Co-Chair, Semantic Web SIG, sdforum.org
Technical Architect, healthline.com
http://web2express.org
Palo Alto, CA

On Thu, May 14, 2009 at 6:45 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Hi all,
>
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point of
> view.
>
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now - we
> are not sure what is the target audience, and we cannot satisfy everyone. I
> think that there are following groups of Nutch users:
>
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention are
> the chief concerns here.
>
> 2. Medium-scale vertical search: I suspect that many Nutch users fall into
> this category. Modularity, flexibility in implementing custom processing,
> ability to modify workflows and to use only some Nutch components seem to be
> chief concerns here. Scalability too, but only up to a volume of ~100-200
> mln documents.
>
> 3. Small- to medium-scale enterprise search: there's a sizeable number of
> Nutch users that fall into this category, for historical reasons. Link-based
> ranking and resource discovery are not that important here, but integration
> with Windows networking, Microsoft formats and databases , as well as
> realtime indexing and easy index maintenance are crucial. This class of
> users often has to heavily customize Nutch to get any sensible result. Also,
> this is where Solr really shines, so there is little benefit in using Nutch
> here. I predict that Nutch will have fewer and fewer users of this type.
>
> 4. Single desktop to small intranet search: as above, but the accent is on
> the ease of use out of the box, and an often requested feature is a GUI
> frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
>
> What is the target audience that we as a community want to support? By this
> I mean not only the moral support, but also active participation in the
> development process. From the place where we are at the moment we could go
> in any of the above directions.
>
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several major
> subsystems in Nutch that are implemented by other projects, and often in a
> better way. Plugin framework (and dependency injection) and content parsing
> are two areas that we have to delegate to third-party libraries, such as
> Tika and OSGI or some other simple IOC container - probably there are other
> components that we don't have to do ourselves. Another thing that I'd love
> to delegate is the distributed search and index maintenance - either through
> Solr or Katta or something else.
>
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
>
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
>
> * web graph analysis - this includes link-based ranking, mirror detection
> (and URL "aliasing") but also link spam detection and a more complex control
> over the crawling frontier.
>
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
>
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot remains to
> be done here, and we are still spending a lot of resources on dealing with
> issues outside this core competence.
>
> -------
>
> So, what do we need to do next?
>
> * we need to decide where we should commit our resources, as a community of
> users, contributors and committers, so that the project is most useful to
> our target audience. At this point there are few active committers, so I
> don't think we can cover more than 1 direction at a time ... ;)
>
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
>
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set the
> goals for the next couple months.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: The Future of Nutch, reactivated

Posted by Julien Nioche <li...@gmail.com>.

Hi,

Am joining the conversation a bit late but nevermind...

In my views the main targets should be (2). As you pointed out, SOLR covers
(3) and (4) quite well (or will progressively do so). As for (1), there is
definitely an audience even if it is small but would certainly benefit from
the work done towards (2). As you said, operating on a large scale (i.e
using more than 100 slaves) requires a lot of resources and a dedicated team
and I expect that the people interested in large scale would have their own
views on scoring and spam prevention anyway :-)

I completely agree that there should be as much delegation of
functionalities to third-parties as possible (e.g. parsing with Tika) in
order to focus on the core competences.
I really like your idea of doing template detection for instance. Another
thing I found promising is the HBase integration (NUTCH-650), which would
also allow more interoperability with other tools such as Heritrix and make
the data structure a bit more open.

Talking about future functionalities, we do quite a lot of text analysis
with tools like Gate or UIMA and have been working on things such as
detection of adult content and automatic text classification with Nutch.
There are plenty of interesting things that can be done for vertical search
systems, such as Named Entity Extraction etc... Since NLP applications can
be quite greedy, leveraging Hadoop is definitely an advantage. I'd love to
see in the future versions of Nutch a separation between Format Parsing (i.e
Tika) and content analysis, where implementations would get a
semi-structured representation of the documents a bit like what extensions
of HTML parsers are getting currently, but regardless of the original
format.

Have a good week end

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/5/14 Andrzej Bialecki <ab...@getopt.org>

> Hi all,
>
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point of
> view.
>
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now - we
> are not sure what is the target audience, and we cannot satisfy everyone. I
> think that there are following groups of Nutch users:
>
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention are
> the chief concerns here.
>
> 2. Medium-scale vertical search: I suspect that many Nutch users fall into
> this category. Modularity, flexibility in implementing custom processing,
> ability to modify workflows and to use only some Nutch components seem to be
> chief concerns here. Scalability too, but only up to a volume of ~100-200
> mln documents.
>
> 3. Small- to medium-scale enterprise search: there's a sizeable number of
> Nutch users that fall into this category, for historical reasons. Link-based
> ranking and resource discovery are not that important here, but integration
> with Windows networking, Microsoft formats and databases , as well as
> realtime indexing and easy index maintenance are crucial. This class of
> users often has to heavily customize Nutch to get any sensible result. Also,
> this is where Solr really shines, so there is little benefit in using Nutch
> here. I predict that Nutch will have fewer and fewer users of this type.
>
> 4. Single desktop to small intranet search: as above, but the accent is on
> the ease of use out of the box, and an often requested feature is a GUI
> frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
>
> What is the target audience that we as a community want to support? By this
> I mean not only the moral support, but also active participation in the
> development process. From the place where we are at the moment we could go
> in any of the above directions.
>
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several major
> subsystems in Nutch that are implemented by other projects, and often in a
> better way. Plugin framework (and dependency injection) and content parsing
> are two areas that we have to delegate to third-party libraries, such as
> Tika and OSGI or some other simple IOC container - probably there are other
> components that we don't have to do ourselves. Another thing that I'd love
> to delegate is the distributed search and index maintenance - either through
> Solr or Katta or something else.
>
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
>
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
>
> * web graph analysis - this includes link-based ranking, mirror detection
> (and URL "aliasing") but also link spam detection and a more complex control
> over the crawling frontier.
>
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
>
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot remains to
> be done here, and we are still spending a lot of resources on dealing with
> issues outside this core competence.
>
> -------
>
> So, what do we need to do next?
>
> * we need to decide where we should commit our resources, as a community of
> users, contributors and committers, so that the project is most useful to
> our target audience. At this point there are few active committers, so I
> don't think we can cover more than 1 direction at a time ... ;)
>
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
>
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set the
> goals for the next couple months.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: The Future of Nutch, reactivated

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.

Hi Andrzej,

Great summary. My general feeling on this is similar to my prior comments on
similar threads from Otis and from Dennis. My personal pet projects for
Nutch2:

* refactored Nutch core data structures, modeled as POJOs
* refactored Nutch architecture where crawling/indexing/parsing/scoring/etc.
are insulated from the underlying messaging substrate (e.g., crawl over JMS,
EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other
framework, etc.)
* simpler Nutch deployment mechanisms (separate Nutch deployment package
from source code package), think about using Maven2

+1 to all of those and other ideas for how to improve the project's focus.

Cheers,
Chris


On 5/14/09 6:45 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

> Hi all,
> 
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point
> of view.
> 
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
> 
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.
> 
> 2. Medium-scale vertical search: I suspect that many Nutch users fall
> into this category. Modularity, flexibility in implementing custom
> processing, ability to modify workflows and to use only some Nutch
> components seem to be chief concerns here. Scalability too, but only up
> to a volume of ~100-200 mln documents.
> 
> 3. Small- to medium-scale enterprise search: there's a sizeable number
> of Nutch users that fall into this category, for historical reasons.
> Link-based ranking and resource discovery are not that important here,
> but integration with Windows networking, Microsoft formats and databases
> , as well as realtime indexing and easy index maintenance are crucial.
> This class of users often has to heavily customize Nutch to get any
> sensible result. Also, this is where Solr really shines, so there is
> little benefit in using Nutch here. I predict that Nutch will have fewer
> and fewer users of this type.
> 
> 4. Single desktop to small intranet search: as above, but the accent is
> on the ease of use out of the box, and an often requested feature is a
> GUI frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
> 
> What is the target audience that we as a community want to support? By
> this I mean not only the moral support, but also active participation in
> the development process. From the place where we are at the moment we
> could go in any of the above directions.
> 
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several
> major subsystems in Nutch that are implemented by other projects, and
> often in a better way. Plugin framework (and dependency injection) and
> content parsing are two areas that we have to delegate to third-party
> libraries, such as Tika and OSGI or some other simple IOC container -
> probably there are other components that we don't have to do ourselves.
> Another thing that I'd love to delegate is the distributed search and
> index maintenance - either through Solr or Katta or something else.
> 
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
> 
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
> 
> * web graph analysis - this includes link-based ranking, mirror
> detection (and URL "aliasing") but also link spam detection and a more
> complex control over the crawling frontier.
> 
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
> 
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot
> remains to be done here, and we are still spending a lot of resources on
> dealing with issues outside this core competence.
> 
> -------
> 
> So, what do we need to do next?
> 
> * we need to decide where we should commit our resources, as a community
> of users, contributors and committers, so that the project is most
> useful to our target audience. At this point there are few active
> committers, so I don't think we can cover more than 1 direction at a
> time ... ;)
> 
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
> 
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set
> the goals for the next couple months.
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: The Future of Nutch, reactivated

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.

Hi Andrzej,

Great summary. My general feeling on this is similar to my prior comments on
similar threads from Otis and from Dennis. My personal pet projects for
Nutch2:

* refactored Nutch core data structures, modeled as POJOs
* refactored Nutch architecture where crawling/indexing/parsing/scoring/etc.
are insulated from the underlying messaging substrate (e.g., crawl over JMS,
EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other
framework, etc.)
* simpler Nutch deployment mechanisms (separate Nutch deployment package
from source code package), think about using Maven2

+1 to all of those and other ideas for how to improve the project's focus.

Cheers,
Chris


On 5/14/09 6:45 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

> Hi all,
> 
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point
> of view.
> 
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
> 
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.
> 
> 2. Medium-scale vertical search: I suspect that many Nutch users fall
> into this category. Modularity, flexibility in implementing custom
> processing, ability to modify workflows and to use only some Nutch
> components seem to be chief concerns here. Scalability too, but only up
> to a volume of ~100-200 mln documents.
> 
> 3. Small- to medium-scale enterprise search: there's a sizeable number
> of Nutch users that fall into this category, for historical reasons.
> Link-based ranking and resource discovery are not that important here,
> but integration with Windows networking, Microsoft formats and databases
> , as well as realtime indexing and easy index maintenance are crucial.
> This class of users often has to heavily customize Nutch to get any
> sensible result. Also, this is where Solr really shines, so there is
> little benefit in using Nutch here. I predict that Nutch will have fewer
> and fewer users of this type.
> 
> 4. Single desktop to small intranet search: as above, but the accent is
> on the ease of use out of the box, and an often requested feature is a
> GUI frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
> 
> What is the target audience that we as a community want to support? By
> this I mean not only the moral support, but also active participation in
> the development process. From the place where we are at the moment we
> could go in any of the above directions.
> 
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several
> major subsystems in Nutch that are implemented by other projects, and
> often in a better way. Plugin framework (and dependency injection) and
> content parsing are two areas that we have to delegate to third-party
> libraries, such as Tika and OSGI or some other simple IOC container -
> probably there are other components that we don't have to do ourselves.
> Another thing that I'd love to delegate is the distributed search and
> index maintenance - either through Solr or Katta or something else.
> 
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
> 
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
> 
> * web graph analysis - this includes link-based ranking, mirror
> detection (and URL "aliasing") but also link spam detection and a more
> complex control over the crawling frontier.
> 
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
> 
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot
> remains to be done here, and we are still spending a lot of resources on
> dealing with issues outside this core competence.
> 
> -------
> 
> So, what do we need to do next?
> 
> * we need to decide where we should commit our resources, as a community
> of users, contributors and committers, so that the project is most
> useful to our target audience. At this point there are few active
> committers, so I don't think we can cover more than 1 direction at a
> time ... ;)
> 
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
> 
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set
> the goals for the next couple months.
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The Future of Nutch, reactivated

Posted by Andrzej Bialecki <ab...@getopt.org>.

Hi all,

I'd like to revive this thread and gather additional feedback so that we
end up with concrete conclusions. Much of what I write below others have 
said before, I'm trying here to express this as it looks from my point 
of view.

Target audience
===============
I think that the Nutch project experiences a crisis of personality now - 
we are not sure what is the target audience, and we cannot satisfy 
everyone. I think that there are following groups of Nutch users:

1. Large-scale Internet crawl & search: actually, there are only few
such users, because it takes considerable resources to manage operations
on that scale. Scalability, manage-ability and ranking/spam prevention 
are the chief concerns here.

2. Medium-scale vertical search: I suspect that many Nutch users fall 
into this category. Modularity, flexibility in implementing custom 
processing, ability to modify workflows and to use only some Nutch 
components seem to be chief concerns here. Scalability too, but only up 
to a volume of ~100-200 mln documents.

3. Small- to medium-scale enterprise search: there's a sizeable number 
of Nutch users that fall into this category, for historical reasons. 
Link-based ranking and resource discovery are not that important here, 
but integration with Windows networking, Microsoft formats and databases 
, as well as realtime indexing and easy index maintenance are crucial. 
This class of users often has to heavily customize Nutch to get any 
sensible result. Also, this is where Solr really shines, so there is 
little benefit in using Nutch here. I predict that Nutch will have fewer 
and fewer users of this type.

4. Single desktop to small intranet search: as above, but the accent is 
on the ease of use out of the box, and an often requested feature is a 
GUI frontend. Currently IMHO Nutch is too complex and requires too much 
command-line operation for casual users to make this use case attractive.

What is the target audience that we as a community want to support? By 
this I mean not only the moral support, but also active participation in 
the development process. From the place where we are at the moment we 
could go in any of the above directions.

Core competence
===============
This is a simple but important point. Currently we maintain several 
major subsystems in Nutch that are implemented by other projects, and 
often in a better way. Plugin framework (and dependency injection) and 
content parsing are two areas that we have to delegate to third-party 
libraries, such as Tika and OSGI or some other simple IOC container - 
probably there are other components that we don't have to do ourselves. 
Another thing that I'd love to delegate is the distributed search and 
index maintenance - either through Solr or Katta or something else.

The question then is, what is the core competence of this project? I see 
the following major areas that are unique to Nutch:

* crawling - this includes crawl scheduling (and re-crawl scheduling), 
discovery and classification of new resources, strategies for crawling 
specific sets of URLs (hosts and domains) under bandwidth and netiquette 
constraints, etc.

* web graph analysis - this includes link-based ranking, mirror 
detection (and URL "aliasing") but also link spam detection and a more 
complex control over the crawling frontier.

Anything more? I'm not sure - perhaps I would add template detection and 
pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).

Nutch 1.0 already made some steps in this direction, with the new link 
analysis package and pluggable FetchSchedule and Signature. A lot 
remains to be done here, and we are still spending a lot of resources on 
dealing with issues outside this core competence.

-------

So, what do we need to do next?

* we need to decide where we should commit our resources, as a community 
of users, contributors and committers, so that the project is most 
useful to our target audience. At this point there are few active 
committers, so I don't think we can cover more than 1 direction at a 
time ... ;)

* we need to re-architect Nutch to focus on our core competence, and 
delegate what we can to other projects.

Feel free to comment on the above, make suggestions or corrections. I'd 
like to wrap it up in a concise mission statement that would help us set 
the goals for the next couple months.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: The Future of Nutch, reactivated

Posted by Aaron Binns <aa...@archive.org>.

Andrzej Bialecki <ab...@getopt.org> writes:

>> One of the biggest boons of Nutch is the Hadoop infrastructure.  When
>> indexing massive data sets, being able to fire up 60+ nodes in a
>> Hadoop system helps tremendously.
>
> Are you familiar with the distributed indexing package in Hadoop
> contrib/ ?

Only superficially at most.  Last I looked at it, it seemed to be a
"hello world" prototype.  If it's developed more, it might be worth
another look.

>> However, the one of the biggest challenges to using Nutch is the fact
>> that the URL is used as the unique key for a document.
>
> Indeed, this change is something that I've been considering, too - 
> URL==page doesn't work that well in case of archives, but also when
> your unit of information is smaller (pagelet) or larger (compound
> docs) than a page.
>
> People can help with this by working on a patch that replaces this
> silent assumption with an explicit API, i.e. splitting recordId and
> URL into separate fields.

Patches always welcomed, it is an open source package after all :) I'll
see about creating a patch-set for the changes I've made in NutchWAX.

>> As for the future of Nutch, I am concerned over what I see to be an
>> increasing focus on crawling and fetching.  We have only lightly
>> evaluated other Open Source search projects, such as Solr, and are not
>> convinced any can be a drop-in replacement for Nutch.  It looks like
>> Solr has some nice features for certain, I'm just not convinced it can
>> scale up to the billion document level.
>
> What do you see as the unique strength of Nutch, then? IMHO there are
> existing frameworks for distributed indexing (on Hadoop) and
> distributed search (e.g. Katta). We would like to avoid the
> duplication of effort, and to focus instead on the aspects of Nutch
> functionality that are not available elsewhere.

Right now, the unique strength of Nutch -- to my organization -- is that
it has all the requisite pieces and comes closer to a complete solution
than other OpenSource projects.  What features it lacks compared to
others are less important than the ones it has that others do not.

Two key features of Nutch indexing are the content parsing and the link
extraction.  The parsing plugins seem to work well enough, although
easier modification of content tokenizing and stop-list management would
be nice.  For example, using a config file to tweak the tokenizing for
say French or Spanish would be nicer than having to write a new .jj file
and a custom build.

Along the same lines, language-awareness would have to be included in
the query processing as well.  And speaking of which, the way in which
Nutch query processing is optimized for web search makes sense.  I've
read that Solr can be configured to emulate the Nutch query processing.
If so, it would eliminate a competitive advantage of Nutch.

Nutch's summary/snippet generation approach works fine.  It's not clear
to me how this is done with the other tools.

On the search service side of things, Nutch is adequate, but I would
like to investigate other distributed search systems.  My main complaint
about Nutch's implementation is the use of the Hadoop RPC mechanism.
It's very difficult to diagnose and debug problems.  I'd prefer if the
master just talked to the slaves over OpenSearch or a simple HTTP/JSON
interface.  This way, monitoring tools could easily ping the slaves and
check for sensible results.

Along the same diagnosis/debug lines, I've added more log messages to
the start-up code of the search slave.  Without these, it's very
difficult to diagnose some trivial mistake in the deployment of the
index/segment shards, such as a mis-named directory or the like.

Lastly, there's also the fact that Nutch is a known quantity and we've
already put non-trivial effort into using and adapting it to our needs.
It would be difficult to start all over again with another toolset, or
assemblage of tools.  We also have scaling expectations based on what
we've achieved so far with Nutch(WAX).  It would be painful to invest
the time and effort in say Solr only to discover it can't scale to the
same size with the same hardware.

Right now, the most interesting other project for us to consider is
Solr.  There seems to be more and more momentum behind it and it does
have some neat features, such as the "did you mean?" suggestions and
things.  However, the distributed search functionality is pretty
rudimentary IMO and I am concerned about reports that it doesn't scale
beyond a few million or tens of millions of documents.  Although it
appears that some of this has to do with the modify/update capabilities,
which are mitigated by the use of read-only IndexReaders (or something
like that).

Aaron

-- 
Aaron Binns
Senior Software Engineer, Web Group
Internet Archive
aaron@archive.org

Re: The Future of Nutch, reactivated

Posted by Andrzej Bialecki <ab...@getopt.org>.

Aaron Binns wrote:

> Our usage of Nutch is focused on index building and search services.  We
> don't use the crawling/fetching features at all.  We use Heritrix.
> Typically, our large-scale harvests are performed over 8-12 week
> periods, then the archived data is handed off to me for full-text search
> indexing.  We deploy the indexes on a separate rack of machines
> dedicated to hosting the full-text search service.
> 
> One of the biggest boons of Nutch is the Hadoop infrastructure.  When
> indexing massive data sets, being able to fire up 60+ nodes in a Hadoop
> system helps tremendously.

Are you familiar with the distributed indexing package in Hadoop contrib/ ?

> 
> However, the one of the biggest challenges to using Nutch is the fact
> that the URL is used as the unique key for a document.  This is usually
> a sensible thing to do, but for web archives, it doesn't work.  Our
> NutchWAX package contains all sorts of hacks to work around this
> assumption.

Indeed, this change is something that I've been considering, too - 
URL==page doesn't work that well in case of archives, but also when your 
unit of information is smaller (pagelet) or larger (compound docs) than 
a page.

People can help with this by working on a patch that replaces this 
silent assumption with an explicit API, i.e. splitting recordId and URL 
into separate fields.

> 
> 
> As for the future of Nutch, I am concerned over what I see to be an
> increasing focus on crawling and fetching.  We have only lightly
> evaluated other Open Source search projects, such as Solr, and are not
> convinced any can be a drop-in replacement for Nutch.  It looks like
> Solr has some nice features for certain, I'm just not convinced it can
> scale up to the billion document level.

What do you see as the unique strength of Nutch, then? IMHO there are 
existing frameworks for distributed indexing (on Hadoop) and distributed 
search (e.g. Katta). We would like to avoid the duplication of effort, 
and to focus instead on the aspects of Nutch functionality that are not 
available elsewhere.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: The Future of Nutch, reactivated

Posted by Aaron Binns <aa...@archive.org>.

Andrzej Bialecki <ab...@getopt.org> writes:

> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
>
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.

We here at the Internet Archive are one of these users; and our numbers
are small, although the size of our data is big.  We routinely deal with
collections of documents (primarily web pages) in excess of 500 million.

We have developed a set of add-ons and modifications to Nutch called
NutchWAX (Web Archive eXtensions).  We use NutchWAX both for our
internal projects (such as archive-it.org) as well as with our national
library partners.

In the coming years, more and more national libraries will be building
their own web archives, mainly by performing "domain harvests" of
websites in a country's domain.  So, I expect the list of users to be
operating at this scale to grow into to be a few dozen in the next few
years.

Our usage of Nutch is focused on index building and search services.  We
don't use the crawling/fetching features at all.  We use Heritrix.
Typically, our large-scale harvests are performed over 8-12 week
periods, then the archived data is handed off to me for full-text search
indexing.  We deploy the indexes on a separate rack of machines
dedicated to hosting the full-text search service.

One of the biggest boons of Nutch is the Hadoop infrastructure.  When
indexing massive data sets, being able to fire up 60+ nodes in a Hadoop
system helps tremendously.

However, the one of the biggest challenges to using Nutch is the fact
that the URL is used as the unique key for a document.  This is usually
a sensible thing to do, but for web archives, it doesn't work.  Our
NutchWAX package contains all sorts of hacks to work around this
assumption.

As for the future of Nutch, I am concerned over what I see to be an
increasing focus on crawling and fetching.  We have only lightly
evaluated other Open Source search projects, such as Solr, and are not
convinced any can be a drop-in replacement for Nutch.  It looks like
Solr has some nice features for certain, I'm just not convinced it can
scale up to the billion document level.

Aaron

-- 
Aaron Binns
Senior Software Engineer, Web Group
Internet Archive
aaron@archive.org

Re: The Future of Nutch, reactivated

Posted by Otis Gospodnetic <og...@yahoo.com>.

Hello,
(I saw the first copy of this email went to nutch-user, but I assume nutch-dev was a resend and the right list to follow-up on)

I agree with the list of core competencies.  For example, and I don't know where I said/wrote this, but I know I said it a few times before -- I think Solr is the future of Nutch's search.  I have a feeling the original Nutch search components will die off with time - nobody is working on them, and Solr is making great progress.

In my experience, most Nutch users fall under #2.  Most require web-wide crawling, but really care about a specific vertical slice.  So that's where I'd say the focus should be, theoretically.  I say theoretically because I don't think active Nutch developers can really choose a direction if it doesn't match their own itches.


Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Andrzej Bialecki <ab...@getopt.org>
> To: nutch-dev@lucene.apache.org
> Sent: Thursday, May 14, 2009 9:59:11 AM
> Subject: The Future of Nutch, reactivated
> 
> Hi all,
> 
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point
> of view.
> 
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
> 
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.
> 
> 2. Medium-scale vertical search: I suspect that many Nutch users fall
> into this category. Modularity, flexibility in implementing custom
> processing, ability to modify workflows and to use only some Nutch
> components seem to be chief concerns here. Scalability too, but only up
> to a volume of ~100-200 mln documents.
> 
> 3. Small- to medium-scale enterprise search: there's a sizeable number
> of Nutch users that fall into this category, for historical reasons.
> Link-based ranking and resource discovery are not that important here,
> but integration with Windows networking, Microsoft formats and databases
> , as well as realtime indexing and easy index maintenance are crucial.
> This class of users often has to heavily customize Nutch to get any
> sensible result. Also, this is where Solr really shines, so there is
> little benefit in using Nutch here. I predict that Nutch will have fewer
> and fewer users of this type.
> 
> 4. Single desktop to small intranet search: as above, but the accent is
> on the ease of use out of the box, and an often requested feature is a
> GUI frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
> 
> What is the target audience that we as a community want to support? By
> this I mean not only the moral support, but also active participation in
> the development process. From the place where we are at the moment we
> could go in any of the above directions.
> 
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several
> major subsystems in Nutch that are implemented by other projects, and
> often in a better way. Plugin framework (and dependency injection) and
> content parsing are two areas that we have to delegate to third-party
> libraries, such as Tika and OSGI or some other simple IOC container -
> probably there are other components that we don't have to do ourselves.
> Another thing that I'd love to delegate is the distributed search and
> index maintenance - either through Solr or Katta or something else.
> 
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
> 
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
> 
> * web graph analysis - this includes link-based ranking, mirror
> detection (and URL "aliasing") but also link spam detection and a more
> complex control over the crawling frontier.
> 
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
> 
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot
> remains to be done here, and we are still spending a lot of resources on
> dealing with issues outside this core competence.
> 
> -------
> 
> So, what do we need to do next?
> 
> * we need to decide where we should commit our resources, as a community
> of users, contributors and committers, so that the project is most
> useful to our target audience. At this point there are few active
> committers, so I don't think we can cover more than 1 direction at a
> time ... ;)
> 
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
> 
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set
> the goals for the next couple months.
> 
> -- Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

The Future of Nutch, reactivated

Posted by Andrzej Bialecki <ab...@getopt.org>.

Hi all,

I'd like to revive this thread and gather additional feedback so that we
end up with concrete conclusions. Much of what I write below others have
said before, I'm trying here to express this as it looks from my point
of view.

Target audience
===============
I think that the Nutch project experiences a crisis of personality now -
we are not sure what is the target audience, and we cannot satisfy
everyone. I think that there are following groups of Nutch users:

1. Large-scale Internet crawl & search: actually, there are only few
such users, because it takes considerable resources to manage operations
on that scale. Scalability, manage-ability and ranking/spam prevention
are the chief concerns here.

2. Medium-scale vertical search: I suspect that many Nutch users fall
into this category. Modularity, flexibility in implementing custom
processing, ability to modify workflows and to use only some Nutch
components seem to be chief concerns here. Scalability too, but only up
to a volume of ~100-200 mln documents.

3. Small- to medium-scale enterprise search: there's a sizeable number
of Nutch users that fall into this category, for historical reasons.
Link-based ranking and resource discovery are not that important here,
but integration with Windows networking, Microsoft formats and databases
, as well as realtime indexing and easy index maintenance are crucial.
This class of users often has to heavily customize Nutch to get any
sensible result. Also, this is where Solr really shines, so there is
little benefit in using Nutch here. I predict that Nutch will have fewer
and fewer users of this type.

4. Single desktop to small intranet search: as above, but the accent is
on the ease of use out of the box, and an often requested feature is a
GUI frontend. Currently IMHO Nutch is too complex and requires too much
command-line operation for casual users to make this use case attractive.

What is the target audience that we as a community want to support? By
this I mean not only the moral support, but also active participation in
the development process. From the place where we are at the moment we
could go in any of the above directions.

Core competence
===============
This is a simple but important point. Currently we maintain several
major subsystems in Nutch that are implemented by other projects, and
often in a better way. Plugin framework (and dependency injection) and
content parsing are two areas that we have to delegate to third-party
libraries, such as Tika and OSGI or some other simple IOC container -
probably there are other components that we don't have to do ourselves.
Another thing that I'd love to delegate is the distributed search and
index maintenance - either through Solr or Katta or something else.

The question then is, what is the core competence of this project? I see
the following major areas that are unique to Nutch:

* crawling - this includes crawl scheduling (and re-crawl scheduling),
discovery and classification of new resources, strategies for crawling
specific sets of URLs (hosts and domains) under bandwidth and netiquette
constraints, etc.

* web graph analysis - this includes link-based ranking, mirror
detection (and URL "aliasing") but also link spam detection and a more
complex control over the crawling frontier.

Anything more? I'm not sure - perhaps I would add template detection and
pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).

Nutch 1.0 already made some steps in this direction, with the new link
analysis package and pluggable FetchSchedule and Signature. A lot
remains to be done here, and we are still spending a lot of resources on
dealing with issues outside this core competence.

-------

So, what do we need to do next?

* we need to decide where we should commit our resources, as a community
of users, contributors and committers, so that the project is most
useful to our target audience. At this point there are few active
committers, so I don't think we can cover more than 1 direction at a
time ... ;)

* we need to re-architect Nutch to focus on our core competence, and
delegate what we can to other projects.

Feel free to comment on the above, make suggestions or corrections. I'd
like to wrap it up in a concise mission statement that would help us set
the goals for the next couple months.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: The Future of Nutch

Posted by Doğacan Güney <do...@gmail.com>.

On Wed, Apr 1, 2009 at 17:42, Ken Krugler <kk...@transpac.com>wrote:

>  On Fri, 2009-03-13 at 19:42 -0700, buddha1021 wrote:
>>
>>>  hi dennis:
>>>
>> ...
>>  > I am confident that hadoop can process the large datas of the  www
>> search
>>
>>>  engine! But lucene? I am afraid of the limited size of lucene's index
>>> per
>>>  server is very little ,10G? or 30G? this is not enough for the www
>>> search
>>>
>>  > engine! IMO, this is a bottleneck!
>>
>> I agree that the actual problem/solution of accessing lucene indexes is
>> to keep them small. What does the possibility of having a clouded index
>> serve if accessing it takes hours?
>>
>> For me here should lie one of nutch core competences: making search in
>> BIG indexes fast (as fast as in SMALL indexes).
>>
>
> I would suggest looking at Katta (http://katta.sourceforge.net/). It's one
> of several projects where the goal is to support very large Lucene indexes
> via distributed shards. Solr has also added federated search support.
>

I agree. I think the new index framework should be flexible enough that we
can support katta along
with solr. Actually, this is one of the things I want to do before the next
major release.


>
> -- Ken
> --
> Ken Krugler
> +1 530-210-6378
>



-- 
Doğacan Güney

Re: The Future of Nutch

Posted by Ken Krugler <kk...@transpac.com>.

>On Fri, 2009-03-13 at 19:42 -0700, buddha1021 wrote:
>>  hi dennis:
>...
>  > I am confident that hadoop can process the large datas of the  www search
>>  engine! But lucene? I am afraid of the limited size of lucene's index per
>>  server is very little ,10G? or 30G? this is not enough for the www search
>  > engine! IMO, this is a bottleneck!
>
>I agree that the actual problem/solution of accessing lucene indexes is
>to keep them small. What does the possibility of having a clouded index
>serve if accessing it takes hours?
>
>For me here should lie one of nutch core competences: making search in
>BIG indexes fast (as fast as in SMALL indexes).

I would suggest looking at Katta (http://katta.sourceforge.net/). 
It's one of several projects where the goal is to support very large 
Lucene indexes via distributed shards. Solr has also added federated 
search support.

-- Ken
-- 
Ken Krugler
+1 530-210-6378