You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matt Wilkie <ma...@gov.yk.ca> on 2006/03/04 00:34:21 UTC

project vitality?

Hi there, I'm new around here. The mailing lists seem to have a pretty 
steady stream of traffic but the website hasn't been updated since 
august, and there's only a handful of news items before that. What is 
the vitality of Nutch project? Is it basically a labority proof of 
concept or a mature ready for production product?

thanks for your time,

-- 
matt wilkie
--------------------------------------------
Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/
--------------------------------------------


Re: project vitality?

Posted by Matt Wilkie <ma...@gov.yk.ca>.
Thank you everyone for giving me a through-the-keyhole view of the Nutch 
project. I really appreciate the time it takes to read messages and 
composing a reply -- time which could otherwise be spent coding or 
writing documentation. ;)

I am somewhat saddened, but unsurprised, to find a slightly antagonistic 
  relationship between the coders and the users. I've followed a number 
of open source projects and a polarised dialogue seems to arise of it's 
own accord, not always mind you, but often. I'll depart with a 
suggestion which in the past I've seen provide to some lubrication and 
make for a more easeful coder-user dialogue:

On the mailing list, initiate a "Summary" convention, wherein people who 
ask questions are politely asked to summarise the results and post back 
to the list. How it works:

Ms Newbie asks "how do I get Nutch to crawl my coffee pot archives?".
Half a dozen people reply with an asortment of tips. Some are terse, 
"read faq 13.4", and some offer a little more hand holding, "first 
connect the archive via caffeine plug 3a, then power up the filter 
holder", and another adds "but don't forget to place receiving 
receptable A1:Final under the spout or your results will be all over the 
floor". Ms Newbie then posts a message titled "SUM: crawling coffee pot 
archives" back to the mailing list summarising the suggestions and her 
results.

The "SUM" or "Summary:" part is important for people searching the 
archives. They want to start with the results before crawling back 
through the initiating questions.

Until the convention has been used enough to become natural, it will be 
necessary to *politely* remind/ask questioners to summarise.

There will always be some who just won't or can't summarise. Don't waste 
time chewing them out for it, that just adds noise. After a suitable 
interval ask them nicely once or twice to summarise, if there is still 
nothing forthcoming simply stop responding to their questions in an 
informative way.

Lead by example. It will take some time for the custom to gel. A small 
handful will need to resign themselves to going it alone for awhile. It 
won't be forever, people know a good thing when they see it (eventually!).

Keep an eye on the SUMs and periodically grab the juicy ones and 
reformat for the wiki and/or documentation.

Summarisers: Please don't just concatenate the replies into one big 
verbatim message -- we can read the mailing list for full details! Keep 
only the core info which really helps. Strip out signatures, anecdotes, 
chatter, unneeded controversy and anything else which doesn't answer the 
question. Also, always credit the people who've taken time out of 
*their* work to help you with *yours*. P:-)

I've run out of time today, tommorrow I'll SUMmarise this thread to show 
more concretely what I mean. After that, well, if it works use it, if 
not leave it to collect dust in the bit bucket and move on.

cheers,

-- 
matt wilkie
--------------------------------------------
Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/
--------------------------------------------


Re: project vitality?

Posted by Chris Lamprecht <cl...@gmail.com>.
I think of the Nutch project as a marathon, not a sprint.  Nutch's
stated goals include:

* Scale to entire web
- pages on millions of different servers
- billions of pages
* Support high traffic
- thousands of searches per second
* State-of-the-art search quality

(see http://wiki.apache.org/nutch/Presentations)

It's inspiring to see a project with such ambitious goals become a reality.


On 3/5/06, Byron Miller <by...@yahoo.com> wrote:
> I like to think of it as a framework. Building blocks
> to build what you ultimately need.
>
> If your after the one stop shop, plug in play, no
> development necessary then perhaps some other
> commercial systems may be your best bet.
>
> Mailing list is very active, most people get responses
> fairly quickly. If the question is ignored its often
> because it's already answered.
>
> To really understand nutch you need to understand
> lucene, hadoop and search in general and the wiki of
> both lucene and nutch is a great read.
>
> If all of this is above ones head or not within your
> time frame to bother with then like i said, there are
> other products out there.
>
> Other then that i'm happily running nutch, looking
> forward to a billion+ page index and enjoying picking
> the brains of the talent pool we have here.
>
> Happy nutcher
>
> -byron
> http://www.mozdex.com
>
>
> --- Matt Wilkie <ma...@gov.yk.ca> wrote:
>
> > Hi there, I'm new around here. The mailing lists
> > seem to have a pretty
> > steady stream of traffic but the website hasn't been
> > updated since
> > august, and there's only a handful of news items
> > before that. What is
> > the vitality of Nutch project? Is it basically a
> > labority proof of
> > concept or a mature ready for production product?
> >
> > thanks for your time,
> >
> > --
> > matt wilkie
> > --------------------------------------------
> > Geographic Information,
> > Information Management and Technology,
> > Yukon Department of Environment
> > 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
> > 867-667-8133 Tel * 867-393-7003 Fax
> > http://environmentyukon.gov.yk.ca/geomatics/
> > --------------------------------------------
> >
> >
>
>

Re: project vitality?

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Thomas,
for this crawl setup we have a test environment of nutch 0.8,  
10xAMD's, custom linux build,  100Mbit eth1, 1Gb eth0, each box has a  
'caching' dns server.
Stefan
Am 06.03.2006 um 15:59 schrieb TDLN:

> Stefan.
>
>> I know people having >500 mio pages index and I personal run  
>> crawls with
> ~300 pages per second.
>
> Sorry, but I have to ask: what kind of setup do you have (network,  
> hw, nutch
> version) that you manage so many pages per second?
>
> Unless this is a "company secret", it would be very nice to know  
> how you
> manage this.
>
> Rgrds, Thomas

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Re: project vitality?

Posted by TDLN <di...@gmail.com>.
Stefan.

> I know people having >500 mio pages index and I personal run crawls with
~300 pages per second.

Sorry, but I have to ask: what kind of setup do you have (network, hw, nutch
version) that you manage so many pages per second?

Unless this is a "company secret", it would be very nice to know how you
manage this.

Rgrds, Thomas

Re: project vitality?

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Richard,

> I told you I was more than willing to help, and I think many users  
> feel
> the same way, but I for one feel that there is a lack of documentation
> and support.  This isn't meant to offend anyone, if you are  
> offended you
> need to toughen up your skin a little bit.

Here you can find some more documentation:
http://wiki.media-style.com/display/nutchDocu/Home

It is the first hit when you are searching for nutch documentation  
with google.
Sure it is full with tons of typos and has  many language issues  
since my english is terrible
but at least I guess that it already helps some people to get a nutch  
0.7 or nutch 0.8 up and running.

Serious nutch is as much production ready as a noncommercial open  
source project could be.
I know people having >500 mio pages index and I personal run crawls  
with ~300 pages per second.

I'm not sure what you can expect more than that from a open source  
search project.

Stefan





RE: project vitality?

Posted by Howie Wang <ho...@hotmail.com>.
I agree that the doc could be better, but I still take issue with
the earlier use of the phrase "proof-of-concept". If there are
dozens of sites using it in production, several of them indexing
100's of millions of pages, I don't know how you can call it
"proof-of-concept".

Honestly, I'm not sure if there's any other choice for a scalable
open source search engine. Last I checked most of the other
free projects were better suited to small site searches -- nothing
on the scale of tens of millions of pages.

So kudos, Nutch developers!

Howie



RE: project vitality?

Posted by Richard Braman <rb...@bramantax.com>.
>don't expect polish.
You shouldn't need polish to be able to leran the command required to
resume an aborted drawl, or to index what you have already crawled.
Things like this shouldn't require an easter egg hunt.  They are going
to heppen to evryone doing greater than a simple crawl.

>If you find a bug, please file a bug report, so that other folks are
aware of it.  
I have reported 2 so far.  I have a third one (and a patch) that I am
still in the process of developing documenting, which relates to parsing
pdfs.

>Better yet, if you have a 
>solution or improvement, please construct a patch file (even for 
>documentation) and attach it to a bug report. On the wiki, anyone can 
>make themselves an account and update documentation. We don't boss 
>folks around here, or complain. We pitch in and help.

In the email I sent you I volunteered to help by offering to polish the
documentation myself.  I do need some answers first.  Many of the
questions that get asked on this list unfortunately go unanswered by the
experts.  If they go unanswered, it impossible for those who would
otherwise share their solutions on the Wiki, because there is no
solution to share.  

If I went and posted my knowledge about indexing and restarting crawls,
it wouldn't be any better than what is already up there, which is
incomplete and incorrect.  I know there are those of you that no nutch
inside and out. Right now that's just a few guys.  I know I want to know
more about it, that's why I am spending my free time trying to learn.
Everyting I am doing is part of an open source search project, not a
commercial endevour. I always contribute my knowledge back by posting
answers to things I know about.  

Documentation, whether we like it or not, is key to the use of the
product. The onus is on the developers to document the project, and to
provide support when the documentation is clearly lacking.  One the
developers share more of their knowledge, their will be more
knowledgable users and the developers wont need to spend as much time on
support and documentation.

I would agree that if you have 1 url to crawl, and you crawl it with
depth = 3-6 , nutch is easy to use.  I tried with depth=10, and I hit  a
snag.  This has been very hard to get through, given the lack of
documentation.  I have nutch up and running fine here
http://24.75.221.234:8080
But this is a simple crawl and doesn't reflect all of the pages needed
to make a good search engine.

I told you I was more than willing to help, and I think many users feel
the same way, but I for one feel that there is a lack of documentation
and support.  This isn't meant to offend anyone, if you are offended you
need to toughen up your skin a little bit.






-----Original Message-----
From: sudhendra seshachala [mailto:sudhi_bs@yahoo.com] 
Sent: Saturday, March 04, 2006 1:26 AM
To: nutch-user@lucene.apache.org
Subject: Re: project vitality?


I could not agree with Doug more. This is one of the best.. am trying
UIMA too... though UIMA also uses Lucene...as of today, it is still a
framework and community in early stages..
   
  In fact the nightly builds has good improvements than 0.71.
  Any serious user or adopter should be trying with a snapshot of
nightly build..
   
  Doug, 
  It  would be better, if there is official 0.8 release or atleast a RC.
  before major releasing 1.0. I am newbie, so let me know about ideas on
releasing 0.8.
   
  Thanks
  Sudhi
  

Doug Cutting <cu...@apache.org> wrote:
  Richard Braman wrote:
> I think it is still very much at proof of concept stage. I think it is

> close, but as you have mentioned, the website Is severely out of date 
> and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks "luster" the project

must be dead! Seriously, this is an active project. It is not yet 1.0, 
so don't expect polish. If it doesn't look easily usable to you then 
perhaps it is not. It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.h
tml

Lots of public sites are using Nutch. Some are listed at 
http://wiki.apache.org/nutch/PublicServers, but many are not, like 
http://search.bittorrent.com/.

> I have tried
> to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project. If you find a bug, please file a bug 
report, so that other folks are aware of it. Better yet, if you have a 
solution or improvement, please construct a patch file (even for 
documentation) and attach it to a bug report. On the wiki, anyone can 
make themselves an account and update documentation. We don't boss 
folks around here, or complain. We pitch in and help.

Doug



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 


Re: project vitality?

Posted by sudhendra seshachala <su...@yahoo.com>.
I could not agree with Doug more. This is one of the best.. am trying UIMA too... though UIMA also uses Lucene...as of today, it is still a framework and community in early stages..
   
  In fact the nightly builds has good improvements than 0.71.
  Any serious user or adopter should be trying with a snapshot of nightly build..
   
  Doug, 
  It  would be better, if there is official 0.8 release or atleast a RC.
  before major releasing 1.0. I am newbie, so let me know about ideas on releasing 0.8.
   
  Thanks
  Sudhi
  

Doug Cutting <cu...@apache.org> wrote:
  Richard Braman wrote:
> I think it is still very much at proof of concept stage. I think it is
> close, but as you have mentioned, the website Is severely out of date
> and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks "luster" the project 
must be dead! Seriously, this is an active project. It is not yet 1.0, 
so don't expect polish. If it doesn't look easily usable to you then 
perhaps it is not. It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html

Lots of public sites are using Nutch. Some are listed at 
http://wiki.apache.org/nutch/PublicServers, but many are not, like 
http://search.bittorrent.com/.

> I have tried
> to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project. If you find a bug, please file a bug 
report, so that other folks are aware of it. Better yet, if you have a 
solution or improvement, please construct a patch file (even for 
documentation) and attach it to a bug report. On the wiki, anyone can 
make themselves an account and update documentation. We don't boss 
folks around here, or complain. We pitch in and help.

Doug



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Re: project vitality?

Posted by Doug Cutting <cu...@apache.org>.
Richard Braman wrote:
> I think it is still very much at proof of concept stage.  I think it is
> close, but as you have mentioned, the website Is severely out of date
> and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks "luster" the project 
must be dead!  Seriously, this is an active project.  It is not yet 1.0, 
so don't expect polish.  If it doesn't look easily usable to you then 
perhaps it is not.  It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html

Lots of public sites are using Nutch.  Some are listed at 
http://wiki.apache.org/nutch/PublicServers, but many are not, like 
http://search.bittorrent.com/.

> I have tried
> to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project.  If you find a bug, please file a bug 
report, so that other folks are aware of it.  Better yet, if you have a 
solution or improvement, please construct a patch file (even for 
documentation) and attach it to a bug report.  On the wiki, anyone can 
make themselves an account and update documentation.  We don't boss 
folks around here, or complain.  We pitch in and help.

Doug

Re: project vitality?

Posted by gekkokid <me...@gekkokid.org.uk>.
passed the concept stage, technorati uses lucene, in open source projects 
the last thing people want to do is documentation,

anybody know why yahoo took down their nutch server?


----- Original Message ----- 
From: "Howie Wang" <ho...@hotmail.com>
To: <rb...@bramantax.com>; <nu...@lucene.apache.org>
Sent: Saturday, March 04, 2006 1:09 AM
Subject: RE: project vitality?


>I wouldn't call Nutch 0.7.x proof-of-concept. There are several
> production sites running it already:
>
> http://wiki.apache.org/nutch/PublicServers
>
> Plus I think technorati is built on either Nutch and/or Lucene.
>
> That said, the doc could be better, and it's probably a good idea
> if you know Java since you might have to tweak the code a bit to
> get the exact behavior you want.  If you don't have special needs,
> you could get something like a site search up in very little time.
>
> The newer versions seem to be changing a lot still though. I've
> been waiting for the dust to settle before I see if I want to upgrade.
>
> Howie
>
>>I think it is still very much at proof of concept stage.  I think it is
>>close, but as you have mentioned, the website Is severely out of date
>>and the information and documentation on it lacks luster.  I have tried
>>to get the tutorial and faqs updated, but I haven't heard back.
>>
>>-----Original Message-----
>>From: Matt Wilkie [mailto:matt.wilkie@gov.yk.ca]
>>Sent: Friday, March 03, 2006 6:34 PM
>>To: nutch-user@lucene.apache.org
>>Subject: project vitality?
>>
>>
>>Hi there, I'm new around here. The mailing lists seem to have a pretty
>>steady stream of traffic but the website hasn't been updated since
>>august, and there's only a handful of news items before that. What is
>>the vitality of Nutch project? Is it basically a labority proof of
>>concept or a mature ready for production product?
>>
>>thanks for your time,
>>
>>--
>>matt wilkie
>>--------------------------------------------
>>Geographic Information,
>>Information Management and Technology,
>>Yukon Department of Environment
>>10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
>>867-667-8133 Tel * 867-393-7003 Fax
>>http://environmentyukon.gov.yk.ca/geomatics/
>>--------------------------------------------
>>
>
>
> 


RE: project vitality?

Posted by Howie Wang <ho...@hotmail.com>.
I wouldn't call Nutch 0.7.x proof-of-concept. There are several
production sites running it already:

http://wiki.apache.org/nutch/PublicServers

Plus I think technorati is built on either Nutch and/or Lucene.

That said, the doc could be better, and it's probably a good idea
if you know Java since you might have to tweak the code a bit to
get the exact behavior you want.  If you don't have special needs,
you could get something like a site search up in very little time.

The newer versions seem to be changing a lot still though. I've
been waiting for the dust to settle before I see if I want to upgrade.

Howie

>I think it is still very much at proof of concept stage.  I think it is
>close, but as you have mentioned, the website Is severely out of date
>and the information and documentation on it lacks luster.  I have tried
>to get the tutorial and faqs updated, but I haven't heard back.
>
>-----Original Message-----
>From: Matt Wilkie [mailto:matt.wilkie@gov.yk.ca]
>Sent: Friday, March 03, 2006 6:34 PM
>To: nutch-user@lucene.apache.org
>Subject: project vitality?
>
>
>Hi there, I'm new around here. The mailing lists seem to have a pretty
>steady stream of traffic but the website hasn't been updated since
>august, and there's only a handful of news items before that. What is
>the vitality of Nutch project? Is it basically a labority proof of
>concept or a mature ready for production product?
>
>thanks for your time,
>
>--
>matt wilkie
>--------------------------------------------
>Geographic Information,
>Information Management and Technology,
>Yukon Department of Environment
>10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
>867-667-8133 Tel * 867-393-7003 Fax
>http://environmentyukon.gov.yk.ca/geomatics/
>--------------------------------------------
>



RE: project vitality?

Posted by Richard Braman <rb...@bramantax.com>.
I think it is still very much at proof of concept stage.  I think it is
close, but as you have mentioned, the website Is severely out of date
and the information and documentation on it lacks luster.  I have tried
to get the tutorial and faqs updated, but I haven't heard back.

-----Original Message-----
From: Matt Wilkie [mailto:matt.wilkie@gov.yk.ca] 
Sent: Friday, March 03, 2006 6:34 PM
To: nutch-user@lucene.apache.org
Subject: project vitality?


Hi there, I'm new around here. The mailing lists seem to have a pretty 
steady stream of traffic but the website hasn't been updated since 
august, and there's only a handful of news items before that. What is 
the vitality of Nutch project? Is it basically a labority proof of 
concept or a mature ready for production product?

thanks for your time,

-- 
matt wilkie
--------------------------------------------
Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/
--------------------------------------------


Re: project vitality?

Posted by Byron Miller <by...@yahoo.com>.
I like to think of it as a framework. Building blocks
to build what you ultimately need.

If your after the one stop shop, plug in play, no
development necessary then perhaps some other
commercial systems may be your best bet.

Mailing list is very active, most people get responses
fairly quickly. If the question is ignored its often
because it's already answered.

To really understand nutch you need to understand
lucene, hadoop and search in general and the wiki of
both lucene and nutch is a great read.

If all of this is above ones head or not within your
time frame to bother with then like i said, there are
other products out there.

Other then that i'm happily running nutch, looking
forward to a billion+ page index and enjoying picking
the brains of the talent pool we have here.

Happy nutcher

-byron
http://www.mozdex.com


--- Matt Wilkie <ma...@gov.yk.ca> wrote:

> Hi there, I'm new around here. The mailing lists
> seem to have a pretty 
> steady stream of traffic but the website hasn't been
> updated since 
> august, and there's only a handful of news items
> before that. What is 
> the vitality of Nutch project? Is it basically a
> labority proof of 
> concept or a mature ready for production product?
> 
> thanks for your time,
> 
> -- 
> matt wilkie
> --------------------------------------------
> Geographic Information,
> Information Management and Technology,
> Yukon Department of Environment
> 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
> 867-667-8133 Tel * 867-393-7003 Fax
> http://environmentyukon.gov.yk.ca/geomatics/
> --------------------------------------------
> 
>