You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Sami Siren <ss...@gmail.com> on 2007/01/16 16:53:41 UTC

Next Nutch release

Hello,

It has been a while from a previous release (0.8.1) and looking at the
great fixes done in trunk I'd start thinking about baking a new release
soon.

Looking at the jira roadmaps there are 1 blocking issues (fixing the
license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
which I think NUTCH-233 is safe to put in.

The top 10 voted issues are currently:

NUTCH-61  	 Adaptive re-fetch interval. Detecting umodified content
NUTCH-48 	"Did you mean" query enhancement/refignment feature
NUTCH-251 	Administration GUI
NUTCH-289 	CrawlDatum should store IP address
NUTCH-36 	Chinese in Nutch
NUTCH-185 	XMLParser is configurable xml parser plugin. 		NUTCH-59 	meta
data support in webdb
NUTCH-92 	DistributedSearch incorrectly scores results 		NUTCH-68 	A
tool to generate arbitrary fetchlists 		NUTCH-87 	Efficient
site-specific crawling for a large number of sites

Are there any opinions about issues that should go in before the next
release (Answering yes means that you are willing to provide a patch for
it).

--
 Sami Siren

Re: Next Nutch release

Posted by Sami Siren <ss...@gmail.com>.
>
> > The top 10 voted issues are currently:
> >
> > NUTCH-61       Adaptive re-fetch interval. Detecting umodified content
> >
>
> Well ... I'm of a split mind on this. I can bring this patch up to date
> and apply it before 0.9.0, if we understand that this is a "0" release
> ... ;) Otherwise I'd prefer to wait with it right after the release.


+1 for putting it in after 0.9.0

I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus
> some changes I made in the meantime), since I'd like to expose the new
> fetcher to a broader audience, and it doesn't affect the existing
> implementation.


+1 for putting it in before 0.9.0


> NUTCH-48      "Did you mean" query enhancement/refignment feature
> > NUTCH-251     Administration GUI
> > NUTCH-289     CrawlDatum should store IP address
> >
>
> I'm still not entirely convinced about this - and there is already a
> mechanism in place to support it if someone really wishes to keep this
> particular info (CrawlDatum.metaData).
>
> > NUTCH-36      Chinese in Nutch
> > NUTCH-185     XMLParser is configurable xml parser
> plugin.            NUTCH-59        meta
> > data support in webdb
> > NUTCH-92      DistributedSearch incorrectly scores
> results            NUTCH-68
>
> This is too intrusive to fix just before the release - and needs
> additional discussion.


+1

> NUTCH-68      A
> > tool to generate arbitrary fetchlists
>
> Easy to port this to 0.9.0 - I can do this.


cool.


I'll start working on the headers and stuff to get the blocking issue away.

--
 Sami Siren

Re: Next Nutch release

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Folks,

 When would you like to make the release? I've been working on NUTCH-185,
but got a bit bogged down with other work. If there is interest in having
NUTCH-185 included in the release, I could make a push to get out a patch by
week's end...

 As for the rest, my +1 for NUTCH-61 being included sooner rather than
later. It seems that the patch has garnered enough use and attention that
folks would like to see it in the release. I think the email from the user
trying to manage a terabyte of data a few days back was particularly
telling.

Cheers,
  Chris



On 1/16/07 8:19 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

> Sami Siren wrote:
>> Hello,
>> 
>> It has been a while from a previous release (0.8.1) and looking at the
>> great fixes done in trunk I'd start thinking about baking a new release
>> soon.
>> 
>> Looking at the jira roadmaps there are 1 blocking issues (fixing the
>> license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
>> which I think NUTCH-233 is safe to put in.
>>   
> 
> Agreed. The replacement regex mentioned in the original comment seems
> safe enough, and simpler.
> 
>> The top 10 voted issues are currently:
>> 
>> NUTCH-61    Adaptive re-fetch interval. Detecting umodified content
>>   
> 
> Well ... I'm of a split mind on this. I can bring this patch up to date
> and apply it before 0.9.0, if we understand that this is a "0" release
> ... ;) Otherwise I'd prefer to wait with it right after the release.
> 
> I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus
> some changes I made in the meantime), since I'd like to expose the new
> fetcher to a broader audience, and it doesn't affect the existing
> implementation.
> 
> 
>> NUTCH-48  "Did you mean" query enhancement/refignment feature
>> NUTCH-251  Administration GUI
>> NUTCH-289  CrawlDatum should store IP address
>>   
> 
> I'm still not entirely convinced about this - and there is already a
> mechanism in place to support it if someone really wishes to keep this
> particular info (CrawlDatum.metaData).
> 
>> NUTCH-36  Chinese in Nutch
>> NUTCH-185  XMLParser is configurable xml parser plugin.   NUTCH-59  meta
>> data support in webdb
>> NUTCH-92  DistributedSearch incorrectly scores results   NUTCH-68  
> 
> This is too intrusive to fix just before the release - and needs
> additional discussion.
> 
> 
>> NUTCH-68 A
>> tool to generate arbitrary fetchlists  
> 
> Easy to port this to 0.9.0 - I can do this.
> 
> 
>> NUTCH-87  Efficient
>> site-specific crawling for a large number of sites
>>   
> 
> 



RE: Next Nutch release

Posted by Alan Tanaman <al...@idna-solutions.com>.
All,

+5 on NUTCH-61
So far, we have been trying to use this patch with partial success on 0.8.1.
We would be happy to help with work on updating/testing this.

Obviously we are hardly impartial, and we would also like to have NUTCH-422
(index-extra plugin) incorporated (although we are aware that we still have
some cleanup to do and the provision of junit tests).

We have done some further work on NUTCH-185 (XMLParser is configurable xml
parser plugin), but haven't posted as yet because the work is perhaps too
highly-customized (we generate fields automatically without any need to
configure a specific Xpath).  We are still deliberating over the desired
configuration to do this without conflicting with those implementations
where it is necessary to specify which fields go into the index.

Apart from these, we would find the following candidates, which we hope to
use/work on very soon (but perhaps not soon enough for this release), very
useful:

NUTCH-48 	"Did you mean" query enhancement/refinement feature
NUTCH-251 	Administration GUI
NUTCH-36 	Chinese in Nutch
NUTCH-92 	DistributedSearch incorrectly scores results

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: 16 January 2007 16:19
To: nutch-dev@lucene.apache.org
Subject: Re: Next Nutch release

Sami Siren wrote:
> Hello,
>
> It has been a while from a previous release (0.8.1) and looking at the
> great fixes done in trunk I'd start thinking about baking a new release
> soon.
>
> Looking at the jira roadmaps there are 1 blocking issues (fixing the
> license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
> which I think NUTCH-233 is safe to put in.
>   

Agreed. The replacement regex mentioned in the original comment seems 
safe enough, and simpler.

> The top 10 voted issues are currently:
>
> NUTCH-61  	 Adaptive re-fetch interval. Detecting umodified content
>   

Well ... I'm of a split mind on this. I can bring this patch up to date 
and apply it before 0.9.0, if we understand that this is a "0" release 
... ;) Otherwise I'd prefer to wait with it right after the release.

I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus 
some changes I made in the meantime), since I'd like to expose the new 
fetcher to a broader audience, and it doesn't affect the existing 
implementation.


> NUTCH-48 	"Did you mean" query enhancement/refignment feature
> NUTCH-251 	Administration GUI
> NUTCH-289 	CrawlDatum should store IP address
>   

I'm still not entirely convinced about this - and there is already a 
mechanism in place to support it if someone really wishes to keep this 
particular info (CrawlDatum.metaData).

> NUTCH-36 	Chinese in Nutch
> NUTCH-185 	XMLParser is configurable xml parser plugin.
NUTCH-59 	meta
> data support in webdb
> NUTCH-92 	DistributedSearch incorrectly scores results
NUTCH-68 	

This is too intrusive to fix just before the release - and needs 
additional discussion.


> NUTCH-68	A
> tool to generate arbitrary fetchlists 	

Easy to port this to 0.9.0 - I can do this.


> 	NUTCH-87 	Efficient
> site-specific crawling for a large number of sites
>   



-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Next Nutch release

Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren wrote:
> Hello,
>
> It has been a while from a previous release (0.8.1) and looking at the
> great fixes done in trunk I'd start thinking about baking a new release
> soon.
>
> Looking at the jira roadmaps there are 1 blocking issues (fixing the
> license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
> which I think NUTCH-233 is safe to put in.
>   

Agreed. The replacement regex mentioned in the original comment seems 
safe enough, and simpler.

> The top 10 voted issues are currently:
>
> NUTCH-61  	 Adaptive re-fetch interval. Detecting umodified content
>   

Well ... I'm of a split mind on this. I can bring this patch up to date 
and apply it before 0.9.0, if we understand that this is a "0" release 
... ;) Otherwise I'd prefer to wait with it right after the release.

I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus 
some changes I made in the meantime), since I'd like to expose the new 
fetcher to a broader audience, and it doesn't affect the existing 
implementation.


> NUTCH-48 	"Did you mean" query enhancement/refignment feature
> NUTCH-251 	Administration GUI
> NUTCH-289 	CrawlDatum should store IP address
>   

I'm still not entirely convinced about this - and there is already a 
mechanism in place to support it if someone really wishes to keep this 
particular info (CrawlDatum.metaData).

> NUTCH-36 	Chinese in Nutch
> NUTCH-185 	XMLParser is configurable xml parser plugin. 		NUTCH-59 	meta
> data support in webdb
> NUTCH-92 	DistributedSearch incorrectly scores results 		NUTCH-68 	

This is too intrusive to fix just before the release - and needs 
additional discussion.


> NUTCH-68	A
> tool to generate arbitrary fetchlists 	

Easy to port this to 0.9.0 - I can do this.


> 	NUTCH-87 	Efficient
> site-specific crawling for a large number of sites
>   



-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Next Nutch release

Posted by Andrzej Bialecki <ab...@getopt.org>.
Armel T. Nene wrote:
> I am willing to work with Andrzej to make it stable as I understand it's the
> architect of this patch. I have the possibility of testing it in a mix
> environment in our computer lab. This patch can be the stepping stone for
> other features such real time indexing and fetch queue for index updating as
> opposed to creating a new index each time.
>   

Thanks for taking the initiative! I'll be glad to review the patch and 
apply it right after the 0.9 release. The best way to keep the process 
open would be to make svn diff and attach this new version of the patch 
to the JIRA issue.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Next Nutch release

Posted by "Armel T. Nene" <ar...@idna-solutions.com>.
Hi guys,

 

I have been working on NUTCH-61 Adaptive re-fetch interval. Detecting
unmodified content applying it to Nutch 0.8.1. Here are some points:

 

1.    This feature is great for Nutch to have has it differentiate between
modified and unmodified content, therefore not indexing twice even if the
document fetch time has arrived.

a.    There are some performance issues here. Even with this patch, Nutch
still fetches the content and then checks its status against the last
modified time in the database. If it has to check for a 1000 files before
indexing the following 10 files, this will cause a real problem for those
that are after real time indexing.

 

2.    Since, I applied this patch to Nutch 0.8.1, when I try to parse xml
files with our modified version of the xmlparser /indexer plugin; the
fetcher throws the following exception:

 

WARN  fetcher.Fetcher - Error parsing:
file:/C:/880254/8802_583254_20051006_12.xml: failed(2,200):
java.lang.IllegalStateException: Root element not set

 

The system will not hang or crash but the xml file will be indexed without
any generated fields. The plugins works fine without the patch. I have
another parser that parses graphics and other formats that fails when used
with the patch. So far this problem occurs when using the file protocol.

 

3.    the patch works fine when indexing web site using the http protocol.

 

I am willing to work with Andrzej to make it stable as I understand it's the
architect of this patch. I have the possibility of testing it in a mix
environment in our computer lab. This patch can be the stepping stone for
other features such real time indexing and fetch queue for index updating as
opposed to creating a new index each time.

 

Best Regards,

 

Armel

 

-------------------------------------------------

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

http://blog.idna-solutions.com

-----Original Message-----
From: Enis Soztutar [mailto:enis.soz.nutch@gmail.com] 
Sent: 17 January 2007 15:39
To: nutch-dev@lucene.apache.org
Subject: Re: Next Nutch release

 

Sami Siren wrote:

> 2007/1/17, Enis Soztutar <en...@gmail.com>:

>> 

>> Hi all, for NUTCH-251:

>> 

>> I suppose that NUTCH-251 is relatively a significant issue by the votes.

>> Stafan has written a good plugin for the admin gui and i have updated it

>> to work with nutch-0.8, hadoop 0.4.

> 

> 

> Good to hear someone is working on that! Why not target it to

> trunk version of Nutch?

It is targetted to the trunk already. The previous was targetted to 

nutch-0.8, hadoop 0.4, since back then that versions was the latest in 

the trunk

> 

>>  - a web server to serve plugin jsp's

> 

> Why not make it regular war? also please consider making a clean

> separation of view/logic when you implement the web ui.

As Stafan's version used embedded Jetty server, I continued this way. 

But i will consider that possibility also.

 

> 

> -- 

> Sami Siren

> 

 

 


Re: Next Nutch release

Posted by Enis Soztutar <en...@gmail.com>.
Sami Siren wrote:
> 2007/1/17, Enis Soztutar <en...@gmail.com>:
>>
>> Hi all, for NUTCH-251:
>>
>> I suppose that NUTCH-251 is relatively a significant issue by the votes.
>> Stafan has written a good plugin for the admin gui and i have updated it
>> to work with nutch-0.8, hadoop 0.4.
>
>
> Good to hear someone is working on that! Why not target it to
> trunk version of Nutch?
It is targetted to the trunk already. The previous was targetted to 
nutch-0.8, hadoop 0.4, since back then that versions was the latest in 
the trunk
>
>>  - a web server to serve plugin jsp's
>
> Why not make it regular war? also please consider making a clean
> separation of view/logic when you implement the web ui.
As Stafan's version used embedded Jetty server, I continued this way. 
But i will consider that possibility also.

>
> -- 
> Sami Siren
>


Re: Next Nutch release

Posted by Sami Siren <ss...@gmail.com>.
2007/1/17, Enis Soztutar <en...@gmail.com>:
>
> Hi all, for NUTCH-251:
>
> I suppose that NUTCH-251 is relatively a significant issue by the votes.
> Stafan has written a good plugin for the admin gui and i have updated it
> to work with nutch-0.8, hadoop 0.4.


Good to hear someone is working on that! Why not target it to
trunk version of Nutch?

>  - a web server to serve plugin jsp's

Why not make it regular war? also please consider making a clean
separation of view/logic when you implement the web ui.

--
 Sami Siren

Re: Next Nutch release

Posted by Doug Cutting <cu...@apache.org>.
Stefan Groschupf wrote:
> I don't want to start a emotional discussion here, however talking about 
> the problem in public might help.

What, specifically, is the problem you perceive?

Doug

Re: Next Nutch release

Posted by Doug Cutting <cu...@apache.org>.
Dennis Kubes wrote:
> I will say that it is difficult for people to understand how to get more 
> involved.  I have been working with Nutch and Hadoop for almost a year 
> now on a daily basis and only now am I understanding how to contribute 
> through jira, etc.  There needs to be more guidance in helping 
> developers contribute.  For example if you want to develop a new piece 
> of function they do x, y, and z.  Here is how to patch your system. If 
> you want to develop a patch then here are the steps.

The closest thing we have currently are the HowToContribute pages:

http://wiki.apache.org/nutch/HowToContribute
http://wiki.apache.org/lucene-hadoop/HowToContribute
http://wiki.apache.org/jakarta-lucene/HowToContribute

These are not great, but they're a start.  Are there parts that are 
confusing?  Do they assume too much?  Are they missing things?  If so, 
please help to update these.

I note that the Nutch version is less evolved than the Lucene and Hadoop 
versions.

Doug


Re: Next Nutch release

Posted by Doug Cutting <cu...@apache.org>.
Dennis Kubes wrote:
> Andrzej Bialecki wrote:
>> I believe that at this point it's crucial to keep the project 
>> well-focused (at the moment I think the main focus is on larger 
>> installations, and not the small ones), and also to make Nutch 
>> attractive to developers as a reusable "search engine" component.
> 
> I think there are two areas.  One is to keep the focus as you stated 
> above.  The other is to provide a path to get more people involved.  If 
> no one objects I will continue working on such a path.

Please let me know if I can help in this "people" area.  I'm currently 
unable to assist with technical Nutch issues on a day-to-day basis, but 
I am still very interested in doing what I can to ensure Nutch's 
long-term vitality as a project.

Cheers,

Doug

Re: Next Nutch release

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> I completely agree with this.  I am interested in devoting as much 
>> time as possible to seeing the success of Nutch, Hadoop, and Lucene.  
>> As our business grows I would also be willing to devote developers 
>> full time to work on Nutch, Hadoop, and Lucene.
>>
>> I think that at least one company needs to come out and have a 
>> production search engine that is competition, however small, to the 
>> googles and yahoos of the world, built on Nutch and Hadoop.  I thought 
>> that was the original goal of Nutch.  I know there are some out there 
>> right now like Mozdex, but I mean a true billion page system.  I think 
>> the .8 codebase, and yes improvements could be made, is capable of 
>> supporting such a system.  I think then you will see many more 
>> developers become interested in the project.  If you build it they 
>> will come.
> 
> Sure, I'd love to point people to such a system. But did you do a 
> calculation how much money in the initial investment, and then ongoing 
> costs, is needed to maintain such an index? It cannot happen just 
> because of someone's goodwill, there must be a sound business idea 
> behind it, and a team of dedicated people to make it happen and 
> persevere - not just to demonstrate how good Nutch is, but to keep up 
> for the sake of their own business.
> 
I completely agree.  We have been working on this business for almost a 
year.  We received significant seed capital to build the alpha version 
of the search, which is complete, and are in the process of securing 
first round private equity funding to scale to 100M pages this year and 
up to 1B pages in year 2.

Yes the initial investment  for hardware, data center costs, marketing 
costs, and most importantly development staff for say a 1 billion page 
index capable of supporting 100 queries per second constant is around 5M 
and as it grows into the 10-20 billion range costs can grow as high as 100M.

I think what many people don't understand is that search is as much a 
hardware (electricity, bandwidth) issue as it is a software issue.  I 
know that we couldn't have developed the systems we have without Nutch, 
Hadoop, and Lucene and that I personally and we as a company are 
completely committed to their development.

>>
>> I will say that it is difficult for people to understand how to get 
>> more involved.  I have been working with Nutch and Hadoop for almost a 
>> year now on a daily basis and only now am I understanding how to 
>> contribute through jira, etc.  There needs to be more guidance in 
>> helping developers contribute.  For example if you want to develop a 
>> new piece of function they do x, y, and z.  Here is how to patch your 
>> system. If you want to develop a patch then here are the steps.  I 
>> have programmed in Java for many years but haven't worked on many open 
>> source projects before.  The process of how they work isn't explicit 
>> and it needs to be.
> 
> Hmm. I might not be objective here anymore. There is however some 
> documentation already on the Wiki, which explains how to contribute - if 
> you feel it's inadequate please use your hard-earned experience to 
> improve it.
> 
I am in the middle of writing a new wiki page for contributing that will 
go into much more detail about the process.

>>
>> We worked up many patches for issues we came up against in the .8 and 
>> .4 codebases but they were never contributed because, as stupid as it 
>> might sound, we really don't know how to give it back.  The best thing 
>> I thought I could do was to help answer questions on the list.  Again 
>> just need a little guidance.
>>
>>>> Are you willing to spend the time and do the required refactoring? 
>>>> Anyone else?
>>
>> Yes, I am and I currently have 2 other developers that can help.
> 
> Sounds great. We could start by creating a new page on Wiki, which would 
> collect our vision for Nutch - as I mentioned to Stefan, I think we 
> should take a step back, and think about the strategy for the next 1-2 
> years of Nutch development, and what is the target audience.

I am all for this, just understand this is a new process for me so will 
need some guidance.

> 
>>> Sure if we start a 2.x branch and if I'm not developing for the trash 
>>> or "jira nirvana", I can imaging to contribute. I 
> 
> Just a quick comment: "jira nirvana" (which I believe refers to patches 
> sitting idle in Jira for a long time) is not caused by ill will or 
> disrespect for contributors, but foremost by limited human resources. If 
> we want to maintain a certain level of quality, these patches cannot be 
> applied blindly, but need to be reviewed, analyzed, applied, tested, and 
> committed. That's an awful lot of work for 2-3 people, who also have 
> other things to do ...
> 
> 
> 
>>> It is very less attractive to developers spending weeks to find a bug 
>>> like the regular expression one. Than such a bug sits there for month 
>>> in the jira being rejected. Sure if nobody of the contributors run 
>>> nutch with a 500 mio url 
> 
> It's not being rejected - see the comments on that issue, there is an 
> overall agreement that it's ok; it simply hasn't been applied yet. See 
> above for the why.
> 
> 
>>>> I'm slowly coming to a point where I should be able to fix it - but 
>>>> let's not throw out the baby with the water ...
>>> Wow, I hold my finger crossed!
>>
>> There is a great book on this.  It is 0691122024.  Andrzej send me 
>> your address and I will buy and ship you a copy if you don't have it.  
> 
> Too late :) I found it two weeks ago, and it's already on its merry way 
> - but thanks for the offer.
> 
>> We would also be willing to help develop this functionality further.
> 
> I started working on a testbed as a part of another commercial project, 
> it's likely that I could get a release from the customer to contribute 
> this code to the project. A testbed is a prerequisite for any serious 
> work on ranking and web graph.
> 
> (It's quite unfortunate that the best-of-breed open source framework for 
> working with web graphs is licensed under LGPL ...)
> 
>>
>> I can definitely see a desire to re-write but I think even if you 
>> re-write you are still going to have the same problem.  Search is hard 
>> and without guidance we can't get enough developers to understand what 
>> they need to know to help.
> 
> Indeed. People often don't appreciate how much heuristics and trials, 
> beyond pure academic-level IR, is needed to come up with a system that 
> gives a decent quality of results, and is manageable. Nutch may not be 
> perfect, but there's a lot of this specific knowledge already 
> accumulated here.
> 
Absolutely and it is not knowledge that is easily found elsewhere.
> 
>> At this time I don't think it is a design problem I think it is a 
>> people problem.  I will be more than willing to head up training, 
>> documenting, and helping developers get up to speed. I just need 
>> direction in this area myself.
> 
> I believe that at this point it's crucial to keep the project 
> well-focused (at the moment I think the main focus is on larger 
> installations, and not the small ones), and also to make Nutch 
> attractive to developers as a reusable "search engine" component.

I think there are two areas.  One is to keep the focus as you stated 
above.  The other is to provide a path to get more people involved.  If 
no one objects I will continue working on such a path.

> 
> Let's continue the discussion. I'll create the page on Wiki, please feel 
> free to add your thoughts.
Will do.
> 

Re: Next Nutch release

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> I completely agree with this.  I am interested in devoting as much 
> time as possible to seeing the success of Nutch, Hadoop, and Lucene.  
> As our business grows I would also be willing to devote developers 
> full time to work on Nutch, Hadoop, and Lucene.
>
> I think that at least one company needs to come out and have a 
> production search engine that is competition, however small, to the 
> googles and yahoos of the world, built on Nutch and Hadoop.  I thought 
> that was the original goal of Nutch.  I know there are some out there 
> right now like Mozdex, but I mean a true billion page system.  I think 
> the .8 codebase, and yes improvements could be made, is capable of 
> supporting such a system.  I think then you will see many more 
> developers become interested in the project.  If you build it they 
> will come.

Sure, I'd love to point people to such a system. But did you do a 
calculation how much money in the initial investment, and then ongoing 
costs, is needed to maintain such an index? It cannot happen just 
because of someone's goodwill, there must be a sound business idea 
behind it, and a team of dedicated people to make it happen and 
persevere - not just to demonstrate how good Nutch is, but to keep up 
for the sake of their own business.

>
> I will say that it is difficult for people to understand how to get 
> more involved.  I have been working with Nutch and Hadoop for almost a 
> year now on a daily basis and only now am I understanding how to 
> contribute through jira, etc.  There needs to be more guidance in 
> helping developers contribute.  For example if you want to develop a 
> new piece of function they do x, y, and z.  Here is how to patch your 
> system. If you want to develop a patch then here are the steps.  I 
> have programmed in Java for many years but haven't worked on many open 
> source projects before.  The process of how they work isn't explicit 
> and it needs to be.

Hmm. I might not be objective here anymore. There is however some 
documentation already on the Wiki, which explains how to contribute - if 
you feel it's inadequate please use your hard-earned experience to 
improve it.

>
> We worked up many patches for issues we came up against in the .8 and 
> .4 codebases but they were never contributed because, as stupid as it 
> might sound, we really don't know how to give it back.  The best thing 
> I thought I could do was to help answer questions on the list.  Again 
> just need a little guidance.
>
>>> Are you willing to spend the time and do the required refactoring? 
>>> Anyone else?
>
> Yes, I am and I currently have 2 other developers that can help.

Sounds great. We could start by creating a new page on Wiki, which would 
collect our vision for Nutch - as I mentioned to Stefan, I think we 
should take a step back, and think about the strategy for the next 1-2 
years of Nutch development, and what is the target audience.

>> Sure if we start a 2.x branch and if I'm not developing for the trash 
>> or "jira nirvana", I can imaging to contribute. I 

Just a quick comment: "jira nirvana" (which I believe refers to patches 
sitting idle in Jira for a long time) is not caused by ill will or 
disrespect for contributors, but foremost by limited human resources. If 
we want to maintain a certain level of quality, these patches cannot be 
applied blindly, but need to be reviewed, analyzed, applied, tested, and 
committed. That's an awful lot of work for 2-3 people, who also have 
other things to do ...



>> It is very less attractive to developers spending weeks to find a bug 
>> like the regular expression one. Than such a bug sits there for month 
>> in the jira being rejected. Sure if nobody of the contributors run 
>> nutch with a 500 mio url 

It's not being rejected - see the comments on that issue, there is an 
overall agreement that it's ok; it simply hasn't been applied yet. See 
above for the why.


>>> I'm slowly coming to a point where I should be able to fix it - but 
>>> let's not throw out the baby with the water ...
>> Wow, I hold my finger crossed!
>
> There is a great book on this.  It is 0691122024.  Andrzej send me 
> your address and I will buy and ship you a copy if you don't have it.  

Too late :) I found it two weeks ago, and it's already on its merry way 
- but thanks for the offer.

> We would also be willing to help develop this functionality further.

I started working on a testbed as a part of another commercial project, 
it's likely that I could get a release from the customer to contribute 
this code to the project. A testbed is a prerequisite for any serious 
work on ranking and web graph.

(It's quite unfortunate that the best-of-breed open source framework for 
working with web graphs is licensed under LGPL ...)

>
> I can definitely see a desire to re-write but I think even if you 
> re-write you are still going to have the same problem.  Search is hard 
> and without guidance we can't get enough developers to understand what 
> they need to know to help.

Indeed. People often don't appreciate how much heuristics and trials, 
beyond pure academic-level IR, is needed to come up with a system that 
gives a decent quality of results, and is manageable. Nutch may not be 
perfect, but there's a lot of this specific knowledge already 
accumulated here.


> At this time I don't think it is a design problem I think it is a 
> people problem.  I will be more than willing to head up training, 
> documenting, and helping developers get up to speed. I just need 
> direction in this area myself.

I believe that at this point it's crucial to keep the project 
well-focused (at the moment I think the main focus is on larger 
installations, and not the small ones), and also to make Nutch 
attractive to developers as a reusable "search engine" component.

Let's continue the discussion. I'll create the page on Wiki, please feel 
free to add your thoughts.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Next Nutch release

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Just to put in my view.

Stefan Groschupf wrote:
> Hi Andrzej,
> 
> thank you for taking the time to comment, I highly value your comments.
> 
>> * I guess that for each case where Nutch seems inappropriate I could 
>> give you a counter-example of Nutch being used  commercially with much 
>> success. I guess it depends on a particular application and the type 
>> of customer.
> 
> Yes, it would be interesting to hear who use nutch .8 _successfully_ in 
> production.

Although I can't say who we are yet as we are in the middle of private 
equity funding,  we have built a production version categorization 
search engine that uses the Nutch .8 and hadoop .4 code base that we are 
currently in the process of scaling to 100M pages.
> 
>> * no doubt Nutch has its warts - the plugin system could be simpler, 
>> for example ;) but hey, it's great that we have a plugin system at 
>> all! It would be easier now to refactor Nutch to use a different 
>> plugin system than it was to go from the completely monolithic design 
>> to the plugin system ... As with any open source project - if you 
>> don't like it, fix it and contribute the fix.
> 
> Sure - I tried that more than once - but I do not want to start this 
> discussion again.
> 
>> * things won't happen magically unless there is a greater involvement 
>> of skilled developers. "One way road" - well, with limited resources 
>> that this project has at the moment the only way is to gradually 
>> improve, we cannot afford to abandon the current codebase and start 
>> from scratch.
> 
> I agree - the problem are skilled developers, I remember more than one 
> offer of different companies to dedicate developers to the project, but 
> looks like there was no interest.

I completely agree with this.  I am interested in devoting as much time 
as possible to seeing the success of Nutch, Hadoop, and Lucene.  As our 
business grows I would also be willing to devote developers full time to 
work on Nutch, Hadoop, and Lucene.

I think that at least one company needs to come out and have a 
production search engine that is competition, however small, to the 
googles and yahoos of the world, built on Nutch and Hadoop.  I thought 
that was the original goal of Nutch.  I know there are some out there 
right now like Mozdex, but I mean a true billion page system.  I think 
the .8 codebase, and yes improvements could be made, is capable of 
supporting such a system.  I think then you will see many more 
developers become interested in the project.  If you build it they will 
come.

I will say that it is difficult for people to understand how to get more 
involved.  I have been working with Nutch and Hadoop for almost a year 
now on a daily basis and only now am I understanding how to contribute 
through jira, etc.  There needs to be more guidance in helping 
developers contribute.  For example if you want to develop a new piece 
of function they do x, y, and z.  Here is how to patch your system. If 
you want to develop a patch then here are the steps.  I have programmed 
in Java for many years but haven't worked on many open source projects 
before.  The process of how they work isn't explicit and it needs to be.

We worked up many patches for issues we came up against in the .8 and .4 
codebases but they were never contributed because, as stupid as it might 
sound, we really don't know how to give it back.  The best thing I 
thought I could do was to help answer questions on the list.  Again just 
need a little guidance.

>> Are you willing to spend the time and do the required refactoring? 
>> Anyone else?

Yes, I am and I currently have 2 other developers that can help.

> 
> In general there was some emotional discussion about API changes. Since 
> nutch is a 0.x and also a software and not a library more frequent 
> refactorings had may be improved the maintainability of the code over 
> the time.
> 
> Sure if we start a 2.x branch and if I'm not developing for the trash or 
> "jira nirvana", I can imaging to contribute. I would rethink and rewrite 
> some major parts (e.g. remove the reusage of objects with a complex 
> states and endless if than else conditions no body can debug) and I 
> believe that is not difficult. I'm not talking about the algorithm stuff 
> here.
> 
>>> May be one day we can get some developer together first think about a 
>>> good extendable design and than start a 2.x stream or a new project.
>>
>> I hope so too. But as Steve B. said once, what we need is "developers, 
>> developers, developers ..." ;)
> 
> I agree, however it must be attractive for developers to spend time in a 
> open source project. We saw many developers here. You are the only one 
> left that does some serious development and I can't find words how much 
> respect I have for your work. You are the only one that is able to fix 
> serious bugs.

We also have much respect for you Andrzej.

You may have more developers than you think.  They might just not know 
how to contribute.

> It is very less attractive to developers spending weeks to find a bug 
> like the regular expression one. Than such a bug sits there for month in 
> the jira being rejected. Sure if nobody of the contributors run nutch 
> with a 500 mio url web db, than it might be difficult to reproduce such 
> a bug. If you have a set of a such issues (another one is the gui etc.) 
> you decide to run your very own  nutch brunch in your home svn. At least 
> all of my customers did over the time. The result - no public nutch 
> contributions, no developers.
> 
>> Nutch, as it is now, is not too well-focused, so that may be the 
>> reason why it doesn't attract too many developers - and casual users 
>> find it perhaps too difficult to get interested enough to dig deeper.

Definitely agree.  Better documentation is needed to attract the more 
"casual" developers.  We would be willing to help produce this.

> 
> I agree that is another issue, since nutch tries to solve to many 
> problems at the same time the code is to difficult to understand for 
> newbies.
> 
> 
>> On one end of spectrum we have small desktop installations in mind, on 
>> the other end we have scalable 1 bln page server farms ... it's hard 
>> to satisfy everyone, and the current design is not that satisfactory 
>> for either group. So, I think a better focus is needed, combined with 
>> design that satisfies either one or the other group - or maybe two 
>> designs for each group, assuming we can motivate enough people to 
>> participate in each sub-project.
> 
> Sounds like a good idea! :-)

Agreed.
> 
>>> And ... yes no opic and yes definitely no plugin architecture (I feel 
>>> very sorry for all that wast so much life time
>>
>> Ah, the more I study the theory behind PageRank calculation the more I 
>> think OPIC is an excellent solution to this hard problem - but our 
>> current implementation is broken.
> 
> Yes - very much, a search engine that need to recrawl from scratch each 
> time to get sense-fully index scores - that is really broken.
> However at least the "page rank" implementation in nutch .7 worked great 
> for me, it just didn't scaled that well.
> 
> 
>> I'm slowly coming to a point where I should be able to fix it - but 
>> let's not throw out the baby with the water ...
> Wow, I hold my finger crossed!

There is a great book on this.  It is 0691122024.  Andrzej send me your 
address and I will buy and ship you a copy if you don't have it.  We 
would also be willing to help develop this functionality further.

> 
>>> because of my terrible complicate plugin system) but a clean IOC 
>>> design with lightweight default interface implementations and a great 
>>> test coverage.
>>> Anyway just my *very little* point of view based on 3.5 years nutch 
>>> experience.
>> I'm looking forward to your patches that implement the clean IOC 
>> design ;) Seriously - if you can show how to refactor a portion of 
>> Nutch to a clean IOC design, we will start refactoring the rest of it 
>> in this direction.
> 
Would be happy to help here as well.

> No - sorry I'm personal too tiered doing patches. I already talked to a 
> set of people and at least 2 good developers would be serious interested 
> in writing a search engine from scratch based on hadoop by "reusing" as 
> much nutch code as sense-fully. Another 2 would be interested. All those 
> people worked with nutch or working in the IR research area or in 
> vertical search companies. May be a interesting starting point for a 
> nice summer project.
> We might even find some company that would sponsor some work - I know at 
> least 2 that would be interested. You might know one as well. :-)
> 
> Anyway don't count on that - I don't know - at least in the moment I can 
> imaging more interesting things in my spear time than doing nutch patches.
> 
> 
> I don't want to start a emotional discussion here, however talking about 
> the problem in public might help.
> Cheers,
> Stefan
> 

I can definitely see a desire to re-write but I think even if you 
re-write you are still going to have the same problem.  Search is hard 
and without guidance we can't get enough developers to understand what 
they need to know to help.  At this time I don't think it is a design 
problem I think it is a people problem.  I will be more than willing to 
head up training, documenting, and helping developers get up to speed. 
I just need direction in this area myself.

Dennis Kubes

Re: Next Nutch release

Posted by Stefan Groschupf <sg...@101tec.com>.
Hi Andrzej,

thank you for taking the time to comment, I highly value your comments.

> * I guess that for each case where Nutch seems inappropriate I  
> could give you a counter-example of Nutch being used  commercially  
> with much success. I guess it depends on a particular application  
> and the type of customer.

Yes, it would be interesting to hear who use nutch .8 _successfully_  
in production.

> * no doubt Nutch has its warts - the plugin system could be  
> simpler, for example ;) but hey, it's great that we have a plugin  
> system at all! It would be easier now to refactor Nutch to use a  
> different plugin system than it was to go from the completely  
> monolithic design to the plugin system ... As with any open source  
> project - if you don't like it, fix it and contribute the fix.

Sure - I tried that more than once - but I do not want to start this  
discussion again.

> * things won't happen magically unless there is a greater  
> involvement of skilled developers. "One way road" - well, with  
> limited resources that this project has at the moment the only way  
> is to gradually improve, we cannot afford to abandon the current  
> codebase and start from scratch.

I agree - the problem are skilled developers, I remember more than  
one offer of different companies to dedicate developers to the  
project, but looks like there was no interest.

> Are you willing to spend the time and do the required refactoring?  
> Anyone else?

In general there was some emotional discussion about API changes.  
Since nutch is a 0.x and also a software and not a library more  
frequent refactorings had may be improved the maintainability of the  
code over the time.

Sure if we start a 2.x branch and if I'm not developing for the trash  
or "jira nirvana", I can imaging to contribute. I would rethink and  
rewrite some major parts (e.g. remove the reusage of objects with a  
complex states and endless if than else conditions no body can debug)  
and I believe that is not difficult. I'm not talking about the  
algorithm stuff here.

>> May be one day we can get some developer together first think  
>> about a good extendable design and than start a 2.x stream or a  
>> new project.
>
> I hope so too. But as Steve B. said once, what we need is  
> "developers, developers, developers ..." ;)

I agree, however it must be attractive for developers to spend time  
in a open source project. We saw many developers here. You are the  
only one left that does some serious development and I can't find  
words how much respect I have for your work. You are the only one  
that is able to fix serious bugs.

It is very less attractive to developers spending weeks to find a bug  
like the regular expression one. Than such a bug sits there for month  
in the jira being rejected. Sure if nobody of the contributors run  
nutch with a 500 mio url web db, than it might be difficult to  
reproduce such a bug. If you have a set of a such issues (another one  
is the gui etc.) you decide to run your very own  nutch brunch in  
your home svn. At least all of my customers did over the time. The  
result - no public nutch contributions, no developers.


> Nutch, as it is now, is not too well-focused, so that may be the  
> reason why it doesn't attract too many developers - and casual  
> users find it perhaps too difficult to get interested enough to dig  
> deeper.

I agree that is another issue, since nutch tries to solve to many  
problems at the same time the code is to difficult to understand for  
newbies.


> On one end of spectrum we have small desktop installations in mind,  
> on the other end we have scalable 1 bln page server farms ... it's  
> hard to satisfy everyone, and the current design is not that  
> satisfactory for either group. So, I think a better focus is  
> needed, combined with design that satisfies either one or the other  
> group - or maybe two designs for each group, assuming we can  
> motivate enough people to participate in each sub-project.

Sounds like a good idea! :-)

>> And ... yes no opic and yes definitely no plugin architecture (I  
>> feel very sorry for all that wast so much life time
>
> Ah, the more I study the theory behind PageRank calculation the  
> more I think OPIC is an excellent solution to this hard problem -  
> but our current implementation is broken.

Yes - very much, a search engine that need to recrawl from scratch  
each time to get sense-fully index scores - that is really broken.
However at least the "page rank" implementation in nutch .7 worked  
great for me, it just didn't scaled that well.


> I'm slowly coming to a point where I should be able to fix it - but  
> let's not throw out the baby with the water ...
Wow, I hold my finger crossed!

>> because of my terrible complicate plugin system) but a clean IOC  
>> design with lightweight default interface implementations and a  
>> great test coverage.
>> Anyway just my *very little* point of view based on 3.5 years  
>> nutch experience.
> I'm looking forward to your patches that implement the clean IOC  
> design ;) Seriously - if you can show how to refactor a portion of  
> Nutch to a clean IOC design, we will start refactoring the rest of  
> it in this direction.

No - sorry I'm personal too tiered doing patches. I already talked to  
a set of people and at least 2 good developers would be serious  
interested in writing a search engine from scratch based on hadoop by  
"reusing" as much nutch code as sense-fully. Another 2 would be  
interested. All those people worked with nutch or working in the IR  
research area or in vertical search companies. May be a interesting  
starting point for a nice summer project.
We might even find some company that would sponsor some work - I know  
at least 2 that would be interested. You might know one as well. :-)

Anyway don't count on that - I don't know - at least in the moment I  
can imaging more interesting things in my spear time than doing nutch  
patches.


I don't want to start a emotional discussion here, however talking  
about the problem in public might help.
Cheers,
Stefan


Re: Next Nutch release

Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf wrote:
> Hi Scott,
>
> feel free - I have no options on that.
>
> From my very little point of view the nutch > .8 source stream is a 
> one way street.
> In all my projects we move as far as possible away from nutch. I like 
> hadoop a lot and writing customer tools on top of it is - that easy.
> But nutch .8 was a proof of concept for the early hadoop.  There is 
> only one serious developer left and wow how great he does his job - 
> but nutch >.8 is just too monolithic, to difficult to extend, to 
> difficult to debug, to difficult to integrate for a serious mission 
> critical application.
> I spend a signification part of my life daily working with nutch, but 
> if someone would ask - I would answer don't use it.

Let me comment on what you said:

* I guess that for each case where Nutch seems inappropriate I could 
give you a counter-example of Nutch being used  commercially with much 
success. I guess it depends on a particular application and the type of 
customer.

* no doubt Nutch has its warts - the plugin system could be simpler, for 
example ;) but hey, it's great that we have a plugin system at all! It 
would be easier now to refactor Nutch to use a different plugin system 
than it was to go from the completely monolithic design to the plugin 
system ... As with any open source project - if you don't like it, fix 
it and contribute the fix.

* things won't happen magically unless there is a greater involvement of 
skilled developers. "One way road" - well, with limited resources that 
this project has at the moment the only way is to gradually improve, we 
cannot afford to abandon the current codebase and start from scratch. 
Are you willing to spend the time and do the required refactoring? 
Anyone else?

> May be one day we can get some developer together first think about a 
> good extendable design and than start a 2.x stream or a new project.

I hope so too. But as Steve B. said once, what we need is "developers, 
developers, developers ..." ;)

Nutch, as it is now, is not too well-focused, so that may be the reason 
why it doesn't attract too many developers - and casual users find it 
perhaps too difficult to get interested enough to dig deeper. On one end 
of spectrum we have small desktop installations in mind, on the other 
end we have scalable 1 bln page server farms ... it's hard to satisfy 
everyone, and the current design is not that satisfactory for either 
group. So, I think a better focus is needed, combined with design that 
satisfies either one or the other group - or maybe two designs for each 
group, assuming we can motivate enough people to participate in each 
sub-project.

> And ... yes no opic and yes definitely no plugin architecture (I feel 
> very sorry for all that wast so much life time 

Ah, the more I study the theory behind PageRank calculation the more I 
think OPIC is an excellent solution to this hard problem - but our 
current implementation is broken. I'm slowly coming to a point where I 
should be able to fix it - but let's not throw out the baby with the 
water ...

> because of my terrible complicate plugin system) but a clean IOC 
> design with lightweight default interface implementations and a great 
> test coverage.
> Anyway just my *very little* point of view based on 3.5 years nutch 
> experience.
I'm looking forward to your patches that implement the clean IOC design ;) Seriously - if you can show how to refactor a portion of Nutch to a clean IOC design, we will start refactoring the rest of it in this direction.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Next Nutch release

Posted by Stefan Groschupf <sg...@101tec.com>.
Hi Scott,

feel free - I have no options on that.

 From my very little point of view the nutch > .8 source stream is a  
one way street.
In all my projects we move as far as possible away from nutch. I like  
hadoop a lot and writing customer tools on top of it is - that easy.
But nutch .8 was a proof of concept for the early hadoop.  There is  
only one serious developer left and wow how great he does his job -  
but nutch >.8 is just too monolithic, to difficult to extend, to  
difficult to debug, to difficult to integrate for a serious mission  
critical application.
I spend a signification part of my life daily working with nutch, but  
if someone would ask - I would answer don't use it.
May be one day we can get some developer together first think about a  
good extendable design and than start a 2.x stream or a new project.
And ... yes no opic and yes definitely no plugin architecture (I feel  
very sorry for all that wast so much life time because of my terrible  
complicate plugin system) but a clean IOC design with lightweight  
default interface implementations and a great test coverage.
Anyway just my *very little* point of view based on 3.5 years nutch  
experience.

Stefan





On 18.01.2007, at 21:33, Scott Green wrote:

> Stefan,
>
> I also dived into contrib/web2 in nutch. The one and admin-gui are
> both owns some plugins based on nutch plugin architecture. So I think
> it is great if we extract something in high level and they should have
> a lot commons.  Well, i dont know it is the right time to do this job.
>
> On 1/19/07, Stefan Groschupf <sg...@101tec.com> wrote:
>> Hi,
>> > I just finished reading all source code about nutch gui. And
>> > personally i don't like putting a lot of code snippets into jsp  
>> files
>> > since it takes a lot time when refactoring. So how about to adopt
>> > using velocity/freemarker with servlet?
>>
>>
>> In general I agree it is the view layer and should have as less as
>> possible code, however the idea was to have as less as possible
>> dependencies to thirdparty tools and libraries and also getting
>> things realized with low tech (jsp).
>>
>> Stefan
>>
>>
>>
>>
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com




Re: Next Nutch release

Posted by Scott Green <sm...@gmail.com>.
Stefan,

I also dived into contrib/web2 in nutch. The one and admin-gui are
both owns some plugins based on nutch plugin architecture. So I think
it is great if we extract something in high level and they should have
a lot commons.  Well, i dont know it is the right time to do this job.

On 1/19/07, Stefan Groschupf <sg...@101tec.com> wrote:
> Hi,
> > I just finished reading all source code about nutch gui. And
> > personally i don't like putting a lot of code snippets into jsp files
> > since it takes a lot time when refactoring. So how about to adopt
> > using velocity/freemarker with servlet?
>
>
> In general I agree it is the view layer and should have as less as
> possible code, however the idea was to have as less as possible
> dependencies to thirdparty tools and libraries and also getting
> things realized with low tech (jsp).
>
> Stefan
>
>
>
>

Re: Next Nutch release

Posted by Stefan Groschupf <sg...@101tec.com>.
Hi,
> I just finished reading all source code about nutch gui. And
> personally i don't like putting a lot of code snippets into jsp files
> since it takes a lot time when refactoring. So how about to adopt
> using velocity/freemarker with servlet?


In general I agree it is the view layer and should have as less as  
possible code, however the idea was to have as less as possible  
dependencies to thirdparty tools and libraries and also getting  
things realized with low tech (jsp).

Stefan




Re: Next Nutch release

Posted by Scott Green <sm...@gmail.com>.
Hi,

I just finished reading all source code about nutch gui. And
personally i don't like putting a lot of code snippets into jsp files
since it takes a lot time when refactoring. So how about to adopt
using velocity/freemarker with servlet?

On 1/17/07, Enis Soztutar <en...@gmail.com> wrote:
> Hi all, for NUTCH-251:
>
> I suppose that NUTCH-251 is relatively a significant issue by the votes.
> Stafan has written a good plugin for the admin gui and i have updated it
> to work with nutch-0.8, hadoop 0.4.
>
> Some of the features in the patch is not appropriate for our use cases
> and it requires hadoop changes, thus I am currently working on an
> alternative implementation of the administration gui, which runs a
> hadoop server( like JobTraker) to listen to submitted Jobs, an web Gui
> to submit and track the jobs from the browser and a job runner.
>
> The architechture details of the patch is as follows :
>
>   - An interface AdminJob which is an abstract class representing a Job
> in nutch.
>   - various classes extending AdminJob. for ex FetchAdminJob, IndexAdminJob.
>   - A queue which sorts the jobs in priority order, by a modified a
> topological sort(jobs can be dependent).
>   - an interface to submit Jobs
>   - a rpc server to listen to job submissions
>   - an extension point (basically same as the previous)
>   - a web server to serve plugin jsp's
>
> upon the features will be
>     - submitting jobs from code, command line or web interface,
>     - tracking jobs from the command line or web interface
>     - scheduling jobs
>
> I could send the code or details if anyone is interested in pretesting.
> And i will appreciate any comments and suggestions on this. I am
> planning to complete the patch and submit it to Jira ASAP.
>
> Sami Siren wrote:
> > Hello,
> >
> > It has been a while from a previous release (0.8.1) and looking at the
> > great fixes done in trunk I'd start thinking about baking a new release
> > soon.
> >
> > Looking at the jira roadmaps there are 1 blocking issues (fixing the
> > license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
> > which I think NUTCH-233 is safe to put in.
> >
> > The top 10 voted issues are currently:
> >
> > NUTCH-61       Adaptive re-fetch interval. Detecting umodified content
> > NUTCH-48      "Did you mean" query enhancement/refignment feature
> > NUTCH-251     Administration GUI
> > NUTCH-289     CrawlDatum should store IP address
> > NUTCH-36      Chinese in Nutch
> > NUTCH-185     XMLParser is configurable xml parser plugin.            NUTCH-59        meta
> > data support in webdb
> > NUTCH-92      DistributedSearch incorrectly scores results            NUTCH-68        A
> > tool to generate arbitrary fetchlists                 NUTCH-87        Efficient
> > site-specific crawling for a large number of sites
> >
> > Are there any opinions about issues that should go in before the next
> > release (Answering yes means that you are willing to provide a patch for
> > it).
> >
> > --
> >  Sami Siren
> >
> >
>
>

Re: Next Nutch release

Posted by Stefan Groschupf <sg...@101tec.com>.
Th old hadoop patch is here:
https://issues.apache.org/jira/browse/NUTCH-251
Also we had this conversation:
http://www.mail-archive.com/hadoop-dev@lucene.apache.org/msg00314.html
I guess after this we missed to post the patches we use internally.

If someone feels strong about getting the gui working with hadoop he/ 
she should feel free to update the patch and post it in the hadoop jira.

Stefan







On 18.01.2007, at 15:39, Doug Cutting wrote:

> Stefan Groschupf wrote:
>> We run the gui in several production environemnts with patched  
>> hadoop code - since this is from our point of view the clean  
>> approach. Everything else feels like a workaround to fix some  
>> strange hadoop behaviors.
>
> Are there issues in Hadoop's Jira for these?  If so, do they have  
> patches attached?  Are they linked to the corresponding issue in  
> Nutch?
>
> Doug
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com




Re: Next Nutch release

Posted by Doug Cutting <cu...@apache.org>.
Stefan Groschupf wrote:
> We run the gui in several production environemnts with patched hadoop 
> code - since this is from our point of view the clean approach. 
> Everything else feels like a workaround to fix some strange hadoop 
> behaviors.

Are there issues in Hadoop's Jira for these?  If so, do they have 
patches attached?  Are they linked to the corresponding issue in Nutch?

Doug

Re: Next Nutch release

Posted by Stefan Groschupf <sg...@101tec.com>.
Hi,

great to hear people still working on things. It shows once more  
getting something in early would save some effort. :)
Just some random comments.

We run the gui in several production environemnts with patched hadoop  
code - since this is from our point of view the clean approach.  
Everything else feels like a workaround to fix some strange hadoop  
behaviors. It is may be a long time ago that I spoke to Doug and some  
other Hadoop developers  but at this time I understand people that  
there is a general interest to have a nutch gui and support required  
functionality in hadoop.
I'm not sure if that is still the case or if I had a wrong impression.
In any case from my p.o.v. the clean way would be getting the  
required minor changes into hadoop (not critical simple stuff from my  
point of view) instead of implement working around in nutch. Since  
hadoop is a kind of child of nutch there should be a close relation  
at least to discuss things.
Anyway no strong option, just my 2 cents. In any case I'm very happy  
if people see now the need for a gui as well and someone is working  
on that since I'm kind of busy with other projects.

Thanks.
Stefan


On 17.01.2007, at 06:42, Enis Soztutar wrote:

> Hi all, for NUTCH-251:
>
> I suppose that NUTCH-251 is relatively a significant issue by the  
> votes. Stafan has written a good plugin for the admin gui and i  
> have updated it to work with nutch-0.8, hadoop 0.4.
>
> Some of the features in the patch is not appropriate for our use  
> cases and it requires hadoop changes, thus I am currently working  
> on an alternative implementation of the administration gui, which  
> runs a hadoop server( like JobTraker) to listen to submitted Jobs,  
> an web Gui to submit and track the jobs from the browser and a job  
> runner.
>
> The architechture details of the patch is as follows :
>
>  - An interface AdminJob which is an abstract class representing a  
> Job in nutch.
>  - various classes extending AdminJob. for ex FetchAdminJob,  
> IndexAdminJob.
>  - A queue which sorts the jobs in priority order, by a modified a  
> topological sort(jobs can be dependent).
>  - an interface to submit Jobs
>  - a rpc server to listen to job submissions
>  - an extension point (basically same as the previous)
>  - a web server to serve plugin jsp's
>
> upon the features will be
>    - submitting jobs from code, command line or web interface,
>    - tracking jobs from the command line or web interface
>    - scheduling jobs
>
> I could send the code or details if anyone is interested in  
> pretesting. And i will appreciate any comments and suggestions on  
> this. I am planning to complete the patch and submit it to Jira ASAP.
>
> Sami Siren wrote:
>> Hello,
>>
>> It has been a while from a previous release (0.8.1) and looking at  
>> the
>> great fixes done in trunk I'd start thinking about baking a new  
>> release
>> soon.
>>
>> Looking at the jira roadmaps there are 1 blocking issues (fixing the
>> license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
>> which I think NUTCH-233 is safe to put in.
>>
>> The top 10 voted issues are currently:
>>
>> NUTCH-61  	 Adaptive re-fetch interval. Detecting umodified content
>> NUTCH-48 	"Did you mean" query enhancement/refignment feature
>> NUTCH-251 	Administration GUI
>> NUTCH-289 	CrawlDatum should store IP address
>> NUTCH-36 	Chinese in Nutch
>> NUTCH-185 	XMLParser is configurable xml parser plugin. 		NUTCH-59  
>> 	meta
>> data support in webdb
>> NUTCH-92 	DistributedSearch incorrectly scores results 		NUTCH-68 	A
>> tool to generate arbitrary fetchlists 		NUTCH-87 	Efficient
>> site-specific crawling for a large number of sites
>>
>> Are there any opinions about issues that should go in before the next
>> release (Answering yes means that you are willing to provide a  
>> patch for
>> it).
>>
>> --
>>  Sami Siren
>>
>>
>
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com




Re: Next Nutch release

Posted by Enis Soztutar <en...@gmail.com>.
Hi all, for NUTCH-251:

I suppose that NUTCH-251 is relatively a significant issue by the votes. 
Stafan has written a good plugin for the admin gui and i have updated it 
to work with nutch-0.8, hadoop 0.4.

Some of the features in the patch is not appropriate for our use cases 
and it requires hadoop changes, thus I am currently working on an 
alternative implementation of the administration gui, which runs a 
hadoop server( like JobTraker) to listen to submitted Jobs, an web Gui 
to submit and track the jobs from the browser and a job runner.

The architechture details of the patch is as follows :

  - An interface AdminJob which is an abstract class representing a Job 
in nutch.
  - various classes extending AdminJob. for ex FetchAdminJob, IndexAdminJob.
  - A queue which sorts the jobs in priority order, by a modified a 
topological sort(jobs can be dependent).
  - an interface to submit Jobs
  - a rpc server to listen to job submissions
  - an extension point (basically same as the previous)
  - a web server to serve plugin jsp's

upon the features will be
    - submitting jobs from code, command line or web interface,
    - tracking jobs from the command line or web interface
    - scheduling jobs

I could send the code or details if anyone is interested in pretesting. 
And i will appreciate any comments and suggestions on this. I am 
planning to complete the patch and submit it to Jira ASAP.

Sami Siren wrote:
> Hello,
>
> It has been a while from a previous release (0.8.1) and looking at the
> great fixes done in trunk I'd start thinking about baking a new release
> soon.
>
> Looking at the jira roadmaps there are 1 blocking issues (fixing the
> license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
> which I think NUTCH-233 is safe to put in.
>
> The top 10 voted issues are currently:
>
> NUTCH-61  	 Adaptive re-fetch interval. Detecting umodified content
> NUTCH-48 	"Did you mean" query enhancement/refignment feature
> NUTCH-251 	Administration GUI
> NUTCH-289 	CrawlDatum should store IP address
> NUTCH-36 	Chinese in Nutch
> NUTCH-185 	XMLParser is configurable xml parser plugin. 		NUTCH-59 	meta
> data support in webdb
> NUTCH-92 	DistributedSearch incorrectly scores results 		NUTCH-68 	A
> tool to generate arbitrary fetchlists 		NUTCH-87 	Efficient
> site-specific crawling for a large number of sites
>
> Are there any opinions about issues that should go in before the next
> release (Answering yes means that you are willing to provide a patch for
> it).
>
> --
>  Sami Siren
>
>   


Re: Next Nutch release

Posted by Thomas Müller <th...@gmx.net>.
Sami, Thanks a lot,

I would like to see a feature in, that a link to a webpage is sowing all areay indexed urls.

So other spiders can fetch this site and get the urls, the open souce natuch has already to provide.


So we need to start not to have open source coding the machine, but as well every node offering an open, downloadable database of urls,

And we need a list of urls, of other nutch domains. With this list, each Nutch can crawl the urls of the other nutch  providing them on a website.

As Million of urls are a lot, I suggest to have 26 websites from a-z to display all urls of the `word´ "a", all 25 urls links b-z as well on the page of the word-page "a".

then several Nutch nodes could use a small p2p feature and as well the sister yacy can fetch the urls from a central open source point: all nutch domains.

Would this be possible to generate a webpage-link somewhere on the nutch-homepage of the individual serverinstall with all urls?

Opensource has to found solidarity, so make the nutch url database open for as well open source search engine spiders from central points.

thanks

-------- Original-Nachricht --------
Datum: Tue, 16 Jan 2007 17:53:41 +0200
Von: Sami Siren <ss...@gmail.com>
An: nutch-dev@lucene.apache.org
Betreff: Next Nutch release

> Hello,
> 
> It has been a while from a previous release (0.8.1) and looking at the
> great fixes done in trunk I'd start thinking about baking a new release
> soon.
> 
> Looking at the jira roadmaps there are 1 blocking issues (fixing the
> license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
> which I think NUTCH-233 is safe to put in.
> 
> The top 10 voted issues are currently:
> 
> NUTCH-61  	 Adaptive re-fetch interval. Detecting umodified content
> NUTCH-48 	"Did you mean" query enhancement/refignment feature
> NUTCH-251 	Administration GUI
> NUTCH-289 	CrawlDatum should store IP address
> NUTCH-36 	Chinese in Nutch
> NUTCH-185 	XMLParser is configurable xml parser plugin. 		NUTCH-59 	meta
> data support in webdb
> NUTCH-92 	DistributedSearch incorrectly scores results 		NUTCH-68 	A
> tool to generate arbitrary fetchlists 		NUTCH-87 	Efficient
> site-specific crawling for a large number of sites
> 
> Are there any opinions about issues that should go in before the next
> release (Answering yes means that you are willing to provide a patch for
> it).
> 
> --
>  Sami Siren

-- 
"Feel free" - 5 GB Mailbox, 50 FreeSMS/Monat ...
Jetzt GMX ProMail testen: http://www.gmx.net/de/go/promail