You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2010/04/26 07:55:56 UTC

[VOTE] Apache Nutch 1.1 Release Candidate #2

Hi Folks,

I have posted an updated candidate for the Apache Nutch 1.1 release. The
source code is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/

The major difference between this release and rc #1 is the application of
NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
as well as some commits by Sami Siren to fix missing ASL license headers.

For more detailed information, see the included CHANGES.txt file for details
on release contents and latest changes. The release was made using the Nutch
release process, documented on the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

<note>
There was a request by Sami Siren that the tutorial be updated to reflect
the fact that this release is a source-only release, as well as a request to
integrate RAT into the build, however, in the interest of getting this 1.1
out and getting going on the Nutch TLP, my proposal is:

* update the docs independent of this release (the tutorial as it exists
right now says 0.7 on it anyways and doesn't look like it's been updated in
a while, so I think users can live with what's there and support on
user@nutch.apache.org or dev@nutch.apache.org until it's updated)

* begin source only releases in general since we've long had the debate as
to the size of the Nutch release. Most folks that use Nutch are likely
familiar with running ant IMHO.

* run RAT and integrate into the build

</note>

Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours.

Since Nutch is now a TLP and has its own PMC, there is a question of who are
the binding release VOTES in this particular thread. My gut reaction is that
since I started this release while we were under the Lucene PMC, for
continuity purposes, only votes from Lucene PMC are binding, but everyone
(especially newly minted Nutch PMC members!) are  welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

P.S. Here is my +1.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Bernd Fondermann <be...@googlemail.com>.
On Tue, Apr 27, 2010 at 04:15, Mattmann, Chris A (388J)
<ch...@jpl.nasa.gov> wrote:
> Hi Hoss,
>
>> : Thanks. I think it actually makes sense to finish off 1.1, and since
>> : there is overlap with the Nutch PMC and the Lucene PMC and since the
>> : thread started in Lucene before the TLP, I think it would be great e.g.,
>>
>> Except that once the Board officilly passed the resolution creating the
>> nutch TLP... "...all responsibilities pertaining to the Apache Lucene
>> Nutch sub-project encumbered upon the Apache Lucene Project are hereafter
>> discharged."
>>
>> Tthe Lucene PMC can't hold a VOTE for Nutch releases as of that board
>> meeting.
>
> Interesting -- I wasn't so confident with my interpretation of it since:
>
> (a) I'm not a lawyer, and;
> (b) I can't find anywhere regarding ASF policy on existing VOTE threads
> started before TLP VOTEs and resolutions passed in that regard. My guess is
> that it's probably too specific to have an official policy written down for
> it, therefore it's open to interpretation.
>
> I'm still not convinced that it's 100% clear who in fact should be the
> binding VOTEs on something started in the Lucene-ville, but now existing in
> Nutch-ville,

If "something" equals "Nutch", it's the members of the Nutch PMC,
otherwise it's some other PMC.

> but in the interest of moving forward, that's why I just
> suggested those with overlap anyways (Andrzej, Sami, and myself) try and
> check out the release at the very least to see if we can move it forward.

It might be in Nutch's interest to first move to TLP, and only then
come out with a big 1.1-taa-daaa!

Just my 2 eurocents,

  Bernd

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-04-27 07:30, Mattmann, Chris A (388J) wrote:
> Ehrm, well OK, I technically started the 1.1 release VOTE (for rc #1)
> while in Lucene, so I thought that even voting on an rc#2 within that
> same 1.1 release counted as "same thread", but I suppose I'm just
> debating semantics.
> 
> In the end, it's probably simpler to be on the Nutch PMC anyways.
> We're in the process of moving - we'll continue that and the move
> forward with 1.1 rc #2 in Nutch-ville.

In the meantime we have discovered some issues with the rc2, so no
matter where I sit I would've given it -1 anyway ... so let's leave the
finer points of this situation to lawyers :) and proceed with the next
RC under the Nutch TLP.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Ehrm, well OK, I technically started the 1.1 release VOTE (for rc #1) while in Lucene, so I thought that even voting on an rc#2 within that same 1.1 release counted as "same thread", but I suppose I'm just debating semantics.

In the end, it's probably simpler to be on the Nutch PMC anyways. We're in the process of moving - we'll continue that and the move forward with 1.1 rc #2 in Nutch-ville.

Thanks,
Chris


On 4/26/10 10:05 PM, "Chris Hostetter" <ho...@fucit.org> wrote:



: (b) I can't find anywhere regarding ASF policy on existing VOTE threads
: started before TLP VOTEs and resolutions passed in that regard. My guess is
: that it's probably too specific to have an official policy written down for
: it, therefore it's open to interpretation.

there is potential for some odd voting formalities if a vote is actively
in progress when a TLP switch happens, but key the thing to remember is
that when a VOTE is called it's over a specific set of artifacts.  it
doesn't matter how long the discussion about having a relase has been
going on or how many release candidates were put up for a vote before,
what matters is that for this particular set of artifacts, the Nutch PMC
has the only binding votes, because the VOTE was called after the TLP was
created.

I do have to ammend my earlier comments though, because i am now
remmebering that strictly speaking: the Lucene PMC can vote to release any
arbitrary hunk of ASL licensed code as a release -- there is no formal
rules abotu the "products" just the "prjects" ... so in theory we could
have a Lucene release of Nutch in the same way we could have a Lucene
release of Tomcat -- but that strikes me as being "uncool"

The fundemental point i'm trying to make is that the Nutch PMC does exist,
regardless of what domain names or mailing lists have or have not been
created, and all 7 Nutch Project Members have the power of binding votes
for artifacts the Nutch Project wants to release.



-Hoss




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Chris Hostetter <ho...@fucit.org>.
: (b) I can't find anywhere regarding ASF policy on existing VOTE threads
: started before TLP VOTEs and resolutions passed in that regard. My guess is
: that it's probably too specific to have an official policy written down for
: it, therefore it's open to interpretation.

there is potential for some odd voting formalities if a vote is actively 
in progress when a TLP switch happens, but key the thing to remember is 
that when a VOTE is called it's over a specific set of artifacts.  it 
doesn't matter how long the discussion about having a relase has been 
going on or how many release candidates were put up for a vote before, 
what matters is that for this particular set of artifacts, the Nutch PMC 
has the only binding votes, because the VOTE was called after the TLP was 
created.

I do have to ammend my earlier comments though, because i am now 
remmebering that strictly speaking: the Lucene PMC can vote to release any 
arbitrary hunk of ASL licensed code as a release -- there is no formal 
rules abotu the "products" just the "prjects" ... so in theory we could 
have a Lucene release of Nutch in the same way we could have a Lucene 
release of Tomcat -- but that strikes me as being "uncool"

The fundemental point i'm trying to make is that the Nutch PMC does exist, 
regardless of what domain names or mailing lists have or have not been 
created, and all 7 Nutch Project Members have the power of binding votes 
for artifacts the Nutch Project wants to release.



-Hoss


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Hoss,

> : Thanks. I think it actually makes sense to finish off 1.1, and since
> : there is overlap with the Nutch PMC and the Lucene PMC and since the
> : thread started in Lucene before the TLP, I think it would be great e.g.,
> 
> Except that once the Board officilly passed the resolution creating the
> nutch TLP... "...all responsibilities pertaining to the Apache Lucene
> Nutch sub-project encumbered upon the Apache Lucene Project are hereafter
> discharged."
> 
> Tthe Lucene PMC can't hold a VOTE for Nutch releases as of that board
> meeting.

Interesting -- I wasn't so confident with my interpretation of it since:

(a) I'm not a lawyer, and;
(b) I can't find anywhere regarding ASF policy on existing VOTE threads
started before TLP VOTEs and resolutions passed in that regard. My guess is
that it's probably too specific to have an official policy written down for
it, therefore it's open to interpretation.

I'm still not convinced that it's 100% clear who in fact should be the
binding VOTEs on something started in the Lucene-ville, but now existing in
Nutch-ville, but in the interest of moving forward, that's why I just
suggested those with overlap anyways (Andrzej, Sami, and myself) try and
check out the release at the very least to see if we can move it forward.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks. I think it actually makes sense to finish off 1.1, and since 
: there is overlap with the Nutch PMC and the Lucene PMC and since the 
: thread started in Lucene before the TLP, I think it would be great e.g., 

Except that once the Board officilly passed the resolution creating the 
nutch TLP... "...all responsibilities pertaining to the Apache Lucene 
Nutch sub-project encumbered upon the Apache Lucene Project are hereafter 
discharged."

Tthe Lucene PMC can't hold a VOTE for Nutch releases as of that board 
meeting.

Which is not to say that the Nutch release VOTE needs to be put on hold 
until all the nutch TLP infra changes are made -- it just means that the 
people on the new new Nutch PMC are the only people with binding votes 
(regardless of whether those votes are cast on general@lucene.a.o, or 
nutch-dev@lucene.a.o, or some new @nutch.a.o list)



-Hoss


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Andrzej,

Okey dokey, np! Let's get the patch in first :) I can cut as many RCs as needed.

Cheers,
Chris

On 4/26/10 11:30 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

On 2010-04-26 17:19, Mattmann, Chris A (388J) wrote:
> Hi Grant,
>
> Thanks. I think it actually makes sense to finish off 1.1, and since there is overlap with the Nutch PMC and the Lucene PMC and since the thread started in Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami could check the release and that way we still have the continuity and can safely push it out as the last Nutch rel under the Lucene umbrella...
>
> Then all releases post 1.1 can cleanly be done under the auspices of the new PMC :)

I know that Dennis Kubes just discovered a bug in SegmentMerger (he may
report on it in a moment) - this bug has been there for a while, it's
likely the cause of the mysterious "out of disk space" errors, and it
manifests itself only with input files larger than HDFS block size
(64MB). Since 1.1 is likely the final release of Nutch 1.x I think it
would make sense to fix this bug before we release ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-04-26 17:19, Mattmann, Chris A (388J) wrote:
> Hi Grant,
> 
> Thanks. I think it actually makes sense to finish off 1.1, and since there is overlap with the Nutch PMC and the Lucene PMC and since the thread started in Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami could check the release and that way we still have the continuity and can safely push it out as the last Nutch rel under the Lucene umbrella...
> 
> Then all releases post 1.1 can cleanly be done under the auspices of the new PMC :)

I know that Dennis Kubes just discovered a bug in SegmentMerger (he may
report on it in a moment) - this bug has been there for a while, it's
likely the cause of the mysterious "out of disk space" errors, and it
manifests itself only with input files larger than HDFS block size
(64MB). Since 1.1 is likely the final release of Nutch 1.x I think it
would make sense to fix this bug before we release ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Grant,

Thanks. I think it actually makes sense to finish off 1.1, and since there is overlap with the Nutch PMC and the Lucene PMC and since the thread started in Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami could check the release and that way we still have the continuity and can safely push it out as the last Nutch rel under the Lucene umbrella...

Then all releases post 1.1 can cleanly be done under the auspices of the new PMC :)

Cheers,
Chris


On 4/26/10 5:34 AM, "Grant Ignersoll" <gs...@apache.org> wrote:

Might I suggest, that since Nutch is now a TLP that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC?

Cheers,
Grant

On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote:

> Hi Folks,
>
> I have posted an updated candidate for the Apache Nutch 1.1 release. The
> source code is at:
>
> http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/
>
> The major difference between this release and rc #1 is the application of
> NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
> as well as some commits by Sami Siren to fix missing ASL license headers.
>
> For more detailed information, see the included CHANGES.txt file for details
> on release contents and latest changes. The release was made using the Nutch
> release process, documented on the Wiki here:
>
> http://bit.ly/d5ugid
>
> A Nutch 1.1 tag is at:
>
> http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/
>
> <note>
> There was a request by Sami Siren that the tutorial be updated to reflect
> the fact that this release is a source-only release, as well as a request to
> integrate RAT into the build, however, in the interest of getting this 1.1
> out and getting going on the Nutch TLP, my proposal is:
>
> * update the docs independent of this release (the tutorial as it exists
> right now says 0.7 on it anyways and doesn't look like it's been updated in
> a while, so I think users can live with what's there and support on
> user@nutch.apache.org or dev@nutch.apache.org until it's updated)
>
> * begin source only releases in general since we've long had the debate as
> to the size of the Nutch release. Most folks that use Nutch are likely
> familiar with running ant IMHO.
>
> * run RAT and integrate into the build
>
> </note>
>
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
>
> Since Nutch is now a TLP and has its own PMC, there is a question of who are
> the binding release VOTES in this particular thread. My gut reaction is that
> since I started this release while we were under the Lucene PMC, for
> continuity purposes, only votes from Lucene PMC are binding, but everyone
> (especially newly minted Nutch PMC members!) are  welcome to check the
> release candidate and voice their approval or disapproval. The vote passes
> if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache Nutch 1.1.
>
> [ ] -1 Do not release the packages because...
>
> Thanks!
>
> Cheers,
> Chris
>
> P.S. Here is my +1.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>





++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Grant,

Thanks. I think it actually makes sense to finish off 1.1, and since there is overlap with the Nutch PMC and the Lucene PMC and since the thread started in Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami could check the release and that way we still have the continuity and can safely push it out as the last Nutch rel under the Lucene umbrella...

Then all releases post 1.1 can cleanly be done under the auspices of the new PMC :)

Cheers,
Chris


On 4/26/10 5:34 AM, "Grant Ignersoll" <gs...@apache.org> wrote:

Might I suggest, that since Nutch is now a TLP that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC?

Cheers,
Grant

On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote:

> Hi Folks,
>
> I have posted an updated candidate for the Apache Nutch 1.1 release. The
> source code is at:
>
> http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/
>
> The major difference between this release and rc #1 is the application of
> NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
> as well as some commits by Sami Siren to fix missing ASL license headers.
>
> For more detailed information, see the included CHANGES.txt file for details
> on release contents and latest changes. The release was made using the Nutch
> release process, documented on the Wiki here:
>
> http://bit.ly/d5ugid
>
> A Nutch 1.1 tag is at:
>
> http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/
>
> <note>
> There was a request by Sami Siren that the tutorial be updated to reflect
> the fact that this release is a source-only release, as well as a request to
> integrate RAT into the build, however, in the interest of getting this 1.1
> out and getting going on the Nutch TLP, my proposal is:
>
> * update the docs independent of this release (the tutorial as it exists
> right now says 0.7 on it anyways and doesn't look like it's been updated in
> a while, so I think users can live with what's there and support on
> user@nutch.apache.org or dev@nutch.apache.org until it's updated)
>
> * begin source only releases in general since we've long had the debate as
> to the size of the Nutch release. Most folks that use Nutch are likely
> familiar with running ant IMHO.
>
> * run RAT and integrate into the build
>
> </note>
>
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
>
> Since Nutch is now a TLP and has its own PMC, there is a question of who are
> the binding release VOTES in this particular thread. My gut reaction is that
> since I started this release while we were under the Lucene PMC, for
> continuity purposes, only votes from Lucene PMC are binding, but everyone
> (especially newly minted Nutch PMC members!) are  welcome to check the
> release candidate and voice their approval or disapproval. The vote passes
> if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache Nutch 1.1.
>
> [ ] -1 Do not release the packages because...
>
> Thanks!
>
> Cheers,
> Chris
>
> P.S. Here is my +1.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>





++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Grant,

Thanks. I think it actually makes sense to finish off 1.1, and since there is overlap with the Nutch PMC and the Lucene PMC and since the thread started in Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami could check the release and that way we still have the continuity and can safely push it out as the last Nutch rel under the Lucene umbrella...

Then all releases post 1.1 can cleanly be done under the auspices of the new PMC :)

Cheers,
Chris


On 4/26/10 5:34 AM, "Grant Ignersoll" <gs...@apache.org> wrote:

Might I suggest, that since Nutch is now a TLP that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC?

Cheers,
Grant

On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote:

> Hi Folks,
>
> I have posted an updated candidate for the Apache Nutch 1.1 release. The
> source code is at:
>
> http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/
>
> The major difference between this release and rc #1 is the application of
> NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
> as well as some commits by Sami Siren to fix missing ASL license headers.
>
> For more detailed information, see the included CHANGES.txt file for details
> on release contents and latest changes. The release was made using the Nutch
> release process, documented on the Wiki here:
>
> http://bit.ly/d5ugid
>
> A Nutch 1.1 tag is at:
>
> http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/
>
> <note>
> There was a request by Sami Siren that the tutorial be updated to reflect
> the fact that this release is a source-only release, as well as a request to
> integrate RAT into the build, however, in the interest of getting this 1.1
> out and getting going on the Nutch TLP, my proposal is:
>
> * update the docs independent of this release (the tutorial as it exists
> right now says 0.7 on it anyways and doesn't look like it's been updated in
> a while, so I think users can live with what's there and support on
> user@nutch.apache.org or dev@nutch.apache.org until it's updated)
>
> * begin source only releases in general since we've long had the debate as
> to the size of the Nutch release. Most folks that use Nutch are likely
> familiar with running ant IMHO.
>
> * run RAT and integrate into the build
>
> </note>
>
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
>
> Since Nutch is now a TLP and has its own PMC, there is a question of who are
> the binding release VOTES in this particular thread. My gut reaction is that
> since I started this release while we were under the Lucene PMC, for
> continuity purposes, only votes from Lucene PMC are binding, but everyone
> (especially newly minted Nutch PMC members!) are  welcome to check the
> release candidate and voice their approval or disapproval. The vote passes
> if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache Nutch 1.1.
>
> [ ] -1 Do not release the packages because...
>
> Thanks!
>
> Cheers,
> Chris
>
> P.S. Here is my +1.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>





++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Grant Ingersoll <gs...@apache.org>.
Might I suggest, that since Nutch is now a TLP that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC?

Cheers,
Grant

On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote:

> Hi Folks,
> 
> I have posted an updated candidate for the Apache Nutch 1.1 release. The
> source code is at:
> 
> http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/
> 
> The major difference between this release and rc #1 is the application of
> NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
> as well as some commits by Sami Siren to fix missing ASL license headers.
> 
> For more detailed information, see the included CHANGES.txt file for details
> on release contents and latest changes. The release was made using the Nutch
> release process, documented on the Wiki here:
> 
> http://bit.ly/d5ugid
> 
> A Nutch 1.1 tag is at:
> 
> http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/
> 
> <note>
> There was a request by Sami Siren that the tutorial be updated to reflect
> the fact that this release is a source-only release, as well as a request to
> integrate RAT into the build, however, in the interest of getting this 1.1
> out and getting going on the Nutch TLP, my proposal is:
> 
> * update the docs independent of this release (the tutorial as it exists
> right now says 0.7 on it anyways and doesn't look like it's been updated in
> a while, so I think users can live with what's there and support on
> user@nutch.apache.org or dev@nutch.apache.org until it's updated)
> 
> * begin source only releases in general since we've long had the debate as
> to the size of the Nutch release. Most folks that use Nutch are likely
> familiar with running ant IMHO.
> 
> * run RAT and integrate into the build
> 
> </note>
> 
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
> 
> Since Nutch is now a TLP and has its own PMC, there is a question of who are
> the binding release VOTES in this particular thread. My gut reaction is that
> since I started this release while we were under the Lucene PMC, for
> continuity purposes, only votes from Lucene PMC are binding, but everyone
> (especially newly minted Nutch PMC members!) are  welcome to check the
> release candidate and voice their approval or disapproval. The vote passes
> if at least three binding +1 votes are cast.
> 
> [ ] +1 Release the packages as Apache Nutch 1.1.
> 
> [ ] -1 Do not release the packages because...
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> P.S. Here is my +1.
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 



Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-04-26 17:30, Mattmann, Chris A (388J) wrote:
> Hey Andrzej,
> 
>> Actually, we don't have a build target (yet) that produces a binary-only
>> distribution that we can ship and which you can run out of the box (not
>> counting the build/nutch.job alone, because it needs the Hadoop
>> infrastructure to run).
> 
> I thought ant tar did this? That's what it sez on the release guide [1] and
> what I'm familiar with when I did the Nutch 0.9 release.

ant tar packs everything, i.e. both source and binaries.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Andrzej,

> Actually, we don't have a build target (yet) that produces a binary-only
> distribution that we can ship and which you can run out of the box (not
> counting the build/nutch.job alone, because it needs the Hadoop
> infrastructure to run).

I thought ant tar did this? That's what it sez on the release guide [1] and
what I'm familiar with when I did the Nutch 0.9 release.

> 
> The current mixed (source+binary) distribution worked well enough so
> far, but the size of the distribution is becoming a concern, hence the
> idea to ship only the source. We may have been too hasty with that,
> though... What do others think?

Good question, Andrzej. I'll wait for feedback from others. My pref is for
source-only, but I might be in the minority. :)

Cheers,
Chris

[1] http://wiki.apache.org/nutch/Release_HOWTO

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-04-26 16:24, David M. Cole wrote:
> At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote:
>> Most folks that use Nutch are likely
>> familiar with running ant IMHO.
> 
> I guess then I fall into the category of "not most folks." Have been
> running Nutch for about 14 months and I haven't a clue how to run ant.
> 
> If there's a place to vote to suggest that compiled versions still be
> distributed, I vote for that.

Actually, we don't have a build target (yet) that produces a binary-only
distribution that we can ship and which you can run out of the box (not
counting the build/nutch.job alone, because it needs the Hadoop
infrastructure to run).

The current mixed (source+binary) distribution worked well enough so
far, but the size of the distribution is becoming a concern, hence the
idea to ship only the source. We may have been too hasty with that,
though... What do others think?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi David,

Thanks. In fact, running ant is probably simpler than running Nutch. The steps would be:


 *   what OS are you on (Ant is available for all of them to my knowledge)?
 *   if you need ant, grab a distro from ant.apache.org, otherwise, I'll assume that you've got ant installed and callable from the command line.
 *   unpack the nutch src distribution, cd into that directory, type "ant job", and there you go.

HTH! You could try it out by taking the Nutch src code from SVN at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1, and then trying the steps above.

Cheers,
Chris


On 4/26/10 7:24 AM, "David M. Cole" <dm...@colegroup.com> wrote:

At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote:
>Most folks that use Nutch are likely
>familiar with running ant IMHO.

I guess then I fall into the category of "not most folks." Have been
running Nutch for about 14 months and I haven't a clue how to run ant.

If there's a place to vote to suggest that compiled versions still be
distributed, I vote for that.

Thanks.

\dmc

--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi David,

Thanks. In fact, running ant is probably simpler than running Nutch. The steps would be:


 *   what OS are you on (Ant is available for all of them to my knowledge)?
 *   if you need ant, grab a distro from ant.apache.org, otherwise, I'll assume that you've got ant installed and callable from the command line.
 *   unpack the nutch src distribution, cd into that directory, type "ant job", and there you go.

HTH! You could try it out by taking the Nutch src code from SVN at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1, and then trying the steps above.

Cheers,
Chris


On 4/26/10 7:24 AM, "David M. Cole" <dm...@colegroup.com> wrote:

At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote:
>Most folks that use Nutch are likely
>familiar with running ant IMHO.

I guess then I fall into the category of "not most folks." Have been
running Nutch for about 14 months and I haven't a clue how to run ant.

If there's a place to vote to suggest that compiled versions still be
distributed, I vote for that.

Thanks.

\dmc

--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "David M. Cole" <dm...@colegroup.com>.
At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote:
>Most folks that use Nutch are likely
>familiar with running ant IMHO.

I guess then I fall into the category of "not most folks." Have been 
running Nutch for about 14 months and I haven't a clue how to run ant.

If there's a place to vote to suggest that compiled versions still be 
distributed, I vote for that.

Thanks.

\dmc

-- 
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Phil Barnett <ph...@philb.us>.
On Sat, May 1, 2010 at 2:34 AM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

>
> Sure, hopefully you'll find the answer you're looking for. In the
> meanwhile,
> it's my job to keep cutting release candidates as the RM, that at least
> pass
> the basic criteria for release and right now that involves what I mentioned
> above.
>

I hope to be able to get back onto my Nutch project when I get back to work
next Thursday.

Until then, it appears that someone else has reported the same behaviour
that I am experiencing.

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Phil,

Thanks for your comments. Mine below:

>> Unfortunately some parts of the documentation on Nutch (namely the
>> tutorial,
>> and other parts of the static site) have been out of date for a while. This
>> has occurred really independent of the releases, and independent of the
>> wiki
>> [1], which hasn't really fallen out of date as quick.
>> 
> 
> While documentation may not be part of the code, it's certainly part of the
> project. And it's just as important as the code. Yes, I know that
> documentation is the bane of programmers everywhere. I'm a coder. I get it.
> But when you change the way things work in a fundamental way that leaves all
> of  your documentation behind, it's time to spend some time on it.

Sure. So, what fundamental way has Nutch changed from 1.0 to 1.1? Can you
elaborate? Also, in terms of spending time on Nutch's documentation, I'll
try to as I get more time (as I'm sure other committers will as well), but
I'd also say: if there's something to be improved, by all means, go for it,
and patches welcome to contribute it back.

> 
> 
>>> 
>>> For example, my find of broken code in bin/nutch crawl, a most basic way
>> of
>>> getting it running.
>> 
>> Can you elaborate on your find of broken code? Did you file a JIRA issue
>> for
>> this in the Nutch JIRA system [2] ?
>> 
> 
> Yes, it led to another release. The bug fix I contributed was incorporated.

Great!

>> 
>> The more information you provide here about your environment and your
>> situation that caused the error, as well as e.g., detailed information (a
>> stack trace, an exception, something), the easier it is to track down what
>> you're seeing.
>> 
> 
> Yes, that was all in the unanswered emails. it would be easier for you to
> search your inbox than for me to send it all over again.

I wouldn't assume that the inboxes of folks watching the list are always
centered on the Nutch mailing lists. Realize that many of us are subscribed
to several mailing lists, and sometimes, emails go unanswered for a while.

> 
>> That said, one thing to realize is that this is open source software, so in
>> the end, as they say in Apache, "those that do, decide", or "patches
>> welcome!" In other words, if there are things that you see that could be
>> fixed, improved, made more configurable, etc., including the code, but
>> *also
>> the documentation*, then by all means we'd appreciate your feedback and
>> contribution. Nutch is not simply a product of the developers that
>> contribute their (potentially and often unsalaried) time to work on it, but
>> of its user community as well.
>> 
> 
> I've been the leader of a major open source project for over 10 years. Last
> fall I relinquished the reins of that project to a new project leader, so I
> think I know how it works. We wrote an open source cross platform compiler
> for xBase (Clipper) code named Harbour Project, now in release 2.0.
> 
> That would be why I not only raised the flag that it's not ready to release,
> but I tracked down a bug and submitted a bug fix.
> 
> And I'm still saying it's not ready to release. There's still another bug
> that I have found that goes unanswered.

Right, so then you know that bugs aren't just "bugs" -- they must come with
a priority. There are several categories, "High", "Medium", "Critical", or
"Blocker", just to name a few.

When I cut a release as the Release Manager (RM), I always run unit tests
and try and at least run a basic crawl first before cutting the RC. So,
hopefully that catches anything that would be a big problem, but sometimes
even that process breaks down since not everyone has e.g., large scale
deployments, or maybe we're missing a unit test we need, etc. I'd say at ~10
releases of Nutch to date, and many many features, etc., we have fairly
decent regression.

>> 
>> In certain cases you are right, but I would take your above comments as
>> verbatim across the board. For example, if you believe there is
>> documentation lacking, then the first step is typically to file JIRA issues
>> to alert committers and other users of Nutch of your concern and then have
>> discussion on the lists regarding the issues. At some point a patch is
>> produced, and then attached to the issue, where the committers can review
>> the patches and then work to get them committed to the code base.
>> 
>> Nutch has a number of unit tests for regression that ship with the product
>> that tell me that it's not broken, and users that are able to make it work
>> in their environments. There have been some recent bug fixes in the 1.1 RC
>> that we caught which have been fixed (NUTCH-812, NUTCH-814, etc.), but
>> that's natural.
>> 
> 
> No, not we. Me. I found a bug, told you about it and provided the fix.
> Before I did that, I told you that your release candidate was broken. Just
> like I'm still saying, unless I'm doing something grossly wrong, it's still
> broken.

Right, gotcha. I didn't map that you had been the guy that contributed the
patch. Thanks for that.

>> 
>> Good question. I'm not super familiar with the nightly tests, but my guess
>> is that the scripts are outside the context of the tests since most of the
>> tests use Junit and are testing the Java API and classes. I may be wrong
>> though.
>> 
> 
> Then that means that you need more unit and process tests that are run
> before a release candidate. If the nightly build tests are this weak, you
> can't depend on them to tell you all you need to know. It would keep you
> from creating a release candidate that was plainly broken in a most
> fundamental way.

Hmmm...I'm not sure anything about Nutch is weak, and that's really a
subjective/qualitative judgment. If you have ideas about how to improve the
tests, we'd welcome them. Until then, the 100+ tests that exist are fairly
decent, at least in my experience using Nutch. Furthermore, in my software
development experience, I've never seen 100% coverage on tests -- it simply
doesn't work that way.

>> Ready in the sense of the release is a consensus decision made by the
>> developers and community based on a variety of things:
>> 
>> * issues being resolved in JIRA of a particular priority
>> * time in-between last release
>> * community requesting a release
>> * according to some pre-defined schedule
>> * making a feature release to get out new interesting features
>> etc etc.
>> 
> 
> Most of the above are Marketing issues, not release issues, but I'm not on
> the staff here, so I won't critique. You have your priorities, that's good
> enough for me.

Marketing issues? Huh? They are in fact release criteria, in just about
every software development job I've worked within.

> 
> One of the pleasures of Open Source is that there is no marketing department
> forcing you to release a product that is not yet ready. We've all lived with
> products like that. In the short run it's not fun. And in the long run it
> will give you a bad reputation.

That's probably why the Nutch 1.1 RC hasn't turned into the Nutch 1.1
release. We work with the community during the release process, just like we
do during development.

> I have found at least two bugs, one of them I tracked down and fixed and
> submitted code. The other I don't even know where to start the hunt and that
> is what lead me to post some questions here.
> 
> I'd appreciate it if someone knowledgeable would look at those questions
> from last week and give me some feedback.
> 

Sure, hopefully you'll find the answer you're looking for. In the meanwhile,
it's my job to keep cutting release candidates as the RM, that at least pass
the basic criteria for release and right now that involves what I mentioned
above.

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Phil Barnett <ph...@philb.us>.
On Wed, Apr 28, 2010 at 11:01 AM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

>
> Unfortunately some parts of the documentation on Nutch (namely the
> tutorial,
> and other parts of the static site) have been out of date for a while. This
> has occurred really independent of the releases, and independent of the
> wiki
> [1], which hasn't really fallen out of date as quick.
>

While documentation may not be part of the code, it's certainly part of the
project. And it's just as important as the code. Yes, I know that
documentation is the bane of programmers everywhere. I'm a coder. I get it.
But when you change the way things work in a fundamental way that leaves all
of  your documentation behind, it's time to spend some time on it.


> >
> > For example, my find of broken code in bin/nutch crawl, a most basic way
> of
> > getting it running.
>
> Can you elaborate on your find of broken code? Did you file a JIRA issue
> for
> this in the Nutch JIRA system [2] ?
>

Yes, it led to another release. The bug fix I contributed was incorporated.

> And I have yet to get the deepcrawl script which seems to be the
> suggestion
> > of how to get beyond bin/nutch crawl. It doesn't return any data at all
> and
> > has an error in the middle of it's run regarding missing file which the
> last
> > stage apparently failed to write. (I believe because the scheduler
> excluded
> > everything)
>
> The more information you provide here about your environment and your
> situation that caused the error, as well as e.g., detailed information (a
> stack trace, an exception, something), the easier it is to track down what
> you're seeing.
>

Yes, that was all in the unanswered emails. it would be easier for you to
search your inbox than for me to send it all over again.


> > I wonder if the developers have advanced so far past these basic scripts
> as
> > to have pretty much left them behind. This leads to these basics that
> people
> > start with not working.
>
> I wouldn't say developers have advanced beyond anything really for that
> matter :) The number of active developers in Nutch these days is fairly
> small, but interest and the user community is stable and there are some
> pretty large scale deployments of Nutch to my knowledge. That said, those
> folks have been following the mailing lists for a while, have been using
> the
> software for a while and thus their level of entry into the documentation
> may be at a little higher bar than that of a newer user such as yourself.
>

bin/nutch crawl was plainly broken and it would never have worked for anyone
who tried it. 'nuff said.


> That said, one thing to realize is that this is open source software, so in
> the end, as they say in Apache, "those that do, decide", or "patches
> welcome!" In other words, if there are things that you see that could be
> fixed, improved, made more configurable, etc., including the code, but
> *also
> the documentation*, then by all means we'd appreciate your feedback and
> contribution. Nutch is not simply a product of the developers that
> contribute their (potentially and often unsalaried) time to work on it, but
> of its user community as well.
>

I've been the leader of a major open source project for over 10 years. Last
fall I relinquished the reins of that project to a new project leader, so I
think I know how it works. We wrote an open source cross platform compiler
for xBase (Clipper) code named Harbour Project, now in release 2.0.

That would be why I not only raised the flag that it's not ready to release,
but I tracked down a bug and submitted a bug fix.

And I'm still saying it's not ready to release. There's still another bug
that I have found that goes unanswered.


> > I've spend dozens of hours trying to get 1.1 to work anything like 1.0
> and
> > I'm getting nowhere at all. It's pretty frustrating to spend that much
> time
> > trying to figure out how it works and keep hitting walls. And then asking
> > basic questions here that go unanswered.
>
> I apologize that your questions have gone unanswered and that you're
> hitting
> walls with regards to using Nutch. What questions did you ask? Perhaps it's
> the detail that you are providing (or not providing), or perhaps it's the
> way you're asking the questions. Or (even more likely) it's the fact that
> this is an open source project and thus the committers get around to user
> emails lists as one of the multiple items on their plate that they are
> working on the project and us committers may have missed your question, or
> perhaps those that had the time weren't particular experts in the one area
> of Nutch that you were asking about. There could be a number of reasons.
> Regardless, persistence is key as is *patience* and respectfulness. This
> has
> always to my knowledge been a really friendly community, so if you hang
> around and keep asking questions they will get answered I'm confident of
> that.
>

Great. Now that it's out in the open, perhaps someone who does know about
the things I asked about can reply to my questions.


>  > The view from the outside is not so good from my direction. If you don't
> > keep documentation up to date and you change the way things work, the
> > project as seen from the outside, is plainly broken.
>
> In certain cases you are right, but I would take your above comments as
> verbatim across the board. For example, if you believe there is
> documentation lacking, then the first step is typically to file JIRA issues
> to alert committers and other users of Nutch of your concern and then have
> discussion on the lists regarding the issues. At some point a patch is
> produced, and then attached to the issue, where the committers can review
> the patches and then work to get them committed to the code base.
>
> Nutch has a number of unit tests for regression that ship with the product
> that tell me that it's not broken, and users that are able to make it work
> in their environments. There have been some recent bug fixes in the 1.1 RC
> that we caught which have been fixed (NUTCH-812, NUTCH-814, etc.), but
> that's natural.
>

No, not we. Me. I found a bug, told you about it and provided the fix.
Before I did that, I told you that your release candidate was broken. Just
like I'm still saying, unless I'm doing something grossly wrong, it's still
broken.


> > I'd be happy to give you feedback on where I find these problems and I'll
> > even donate whatever fixes I can come up with, but Java is not a language
> > I'm familiar with and going is slow weeding through things. I really need
> > this project to work for me. I want to help.
>
> There are other ways to contribute to the project besides coding - I just
> thought of a really good reference that you can read in this regard put
> together by Dennis Kubes, one of the Nutch committers and PMC members.
> Check
> this out [3]. You may also want to check out our FAQ [4].
>

Yes, I've read the faq. I've searched for answers in the documentation for
weeks. Once I did that and came to a dead end, I asked questions here.

> 1. Where is the scheduler documented? If I want to crawl everything from
> scratch, where is the information from the last run stored? It seems like
> the schedule is telling my crawl to ignore pages due to scheduler knocking
> them out. It's not obvious to my why this is happening and how to stop it
> from happening. I think right now this is my major roadblock in getting
> bin/nutch crawl working. Maybe the scheduler code no longer works properly
> in bin/nutch crawl. I can't tell if it's that or if the default
> configurations don't work.

I think you might be talking about the Fetcher: there is documentation of it
> here:
>
> http://bit.ly/alqFoA
> http://wiki.apache.org/nutch/FetchOptions
> http://wiki.apache.org/nutch/CommandLineOptions
>
>
I'm talking about the part of the fetcher that keeps it from fetching the
same document within a specific time frame.


> > 2, Where are the control files in conf documented? How do I know which
> ones
> > do what and when? There's a half dozen *-urlfilters. Why?
>
> Some of these are admittedly newer features but others are not:
>
> http://wiki.apache.org/nutch/RegexURLFiltersBenchs
> http://bit.ly/b99NLK
>
> >
> > 3. Why doesn't your post nightly compile tests include bin/nutch crawl or
> if
> > it does, why didn't it find the error that stopped it from running?
>
> Good question. I'm not super familiar with the nightly tests, but my guess
> is that the scripts are outside the context of the tests since most of the
> tests use Junit and are testing the Java API and classes. I may be wrong
> though.
>

Then that means that you need more unit and process tests that are run
before a release candidate. If the nightly build tests are this weak, you
can't depend on them to tell you all you need to know. It would keep you
from creating a release candidate that was plainly broken in a most
fundamental way.

> 4. Where is the documentation on how to configure the new tika parser in
> your environment? I see that the old parsers have been removed by default,
> but there's nothing that shows me how to include/exclude document types.

Julien Nioche put this together on the TikaPlugin:
>
> http://wiki.apache.org/nutch/TikaPlugin
>

Great, thanks. I'll try to get back into my studies of how 1.1 works as I
can. Work is very busy and full of demands. For now, I've been dodging
questions about Nutch so I can try to understand it better. But the few very
pointed questions the I asked here last week were not answered, so I started
working on another project. I believe I'm a pretty good communicator and I
think I asked complete questions that were not vague.


> > I believe your assessment of 'ready' is not inclusive of some very
> important
> > things and that you would be doing a service to newcomers to bring
> > documentation in line with current offerings. This is not trivial code
> and
> > it takes a long time for someone from the outside to understand it. That
> > process is being stifled on multiple fronts as far as I can see. Either
> that
> > or I have missed an important document that exists and I haven't read it.
>
> Ready in the sense of the release is a consensus decision made by the
> developers and community based on a variety of things:
>
> * issues being resolved in JIRA of a particular priority
> * time in-between last release
> * community requesting a release
> * according to some pre-defined schedule
> * making a feature release to get out new interesting features
> etc etc.
>

Most of the above are Marketing issues, not release issues, but I'm not on
the staff here, so I won't critique. You have your priorities, that's good
enough for me.

One of the pleasures of Open Source is that there is no marketing department
forcing you to release a product that is not yet ready. We've all lived with
products like that. In the short run it's not fun. And in the long run it
will give you a bad reputation.


> I'm sorry that you are experiencing problems, and our goal is to try and
> address as many as possible and prioritize them, but in the end, Apache has
> a process regarding releases [5], which is based somewhat on input from the
> community (usually in the form of simple majority), but ultimately based on
> a Project Management Committee [6] structure, whose votes are binding on a
> particular release.
>
> I hope that we can work with you to continue to use Nutch and make it
> useful
> in your environment, but in the meanwhile, I would suggest you keep
> plugging
> along, continue to push forward and check out some of the references I
> included in this email moving forward.
>

My first, second and third attempt at getting 1.1 working were to duplicate
what I did with 1.0 to get it working. Even going so far as to dump the
entire directory and start over multiple times in hopes that I just did
something wrong.

I have found at least two bugs, one of them I tracked down and fixed and
submitted code. The other I don't even know where to start the hunt and that
is what lead me to post some questions here.

I'd appreciate it if someone knowledgeable would look at those questions
from last week and give me some feedback.

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Phil Barnett <ph...@philb.us>.
Oh yeah, I built a presentation and gave it to our local Linux User Group
meeting. You might find it useful:

http://leap-cf.org/presentations/nutch/NutchWebCrawler.odp

On Sat, May 1, 2010 at 2:10 AM, Phil Barnett <ph...@philb.us> wrote:

>
>
> On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius <mgrisius@comcast.net
> > wrote:
>
>> I also share many of Phil's sentiments. I really want the project
>> (bin/nutch crawl) to work for me as well and I want to help somehow. I
>> would like to share a 5gb 'intranet' web site with ~50 people. And I
>> have not graduated to making the 'deepcrawl' script work yet either, as
>> I'm thinking that maybe Nutch might not be the 'right tool' for 'little
>> projects' based on documentation, discussion list feedback, etc. . . .
>>
>
> I think it's exactly what you need to do that. I was able to get the 1.0
> release to work pretty quickly. Working 8 hour days, I had a server built
> and Nutch crawling sites within 40 hours. Actually after I found one
> specific tutorial I can get Nutch running in a basic bin/nutch crawl sort of
> way in about an hour. Wish I had found that site the first day...
>
> Going through that documentation, I found that it lacked one step and I fed
> that back to the author. He has already fixed it for 1.0 and if you follow
> his steps from top to bottom, you will get Nutch 1.0 running.
>
> The site is here:
>
>
> http://centoshelp.org/servers/installing-configuring-nutch-nutch-gui-sun-jdk-tomcat-6-on-centos-5.x
>
> Nutch 1.1 also follows the same installation steps and you get a working
> interface, but the crawls don't work well enough to get data into the
> indexes.
>

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Phil Barnett <ph...@philb.us>.
On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius
<mg...@comcast.net>wrote:

> I also share many of Phil's sentiments. I really want the project
> (bin/nutch crawl) to work for me as well and I want to help somehow. I
> would like to share a 5gb 'intranet' web site with ~50 people. And I
> have not graduated to making the 'deepcrawl' script work yet either, as
> I'm thinking that maybe Nutch might not be the 'right tool' for 'little
> projects' based on documentation, discussion list feedback, etc. . . .
>

I think it's exactly what you need to do that. I was able to get the 1.0
release to work pretty quickly. Working 8 hour days, I had a server built
and Nutch crawling sites within 40 hours. Actually after I found one
specific tutorial I can get Nutch running in a basic bin/nutch crawl sort of
way in about an hour. Wish I had found that site the first day...

Going through that documentation, I found that it lacked one step and I fed
that back to the author. He has already fixed it for 1.0 and if you follow
his steps from top to bottom, you will get Nutch 1.0 running.

The site is here:

http://centoshelp.org/servers/installing-configuring-nutch-nutch-gui-sun-jdk-tomcat-6-on-centos-5.x

Nutch 1.1 also follows the same installation steps and you get a working
interface, but the crawls don't work well enough to get data into the
indexes.

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Matthew,

Thanks for your feedback. If you have any specific updates/improvements/actionable items based on your comments below, we'd love to have you contribute them back in the form of contributions to the community. Otherwise, we will take your feedback, put it into the queue of other items in the Nutch issue tracking system for those who are committers on the project to work on, as time permits.

Apache has a process for meritocracy [1] in terms of contributing to projects and being recognized for those contributions - we welcome feedback and actionable things in the forms of patches that improve the code, documentation, add new features, etc., while maintaining backwards compatibility with existing deployments and existing users.

Thanks and hope to see some issues/feedback/patches continue to come!

Cheers,
Chris

[1] http://www.apache.org/foundation/how-it-works.html#meritocracy

On 4/28/10 7:27 AM, "matthew a. grisius" <mg...@comcast.net> wrote:

I also share many of Phil's sentiments. I really want the project
(bin/nutch crawl) to work for me as well and I want to help somehow. I
would like to share a 5gb 'intranet' web site with ~50 people. And I
have not graduated to making the 'deepcrawl' script work yet either, as
I'm thinking that maybe Nutch might not be the 'right tool' for 'little
projects' based on documentation, discussion list feedback, etc. . . .

-m.

On Wed, 2010-04-28 at 06:59 -0400, Phil Barnett wrote:
> On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> >
> > Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> > open for the next 72 hours.
> >
>
> How do you test to see if Nutch works like the documentation says it works?
> I still find major differences between how existing documentation tells me,
> a newcomer to the project, how to get it running.
>
> For example, my find of broken code in bin/nutch crawl, a most basic way of
> getting it running.
>
> And I have yet to get the deepcrawl script which seems to be the suggestion
> of how to get beyond bin/nutch crawl. It doesn't return any data at all and
> has an error in the middle of it's run regarding missing file which the last
> stage apparently failed to write. (I believe because the scheduler excluded
> everything)
>
> I wonder if the developers have advanced so far past these basic scripts as
> to have pretty much left them behind. This leads to these basics that people
> start with not working.
>
> I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
> I'm getting nowhere at all. It's pretty frustrating to spend that much time
> trying to figure out how it works and keep hitting walls. And then asking
> basic questions here that go unanswered.
>
> The view from the outside is not so good from my direction. If you don't
> keep documentation up to date and you change the way things work, the
> project as seen from the outside, is plainly broken.
>
> I'd be happy to give you feedback on where I find these problems and I'll
> even donate whatever fixes I can come up with, but Java is not a language
> I'm familiar with and going is slow weeding through things. I really need
> this project to work for me. I want to help.
>
> 1. Where is the scheduler documented? If I want to crawl everything from
> scratch, where is the information from the last run stored? It seems like
> the schedule is telling my crawl to ignore pages due to scheduler knocking
> them out. It's not obvious to my why this is happening and how to stop it
> from happening. I think right now this is my major roadblock in getting
> bin/nutch crawl working. Maybe the scheduler code no longer works properly
> in bin/nutch crawl. I can't tell if it's that or if the default
> configurations don't work.
>
> 2, Where are the control files in conf documented? How do I know which ones
> do what and when? There's a half dozen *-urlfilters. Why?
>
> 3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
> it does, why didn't it find the error that stopped it from running?
>
> 4. Where is the documentation on how to configure the new tika parser in
> your environment? I see that the old parsers have been removed by default,
> but there's nothing that shows me how to include/exclude document types.
>
> I believe your assessment of 'ready' is not inclusive of some very important
> things and that you would be doing a service to newcomers to bring
> documentation in line with current offerings. This is not trivial code and
> it takes a long time for someone from the outside to understand it. That
> process is being stifled on multiple fronts as far as I can see. Either that
> or I have missed an important document that exists and I haven't read it.
>
> Phil Barnett
> Senior Programmer / Analyst
> Walt Disney World, Inc.




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "matthew a. grisius" <mg...@comcast.net>.
I also share many of Phil's sentiments. I really want the project
(bin/nutch crawl) to work for me as well and I want to help somehow. I
would like to share a 5gb 'intranet' web site with ~50 people. And I
have not graduated to making the 'deepcrawl' script work yet either, as
I'm thinking that maybe Nutch might not be the 'right tool' for 'little
projects' based on documentation, discussion list feedback, etc. . . .

-m.

On Wed, 2010-04-28 at 06:59 -0400, Phil Barnett wrote:
> On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
> 
> >
> > Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> > open for the next 72 hours.
> >
> 
> How do you test to see if Nutch works like the documentation says it works?
> I still find major differences between how existing documentation tells me,
> a newcomer to the project, how to get it running.
> 
> For example, my find of broken code in bin/nutch crawl, a most basic way of
> getting it running.
> 
> And I have yet to get the deepcrawl script which seems to be the suggestion
> of how to get beyond bin/nutch crawl. It doesn't return any data at all and
> has an error in the middle of it's run regarding missing file which the last
> stage apparently failed to write. (I believe because the scheduler excluded
> everything)
> 
> I wonder if the developers have advanced so far past these basic scripts as
> to have pretty much left them behind. This leads to these basics that people
> start with not working.
> 
> I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
> I'm getting nowhere at all. It's pretty frustrating to spend that much time
> trying to figure out how it works and keep hitting walls. And then asking
> basic questions here that go unanswered.
> 
> The view from the outside is not so good from my direction. If you don't
> keep documentation up to date and you change the way things work, the
> project as seen from the outside, is plainly broken.
> 
> I'd be happy to give you feedback on where I find these problems and I'll
> even donate whatever fixes I can come up with, but Java is not a language
> I'm familiar with and going is slow weeding through things. I really need
> this project to work for me. I want to help.
> 
> 1. Where is the scheduler documented? If I want to crawl everything from
> scratch, where is the information from the last run stored? It seems like
> the schedule is telling my crawl to ignore pages due to scheduler knocking
> them out. It's not obvious to my why this is happening and how to stop it
> from happening. I think right now this is my major roadblock in getting
> bin/nutch crawl working. Maybe the scheduler code no longer works properly
> in bin/nutch crawl. I can't tell if it's that or if the default
> configurations don't work.
> 
> 2, Where are the control files in conf documented? How do I know which ones
> do what and when? There's a half dozen *-urlfilters. Why?
> 
> 3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
> it does, why didn't it find the error that stopped it from running?
> 
> 4. Where is the documentation on how to configure the new tika parser in
> your environment? I see that the old parsers have been removed by default,
> but there's nothing that shows me how to include/exclude document types.
> 
> I believe your assessment of 'ready' is not inclusive of some very important
> things and that you would be doing a service to newcomers to bring
> documentation in line with current offerings. This is not trivial code and
> it takes a long time for someone from the outside to understand it. That
> process is being stifled on multiple fronts as far as I can see. Either that
> or I have missed an important document that exists and I haven't read it.
> 
> Phil Barnett
> Senior Programmer / Analyst
> Walt Disney World, Inc.


Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Phil,

Thanks very much for the feedback. I¹d like to take a second to address your
points:

> 
> How do you test to see if Nutch works like the documentation says it works?
> I still find major differences between how existing documentation tells me,
> a newcomer to the project, how to get it running.

Unfortunately some parts of the documentation on Nutch (namely the tutorial,
and other parts of the static site) have been out of date for a while. This
has occurred really independent of the releases, and independent of the wiki
[1], which hasn't really fallen out of date as quick.

> 
> For example, my find of broken code in bin/nutch crawl, a most basic way of
> getting it running.

Can you elaborate on your find of broken code? Did you file a JIRA issue for
this in the Nutch JIRA system [2] ?

> 
> And I have yet to get the deepcrawl script which seems to be the suggestion
> of how to get beyond bin/nutch crawl. It doesn't return any data at all and
> has an error in the middle of it's run regarding missing file which the last
> stage apparently failed to write. (I believe because the scheduler excluded
> everything)

The more information you provide here about your environment and your
situation that caused the error, as well as e.g., detailed information (a
stack trace, an exception, something), the easier it is to track down what
you're seeing.

> 
> I wonder if the developers have advanced so far past these basic scripts as
> to have pretty much left them behind. This leads to these basics that people
> start with not working.

I wouldn't say developers have advanced beyond anything really for that
matter :) The number of active developers in Nutch these days is fairly
small, but interest and the user community is stable and there are some
pretty large scale deployments of Nutch to my knowledge. That said, those
folks have been following the mailing lists for a while, have been using the
software for a while and thus their level of entry into the documentation
may be at a little higher bar than that of a newer user such as yourself.

That said, one thing to realize is that this is open source software, so in
the end, as they say in Apache, "those that do, decide", or "patches
welcome!" In other words, if there are things that you see that could be
fixed, improved, made more configurable, etc., including the code, but *also
the documentation*, then by all means we'd appreciate your feedback and
contribution. Nutch is not simply a product of the developers that
contribute their (potentially and often unsalaried) time to work on it, but
of its user community as well.

> 
> I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
> I'm getting nowhere at all. It's pretty frustrating to spend that much time
> trying to figure out how it works and keep hitting walls. And then asking
> basic questions here that go unanswered.

I apologize that your questions have gone unanswered and that you're hitting
walls with regards to using Nutch. What questions did you ask? Perhaps it's
the detail that you are providing (or not providing), or perhaps it's the
way you're asking the questions. Or (even more likely) it's the fact that
this is an open source project and thus the committers get around to user
emails lists as one of the multiple items on their plate that they are
working on the project and us committers may have missed your question, or
perhaps those that had the time weren't particular experts in the one area
of Nutch that you were asking about. There could be a number of reasons.
Regardless, persistence is key as is *patience* and respectfulness. This has
always to my knowledge been a really friendly community, so if you hang
around and keep asking questions they will get answered I'm confident of
that.

> 
> The view from the outside is not so good from my direction. If you don't
> keep documentation up to date and you change the way things work, the
> project as seen from the outside, is plainly broken.

In certain cases you are right, but I would take your above comments as
verbatim across the board. For example, if you believe there is
documentation lacking, then the first step is typically to file JIRA issues
to alert committers and other users of Nutch of your concern and then have
discussion on the lists regarding the issues. At some point a patch is
produced, and then attached to the issue, where the committers can review
the patches and then work to get them committed to the code base.

Nutch has a number of unit tests for regression that ship with the product
that tell me that it's not broken, and users that are able to make it work
in their environments. There have been some recent bug fixes in the 1.1 RC
that we caught which have been fixed (NUTCH-812, NUTCH-814, etc.), but
that's natural.  

> 
> I'd be happy to give you feedback on where I find these problems and I'll
> even donate whatever fixes I can come up with, but Java is not a language
> I'm familiar with and going is slow weeding through things. I really need
> this project to work for me. I want to help.

There are other ways to contribute to the project besides coding - I just
thought of a really good reference that you can read in this regard put
together by Dennis Kubes, one of the Nutch committers and PMC members. Check
this out [3]. You may also want to check out our FAQ [4].

> 
> 1. Where is the scheduler documented? If I want to crawl everything from
> scratch, where is the information from the last run stored? It seems like
> the schedule is telling my crawl to ignore pages due to scheduler knocking
> them out. It's not obvious to my why this is happening and how to stop it
> from happening. I think right now this is my major roadblock in getting
> bin/nutch crawl working. Maybe the scheduler code no longer works properly
> in bin/nutch crawl. I can't tell if it's that or if the default
> configurations don't work.

I think you might be talking about the Fetcher: there is documentation of it
here:

http://bit.ly/alqFoA
http://wiki.apache.org/nutch/FetchOptions
http://wiki.apache.org/nutch/CommandLineOptions

> 
> 2, Where are the control files in conf documented? How do I know which ones
> do what and when? There's a half dozen *-urlfilters. Why?

Some of these are admittedly newer features but others are not:

http://wiki.apache.org/nutch/RegexURLFiltersBenchs
http://bit.ly/b99NLK

> 
> 3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
> it does, why didn't it find the error that stopped it from running?

Good question. I'm not super familiar with the nightly tests, but my guess
is that the scripts are outside the context of the tests since most of the
tests use Junit and are testing the Java API and classes. I may be wrong
though.

> 
> 4. Where is the documentation on how to configure the new tika parser in
> your environment? I see that the old parsers have been removed by default,
> but there's nothing that shows me how to include/exclude document types.

Julien Nioche put this together on the TikaPlugin:

http://wiki.apache.org/nutch/TikaPlugin
> 
> I believe your assessment of 'ready' is not inclusive of some very important
> things and that you would be doing a service to newcomers to bring
> documentation in line with current offerings. This is not trivial code and
> it takes a long time for someone from the outside to understand it. That
> process is being stifled on multiple fronts as far as I can see. Either that
> or I have missed an important document that exists and I haven't read it.

Ready in the sense of the release is a consensus decision made by the
developers and community based on a variety of things:

* issues being resolved in JIRA of a particular priority
* time in-between last release
* community requesting a release
* according to some pre-defined schedule
* making a feature release to get out new interesting features
etc etc.

I'm sorry that you are experiencing problems, and our goal is to try and
address as many as possible and prioritize them, but in the end, Apache has
a process regarding releases [5], which is based somewhat on input from the
community (usually in the form of simple majority), but ultimately based on
a Project Management Committee [6] structure, whose votes are binding on a
particular release.

I hope that we can work with you to continue to use Nutch and make it useful
in your environment, but in the meanwhile, I would suggest you keep plugging
along, continue to push forward and check out some of the references I
included in this email moving forward.

Thanks!

Cheers,
Chris


[1] http://wiki.apache.org/nutch/
[2] http://issues.apache.org/jira/browse/NUTCH
[3] http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
[4] http://wiki.apache.org/nutch/FAQ
[5] http://www.apache.org/foundation/voting.html
[6] http://www.apache.org/dev/pmc.html

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Phil Barnett <ph...@philb.us>.
On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

>
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
>

How do you test to see if Nutch works like the documentation says it works?
I still find major differences between how existing documentation tells me,
a newcomer to the project, how to get it running.

For example, my find of broken code in bin/nutch crawl, a most basic way of
getting it running.

And I have yet to get the deepcrawl script which seems to be the suggestion
of how to get beyond bin/nutch crawl. It doesn't return any data at all and
has an error in the middle of it's run regarding missing file which the last
stage apparently failed to write. (I believe because the scheduler excluded
everything)

I wonder if the developers have advanced so far past these basic scripts as
to have pretty much left them behind. This leads to these basics that people
start with not working.

I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
I'm getting nowhere at all. It's pretty frustrating to spend that much time
trying to figure out how it works and keep hitting walls. And then asking
basic questions here that go unanswered.

The view from the outside is not so good from my direction. If you don't
keep documentation up to date and you change the way things work, the
project as seen from the outside, is plainly broken.

I'd be happy to give you feedback on where I find these problems and I'll
even donate whatever fixes I can come up with, but Java is not a language
I'm familiar with and going is slow weeding through things. I really need
this project to work for me. I want to help.

1. Where is the scheduler documented? If I want to crawl everything from
scratch, where is the information from the last run stored? It seems like
the schedule is telling my crawl to ignore pages due to scheduler knocking
them out. It's not obvious to my why this is happening and how to stop it
from happening. I think right now this is my major roadblock in getting
bin/nutch crawl working. Maybe the scheduler code no longer works properly
in bin/nutch crawl. I can't tell if it's that or if the default
configurations don't work.

2, Where are the control files in conf documented? How do I know which ones
do what and when? There's a half dozen *-urlfilters. Why?

3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
it does, why didn't it find the error that stopped it from running?

4. Where is the documentation on how to configure the new tika parser in
your environment? I see that the old parsers have been removed by default,
but there's nothing that shows me how to include/exclude document types.

I believe your assessment of 'ready' is not inclusive of some very important
things and that you would be doing a service to newcomers to bring
documentation in line with current offerings. This is not trivial code and
it takes a long time for someone from the outside to understand it. That
process is being stifled on multiple fronts as far as I can see. Either that
or I have missed an important document that exists and I haven't read it.

Phil Barnett
Senior Programmer / Analyst
Walt Disney World, Inc.

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Grant Ingersoll <gs...@apache.org>.
Might I suggest, that since Nutch is now a TLP that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC?

Cheers,
Grant

On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote:

> Hi Folks,
> 
> I have posted an updated candidate for the Apache Nutch 1.1 release. The
> source code is at:
> 
> http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/
> 
> The major difference between this release and rc #1 is the application of
> NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
> as well as some commits by Sami Siren to fix missing ASL license headers.
> 
> For more detailed information, see the included CHANGES.txt file for details
> on release contents and latest changes. The release was made using the Nutch
> release process, documented on the Wiki here:
> 
> http://bit.ly/d5ugid
> 
> A Nutch 1.1 tag is at:
> 
> http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/
> 
> <note>
> There was a request by Sami Siren that the tutorial be updated to reflect
> the fact that this release is a source-only release, as well as a request to
> integrate RAT into the build, however, in the interest of getting this 1.1
> out and getting going on the Nutch TLP, my proposal is:
> 
> * update the docs independent of this release (the tutorial as it exists
> right now says 0.7 on it anyways and doesn't look like it's been updated in
> a while, so I think users can live with what's there and support on
> user@nutch.apache.org or dev@nutch.apache.org until it's updated)
> 
> * begin source only releases in general since we've long had the debate as
> to the size of the Nutch release. Most folks that use Nutch are likely
> familiar with running ant IMHO.
> 
> * run RAT and integrate into the build
> 
> </note>
> 
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
> 
> Since Nutch is now a TLP and has its own PMC, there is a question of who are
> the binding release VOTES in this particular thread. My gut reaction is that
> since I started this release while we were under the Lucene PMC, for
> continuity purposes, only votes from Lucene PMC are binding, but everyone
> (especially newly minted Nutch PMC members!) are  welcome to check the
> release candidate and voice their approval or disapproval. The vote passes
> if at least three binding +1 votes are cast.
> 
> [ ] +1 Release the packages as Apache Nutch 1.1.
> 
> [ ] -1 Do not release the packages because...
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> P.S. Here is my +1.
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 



Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Posted by Grant Ingersoll <gs...@apache.org>.
Might I suggest, that since Nutch is now a TLP that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC?

Cheers,
Grant

On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote:

> Hi Folks,
> 
> I have posted an updated candidate for the Apache Nutch 1.1 release. The
> source code is at:
> 
> http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/
> 
> The major difference between this release and rc #1 is the application of
> NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
> as well as some commits by Sami Siren to fix missing ASL license headers.
> 
> For more detailed information, see the included CHANGES.txt file for details
> on release contents and latest changes. The release was made using the Nutch
> release process, documented on the Wiki here:
> 
> http://bit.ly/d5ugid
> 
> A Nutch 1.1 tag is at:
> 
> http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/
> 
> <note>
> There was a request by Sami Siren that the tutorial be updated to reflect
> the fact that this release is a source-only release, as well as a request to
> integrate RAT into the build, however, in the interest of getting this 1.1
> out and getting going on the Nutch TLP, my proposal is:
> 
> * update the docs independent of this release (the tutorial as it exists
> right now says 0.7 on it anyways and doesn't look like it's been updated in
> a while, so I think users can live with what's there and support on
> user@nutch.apache.org or dev@nutch.apache.org until it's updated)
> 
> * begin source only releases in general since we've long had the debate as
> to the size of the Nutch release. Most folks that use Nutch are likely
> familiar with running ant IMHO.
> 
> * run RAT and integrate into the build
> 
> </note>
> 
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
> 
> Since Nutch is now a TLP and has its own PMC, there is a question of who are
> the binding release VOTES in this particular thread. My gut reaction is that
> since I started this release while we were under the Lucene PMC, for
> continuity purposes, only votes from Lucene PMC are binding, but everyone
> (especially newly minted Nutch PMC members!) are  welcome to check the
> release candidate and voice their approval or disapproval. The vote passes
> if at least three binding +1 votes are cast.
> 
> [ ] +1 Release the packages as Apache Nutch 1.1.
> 
> [ ] -1 Do not release the packages because...
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> P.S. Here is my +1.
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
>