You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2010/07/21 20:26:34 UTC

Nutchbase merge strategy

Hi all,

I'd like to discuss what is the best way forward to merging the 
nutchbase code with trunk.

First some important facts:

* nutchbase is almost totally API incompatible with Nutch 1.x. While the 
main ideas remain the same, and most of the tools remain as well, their 
implementation is very different (and let me say, much cleaner) than 
that of Nutch 1.x. E.g. while nutchbase uses URLFilters and 
URLNormalizers, and IndexingFilters, etc, their method signatures have 
changed. To give you some idea how deep these changes go, let me say 
that CrawlDatum is gone now.

* for the last month or so, and I foresee for another month or so, 
Julien, Dogacan, myself and Enis have been working on bringing nutchbase 
(and Gora) as much up-to-date with trunk as possible - in fact, you 
could say we have been merging trunk to nutchbase... The original reason 
for this was that we first wanted to bring nutchbase into a working 
state and then start merging, but also another important reason was the 
one mentioned above - we didn't know how to prepare a meaningful patch 
for trunk that wouldn't replace 90+ % of the code in trunk...

So, I would like to propose an alternative strategy: we will keep 
merging from trunk to nutchbase, with proper JIRA tracking (I created a 
'nutchbase' tag in JIRA), and once we reach a state when nutchbase 
offers roughly the same functionality as the code in trunk then we 
simply switch nutchbase with trunk.

Current status of nutchbase is that the basic tools to implement a 
crawling workflow have been ported and work correctly, and we are able 
to execute a few unit tests on an SQL backend.

Regarding backwards-compatibility with Nutch 1.x: most config files are 
unchanged, and we should probably offer some data migration tools - I'm 
not sure whether it makes sense to create a segment converter, but we 
can certainly create a CrawlDb converter.

What do you think? Any comments / suggestions / ideas?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutchbase merge strategy

Posted by Doğacan Güney <do...@gmail.com>.

Hey,

Sorry for the late answer everyone.

On Wed, Jul 21, 2010 at 21:26, Andrzej Bialecki <ab...@getopt.org> wrote:

> Hi all,
>
> I'd like to discuss what is the best way forward to merging the nutchbase
> code with trunk.
>
> First some important facts:
>
> * nutchbase is almost totally API incompatible with Nutch 1.x. While the
> main ideas remain the same, and most of the tools remain as well, their
> implementation is very different (and let me say, much cleaner) than that of
> Nutch 1.x. E.g. while nutchbase uses URLFilters and URLNormalizers, and
> IndexingFilters, etc, their method signatures have changed. To give you some
> idea how deep these changes go, let me say that CrawlDatum is gone now.
>
> * for the last month or so, and I foresee for another month or so, Julien,
> Dogacan, myself and Enis have been working on bringing nutchbase (and Gora)
> as much up-to-date with trunk as possible - in fact, you could say we have
> been merging trunk to nutchbase... The original reason for this was that we
> first wanted to bring nutchbase into a working state and then start merging,
> but also another important reason was the one mentioned above - we didn't
> know how to prepare a meaningful patch for trunk that wouldn't replace 90+ %
> of the code in trunk...
>
> So, I would like to propose an alternative strategy: we will keep merging
> from trunk to nutchbase, with proper JIRA tracking (I created a 'nutchbase'
> tag in JIRA), and once we reach a state when nutchbase offers roughly the
> same functionality as the code in trunk then we simply switch nutchbase with
> trunk.
>
> Current status of nutchbase is that the basic tools to implement a crawling
> workflow have been ported and work correctly, and we are able to execute a
> few unit tests on an SQL backend.
>
> Regarding backwards-compatibility with Nutch 1.x: most config files are
> unchanged, and we should probably offer some data migration tools - I'm not
> sure whether it makes sense to create a segment converter, but we can
> certainly create a CrawlDb converter.
>
> What do you think? Any comments / suggestions / ideas?
>
>

I am ok with this approach. One thing that may be problematic is that this
flattens SVN history a lot so history will both be more difficult
to read AND code (that committers commit) will not be properly attributed.

Btw, I recently tried a git merge between trunk and nutchbase. IIRC, there
were only 10-15 file conflicts. So I think producing a patch
*may* be possible.


>  --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Re: Nutchbase merge strategy

Posted by Julien Nioche <li...@gmail.com>.

On 23 July 2010 10:20, Julien Nioche <li...@gmail.com> wrote:

>
>
>>  Before doing so,
>>> let's:
>>>
>>> 1. tag current trunk as
>>> http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 (EOL'ed won't
>>> be
>>> worked on, but nice to save). This way someone doesn't have to remember
>>> the
>>> Nutchbase rev # before the Nutchbase branch lands in the trunk.
>>>
>>> Then we can:
>>>
>>> 2. svn remove -m "n-1 before Nutchbase lands."
>>> https://svn.apache.org/repos/asf/nutch/trunk
>>> 3. svn copy -m "Nutchbase branch lands in trunk."
>>> https://svn.apache.org/repos/asf/nutch/branches/nutchbase
>>> https://svn.apache.org/repos/asf/nutch/trunk
>>>
>>> After doing that, we should also:
>>>
>>> 4. roll a a 1.2 release, which I would say is the last major 1.x release.
>>> Andrzej and I and others have backported some pretty decent patches in
>>> the
>>> past few weeks and it probably makes sense to make a quick release. I'll
>>> happily be the RM for it.
>>>
>>
>> +1 to all of the above - see below.
>>
>>
>>
>>> So if 1-4 make sense, let's do 1, 2 and 3 today or tomorrow -- 4 can
>>> happen
>>> over the next few weeks. WDYT?
>>>
>>
>> This is a serious move - let's wait a bit, say until Monday, to give
>> chance to others to comment.
>>
>
> +1 from me as well
>

Before we turn NutchBase into trunk we need to make sure that all (more or
less) recent changes in the trunk have been ported to NutchBase. I have done
that recently but given that there is a very large number of changes I might
have missed a few things here and there.  I've created NUTCH-859 to track
this.

Thanks

J.

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Nutchbase merge strategy

Posted by Julien Nioche <li...@gmail.com>.

>
>  Before doing so,
>> let's:
>>
>> 1. tag current trunk as
>> http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 (EOL'ed won't
>> be
>> worked on, but nice to save). This way someone doesn't have to remember
>> the
>> Nutchbase rev # before the Nutchbase branch lands in the trunk.
>>
>> Then we can:
>>
>> 2. svn remove -m "n-1 before Nutchbase lands."
>> https://svn.apache.org/repos/asf/nutch/trunk
>> 3. svn copy -m "Nutchbase branch lands in trunk."
>> https://svn.apache.org/repos/asf/nutch/branches/nutchbase
>> https://svn.apache.org/repos/asf/nutch/trunk
>>
>> After doing that, we should also:
>>
>> 4. roll a a 1.2 release, which I would say is the last major 1.x release.
>> Andrzej and I and others have backported some pretty decent patches in the
>> past few weeks and it probably makes sense to make a quick release. I'll
>> happily be the RM for it.
>>
>
> +1 to all of the above - see below.
>
>
>
>> So if 1-4 make sense, let's do 1, 2 and 3 today or tomorrow -- 4 can
>> happen
>> over the next few weeks. WDYT?
>>
>
> This is a serious move - let's wait a bit, say until Monday, to give chance
> to others to comment.
>

+1 from me as well

Julien

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Nutchbase merge strategy

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-07-21 21:12, Mattmann, Chris A (388J) wrote:
> Hey Andrzej,
>
>> +1 to all of the above - see below.
>>
>>>
>>> So if 1-4 make sense, let's do 1, 2 and 3 today or tomorrow -- 4 can happen
>>> over the next few weeks. WDYT?
>>
>> This is a serious move - let's wait a bit, say until Monday, to give
>> chance to others to comment.
>
> Agreed. Let's wait until Monday. If there aren't any objections, let's let
> er' rip!
>
> BTW, #4 is independent of #1-3. WDYT about wrapping up the 1.x series of
> Nutch and rolling a 1.2 in the next few days (while I have some free
> cycles)? :) #4 is also in its own branch and therefore independent as well
> so it won't be as brave a move.
>
> Let me know what you (all) think.

If 1.2 is going to be the last release in 1.x series then I think we 
should review some pending issues, especially those reported after 1.0 
release:

https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&pid=10680&updated%3Aprevious=-1w&created%3Aafter=1%2FApr%2F09&status=1&status=3&status=4&sorter/field=updated&sorter/order=DESC

Actually, just two issues are still unresolved... hmm, not bad.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutchbase merge strategy

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Andrzej,

> +1 to all of the above - see below.
> 
>> 
>> So if 1-4 make sense, let's do 1, 2 and 3 today or tomorrow -- 4 can happen
>> over the next few weeks. WDYT?
> 
> This is a serious move - let's wait a bit, say until Monday, to give
> chance to others to comment.

Agreed. Let's wait until Monday. If there aren't any objections, let's let
er' rip!

BTW, #4 is independent of #1-3. WDYT about wrapping up the 1.x series of
Nutch and rolling a 1.2 in the next few days (while I have some free
cycles)? :) #4 is also in its own branch and therefore independent as well
so it won't be as brave a move.

Let me know what you (all) think.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Nutchbase merge strategy

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-07-21 20:36, Mattmann, Chris A (388J) wrote:
> Hmmm....interesting.
>
> OK, my one comment would be: why wait? trunk is traditional not guaranteed
> to be stable and it seems like you guys have nutchbase *sorta* working
> enough that the time is ripe to just switch now. And then you won't further
> confuse folks like me that are happy to check out the nutch trunk in
> Eclipse, but shudder when I have to manually check out multiple copies of
> Nutch as branches, etc. etc.
>
> In other words, my comment is, *let's just switch now*.

Hmm, well - a brave move. The reason why we waited was to bring 
nutchbase into a state that is more functional than just a bunch of 
ideas, i.e. into a state where one can meaningfully play with it and get 
some useful results. Perhaps it is in such a state now, perhaps not ... 
it depends how adventurous you are :)

> Before doing so,
> let's:
>
> 1. tag current trunk as
> http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 (EOL'ed won't be
> worked on, but nice to save). This way someone doesn't have to remember the
> Nutchbase rev # before the Nutchbase branch lands in the trunk.
>
> Then we can:
>
> 2. svn remove -m "n-1 before Nutchbase lands."
> https://svn.apache.org/repos/asf/nutch/trunk
> 3. svn copy -m "Nutchbase branch lands in trunk."
> https://svn.apache.org/repos/asf/nutch/branches/nutchbase
> https://svn.apache.org/repos/asf/nutch/trunk
>
> After doing that, we should also:
>
> 4. roll a a 1.2 release, which I would say is the last major 1.x release.
> Andrzej and I and others have backported some pretty decent patches in the
> past few weeks and it probably makes sense to make a quick release. I'll
> happily be the RM for it.

+1 to all of the above - see below.

>
> So if 1-4 make sense, let's do 1, 2 and 3 today or tomorrow -- 4 can happen
> over the next few weeks. WDYT?

This is a serious move - let's wait a bit, say until Monday, to give 
chance to others to comment.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutchbase merge strategy

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hmmm....interesting.

OK, my one comment would be: why wait? trunk is traditional not guaranteed
to be stable and it seems like you guys have nutchbase *sorta* working
enough that the time is ripe to just switch now. And then you won't further
confuse folks like me that are happy to check out the nutch trunk in
Eclipse, but shudder when I have to manually check out multiple copies of
Nutch as branches, etc. etc.

In other words, my comment is, *let's just switch now*. Before doing so,
let's:

1. tag current trunk as
http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 (EOL'ed won't be
worked on, but nice to save). This way someone doesn't have to remember the
Nutchbase rev # before the Nutchbase branch lands in the trunk.

Then we can:

2. svn remove -m "n-1 before Nutchbase lands."
https://svn.apache.org/repos/asf/nutch/trunk
3. svn copy -m "Nutchbase branch lands in trunk."
https://svn.apache.org/repos/asf/nutch/branches/nutchbase
https://svn.apache.org/repos/asf/nutch/trunk

After doing that, we should also:

4. roll a a 1.2 release, which I would say is the last major 1.x release.
Andrzej and I and others have backported some pretty decent patches in the
past few weeks and it probably makes sense to make a quick release. I'll
happily be the RM for it.

So if 1-4 make sense, let's do 1, 2 and 3 today or tomorrow -- 4 can happen
over the next few weeks. WDYT?

Cheers,
Chris

On 7/21/10 2:26 PM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

> Hi all,
> 
> I'd like to discuss what is the best way forward to merging the
> nutchbase code with trunk.
> 
> First some important facts:
> 
> * nutchbase is almost totally API incompatible with Nutch 1.x. While the
> main ideas remain the same, and most of the tools remain as well, their
> implementation is very different (and let me say, much cleaner) than
> that of Nutch 1.x. E.g. while nutchbase uses URLFilters and
> URLNormalizers, and IndexingFilters, etc, their method signatures have
> changed. To give you some idea how deep these changes go, let me say
> that CrawlDatum is gone now.
> 
> * for the last month or so, and I foresee for another month or so,
> Julien, Dogacan, myself and Enis have been working on bringing nutchbase
> (and Gora) as much up-to-date with trunk as possible - in fact, you
> could say we have been merging trunk to nutchbase... The original reason
> for this was that we first wanted to bring nutchbase into a working
> state and then start merging, but also another important reason was the
> one mentioned above - we didn't know how to prepare a meaningful patch
> for trunk that wouldn't replace 90+ % of the code in trunk...
> 
> So, I would like to propose an alternative strategy: we will keep
> merging from trunk to nutchbase, with proper JIRA tracking (I created a
> 'nutchbase' tag in JIRA), and once we reach a state when nutchbase
> offers roughly the same functionality as the code in trunk then we
> simply switch nutchbase with trunk.
> 
> Current status of nutchbase is that the basic tools to implement a
> crawling workflow have been ported and work correctly, and we are able
> to execute a few unit tests on an SQL backend.
> 
> Regarding backwards-compatibility with Nutch 1.x: most config files are
> unchanged, and we should probably offer some data migration tools - I'm
> not sure whether it makes sense to create a segment converter, but we
> can certainly create a CrawlDb converter.
> 
> What do you think? Any comments / suggestions / ideas?
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++