You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by jn...@apache.org on 2010/07/20 13:47:23 UTC

svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

Author: jnioche
Date: Tue Jul 20 11:47:23 2010
New Revision: 965815

URL: http://svn.apache.org/viewvc?rev=965815&view=rev
Log:
remove deprecated objects ParseText and ParseStatus

Removed:
    nutch/branches/nutchbase/src/java/org/apache/nutch/parse/ParseStatus.java
    nutch/branches/nutchbase/src/java/org/apache/nutch/parse/ParseText.java
    nutch/branches/nutchbase/src/test/org/apache/nutch/parse/TestParseText.java


Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

Posted by Julien Nioche <li...@gmail.com>.
Thanks for your comments Chris

>
> > However we still need to address the issue raise by Dogacan i.e shall we
> > provide tools to convert from 1.x structures to 2.0 and if so how shall
> we
> > organise it. Again - some things have been removed fom NutchBase for the
> sake
> > of clarity but since they are in the trunk they are not lost and we can
> decide
> > what to do with them later.
>
> Maybe we can provide a couple of encapsulated upgradetools that contain
> internal versions of the necessary Nutch1.x classes that live inside of the
> Tool class. This way they are hidden and not cluttering the sources, but
> the
> point is still accomplished.
>


+1

Jul

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-07-20 20:29, Julien Nioche wrote:

> I meant putting the migration code and 1.x Nutch jars in the contrib
> directory of the trunk - that shouldn't require a different committers
> list or should it?

I don't feel strongly about contrib... there is a different precedent:
for a while there were migration tools in the main tree for conversion
between 0.8 and 0.9+.


>     A. branch cleaned up, SVN commits, etc., stable working
>     B. at some point, branch ready to be merged (assumption: branch
>     devel stops)
>     C. define branch merge into 3-5 patches

Due to a total API incompatibility (CrawlDatum is replaced by a WebPage,
content and link storage is different, the way we run jobs in nutchbase
is also different) I don't expect more than 2 patches, of which the
first one will contain 90% of API changes...

>     D. foreach patch in C:
>        create JIRA issue for patch
>        call for review of patch
>        if no objections, then commit in 24-48 hours
> 
>     E. trunk now ready for 2.0 development
>     F. schedule current open issues for 2.0, grab any low hanging fruit (1-2
>     days)


>     G. all other issues pushed out to 2.1
>     H. release 2.0
> 
> 
> Andrzej and myself are in the process of porting the last missing tests
> in NutchBase and debugging Gora along the way. There is just a handful
> of plugins which have not been ported and I should have finished that
> pretty quickly. Hopefully we'll get to (A) soonish and can then follow
> the plan above.
> 
> However we still need to address the issue raise by Dogacan i.e shall we
> provide tools to convert from 1.x structures to 2.0 and if so how shall
> we organise it. Again - some things have been removed fom NutchBase for
> the sake of clarity but since they are in the trunk they are not lost
> and we can decide what to do with them later.

IMO it would take enormous effort to implement a runtime compatibility
between 1.x and 2.x, so users will have to either convert or recrawl. I
think that at a minimum we should provide a clear procedure on how to
export the old crawldb and import into a new db.

If there's a strong desire to have a tool to convert 1.x segments into
the new crawl job data format we could also implement this - but I don't
expect there would be ... after all, segments are a throwaway property
with a limited time to live...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Julien,

>> I wouldn't favor a Nutch contrib going forward. Contribs lead to
>> umbrella-projects which Apache is moving away from b/c it typically creates
>> different committer lists (those who can commit to contrib and those with
>> commit privs to the full source code base, etc.), different lifecycles and
>> ultimately incubates/grows mini-projects within larger ones.
> 
> I meant putting the migration code and 1.x Nutch jars in the contrib directory
> of the trunk - that shouldn't require a different committers list or should
> it?

Not outright, but it's the principle that matters. Contribs typically
_imply_ what I suggested, but nothing really _enforces_ it, thought it would
be confusing to have contribs just as a dir name. I favor getting rid of it.

>  
>> 
>> If someone needs Nutch 1.x jars they can grab them from the Apache distros
>> or we can publish them to Maven central. As for conversion and removal of
>> src/java, I'm not sure I get that? Why should we remove src/java? Merge
>> means "adapt existing" rather than "replace".
> 
> I was talking about removing deprecated Nutch objects (old Writables which we
> needed for storing things in Hadoop MapFiles) from the src after the merge
> once they are not used by Nutch2.0.

+1

> 
> The point made by Dogacan was that they would be needed if we want to provide
> conversion tools so that people could convert their old crawldbs and segments
> into our shiny new Gora-based architecture.

Gotcha, comments below.

>> 
>> Nah, IMHO I think it's OK to muck around in the branch, so long as when the
>> branch gets merged (incrementally rather than wholesale), we can review
>> those. So, the way it would work is this:
>> 
>> A. branch cleaned up, SVN commits, etc., stable working
>> B. at some point, branch ready to be merged (assumption: branch devel stops)
>> C. define branch merge into 3-5 patches
>> D. foreach patch in C:
>>     create JIRA issue for patch
>>     call for review of patch
>>     if no objections, then commit in 24-48 hours
>> 
>> E. trunk now ready for 2.0 development
>> F. schedule current open issues for 2.0, grab any low hanging fruit (1-2
>> days)
>> G. all other issues pushed out to 2.1
>> H. release 2.0
>> 
> 
> Andrzej and myself are in the process of porting the last missing tests in
> NutchBase and debugging Gora along the way. There is just a handful of plugins
> which have not been ported and I should have finished that pretty quickly.
> Hopefully we'll get to (A) soonish and can then follow the plan above.

+1

> 
> However we still need to address the issue raise by Dogacan i.e shall we
> provide tools to convert from 1.x structures to 2.0 and if so how shall we
> organise it. Again - some things have been removed fom NutchBase for the sake
> of clarity but since they are in the trunk they are not lost and we can decide
> what to do with them later.

Maybe we can provide a couple of encapsulated upgradetools that contain
internal versions of the necessary Nutch1.x classes that live inside of the
Tool class. This way they are hidden and not cluttering the sources, but the
point is still accomplished.

Thoughts?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

Posted by Julien Nioche <li...@gmail.com>.
>
> > Now that you mention upgrade solutions from 1.x to 2.0 I suggest that we
> open
> > a JIRA to discuss this. IMHO we probably don't want to keep the 'old'
> code in
> > src/java when we merge but could have the code for the conversion
> utilities
> > and the Nutch 1.x jars in a the contrib/ directory instead.
>
> I wouldn't favor a Nutch contrib going forward. Contribs lead to
> umbrella-projects which Apache is moving away from b/c it typically creates
> different committer lists (those who can commit to contrib and those with
> commit privs to the full source code base, etc.), different lifecycles and
> ultimately incubates/grows mini-projects within larger ones.
>

I meant putting the migration code and 1.x Nutch jars in the contrib
directory of the trunk - that shouldn't require a different committers list
or should it?


>
> If someone needs Nutch 1.x jars they can grab them from the Apache distros
> or we can publish them to Maven central. As for conversion and removal of
> src/java, I'm not sure I get that? Why should we remove src/java? Merge
> means "adapt existing" rather than "replace".
>

I was talking about removing deprecated Nutch objects (old Writables which
we needed for storing things in Hadoop MapFiles) from the src after the
merge once they are not used by Nutch2.0.

The point made by Dogacan was that they would be needed if we want to
provide conversion tools so that people could convert their old crawldbs and
segments into our shiny new Gora-based architecture.


> >
> >>
> >> Also, I realize that I am the last person to talk about this, but can we
> get
> >> some reviews for these changes?
> >
> > I could have filed a JIRA for the branch NutchBase indeed (but haven't).
> > Again, NutchBase is a transitional / test / development repository before
> we
> > merge things into trunk. Changes to the trunk are made properly i.e.
> through
> > JIRA with patches and peer review. Or maybe I should indeed open a JIRA
> for
> > NutchBase every time I do a bit of cleanup or port new plugins to the 2.0
> API?
>
> Nah, IMHO I think it's OK to muck around in the branch, so long as when the
> branch gets merged (incrementally rather than wholesale), we can review
> those. So, the way it would work is this:
>
> A. branch cleaned up, SVN commits, etc., stable working
> B. at some point, branch ready to be merged (assumption: branch devel
> stops)
> C. define branch merge into 3-5 patches
> D. foreach patch in C:
>    create JIRA issue for patch
>    call for review of patch
>    if no objections, then commit in 24-48 hours
>
> E. trunk now ready for 2.0 development
> F. schedule current open issues for 2.0, grab any low hanging fruit (1-2
> days)
> G. all other issues pushed out to 2.1
> H. release 2.0
>
>
Andrzej and myself are in the process of porting the last missing tests in
NutchBase and debugging Gora along the way. There is just a handful of
plugins which have not been ported and I should have finished that pretty
quickly. Hopefully we'll get to (A) soonish and can then follow the plan
above.

However we still need to address the issue raise by Dogacan i.e shall we
provide tools to convert from 1.x structures to 2.0 and if so how shall we
organise it. Again - some things have been removed fom NutchBase for the
sake of clarity but since they are in the trunk they are not lost and we can
decide what to do with them later.

J.

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Guys,

> Now that you mention upgrade solutions from 1.x to 2.0 I suggest that we open
> a JIRA to discuss this. IMHO we probably don't want to keep the 'old' code in
> src/java when we merge but could have the code for the conversion utilities
> and the Nutch 1.x jars in a the contrib/ directory instead.

I wouldn't favor a Nutch contrib going forward. Contribs lead to
umbrella-projects which Apache is moving away from b/c it typically creates
different committer lists (those who can commit to contrib and those with
commit privs to the full source code base, etc.), different lifecycles and
ultimately incubates/grows mini-projects within larger ones.

If someone needs Nutch 1.x jars they can grab them from the Apache distros
or we can publish them to Maven central. As for conversion and removal of
src/java, I'm not sure I get that? Why should we remove src/java? Merge
means "adapt existing" rather than "replace".

>  
>> 
>> Also, I realize that I am the last person to talk about this, but can we get
>> some reviews for these changes?
> 
> I could have filed a JIRA for the branch NutchBase indeed (but haven't).
> Again, NutchBase is a transitional / test / development repository before we
> merge things into trunk. Changes to the trunk are made properly i.e. through
> JIRA with patches and peer review. Or maybe I should indeed open a JIRA for
> NutchBase every time I do a bit of cleanup or port new plugins to the 2.0 API?

Nah, IMHO I think it's OK to muck around in the branch, so long as when the
branch gets merged (incrementally rather than wholesale), we can review
those. So, the way it would work is this:

A. branch cleaned up, SVN commits, etc., stable working
B. at some point, branch ready to be merged (assumption: branch devel stops)
C. define branch merge into 3-5 patches
D. foreach patch in C:
    create JIRA issue for patch
    call for review of patch
    if no objections, then commit in 24-48 hours

E. trunk now ready for 2.0 development
F. schedule current open issues for 2.0, grab any low hanging fruit (1-2
days)
G. all other issues pushed out to 2.1
H. release 2.0

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

Posted by Julien Nioche <li...@gmail.com>.
I have made some changes to the nutchbase branch indeed. Mostly porting
missing plugins to the new API but also retrofitting recent modifications
from the trunk to nutchbase in order to facilitate the later merge from
nutchbase to trunk later.

I also removed some old Nutch objects which were not actively used in
nutchbase in order to check that it had no influence on the code but also to
facilitate the review and merging of the NutchBase code.


If we remove these, it will be impossible to be able to update nutch 1.1
> data into 2.0. It is already difficult, but I think at least for standard
> crawls we can offer an upgrade path.


indeed the conversion from 1.x to 2.0 has not been discussed so far. Bear in
mind that we will merge the changes from NutchBase to the trunk - so these
objects are not lost and will be kept in the trunk if we decide that this is
what we should do. Doing some clearing in the NutchBase branch allowed to
improve the code by finding references to old Writable classes that we won't
necessarily keep in the trunk.

Now that you mention upgrade solutions from 1.x to 2.0 I suggest that we
open a JIRA to discuss this. IMHO we probably don't want to keep the 'old'
code in src/java when we merge but could have the code for the conversion
utilities and the Nutch 1.x jars in a the contrib/ directory instead.


>
> Also, I realize that I am the last person to talk about this, but can we
> get some reviews for these changes?
>

I could have filed a JIRA for the branch NutchBase indeed (but haven't).
Again, NutchBase is a transitional / test / development repository before we
merge things into trunk. Changes to the trunk are made properly i.e. through
JIRA with patches and peer review. Or maybe I should indeed open a JIRA for
NutchBase every time I do a bit of cleanup or port new plugins to the 2.0
API?

Do you want to open a JIRa for discussing how we will upgrade from 1.x to
2.0?

J.



>
> On Tue, Jul 20, 2010 at 14:47, <jn...@apache.org> wrote:
>
>> Author: jnioche
>> Date: Tue Jul 20 11:47:23 2010
>> New Revision: 965815
>>
>> URL: http://svn.apache.org/viewvc?rev=965815&view=rev
>> Log:
>> remove deprecated objects ParseText and ParseStatus
>>
>> Removed:
>>
>>  nutch/branches/nutchbase/src/java/org/apache/nutch/parse/ParseStatus.java
>>    nutch/branches/nutchbase/src/java/org/apache/nutch/parse/ParseText.java
>>
>>  nutch/branches/nutchbase/src/test/org/apache/nutch/parse/TestParseText.java
>>
>>
>
>
> --
> Doğacan Güney
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

Posted by Doğacan Güney <do...@gmail.com>.
If we remove these, it will be impossible to be able to update nutch 1.1
data into 2.0. It is already difficult, but I think at least for standard
crawls we can offer an upgrade path.

Also, I realize that I am the last person to talk about this, but can we get
some reviews for these changes?

On Tue, Jul 20, 2010 at 14:47, <jn...@apache.org> wrote:

> Author: jnioche
> Date: Tue Jul 20 11:47:23 2010
> New Revision: 965815
>
> URL: http://svn.apache.org/viewvc?rev=965815&view=rev
> Log:
> remove deprecated objects ParseText and ParseStatus
>
> Removed:
>
>  nutch/branches/nutchbase/src/java/org/apache/nutch/parse/ParseStatus.java
>    nutch/branches/nutchbase/src/java/org/apache/nutch/parse/ParseText.java
>
>  nutch/branches/nutchbase/src/test/org/apache/nutch/parse/TestParseText.java
>
>


-- 
Doğacan Güney