You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2011/01/05 11:28:48 UTC

Backport to 1.3 (was: Release planning)

I've finished porting the changes from 1.2 which were missing in 1.3 and
were not related to the Lucene indexing or search

   - NUTCH-878 ScoringFilters should not override the injected score
   - NUTCH-901 Make index-more plug-in configurable (Markus Jelsma via
   mattmann)
   - NUTCH-905 Configurable file protocol parent directory crawling
   (Thorsten Scherler, mattmann, ab)
   - NUTCH-855 ScoringFilter and IndexingFilter: To allow for the
   propagation of URL Metatags and their subsequent indexing (Scott Gonyea via
   mattmann)
   - NUTCH-716 Make subcollection index filed multivalued (Dmitry Lihachev
   via jnioche)

I've compared the changes from 2.0 with 1.3 and found the following
differences (excluding anything specific to 2.0/GORA)

   - * NUTCH-564 External parser supports encoding attribute (Antony
   Bowesman, mattmann)*
   -  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh,
   mattmann)
   - * NUTCH-825 Publish nutch artifacts to central maven repository
   (mattmann)*
   -  NUTCH-851 Port logging to slf4j (jnioche)
   -  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
   - * NUTCH-872 Change the default fetcher.parse to FALSE (ab).*
   - * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)*
   -  NUTCH-880 REST API for Nutch (ab)
   - * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)*
   - * NUTCH-884 FetcherJob should run more reduce tasks than default (ab)*
   - * NUTCH-886 A .gitignore file for Nutch (dogacan)*
   - * NUTCH-894 Move statistical language identification from indexing to
   parsing step*
   - * NUTCH-921 Reduce dependency of Nutch on config files (ab)*
   - * NUTCH-930 Remove remaining dependencies on Lucene API (ab)*
   -  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
   -  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)

I've created a new issue on
https://issues.apache.org/jira/browse/NUTCH-951to track this. I'd be
in favour of porting only the things that are not new
functionalities and put them in bold above.

Any thoughts on this?

Julien

On 4 January 2011 21:44, Julien Nioche <li...@gmail.com>wrote:

> +1 from me. I've committed today a bunch of patches which were in 1.2 but
> not in 1.3 (just one last one to do) but haven't compared with 2.0
>
> Having a release based on 1.3 would be great as it would be a nice
> transition towards 2.0 (delegate indexing/search, dependency management with
> Ivy, separation between local and remote deployment, removal of redondant
> plugins etc...).
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
>
> On 4 January 2011 20:27, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> Hi users & devs,
>>
>> As you probably know, there are currently two active lines of development
>> for Nutch:
>>
>> * Nutch trunk, a.k.a. Nutch 2.0: this is based on a completely redesigned
>> storage layer that uses Apache Gora, which in turn can use various storage
>> implementations such as HBase, Cassandra, and MySQL. This branch is still
>> largely experimental and unstable, but work is progressing, and at the
>> current pace I think a release should be possible within the next ~6 months.
>> Another important addition on this branch is a REST API that allows using
>> Nutch as a black-box crawling service.
>>
>> * Nutch branch-1.3: this started as a snapshot of Nutch trunk just before
>> merging with nutchbase (i.e. switching to Gora as a storage layer). This
>> branch is still largely similar to the previous versions of Nutch, and uses
>> Hadoop MapFile/SequenceFile and "segments". As compared with release 1.2 it
>> does NOT ship with any search infrastructure, because all search
>> functionality has been delegated to Solr (via SolrIndexer). This is BTW also
>> true about Nutch trunk.
>>
>> Regarding branch-1.2 (which is a maintenance branch after release 1.2)
>> there have been pretty no updates there, if any. Nutch committer resources
>> are very limited (when it comes to active committers), so I don't expect any
>> maintenance release from this branch to happen...
>>
>> I think that considering the relatively remote release date for Nutch 2.-0
>> it would make sense to roll out a 1.3 release based on branch-1.3, after
>> making sure that all critical patches from trunk have been merged in there.
>>
>> What do you think?
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com