You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2015/10/01 02:35:09 UTC

Re: [VOTE] Release Apache Nutch 2.3.1

Hi Folks,
Is anyone else able to test and run the release candidate for 2.3.1?
It would be great to get a release if we can get the VOTE's and the RC is
suitable.
Thanks in advance.
Best
Lewis

On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Folks,
> It turns out the formatting for the original email below was terrible.
> Sorry about that.
> I've hopefully corrected formatting now. Please VOTE away!
>
> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi user@ & dev@,
>>
>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>
>> We addressed 32 issues in all which can been see at the release report
>> http://s.apache.org/nutch_2.3.1
>>
>> The release candidate comprises the following components.
>>
>> * A staging repository [0] containing various Maven artifacts
>> * A branch-2.3.1 of the 2.x code [1]
>> * The tagged source upon which we are VOTE'ing [2]
>> * Finally, the release artifacts [3] which i would encourage you to
>> verify for signatures and test.
>>
>> You should use the following KEYS [4] file to verify the signatures of
>> all release artifacts.
>>
>> Please VOTE as follows
>>
>> [ ] +1 Push the release, I am happy :)
>> [ ] +/-0 I am not bothered either way
>> [ ] -1 I am not happy with this release candidate (please state why)
>>
>> Firstly thank you to everyone that contributed to Nutch. Secondly, thank
>> you to everyone that VOTE's. It is appreciated.
>>
>> Thanks
>> Lewis
>> (on behalf of Nutch PMC)
>>
>> p.s. Here's my +1
>>
>> [0]
>> https://repository.apache.org/content/repositories/orgapachenutch-1005
>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>> [4] http://www.apache.org/dist/nutch/KEYS
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Re: [VOTE] Release Apache Nutch 2.3.1

Posted by "Drulea, Sherban" <sd...@rand.org>.
Thanks Sebastian. I’m running on OS X 10.9.5 btw.


On 10/5/15, 11:53 AM, "Sebastian Nagel" <wa...@googlemail.com> wrote:

>Hi Sherban,
>
>thanks for the detailed description and the attached log.
>I'll have a look on it and hope to be able reproduce the
>problem.
>
>Sebastian
>
>On 10/05/2015 07:53 PM, Drulea, Sherban wrote:
>> Hi Sebastian,
>> 
>> I tried multiple URLs in my seed.txt file. None of them result in the
>> nutch generator crawling any links.
>> 
>> Here’s my environment:
>> java version "1.8.0_60"
>> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
>> SOLR 4.6.0
>> Mongo version 3.0.2.
>> Nutch 2.3.1
>> 
>> ―――――――――――――――
>> 
>> regex-urlfilter.txt:
>> ―――――――――――――――
>> +.
>> 
>> ―――――――――――――――
>> nutch-site.xml
>> ―――――――――――――――
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> 
>> <!-- Put site-specific property overrides in this file. -->
>> 
>> <configuration>
>> 
>>     <property>
>>         <name>http.agent.name</name>
>>         <value>nutch Mongo Solr Crawler</value>
>>     </property>
>> 
>>     <property>
>>         <name>storage.data.store.class</name>
>>         <value>org.apache.gora.mongodb.store.MongoStore</value>
>>         <description>Default class for storing data</description>
>>     </property>
>>     
>>     <property>
>>         <name>plugin.includes</name>
>>         
>> 
>><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index
>>-(
>> 
>>basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>>/v
>> alue>
>>         <description>Regular expression naming plugin directory names to
>> include. </description>
>>    </property>
>>     
>> </configuration>
>> 
>> 
>> ―――――――――――――――
>> gora.properties:
>> ―――――――――――――――
>> ############################
>> # MongoDBStore properties  #
>> ############################
>> gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
>> gora.mongodb.override_hadoop_configuration=false
>> gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
>> gora.mongodb.servers=localhost:27017
>> gora.mongodb.db=method_centers
>> 
>> ―――――――――――――――
>> seed.txt
>> ―――――――――――――――
>> http://punklawyer.com
>> http://mail-archives.apache.org/mod_mbox/nutch-user/
>> http://hbase.apache.org/index.html
>> http://wiki.apache.org/nutch/FrontPage
>> http://www.aintitcool.com/
>> ―――――――――――――――
>> 
>> Here are the results of the crawl command " ./bin/crawl urls methods
>> http://127.0.0.1:8983/solr/ 2”
>> Injecting seed URLs
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
>> -crawlId methods
>> InjectorJob: starting at 2015-10-01 18:27:23
>> InjectorJob: Injecting urlDir: urls
>> InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
>> Gora storage class.
>> InjectorJob: total number of urls rejected by filters: 0
>> InjectorJob: total number of urls injected after normalization and
>> filtering: 5
>> Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
>> Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
>> Generating batchId
>> Generating a new fetchlist
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>> -crawlId methods -batchId 1443749246-29495
>> GeneratorJob: starting at 2015-10-01 18:27:26
>> GeneratorJob: Selecting best-scoring urls due for fetch.
>> GeneratorJob: starting
>> GeneratorJob: filtering: false
>> GeneratorJob: normalizing: false
>> GeneratorJob: topN: 50000
>> GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
>> GeneratorJob: generated batch id: 1443749246-1282586680 containing 5
>>URLs
>> Fetching : 
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D fetcher.timelimit.mins=180
>> 1443749246-29495 -crawlId methods -threads 50
>> FetcherJob: starting at 2015-10-01 18:27:29
>> FetcherJob: batchId: 1443749246-29495
>> FetcherJob: threads: 50
>> FetcherJob: parsing: false
>> FetcherJob: resuming: false
>> FetcherJob : timelimit set for : 1443760049865
>> Using queue mode : byHost
>> Fetcher: threads: 50
>> QueueFeeder finished: total 0 records. Hit by time limit :0
>> -finishing thread FetcherThread0, activeThreads=0
>> ...
>> -finishing thread FetcherThread49, activeThreads=0
>> Fetcher: throughput threshold: -1
>> Fetcher: throughput threshold sequence: 5
>> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>>URLs
>> in 0 queues
>> -activeThreads=0
>> Using queue mode : byHost
>> Fetcher: threads: 50
>> QueueFeeder finished: total 0 records. Hit by time limit :0
>> -finishing thread FetcherThread0, activeThreads=0
>> ...
>> 
>> -finishing thread FetcherThread48, activeThreads=0
>> Fetcher: throughput threshold: -1
>> Fetcher: throughput threshold sequence: 5
>> -finishing thread FetcherThread49, activeThreads=0
>> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>>URLs
>> in 0 queues
>> -activeThreads=0
>> FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
>> Parsing : 
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D
>> mapred.skip.attempts.to.start.skipping=2 -D
>> mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
>> ParserJob: starting at 2015-10-01 18:27:43
>> ParserJob: resuming:  false
>> ParserJob: forced reparse:  false
>> ParserJob: batchId: 1443749246-29495
>> ParserJob: success
>> ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
>> CrawlDB update for methods
>> 
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true 1443749246-29495 -crawlId methods
>> DbUpdaterJob: starting at 2015-10-01 18:27:46
>> DbUpdaterJob: batchId: 1443749246-29495
>> DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
>> Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D
>> solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
>> IndexingJob: starting
>> Active IndexWriters :
>> SOLRIndexWriter
>> solr.server.url : URL of the SOLR instance (mandatory)
>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>> solr.mapping.file : name of the mapping file for fields (default
>> solrindex-mapping.xml)
>> solr.auth : use authentication (default false)
>> solr.auth.username : username for authentication
>> solr.auth.password : password for authentication
>> 
>> 
>> IndexingJob: done.
>> SOLR dedup -> http://127.0.0.1:8983/solr/
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true http://127.0.0.1:8983/solr/
>> Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
>> Generating batchId
>> Generating a new fetchlist
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>> -crawlId methods -batchId 1443749274-17203
>> GeneratorJob: starting at 2015-10-01 18:27:55
>> GeneratorJob: Selecting best-scoring urls due for fetch.
>> GeneratorJob: starting
>> GeneratorJob: filtering: false
>> GeneratorJob: normalizing: false
>> GeneratorJob: topN: 50000
>> GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
>> GeneratorJob: generated batch id: 1443749275-2050785747 containing 0
>>URLs
>> Generate returned 1 (no new segments created)
>> Escaping loop: no more URLs to fetch now
>> 
>> There’s no errors but also no data. What else can I debug?
>> 
>> I see some warning in my hadoop.log but nothing glaring ….
>> 
>> 2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using
>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule -
>> defaultInterval=2592000
>> 2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=7776000
>> 2015-10-01 18:19:30,326 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_lo
>>ca
>> l1900181322_0001/job.xml:an attempt to override final parameter:
>> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>> 2015-10-01 18:19:30,327 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_lo
>>ca
>> l1900181322_0001/job.xml:an attempt to override final parameter:
>> mapreduce.job.end-notification.max.attempts;  Ignoring.
>> 2015-10-01 18:19:30,405 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local190018
>>13
>> 22_0001/job_local1900181322_0001.xml:an attempt to override final
>> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>> 2015-10-01 18:19:30,406 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local190018
>>13
>> 22_0001/job_local1900181322_0001.xml:an attempt to override final
>> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
>> ….
>> 2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using
>>class
>> org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
>> 2015-10-01 18:27:24,969 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_lo
>>ca
>> l1182157052_0001/job.xml:an attempt to override final parameter:
>> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>> 2015-10-01 18:27:24,971 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_lo
>>ca
>> l1182157052_0001/job.xml:an attempt to override final parameter:
>> mapreduce.job.end-notification.max.attempts;  Ignoring.
>> 2015-10-01 18:27:25,050 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local118215
>>70
>> 52_0001/job_local1182157052_0001.xml:an attempt to override final
>> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>> 2015-10-01 18:27:25,052 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local118215
>>70
>> 52_0001/job_local1182157052_0001.xml:an attempt to override final
>> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
>> 
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit =
>>65536
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo
>> Solr Crawler/Nutch-2.4-SNAPSHOT
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit =
>>65536
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo
>> Solr Crawler/Nutch-2.4-SNAPSHOT
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>> 
>> I’ve been trying this for 3 days with no luck. I want to use nutch but
>>may
>> be forced to use other program.
>> 
>> My best guess is maybe something is borked with my plugin.includes:
>> 
>> <property>
>>         <name>plugin.includes</name>
>>         
>> 
>><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index
>>-(
>> 
>>basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>>/v
>> alue>
>>         <description>Regular expression naming plugin directory names to
>> include. </description>
>>    </property>
>> 
>> Are these valid? Is there a more minimal set to try?
>> 
>> Cheers,
>> Sherban
>> 
>> 
>> 
>> 
>> On 10/4/15, 12:23 PM, "Sebastian Nagel" <wa...@googlemail.com>
>>wrote:
>> 
>>> Hi Sherban,
>>>
>>>> Right now it finds 0 URLs with no errors.
>>>
>>> Can you specify what's going wrong. It could
>>> be everything, even a configuration problem.
>>> What did you crawl? Using which storage back-end?
>>>
>>> Thanks,
>>> Sebastian
>>>
>>>
>>> On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
>>>> Hi Lewis,
>>>>
>>>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs
>>>>with
>>>> no
>>>> errors.
>>>>
>>>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at
>>>>all.
>>>>
>>>> Cheers,
>>>> Sherban
>>>>
>>>>
>>>>
>>>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney"
>>>><le...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Folks,
>>>>> Is anyone else able to test and run the release candidate for 2.3.1?
>>>>> It would be great to get a release if we can get the VOTE's and the
>>>>>RC
>>>>> is
>>>>> suitable.
>>>>> Thanks in advance.
>>>>> Best
>>>>> Lewis
>>>>>
>>>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>> It turns out the formatting for the original email below was
>>>>>>terrible.
>>>>>> Sorry about that.
>>>>>> I've hopefully corrected formatting now. Please VOTE away!
>>>>>>
>>>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>>>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>>>
>>>>>>> Hi user@ & dev@,
>>>>>>>
>>>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>>>>>
>>>>>>> We addressed 32 issues in all which can been see at the release
>>>>>>> report
>>>>>>> http://s.apache.org/nutch_2.3.1
>>>>>>>
>>>>>>> The release candidate comprises the following components.
>>>>>>>
>>>>>>> * A staging repository [0] containing various Maven artifacts
>>>>>>> * A branch-2.3.1 of the 2.x code [1]
>>>>>>> * The tagged source upon which we are VOTE'ing [2]
>>>>>>> * Finally, the release artifacts [3] which i would encourage you to
>>>>>>> verify for signatures and test.
>>>>>>>
>>>>>>> You should use the following KEYS [4] file to verify the signatures
>>>>>>> of
>>>>>>> all release artifacts.
>>>>>>>
>>>>>>> Please VOTE as follows
>>>>>>>
>>>>>>> [ ] +1 Push the release, I am happy :)
>>>>>>> [ ] +/-0 I am not bothered either way
>>>>>>> [ ] -1 I am not happy with this release candidate (please state
>>>>>>>why)
>>>>>>>
>>>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>>>>> thank
>>>>>>> you to everyone that VOTE's. It is appreciated.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Lewis
>>>>>>> (on behalf of Nutch PMC)
>>>>>>>
>>>>>>> p.s. Here's my +1
>>>>>>>
>>>>>>> [0]
>>>>>>>
>>>>>>> 
>>>>>>>https://repository.apache.org/content/repositories/orgapachenutch-10
>>>>>>>05
>>>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>>>>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>>>>>
>>>>>>> --
>>>>>>> *Lewis*
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> *Lewis*
>>>>
>>>>
>>>>
>>>> 
>>>>_______________________________________________________________________
>>>>__
>>>> _
>>>>
>>>> This email message is for the sole use of the intended recipient(s)
>>>>and
>>>> may contain confidential information. Any unauthorized review, use,
>>>> disclosure or distribution is prohibited. If you are not the intended
>>>> recipient, please contact the sender by reply email and destroy all
>>>> copies
>>>> of the original message.
>>>>
>>>
>> 
>


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Sherban,

thanks for the detailed description and the attached log.
I'll have a look on it and hope to be able reproduce the
problem.

Sebastian

On 10/05/2015 07:53 PM, Drulea, Sherban wrote:
> Hi Sebastian,
> 
> I tried multiple URLs in my seed.txt file. None of them result in the
> nutch generator crawling any links.
> 
> Here’s my environment:
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> SOLR 4.6.0
> Mongo version 3.0.2.
> Nutch 2.3.1
> 
> ―――――――――――――――
> 
> regex-urlfilter.txt:
> ―――――――――――――――
> +.
> 
> ―――――――――――――――
> nutch-site.xml
> ―――――――――――――――
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
>     <property>
>         <name>http.agent.name</name>
>         <value>nutch Mongo Solr Crawler</value>
>     </property>
> 
>     <property>
>         <name>storage.data.store.class</name>
>         <value>org.apache.gora.mongodb.store.MongoStore</value>
>         <description>Default class for storing data</description>
>     </property>
>     
>     <property>
>         <name>plugin.includes</name>
>         
> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
> basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
> alue>
>         <description>Regular expression naming plugin directory names to
> include. </description>
>    </property>
>     
> </configuration>
> 
> 
> ―――――――――――――――
> gora.properties:
> ―――――――――――――――
> ############################
> # MongoDBStore properties  #
> ############################
> gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
> gora.mongodb.override_hadoop_configuration=false
> gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
> gora.mongodb.servers=localhost:27017
> gora.mongodb.db=method_centers
> 
> ―――――――――――――――
> seed.txt
> ―――――――――――――――
> http://punklawyer.com
> http://mail-archives.apache.org/mod_mbox/nutch-user/
> http://hbase.apache.org/index.html
> http://wiki.apache.org/nutch/FrontPage
> http://www.aintitcool.com/
> ―――――――――――――――
> 
> Here are the results of the crawl command " ./bin/crawl urls methods
> http://127.0.0.1:8983/solr/ 2”
> Injecting seed URLs
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
> -crawlId methods
> InjectorJob: starting at 2015-10-01 18:27:23
> InjectorJob: Injecting urlDir: urls
> InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
> Gora storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 5
> Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
> Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
> Generating batchId
> Generating a new fetchlist
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
> -crawlId methods -batchId 1443749246-29495
> GeneratorJob: starting at 2015-10-01 18:27:26
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: false
> GeneratorJob: normalizing: false
> GeneratorJob: topN: 50000
> GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
> Fetching : 
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D fetcher.timelimit.mins=180
> 1443749246-29495 -crawlId methods -threads 50
> FetcherJob: starting at 2015-10-01 18:27:29
> FetcherJob: batchId: 1443749246-29495
> FetcherJob: threads: 50
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : 1443760049865
> Using queue mode : byHost
> Fetcher: threads: 50
> QueueFeeder finished: total 0 records. Hit by time limit :0
> -finishing thread FetcherThread0, activeThreads=0
> ...
> -finishing thread FetcherThread49, activeThreads=0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> Using queue mode : byHost
> Fetcher: threads: 50
> QueueFeeder finished: total 0 records. Hit by time limit :0
> -finishing thread FetcherThread0, activeThreads=0
> ...
> 
> -finishing thread FetcherThread48, activeThreads=0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> -finishing thread FetcherThread49, activeThreads=0
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
> Parsing : 
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D
> mapred.skip.attempts.to.start.skipping=2 -D
> mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
> ParserJob: starting at 2015-10-01 18:27:43
> ParserJob: resuming:  false
> ParserJob: forced reparse:  false
> ParserJob: batchId: 1443749246-29495
> ParserJob: success
> ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
> CrawlDB update for methods
> 
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true 1443749246-29495 -crawlId methods
> DbUpdaterJob: starting at 2015-10-01 18:27:46
> DbUpdaterJob: batchId: 1443749246-29495
> DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
> Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D
> solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
> IndexingJob: starting
> Active IndexWriters :
> SOLRIndexWriter
> solr.server.url : URL of the SOLR instance (mandatory)
> solr.commit.size : buffer size when sending to SOLR (default 1000)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.auth : use authentication (default false)
> solr.auth.username : username for authentication
> solr.auth.password : password for authentication
> 
> 
> IndexingJob: done.
> SOLR dedup -> http://127.0.0.1:8983/solr/
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true http://127.0.0.1:8983/solr/
> Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
> Generating batchId
> Generating a new fetchlist
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
> -crawlId methods -batchId 1443749274-17203
> GeneratorJob: starting at 2015-10-01 18:27:55
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: false
> GeneratorJob: normalizing: false
> GeneratorJob: topN: 50000
> GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
> Generate returned 1 (no new segments created)
> Escaping loop: no more URLs to fetch now
> 
> There’s no errors but also no data. What else can I debug?
> 
> I see some warning in my hadoop.log but nothing glaring ….
> 
> 2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2015-10-01 18:19:30,326 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
> l1900181322_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:19:30,327 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
> l1900181322_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2015-10-01 18:19:30,405 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
> 22_0001/job_local1900181322_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:19:30,406 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
> 22_0001/job_local1900181322_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
> ….
> 2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using class
> org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
> 2015-10-01 18:27:24,969 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
> l1182157052_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:27:24,971 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
> l1182157052_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2015-10-01 18:27:25,050 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
> 52_0001/job_local1182157052_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:27:25,052 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
> 52_0001/job_local1182157052_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
> 
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit = 65536
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo
> Solr Crawler/Nutch-2.4-SNAPSHOT
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit = 65536
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo
> Solr Crawler/Nutch-2.4-SNAPSHOT
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 
> I’ve been trying this for 3 days with no luck. I want to use nutch but may
> be forced to use other program.
> 
> My best guess is maybe something is borked with my plugin.includes:
> 
> <property>
>         <name>plugin.includes</name>
>         
> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
> basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
> alue>
>         <description>Regular expression naming plugin directory names to
> include. </description>
>    </property>
> 
> Are these valid? Is there a more minimal set to try?
> 
> Cheers,
> Sherban
> 
> 
> 
> 
> On 10/4/15, 12:23 PM, "Sebastian Nagel" <wa...@googlemail.com> wrote:
> 
>> Hi Sherban,
>>
>>> Right now it finds 0 URLs with no errors.
>>
>> Can you specify what's going wrong. It could
>> be everything, even a configuration problem.
>> What did you crawl? Using which storage back-end?
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
>>> Hi Lewis,
>>>
>>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs with
>>> no
>>> errors.
>>>
>>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.
>>>
>>> Cheers,
>>> Sherban
>>>
>>>
>>>
>>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <le...@gmail.com>
>>> wrote:
>>>
>>>> Hi Folks,
>>>> Is anyone else able to test and run the release candidate for 2.3.1?
>>>> It would be great to get a release if we can get the VOTE's and the RC
>>>> is
>>>> suitable.
>>>> Thanks in advance.
>>>> Best
>>>> Lewis
>>>>
>>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>
>>>>> Hi Folks,
>>>>> It turns out the formatting for the original email below was terrible.
>>>>> Sorry about that.
>>>>> I've hopefully corrected formatting now. Please VOTE away!
>>>>>
>>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>>
>>>>>> Hi user@ & dev@,
>>>>>>
>>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>>>>
>>>>>> We addressed 32 issues in all which can been see at the release
>>>>>> report
>>>>>> http://s.apache.org/nutch_2.3.1
>>>>>>
>>>>>> The release candidate comprises the following components.
>>>>>>
>>>>>> * A staging repository [0] containing various Maven artifacts
>>>>>> * A branch-2.3.1 of the 2.x code [1]
>>>>>> * The tagged source upon which we are VOTE'ing [2]
>>>>>> * Finally, the release artifacts [3] which i would encourage you to
>>>>>> verify for signatures and test.
>>>>>>
>>>>>> You should use the following KEYS [4] file to verify the signatures
>>>>>> of
>>>>>> all release artifacts.
>>>>>>
>>>>>> Please VOTE as follows
>>>>>>
>>>>>> [ ] +1 Push the release, I am happy :)
>>>>>> [ ] +/-0 I am not bothered either way
>>>>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>>>>
>>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>>>> thank
>>>>>> you to everyone that VOTE's. It is appreciated.
>>>>>>
>>>>>> Thanks
>>>>>> Lewis
>>>>>> (on behalf of Nutch PMC)
>>>>>>
>>>>>> p.s. Here's my +1
>>>>>>
>>>>>> [0]
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachenutch-1005
>>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>>>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> *Lewis*
>>>
>>>
>>>
>>> _________________________________________________________________________
>>> _
>>>
>>> This email message is for the sole use of the intended recipient(s) and
>>> may contain confidential information. Any unauthorized review, use,
>>> disclosure or distribution is prohibited. If you are not the intended
>>> recipient, please contact the sender by reply email and destroy all
>>> copies
>>> of the original message.
>>>
>>
> 


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Lewis, hi Sherban,

I have to turn my vote into a

-1

The crawl (if run from bin/crawl) isn't working because
generator ignores the batch id passed per option -batchId
See https://issues.apache.org/jira/browse/NUTCH-2143.

Thanks, Sherban, for being insistent!

The logs you sent point to the same problem:
> Generating a new fetchlist
> .../bin/nutch generate ... -batchId 1443749246-29495
> ...
> GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
> Fetching :
> .../bin/nutch fetch ... 1443749246-29495 ...
> ...
> FetcherJob: batchId: 1443749246-29495

If you use the batch id logged by Generator (1443749246-1282586680)
for the steps "fetch", "parse", and "updatedb" the crawl
should step forward.  Of course, this is no option for a released 2.3.1!
We have to fix this bug. :)

Thanks,
Sebastian


On 10/05/2015 07:53 PM, Drulea, Sherban wrote:
> Hi Sebastian,
> 
> I tried multiple URLs in my seed.txt file. None of them result in the
> nutch generator crawling any links.
> 
> Here’s my environment:
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> SOLR 4.6.0
> Mongo version 3.0.2.
> Nutch 2.3.1
> 
> ―――――――――――――――
> 
> regex-urlfilter.txt:
> ―――――――――――――――
> +.
> 
> ―――――――――――――――
> nutch-site.xml
> ―――――――――――――――
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
>     <property>
>         <name>http.agent.name</name>
>         <value>nutch Mongo Solr Crawler</value>
>     </property>
> 
>     <property>
>         <name>storage.data.store.class</name>
>         <value>org.apache.gora.mongodb.store.MongoStore</value>
>         <description>Default class for storing data</description>
>     </property>
>     
>     <property>
>         <name>plugin.includes</name>
>         
> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
> basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
> alue>
>         <description>Regular expression naming plugin directory names to
> include. </description>
>    </property>
>     
> </configuration>
> 
> 
> ―――――――――――――――
> gora.properties:
> ―――――――――――――――
> ############################
> # MongoDBStore properties  #
> ############################
> gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
> gora.mongodb.override_hadoop_configuration=false
> gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
> gora.mongodb.servers=localhost:27017
> gora.mongodb.db=method_centers
> 
> ―――――――――――――――
> seed.txt
> ―――――――――――――――
> http://punklawyer.com
> http://mail-archives.apache.org/mod_mbox/nutch-user/
> http://hbase.apache.org/index.html
> http://wiki.apache.org/nutch/FrontPage
> http://www.aintitcool.com/
> ―――――――――――――――
> 
> Here are the results of the crawl command " ./bin/crawl urls methods
> http://127.0.0.1:8983/solr/ 2”
> Injecting seed URLs
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
> -crawlId methods
> InjectorJob: starting at 2015-10-01 18:27:23
> InjectorJob: Injecting urlDir: urls
> InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
> Gora storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 5
> Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
> Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
> Generating batchId
> Generating a new fetchlist
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
> -crawlId methods -batchId 1443749246-29495
> GeneratorJob: starting at 2015-10-01 18:27:26
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: false
> GeneratorJob: normalizing: false
> GeneratorJob: topN: 50000
> GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
> Fetching : 
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D fetcher.timelimit.mins=180
> 1443749246-29495 -crawlId methods -threads 50
> FetcherJob: starting at 2015-10-01 18:27:29
> FetcherJob: batchId: 1443749246-29495
> FetcherJob: threads: 50
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : 1443760049865
> Using queue mode : byHost
> Fetcher: threads: 50
> QueueFeeder finished: total 0 records. Hit by time limit :0
> -finishing thread FetcherThread0, activeThreads=0
> ...
> -finishing thread FetcherThread49, activeThreads=0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> Using queue mode : byHost
> Fetcher: threads: 50
> QueueFeeder finished: total 0 records. Hit by time limit :0
> -finishing thread FetcherThread0, activeThreads=0
> ...
> 
> -finishing thread FetcherThread48, activeThreads=0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> -finishing thread FetcherThread49, activeThreads=0
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
> Parsing : 
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D
> mapred.skip.attempts.to.start.skipping=2 -D
> mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
> ParserJob: starting at 2015-10-01 18:27:43
> ParserJob: resuming:  false
> ParserJob: forced reparse:  false
> ParserJob: batchId: 1443749246-29495
> ParserJob: success
> ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
> CrawlDB update for methods
> 
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true 1443749246-29495 -crawlId methods
> DbUpdaterJob: starting at 2015-10-01 18:27:46
> DbUpdaterJob: batchId: 1443749246-29495
> DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
> Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D
> solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
> IndexingJob: starting
> Active IndexWriters :
> SOLRIndexWriter
> solr.server.url : URL of the SOLR instance (mandatory)
> solr.commit.size : buffer size when sending to SOLR (default 1000)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.auth : use authentication (default false)
> solr.auth.username : username for authentication
> solr.auth.password : password for authentication
> 
> 
> IndexingJob: done.
> SOLR dedup -> http://127.0.0.1:8983/solr/
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true http://127.0.0.1:8983/solr/
> Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
> Generating batchId
> Generating a new fetchlist
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
> -crawlId methods -batchId 1443749274-17203
> GeneratorJob: starting at 2015-10-01 18:27:55
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: false
> GeneratorJob: normalizing: false
> GeneratorJob: topN: 50000
> GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
> Generate returned 1 (no new segments created)
> Escaping loop: no more URLs to fetch now
> 
> There’s no errors but also no data. What else can I debug?
> 
> I see some warning in my hadoop.log but nothing glaring ….
> 
> 2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2015-10-01 18:19:30,326 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
> l1900181322_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:19:30,327 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
> l1900181322_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2015-10-01 18:19:30,405 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
> 22_0001/job_local1900181322_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:19:30,406 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
> 22_0001/job_local1900181322_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
> ….
> 2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using class
> org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
> 2015-10-01 18:27:24,969 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
> l1182157052_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:27:24,971 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
> l1182157052_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2015-10-01 18:27:25,050 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
> 52_0001/job_local1182157052_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:27:25,052 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
> 52_0001/job_local1182157052_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
> 
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit = 65536
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo
> Solr Crawler/Nutch-2.4-SNAPSHOT
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit = 65536
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo
> Solr Crawler/Nutch-2.4-SNAPSHOT
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 
> I’ve been trying this for 3 days with no luck. I want to use nutch but may
> be forced to use other program.
> 
> My best guess is maybe something is borked with my plugin.includes:
> 
> <property>
>         <name>plugin.includes</name>
>         
> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
> basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
> alue>
>         <description>Regular expression naming plugin directory names to
> include. </description>
>    </property>
> 
> Are these valid? Is there a more minimal set to try?
> 
> Cheers,
> Sherban
> 
> 
> 
> 
> On 10/4/15, 12:23 PM, "Sebastian Nagel" <wa...@googlemail.com> wrote:
> 
>> Hi Sherban,
>>
>>> Right now it finds 0 URLs with no errors.
>>
>> Can you specify what's going wrong. It could
>> be everything, even a configuration problem.
>> What did you crawl? Using which storage back-end?
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
>>> Hi Lewis,
>>>
>>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs with
>>> no
>>> errors.
>>>
>>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.
>>>
>>> Cheers,
>>> Sherban
>>>
>>>
>>>
>>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <le...@gmail.com>
>>> wrote:
>>>
>>>> Hi Folks,
>>>> Is anyone else able to test and run the release candidate for 2.3.1?
>>>> It would be great to get a release if we can get the VOTE's and the RC
>>>> is
>>>> suitable.
>>>> Thanks in advance.
>>>> Best
>>>> Lewis
>>>>
>>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>
>>>>> Hi Folks,
>>>>> It turns out the formatting for the original email below was terrible.
>>>>> Sorry about that.
>>>>> I've hopefully corrected formatting now. Please VOTE away!
>>>>>
>>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>>
>>>>>> Hi user@ & dev@,
>>>>>>
>>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>>>>
>>>>>> We addressed 32 issues in all which can been see at the release
>>>>>> report
>>>>>> http://s.apache.org/nutch_2.3.1
>>>>>>
>>>>>> The release candidate comprises the following components.
>>>>>>
>>>>>> * A staging repository [0] containing various Maven artifacts
>>>>>> * A branch-2.3.1 of the 2.x code [1]
>>>>>> * The tagged source upon which we are VOTE'ing [2]
>>>>>> * Finally, the release artifacts [3] which i would encourage you to
>>>>>> verify for signatures and test.
>>>>>>
>>>>>> You should use the following KEYS [4] file to verify the signatures
>>>>>> of
>>>>>> all release artifacts.
>>>>>>
>>>>>> Please VOTE as follows
>>>>>>
>>>>>> [ ] +1 Push the release, I am happy :)
>>>>>> [ ] +/-0 I am not bothered either way
>>>>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>>>>
>>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>>>> thank
>>>>>> you to everyone that VOTE's. It is appreciated.
>>>>>>
>>>>>> Thanks
>>>>>> Lewis
>>>>>> (on behalf of Nutch PMC)
>>>>>>
>>>>>> p.s. Here's my +1
>>>>>>
>>>>>> [0]
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachenutch-1005
>>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>>>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> *Lewis*
>>>
>>>
>>>
>>> _________________________________________________________________________
>>> _
>>>
>>> This email message is for the sole use of the intended recipient(s) and
>>> may contain confidential information. Any unauthorized review, use,
>>> disclosure or distribution is prohibited. If you are not the intended
>>> recipient, please contact the sender by reply email and destroy all
>>> copies
>>> of the original message.
>>>
>>
> 


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by "Drulea, Sherban" <sd...@rand.org>.
Hi Sebastian,

I tried multiple URLs in my seed.txt file. None of them result in the
nutch generator crawling any links.

Here’s my environment:
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
SOLR 4.6.0
Mongo version 3.0.2.
Nutch 2.3.1

―――――――――――――――

regex-urlfilter.txt:
―――――――――――――――
+.

―――――――――――――――
nutch-site.xml
―――――――――――――――
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>http.agent.name</name>
        <value>nutch Mongo Solr Crawler</value>
    </property>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.mongodb.store.MongoStore</value>
        <description>Default class for storing data</description>
    </property>
    
    <property>
        <name>plugin.includes</name>
        
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>
        <description>Regular expression naming plugin directory names to
include. </description>
   </property>
    
</configuration>


―――――――――――――――
gora.properties:
―――――――――――――――
############################
# MongoDBStore properties  #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=method_centers

―――――――――――――――
seed.txt
―――――――――――――――
http://punklawyer.com
http://mail-archives.apache.org/mod_mbox/nutch-user/
http://hbase.apache.org/index.html
http://wiki.apache.org/nutch/FrontPage
http://www.aintitcool.com/
―――――――――――――――

Here are the results of the crawl command " ./bin/crawl urls methods
http://127.0.0.1:8983/solr/ 2”
Injecting seed URLs
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
-crawlId methods
InjectorJob: starting at 2015-10-01 18:27:23
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and
filtering: 5
Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
-crawlId methods -batchId 1443749246-29495
GeneratorJob: starting at 2015-10-01 18:27:26
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
Fetching : 
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D fetcher.timelimit.mins=180
1443749246-29495 -crawlId methods -threads 50
FetcherJob: starting at 2015-10-01 18:27:29
FetcherJob: batchId: 1443749246-29495
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1443760049865
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
...
-finishing thread FetcherThread49, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
...

-finishing thread FetcherThread48, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
Parsing : 
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
ParserJob: starting at 2015-10-01 18:27:43
ParserJob: resuming:  false
ParserJob: forced reparse:  false
ParserJob: batchId: 1443749246-29495
ParserJob: success
ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
CrawlDB update for methods

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1443749246-29495 -crawlId methods
DbUpdaterJob: starting at 2015-10-01 18:27:46
DbUpdaterJob: batchId: 1443749246-29495
DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication


IndexingJob: done.
SOLR dedup -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://127.0.0.1:8983/solr/
Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
-crawlId methods -batchId 1443749274-17203
GeneratorJob: starting at 2015-10-01 18:27:55
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

There’s no errors but also no data. What else can I debug?

I see some warning in my hadoop.log but nothing glaring ….

2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2015-10-01 18:19:30,326 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
l1900181322_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-10-01 18:19:30,327 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
l1900181322_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2015-10-01 18:19:30,405 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
22_0001/job_local1900181322_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-10-01 18:19:30,406 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
22_0001/job_local1900181322_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
….
2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
2015-10-01 18:27:24,969 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
l1182157052_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-10-01 18:27:24,971 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
l1182157052_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2015-10-01 18:27:25,050 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
52_0001/job_local1182157052_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-10-01 18:27:25,052 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
52_0001/job_local1182157052_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo
Solr Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo
Solr Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

I’ve been trying this for 3 days with no luck. I want to use nutch but may
be forced to use other program.

My best guess is maybe something is borked with my plugin.includes:

<property>
        <name>plugin.includes</name>
        
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>
        <description>Regular expression naming plugin directory names to
include. </description>
   </property>

Are these valid? Is there a more minimal set to try?

Cheers,
Sherban




On 10/4/15, 12:23 PM, "Sebastian Nagel" <wa...@googlemail.com> wrote:

>Hi Sherban,
>
>> Right now it finds 0 URLs with no errors.
>
>Can you specify what's going wrong. It could
>be everything, even a configuration problem.
>What did you crawl? Using which storage back-end?
>
>Thanks,
>Sebastian
>
>
>On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
>> Hi Lewis,
>> 
>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs with
>>no
>> errors.
>> 
>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.
>> 
>> Cheers,
>> Sherban
>> 
>> 
>> 
>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <le...@gmail.com>
>> wrote:
>> 
>>> Hi Folks,
>>> Is anyone else able to test and run the release candidate for 2.3.1?
>>> It would be great to get a release if we can get the VOTE's and the RC
>>>is
>>> suitable.
>>> Thanks in advance.
>>> Best
>>> Lewis
>>>
>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>>> lewis.mcgibbney@gmail.com> wrote:
>>>
>>>> Hi Folks,
>>>> It turns out the formatting for the original email below was terrible.
>>>> Sorry about that.
>>>> I've hopefully corrected formatting now. Please VOTE away!
>>>>
>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>
>>>>> Hi user@ & dev@,
>>>>>
>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>>>
>>>>> We addressed 32 issues in all which can been see at the release
>>>>>report
>>>>> http://s.apache.org/nutch_2.3.1
>>>>>
>>>>> The release candidate comprises the following components.
>>>>>
>>>>> * A staging repository [0] containing various Maven artifacts
>>>>> * A branch-2.3.1 of the 2.x code [1]
>>>>> * The tagged source upon which we are VOTE'ing [2]
>>>>> * Finally, the release artifacts [3] which i would encourage you to
>>>>> verify for signatures and test.
>>>>>
>>>>> You should use the following KEYS [4] file to verify the signatures
>>>>>of
>>>>> all release artifacts.
>>>>>
>>>>> Please VOTE as follows
>>>>>
>>>>> [ ] +1 Push the release, I am happy :)
>>>>> [ ] +/-0 I am not bothered either way
>>>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>>>
>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>>> thank
>>>>> you to everyone that VOTE's. It is appreciated.
>>>>>
>>>>> Thanks
>>>>> Lewis
>>>>> (on behalf of Nutch PMC)
>>>>>
>>>>> p.s. Here's my +1
>>>>>
>>>>> [0]
>>>>> 
>>>>>https://repository.apache.org/content/repositories/orgapachenutch-1005
>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Lewis*
>>>>
>>>
>>>
>>>
>>> -- 
>>> *Lewis*
>> 
>> 
>> 
>>_________________________________________________________________________
>>_
>> 
>> This email message is for the sole use of the intended recipient(s) and
>> may contain confidential information. Any unauthorized review, use,
>> disclosure or distribution is prohibited. If you are not the intended
>> recipient, please contact the sender by reply email and destroy all
>>copies
>> of the original message.
>> 
>


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Sherban,

> Right now it finds 0 URLs with no errors.

Can you specify what's going wrong. It could
be everything, even a configuration problem.
What did you crawl? Using which storage back-end?

Thanks,
Sebastian


On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
> Hi Lewis,
> 
> -1 until I verify nutch actually crawls. Right now it finds 0 URLs with no
> errors.
> 
> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.
> 
> Cheers,
> Sherban
> 
> 
> 
> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <le...@gmail.com>
> wrote:
> 
>> Hi Folks,
>> Is anyone else able to test and run the release candidate for 2.3.1?
>> It would be great to get a release if we can get the VOTE's and the RC is
>> suitable.
>> Thanks in advance.
>> Best
>> Lewis
>>
>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> Hi Folks,
>>> It turns out the formatting for the original email below was terrible.
>>> Sorry about that.
>>> I've hopefully corrected formatting now. Please VOTE away!
>>>
>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>>> lewis.mcgibbney@gmail.com> wrote:
>>>
>>>> Hi user@ & dev@,
>>>>
>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>>
>>>> We addressed 32 issues in all which can been see at the release report
>>>> http://s.apache.org/nutch_2.3.1
>>>>
>>>> The release candidate comprises the following components.
>>>>
>>>> * A staging repository [0] containing various Maven artifacts
>>>> * A branch-2.3.1 of the 2.x code [1]
>>>> * The tagged source upon which we are VOTE'ing [2]
>>>> * Finally, the release artifacts [3] which i would encourage you to
>>>> verify for signatures and test.
>>>>
>>>> You should use the following KEYS [4] file to verify the signatures of
>>>> all release artifacts.
>>>>
>>>> Please VOTE as follows
>>>>
>>>> [ ] +1 Push the release, I am happy :)
>>>> [ ] +/-0 I am not bothered either way
>>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>>
>>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>> thank
>>>> you to everyone that VOTE's. It is appreciated.
>>>>
>>>> Thanks
>>>> Lewis
>>>> (on behalf of Nutch PMC)
>>>>
>>>> p.s. Here's my +1
>>>>
>>>> [0]
>>>> https://repository.apache.org/content/repositories/orgapachenutch-1005
>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>>
>>>> --
>>>> *Lewis*
>>>>
>>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>>
>>
>>
>> -- 
>> *Lewis*
> 
> 
> __________________________________________________________________________
> 
> This email message is for the sole use of the intended recipient(s) and
> may contain confidential information. Any unauthorized review, use,
> disclosure or distribution is prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
> 


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by "Drulea, Sherban" <sd...@rand.org>.
Hi Lewis,

-1 until I verify nutch actually crawls. Right now it finds 0 URLs with no
errors.

2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.

Cheers,
Sherban



On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <le...@gmail.com>
wrote:

>Hi Folks,
>Is anyone else able to test and run the release candidate for 2.3.1?
>It would be great to get a release if we can get the VOTE's and the RC is
>suitable.
>Thanks in advance.
>Best
>Lewis
>
>On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Folks,
>> It turns out the formatting for the original email below was terrible.
>> Sorry about that.
>> I've hopefully corrected formatting now. Please VOTE away!
>>
>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> Hi user@ & dev@,
>>>
>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>
>>> We addressed 32 issues in all which can been see at the release report
>>> http://s.apache.org/nutch_2.3.1
>>>
>>> The release candidate comprises the following components.
>>>
>>> * A staging repository [0] containing various Maven artifacts
>>> * A branch-2.3.1 of the 2.x code [1]
>>> * The tagged source upon which we are VOTE'ing [2]
>>> * Finally, the release artifacts [3] which i would encourage you to
>>> verify for signatures and test.
>>>
>>> You should use the following KEYS [4] file to verify the signatures of
>>> all release artifacts.
>>>
>>> Please VOTE as follows
>>>
>>> [ ] +1 Push the release, I am happy :)
>>> [ ] +/-0 I am not bothered either way
>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>
>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>thank
>>> you to everyone that VOTE's. It is appreciated.
>>>
>>> Thanks
>>> Lewis
>>> (on behalf of Nutch PMC)
>>>
>>> p.s. Here's my +1
>>>
>>> [0]
>>> https://repository.apache.org/content/repositories/orgapachenutch-1005
>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>
>>> --
>>> *Lewis*
>>>
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
>-- 
>*Lewis*


__________________________________________________________________________

This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
+1 from me:

[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc
apache-nutch 2.3.1 https://dist.apache.org/repos/dist/dev/nutch/2.3.1
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc
apache-nutch 2.3.1-src https://dist.apache.org/repos/dist/dev/nutch/2.3.1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100 3988k  100 3988k    0     0  1804k      0  0:00:02  0:00:02 --:--:--
1805k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100   819  100   819    0     0   2547      0 --:--:-- --:--:-- --:--:--
2551
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100    80  100    80    0     0    278      0 --:--:-- --:--:-- --:--:--
279
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100 6257k  100 6257k    0     0  1629k      0  0:00:03  0:00:03 --:--:--
1630k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100   819  100   819    0     0   3074      0 --:--:-- --:--:-- --:--:--
3078
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100    77  100    77    0     0    272      0 --:--:-- --:--:-- --:--:--
272
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/verify_gpg_sigs
Verifying Signature for file apache-nutch-2.3.1-src.tar.gz.asc
gpg: Signature made Tue Sep 22 18:38:37 2015 PDT using RSA key ID 48BAEBF6
gpg: Good signature from "Lewis John McGibbney (CODE SIGNING KEY)
<le...@apache.org>"
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the
owner.
Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
Verifying Signature for file apache-nutch-2.3.1-src.zip.asc
gpg: Signature made Tue Sep 22 18:38:18 2015 PDT using RSA key ID 48BAEBF6
gpg: Good signature from "Lewis John McGibbney (CODE SIGNING KEY)
<le...@apache.org>"
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the
owner.
Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/verify_md5_checksums
md5sum: stat '*.bz2': No such file or directory
md5sum: stat '*.tgz': No such file or directory
apache-nutch-2.3.1-src.tar.gz: OK
apache-nutch-2.3.1-src.zip: OK
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann%




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Wednesday, September 30, 2015 at 5:35 PM
To: "user@nutch.apache.org" <us...@nutch.apache.org>,
"dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: [VOTE] Release Apache Nutch 2.3.1

>Hi Folks,
>
>Is anyone else able to test and run the release candidate for 2.3.1?
>
>It would be great to get a release if we can get the VOTE's and the RC is
>suitable.
>
>Thanks in advance.
>
>Best
>
>Lewis
>
>
>On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney
><le...@gmail.com> wrote:
>
>Hi Folks,
>
>It turns out the formatting for the original email below was terrible.
>Sorry about that.
>
>I've hopefully corrected formatting now. Please VOTE away!
>
>On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney
><le...@gmail.com> wrote:
>
>Hi user@ & dev@,
>
>This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>
>We addressed 32 issues in all which can been see at the release report
>http://s.apache.org/nutch_2.3.1
>
>The release candidate comprises the following components.
> 
>* A staging repository [0] containing various Maven artifacts
>* A branch-2.3.1 of the 2.x code [1]
>* The tagged source upon which we are VOTE'ing [2]
>* Finally, the release artifacts [3] which i would encourage you to
>verify for signatures and test.
>
>You should use the following KEYS [4] file to verify the signatures of
>all release artifacts.
>
>Please VOTE as follows
>
>[ ] +1 Push the release, I am happy :)
>[ ] +/-0 I am not bothered either way
>[ ] -1 I am not happy with this release candidate (please state why)
>
>Firstly thank you to everyone that contributed to Nutch. Secondly, thank
>you to everyone that VOTE's. It is appreciated.
>
>Thanks
>Lewis
>(on behalf of Nutch PMC)
>
>p.s. Here's my +1
> 
>[0] 
>https://repository.apache.org/content/repositories/orgapachenutch-1005
><https://repository.apache.org/content/repositories/orgapachenutch-1005>
>[1] 
>https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
><https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1>
>[2] 
>https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
><https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1>
>[3] 
>https://dist.apache.org/repos/dist/dev/nutch/2.3.1
><https://dist.apache.org/repos/dist/dev/nutch/2.3.1>
>[4] http://www.apache.org/dist/nutch/KEYS
>
>-- 
>Lewis
>
>
>
>
>
>
>
>
>-- 
>Lewis
>
>
>
>
>
>
>
>
>
>
>
>
>-- 
>Lewis
>
>


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by "Drulea, Sherban" <sd...@rand.org>.
Hi Lewis,

-1 until I verify nutch actually crawls. Right now it finds 0 URLs with no
errors.

2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.

Cheers,
Sherban



On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <le...@gmail.com>
wrote:

>Hi Folks,
>Is anyone else able to test and run the release candidate for 2.3.1?
>It would be great to get a release if we can get the VOTE's and the RC is
>suitable.
>Thanks in advance.
>Best
>Lewis
>
>On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Folks,
>> It turns out the formatting for the original email below was terrible.
>> Sorry about that.
>> I've hopefully corrected formatting now. Please VOTE away!
>>
>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> Hi user@ & dev@,
>>>
>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>
>>> We addressed 32 issues in all which can been see at the release report
>>> http://s.apache.org/nutch_2.3.1
>>>
>>> The release candidate comprises the following components.
>>>
>>> * A staging repository [0] containing various Maven artifacts
>>> * A branch-2.3.1 of the 2.x code [1]
>>> * The tagged source upon which we are VOTE'ing [2]
>>> * Finally, the release artifacts [3] which i would encourage you to
>>> verify for signatures and test.
>>>
>>> You should use the following KEYS [4] file to verify the signatures of
>>> all release artifacts.
>>>
>>> Please VOTE as follows
>>>
>>> [ ] +1 Push the release, I am happy :)
>>> [ ] +/-0 I am not bothered either way
>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>
>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>thank
>>> you to everyone that VOTE's. It is appreciated.
>>>
>>> Thanks
>>> Lewis
>>> (on behalf of Nutch PMC)
>>>
>>> p.s. Here's my +1
>>>
>>> [0]
>>> https://repository.apache.org/content/repositories/orgapachenutch-1005
>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>
>>> --
>>> *Lewis*
>>>
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
>-- 
>*Lewis*


__________________________________________________________________________

This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
+1 from me:

[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc
apache-nutch 2.3.1 https://dist.apache.org/repos/dist/dev/nutch/2.3.1
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc
apache-nutch 2.3.1-src https://dist.apache.org/repos/dist/dev/nutch/2.3.1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100 3988k  100 3988k    0     0  1804k      0  0:00:02  0:00:02 --:--:--
1805k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100   819  100   819    0     0   2547      0 --:--:-- --:--:-- --:--:--
2551
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100    80  100    80    0     0    278      0 --:--:-- --:--:-- --:--:--
279
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100 6257k  100 6257k    0     0  1629k      0  0:00:03  0:00:03 --:--:--
1630k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100   819  100   819    0     0   3074      0 --:--:-- --:--:-- --:--:--
3078
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100    77  100    77    0     0    272      0 --:--:-- --:--:-- --:--:--
272
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/verify_gpg_sigs
Verifying Signature for file apache-nutch-2.3.1-src.tar.gz.asc
gpg: Signature made Tue Sep 22 18:38:37 2015 PDT using RSA key ID 48BAEBF6
gpg: Good signature from "Lewis John McGibbney (CODE SIGNING KEY)
<le...@apache.org>"
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the
owner.
Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
Verifying Signature for file apache-nutch-2.3.1-src.zip.asc
gpg: Signature made Tue Sep 22 18:38:18 2015 PDT using RSA key ID 48BAEBF6
gpg: Good signature from "Lewis John McGibbney (CODE SIGNING KEY)
<le...@apache.org>"
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the
owner.
Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/verify_md5_checksums
md5sum: stat '*.bz2': No such file or directory
md5sum: stat '*.tgz': No such file or directory
apache-nutch-2.3.1-src.tar.gz: OK
apache-nutch-2.3.1-src.zip: OK
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann%




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Wednesday, September 30, 2015 at 5:35 PM
To: "user@nutch.apache.org" <us...@nutch.apache.org>,
"dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: [VOTE] Release Apache Nutch 2.3.1

>Hi Folks,
>
>Is anyone else able to test and run the release candidate for 2.3.1?
>
>It would be great to get a release if we can get the VOTE's and the RC is
>suitable.
>
>Thanks in advance.
>
>Best
>
>Lewis
>
>
>On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney
><le...@gmail.com> wrote:
>
>Hi Folks,
>
>It turns out the formatting for the original email below was terrible.
>Sorry about that.
>
>I've hopefully corrected formatting now. Please VOTE away!
>
>On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney
><le...@gmail.com> wrote:
>
>Hi user@ & dev@,
>
>This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>
>We addressed 32 issues in all which can been see at the release report
>http://s.apache.org/nutch_2.3.1
>
>The release candidate comprises the following components.
> 
>* A staging repository [0] containing various Maven artifacts
>* A branch-2.3.1 of the 2.x code [1]
>* The tagged source upon which we are VOTE'ing [2]
>* Finally, the release artifacts [3] which i would encourage you to
>verify for signatures and test.
>
>You should use the following KEYS [4] file to verify the signatures of
>all release artifacts.
>
>Please VOTE as follows
>
>[ ] +1 Push the release, I am happy :)
>[ ] +/-0 I am not bothered either way
>[ ] -1 I am not happy with this release candidate (please state why)
>
>Firstly thank you to everyone that contributed to Nutch. Secondly, thank
>you to everyone that VOTE's. It is appreciated.
>
>Thanks
>Lewis
>(on behalf of Nutch PMC)
>
>p.s. Here's my +1
> 
>[0] 
>https://repository.apache.org/content/repositories/orgapachenutch-1005
><https://repository.apache.org/content/repositories/orgapachenutch-1005>
>[1] 
>https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
><https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1>
>[2] 
>https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
><https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1>
>[3] 
>https://dist.apache.org/repos/dist/dev/nutch/2.3.1
><https://dist.apache.org/repos/dist/dev/nutch/2.3.1>
>[4] http://www.apache.org/dist/nutch/KEYS
>
>-- 
>Lewis
>
>
>
>
>
>
>
>
>-- 
>Lewis
>
>
>
>
>
>
>
>
>
>
>
>
>-- 
>Lewis
>
>


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
I’ll download and VOTE on the release right now Lewis.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Wednesday, September 30, 2015 at 5:35 PM
To: "user@nutch.apache.org" <us...@nutch.apache.org>,
"dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: [VOTE] Release Apache Nutch 2.3.1

>Hi Folks,
>
>Is anyone else able to test and run the release candidate for 2.3.1?
>
>It would be great to get a release if we can get the VOTE's and the RC is
>suitable.
>
>Thanks in advance.
>
>Best
>
>Lewis
>
>
>On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney
><le...@gmail.com> wrote:
>
>Hi Folks,
>
>It turns out the formatting for the original email below was terrible.
>Sorry about that.
>
>I've hopefully corrected formatting now. Please VOTE away!
>
>On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney
><le...@gmail.com> wrote:
>
>Hi user@ & dev@,
>
>This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>
>We addressed 32 issues in all which can been see at the release report
>http://s.apache.org/nutch_2.3.1
>
>The release candidate comprises the following components.
> 
>* A staging repository [0] containing various Maven artifacts
>* A branch-2.3.1 of the 2.x code [1]
>* The tagged source upon which we are VOTE'ing [2]
>* Finally, the release artifacts [3] which i would encourage you to
>verify for signatures and test.
>
>You should use the following KEYS [4] file to verify the signatures of
>all release artifacts.
>
>Please VOTE as follows
>
>[ ] +1 Push the release, I am happy :)
>[ ] +/-0 I am not bothered either way
>[ ] -1 I am not happy with this release candidate (please state why)
>
>Firstly thank you to everyone that contributed to Nutch. Secondly, thank
>you to everyone that VOTE's. It is appreciated.
>
>Thanks
>Lewis
>(on behalf of Nutch PMC)
>
>p.s. Here's my +1
> 
>[0] 
>https://repository.apache.org/content/repositories/orgapachenutch-1005
><https://repository.apache.org/content/repositories/orgapachenutch-1005>
>[1] 
>https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
><https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1>
>[2] 
>https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
><https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1>
>[3] 
>https://dist.apache.org/repos/dist/dev/nutch/2.3.1
><https://dist.apache.org/repos/dist/dev/nutch/2.3.1>
>[4] http://www.apache.org/dist/nutch/KEYS
>
>-- 
>Lewis
>
>
>
>
>
>
>
>
>-- 
>Lewis
>
>
>
>
>
>
>
>
>
>
>
>
>-- 
>Lewis
>
>


Re: [VOTE] Release Apache Nutch 2.3.1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
I’ll download and VOTE on the release right now Lewis.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Wednesday, September 30, 2015 at 5:35 PM
To: "user@nutch.apache.org" <us...@nutch.apache.org>,
"dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: [VOTE] Release Apache Nutch 2.3.1

>Hi Folks,
>
>Is anyone else able to test and run the release candidate for 2.3.1?
>
>It would be great to get a release if we can get the VOTE's and the RC is
>suitable.
>
>Thanks in advance.
>
>Best
>
>Lewis
>
>
>On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney
><le...@gmail.com> wrote:
>
>Hi Folks,
>
>It turns out the formatting for the original email below was terrible.
>Sorry about that.
>
>I've hopefully corrected formatting now. Please VOTE away!
>
>On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney
><le...@gmail.com> wrote:
>
>Hi user@ & dev@,
>
>This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>
>We addressed 32 issues in all which can been see at the release report
>http://s.apache.org/nutch_2.3.1
>
>The release candidate comprises the following components.
> 
>* A staging repository [0] containing various Maven artifacts
>* A branch-2.3.1 of the 2.x code [1]
>* The tagged source upon which we are VOTE'ing [2]
>* Finally, the release artifacts [3] which i would encourage you to
>verify for signatures and test.
>
>You should use the following KEYS [4] file to verify the signatures of
>all release artifacts.
>
>Please VOTE as follows
>
>[ ] +1 Push the release, I am happy :)
>[ ] +/-0 I am not bothered either way
>[ ] -1 I am not happy with this release candidate (please state why)
>
>Firstly thank you to everyone that contributed to Nutch. Secondly, thank
>you to everyone that VOTE's. It is appreciated.
>
>Thanks
>Lewis
>(on behalf of Nutch PMC)
>
>p.s. Here's my +1
> 
>[0] 
>https://repository.apache.org/content/repositories/orgapachenutch-1005
><https://repository.apache.org/content/repositories/orgapachenutch-1005>
>[1] 
>https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
><https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1>
>[2] 
>https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
><https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1>
>[3] 
>https://dist.apache.org/repos/dist/dev/nutch/2.3.1
><https://dist.apache.org/repos/dist/dev/nutch/2.3.1>
>[4] http://www.apache.org/dist/nutch/KEYS
>
>-- 
>Lewis
>
>
>
>
>
>
>
>
>-- 
>Lewis
>
>
>
>
>
>
>
>
>
>
>
>
>-- 
>Lewis
>
>