You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Philip <da...@gmail.com> on 2015/01/01 10:58:00 UTC
Re: No Crawl data in Solr

Hi,

   Can you check whether you have added any filter
to conf/regex-urlfilter.txt ? It should be +. or  +^http://([a-z0-9]*\.)*
nutch.apache.org/ to get nutch site crawl and index. Also, please cross
check whether you have added seed.txt to your urls folder.

The best way to start and get the nutch up and running is to follow each
step mentioned in the wiki[1] diligently and few other sites that I had
followed for reference[2].

[1]http://wiki.apache.org/nutch/NutchTutorial
[2]
https://sites.google.com/site/profilerajanimaski/technical/apache-solr/webcrawlers/apache-nutch




On Tue, Dec 30, 2014 at 10:39 PM, Mark Otero <ma...@gmail.com> wrote:

> Hi all -
>
> I'm stuck.  I'm a Nutch and Solr newbie.
>
> I'm trying to crawl "http://nutch.apache.org" and store the crawl results
> on Solr.  I'm uncertain if the crawl worked because I don't see the crawl
> results in Solr.  I figured crawling the apache.org site would be a safe
> test.
>
> -------------  Here's my console ---------------
>
> MARKs-Mac-Pro:local mark$ bin/crawl urls/ TestCrawl/
> http://localhost:8983/solr/ 2
>
> InjectorJob: starting at 2014-12-30 09:01:57
>
> InjectorJob: Injecting urlDir: urls
>
> InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora
> storage class.
>
> InjectorJob: total number of urls rejected by filters: 0
>
> InjectorJob: total number of urls injected after normalization and
> filtering: 1
>
> Injector: finished at 2014-12-30 09:01:58, elapsed: 00:00:01
>
> Tue Dec 30 09:01:58 PST 2014 : Iteration 1 of 2
>
> Generating batchId
>
> Generating a new fetchlist
>
> GeneratorJob: starting at 2014-12-30 09:01:59
>
> GeneratorJob: Selecting best-scoring urls due for fetch.
>
> GeneratorJob: starting
>
> GeneratorJob: filtering: false
>
> GeneratorJob: normalizing: false
>
> GeneratorJob: topN: 50000
>
> GeneratorJob: finished at 2014-12-30 09:02:00, time elapsed: 00:00:01
>
> GeneratorJob: generated batch id: 1419958918-5854
>
> Fetching :
>
> FetcherJob: starting
>
> FetcherJob: batchId: 1419958918-5854
>
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
>
> FetcherJob: threads: 50
>
> FetcherJob: parsing: false
>
> FetcherJob: resuming: false
>
> FetcherJob : timelimit set for : 1419969721241
>
> Using queue mode : byHost
>
> Fetcher: threads: 50
>
> QueueFeeder finished: total 0 records. Hit by time limit :0
>
> .... ( I removed the "-finishing thread FetcherThread0, activeThreads=0"
> messages for brevity)
>
> Fetcher: throughput threshold: -1
>
> -finishing thread FetcherThread49, activeThreads=0
>
> Fetcher: throughput threshold sequence: 5
>
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
>
> -activeThreads=0
>
> FetcherJob: done
>
> Parsing :
>
> ParserJob: starting
>
> ParserJob: resuming: false
>
> ParserJob: forced reparse: false
>
> ParserJob: batchId: 1419958918-5854
>
> ParserJob: success
>
> CrawlDB update for TestCrawl/
>
> DbUpdaterJob: starting
>
> DbUpdaterJob: done
>
> Indexing TestCrawl/ on SOLR index -> http://localhost:8983/solr/
>
> SolrIndexerJob: starting
>
> SolrIndexerJob: done.
>
> SOLR dedup -> http://localhost:8983/solr/
>
>
> Mark
>