You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gavin <27...@qq.com> on 2014/02/12 09:33:34 UTC

Nutch 2.2.1 can not index to solr

I compiled  nutch in eclipse. My storage is hbase. 
After I run the bin/crawl , there are to tables in hbase :"webpage" and "%crawl_ID%webpage"
but there is no data in solr and no exception.
why?

(I can crawl and index to solr server use nutch1.7.bin,so I think my solr server is ok)

Re: Nutch 2.2.1 can not index to solr

Posted by Gavin <27...@qq.com>.
my solr is 4.6.1. 
I followed the steps in https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup and https://wiki.apache.org/nutch/RunNutchInEclipse.
Mybe there is something wrong in the schema.xml of solr.
please give me a working schema.xml!   thanks!
And where is the log file for the solr? I cant find any exception from the console of solr.

thanks a lot!



------------------ Original ------------------
From:  "d_k";<ma...@gmail.com>;
Date:  Wed, Feb 12, 2014 07:15 PM
To:  "user"<us...@nutch.apache.org>; 

Subject:  Re: Nutch 2.2.1 can not index to solr



Are you sure solr is not throwing any errors?
Did you make any changes to the schema? What schema does Solr use? What
version of Solr are you using?
You can turn on the debug logs by changing the logging level to DEBUG in
the log4j.properties properties file inside the conf dir in the
runtime/local dir. (I assume this is your setup, let me know if its not).
You can also try to debug nutch in eclipse as described here:
https://wiki.apache.org/nutch/RunNutchInEclipse


On Wed, Feb 12, 2014 at 11:31 AM, Gavin <27...@qq.com> wrote:

> andm my solr:
>
>
> Statistics
>
>                                  Last Modified:
> Num Docs:0Max Doc:0Heap Memory Usage:0Deleted Docs:0Version:1Segment
> Count:0Optimized:
> Current:
>
>
>
> what is wrong?
>
> Thanks for your help!!!
>
>
>
>
>
> ------------------ Original ------------------
> From:  "274614348";<27...@qq.com>;
> Date:  Wed, Feb 12, 2014 05:24 PM
> To:  "user"<us...@nutch.apache.org>;
>
> Subject:  Re: Nutch 2.2.1 can not index to solr
>
>
>
> Here is my output:
>
>
> [Gavin@Gavin local]$ bin/nutch  inject urls
> InjectorJob: starting at 2014-02-12 17:16:20
> InjectorJob: Injecting urlDir: urls
> InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the
> Gora storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 1
> Injector: finished at 2014-02-12 17:16:25, elapsed: 00:00:04
> [Gavin@Gavin local]$ bin/nutch generate -topN 5
> GeneratorJob: starting at 2014-02-12 17:16:46
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: true
> GeneratorJob: normalizing: true
> GeneratorJob: topN: 5
> GeneratorJob: finished at 2014-02-12 17:16:51, time elapsed: 00:00:05
> GeneratorJob: generated batch id: 1392196606-229189632
> [Gavin@Gavin local]$ bin/nutch fetch -all
> FetcherJob: starting
> FetcherJob: fetching all
> FetcherJob: threads: 10
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : -1
> Using queue mode : byHost
> Fetcher: threads: 10
> QueueFeeder finished: total 5 records. Hit by time limit :0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> fetching http://www.163.com/ (queue crawl delay=5000ms)
> fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
> fetching http://www.tianya.cn/ (queue crawl delay=5000ms)
> fetching http://www.taobao.com/ (queue crawl delay=5000ms)
> -finishing thread FetcherThread5, activeThreads=8
> -finishing thread FetcherThread6, activeThreads=8
> -finishing thread FetcherThread4, activeThreads=7
> -finishing thread FetcherThread3, activeThreads=6
> -finishing thread FetcherThread2, activeThreads=5
> fetching http://www.hao123.com/ (queue crawl delay=5000ms)
> -finishing thread FetcherThread0, activeThreads=4
> -finishing thread FetcherThread7, activeThreads=3
> -finishing thread FetcherThread1, activeThreads=2
> -finishing thread FetcherThread8, activeThreads=1
> -finishing thread FetcherThread9, activeThreads=0
> 0/0 spinwaiting/active, 4 pages, 0 errors, 0.8 1 pages/s, 242 242 kb/s, 0
> URLs in 0 queues
> -activeThreads=0
> FetcherJob: done
> [Gavin@Gavin local]$ bin/nutch parse -all
> ParserJob: starting
> ParserJob: resuming:    false
> ParserJob: forced reparse:    false
> ParserJob: parsing all
> Parsing http://www.tianya.cn/
> Parsing http://www.163.com/
> Parsing http://www.hao123.com/
> Parsing http://www.taobao.com/
> Parsing http://nutch.apache.org/
> ParserJob: success
> [Gavin@Gavin local]$ bin/nutch solrindex http://127.0.0.1:8983/solr -all
> SolrIndexerJob: starting
> SolrIndexerJob: done.
>
>
> Thank you!
>
>
> ------------------ Original ------------------
> From:  "d_k";<ma...@gmail.com>;
> Date:  Wed, Feb 12, 2014 04:58 PM
> To:  "user"<us...@nutch.apache.org>;
>
> Subject:  Re: Nutch 2.2.1 can not index to solr
>
>
>
> What is the output of each of the steps when you execute them separately?
> Did you edit regex-urlfilter.txt accordingly?
>
> $ bin/nutch inject urls
> $ bin/nutch generate -topN 5
> $ bin/nutch fetch -all
> $ bin/nutch parse -all
>
> Taken from here:
> https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup
>
>
>
>
> On Wed, Feb 12, 2014 at 10:33 AM, Gavin <27...@qq.com> wrote:
>
> > I compiled  nutch in eclipse. My storage is hbase.
> > After I run the bin/crawl , there are to tables in hbase :"webpage" and
> > "%crawl_ID%webpage"
> > but there is no data in solr and no exception.
> > why?
> >
> > (I can crawl and index to solr server use nutch1.7.bin,so I think my solr
> > server is ok)
>

Re: Nutch 2.2.1 can not index to solr

Posted by Gavin <27...@qq.com>.
I turn the debug logs and the output is :

[root@Gavin local]#  bin/nutch solrindex http://127.0.0.1:8983/solr -all
SolrIndexerJob: starting
Skipping http://www.tianya.cn/; different batch id (null)
Skipping http://www.163.com/; different batch id (null)
Skipping http://www.taobao.com/; different batch id (null)
Skipping http://nutch.apache.org/; different batch id (null)
SolrIndexerJob: done.

The indexjob was  skiped!

What should I  do to make it work?

Thank you!




------------------ Original ------------------
From:  "d_k";<ma...@gmail.com>;
Date:  Wed, Feb 12, 2014 07:15 PM
To:  "user"<us...@nutch.apache.org>; 

Subject:  Re: Nutch 2.2.1 can not index to solr



Are you sure solr is not throwing any errors?
Did you make any changes to the schema? What schema does Solr use? What
version of Solr are you using?
You can turn on the debug logs by changing the logging level to DEBUG in
the log4j.properties properties file inside the conf dir in the
runtime/local dir. (I assume this is your setup, let me know if its not).
You can also try to debug nutch in eclipse as described here:
https://wiki.apache.org/nutch/RunNutchInEclipse


On Wed, Feb 12, 2014 at 11:31 AM, Gavin <27...@qq.com> wrote:

> andm my solr:
>
>
> Statistics
>
>                                  Last Modified:
> Num Docs:0Max Doc:0Heap Memory Usage:0Deleted Docs:0Version:1Segment
> Count:0Optimized:
> Current:
>
>
>
> what is wrong?
>
> Thanks for your help!!!
>
>
>
>
>
> ------------------ Original ------------------
> From:  "274614348";<27...@qq.com>;
> Date:  Wed, Feb 12, 2014 05:24 PM
> To:  "user"<us...@nutch.apache.org>;
>
> Subject:  Re: Nutch 2.2.1 can not index to solr
>
>
>
> Here is my output:
>
>
> [Gavin@Gavin local]$ bin/nutch  inject urls
> InjectorJob: starting at 2014-02-12 17:16:20
> InjectorJob: Injecting urlDir: urls
> InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the
> Gora storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 1
> Injector: finished at 2014-02-12 17:16:25, elapsed: 00:00:04
> [Gavin@Gavin local]$ bin/nutch generate -topN 5
> GeneratorJob: starting at 2014-02-12 17:16:46
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: true
> GeneratorJob: normalizing: true
> GeneratorJob: topN: 5
> GeneratorJob: finished at 2014-02-12 17:16:51, time elapsed: 00:00:05
> GeneratorJob: generated batch id: 1392196606-229189632
> [Gavin@Gavin local]$ bin/nutch fetch -all
> FetcherJob: starting
> FetcherJob: fetching all
> FetcherJob: threads: 10
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : -1
> Using queue mode : byHost
> Fetcher: threads: 10
> QueueFeeder finished: total 5 records. Hit by time limit :0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> fetching http://www.163.com/ (queue crawl delay=5000ms)
> fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
> fetching http://www.tianya.cn/ (queue crawl delay=5000ms)
> fetching http://www.taobao.com/ (queue crawl delay=5000ms)
> -finishing thread FetcherThread5, activeThreads=8
> -finishing thread FetcherThread6, activeThreads=8
> -finishing thread FetcherThread4, activeThreads=7
> -finishing thread FetcherThread3, activeThreads=6
> -finishing thread FetcherThread2, activeThreads=5
> fetching http://www.hao123.com/ (queue crawl delay=5000ms)
> -finishing thread FetcherThread0, activeThreads=4
> -finishing thread FetcherThread7, activeThreads=3
> -finishing thread FetcherThread1, activeThreads=2
> -finishing thread FetcherThread8, activeThreads=1
> -finishing thread FetcherThread9, activeThreads=0
> 0/0 spinwaiting/active, 4 pages, 0 errors, 0.8 1 pages/s, 242 242 kb/s, 0
> URLs in 0 queues
> -activeThreads=0
> FetcherJob: done
> [Gavin@Gavin local]$ bin/nutch parse -all
> ParserJob: starting
> ParserJob: resuming:    false
> ParserJob: forced reparse:    false
> ParserJob: parsing all
> Parsing http://www.tianya.cn/
> Parsing http://www.163.com/
> Parsing http://www.hao123.com/
> Parsing http://www.taobao.com/
> Parsing http://nutch.apache.org/
> ParserJob: success
> [Gavin@Gavin local]$ bin/nutch solrindex http://127.0.0.1:8983/solr -all
> SolrIndexerJob: starting
> SolrIndexerJob: done.
>
>
> Thank you!
>
>
> ------------------ Original ------------------
> From:  "d_k";<ma...@gmail.com>;
> Date:  Wed, Feb 12, 2014 04:58 PM
> To:  "user"<us...@nutch.apache.org>;
>
> Subject:  Re: Nutch 2.2.1 can not index to solr
>
>
>
> What is the output of each of the steps when you execute them separately?
> Did you edit regex-urlfilter.txt accordingly?
>
> $ bin/nutch inject urls
> $ bin/nutch generate -topN 5
> $ bin/nutch fetch -all
> $ bin/nutch parse -all
>
> Taken from here:
> https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup
>
>
>
>
> On Wed, Feb 12, 2014 at 10:33 AM, Gavin <27...@qq.com> wrote:
>
> > I compiled  nutch in eclipse. My storage is hbase.
> > After I run the bin/crawl , there are to tables in hbase :"webpage" and
> > "%crawl_ID%webpage"
> > but there is no data in solr and no exception.
> > why?
> >
> > (I can crawl and index to solr server use nutch1.7.bin,so I think my solr
> > server is ok)
>

Re: Nutch 2.2.1 can not index to solr

Posted by d_k <ma...@gmail.com>.
Are you sure solr is not throwing any errors?
Did you make any changes to the schema? What schema does Solr use? What
version of Solr are you using?
You can turn on the debug logs by changing the logging level to DEBUG in
the log4j.properties properties file inside the conf dir in the
runtime/local dir. (I assume this is your setup, let me know if its not).
You can also try to debug nutch in eclipse as described here:
https://wiki.apache.org/nutch/RunNutchInEclipse


On Wed, Feb 12, 2014 at 11:31 AM, Gavin <27...@qq.com> wrote:

> andm my solr:
>
>
> Statistics
>
>                                  Last Modified:
> Num Docs:0Max Doc:0Heap Memory Usage:0Deleted Docs:0Version:1Segment
> Count:0Optimized:
> Current:
>
>
>
> what is wrong?
>
> Thanks for your help!!!
>
>
>
>
>
> ------------------ Original ------------------
> From:  "274614348";<27...@qq.com>;
> Date:  Wed, Feb 12, 2014 05:24 PM
> To:  "user"<us...@nutch.apache.org>;
>
> Subject:  Re: Nutch 2.2.1 can not index to solr
>
>
>
> Here is my output:
>
>
> [Gavin@Gavin local]$ bin/nutch  inject urls
> InjectorJob: starting at 2014-02-12 17:16:20
> InjectorJob: Injecting urlDir: urls
> InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the
> Gora storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 1
> Injector: finished at 2014-02-12 17:16:25, elapsed: 00:00:04
> [Gavin@Gavin local]$ bin/nutch generate -topN 5
> GeneratorJob: starting at 2014-02-12 17:16:46
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: true
> GeneratorJob: normalizing: true
> GeneratorJob: topN: 5
> GeneratorJob: finished at 2014-02-12 17:16:51, time elapsed: 00:00:05
> GeneratorJob: generated batch id: 1392196606-229189632
> [Gavin@Gavin local]$ bin/nutch fetch -all
> FetcherJob: starting
> FetcherJob: fetching all
> FetcherJob: threads: 10
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : -1
> Using queue mode : byHost
> Fetcher: threads: 10
> QueueFeeder finished: total 5 records. Hit by time limit :0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> fetching http://www.163.com/ (queue crawl delay=5000ms)
> fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
> fetching http://www.tianya.cn/ (queue crawl delay=5000ms)
> fetching http://www.taobao.com/ (queue crawl delay=5000ms)
> -finishing thread FetcherThread5, activeThreads=8
> -finishing thread FetcherThread6, activeThreads=8
> -finishing thread FetcherThread4, activeThreads=7
> -finishing thread FetcherThread3, activeThreads=6
> -finishing thread FetcherThread2, activeThreads=5
> fetching http://www.hao123.com/ (queue crawl delay=5000ms)
> -finishing thread FetcherThread0, activeThreads=4
> -finishing thread FetcherThread7, activeThreads=3
> -finishing thread FetcherThread1, activeThreads=2
> -finishing thread FetcherThread8, activeThreads=1
> -finishing thread FetcherThread9, activeThreads=0
> 0/0 spinwaiting/active, 4 pages, 0 errors, 0.8 1 pages/s, 242 242 kb/s, 0
> URLs in 0 queues
> -activeThreads=0
> FetcherJob: done
> [Gavin@Gavin local]$ bin/nutch parse -all
> ParserJob: starting
> ParserJob: resuming:    false
> ParserJob: forced reparse:    false
> ParserJob: parsing all
> Parsing http://www.tianya.cn/
> Parsing http://www.163.com/
> Parsing http://www.hao123.com/
> Parsing http://www.taobao.com/
> Parsing http://nutch.apache.org/
> ParserJob: success
> [Gavin@Gavin local]$ bin/nutch solrindex http://127.0.0.1:8983/solr -all
> SolrIndexerJob: starting
> SolrIndexerJob: done.
>
>
> Thank you!
>
>
> ------------------ Original ------------------
> From:  "d_k";<ma...@gmail.com>;
> Date:  Wed, Feb 12, 2014 04:58 PM
> To:  "user"<us...@nutch.apache.org>;
>
> Subject:  Re: Nutch 2.2.1 can not index to solr
>
>
>
> What is the output of each of the steps when you execute them separately?
> Did you edit regex-urlfilter.txt accordingly?
>
> $ bin/nutch inject urls
> $ bin/nutch generate -topN 5
> $ bin/nutch fetch -all
> $ bin/nutch parse -all
>
> Taken from here:
> https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup
>
>
>
>
> On Wed, Feb 12, 2014 at 10:33 AM, Gavin <27...@qq.com> wrote:
>
> > I compiled  nutch in eclipse. My storage is hbase.
> > After I run the bin/crawl , there are to tables in hbase :"webpage" and
> > "%crawl_ID%webpage"
> > but there is no data in solr and no exception.
> > why?
> >
> > (I can crawl and index to solr server use nutch1.7.bin,so I think my solr
> > server is ok)
>

Re: Nutch 2.2.1 can not index to solr

Posted by Gavin <27...@qq.com>.
andm my solr:

       
Statistics
                                    
                                 Last Modified:
Num Docs:0Max Doc:0Heap Memory Usage:0Deleted Docs:0Version:1Segment Count:0Optimized:             
Current:



what is wrong?

Thanks for your help!!!





------------------ Original ------------------
From:  "274614348";<27...@qq.com>;
Date:  Wed, Feb 12, 2014 05:24 PM
To:  "user"<us...@nutch.apache.org>; 

Subject:  Re: Nutch 2.2.1 can not index to solr



Here is my output:


[Gavin@Gavin local]$ bin/nutch  inject urls
InjectorJob: starting at 2014-02-12 17:16:20
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2014-02-12 17:16:25, elapsed: 00:00:04
[Gavin@Gavin local]$ bin/nutch generate -topN 5
GeneratorJob: starting at 2014-02-12 17:16:46
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 5
GeneratorJob: finished at 2014-02-12 17:16:51, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1392196606-229189632
[Gavin@Gavin local]$ bin/nutch fetch -all
FetcherJob: starting
FetcherJob: fetching all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 5 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.163.com/ (queue crawl delay=5000ms)
fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
fetching http://www.tianya.cn/ (queue crawl delay=5000ms)
fetching http://www.taobao.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread5, activeThreads=8
-finishing thread FetcherThread6, activeThreads=8
-finishing thread FetcherThread4, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread2, activeThreads=5
fetching http://www.hao123.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread0, activeThreads=4
-finishing thread FetcherThread7, activeThreads=3
-finishing thread FetcherThread1, activeThreads=2
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 4 pages, 0 errors, 0.8 1 pages/s, 242 242 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
[Gavin@Gavin local]$ bin/nutch parse -all
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:    false
ParserJob: parsing all
Parsing http://www.tianya.cn/
Parsing http://www.163.com/
Parsing http://www.hao123.com/
Parsing http://www.taobao.com/
Parsing http://nutch.apache.org/
ParserJob: success
[Gavin@Gavin local]$ bin/nutch solrindex http://127.0.0.1:8983/solr -all
SolrIndexerJob: starting
SolrIndexerJob: done.


Thank you!


------------------ Original ------------------
From:  "d_k";<ma...@gmail.com>;
Date:  Wed, Feb 12, 2014 04:58 PM
To:  "user"<us...@nutch.apache.org>; 

Subject:  Re: Nutch 2.2.1 can not index to solr



What is the output of each of the steps when you execute them separately?
Did you edit regex-urlfilter.txt accordingly?

$ bin/nutch inject urls
$ bin/nutch generate -topN 5
$ bin/nutch fetch -all
$ bin/nutch parse -all

Taken from here:
https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup




On Wed, Feb 12, 2014 at 10:33 AM, Gavin <27...@qq.com> wrote:

> I compiled  nutch in eclipse. My storage is hbase.
> After I run the bin/crawl , there are to tables in hbase :"webpage" and
> "%crawl_ID%webpage"
> but there is no data in solr and no exception.
> why?
>
> (I can crawl and index to solr server use nutch1.7.bin,so I think my solr
> server is ok)

Re: Nutch 2.2.1 can not index to solr

Posted by Gavin <27...@qq.com>.
Here is my output:


[Gavin@Gavin local]$ bin/nutch  inject urls
InjectorJob: starting at 2014-02-12 17:16:20
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2014-02-12 17:16:25, elapsed: 00:00:04
[Gavin@Gavin local]$ bin/nutch generate -topN 5
GeneratorJob: starting at 2014-02-12 17:16:46
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 5
GeneratorJob: finished at 2014-02-12 17:16:51, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1392196606-229189632
[Gavin@Gavin local]$ bin/nutch fetch -all
FetcherJob: starting
FetcherJob: fetching all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 5 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.163.com/ (queue crawl delay=5000ms)
fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
fetching http://www.tianya.cn/ (queue crawl delay=5000ms)
fetching http://www.taobao.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread5, activeThreads=8
-finishing thread FetcherThread6, activeThreads=8
-finishing thread FetcherThread4, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread2, activeThreads=5
fetching http://www.hao123.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread0, activeThreads=4
-finishing thread FetcherThread7, activeThreads=3
-finishing thread FetcherThread1, activeThreads=2
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 4 pages, 0 errors, 0.8 1 pages/s, 242 242 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
[Gavin@Gavin local]$ bin/nutch parse -all
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:    false
ParserJob: parsing all
Parsing http://www.tianya.cn/
Parsing http://www.163.com/
Parsing http://www.hao123.com/
Parsing http://www.taobao.com/
Parsing http://nutch.apache.org/
ParserJob: success
[Gavin@Gavin local]$ bin/nutch solrindex http://127.0.0.1:8983/solr -all
SolrIndexerJob: starting
SolrIndexerJob: done.


Thank you!


------------------ Original ------------------
From:  "d_k";<ma...@gmail.com>;
Date:  Wed, Feb 12, 2014 04:58 PM
To:  "user"<us...@nutch.apache.org>; 

Subject:  Re: Nutch 2.2.1 can not index to solr



What is the output of each of the steps when you execute them separately?
Did you edit regex-urlfilter.txt accordingly?

$ bin/nutch inject urls
$ bin/nutch generate -topN 5
$ bin/nutch fetch -all
$ bin/nutch parse -all

Taken from here:
https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup




On Wed, Feb 12, 2014 at 10:33 AM, Gavin <27...@qq.com> wrote:

> I compiled  nutch in eclipse. My storage is hbase.
> After I run the bin/crawl , there are to tables in hbase :"webpage" and
> "%crawl_ID%webpage"
> but there is no data in solr and no exception.
> why?
>
> (I can crawl and index to solr server use nutch1.7.bin,so I think my solr
> server is ok)

Re: Nutch 2.2.1 can not index to solr

Posted by d_k <ma...@gmail.com>.
What is the output of each of the steps when you execute them separately?
Did you edit regex-urlfilter.txt accordingly?

$ bin/nutch inject urls
$ bin/nutch generate -topN 5
$ bin/nutch fetch -all
$ bin/nutch parse -all

Taken from here:
https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup




On Wed, Feb 12, 2014 at 10:33 AM, Gavin <27...@qq.com> wrote:

> I compiled  nutch in eclipse. My storage is hbase.
> After I run the bin/crawl , there are to tables in hbase :"webpage" and
> "%crawl_ID%webpage"
> but there is no data in solr and no exception.
> why?
>
> (I can crawl and index to solr server use nutch1.7.bin,so I think my solr
> server is ok)