You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Néstor <ro...@gmail.com> on 2016/10/14 19:05:53 UTC

nutch 1.7 solr 5.52 ubuntu

BTW, I try the below with several nutch and solr versions and I had errors
but now I am
using nutch 1.7 ans solr 5.52 on ubuntu and I am trying to crawl a
subfolder and anything under
that subfolder.  The subfolder contains yearly subfolders for every year
since 2005(12 year subfolder)
and each year subfolder has a month subfolder (12 month subfolder) and each
month subfolder has
at least 30 days subfolders.  I know that I have more than 3,960
index.phtml and some other
regular .html, .phtml and PDF files

Ok so I start the crawl and I follow the step by step instruction:
http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website

bin/nutch inject crawl/crawldb urls

After crawling at least 7 times:

 bin/nutch generate crawl/crawldb crawl/segments -topN 10000000 -Depth
100000
 s7=`ls -d crawl/segments/2* | tail -1`
 bin/nutch fetch $s7
 bin/nutch parse $s7
 bin/nutch updatedb crawl/crawldb $s7


Followed by:

 bin/nutch invertlinks crawl/linkdb -dir crawl/segments
 bin/nutch solrindex http://localhost:9191/solr/clips crawl/crawldb/
-linkdb crawl/linkdb/ crawl/segments/20161004205432/ -filter -normalize


But it only finds 289 records(docs) when I look at the solr page.
it seems that it only sees the clips/2016, clips/2015 and clips/2011

-----------------------------------------
I also try all in one command but it FAILS:
bin/nutch crawl urls -solr http://localhost:9191/solr/clips -dir newcrawl
-depth 3 -topN 3

Indexer: starting at 2016-10-14 18:53:55
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication


Indexer: finished at 2016-10-14 18:53:57, elapsed: 00:00:01
SolrDeleteDuplicates: starting at 2016-10-14 18:53:57
SolrDeleteDuplicates: Solr url: http://localhost:9191/solr/clips
*Exception in thread "main" java.io.IOException: Job failed!*
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)


*How can I make it crawl the entire subfolder?*
*and What does that error means?*


Thanks,

Néstor

-- 
Né§t☼r  *Authority gone to one's head is the greatest enemy of Truth*

Re: nutch 1.7 solr 5.52 ubuntu

Posted by Tom Chiverton <tc...@extravision.com>.

Try looking in ..../nutch/runtime/local/logs

Tom


On 14/10/16 20:05, N�stor wrote:
> *and What does that error means?*