You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Néstor <ro...@gmail.com> on 2016/10/14 19:05:53 UTC
nutch 1.7 solr 5.52 ubuntu
BTW, I try the below with several nutch and solr versions and I had errors
but now I am
using nutch 1.7 ans solr 5.52 on ubuntu and I am trying to crawl a
subfolder and anything under
that subfolder. The subfolder contains yearly subfolders for every year
since 2005(12 year subfolder)
and each year subfolder has a month subfolder (12 month subfolder) and each
month subfolder has
at least 30 days subfolders. I know that I have more than 3,960
index.phtml and some other
regular .html, .phtml and PDF files
Ok so I start the crawl and I follow the step by step instruction:
http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website
bin/nutch inject crawl/crawldb urls
After crawling at least 7 times:
bin/nutch generate crawl/crawldb crawl/segments -topN 10000000 -Depth
100000
s7=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s7
bin/nutch parse $s7
bin/nutch updatedb crawl/crawldb $s7
Followed by:
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch solrindex http://localhost:9191/solr/clips crawl/crawldb/
-linkdb crawl/linkdb/ crawl/segments/20161004205432/ -filter -normalize
But it only finds 289 records(docs) when I look at the solr page.
it seems that it only sees the clips/2016, clips/2015 and clips/2011
-----------------------------------------
I also try all in one command but it FAILS:
bin/nutch crawl urls -solr http://localhost:9191/solr/clips -dir newcrawl
-depth 3 -topN 3
Indexer: starting at 2016-10-14 18:53:55
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: finished at 2016-10-14 18:53:57, elapsed: 00:00:01
SolrDeleteDuplicates: starting at 2016-10-14 18:53:57
SolrDeleteDuplicates: Solr url: http://localhost:9191/solr/clips
*Exception in thread "main" java.io.IOException: Job failed!*
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
*How can I make it crawl the entire subfolder?*
*and What does that error means?*
Thanks,
Néstor
--
Né§t☼r *Authority gone to one's head is the greatest enemy of Truth*
Re: nutch 1.7 solr 5.52 ubuntu
Posted by Tom Chiverton <tc...@extravision.com>.
Try looking in ..../nutch/runtime/local/logs
Tom
On 14/10/16 20:05, N�stor wrote:
> *and What does that error means?*