You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil> on 2016/10/04 10:38:00 UTC

RE: [Non-DoD Source] Re: crawling a subfolder (UNCLASSIFIED)

CLASSIFICATION: UNCLASSIFIED

Try using full paths i.e
/home/whatever/nutch/bin/nutch mergesegs /home/whatever/nutch/crawl/merged /home/whatever/nutch/crawl/segments/*

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~

-----Original Message-----
From: Nestor [mailto:rotsen@gmail.com] 
Sent: Monday, October 03, 2016 7:48 PM
To: user@nutch.apache.org
Subject: [Non-DoD Source] Re: crawling a subfolder

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

I look at the link you sent and I tried it and it failed.

Thanks,

$ bin/nutch mergesegs crawl/merged crawl/segments/* Merging 1 segments to crawl/merged/20161003234422
SegmentMerger:   adding crawl/segments/20161003222933
SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text $ bin/nutch readseg -dump crawl/merged/* dumpedContent
SegmentReader: dump segment: crawl/merged/20161003234422 Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/home/ubuntu/temtomcat/apache-nutch-1.7/runtime/local/crawl/merged/20161003234422/crawl_parse
Input path does not exist:
file:/home/ubuntu/temtomcat/apache-nutch-1.7/runtime/local/crawl/merged/20161003234422/content
Input path does not exist:
file:/home/ubuntu/temtomcat/apache-nutch-1.7/runtime/local/crawl/merged/20161003234422/parse_data
Input path does not exist:
file:/home/ubuntu/temtomcat/apache-nutch-1.7/runtime/local/crawl/merged/20161003234422/parse_text
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)




--
View this message in context: Caution-http://lucene.472066.n3.nabble.com/crawling-a-subfolder-tp4299300p4299375.html
Sent from the Nutch - User mailing list archive at Nabble.com.


CLASSIFICATION: UNCLASSIFIED