You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Mohan Lal <mo...@gmail.com> on 2006/09/28 08:01:28 UTC

Problem in Distributed crawling using nutch 0.8


Hi all,

While iam try to crawl using distributed machines its throw an error

bin/nutch crawl urls -dir crawl -depth 10 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 10
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Input directory
/user/root/urls in localhost:9000 is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

whats wrong with my configuration,  please help  me..................


Regards
Mohan Lal 
-- 
View this message in context: http://www.nabble.com/Problem-in-Distributed-crawling-using-nutch-0.8-tf2348922.html#a6540735
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in Distributed crawling using nutch 0.8

Posted by Mohan Lal <mo...@gmail.com>.

Hi ,

 I have 3 slaves mentiond in the  conf/slave file, also started all process
using bin/nutch start-all.sh and i have started crawling using the command
bin/nutch crawl -dir crawld -depth 30 -topN 50. and crowld
successfully....no problem 

but all the jobs are executed in localhost Machine, is it possible to split
all the jobs into the 3 slave machines ?
if so how can i do ? please help me its urgent..............

http://localhost:50030/

there is only one node have been displayed 

Maps	Reduces	Tasks/Node	Nodes
0	         0	       2	                1


Regards
Mohan Lal



&quot;H?vard W. Kongsg?rd&quot;-2 wrote:
> 
> see: 
> http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
> 
> Before you start tomcat remeber to change the path of your search
> directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
> directory 
> 
> #This is an example of my configuration 
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
>   <property>
>     <name>fs.default.name</name>
>     <value>LSearchDev01:9000</value>
>   </property>
> 
>   <property>
>     <name>searcher.dir</name>
>     <value>/user/root/crawld</value>
>   </property>
> 
> </configuration>
> 
> 
> 
> Mohan Lal wrote:
>> Hi,
>>
>> thanks for your valuable information, i have solved that problem after
>> that
>> iam facing another problem ....
>> i have 2 slaves
>>  1)  MAC1
>>   2)  MAC2
>>
>> but the job was running in MAC1 itself, and it take a long time to finish
>> the crawling process
>> how can i assign job to distributed machines i specified in tha slaves
>> file
>> ?
>>
>> But my Crowling process done successfully..........also how ccan i
>> specify
>> the searcher dir in the nutch-site.xml file
>>
>>      <property>
>>           <name>searcher.dir</name>
>>           <value> ? </value>  
>>      </property>
>>
>> please help me.........
>>
>>
>> I have done the following setting.....
>>
>> [root@mohanlal ~]# cd /home/lucene/nutch-0.8.1/
>> [root@mohanlal nutch-0.8.1]# bin/hadoop namenode -format
>> Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
>> Formatted /tmp/hadoop/dfs/name
>> [root@mohanlal nutch-0.8.1]# bin/start-all.sh
>> starting namenode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
>> amenode-mohanlal.qburst.local.out
>> fpo: ssh: fpo: Name or service not known
>> localhost: starting datanode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/ha
>> doop-root-datanode-mohanlal.qburst.local.out
>> starting jobtracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
>> -jobtracker-mohanlal.qburst.local.out
>> fpo: ssh: fpo: Name or service not known
>> localhost: starting tasktracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs
>> /hadoop-root-tasktracker-mohanlal.qburst.local.out
>> [root@mohanlal nutch-0.8.1]# bin/stop-all.sh
>> stopping jobtracker
>> localhost: stopping tasktracker
>> sonu: no tasktracker to stop
>> stopping namenode
>> sonu: no datanode to stop
>> localhost: stopping datanode
>> [root@mohanlal nutch-0.8.1]# bin/start-all.sh
>> starting namenode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
>> amenode-mohanlal.qburst.local.out
>> sonu: starting datanode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-
>> root-datanode-sonu.qburst.local.out
>> localhost: starting datanode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/ha
>> doop-root-datanode-mohanlal.qburst.local.out
>> starting jobtracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
>> -jobtracker-mohanlal.qburst.local.out
>> localhost: starting tasktracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs
>> /hadoop-root-tasktracker-mohanlal.qburst.local.out
>> sonu: starting tasktracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hado
>> op-root-tasktracker-sonu.qburst.local.out
>> [root@mohanlal nutch-0.8.1]# bin/hadoop dfs -put  urls urls
>> [root@mohanlal nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
>> -topN 10 crawl started in: crawl.1
>> rootUrlDir = urls
>> threads = 100
>> depth = 2
>> topN = 10
>> Injector: starting
>> Injector: crawlDb: crawl.1/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: done
>> Generator: starting
>> Generator: segment: crawl.1/segments/20060929120038
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: crawl.1/segments/20060929120038
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl.1/crawldb
>> CrawlDb update: segment: crawl.1/segments/20060929120038
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> Generator: starting
>> Generator: segment: crawl.1/segments/20060929120235
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: crawl.1/segments/20060929120235
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl.1/crawldb
>> CrawlDb update: segment: crawl.1/segments/20060929120235
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> LinkDb: starting
>> LinkDb: linkdb: crawl.1/linkdb
>> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
>> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
>> LinkDb: done
>> Indexer: starting
>> Indexer: linkdb: crawl.1/linkdb
>> Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
>> Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
>> Indexer: done
>> Dedup: starting
>> Dedup: adding indexes in: crawl.1/indexes
>> Dedup: done
>> Adding /user/root/crawl.1/indexes/part-00000
>> Adding /user/root/crawl.1/indexes/part-00001
>> crawl finished: crawl.1
>>
>>
>> Thanks and Regards
>> Mohanlal
>>
>>
>> &quot;H?vard W. Kongsg?rd&quot;-2 wrote:
>>   
>>> Do /user/root/url exist, have you uploaded  the url folder to you dfs
>>> system?
>>>
>>> bin/hadoop dfs -mkdir urls
>>> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
>>>
>>> or
>>>
>>> bin/hadoop -put <localsrc> <dst>
>>>
>>>
>>> Mohan Lal wrote:
>>>     
>>>> Hi all,
>>>>
>>>> While iam try to crawl using distributed machines its throw an error
>>>>
>>>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
>>>> crawl started in: crawl
>>>> rootUrlDir = urls
>>>> threads = 10
>>>> depth = 10
>>>> topN = 50
>>>> Injector: starting
>>>> Injector: crawlDb: crawl/crawldb
>>>> Injector: urlDir: urls
>>>> Injector: Converting injected urls to crawl db entries.
>>>> Exception in thread "main" java.io.IOException: Input directory
>>>> /user/root/urls in localhost:9000 is invalid.
>>>>         at
>>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>>>         at
>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>>>>         at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>>>
>>>> whats wrong with my configuration,  please help  me..................
>>>>
>>>>
>>>> Regards
>>>> Mohan Lal 
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Problem-in-Distributed-crawling-using-nutch-0.8-tf2348922.html#a6576824
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in Distributed crawling using nutch 0.8

Posted by mohanlal sankaranarayanan <mo...@gmail.com>.

Thanks "Håvard"

now its working fine

Rgds
Mohan Lal

On 9/29/06, "Håvard W. Kongsgård" <nu...@niap.org> wrote:
>
> see:
>
> http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
>
> Before you start tomcat remeber to change the path of your search
> directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
> directory
>
> #This is an example of my configuration
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>   <property>
>     <name>fs.default.name</name>
>     <value>LSearchDev01:9000</value>
>   </property>
>
>   <property>
>     <name>searcher.dir</name>
>     <value>/user/root/crawld</value>
>   </property>
>
> </configuration>
>
>
>
> Mohan Lal wrote:
> > Hi,
> >
> > thanks for your valuable information, i have solved that problem after
> that
> > iam facing another problem ....
> > i have 2 slaves
> >  1)  MAC1
> >   2)  MAC2
> >
> > but the job was running in MAC1 itself, and it take a long time to
> finish
> > the crawling process
> > how can i assign job to distributed machines i specified in tha slaves
> file
> > ?
> >
> > But my Crowling process done successfully..........also how ccan i
> specify
> > the searcher dir in the nutch-site.xml file
> >
> >      <property>
> >           <name>searcher.dir</name>
> >           <value> ? </value>
> >      </property>
> >
> > please help me.........
> >
> >
> > I have done the following setting.....
> >
> > [root@mohanlal ~]# cd /home/lucene/nutch-0.8.1/
> > [root@mohanlal nutch-0.8.1]# bin/hadoop namenode -format
> > Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
> > Formatted /tmp/hadoop/dfs/name
> > [root@mohanlal nutch-0.8.1]# bin/start-all.sh
> > starting namenode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> > amenode-mohanlal.qburst.local.out
> > fpo: ssh: fpo: Name or service not known
> > localhost: starting datanode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/ha
> > doop-root-datanode-mohanlal.qburst.local.out
> > starting jobtracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> > -jobtracker-mohanlal.qburst.local.out
> > fpo: ssh: fpo: Name or service not known
> > localhost: starting tasktracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs
> > /hadoop-root-tasktracker-mohanlal.qburst.local.out
> > [root@mohanlal nutch-0.8.1]# bin/stop-all.sh
> > stopping jobtracker
> > localhost: stopping tasktracker
> > sonu: no tasktracker to stop
> > stopping namenode
> > sonu: no datanode to stop
> > localhost: stopping datanode
> > [root@mohanlal nutch-0.8.1]# bin/start-all.sh
> > starting namenode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> > amenode-mohanlal.qburst.local.out
> > sonu: starting datanode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-
> > root-datanode-sonu.qburst.local.out
> > localhost: starting datanode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/ha
> > doop-root-datanode-mohanlal.qburst.local.out
> > starting jobtracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> > -jobtracker-mohanlal.qburst.local.out
> > localhost: starting tasktracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs
> > /hadoop-root-tasktracker-mohanlal.qburst.local.out
> > sonu: starting tasktracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hado
> > op-root-tasktracker-sonu.qburst.local.out
> > [root@mohanlal nutch-0.8.1]# bin/hadoop dfs -put  urls urls
> > [root@mohanlal nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
> > -topN 10 crawl started in: crawl.1
> > rootUrlDir = urls
> > threads = 100
> > depth = 2
> > topN = 10
> > Injector: starting
> > Injector: crawlDb: crawl.1/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: done
> > Generator: starting
> > Generator: segment: crawl.1/segments/20060929120038
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: Partitioning selected urls by host, for politeness.
> > Generator: done.
> > Fetcher: starting
> > Fetcher: segment: crawl.1/segments/20060929120038
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl.1/crawldb
> > CrawlDb update: segment: crawl.1/segments/20060929120038
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > Generator: starting
> > Generator: segment: crawl.1/segments/20060929120235
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: Partitioning selected urls by host, for politeness.
> > Generator: done.
> > Fetcher: starting
> > Fetcher: segment: crawl.1/segments/20060929120235
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl.1/crawldb
> > CrawlDb update: segment: crawl.1/segments/20060929120235
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > LinkDb: starting
> > LinkDb: linkdb: crawl.1/linkdb
> > LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
> > LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
> > LinkDb: done
> > Indexer: starting
> > Indexer: linkdb: crawl.1/linkdb
> > Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
> > Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
> > Indexer: done
> > Dedup: starting
> > Dedup: adding indexes in: crawl.1/indexes
> > Dedup: done
> > Adding /user/root/crawl.1/indexes/part-00000
> > Adding /user/root/crawl.1/indexes/part-00001
> > crawl finished: crawl.1
> >
> >
> > Thanks and Regards
> > Mohanlal
> >
> >
> > &quot;H?vard W. Kongsg?rd&quot;-2 wrote:
> >
> >> Do /user/root/url exist, have you uploaded  the url folder to you dfs
> >> system?
> >>
> >> bin/hadoop dfs -mkdir urls
> >> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
> >>
> >> or
> >>
> >> bin/hadoop -put <localsrc> <dst>
> >>
> >>
> >> Mohan Lal wrote:
> >>
> >>> Hi all,
> >>>
> >>> While iam try to crawl using distributed machines its throw an error
> >>>
> >>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
> >>> crawl started in: crawl
> >>> rootUrlDir = urls
> >>> threads = 10
> >>> depth = 10
> >>> topN = 50
> >>> Injector: starting
> >>> Injector: crawlDb: crawl/crawldb
> >>> Injector: urlDir: urls
> >>> Injector: Converting injected urls to crawl db entries.
> >>> Exception in thread "main" java.io.IOException: Input directory
> >>> /user/root/urls in localhost:9000 is invalid.
> >>>         at
> >>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
> >>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java
> :327)
> >>>         at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
> >>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
> >>>
> >>> whats wrong with my configuration,  please help  me..................
> >>>
> >>>
> >>> Regards
> >>> Mohan Lal
> >>>
> >>>
> >>
> >>
> >
> >
>
>

Re: Problem in Distributed crawling using nutch 0.8

Posted by "Håvard W. Kongsgård" <nu...@niap.org>.

see: 
http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E

Before you start tomcat remeber to change the path of your search directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes directory 

#This is an example of my configuration 

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>LSearchDev01:9000</value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/user/root/crawld</value>
  </property>

</configuration>



Mohan Lal wrote:
> Hi,
>
> thanks for your valuable information, i have solved that problem after that
> iam facing another problem ....
> i have 2 slaves
>  1)  MAC1
>   2)  MAC2
>
> but the job was running in MAC1 itself, and it take a long time to finish
> the crawling process
> how can i assign job to distributed machines i specified in tha slaves file
> ?
>
> But my Crowling process done successfully..........also how ccan i specify
> the searcher dir in the nutch-site.xml file
>
>      <property>
>           <name>searcher.dir</name>
>           <value> ? </value>  
>      </property>
>
> please help me.........
>
>
> I have done the following setting.....
>
> [root@mohanlal ~]# cd /home/lucene/nutch-0.8.1/
> [root@mohanlal nutch-0.8.1]# bin/hadoop namenode -format
> Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
> Formatted /tmp/hadoop/dfs/name
> [root@mohanlal nutch-0.8.1]# bin/start-all.sh
> starting namenode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> amenode-mohanlal.qburst.local.out
> fpo: ssh: fpo: Name or service not known
> localhost: starting datanode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/ha
> doop-root-datanode-mohanlal.qburst.local.out
> starting jobtracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> -jobtracker-mohanlal.qburst.local.out
> fpo: ssh: fpo: Name or service not known
> localhost: starting tasktracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs
> /hadoop-root-tasktracker-mohanlal.qburst.local.out
> [root@mohanlal nutch-0.8.1]# bin/stop-all.sh
> stopping jobtracker
> localhost: stopping tasktracker
> sonu: no tasktracker to stop
> stopping namenode
> sonu: no datanode to stop
> localhost: stopping datanode
> [root@mohanlal nutch-0.8.1]# bin/start-all.sh
> starting namenode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> amenode-mohanlal.qburst.local.out
> sonu: starting datanode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-
> root-datanode-sonu.qburst.local.out
> localhost: starting datanode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/ha
> doop-root-datanode-mohanlal.qburst.local.out
> starting jobtracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> -jobtracker-mohanlal.qburst.local.out
> localhost: starting tasktracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs
> /hadoop-root-tasktracker-mohanlal.qburst.local.out
> sonu: starting tasktracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hado
> op-root-tasktracker-sonu.qburst.local.out
> [root@mohanlal nutch-0.8.1]# bin/hadoop dfs -put  urls urls
> [root@mohanlal nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
> -topN 10 crawl started in: crawl.1
> rootUrlDir = urls
> threads = 100
> depth = 2
> topN = 10
> Injector: starting
> Injector: crawlDb: crawl.1/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: starting
> Generator: segment: crawl.1/segments/20060929120038
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl.1/segments/20060929120038
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.1/crawldb
> CrawlDb update: segment: crawl.1/segments/20060929120038
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: crawl.1/segments/20060929120235
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl.1/segments/20060929120235
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.1/crawldb
> CrawlDb update: segment: crawl.1/segments/20060929120235
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl.1/linkdb
> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl.1/linkdb
> Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
> Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl.1/indexes
> Dedup: done
> Adding /user/root/crawl.1/indexes/part-00000
> Adding /user/root/crawl.1/indexes/part-00001
> crawl finished: crawl.1
>
>
> Thanks and Regards
> Mohanlal
>
>
> &quot;H?vard W. Kongsg?rd&quot;-2 wrote:
>   
>> Do /user/root/url exist, have you uploaded  the url folder to you dfs
>> system?
>>
>> bin/hadoop dfs -mkdir urls
>> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
>>
>> or
>>
>> bin/hadoop -put <localsrc> <dst>
>>
>>
>> Mohan Lal wrote:
>>     
>>> Hi all,
>>>
>>> While iam try to crawl using distributed machines its throw an error
>>>
>>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
>>> crawl started in: crawl
>>> rootUrlDir = urls
>>> threads = 10
>>> depth = 10
>>> topN = 50
>>> Injector: starting
>>> Injector: crawlDb: crawl/crawldb
>>> Injector: urlDir: urls
>>> Injector: Converting injected urls to crawl db entries.
>>> Exception in thread "main" java.io.IOException: Input directory
>>> /user/root/urls in localhost:9000 is invalid.
>>>         at
>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>>>         at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>>
>>> whats wrong with my configuration,  please help  me..................
>>>
>>>
>>> Regards
>>> Mohan Lal 
>>>   
>>>       
>>
>>     
>
>

Re: Problem in Distributed crawling using nutch 0.8

Posted by Mohan Lal <mo...@gmail.com>.

Hi,

thanks for your valuable information, i have solved that problem after that
iam facing another problem ....
i have 2 slaves
 1)  MAC1
  2)  MAC2

but the job was running in MAC1 itself, and it take a long time to finish
the crawling process
how can i assign job to distributed machines i specified in tha slaves file
?

But my Crowling process done successfully..........also how ccan i specify
the searcher dir in the nutch-site.xml file

     <property>
          <name>searcher.dir</name>
          <value> ? </value>  
     </property>

please help me.........


I have done the following setting.....

[root@mohanlal ~]# cd /home/lucene/nutch-0.8.1/
[root@mohanlal nutch-0.8.1]# bin/hadoop namenode -format
Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
Formatted /tmp/hadoop/dfs/name
[root@mohanlal nutch-0.8.1]# bin/start-all.sh
starting namenode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
amenode-mohanlal.qburst.local.out
fpo: ssh: fpo: Name or service not known
localhost: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/ha
doop-root-datanode-mohanlal.qburst.local.out
starting jobtracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
-jobtracker-mohanlal.qburst.local.out
fpo: ssh: fpo: Name or service not known
localhost: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs
/hadoop-root-tasktracker-mohanlal.qburst.local.out
[root@mohanlal nutch-0.8.1]# bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
sonu: no tasktracker to stop
stopping namenode
sonu: no datanode to stop
localhost: stopping datanode
[root@mohanlal nutch-0.8.1]# bin/start-all.sh
starting namenode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
amenode-mohanlal.qburst.local.out
sonu: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-
root-datanode-sonu.qburst.local.out
localhost: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/ha
doop-root-datanode-mohanlal.qburst.local.out
starting jobtracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
-jobtracker-mohanlal.qburst.local.out
localhost: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs
/hadoop-root-tasktracker-mohanlal.qburst.local.out
sonu: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hado
op-root-tasktracker-sonu.qburst.local.out
[root@mohanlal nutch-0.8.1]# bin/hadoop dfs -put  urls urls
[root@mohanlal nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
-topN 10 crawl started in: crawl.1
rootUrlDir = urls
threads = 100
depth = 2
topN = 10
Injector: starting
Injector: crawlDb: crawl.1/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: crawl.1/segments/20060929120038
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl.1/segments/20060929120038
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.1/crawldb
CrawlDb update: segment: crawl.1/segments/20060929120038
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: crawl.1/segments/20060929120235
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl.1/segments/20060929120235
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.1/crawldb
CrawlDb update: segment: crawl.1/segments/20060929120235
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl.1/linkdb
LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl.1/linkdb
Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl.1/indexes
Dedup: done
Adding /user/root/crawl.1/indexes/part-00000
Adding /user/root/crawl.1/indexes/part-00001
crawl finished: crawl.1


Thanks and Regards
Mohanlal


&quot;H?vard W. Kongsg?rd&quot;-2 wrote:
> 
> Do /user/root/url exist, have you uploaded  the url folder to you dfs
> system?
> 
> bin/hadoop dfs -mkdir urls
> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
> 
> or
> 
> bin/hadoop -put <localsrc> <dst>
> 
> 
> Mohan Lal wrote:
>> Hi all,
>>
>> While iam try to crawl using distributed machines its throw an error
>>
>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
>> crawl started in: crawl
>> rootUrlDir = urls
>> threads = 10
>> depth = 10
>> topN = 50
>> Injector: starting
>> Injector: crawlDb: crawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Exception in thread "main" java.io.IOException: Input directory
>> /user/root/urls in localhost:9000 is invalid.
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>>         at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>
>> whats wrong with my configuration,  please help  me..................
>>
>>
>> Regards
>> Mohan Lal 
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Problem-in-Distributed-crawling-using-nutch-0.8-tf2348922.html#a6560245
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in Distributed crawling using nutch 0.8

Posted by "Håvard W. Kongsgård" <nu...@niap.org>.

Do /user/root/url exist, have you uploaded  the url folder to you dfs system?

bin/hadoop dfs -mkdir urls
bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt

or

bin/hadoop -put <localsrc> <dst>


Mohan Lal wrote:
> Hi all,
>
> While iam try to crawl using distributed machines its throw an error
>
> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 10
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Exception in thread "main" java.io.IOException: Input directory
> /user/root/urls in localhost:9000 is invalid.
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
> whats wrong with my configuration,  please help  me..................
>
>
> Regards
> Mohan Lal 
>