You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mohan Lal <mo...@gmail.com> on 2006/09/28 08:01:28 UTC
Problem in Distributed crawling using nutch 0.8
Hi all,
While iam try to crawl using distributed machines its throw an error
bin/nutch crawl urls -dir crawl -depth 10 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 10
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Input directory
/user/root/urls in localhost:9000 is invalid.
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
whats wrong with my configuration, please help me..................
Regards
Mohan Lal
--
View this message in context: http://www.nabble.com/Problem-in-Distributed-crawling-using-nutch-0.8-tf2348922.html#a6540735
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Problem in Distributed crawling using nutch 0.8
Posted by Mohan Lal <mo...@gmail.com>.
Hi ,
I have 3 slaves mentiond in the conf/slave file, also started all process
using bin/nutch start-all.sh and i have started crawling using the command
bin/nutch crawl -dir crawld -depth 30 -topN 50. and crowld
successfully....no problem
but all the jobs are executed in localhost Machine, is it possible to split
all the jobs into the 3 slave machines ?
if so how can i do ? please help me its urgent..............
http://localhost:50030/
there is only one node have been displayed
Maps Reduces Tasks/Node Nodes
0 0 2 1
Regards
Mohan Lal
"H?vard W. Kongsg?rd"-2 wrote:
>
> see:
> http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
>
> Before you start tomcat remeber to change the path of your search
> directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
> directory
>
> #This is an example of my configuration
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
> <name>fs.default.name</name>
> <value>LSearchDev01:9000</value>
> </property>
>
> <property>
> <name>searcher.dir</name>
> <value>/user/root/crawld</value>
> </property>
>
> </configuration>
>
>
>
> Mohan Lal wrote:
>> Hi,
>>
>> thanks for your valuable information, i have solved that problem after
>> that
>> iam facing another problem ....
>> i have 2 slaves
>> 1) MAC1
>> 2) MAC2
>>
>> but the job was running in MAC1 itself, and it take a long time to finish
>> the crawling process
>> how can i assign job to distributed machines i specified in tha slaves
>> file
>> ?
>>
>> But my Crowling process done successfully..........also how ccan i
>> specify
>> the searcher dir in the nutch-site.xml file
>>
>> <property>
>> <name>searcher.dir</name>
>> <value> ? </value>
>> </property>
>>
>> please help me.........
>>
>>
>> I have done the following setting.....
>>
>> [root@mohanlal ~]# cd /home/lucene/nutch-0.8.1/
>> [root@mohanlal nutch-0.8.1]# bin/hadoop namenode -format
>> Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
>> Formatted /tmp/hadoop/dfs/name
>> [root@mohanlal nutch-0.8.1]# bin/start-all.sh
>> starting namenode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
>> amenode-mohanlal.qburst.local.out
>> fpo: ssh: fpo: Name or service not known
>> localhost: starting datanode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/ha
>> doop-root-datanode-mohanlal.qburst.local.out
>> starting jobtracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
>> -jobtracker-mohanlal.qburst.local.out
>> fpo: ssh: fpo: Name or service not known
>> localhost: starting tasktracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs
>> /hadoop-root-tasktracker-mohanlal.qburst.local.out
>> [root@mohanlal nutch-0.8.1]# bin/stop-all.sh
>> stopping jobtracker
>> localhost: stopping tasktracker
>> sonu: no tasktracker to stop
>> stopping namenode
>> sonu: no datanode to stop
>> localhost: stopping datanode
>> [root@mohanlal nutch-0.8.1]# bin/start-all.sh
>> starting namenode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
>> amenode-mohanlal.qburst.local.out
>> sonu: starting datanode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-
>> root-datanode-sonu.qburst.local.out
>> localhost: starting datanode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/ha
>> doop-root-datanode-mohanlal.qburst.local.out
>> starting jobtracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
>> -jobtracker-mohanlal.qburst.local.out
>> localhost: starting tasktracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs
>> /hadoop-root-tasktracker-mohanlal.qburst.local.out
>> sonu: starting tasktracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hado
>> op-root-tasktracker-sonu.qburst.local.out
>> [root@mohanlal nutch-0.8.1]# bin/hadoop dfs -put urls urls
>> [root@mohanlal nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
>> -topN 10 crawl started in: crawl.1
>> rootUrlDir = urls
>> threads = 100
>> depth = 2
>> topN = 10
>> Injector: starting
>> Injector: crawlDb: crawl.1/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: done
>> Generator: starting
>> Generator: segment: crawl.1/segments/20060929120038
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: crawl.1/segments/20060929120038
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl.1/crawldb
>> CrawlDb update: segment: crawl.1/segments/20060929120038
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> Generator: starting
>> Generator: segment: crawl.1/segments/20060929120235
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: crawl.1/segments/20060929120235
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl.1/crawldb
>> CrawlDb update: segment: crawl.1/segments/20060929120235
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> LinkDb: starting
>> LinkDb: linkdb: crawl.1/linkdb
>> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
>> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
>> LinkDb: done
>> Indexer: starting
>> Indexer: linkdb: crawl.1/linkdb
>> Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
>> Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
>> Indexer: done
>> Dedup: starting
>> Dedup: adding indexes in: crawl.1/indexes
>> Dedup: done
>> Adding /user/root/crawl.1/indexes/part-00000
>> Adding /user/root/crawl.1/indexes/part-00001
>> crawl finished: crawl.1
>>
>>
>> Thanks and Regards
>> Mohanlal
>>
>>
>> "H?vard W. Kongsg?rd"-2 wrote:
>>
>>> Do /user/root/url exist, have you uploaded the url folder to you dfs
>>> system?
>>>
>>> bin/hadoop dfs -mkdir urls
>>> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
>>>
>>> or
>>>
>>> bin/hadoop -put <localsrc> <dst>
>>>
>>>
>>> Mohan Lal wrote:
>>>
>>>> Hi all,
>>>>
>>>> While iam try to crawl using distributed machines its throw an error
>>>>
>>>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
>>>> crawl started in: crawl
>>>> rootUrlDir = urls
>>>> threads = 10
>>>> depth = 10
>>>> topN = 50
>>>> Injector: starting
>>>> Injector: crawlDb: crawl/crawldb
>>>> Injector: urlDir: urls
>>>> Injector: Converting injected urls to crawl db entries.
>>>> Exception in thread "main" java.io.IOException: Input directory
>>>> /user/root/urls in localhost:9000 is invalid.
>>>> at
>>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>>> at
>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>>>> at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>>>
>>>> whats wrong with my configuration, please help me..................
>>>>
>>>>
>>>> Regards
>>>> Mohan Lal
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>
--
View this message in context: http://www.nabble.com/Problem-in-Distributed-crawling-using-nutch-0.8-tf2348922.html#a6576824
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Problem in Distributed crawling using nutch 0.8
Posted by mohanlal sankaranarayanan <mo...@gmail.com>.
Thanks "Håvard"
now its working fine
Rgds
Mohan Lal
On 9/29/06, "Håvard W. Kongsgård" <nu...@niap.org> wrote:
>
> see:
>
> http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
>
> Before you start tomcat remeber to change the path of your search
> directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
> directory
>
> #This is an example of my configuration
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
> <name>fs.default.name</name>
> <value>LSearchDev01:9000</value>
> </property>
>
> <property>
> <name>searcher.dir</name>
> <value>/user/root/crawld</value>
> </property>
>
> </configuration>
>
>
>
> Mohan Lal wrote:
> > Hi,
> >
> > thanks for your valuable information, i have solved that problem after
> that
> > iam facing another problem ....
> > i have 2 slaves
> > 1) MAC1
> > 2) MAC2
> >
> > but the job was running in MAC1 itself, and it take a long time to
> finish
> > the crawling process
> > how can i assign job to distributed machines i specified in tha slaves
> file
> > ?
> >
> > But my Crowling process done successfully..........also how ccan i
> specify
> > the searcher dir in the nutch-site.xml file
> >
> > <property>
> > <name>searcher.dir</name>
> > <value> ? </value>
> > </property>
> >
> > please help me.........
> >
> >
> > I have done the following setting.....
> >
> > [root@mohanlal ~]# cd /home/lucene/nutch-0.8.1/
> > [root@mohanlal nutch-0.8.1]# bin/hadoop namenode -format
> > Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
> > Formatted /tmp/hadoop/dfs/name
> > [root@mohanlal nutch-0.8.1]# bin/start-all.sh
> > starting namenode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> > amenode-mohanlal.qburst.local.out
> > fpo: ssh: fpo: Name or service not known
> > localhost: starting datanode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/ha
> > doop-root-datanode-mohanlal.qburst.local.out
> > starting jobtracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> > -jobtracker-mohanlal.qburst.local.out
> > fpo: ssh: fpo: Name or service not known
> > localhost: starting tasktracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs
> > /hadoop-root-tasktracker-mohanlal.qburst.local.out
> > [root@mohanlal nutch-0.8.1]# bin/stop-all.sh
> > stopping jobtracker
> > localhost: stopping tasktracker
> > sonu: no tasktracker to stop
> > stopping namenode
> > sonu: no datanode to stop
> > localhost: stopping datanode
> > [root@mohanlal nutch-0.8.1]# bin/start-all.sh
> > starting namenode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> > amenode-mohanlal.qburst.local.out
> > sonu: starting datanode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-
> > root-datanode-sonu.qburst.local.out
> > localhost: starting datanode, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/ha
> > doop-root-datanode-mohanlal.qburst.local.out
> > starting jobtracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> > -jobtracker-mohanlal.qburst.local.out
> > localhost: starting tasktracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs
> > /hadoop-root-tasktracker-mohanlal.qburst.local.out
> > sonu: starting tasktracker, logging to
> > /home/lucene/nutch-0.8.1/bin/../logs/hado
> > op-root-tasktracker-sonu.qburst.local.out
> > [root@mohanlal nutch-0.8.1]# bin/hadoop dfs -put urls urls
> > [root@mohanlal nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
> > -topN 10 crawl started in: crawl.1
> > rootUrlDir = urls
> > threads = 100
> > depth = 2
> > topN = 10
> > Injector: starting
> > Injector: crawlDb: crawl.1/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: done
> > Generator: starting
> > Generator: segment: crawl.1/segments/20060929120038
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: Partitioning selected urls by host, for politeness.
> > Generator: done.
> > Fetcher: starting
> > Fetcher: segment: crawl.1/segments/20060929120038
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl.1/crawldb
> > CrawlDb update: segment: crawl.1/segments/20060929120038
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > Generator: starting
> > Generator: segment: crawl.1/segments/20060929120235
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: Partitioning selected urls by host, for politeness.
> > Generator: done.
> > Fetcher: starting
> > Fetcher: segment: crawl.1/segments/20060929120235
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl.1/crawldb
> > CrawlDb update: segment: crawl.1/segments/20060929120235
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > LinkDb: starting
> > LinkDb: linkdb: crawl.1/linkdb
> > LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
> > LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
> > LinkDb: done
> > Indexer: starting
> > Indexer: linkdb: crawl.1/linkdb
> > Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
> > Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
> > Indexer: done
> > Dedup: starting
> > Dedup: adding indexes in: crawl.1/indexes
> > Dedup: done
> > Adding /user/root/crawl.1/indexes/part-00000
> > Adding /user/root/crawl.1/indexes/part-00001
> > crawl finished: crawl.1
> >
> >
> > Thanks and Regards
> > Mohanlal
> >
> >
> > "H?vard W. Kongsg?rd"-2 wrote:
> >
> >> Do /user/root/url exist, have you uploaded the url folder to you dfs
> >> system?
> >>
> >> bin/hadoop dfs -mkdir urls
> >> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
> >>
> >> or
> >>
> >> bin/hadoop -put <localsrc> <dst>
> >>
> >>
> >> Mohan Lal wrote:
> >>
> >>> Hi all,
> >>>
> >>> While iam try to crawl using distributed machines its throw an error
> >>>
> >>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
> >>> crawl started in: crawl
> >>> rootUrlDir = urls
> >>> threads = 10
> >>> depth = 10
> >>> topN = 50
> >>> Injector: starting
> >>> Injector: crawlDb: crawl/crawldb
> >>> Injector: urlDir: urls
> >>> Injector: Converting injected urls to crawl db entries.
> >>> Exception in thread "main" java.io.IOException: Input directory
> >>> /user/root/urls in localhost:9000 is invalid.
> >>> at
> >>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
> >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java
> :327)
> >>> at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
> >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
> >>>
> >>> whats wrong with my configuration, please help me..................
> >>>
> >>>
> >>> Regards
> >>> Mohan Lal
> >>>
> >>>
> >>
> >>
> >
> >
>
>
Re: Problem in Distributed crawling using nutch 0.8
Posted by "Håvard W. Kongsgård" <nu...@niap.org>.
see:
http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
Before you start tomcat remeber to change the path of your search directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes directory
#This is an example of my configuration
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>LSearchDev01:9000</value>
</property>
<property>
<name>searcher.dir</name>
<value>/user/root/crawld</value>
</property>
</configuration>
Mohan Lal wrote:
> Hi,
>
> thanks for your valuable information, i have solved that problem after that
> iam facing another problem ....
> i have 2 slaves
> 1) MAC1
> 2) MAC2
>
> but the job was running in MAC1 itself, and it take a long time to finish
> the crawling process
> how can i assign job to distributed machines i specified in tha slaves file
> ?
>
> But my Crowling process done successfully..........also how ccan i specify
> the searcher dir in the nutch-site.xml file
>
> <property>
> <name>searcher.dir</name>
> <value> ? </value>
> </property>
>
> please help me.........
>
>
> I have done the following setting.....
>
> [root@mohanlal ~]# cd /home/lucene/nutch-0.8.1/
> [root@mohanlal nutch-0.8.1]# bin/hadoop namenode -format
> Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
> Formatted /tmp/hadoop/dfs/name
> [root@mohanlal nutch-0.8.1]# bin/start-all.sh
> starting namenode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> amenode-mohanlal.qburst.local.out
> fpo: ssh: fpo: Name or service not known
> localhost: starting datanode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/ha
> doop-root-datanode-mohanlal.qburst.local.out
> starting jobtracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> -jobtracker-mohanlal.qburst.local.out
> fpo: ssh: fpo: Name or service not known
> localhost: starting tasktracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs
> /hadoop-root-tasktracker-mohanlal.qburst.local.out
> [root@mohanlal nutch-0.8.1]# bin/stop-all.sh
> stopping jobtracker
> localhost: stopping tasktracker
> sonu: no tasktracker to stop
> stopping namenode
> sonu: no datanode to stop
> localhost: stopping datanode
> [root@mohanlal nutch-0.8.1]# bin/start-all.sh
> starting namenode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
> amenode-mohanlal.qburst.local.out
> sonu: starting datanode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-
> root-datanode-sonu.qburst.local.out
> localhost: starting datanode, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/ha
> doop-root-datanode-mohanlal.qburst.local.out
> starting jobtracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
> -jobtracker-mohanlal.qburst.local.out
> localhost: starting tasktracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs
> /hadoop-root-tasktracker-mohanlal.qburst.local.out
> sonu: starting tasktracker, logging to
> /home/lucene/nutch-0.8.1/bin/../logs/hado
> op-root-tasktracker-sonu.qburst.local.out
> [root@mohanlal nutch-0.8.1]# bin/hadoop dfs -put urls urls
> [root@mohanlal nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
> -topN 10 crawl started in: crawl.1
> rootUrlDir = urls
> threads = 100
> depth = 2
> topN = 10
> Injector: starting
> Injector: crawlDb: crawl.1/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: starting
> Generator: segment: crawl.1/segments/20060929120038
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl.1/segments/20060929120038
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.1/crawldb
> CrawlDb update: segment: crawl.1/segments/20060929120038
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: crawl.1/segments/20060929120235
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl.1/segments/20060929120235
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.1/crawldb
> CrawlDb update: segment: crawl.1/segments/20060929120235
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl.1/linkdb
> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl.1/linkdb
> Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
> Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl.1/indexes
> Dedup: done
> Adding /user/root/crawl.1/indexes/part-00000
> Adding /user/root/crawl.1/indexes/part-00001
> crawl finished: crawl.1
>
>
> Thanks and Regards
> Mohanlal
>
>
> "H?vard W. Kongsg?rd"-2 wrote:
>
>> Do /user/root/url exist, have you uploaded the url folder to you dfs
>> system?
>>
>> bin/hadoop dfs -mkdir urls
>> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
>>
>> or
>>
>> bin/hadoop -put <localsrc> <dst>
>>
>>
>> Mohan Lal wrote:
>>
>>> Hi all,
>>>
>>> While iam try to crawl using distributed machines its throw an error
>>>
>>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
>>> crawl started in: crawl
>>> rootUrlDir = urls
>>> threads = 10
>>> depth = 10
>>> topN = 50
>>> Injector: starting
>>> Injector: crawlDb: crawl/crawldb
>>> Injector: urlDir: urls
>>> Injector: Converting injected urls to crawl db entries.
>>> Exception in thread "main" java.io.IOException: Input directory
>>> /user/root/urls in localhost:9000 is invalid.
>>> at
>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>>> at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>>
>>> whats wrong with my configuration, please help me..................
>>>
>>>
>>> Regards
>>> Mohan Lal
>>>
>>>
>>
>>
>
>
Re: Problem in Distributed crawling using nutch 0.8
Posted by Mohan Lal <mo...@gmail.com>.
Hi,
thanks for your valuable information, i have solved that problem after that
iam facing another problem ....
i have 2 slaves
1) MAC1
2) MAC2
but the job was running in MAC1 itself, and it take a long time to finish
the crawling process
how can i assign job to distributed machines i specified in tha slaves file
?
But my Crowling process done successfully..........also how ccan i specify
the searcher dir in the nutch-site.xml file
<property>
<name>searcher.dir</name>
<value> ? </value>
</property>
please help me.........
I have done the following setting.....
[root@mohanlal ~]# cd /home/lucene/nutch-0.8.1/
[root@mohanlal nutch-0.8.1]# bin/hadoop namenode -format
Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
Formatted /tmp/hadoop/dfs/name
[root@mohanlal nutch-0.8.1]# bin/start-all.sh
starting namenode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
amenode-mohanlal.qburst.local.out
fpo: ssh: fpo: Name or service not known
localhost: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/ha
doop-root-datanode-mohanlal.qburst.local.out
starting jobtracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
-jobtracker-mohanlal.qburst.local.out
fpo: ssh: fpo: Name or service not known
localhost: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs
/hadoop-root-tasktracker-mohanlal.qburst.local.out
[root@mohanlal nutch-0.8.1]# bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
sonu: no tasktracker to stop
stopping namenode
sonu: no datanode to stop
localhost: stopping datanode
[root@mohanlal nutch-0.8.1]# bin/start-all.sh
starting namenode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
amenode-mohanlal.qburst.local.out
sonu: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-
root-datanode-sonu.qburst.local.out
localhost: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/ha
doop-root-datanode-mohanlal.qburst.local.out
starting jobtracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
-jobtracker-mohanlal.qburst.local.out
localhost: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs
/hadoop-root-tasktracker-mohanlal.qburst.local.out
sonu: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hado
op-root-tasktracker-sonu.qburst.local.out
[root@mohanlal nutch-0.8.1]# bin/hadoop dfs -put urls urls
[root@mohanlal nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
-topN 10 crawl started in: crawl.1
rootUrlDir = urls
threads = 100
depth = 2
topN = 10
Injector: starting
Injector: crawlDb: crawl.1/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: crawl.1/segments/20060929120038
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl.1/segments/20060929120038
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.1/crawldb
CrawlDb update: segment: crawl.1/segments/20060929120038
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: crawl.1/segments/20060929120235
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl.1/segments/20060929120235
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.1/crawldb
CrawlDb update: segment: crawl.1/segments/20060929120235
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl.1/linkdb
LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl.1/linkdb
Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl.1/indexes
Dedup: done
Adding /user/root/crawl.1/indexes/part-00000
Adding /user/root/crawl.1/indexes/part-00001
crawl finished: crawl.1
Thanks and Regards
Mohanlal
"H?vard W. Kongsg?rd"-2 wrote:
>
> Do /user/root/url exist, have you uploaded the url folder to you dfs
> system?
>
> bin/hadoop dfs -mkdir urls
> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
>
> or
>
> bin/hadoop -put <localsrc> <dst>
>
>
> Mohan Lal wrote:
>> Hi all,
>>
>> While iam try to crawl using distributed machines its throw an error
>>
>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
>> crawl started in: crawl
>> rootUrlDir = urls
>> threads = 10
>> depth = 10
>> topN = 50
>> Injector: starting
>> Injector: crawlDb: crawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Exception in thread "main" java.io.IOException: Input directory
>> /user/root/urls in localhost:9000 is invalid.
>> at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>> at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>
>> whats wrong with my configuration, please help me..................
>>
>>
>> Regards
>> Mohan Lal
>>
>
>
>
--
View this message in context: http://www.nabble.com/Problem-in-Distributed-crawling-using-nutch-0.8-tf2348922.html#a6560245
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Problem in Distributed crawling using nutch 0.8
Posted by "Håvard W. Kongsgård" <nu...@niap.org>.
Do /user/root/url exist, have you uploaded the url folder to you dfs system?
bin/hadoop dfs -mkdir urls
bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
or
bin/hadoop -put <localsrc> <dst>
Mohan Lal wrote:
> Hi all,
>
> While iam try to crawl using distributed machines its throw an error
>
> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 10
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Exception in thread "main" java.io.IOException: Input directory
> /user/root/urls in localhost:9000 is invalid.
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
> whats wrong with my configuration, please help me..................
>
>
> Regards
> Mohan Lal
>