You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Alexei Korolev <al...@gmail.com> on 2012/08/03 10:53:58 UTC

crawling site without www

Hello,

I have small script

$NUTCH_PATH inject crawl/crawldb seed.txt
$NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0

s1=`ls -d crawl/crawldb/segments/* | tail -1`
$NUTCH_PATH fetch $s1
$NUTCH_PATH parse $s1
$NUTCH_PATH updatedb crawl/crawldb $s1

In seed.txt I have just one site, for example "test.com". When I start
script it falls on fetch phase.
If I change test.com on www.test.com it works fine. Seems the reason, that
outgoing link on test.com all have www. prefix.
What I need to change in nutch config for work with test.com?

Thank you in advance. I hope my explanation is clear :)

-- 
Alexei A. Korolev

Re: crawling site without www

Posted by Alexei Korolev <al...@gmail.com>.

Ok. Thank you a lot. I'll try later :)

On Wed, Aug 8, 2012 at 9:18 PM, Sebastian Nagel
<wa...@googlemail.com>wrote:

> Hi Alexei,
>
> > So I see just one solution for crawling limited count of sites with
> > behaviour like on mobile365. Its limit scope of sites using
> > regex-urlfilter.txt with list like this
> >
> > +^www.mobile365.ru
> > +^mobile365.ru
>
> Better:
> +^https?://(?:www\.)?mobile365\.ru/
> or to catch all of mobile365.ru
> +^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/
>
> and don't forget to remove the final rule
>
> # accept anything else
> +.
>
> and replace it by
>
> # skip everything else
> -.
>
> If you have more than a few hosts / domains you want to allow
> the urlfilter-domain would be a more comfortable choice.
> Here a simple line has the desired effect:
> mobile365.ru
>
>
> Sebastian
>
> >
> > Thanks.
> >
> > On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma <
> markus.jelsma@openindex.io>wrote:
> >
> >>
> >> If it starts to redirect and you are on the wrong side of the redirect,
> >> you're in trouble. But with the HostNormalizer you can then renormalize
> all
> >> URL's to the host that is being redirected to.
> >>
> >>
> >> -----Original message-----
> >>> From:Alexei Korolev <al...@gmail.com>
> >>> Sent: Wed 08-Aug-2012 15:55
> >>> To: user@nutch.apache.org
> >>> Subject: Re: crawling site without www
> >>>
> >>>> You can use the HostURLNormalizer for this task or just crawl the www
> >> OR
> >>>> the non-www, not both.
> >>>>
> >>>
> >>> I'm trying to crawl only version without www. As I see, I can remove
> www.
> >>> using proper configured regex-normalize.xml.
> >>> But will it work if mobile365.ru redirect on www.mobile365.ru (it's
> very
> >>> common situation in web)
> >>>
> >>> Thanks.
> >>>
> >>> Alexei
> >>>
> >>
> >
> >
> >
>
>


-- 
Alexei A. Korolev

Re: crawling site without www

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Alexei,

> So I see just one solution for crawling limited count of sites with
> behaviour like on mobile365. Its limit scope of sites using
> regex-urlfilter.txt with list like this
> 
> +^www.mobile365.ru
> +^mobile365.ru

Better:
+^https?://(?:www\.)?mobile365\.ru/
or to catch all of mobile365.ru
+^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/

and don't forget to remove the final rule

# accept anything else
+.

and replace it by

# skip everything else
-.

If you have more than a few hosts / domains you want to allow
the urlfilter-domain would be a more comfortable choice.
Here a simple line has the desired effect:
mobile365.ru


Sebastian

> 
> Thanks.
> 
> On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma <ma...@openindex.io>wrote:
> 
>>
>> If it starts to redirect and you are on the wrong side of the redirect,
>> you're in trouble. But with the HostNormalizer you can then renormalize all
>> URL's to the host that is being redirected to.
>>
>>
>> -----Original message-----
>>> From:Alexei Korolev <al...@gmail.com>
>>> Sent: Wed 08-Aug-2012 15:55
>>> To: user@nutch.apache.org
>>> Subject: Re: crawling site without www
>>>
>>>> You can use the HostURLNormalizer for this task or just crawl the www
>> OR
>>>> the non-www, not both.
>>>>
>>>
>>> I'm trying to crawl only version without www. As I see, I can remove www.
>>> using proper configured regex-normalize.xml.
>>> But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
>>> common situation in web)
>>>
>>> Thanks.
>>>
>>> Alexei
>>>
>>
> 
> 
>

Re: crawling site without www

Posted by Alexei Korolev <al...@gmail.com>.

So I see just one solution for crawling limited count of sites with
behaviour like on mobile365. Its limit scope of sites using
regex-urlfilter.txt with list like this

+^www.mobile365.ru
+^mobile365.ru

Thanks.

On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma <ma...@openindex.io>wrote:

>
> If it starts to redirect and you are on the wrong side of the redirect,
> you're in trouble. But with the HostNormalizer you can then renormalize all
> URL's to the host that is being redirected to.
>
>
> -----Original message-----
> > From:Alexei Korolev <al...@gmail.com>
> > Sent: Wed 08-Aug-2012 15:55
> > To: user@nutch.apache.org
> > Subject: Re: crawling site without www
> >
> > > You can use the HostURLNormalizer for this task or just crawl the www
> OR
> > > the non-www, not both.
> > >
> >
> > I'm trying to crawl only version without www. As I see, I can remove www.
> > using proper configured regex-normalize.xml.
> > But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
> > common situation in web)
> >
> > Thanks.
> >
> > Alexei
> >
>



-- 
Alexei A. Korolev

RE: crawling site without www

Posted by Markus Jelsma <ma...@openindex.io>.

If it starts to redirect and you are on the wrong side of the redirect, you're in trouble. But with the HostNormalizer you can then renormalize all URL's to the host that is being redirected to.
 
 
-----Original message-----
> From:Alexei Korolev <al...@gmail.com>
> Sent: Wed 08-Aug-2012 15:55
> To: user@nutch.apache.org
> Subject: Re: crawling site without www
> 
> > You can use the HostURLNormalizer for this task or just crawl the www OR
> > the non-www, not both.
> >
> 
> I'm trying to crawl only version without www. As I see, I can remove www.
> using proper configured regex-normalize.xml.
> But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
> common situation in web)
> 
> Thanks.
> 
> Alexei
>

Re: crawling site without www

Posted by Alexei Korolev <al...@gmail.com>.

> You can use the HostURLNormalizer for this task or just crawl the www OR
> the non-www, not both.
>

I'm trying to crawl only version without www. As I see, I can remove www.
using proper configured regex-normalize.xml.
But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
common situation in web)

Thanks.

Alexei

RE: crawling site without www

Posted by Markus Jelsma <ma...@openindex.io>.


 
 
-----Original message-----
> From:Alexei Korolev <al...@gmail.com>
> Sent: Wed 08-Aug-2012 15:43
> To: user@nutch.apache.org
> Subject: Re: crawling site without www
> 
> Hi, Sebastian
> 
> Seems you are right. I have db.ignore.external.links is true.
> But how to configure nutch for processing mobile365.ru and www.mobile365 as
> single site?

You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both.

> 
> Thanks.
> 
> On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> > wrote:
> 
> > Hi Alexei,
> >
> > I tried a crawl with your scrip fragment and Nutch 1.5.1
> > and the URLs http://mobile365.ru as seed. It worked,
> > see annotated log below.
> >
> > Which version of Nutch do you use?
> >
> > Check the property db.ignore.external.links (default is false).
> > If true the link from mobile365.ru to www.mobile365.ru
> > is skipped.
> >
> > Look into your crawldb (bin/nutch readdb)
> >
> > Check your URL filters with
> >  bin/nutch org.apache.nutch.net.URLFilterChecker
> >
> > Finally, send the nutch-site.xml and every configuration
> > file you changed.
> >
> > Good luck,
> > Sebastian
> >
> > % nutch inject crawl/crawldb seed.txt
> > Injector: starting at 2012-08-07 20:31:00
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: seed.txt
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15
> >
> > % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
> > Generator: starting at 2012-08-07 20:31:23
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/crawldb/segments/20120807203131
> > Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15
> >
> > # Note: personally, I would prefer not to place segments (also linkdb)
> > #       in the crawldb/ folder.
> >
> > % s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >
> > % nutch fetch $s1
> > Fetcher: starting at 2012-08-07 20:32:00
> > Fetcher: segment: crawl/crawldb/segments/20120807203131
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Using queue mode : byHost
> > fetching http://mobile365.ru/
> > Using queue mode : byHost
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Fetcher: throughput threshold: -1
> > -finishing thread FetcherThread, activeThreads=1
> > Fetcher: throughput threshold retries: 5
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07
> >
> > % nutch parse $s1
> > ParseSegment: starting at 2012-08-07 20:32:12
> > ParseSegment: segment: crawl/crawldb/segments/20120807203131
> > Parsed (10ms):http://mobile365.ru/
> > ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07
> >
> > % nutch updatedb crawl/crawldb/ $s1
> > CrawlDb update: starting at 2012-08-07 20:32:24
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/crawldb/segments/20120807203131]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: 404 purging: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13
> >
> > # see whether the outlink is now in crawldb:
> > % nutch readdb crawl/crawldb/ -stats
> > CrawlDb statistics start: crawl/crawldb/
> > Statistics for CrawlDb: crawl/crawldb/
> > TOTAL urls:     2
> > retry 0:        2
> > min score:      1.0
> > avg score:      1.0
> > max score:      1.0
> > status 1 (db_unfetched):        1
> > status 2 (db_fetched):  1
> > CrawlDb statistics: done
> > # => yes: http://mobile365.ru/ is fetched, outlink found
> >
> > %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
> > Generator: starting at 2012-08-07 20:32:58
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/crawldb/segments/20120807203307
> > Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15
> >
> > % s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >
> > % nutch fetch $s1
> > Fetcher: starting at 2012-08-07 20:33:34
> > Fetcher: segment: crawl/crawldb/segments/20120807203307
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Using queue mode : byHost
> > fetching http://www.mobile365.ru/test.html
> > # got it
> >
> >
> > On 08/07/2012 04:37 PM, Alexei Korolev wrote:
> > > Hi,
> > >
> > > I made simple example
> > >
> > > Put in seed.txt
> > > http://mobile365.ru
> > >
> > > It will produce error.
> > > 20120807160035
> > > Put in seed.txt
> > > http://www.mobile365.ru
> > >
> > > and second launch of crawler script will work fine and fetch
> > > http://www.mobile365.ru/test.html page.
> > >
> > > On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga <
> > > mathijs.homminga@kalooga.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I read from your logs:
> > >> - test.com is injected.
> > >> - test.com is fetched and parsed successfully.
> > >> - but when you run a generate again (second launch), no segment is
> > created
> > >> (because no url is selected) and your script tries to fetch and parse
> > the
> > >> first segment again. Hence the errors.
> > >>20120807160035
> > >> So test.com is fetched successfully. Question remains: why is no url
> > >> selected in the second generate?
> > >> Many answers possible. Can you tell us what urls you have in your
> > crawldb
> > >> after the first cycle? Perhaps no outlinks have been found / added.
> > >>
> > >> Mathijs
> > >>
> > >>
> > >>
> > >>
> > >> On Aug 7, 2012, at 16:02 , Alexei Korolev <al...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> Yes, test.com and www.test.com exist.
> > >>> test.com do not redirect on www.test.com, it opens page with ongoing
> > >> link20120807160035
> > >>> with www. like www.test.com/page1 www.test.com/page2
> > >>>
> > >>> First launch of crawler script
> > >>>
> > >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> > >>> Injector: starting at 2012-08-07 16:00:30
> > >>> Injector: crawlDb: crawl/crawldb
> > >>> Injector: urlDir: seed.txt
> > >>> Injector: Converting injected urls to crawl db entries.
> > >>> Injector: Merging injected urls into crawl db.
> > >>> Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
> > >>> Generator: starting at 2012-08-07 16:00:33
> > >>> Generator: Selecting best-scoring urls due for fetc20120807160035h.
> > >>> Generator: filtering: true
> > >>> Generator: normalizing: true
> > >>> Generator: jobtracker is 'local', generating exactly one partition.
> > >>> Generator: Partitioning selected urls for politeness.
> > >>> Generator: segment: crawl/crawldb/segments/20120807160035
> > >>> Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
> > >>> Fetcher: Your 'http.agent.name' value should be listed first in
> > >>> 'http.robots.agents' property.
> > >>> Fetcher: starting at 2012-08-07 16:00:37
> > >>> Fetcher: segment: crawl/crawldb/segments/20120807160035
> > >>> Using queue mode : byHost
> > >>> Fetcher: threads: 10
> > >>> Fetcher: time-out divisor: 2
> > >>> QueueFeeder finished: total 1 records + hit by time limit :0
> > >>> Using queue mode : byHost20120807160035
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> fetching http://test.com
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Fetcher: throughput threshold: -1
> > >>> Fetcher: throughput threshold retries: 5
> > >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > >>> -finishing thread FetcherThread, activeThreads=0
> > >>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > >>> -activeThreads=0
> > >>> Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
> > >>> ParseSegment: starting at 2012-08-07 16:00:41
> > >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> > >>> Parsing: http://test.com
> > >>> ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
> > >>> CrawlDb update: starting at 2012-08-07 16:00:44
> > >>> CrawlDb update: db: crawl/crawldb
> > >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> > >>> CrawlDb update: additions allowed: true
> > >>> CrawlDb update: URL normalizing: false
> > >>> CrawlDb update: URL filtering: false
> > >>> CrawlDb update: 404 purging: false
> > >>> CrawlDb update: Merging segment data into db.
> > >>> CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
> > >>> LinkDb: starting at 2012-08-07 16:00:46
> > >>> LinkDb: linkdb: crawl/crawldb/linkdb
> > >>> LinkDb: URL normalize: true
> > >>> LinkDb: URL filter: true
> > >>> LinkDb: adding segment:
> > >>> file:/data/nutch/crawl/crawldb/segments/20120807160035
> > >>> LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
> > >>>
> > >>> Second launch of srcipt
> > >>>
> > >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> > >>> Injector: starting at 2012-08-07 16:01:30
> > >>> Injector: crawlDb: crawl/crawldb
> > >>> Injector: urlDir: seed.txt
> > >>> Injector: Converting injected urls to crawl db entries.
> > >>> Injector: Merging injected urls into crawl db.
> > >>> Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
> > >>> Generator: starting at 2012-08-07 16:01:33
> > >>> Generator: Selecting best-scoring urls due for fetch.
> > >>> Generator: filtering: true
> > >>> Generator: normalizing: true
> > >>> Generator: jobtracker is 'local', generating exactly one partition.
> > >>> Generator: 0 records selected for fetching, exiting ...
> > >>> Fetcher: Your 'http.agent.name' value should be listed first in
> > >>> 'http.robots.agents' property.
> > >>> Fetcher: starting at 2012-08-07 16:01:35
> > >>> Fetcher: segment: crawl/crawldb/segments/20120807160035
> > >>> Fetcher: java.io.IOException: Segment already fetched!
> > >>>    at
> > >>>
> > >>
> > org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
> > >>>    at
> > >>>
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> > >>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> > >>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> > >>>    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
> > >>>    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
> > >>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >>>    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
> > >>>
> > >>> ParseSegment: starting at 2012-08-07 16:01:35
> > >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> > >>> Exception in thread "main" java.io.IOException: Segment already parsed!
> > >>>    at
> > >>>
> > >>
> > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
> > >>>    at
> > >>>
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> > >>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> > >>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> > >>>    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
> > >>>    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
> > >>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >>>    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
> > >>> CrawlDb update: starting at 2012-08-07 16:01:36
> > >>> CrawlDb update: db: crawl/crawldb
> > >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> > >>> CrawlDb update: additions allowed: true
> > >>> CrawlDb update: URL normalizing: false
> > >>> CrawlDb update: URL filtering: false
> > >>> CrawlDb update: 404 purging: false
> > >>> CrawlDb update: Merging segment data into db.
> > >>> CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
> > >>> LinkDb: starting at 2012-08-07 16:01:37
> > >>> LinkDb: linkdb: crawl/crawldb/linkdb
> > >>> LinkDb: URL normalize: true
> > >>> LinkDb: URL filter: true
> > >>> LinkDb: adding segment:
> > >>> file:/data/nutch/crawl/crawldb/segments/20120807160035
> > >>> LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
> > >>> LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
> > >>>
> > >>>
> > >>> But when seed.txt have www.test.com instead test.com second launch of
> > >>> crawler script found next segment for fetching.
> > >>>
> > >>> On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
> > >>> mathijs.homminga@kalooga.com> wrote:
> > >>>
> > >>>> What do you mean exactly with "it falls on fetch phase"?
> > >>>> Do  you get an error?
> > >>>> Does "test.com" exist?
> > >>>> Does it perhaps redirect to "www.test.com"?
> > >>>> ...
> > >>>>
> > >>>> Mathijs
> > >>>>
> > >>>>
> > >>>> On Aug 4, 2012, at 17:11 , Alexei Korolev <al...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> yes
> > >>>>>
> > >>>>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> > >>>>> lewis.mcgibbney@gmail.com> wrote:
> > >>>>>
> > >>>>>> http://   ?
> > >>>>>>
> > >>>>>> hth
> > >>>>>>
> > >>>>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
> > >>>> alexei.korolev@gmail.com>
> > >>>>>> wrote:
> > >>>>>>> Hello,
> > >>>>>>>
> > >>>>>>> I have small script
> > >>>>>>>
> > >>>>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
> > >>>>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays
> > 0
> > >>>>>>>
> > >>>>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> > >>>>>>> $NUTCH_PATH fetch $s1
> > >>>>>>> $NUTCH_PATH parse $s1
> > >>>>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
> > >>>>>>>
> > >>>>>>> In seed.txt I have just one site, for example "test.com". When I
> > >> start
> > >>>>>>> script it falls on fetch phase.
> > >>>>>>> If I change test.com on www.test.com it works fine. Seems the
> > >> reason,
> > >>>>>> that
> > >>>>>>> outgoing link on test.com all have www. prefix.
> > >>>>>>> What I need to change in nutch config for work with test.com?
> > >>>>>>>
> > >>>>>>> Thank you in advance. I hope my explanation is clear :)
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Alexei A. Korolev
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Lewis
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Alexei A. Korolev
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Alexei A. Korolev
> > >>
> > >>
> > >
> > >
> >
> >
> 
> 
> -- 
> Alexei A. Korolev
>

Re: crawling site without www

Posted by Alexei Korolev <al...@gmail.com>.

Hi, Sebastian

Seems you are right. I have db.ignore.external.links is true.
But how to configure nutch for processing mobile365.ru and www.mobile365 as
single site?

Thanks.

On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Alexei,
>
> I tried a crawl with your scrip fragment and Nutch 1.5.1
> and the URLs http://mobile365.ru as seed. It worked,
> see annotated log below.
>
> Which version of Nutch do you use?
>
> Check the property db.ignore.external.links (default is false).
> If true the link from mobile365.ru to www.mobile365.ru
> is skipped.
>
> Look into your crawldb (bin/nutch readdb)
>
> Check your URL filters with
>  bin/nutch org.apache.nutch.net.URLFilterChecker
>
> Finally, send the nutch-site.xml and every configuration
> file you changed.
>
> Good luck,
> Sebastian
>
> % nutch inject crawl/crawldb seed.txt
> Injector: starting at 2012-08-07 20:31:00
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: seed.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15
>
> % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
> Generator: starting at 2012-08-07 20:31:23
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/crawldb/segments/20120807203131
> Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15
>
> # Note: personally, I would prefer not to place segments (also linkdb)
> #       in the crawldb/ folder.
>
> % s1=`ls -d crawl/crawldb/segments/* | tail -1`
>
> % nutch fetch $s1
> Fetcher: starting at 2012-08-07 20:32:00
> Fetcher: segment: crawl/crawldb/segments/20120807203131
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> fetching http://mobile365.ru/
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Using queue mode : byHost
> Fetcher: throughput threshold: -1
> -finishing thread FetcherThread, activeThreads=1
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07
>
> % nutch parse $s1
> ParseSegment: starting at 2012-08-07 20:32:12
> ParseSegment: segment: crawl/crawldb/segments/20120807203131
> Parsed (10ms):http://mobile365.ru/
> ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07
>
> % nutch updatedb crawl/crawldb/ $s1
> CrawlDb update: starting at 2012-08-07 20:32:24
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/crawldb/segments/20120807203131]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13
>
> # see whether the outlink is now in crawldb:
> % nutch readdb crawl/crawldb/ -stats
> CrawlDb statistics start: crawl/crawldb/
> Statistics for CrawlDb: crawl/crawldb/
> TOTAL urls:     2
> retry 0:        2
> min score:      1.0
> avg score:      1.0
> max score:      1.0
> status 1 (db_unfetched):        1
> status 2 (db_fetched):  1
> CrawlDb statistics: done
> # => yes: http://mobile365.ru/ is fetched, outlink found
>
> %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
> Generator: starting at 2012-08-07 20:32:58
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/crawldb/segments/20120807203307
> Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15
>
> % s1=`ls -d crawl/crawldb/segments/* | tail -1`
>
> % nutch fetch $s1
> Fetcher: starting at 2012-08-07 20:33:34
> Fetcher: segment: crawl/crawldb/segments/20120807203307
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> fetching http://www.mobile365.ru/test.html
> # got it
>
>
> On 08/07/2012 04:37 PM, Alexei Korolev wrote:
> > Hi,
> >
> > I made simple example
> >
> > Put in seed.txt
> > http://mobile365.ru
> >
> > It will produce error.
> > 20120807160035
> > Put in seed.txt
> > http://www.mobile365.ru
> >
> > and second launch of crawler script will work fine and fetch
> > http://www.mobile365.ru/test.html page.
> >
> > On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga <
> > mathijs.homminga@kalooga.com> wrote:
> >
> >> Hi,
> >>
> >> I read from your logs:
> >> - test.com is injected.
> >> - test.com is fetched and parsed successfully.
> >> - but when you run a generate again (second launch), no segment is
> created
> >> (because no url is selected) and your script tries to fetch and parse
> the
> >> first segment again. Hence the errors.
> >>20120807160035
> >> So test.com is fetched successfully. Question remains: why is no url
> >> selected in the second generate?
> >> Many answers possible. Can you tell us what urls you have in your
> crawldb
> >> after the first cycle? Perhaps no outlinks have been found / added.
> >>
> >> Mathijs
> >>
> >>
> >>
> >>
> >> On Aug 7, 2012, at 16:02 , Alexei Korolev <al...@gmail.com>
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> Yes, test.com and www.test.com exist.
> >>> test.com do not redirect on www.test.com, it opens page with ongoing
> >> link20120807160035
> >>> with www. like www.test.com/page1 www.test.com/page2
> >>>
> >>> First launch of crawler script
> >>>
> >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> >>> Injector: starting at 2012-08-07 16:00:30
> >>> Injector: crawlDb: crawl/crawldb
> >>> Injector: urlDir: seed.txt
> >>> Injector: Converting injected urls to crawl db entries.
> >>> Injector: Merging injected urls into crawl db.
> >>> Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
> >>> Generator: starting at 2012-08-07 16:00:33
> >>> Generator: Selecting best-scoring urls due for fetc20120807160035h.
> >>> Generator: filtering: true
> >>> Generator: normalizing: true
> >>> Generator: jobtracker is 'local', generating exactly one partition.
> >>> Generator: Partitioning selected urls for politeness.
> >>> Generator: segment: crawl/crawldb/segments/20120807160035
> >>> Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
> >>> Fetcher: Your 'http.agent.name' value should be listed first in
> >>> 'http.robots.agents' property.
> >>> Fetcher: starting at 2012-08-07 16:00:37
> >>> Fetcher: segment: crawl/crawldb/segments/20120807160035
> >>> Using queue mode : byHost
> >>> Fetcher: threads: 10
> >>> Fetcher: time-out divisor: 2
> >>> QueueFeeder finished: total 1 records + hit by time limit :0
> >>> Using queue mode : byHost20120807160035
> >>> Using queue mode : byHost
> >>> -finishing thread FetcherThread, activeThreads=1
> >>> Using queue mode : byHost
> >>> fetching http://test.com
> >>> -finishing thread FetcherThread, activeThreads=1
> >>> Using queue mode : byHost
> >>> -finishing thread FetcherThread, activeThreads=1
> >>> Using queue mode : byHost
> >>> -finishing thread FetcherThread, activeThreads=1
> >>> Using queue mode : byHost
> >>> -finishing thread FetcherThread, activeThreads=1
> >>> Using queue mode : byHost
> >>> -finishing thread FetcherThread, activeThreads=1
> >>> Using queue mode : byHost
> >>> -finishing thread FetcherThread, activeThreads=1
> >>> Using queue mode : byHost
> >>> -finishing thread FetcherThread, activeThreads=1
> >>> Using queue mode : byHost
> >>> -finishing thread FetcherThread, activeThreads=1
> >>> Fetcher: throughput threshold: -1
> >>> Fetcher: throughput threshold retries: 5
> >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >>> -finishing thread FetcherThread, activeThreads=0
> >>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >>> -activeThreads=0
> >>> Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
> >>> ParseSegment: starting at 2012-08-07 16:00:41
> >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> >>> Parsing: http://test.com
> >>> ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
> >>> CrawlDb update: starting at 2012-08-07 16:00:44
> >>> CrawlDb update: db: crawl/crawldb
> >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> >>> CrawlDb update: additions allowed: true
> >>> CrawlDb update: URL normalizing: false
> >>> CrawlDb update: URL filtering: false
> >>> CrawlDb update: 404 purging: false
> >>> CrawlDb update: Merging segment data into db.
> >>> CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
> >>> LinkDb: starting at 2012-08-07 16:00:46
> >>> LinkDb: linkdb: crawl/crawldb/linkdb
> >>> LinkDb: URL normalize: true
> >>> LinkDb: URL filter: true
> >>> LinkDb: adding segment:
> >>> file:/data/nutch/crawl/crawldb/segments/20120807160035
> >>> LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
> >>>
> >>> Second launch of srcipt
> >>>
> >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> >>> Injector: starting at 2012-08-07 16:01:30
> >>> Injector: crawlDb: crawl/crawldb
> >>> Injector: urlDir: seed.txt
> >>> Injector: Converting injected urls to crawl db entries.
> >>> Injector: Merging injected urls into crawl db.
> >>> Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
> >>> Generator: starting at 2012-08-07 16:01:33
> >>> Generator: Selecting best-scoring urls due for fetch.
> >>> Generator: filtering: true
> >>> Generator: normalizing: true
> >>> Generator: jobtracker is 'local', generating exactly one partition.
> >>> Generator: 0 records selected for fetching, exiting ...
> >>> Fetcher: Your 'http.agent.name' value should be listed first in
> >>> 'http.robots.agents' property.
> >>> Fetcher: starting at 2012-08-07 16:01:35
> >>> Fetcher: segment: crawl/crawldb/segments/20120807160035
> >>> Fetcher: java.io.IOException: Segment already fetched!
> >>>    at
> >>>
> >>
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
> >>>    at
> >>>
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> >>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >>>    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
> >>>    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
> >>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
> >>>
> >>> ParseSegment: starting at 2012-08-07 16:01:35
> >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> >>> Exception in thread "main" java.io.IOException: Segment already parsed!
> >>>    at
> >>>
> >>
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
> >>>    at
> >>>
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> >>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >>>    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
> >>>    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
> >>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
> >>> CrawlDb update: starting at 2012-08-07 16:01:36
> >>> CrawlDb update: db: crawl/crawldb
> >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> >>> CrawlDb update: additions allowed: true
> >>> CrawlDb update: URL normalizing: false
> >>> CrawlDb update: URL filtering: false
> >>> CrawlDb update: 404 purging: false
> >>> CrawlDb update: Merging segment data into db.
> >>> CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
> >>> LinkDb: starting at 2012-08-07 16:01:37
> >>> LinkDb: linkdb: crawl/crawldb/linkdb
> >>> LinkDb: URL normalize: true
> >>> LinkDb: URL filter: true
> >>> LinkDb: adding segment:
> >>> file:/data/nutch/crawl/crawldb/segments/20120807160035
> >>> LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
> >>> LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
> >>>
> >>>
> >>> But when seed.txt have www.test.com instead test.com second launch of
> >>> crawler script found next segment for fetching.
> >>>
> >>> On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
> >>> mathijs.homminga@kalooga.com> wrote:
> >>>
> >>>> What do you mean exactly with "it falls on fetch phase"?
> >>>> Do  you get an error?
> >>>> Does "test.com" exist?
> >>>> Does it perhaps redirect to "www.test.com"?
> >>>> ...
> >>>>
> >>>> Mathijs
> >>>>
> >>>>
> >>>> On Aug 4, 2012, at 17:11 , Alexei Korolev <al...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> yes
> >>>>>
> >>>>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> >>>>> lewis.mcgibbney@gmail.com> wrote:
> >>>>>
> >>>>>> http://   ?
> >>>>>>
> >>>>>> hth
> >>>>>>
> >>>>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
> >>>> alexei.korolev@gmail.com>
> >>>>>> wrote:
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> I have small script
> >>>>>>>
> >>>>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
> >>>>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays
> 0
> >>>>>>>
> >>>>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >>>>>>> $NUTCH_PATH fetch $s1
> >>>>>>> $NUTCH_PATH parse $s1
> >>>>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
> >>>>>>>
> >>>>>>> In seed.txt I have just one site, for example "test.com". When I
> >> start
> >>>>>>> script it falls on fetch phase.
> >>>>>>> If I change test.com on www.test.com it works fine. Seems the
> >> reason,
> >>>>>> that
> >>>>>>> outgoing link on test.com all have www. prefix.
> >>>>>>> What I need to change in nutch config for work with test.com?
> >>>>>>>
> >>>>>>> Thank you in advance. I hope my explanation is clear :)
> >>>>>>>
> >>>>>>> --
> >>>>>>> Alexei A. Korolev
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Lewis
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Alexei A. Korolev
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Alexei A. Korolev
> >>
> >>
> >
> >
>
>


-- 
Alexei A. Korolev

Re: crawling site without www

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Alexei,

I tried a crawl with your scrip fragment and Nutch 1.5.1
and the URLs http://mobile365.ru as seed. It worked,
see annotated log below.

Which version of Nutch do you use?

Check the property db.ignore.external.links (default is false).
If true the link from mobile365.ru to www.mobile365.ru
is skipped.

Look into your crawldb (bin/nutch readdb)

Check your URL filters with
 bin/nutch org.apache.nutch.net.URLFilterChecker

Finally, send the nutch-site.xml and every configuration
file you changed.

Good luck,
Sebastian

% nutch inject crawl/crawldb seed.txt
Injector: starting at 2012-08-07 20:31:00
Injector: crawlDb: crawl/crawldb
Injector: urlDir: seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15

% nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
Generator: starting at 2012-08-07 20:31:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/crawldb/segments/20120807203131
Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15

# Note: personally, I would prefer not to place segments (also linkdb)
#       in the crawldb/ folder.

% s1=`ls -d crawl/crawldb/segments/* | tail -1`

% nutch fetch $s1
Fetcher: starting at 2012-08-07 20:32:00
Fetcher: segment: crawl/crawldb/segments/20120807203131
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
fetching http://mobile365.ru/
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07

% nutch parse $s1
ParseSegment: starting at 2012-08-07 20:32:12
ParseSegment: segment: crawl/crawldb/segments/20120807203131
Parsed (10ms):http://mobile365.ru/
ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07

% nutch updatedb crawl/crawldb/ $s1
CrawlDb update: starting at 2012-08-07 20:32:24
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/crawldb/segments/20120807203131]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13

# see whether the outlink is now in crawldb:
% nutch readdb crawl/crawldb/ -stats
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     2
retry 0:        2
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        1
status 2 (db_fetched):  1
CrawlDb statistics: done
# => yes: http://mobile365.ru/ is fetched, outlink found

%nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
Generator: starting at 2012-08-07 20:32:58
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/crawldb/segments/20120807203307
Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15

% s1=`ls -d crawl/crawldb/segments/* | tail -1`

% nutch fetch $s1
Fetcher: starting at 2012-08-07 20:33:34
Fetcher: segment: crawl/crawldb/segments/20120807203307
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
fetching http://www.mobile365.ru/test.html
# got it


On 08/07/2012 04:37 PM, Alexei Korolev wrote:
> Hi,
> 
> I made simple example
> 
> Put in seed.txt
> http://mobile365.ru
> 
> It will produce error.
> 20120807160035
> Put in seed.txt
> http://www.mobile365.ru
> 
> and second launch of crawler script will work fine and fetch
> http://www.mobile365.ru/test.html page.
> 
> On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga <
> mathijs.homminga@kalooga.com> wrote:
> 
>> Hi,
>>
>> I read from your logs:
>> - test.com is injected.
>> - test.com is fetched and parsed successfully.
>> - but when you run a generate again (second launch), no segment is created
>> (because no url is selected) and your script tries to fetch and parse the
>> first segment again. Hence the errors.
>>20120807160035
>> So test.com is fetched successfully. Question remains: why is no url
>> selected in the second generate?
>> Many answers possible. Can you tell us what urls you have in your crawldb
>> after the first cycle? Perhaps no outlinks have been found / added.
>>
>> Mathijs
>>
>>
>>
>>
>> On Aug 7, 2012, at 16:02 , Alexei Korolev <al...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> Yes, test.com and www.test.com exist.
>>> test.com do not redirect on www.test.com, it opens page with ongoing
>> link20120807160035
>>> with www. like www.test.com/page1 www.test.com/page2
>>>
>>> First launch of crawler script
>>>
>>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
>>> Injector: starting at 2012-08-07 16:00:30
>>> Injector: crawlDb: crawl/crawldb
>>> Injector: urlDir: seed.txt
>>> Injector: Converting injected urls to crawl db entries.
>>> Injector: Merging injected urls into crawl db.
>>> Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
>>> Generator: starting at 2012-08-07 16:00:33
>>> Generator: Selecting best-scoring urls due for fetc20120807160035h.
>>> Generator: filtering: true
>>> Generator: normalizing: true
>>> Generator: jobtracker is 'local', generating exactly one partition.
>>> Generator: Partitioning selected urls for politeness.
>>> Generator: segment: crawl/crawldb/segments/20120807160035
>>> Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
>>> Fetcher: Your 'http.agent.name' value should be listed first in
>>> 'http.robots.agents' property.
>>> Fetcher: starting at 2012-08-07 16:00:37
>>> Fetcher: segment: crawl/crawldb/segments/20120807160035
>>> Using queue mode : byHost
>>> Fetcher: threads: 10
>>> Fetcher: time-out divisor: 2
>>> QueueFeeder finished: total 1 records + hit by time limit :0
>>> Using queue mode : byHost20120807160035
>>> Using queue mode : byHost
>>> -finishing thread FetcherThread, activeThreads=1
>>> Using queue mode : byHost
>>> fetching http://test.com
>>> -finishing thread FetcherThread, activeThreads=1
>>> Using queue mode : byHost
>>> -finishing thread FetcherThread, activeThreads=1
>>> Using queue mode : byHost
>>> -finishing thread FetcherThread, activeThreads=1
>>> Using queue mode : byHost
>>> -finishing thread FetcherThread, activeThreads=1
>>> Using queue mode : byHost
>>> -finishing thread FetcherThread, activeThreads=1
>>> Using queue mode : byHost
>>> -finishing thread FetcherThread, activeThreads=1
>>> Using queue mode : byHost
>>> -finishing thread FetcherThread, activeThreads=1
>>> Using queue mode : byHost
>>> -finishing thread FetcherThread, activeThreads=1
>>> Fetcher: throughput threshold: -1
>>> Fetcher: throughput threshold retries: 5
>>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>>> -finishing thread FetcherThread, activeThreads=0
>>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>>> -activeThreads=0
>>> Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
>>> ParseSegment: starting at 2012-08-07 16:00:41
>>> ParseSegment: segment: crawl/crawldb/segments/20120807160035
>>> Parsing: http://test.com
>>> ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
>>> CrawlDb update: starting at 2012-08-07 16:00:44
>>> CrawlDb update: db: crawl/crawldb
>>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
>>> CrawlDb update: additions allowed: true
>>> CrawlDb update: URL normalizing: false
>>> CrawlDb update: URL filtering: false
>>> CrawlDb update: 404 purging: false
>>> CrawlDb update: Merging segment data into db.
>>> CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
>>> LinkDb: starting at 2012-08-07 16:00:46
>>> LinkDb: linkdb: crawl/crawldb/linkdb
>>> LinkDb: URL normalize: true
>>> LinkDb: URL filter: true
>>> LinkDb: adding segment:
>>> file:/data/nutch/crawl/crawldb/segments/20120807160035
>>> LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
>>>
>>> Second launch of srcipt
>>>
>>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
>>> Injector: starting at 2012-08-07 16:01:30
>>> Injector: crawlDb: crawl/crawldb
>>> Injector: urlDir: seed.txt
>>> Injector: Converting injected urls to crawl db entries.
>>> Injector: Merging injected urls into crawl db.
>>> Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
>>> Generator: starting at 2012-08-07 16:01:33
>>> Generator: Selecting best-scoring urls due for fetch.
>>> Generator: filtering: true
>>> Generator: normalizing: true
>>> Generator: jobtracker is 'local', generating exactly one partition.
>>> Generator: 0 records selected for fetching, exiting ...
>>> Fetcher: Your 'http.agent.name' value should be listed first in
>>> 'http.robots.agents' property.
>>> Fetcher: starting at 2012-08-07 16:01:35
>>> Fetcher: segment: crawl/crawldb/segments/20120807160035
>>> Fetcher: java.io.IOException: Segment already fetched!
>>>    at
>>>
>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
>>>    at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
>>>    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
>>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
>>>
>>> ParseSegment: starting at 2012-08-07 16:01:35
>>> ParseSegment: segment: crawl/crawldb/segments/20120807160035
>>> Exception in thread "main" java.io.IOException: Segment already parsed!
>>>    at
>>>
>> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
>>>    at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
>>>    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
>>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
>>> CrawlDb update: starting at 2012-08-07 16:01:36
>>> CrawlDb update: db: crawl/crawldb
>>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
>>> CrawlDb update: additions allowed: true
>>> CrawlDb update: URL normalizing: false
>>> CrawlDb update: URL filtering: false
>>> CrawlDb update: 404 purging: false
>>> CrawlDb update: Merging segment data into db.
>>> CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
>>> LinkDb: starting at 2012-08-07 16:01:37
>>> LinkDb: linkdb: crawl/crawldb/linkdb
>>> LinkDb: URL normalize: true
>>> LinkDb: URL filter: true
>>> LinkDb: adding segment:
>>> file:/data/nutch/crawl/crawldb/segments/20120807160035
>>> LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
>>> LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
>>>
>>>
>>> But when seed.txt have www.test.com instead test.com second launch of
>>> crawler script found next segment for fetching.
>>>
>>> On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
>>> mathijs.homminga@kalooga.com> wrote:
>>>
>>>> What do you mean exactly with "it falls on fetch phase"?
>>>> Do  you get an error?
>>>> Does "test.com" exist?
>>>> Does it perhaps redirect to "www.test.com"?
>>>> ...
>>>>
>>>> Mathijs
>>>>
>>>>
>>>> On Aug 4, 2012, at 17:11 , Alexei Korolev <al...@gmail.com>
>>>> wrote:
>>>>
>>>>> yes
>>>>>
>>>>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
>>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>>
>>>>>> http://   ?
>>>>>>
>>>>>> hth
>>>>>>
>>>>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
>>>> alexei.korolev@gmail.com>
>>>>>> wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have small script
>>>>>>>
>>>>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
>>>>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
>>>>>>>
>>>>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
>>>>>>> $NUTCH_PATH fetch $s1
>>>>>>> $NUTCH_PATH parse $s1
>>>>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
>>>>>>>
>>>>>>> In seed.txt I have just one site, for example "test.com". When I
>> start
>>>>>>> script it falls on fetch phase.
>>>>>>> If I change test.com on www.test.com it works fine. Seems the
>> reason,
>>>>>> that
>>>>>>> outgoing link on test.com all have www. prefix.
>>>>>>> What I need to change in nutch config for work with test.com?
>>>>>>>
>>>>>>> Thank you in advance. I hope my explanation is clear :)
>>>>>>>
>>>>>>> --
>>>>>>> Alexei A. Korolev
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lewis
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Alexei A. Korolev
>>>>
>>>>
>>>
>>>
>>> --
>>> Alexei A. Korolev
>>
>>
> 
>

Re: crawling site without www

Posted by Alexei Korolev <al...@gmail.com>.

Hi,

I made simple example

Put in seed.txt
http://mobile365.ru

It will produce error.

Put in seed.txt
http://www.mobile365.ru

and second launch of crawler script will work fine and fetch
http://www.mobile365.ru/test.html page.

On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga <
mathijs.homminga@kalooga.com> wrote:

> Hi,
>
> I read from your logs:
> - test.com is injected.
> - test.com is fetched and parsed successfully.
> - but when you run a generate again (second launch), no segment is created
> (because no url is selected) and your script tries to fetch and parse the
> first segment again. Hence the errors.
>
> So test.com is fetched successfully. Question remains: why is no url
> selected in the second generate?
> Many answers possible. Can you tell us what urls you have in your crawldb
> after the first cycle? Perhaps no outlinks have been found / added.
>
> Mathijs
>
>
>
>
> On Aug 7, 2012, at 16:02 , Alexei Korolev <al...@gmail.com>
> wrote:
>
> > Hello,
> >
> > Yes, test.com and www.test.com exist.
> > test.com do not redirect on www.test.com, it opens page with ongoing
> link
> > with www. like www.test.com/page1 www.test.com/page2
> >
> > First launch of crawler script
> >
> > root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> > Injector: starting at 2012-08-07 16:00:30
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: seed.txt
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
> > Generator: starting at 2012-08-07 16:00:33
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/crawldb/segments/20120807160035
> > Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
> > Fetcher: Your 'http.agent.name' value should be listed first in
> > 'http.robots.agents' property.
> > Fetcher: starting at 2012-08-07 16:00:37
> > Fetcher: segment: crawl/crawldb/segments/20120807160035
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Using queue mode : byHost
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > fetching http://test.com
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Fetcher: throughput threshold: -1
> > Fetcher: throughput threshold retries: 5
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
> > ParseSegment: starting at 2012-08-07 16:00:41
> > ParseSegment: segment: crawl/crawldb/segments/20120807160035
> > Parsing: http://test.com
> > ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
> > CrawlDb update: starting at 2012-08-07 16:00:44
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: 404 purging: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
> > LinkDb: starting at 2012-08-07 16:00:46
> > LinkDb: linkdb: crawl/crawldb/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> > file:/data/nutch/crawl/crawldb/segments/20120807160035
> > LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
> >
> > Second launch of srcipt
> >
> > root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> > Injector: starting at 2012-08-07 16:01:30
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: seed.txt
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
> > Generator: starting at 2012-08-07 16:01:33
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: 0 records selected for fetching, exiting ...
> > Fetcher: Your 'http.agent.name' value should be listed first in
> > 'http.robots.agents' property.
> > Fetcher: starting at 2012-08-07 16:01:35
> > Fetcher: segment: crawl/crawldb/segments/20120807160035
> > Fetcher: java.io.IOException: Segment already fetched!
> >    at
> >
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
> >    at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> >    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
> >    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
> >    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
> >
> > ParseSegment: starting at 2012-08-07 16:01:35
> > ParseSegment: segment: crawl/crawldb/segments/20120807160035
> > Exception in thread "main" java.io.IOException: Segment already parsed!
> >    at
> >
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
> >    at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> >    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
> >    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
> >    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
> > CrawlDb update: starting at 2012-08-07 16:01:36
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: 404 purging: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
> > LinkDb: starting at 2012-08-07 16:01:37
> > LinkDb: linkdb: crawl/crawldb/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> > file:/data/nutch/crawl/crawldb/segments/20120807160035
> > LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
> > LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
> >
> >
> > But when seed.txt have www.test.com instead test.com second launch of
> > crawler script found next segment for fetching.
> >
> > On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
> > mathijs.homminga@kalooga.com> wrote:
> >
> >> What do you mean exactly with "it falls on fetch phase"?
> >> Do  you get an error?
> >> Does "test.com" exist?
> >> Does it perhaps redirect to "www.test.com"?
> >> ...
> >>
> >> Mathijs
> >>
> >>
> >> On Aug 4, 2012, at 17:11 , Alexei Korolev <al...@gmail.com>
> >> wrote:
> >>
> >>> yes
> >>>
> >>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> >>> lewis.mcgibbney@gmail.com> wrote:
> >>>
> >>>> http://   ?
> >>>>
> >>>> hth
> >>>>
> >>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
> >> alexei.korolev@gmail.com>
> >>>> wrote:
> >>>>> Hello,
> >>>>>
> >>>>> I have small script
> >>>>>
> >>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
> >>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
> >>>>>
> >>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >>>>> $NUTCH_PATH fetch $s1
> >>>>> $NUTCH_PATH parse $s1
> >>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
> >>>>>
> >>>>> In seed.txt I have just one site, for example "test.com". When I
> start
> >>>>> script it falls on fetch phase.
> >>>>> If I change test.com on www.test.com it works fine. Seems the
> reason,
> >>>> that
> >>>>> outgoing link on test.com all have www. prefix.
> >>>>> What I need to change in nutch config for work with test.com?
> >>>>>
> >>>>> Thank you in advance. I hope my explanation is clear :)
> >>>>>
> >>>>> --
> >>>>> Alexei A. Korolev
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lewis
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Alexei A. Korolev
> >>
> >>
> >
> >
> > --
> > Alexei A. Korolev
>
>


-- 
Alexei A. Korolev

Re: crawling site without www

Posted by Mathijs Homminga <ma...@kalooga.com>.

Hi,

I read from your logs: 
- test.com is injected.
- test.com is fetched and parsed successfully. 
- but when you run a generate again (second launch), no segment is created (because no url is selected) and your script tries to fetch and parse the first segment again. Hence the errors.

So test.com is fetched successfully. Question remains: why is no url selected in the second generate? 
Many answers possible. Can you tell us what urls you have in your crawldb after the first cycle? Perhaps no outlinks have been found / added. 

Mathijs




On Aug 7, 2012, at 16:02 , Alexei Korolev <al...@gmail.com> wrote:

> Hello,
> 
> Yes, test.com and www.test.com exist.
> test.com do not redirect on www.test.com, it opens page with ongoing link
> with www. like www.test.com/page1 www.test.com/page2
> 
> First launch of crawler script
> 
> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> Injector: starting at 2012-08-07 16:00:30
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: seed.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
> Generator: starting at 2012-08-07 16:00:33
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/crawldb/segments/20120807160035
> Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2012-08-07 16:00:37
> Fetcher: segment: crawl/crawldb/segments/20120807160035
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> fetching http://test.com
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
> ParseSegment: starting at 2012-08-07 16:00:41
> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> Parsing: http://test.com
> ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
> CrawlDb update: starting at 2012-08-07 16:00:44
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
> LinkDb: starting at 2012-08-07 16:00:46
> LinkDb: linkdb: crawl/crawldb/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/data/nutch/crawl/crawldb/segments/20120807160035
> LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
> 
> Second launch of srcipt
> 
> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> Injector: starting at 2012-08-07 16:01:30
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: seed.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
> Generator: starting at 2012-08-07 16:01:33
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2012-08-07 16:01:35
> Fetcher: segment: crawl/crawldb/segments/20120807160035
> Fetcher: java.io.IOException: Segment already fetched!
>    at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
>    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
> 
> ParseSegment: starting at 2012-08-07 16:01:35
> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> Exception in thread "main" java.io.IOException: Segment already parsed!
>    at
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
>    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
> CrawlDb update: starting at 2012-08-07 16:01:36
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
> LinkDb: starting at 2012-08-07 16:01:37
> LinkDb: linkdb: crawl/crawldb/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/data/nutch/crawl/crawldb/segments/20120807160035
> LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
> LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
> 
> 
> But when seed.txt have www.test.com instead test.com second launch of
> crawler script found next segment for fetching.
> 
> On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
> mathijs.homminga@kalooga.com> wrote:
> 
>> What do you mean exactly with "it falls on fetch phase"?
>> Do  you get an error?
>> Does "test.com" exist?
>> Does it perhaps redirect to "www.test.com"?
>> ...
>> 
>> Mathijs
>> 
>> 
>> On Aug 4, 2012, at 17:11 , Alexei Korolev <al...@gmail.com>
>> wrote:
>> 
>>> yes
>>> 
>>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
>>> lewis.mcgibbney@gmail.com> wrote:
>>> 
>>>> http://   ?
>>>> 
>>>> hth
>>>> 
>>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
>> alexei.korolev@gmail.com>
>>>> wrote:
>>>>> Hello,
>>>>> 
>>>>> I have small script
>>>>> 
>>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
>>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
>>>>> 
>>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
>>>>> $NUTCH_PATH fetch $s1
>>>>> $NUTCH_PATH parse $s1
>>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
>>>>> 
>>>>> In seed.txt I have just one site, for example "test.com". When I start
>>>>> script it falls on fetch phase.
>>>>> If I change test.com on www.test.com it works fine. Seems the reason,
>>>> that
>>>>> outgoing link on test.com all have www. prefix.
>>>>> What I need to change in nutch config for work with test.com?
>>>>> 
>>>>> Thank you in advance. I hope my explanation is clear :)
>>>>> 
>>>>> --
>>>>> Alexei A. Korolev
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Lewis
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Alexei A. Korolev
>> 
>> 
> 
> 
> -- 
> Alexei A. Korolev

Re: crawling site without www

Posted by Alexei Korolev <al...@gmail.com>.

Hello,

Yes, test.com and www.test.com exist.
test.com do not redirect on www.test.com, it opens page with ongoing link
with www. like www.test.com/page1 www.test.com/page2

First launch of crawler script

root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
Injector: starting at 2012-08-07 16:00:30
Injector: crawlDb: crawl/crawldb
Injector: urlDir: seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
Generator: starting at 2012-08-07 16:00:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/crawldb/segments/20120807160035
Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-08-07 16:00:37
Fetcher: segment: crawl/crawldb/segments/20120807160035
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
fetching http://test.com
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
ParseSegment: starting at 2012-08-07 16:00:41
ParseSegment: segment: crawl/crawldb/segments/20120807160035
Parsing: http://test.com
ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
CrawlDb update: starting at 2012-08-07 16:00:44
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
LinkDb: starting at 2012-08-07 16:00:46
LinkDb: linkdb: crawl/crawldb/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/data/nutch/crawl/crawldb/segments/20120807160035
LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01

Second launch of srcipt

root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
Injector: starting at 2012-08-07 16:01:30
Injector: crawlDb: crawl/crawldb
Injector: urlDir: seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
Generator: starting at 2012-08-07 16:01:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-08-07 16:01:35
Fetcher: segment: crawl/crawldb/segments/20120807160035
Fetcher: java.io.IOException: Segment already fetched!
    at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)

ParseSegment: starting at 2012-08-07 16:01:35
ParseSegment: segment: crawl/crawldb/segments/20120807160035
Exception in thread "main" java.io.IOException: Segment already parsed!
    at
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
CrawlDb update: starting at 2012-08-07 16:01:36
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
LinkDb: starting at 2012-08-07 16:01:37
LinkDb: linkdb: crawl/crawldb/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/data/nutch/crawl/crawldb/segments/20120807160035
LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02


But when seed.txt have www.test.com instead test.com second launch of
crawler script found next segment for fetching.

On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
mathijs.homminga@kalooga.com> wrote:

> What do you mean exactly with "it falls on fetch phase"?
> Do  you get an error?
> Does "test.com" exist?
> Does it perhaps redirect to "www.test.com"?
> ...
>
> Mathijs
>
>
> On Aug 4, 2012, at 17:11 , Alexei Korolev <al...@gmail.com>
> wrote:
>
> > yes
> >
> > On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> >> http://   ?
> >>
> >> hth
> >>
> >> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
> alexei.korolev@gmail.com>
> >> wrote:
> >>> Hello,
> >>>
> >>> I have small script
> >>>
> >>> $NUTCH_PATH inject crawl/crawldb seed.txt
> >>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
> >>>
> >>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >>> $NUTCH_PATH fetch $s1
> >>> $NUTCH_PATH parse $s1
> >>> $NUTCH_PATH updatedb crawl/crawldb $s1
> >>>
> >>> In seed.txt I have just one site, for example "test.com". When I start
> >>> script it falls on fetch phase.
> >>> If I change test.com on www.test.com it works fine. Seems the reason,
> >> that
> >>> outgoing link on test.com all have www. prefix.
> >>> What I need to change in nutch config for work with test.com?
> >>>
> >>> Thank you in advance. I hope my explanation is clear :)
> >>>
> >>> --
> >>> Alexei A. Korolev
> >>
> >>
> >>
> >> --
> >> Lewis
> >>
> >
> >
> >
> > --
> > Alexei A. Korolev
>
>


-- 
Alexei A. Korolev

Re: crawling site without www

Posted by Alexei Korolev <al...@gmail.com>.

Hello,

Thank you for reply.

Here is my regex-urlfilter.txt

# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.


and prefix-urlfilter.txt

# config file for urlfilter-prefix plugin

http://
https://
ftp://
file://


Looks all fine for me. Right?

On Sat, Aug 4, 2012 at 11:16 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Alexei,
>
> Because users are lazy some browser automatically
> try to add the www (and other stuff) to escape from
> a "server not found" error, see
> http://www-archive.mozilla.org/docs/end-user/domain-guessing.html
>
> Nutch does no domain guessing. The urls have to be correct
> and the host name must be complete.
>
> Finally, even if test.com sends a HTTP redirect pointing
> to www.test.com : check your URL filters whether both
> hosts are accepted.
>
> Sebastian
>
> On 08/04/2012 05:33 PM, Mathijs Homminga wrote:> What do you mean exactly
> with "it falls on fetch
> phase"?
> > Do  you get an error?
> > Does "test.com" exist?
> > Does it perhaps redirect to "www.test.com"?
> > ...
> >
> > Mathijs
> >
> > On Aug 4, 2012, at 17:11 , Alexei Korolev <al...@gmail.com>
> wrote:
> >
> >> yes
> >>
> >> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> >> lewis.mcgibbney@gmail.com> wrote:
> >>
> >>> http://   ?
> >>>
> >>> hth
> >>>
> >>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
> alexei.korolev@gmail.com>
> >>> wrote:
> >>>> Hello,
> >>>>
> >>>> I have small script
> >>>>
> >>>> $NUTCH_PATH inject crawl/crawldb seed.txt
> >>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
> >>>>
> >>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >>>> $NUTCH_PATH fetch $s1
> >>>> $NUTCH_PATH parse $s1
> >>>> $NUTCH_PATH updatedb crawl/crawldb $s1
> >>>>
> >>>> In seed.txt I have just one site, for example "test.com". When I
> start
> >>>> script it falls on fetch phase.
> >>>> If I change test.com on www.test.com it works fine. Seems the reason,
> >>> that
> >>>> outgoing link on test.com all have www. prefix.
> >>>> What I need to change in nutch config for work with test.com?
> >>>>
> >>>> Thank you in advance. I hope my explanation is clear :)
> >>>>
> >>>> --
> >>>> Alexei A. Korolev
> >>>
> >>>
> >>>
> >>> --
> >>> Lewis
> >>>
> >>
> >>
> >>
> >> --
> >> Alexei A. Korolev
> >
>
>


-- 
Alexei A. Korolev

Re: crawling site without www

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Alexei,

Because users are lazy some browser automatically
try to add the www (and other stuff) to escape from
a "server not found" error, see
http://www-archive.mozilla.org/docs/end-user/domain-guessing.html

Nutch does no domain guessing. The urls have to be correct
and the host name must be complete.

Finally, even if test.com sends a HTTP redirect pointing
to www.test.com : check your URL filters whether both
hosts are accepted.

Sebastian

On 08/04/2012 05:33 PM, Mathijs Homminga wrote:> What do you mean exactly with "it falls on fetch
phase"?
> Do  you get an error? 
> Does "test.com" exist? 
> Does it perhaps redirect to "www.test.com"?
> ...
> 
> Mathijs
> 
> On Aug 4, 2012, at 17:11 , Alexei Korolev <al...@gmail.com> wrote:
> 
>> yes
>>
>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> http://   ?
>>>
>>> hth
>>>
>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <al...@gmail.com>
>>> wrote:
>>>> Hello,
>>>>
>>>> I have small script
>>>>
>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
>>>>
>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
>>>> $NUTCH_PATH fetch $s1
>>>> $NUTCH_PATH parse $s1
>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
>>>>
>>>> In seed.txt I have just one site, for example "test.com". When I start
>>>> script it falls on fetch phase.
>>>> If I change test.com on www.test.com it works fine. Seems the reason,
>>> that
>>>> outgoing link on test.com all have www. prefix.
>>>> What I need to change in nutch config for work with test.com?
>>>>
>>>> Thank you in advance. I hope my explanation is clear :)
>>>>
>>>> --
>>>> Alexei A. Korolev
>>>
>>>
>>>
>>> --
>>> Lewis
>>>
>>
>>
>>
>> -- 
>> Alexei A. Korolev
>

Re: crawling site without www

Posted by Mathijs Homminga <ma...@kalooga.com>.

What do you mean exactly with "it falls on fetch phase"?
Do  you get an error? 
Does "test.com" exist? 
Does it perhaps redirect to "www.test.com"?
...

Mathijs


On Aug 4, 2012, at 17:11 , Alexei Korolev <al...@gmail.com> wrote:

> yes
> 
> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
> 
>> http://   ?
>> 
>> hth
>> 
>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <al...@gmail.com>
>> wrote:
>>> Hello,
>>> 
>>> I have small script
>>> 
>>> $NUTCH_PATH inject crawl/crawldb seed.txt
>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
>>> 
>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
>>> $NUTCH_PATH fetch $s1
>>> $NUTCH_PATH parse $s1
>>> $NUTCH_PATH updatedb crawl/crawldb $s1
>>> 
>>> In seed.txt I have just one site, for example "test.com". When I start
>>> script it falls on fetch phase.
>>> If I change test.com on www.test.com it works fine. Seems the reason,
>> that
>>> outgoing link on test.com all have www. prefix.
>>> What I need to change in nutch config for work with test.com?
>>> 
>>> Thank you in advance. I hope my explanation is clear :)
>>> 
>>> --
>>> Alexei A. Korolev
>> 
>> 
>> 
>> --
>> Lewis
>> 
> 
> 
> 
> -- 
> Alexei A. Korolev

Re: crawling site without www

Posted by Alexei Korolev <al...@gmail.com>.

yes

On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> http://   ?
>
> hth
>
> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <al...@gmail.com>
> wrote:
> > Hello,
> >
> > I have small script
> >
> > $NUTCH_PATH inject crawl/crawldb seed.txt
> > $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
> >
> > s1=`ls -d crawl/crawldb/segments/* | tail -1`
> > $NUTCH_PATH fetch $s1
> > $NUTCH_PATH parse $s1
> > $NUTCH_PATH updatedb crawl/crawldb $s1
> >
> > In seed.txt I have just one site, for example "test.com". When I start
> > script it falls on fetch phase.
> > If I change test.com on www.test.com it works fine. Seems the reason,
> that
> > outgoing link on test.com all have www. prefix.
> > What I need to change in nutch config for work with test.com?
> >
> > Thank you in advance. I hope my explanation is clear :)
> >
> > --
> > Alexei A. Korolev
>
>
>
> --
> Lewis
>



-- 
Alexei A. Korolev

Re: crawling site without www

Posted by Lewis John Mcgibbney <le...@gmail.com>.

http://   ?

hth

On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <al...@gmail.com> wrote:
> Hello,
>
> I have small script
>
> $NUTCH_PATH inject crawl/crawldb seed.txt
> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
>
> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> $NUTCH_PATH fetch $s1
> $NUTCH_PATH parse $s1
> $NUTCH_PATH updatedb crawl/crawldb $s1
>
> In seed.txt I have just one site, for example "test.com". When I start
> script it falls on fetch phase.
> If I change test.com on www.test.com it works fine. Seems the reason, that
> outgoing link on test.com all have www. prefix.
> What I need to change in nutch config for work with test.com?
>
> Thank you in advance. I hope my explanation is clear :)
>
> --
> Alexei A. Korolev



-- 
Lewis