You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Smith Norton <sm...@gmail.com> on 2007/09/07 09:53:02 UTC

Only one URL per site is selected from the URL file

I have mentioned a around 53 URLs from the same site and 7 other URLs
from different sites in the seed-urls file 'urls/url'.

They were like:-

http://central/s1
http://central/s1/t
http://central/s1/topic1
http://central/s1/topic2
http://central/s1/topic3
and so on ....

I was expecting when I begin the crawl, at depth 1 all these URLs
would be fetched. But I find that in the first depth, only
http://centrals/s1 was crawled. And the other 7 URLs from distinct
sites were also crawled.

My first question:-

It seems it is selecting one URL per site for the first depth of
crawl. Please explain why is it so? How can I change the behavior so
that it crawls all URLs I mention in the seed-urls file.

My second question:-

Not only in the first depth, the other central urls were never called
in any of the subsequent depths. Why so?

Re: UTF-16 problem

Posted by Vasja Ocvirk <va...@vizija.si>.
The problem is in charset. If we change charset of the page from UTF-16 
to UTF-8 then Nutch fetches all urls on that page. If a page has charset 
UTF-16 then Nutch fetches just the first url but not the urls on that 
page like with the utf8 charset. Here are two examples:

UTF16 - no additional urls for next cycle
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=utf-16" />
</head>
<body>
   <a href="slo.php">sdfsdf</a>
   <a class="ai" href="info.aspx?docid=54046">test</a>
   <a href="http://www.something.com">something.com</a>
</body>
</html>

UTF8 - first two urls are fetched for next cycle and this is OK.
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
   <a href="slo.php">sdfsdf</a>
   <a class="ai" href="info.aspx?docid=54046">test</a>
   <a href="http://www.something.com">something.com</a>
</body>
</html>

Thanks!

Best regards,
Vasja

Doğacan Güney wrote:
> On 9/11/07, Vasja Ocvirk <va...@vizija.si> wrote:
>   
>> Does anyone know what to do if Nutch doesn't crawl and index web pages
>> in UTF-16? Did anyone had such a problem yet?
>>     
>
> Nutch should work with UTF-16. Can you describe your problem in more detail?
>
>   
>> Best regards,
>> Vasja
>>
>>     
>
>
>   

Re: UTF-16 problem

Posted by Doğacan Güney <do...@gmail.com>.
On 9/11/07, Vasja Ocvirk <va...@vizija.si> wrote:
> Does anyone know what to do if Nutch doesn't crawl and index web pages
> in UTF-16? Did anyone had such a problem yet?

Nutch should work with UTF-16. Can you describe your problem in more detail?

>
> Best regards,
> Vasja
>


-- 
Doğacan Güney

UTF-16 problem

Posted by Vasja Ocvirk <va...@vizija.si>.
Does anyone know what to do if Nutch doesn't crawl and index web pages 
in UTF-16? Did anyone had such a problem yet?

Best regards,
Vasja

Re: Only one URL per site is selected from the URL file

Posted by eyal edri <ey...@gmail.com>.
try to review the settings in the nutch-default.xml file in the conf dir.
there are settings there regarding crawling internal links and external
links.

there also setting for crawling sites using hosts or ip address.

see if that helps.


On 9/7/07, Smith Norton <sm...@gmail.com> wrote:
>
> I find this in the logs:-
>
> 2007-09-06 17:13:54,707 INFO  crawl.Generator - Generator:
> Partitioning selected urls by host, for politeness.
>
> Is this why lots of URLs from the same host are being ignored? If it
> partitions, shouldn't it remember the unselected URLs to be crawled
> later.
>
> Can someone please explain read the two mails below, this one and help
> me to understand what's going on?
>
> On 9/7/07, Smith Norton <sm...@gmail.com> wrote:
> > The facts in the earlier mail is slightly wrong. It's not exactly one
> > URL per site. But all the URLs mentioned in the URL files are not
> > processed. Like out of 53 URLs of the same site, only 3 or 4 were
> > processed. Why so?
> >
> > Is this a known bug or a behavior of Nutch? Can this behavior be
> changed?
> >
> > On 9/7/07, Smith Norton <sm...@gmail.com> wrote:
> > > I have mentioned a around 53 URLs from the same site and 7 other URLs
> > > from different sites in the seed-urls file 'urls/url'.
> > >
> > > They were like:-
> > >
> > > http://central/s1
> > > http://central/s1/t
> > > http://central/s1/topic1
> > > http://central/s1/topic2
> > > http://central/s1/topic3
> > > and so on ....
> > >
> > > I was expecting when I begin the crawl, at depth 1 all these URLs
> > > would be fetched. But I find that in the first depth, only
> > > http://centrals/s1 was crawled. And the other 7 URLs from distinct
> > > sites were also crawled.
> > >
> > > My first question:-
> > >
> > > It seems it is selecting one URL per site for the first depth of
> > > crawl. Please explain why is it so? How can I change the behavior so
> > > that it crawls all URLs I mention in the seed-urls file.
> > >
> > > My second question:-
> > >
> > > Not only in the first depth, the other central urls were never called
> > > in any of the subsequent depths. Why so?
> > >
> >
>



-- 
Eyal Edri

Re: Only one URL per site is selected from the URL file

Posted by Smith Norton <sm...@gmail.com>.
I find this in the logs:-

2007-09-06 17:13:54,707 INFO  crawl.Generator - Generator:
Partitioning selected urls by host, for politeness.

Is this why lots of URLs from the same host are being ignored? If it
partitions, shouldn't it remember the unselected URLs to be crawled
later.

Can someone please explain read the two mails below, this one and help
me to understand what's going on?

On 9/7/07, Smith Norton <sm...@gmail.com> wrote:
> The facts in the earlier mail is slightly wrong. It's not exactly one
> URL per site. But all the URLs mentioned in the URL files are not
> processed. Like out of 53 URLs of the same site, only 3 or 4 were
> processed. Why so?
>
> Is this a known bug or a behavior of Nutch? Can this behavior be changed?
>
> On 9/7/07, Smith Norton <sm...@gmail.com> wrote:
> > I have mentioned a around 53 URLs from the same site and 7 other URLs
> > from different sites in the seed-urls file 'urls/url'.
> >
> > They were like:-
> >
> > http://central/s1
> > http://central/s1/t
> > http://central/s1/topic1
> > http://central/s1/topic2
> > http://central/s1/topic3
> > and so on ....
> >
> > I was expecting when I begin the crawl, at depth 1 all these URLs
> > would be fetched. But I find that in the first depth, only
> > http://centrals/s1 was crawled. And the other 7 URLs from distinct
> > sites were also crawled.
> >
> > My first question:-
> >
> > It seems it is selecting one URL per site for the first depth of
> > crawl. Please explain why is it so? How can I change the behavior so
> > that it crawls all URLs I mention in the seed-urls file.
> >
> > My second question:-
> >
> > Not only in the first depth, the other central urls were never called
> > in any of the subsequent depths. Why so?
> >
>

Re: Only one URL per site is selected from the URL file

Posted by Smith Norton <sm...@gmail.com>.
The facts in the earlier mail is slightly wrong. It's not exactly one
URL per site. But all the URLs mentioned in the URL files are not
processed. Like out of 53 URLs of the same site, only 3 or 4 were
processed. Why so?

Is this a known bug or a behavior of Nutch? Can this behavior be changed?

On 9/7/07, Smith Norton <sm...@gmail.com> wrote:
> I have mentioned a around 53 URLs from the same site and 7 other URLs
> from different sites in the seed-urls file 'urls/url'.
>
> They were like:-
>
> http://central/s1
> http://central/s1/t
> http://central/s1/topic1
> http://central/s1/topic2
> http://central/s1/topic3
> and so on ....
>
> I was expecting when I begin the crawl, at depth 1 all these URLs
> would be fetched. But I find that in the first depth, only
> http://centrals/s1 was crawled. And the other 7 URLs from distinct
> sites were also crawled.
>
> My first question:-
>
> It seems it is selecting one URL per site for the first depth of
> crawl. Please explain why is it so? How can I change the behavior so
> that it crawls all URLs I mention in the seed-urls file.
>
> My second question:-
>
> Not only in the first depth, the other central urls were never called
> in any of the subsequent depths. Why so?
>