You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Adelaida Lejarazu <al...@gmail.com> on 2011/06/13 13:10:40 UTC

No Urls to fetch

Hello,

I´m new to Nutch and I´m doing some tests to see how it works. I want to do
some crawling in a digital newspaper webpage. To do so, I put in the urls
directory where I have my seed list the URL I want to crawl that is: *
http://elcorreo.com*
The thing is that I don´t want to crawl all the news in the site but only
the ones of the current day, so I put a filter in the
*crawl-urlfilter.txt*(for the moment I´m using the
*crawl* command). The filter I put is:

+^http://www.elcorreo.com/.*?/20110613/.*?.html

A correct URL would be for example,
http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html

so, I think the regular expression is correct but Nutch doesn´t crawl
anything. It says that there are *No Urls to Fetch  - check your seed list
and URL filters.*


Am I missing something ??

Thanks,

Re: No Urls to fetch

Posted by MilleBii <mi...@gmail.com>.

You may want to escape the dot at leats

+^http://www\.elcorreo\.com/.*?/20110613/.*?\.html<http://www.elcorreo.com/.*?/20110613/.*?.html>
I'm assuming you have the other rule at the end of your file, for filtering
everything else out :
-.*

Now this is a problem because you can not seed the crawl with
http://elcorreo.com, since there is no matching rule. That URL gets rejected
immediately.

Also unless you have a sitemap page which point you to all the pages you are
looking for you need to have other  pages with links to the content you are
interested in. And those pages will get crawled/indexed etc...

Therefore I don't see how you can use crawl-filter for doing what you want.

You may want to write a special indexer plugin so that unnecessary pages get
drop from the search index and therefore not polute your search results.
But you need to keep those pages and their links for the crawler to work.



2011/6/13 Adelaida Lejarazu <al...@gmail.com>

> Yes...is my only filter.....
> >You should have at least a filter for the seed page you are accessing in
> the very first step!
> Sorry....but, I don´t understand what you are talking about...In my seed
> list I only have http://elcorreo.com and I have the filter to it.
>
> Regards
>
> Adelaida.
>
> 2011/6/13 Hannes Carl Meyer <ha...@googlemail.com>
>
> > Hi,
> >
> > is this your only filter? You should have at least a filter for the seed
> > page you are accessing in the very first step!
> >
> > Regards
> >
> > Hannes
> >
> > On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu <alejarazu@gmail.com
> > >wrote:
> >
> > > Hello,
> > >
> > > I´m new to Nutch and I´m doing some tests to see how it works. I want
> to
> > do
> > > some crawling in a digital newspaper webpage. To do so, I put in the
> urls
> > > directory where I have my seed list the URL I want to crawl that is: *
> > > http://elcorreo.com*
> > > The thing is that I don´t want to crawl all the news in the site but
> only
> > > the ones of the current day, so I put a filter in the
> > > *crawl-urlfilter.txt*(for the moment I´m using the
> > > *crawl* command). The filter I put is:
> > >
> > > +^http://www.elcorreo.com/.*?/20110613/.*?.html
> > >
> > > A correct URL would be for example,
> > >
> > >
> >
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
> > >
> > > so, I think the regular expression is correct but Nutch doesn´t crawl
> > > anything. It says that there are *No Urls to Fetch  - check your seed
> > list
> > > and URL filters.*
> > >
> > >
> > > Am I missing something ??
> > >
> > > Thanks,
> > >
> >
> >
> >
> > Hannes C. Meyer
> > www.informera.de
> >
>



-- 
-MilleBii-

Fwd: No Urls to fetch

Posted by Hannes Carl Meyer <ha...@googlemail.com>.

Please add also +^http://www.elcorreo.com/$ to your filter.
Otherwise you will exclude the seed page.


On Mon, Jun 13, 2011 at 1:44 PM, Adelaida Lejarazu <al...@gmail.com>wrote:

> Yes...is my only filter.....
>
> >You should have at least a filter for the seed page you are accessing in
> the very first step!
> Sorry....but, I don´t understand what you are talking about...In my seed
> list I only have http://elcorreo.com and I have the filter to it.
>
> Regards
>
> Adelaida.
>
>
> 2011/6/13 Hannes Carl Meyer <ha...@googlemail.com>
>
>> Hi,
>>
>> is this your only filter? You should have at least a filter for the seed
>> page you are accessing in the very first step!
>>
>> Regards
>>
>> Hannes
>>
>> On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu <alejarazu@gmail.com
>> >wrote:
>>
>> > Hello,
>> >
>> > I´m new to Nutch and I´m doing some tests to see how it works. I want to
>> do
>> > some crawling in a digital newspaper webpage. To do so, I put in the
>> urls
>> > directory where I have my seed list the URL I want to crawl that is: *
>> > http://elcorreo.com*
>> > The thing is that I don´t want to crawl all the news in the site but
>> only
>> > the ones of the current day, so I put a filter in the
>> > *crawl-urlfilter.txt*(for the moment I´m using the
>> > *crawl* command). The filter I put is:
>> >
>> > +^http://www.elcorreo.com/.*?/20110613/.*?.html
>> >
>> > A correct URL would be for example,
>> >
>> >
>> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
>> >
>> > so, I think the regular expression is correct but Nutch doesn´t crawl
>> > anything. It says that there are *No Urls to Fetch  - check your seed
>> list
>> > and URL filters.*
>> >
>> >
>> > Am I missing something ??
>> >
>> > Thanks,
>> >
>>
>>
>>
>> Hannes C. Meyer
>> www.informera.de
>>
>
>

Re: No Urls to fetch

Posted by Adelaida Lejarazu <al...@gmail.com>.

Yes...is my only filter.....
>You should have at least a filter for the seed page you are accessing in
the very first step!
Sorry....but, I don´t understand what you are talking about...In my seed
list I only have http://elcorreo.com and I have the filter to it.

Regards

Adelaida.

2011/6/13 Hannes Carl Meyer <ha...@googlemail.com>

> Hi,
>
> is this your only filter? You should have at least a filter for the seed
> page you are accessing in the very first step!
>
> Regards
>
> Hannes
>
> On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu <alejarazu@gmail.com
> >wrote:
>
> > Hello,
> >
> > I´m new to Nutch and I´m doing some tests to see how it works. I want to
> do
> > some crawling in a digital newspaper webpage. To do so, I put in the urls
> > directory where I have my seed list the URL I want to crawl that is: *
> > http://elcorreo.com*
> > The thing is that I don´t want to crawl all the news in the site but only
> > the ones of the current day, so I put a filter in the
> > *crawl-urlfilter.txt*(for the moment I´m using the
> > *crawl* command). The filter I put is:
> >
> > +^http://www.elcorreo.com/.*?/20110613/.*?.html
> >
> > A correct URL would be for example,
> >
> >
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
> >
> > so, I think the regular expression is correct but Nutch doesn´t crawl
> > anything. It says that there are *No Urls to Fetch  - check your seed
> list
> > and URL filters.*
> >
> >
> > Am I missing something ??
> >
> > Thanks,
> >
>
>
>
> Hannes C. Meyer
> www.informera.de
>

Re: No Urls to fetch

Posted by Hannes Carl Meyer <ha...@googlemail.com>.

Hi,

is this your only filter? You should have at least a filter for the seed
page you are accessing in the very first step!

Regards

Hannes

On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu <al...@gmail.com>wrote:

> Hello,
>
> I´m new to Nutch and I´m doing some tests to see how it works. I want to do
> some crawling in a digital newspaper webpage. To do so, I put in the urls
> directory where I have my seed list the URL I want to crawl that is: *
> http://elcorreo.com*
> The thing is that I don´t want to crawl all the news in the site but only
> the ones of the current day, so I put a filter in the
> *crawl-urlfilter.txt*(for the moment I´m using the
> *crawl* command). The filter I put is:
>
> +^http://www.elcorreo.com/.*?/20110613/.*?.html
>
> A correct URL would be for example,
>
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
>
> so, I think the regular expression is correct but Nutch doesn´t crawl
> anything. It says that there are *No Urls to Fetch  - check your seed list
> and URL filters.*
>
>
> Am I missing something ??
>
> Thanks,
>



Hannes C. Meyer
www.informera.de

RE: No Urls to fetch

Posted by Abdulelah almubarak <al...@w.cn>.


Hi,

Just change line in crawl-urlfilter from




	
	
	
	


+^http://([a-z09]*\.)*MY.DOMA
IN.NAME/

TO



	
	
	
	


  +^http://([a-z0-9]*\.)*


> Date: Mon, 13 Jun 2011 13:10:40 +0200
> Subject: No Urls to fetch
> From: alejarazu@gmail.com
> To: user@nutch.apache.org
> 
> Hello,
> 
> I´m new to Nutch and I´m doing some tests to see how it works. I want to do
> some crawling in a digital newspaper webpage. To do so, I put in the urls
> directory where I have my seed list the URL I want to crawl that is: *
> http://elcorreo.com*
> The thing is that I don´t want to crawl all the news in the site but only
> the ones of the current day, so I put a filter in the
> *crawl-urlfilter.txt*(for the moment I´m using the
> *crawl* command). The filter I put is:
> 
> +^http://www.elcorreo.com/.*?/20110613/.*?.html
> 
> A correct URL would be for example,
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
> 
> so, I think the regular expression is correct but Nutch doesn´t crawl
> anything. It says that there are *No Urls to Fetch  - check your seed list
> and URL filters.*
> 
> 
> Am I missing something ??
> 
> Thanks,

Re: No Urls to fetch

Posted by Adelaida Lejarazu <al...@gmail.com>.

Thanks for your quick response. I will try to answer to all the questions:
- I am using Nutch 1.2.
- The rest of the crawl-urlfilter.txt is the one that comes by default...I
haven´t changed anything else; only added the +^
http://www.elcorreo.com/.*?/20110613/.*?.html filter.
- In the nutch-site.txt I have following:
<configuration>
*<property>
        <name>http.agent.name</name>
        <value>My Spider</value>
    </property>
    <property>
        <name>generate.max.per.host</name>
        <value>-1</value>
    </property>
<property>
  <name>http.robots.agents</name>
  <value>**My Spider**,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>
    <property>
        <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>*
</configuration>



2011/6/13 lewis john mcgibbney <le...@gmail.com>

> Hi Adelaida,
>
> Assuming that you have been able to successfully crawl the top level domain
> http://elcorreo.com e.g. that you have been able to crawl and create an
> index, at least we know that your configuration options are OK.
>
> I assume that you are using 1.2... can you confirm?
> What does the rest of your crawl-urlfilter.txt look like?
> Have you been setting any properties in nutch-site.txt which might alter
> Nutch behaviour?
>
> I am not perfect with syntax for creating filter rules in
> crawl-urlfilter...
> can someone confirm that this is correct.
>
> On Mon, Jun 13, 2011 at 12:10 PM, Adelaida Lejarazu <alejarazu@gmail.com
> >wrote:
>
> > Hello,
> >
> > I´m new to Nutch and I´m doing some tests to see how it works. I want to
> do
> > some crawling in a digital newspaper webpage. To do so, I put in the urls
> > directory where I have my seed list the URL I want to crawl that is: *
> > http://elcorreo.com*
> > The thing is that I don´t want to crawl all the news in the site but only
> > the ones of the current day, so I put a filter in the
> > *crawl-urlfilter.txt*(for the moment I´m using the
> > *crawl* command). The filter I put is:
> >
> > +^http://www.elcorreo.com/.*?/20110613/.*?.html
> >
> > A correct URL would be for example,
> >
> >
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
> >
> > so, I think the regular expression is correct but Nutch doesn´t crawl
> > anything. It says that there are *No Urls to Fetch  - check your seed
> list
> > and URL filters.*
> >
> >
> > Am I missing something ??
> >
> > Thanks,
> >
>
>
>
> --
> *Lewis*
>

Re: No Urls to fetch

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Adelaida,

Assuming that you have been able to successfully crawl the top level domain
http://elcorreo.com e.g. that you have been able to crawl and create an
index, at least we know that your configuration options are OK.

I assume that you are using 1.2... can you confirm?
What does the rest of your crawl-urlfilter.txt look like?
Have you been setting any properties in nutch-site.txt which might alter
Nutch behaviour?

I am not perfect with syntax for creating filter rules in crawl-urlfilter...
can someone confirm that this is correct.

On Mon, Jun 13, 2011 at 12:10 PM, Adelaida Lejarazu <al...@gmail.com>wrote:

> Hello,
>
> I´m new to Nutch and I´m doing some tests to see how it works. I want to do
> some crawling in a digital newspaper webpage. To do so, I put in the urls
> directory where I have my seed list the URL I want to crawl that is: *
> http://elcorreo.com*
> The thing is that I don´t want to crawl all the news in the site but only
> the ones of the current day, so I put a filter in the
> *crawl-urlfilter.txt*(for the moment I´m using the
> *crawl* command). The filter I put is:
>
> +^http://www.elcorreo.com/.*?/20110613/.*?.html
>
> A correct URL would be for example,
>
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
>
> so, I think the regular expression is correct but Nutch doesn´t crawl
> anything. It says that there are *No Urls to Fetch  - check your seed list
> and URL filters.*
>
>
> Am I missing something ??
>
> Thanks,
>

-- 
*Lewis*