You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by visava <vi...@hotmail.com> on 2007/01/14 05:49:00 UTC

crawling url list

Hi

I have created a file under urls directory with list of urls.
I have changed the crawl-urlfilter.txt to so that all urls in my list are
crawled.but it is not crawling anything
+^http://([a-z0-9]*\.)*.com/

only when I change the expression to something like 
+^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular domain.

Is the intranet crawler  meant only for crawling one particular domain.

my command is
bin/nutch crawl urls -dir crawl.test  -depth 3

Thanks
Harish

-- 
View this message in context: http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: crawling url list

Posted by conrelius <me...@homeofevil.com>.
kauu <babatu <at> gmail.com> writes:

> 
> did u change the nutch-site.xml
> u must sign the agent name
> 
> On 1/16/07, visava <visava <at> hotmail.com> wrote:
> >
> >
> > I am using 0.8.1
> > the contents of the urls file is like this
> >
> > http://www.yahoo.com
> >
> > Harish
> >
> >
> > Shrinivas Patwardhan-2 wrote:



hi list,

i do have the same problem.
nutch isn't even trying to fetch pages..

what is going wrong here?



Re: crawling url list

Posted by kauu <ba...@gmail.com>.
did u change the nutch-site.xml
u must sign the agent name

On 1/16/07, visava <vi...@hotmail.com> wrote:
>
>
> I am using 0.8.1
> the contents of the urls file is like this
>
> http://www.yahoo.com
>
> Harish
>
>
> Shrinivas Patwardhan-2 wrote:
> >
> > visava
> >  can u paste the contents of the urls file ..
> >  secondly which nutch version are u using ?
> >
> > Shrinivas Patwardhan
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/crawling-url-list-tf2983090.html#a8380092
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
www.babatu.com

Re: crawling url list

Posted by visava <vi...@hotmail.com>.
I am using 0.8.1
the contents of the urls file is like this

http://www.yahoo.com

Harish


Shrinivas Patwardhan-2 wrote:
> 
> visava
>  can u paste the contents of the urls file ..
>  secondly which nutch version are u using ?
> 
> Shrinivas Patwardhan
> 
> 

-- 
View this message in context: http://www.nabble.com/crawling-url-list-tf2983090.html#a8380092
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: crawling url list

Posted by Shrinivas Patwardhan <sh...@krawlernetworks.com>.
visava
 can u paste the contents of the urls file ..
 secondly which nutch version are u using ?

Shrinivas Patwardhan

Re: crawling url list

Posted by Cornelius <me...@homeofevil.com>.
kauu <babatu <at> gmail.com> writes:

> 
> and did u write urls like this
> http://abc.abc.com
> 
> > > >>
> > > >> only when I change the expression to something like
> > > >> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular
> > > domain.
> > > >>
> > > >> Is the intranet crawler  meant only for crawling one particular
> > > domain.
> > www.babatu.com
> 


the problem is solved for me.. 

i changed the expression to

+^http://([a-z0-9]*\.)*yahoo\.com/
                            ^
                            ^

there was a \ missing before the last .com..



Re: crawling url list

Posted by kauu <ba...@gmail.com>.
and did u write urls like this
http://abc.abc.com

if u wrote it like abc.abc.com ,nutch won't fetch anything


On 1/15/07, kauu <ba...@gmail.com> wrote:
>
> but it works well in pc
>
> On 1/15/07, visava <vi...@hotmail.com> wrote:
> >
> >
> > I tried +^http://([a-z0-9]*\.)* but it does not work.
> >
> > Harish
> >
> >
> > kauu wrote:
> > >
> > > try this
> > > in crawl-urlfilter.txt
> > > +^http://([a-z0-9]*\.)*
> > >
> > >
> > > On 1/14/07, visava < visava@hotmail.com> wrote:
> > >>
> > >>
> > >> Hi
> > >>
> > >> I have created a file under urls directory with list of urls.
> > >> I have changed the crawl-urlfilter.txt to so that all urls in my list
> > are
> > >> crawled.but it is not crawling anything
> > >> +^http://([a-z0-9]*\.)*.com/
> > >>
> > >> only when I change the expression to something like
> > >> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular
> > domain.
> > >>
> > >> Is the intranet crawler  meant only for crawling one particular
> > domain.
> > >>
> > >> my command is
> > >> bin/nutch crawl urls -dir crawl.test  -depth 3
> > >>
> > >> Thanks
> > >> Harish
> > >>
> > >> --
> > >> View this message in context:
> > >> http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
> > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> > >
> > > --
> > > www.babatu.com
> > >
> > >
> >
> > --
> > View this message in context: http://www.nabble.com/crawling-url-list-tf2983090.html#a8361902
> >
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
>
>
> --
> www.babatu.com




-- 
www.babatu.com

Re: crawling url list

Posted by kauu <ba...@gmail.com>.
but it works well in pc

On 1/15/07, visava <vi...@hotmail.com> wrote:
>
>
> I tried +^http://([a-z0-9]*\.)* but it does not work.
>
> Harish
>
>
> kauu wrote:
> >
> > try this
> > in crawl-urlfilter.txt
> > +^http://([a-z0-9]*\.)*
> >
> >
> > On 1/14/07, visava <vi...@hotmail.com> wrote:
> >>
> >>
> >> Hi
> >>
> >> I have created a file under urls directory with list of urls.
> >> I have changed the crawl-urlfilter.txt to so that all urls in my list
> are
> >> crawled.but it is not crawling anything
> >> +^http://([a-z0-9]*\.)*.com/
> >>
> >> only when I change the expression to something like
> >> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular domain.
> >>
> >> Is the intranet crawler  meant only for crawling one particular domain.
> >>
> >> my command is
> >> bin/nutch crawl urls -dir crawl.test  -depth 3
> >>
> >> Thanks
> >> Harish
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > www.babatu.com
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/crawling-url-list-tf2983090.html#a8361902
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
www.babatu.com

Re: crawling url list

Posted by visava <vi...@hotmail.com>.
I tried +^http://([a-z0-9]*\.)* but it does not work.

Harish


kauu wrote:
> 
> try this
> in crawl-urlfilter.txt
> +^http://([a-z0-9]*\.)*
> 
> 
> On 1/14/07, visava <vi...@hotmail.com> wrote:
>>
>>
>> Hi
>>
>> I have created a file under urls directory with list of urls.
>> I have changed the crawl-urlfilter.txt to so that all urls in my list are
>> crawled.but it is not crawling anything
>> +^http://([a-z0-9]*\.)*.com/
>>
>> only when I change the expression to something like
>> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular domain.
>>
>> Is the intranet crawler  meant only for crawling one particular domain.
>>
>> my command is
>> bin/nutch crawl urls -dir crawl.test  -depth 3
>>
>> Thanks
>> Harish
>>
>> --
>> View this message in context:
>> http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> www.babatu.com
> 
> 

-- 
View this message in context: http://www.nabble.com/crawling-url-list-tf2983090.html#a8361902
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: crawling url list

Posted by kauu <ba...@gmail.com>.
try this
in crawl-urlfilter.txt
+^http://([a-z0-9]*\.)*


On 1/14/07, visava <vi...@hotmail.com> wrote:
>
>
> Hi
>
> I have created a file under urls directory with list of urls.
> I have changed the crawl-urlfilter.txt to so that all urls in my list are
> crawled.but it is not crawling anything
> +^http://([a-z0-9]*\.)*.com/
>
> only when I change the expression to something like
> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular domain.
>
> Is the intranet crawler  meant only for crawling one particular domain.
>
> my command is
> bin/nutch crawl urls -dir crawl.test  -depth 3
>
> Thanks
> Harish
>
> --
> View this message in context:
> http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
www.babatu.com