You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by visava <vi...@hotmail.com> on 2007/01/14 05:49:00 UTC
crawling url list
Hi
I have created a file under urls directory with list of urls.
I have changed the crawl-urlfilter.txt to so that all urls in my list are
crawled.but it is not crawling anything
+^http://([a-z0-9]*\.)*.com/
only when I change the expression to something like
+^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular domain.
Is the intranet crawler meant only for crawling one particular domain.
my command is
bin/nutch crawl urls -dir crawl.test -depth 3
Thanks
Harish
--
View this message in context: http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: crawling url list
Posted by conrelius <me...@homeofevil.com>.
kauu <babatu <at> gmail.com> writes:
>
> did u change the nutch-site.xml
> u must sign the agent name
>
> On 1/16/07, visava <visava <at> hotmail.com> wrote:
> >
> >
> > I am using 0.8.1
> > the contents of the urls file is like this
> >
> > http://www.yahoo.com
> >
> > Harish
> >
> >
> > Shrinivas Patwardhan-2 wrote:
hi list,
i do have the same problem.
nutch isn't even trying to fetch pages..
what is going wrong here?
Re: crawling url list
Posted by kauu <ba...@gmail.com>.
did u change the nutch-site.xml
u must sign the agent name
On 1/16/07, visava <vi...@hotmail.com> wrote:
>
>
> I am using 0.8.1
> the contents of the urls file is like this
>
> http://www.yahoo.com
>
> Harish
>
>
> Shrinivas Patwardhan-2 wrote:
> >
> > visava
> > can u paste the contents of the urls file ..
> > secondly which nutch version are u using ?
> >
> > Shrinivas Patwardhan
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/crawling-url-list-tf2983090.html#a8380092
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
--
www.babatu.com
Re: crawling url list
Posted by visava <vi...@hotmail.com>.
I am using 0.8.1
the contents of the urls file is like this
http://www.yahoo.com
Harish
Shrinivas Patwardhan-2 wrote:
>
> visava
> can u paste the contents of the urls file ..
> secondly which nutch version are u using ?
>
> Shrinivas Patwardhan
>
>
--
View this message in context: http://www.nabble.com/crawling-url-list-tf2983090.html#a8380092
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: crawling url list
Posted by Shrinivas Patwardhan <sh...@krawlernetworks.com>.
visava
can u paste the contents of the urls file ..
secondly which nutch version are u using ?
Shrinivas Patwardhan
Re: crawling url list
Posted by Cornelius <me...@homeofevil.com>.
kauu <babatu <at> gmail.com> writes:
>
> and did u write urls like this
> http://abc.abc.com
>
> > > >>
> > > >> only when I change the expression to something like
> > > >> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular
> > > domain.
> > > >>
> > > >> Is the intranet crawler meant only for crawling one particular
> > > domain.
> > www.babatu.com
>
the problem is solved for me..
i changed the expression to
+^http://([a-z0-9]*\.)*yahoo\.com/
^
^
there was a \ missing before the last .com..
Re: crawling url list
Posted by kauu <ba...@gmail.com>.
and did u write urls like this
http://abc.abc.com
if u wrote it like abc.abc.com ,nutch won't fetch anything
On 1/15/07, kauu <ba...@gmail.com> wrote:
>
> but it works well in pc
>
> On 1/15/07, visava <vi...@hotmail.com> wrote:
> >
> >
> > I tried +^http://([a-z0-9]*\.)* but it does not work.
> >
> > Harish
> >
> >
> > kauu wrote:
> > >
> > > try this
> > > in crawl-urlfilter.txt
> > > +^http://([a-z0-9]*\.)*
> > >
> > >
> > > On 1/14/07, visava < visava@hotmail.com> wrote:
> > >>
> > >>
> > >> Hi
> > >>
> > >> I have created a file under urls directory with list of urls.
> > >> I have changed the crawl-urlfilter.txt to so that all urls in my list
> > are
> > >> crawled.but it is not crawling anything
> > >> +^http://([a-z0-9]*\.)*.com/
> > >>
> > >> only when I change the expression to something like
> > >> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular
> > domain.
> > >>
> > >> Is the intranet crawler meant only for crawling one particular
> > domain.
> > >>
> > >> my command is
> > >> bin/nutch crawl urls -dir crawl.test -depth 3
> > >>
> > >> Thanks
> > >> Harish
> > >>
> > >> --
> > >> View this message in context:
> > >> http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
> > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> > >
> > > --
> > > www.babatu.com
> > >
> > >
> >
> > --
> > View this message in context: http://www.nabble.com/crawling-url-list-tf2983090.html#a8361902
> >
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
>
>
> --
> www.babatu.com
--
www.babatu.com
Re: crawling url list
Posted by kauu <ba...@gmail.com>.
but it works well in pc
On 1/15/07, visava <vi...@hotmail.com> wrote:
>
>
> I tried +^http://([a-z0-9]*\.)* but it does not work.
>
> Harish
>
>
> kauu wrote:
> >
> > try this
> > in crawl-urlfilter.txt
> > +^http://([a-z0-9]*\.)*
> >
> >
> > On 1/14/07, visava <vi...@hotmail.com> wrote:
> >>
> >>
> >> Hi
> >>
> >> I have created a file under urls directory with list of urls.
> >> I have changed the crawl-urlfilter.txt to so that all urls in my list
> are
> >> crawled.but it is not crawling anything
> >> +^http://([a-z0-9]*\.)*.com/
> >>
> >> only when I change the expression to something like
> >> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular domain.
> >>
> >> Is the intranet crawler meant only for crawling one particular domain.
> >>
> >> my command is
> >> bin/nutch crawl urls -dir crawl.test -depth 3
> >>
> >> Thanks
> >> Harish
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > www.babatu.com
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/crawling-url-list-tf2983090.html#a8361902
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
--
www.babatu.com
Re: crawling url list
Posted by visava <vi...@hotmail.com>.
I tried +^http://([a-z0-9]*\.)* but it does not work.
Harish
kauu wrote:
>
> try this
> in crawl-urlfilter.txt
> +^http://([a-z0-9]*\.)*
>
>
> On 1/14/07, visava <vi...@hotmail.com> wrote:
>>
>>
>> Hi
>>
>> I have created a file under urls directory with list of urls.
>> I have changed the crawl-urlfilter.txt to so that all urls in my list are
>> crawled.but it is not crawling anything
>> +^http://([a-z0-9]*\.)*.com/
>>
>> only when I change the expression to something like
>> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular domain.
>>
>> Is the intranet crawler meant only for crawling one particular domain.
>>
>> my command is
>> bin/nutch crawl urls -dir crawl.test -depth 3
>>
>> Thanks
>> Harish
>>
>> --
>> View this message in context:
>> http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>
> --
> www.babatu.com
>
>
--
View this message in context: http://www.nabble.com/crawling-url-list-tf2983090.html#a8361902
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: crawling url list
Posted by kauu <ba...@gmail.com>.
try this
in crawl-urlfilter.txt
+^http://([a-z0-9]*\.)*
On 1/14/07, visava <vi...@hotmail.com> wrote:
>
>
> Hi
>
> I have created a file under urls directory with list of urls.
> I have changed the crawl-urlfilter.txt to so that all urls in my list are
> crawled.but it is not crawling anything
> +^http://([a-z0-9]*\.)*.com/
>
> only when I change the expression to something like
> +^http://([a-z0-9]*\.)*yahoo.com/ will it crawl that particular domain.
>
> Is the intranet crawler meant only for crawling one particular domain.
>
> my command is
> bin/nutch crawl urls -dir crawl.test -depth 3
>
> Thanks
> Harish
>
> --
> View this message in context:
> http://www.nabble.com/crawling-url-list-tf2983090.html#a8330415
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
--
www.babatu.com