You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raja Kulasekaran <cu...@gmail.com> on 2013/03/05 10:29:11 UTC

Robots.db instead of robots.txt

Hi

Instead of parsing robots.txt file, why don't ask the web hoster or web
administrator to create the complete parsed text in the db file format at
the robots.txt location itself ?

Is there are any standard protocol ?  It would be a better idea to stop
transferring data through crawlers .

Please let me know your thoughts on the same .

Raja

Re: Robots.db instead of robots.txt

Posted by Tejas Patil <te...@gmail.com>.
Nutch is internally caching the robots rules (it uses a hash map) in every
round. It will fetch robots file for a particular host just once in a given
round. This model works out well. If you are creating a separate db for it,
then you have to ensure that it is updated frequently to take into account
the changes that are done by the server.

On Tue, Mar 5, 2013 at 7:15 AM, Raja Kulasekaran <cu...@gmail.com> wrote:

> Hi,
>
> I meant to move the entire crawl process in the client environment , create
>  "robots.db"  and fetch only robots.db as a indexed data .
>
> Raja
>
> On Tue, Mar 5, 2013 at 8:27 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > robots.txt is a global standard accepted by everyone. Even google, bing
> use
> > that. I dont think that there is any db file format maintained by web
> > servers for the robots information.
> >
> >
> > On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran <cu...@gmail.com>
> > wrote:
> >
> > > Hi
> > >
> > > Instead of parsing robots.txt file, why don't ask the web hoster or web
> > > administrator to create the complete parsed text in the db file format
> at
> > > the robots.txt location itself ?
> > >
> > > Is there are any standard protocol ?  It would be a better idea to stop
> > > transferring data through crawlers .
> > >
> > > Please let me know your thoughts on the same .
> > >
> > > Raja
> > >
> >
>

Re: Robots.db instead of robots.txt

Posted by Raja Kulasekaran <cu...@gmail.com>.
Hi,

I meant to move the entire crawl process in the client environment , create
 "robots.db"  and fetch only robots.db as a indexed data .

Raja

On Tue, Mar 5, 2013 at 8:27 PM, Tejas Patil <te...@gmail.com>wrote:

> robots.txt is a global standard accepted by everyone. Even google, bing use
> that. I dont think that there is any db file format maintained by web
> servers for the robots information.
>
>
> On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran <cu...@gmail.com>
> wrote:
>
> > Hi
> >
> > Instead of parsing robots.txt file, why don't ask the web hoster or web
> > administrator to create the complete parsed text in the db file format at
> > the robots.txt location itself ?
> >
> > Is there are any standard protocol ?  It would be a better idea to stop
> > transferring data through crawlers .
> >
> > Please let me know your thoughts on the same .
> >
> > Raja
> >
>

Re: Robots.db instead of robots.txt

Posted by Tejas Patil <te...@gmail.com>.
robots.txt is a global standard accepted by everyone. Even google, bing use
that. I dont think that there is any db file format maintained by web
servers for the robots information.


On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran <cu...@gmail.com> wrote:

> Hi
>
> Instead of parsing robots.txt file, why don't ask the web hoster or web
> administrator to create the complete parsed text in the db file format at
> the robots.txt location itself ?
>
> Is there are any standard protocol ?  It would be a better idea to stop
> transferring data through crawlers .
>
> Please let me know your thoughts on the same .
>
> Raja
>