You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by Henri Yandell <fl...@gmail.com> on 2004/11/01 20:37:39 UTC

robots.txt parser

Does HttpClient have anything to parse a robots.txt file?

If not, would anyone be interested in http://www.osjava.org/norbert/ ?

I'd like to put it in the sandbox and thought that it would be of a
lot of interest to the HttpClient project and users.

It would need adjusting to sit on top of HttpClient as it currently
uses the JDK to download the robots.txt file itself, but that
shouldn't be very hard. Equally, HttpClient might want to, by default,
refuse to download things if it's against the robots.txt rules and
make people configure HttpClient to ignore the robots.txt to get
around it.

Hen

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: robots.txt parser

Posted by Henri Yandell <fl...@gmail.com>.
On Mon, 01 Nov 2004 23:59:01 +0100, Oleg Kalnichevski <ol...@apache.org> wrote:
> On Mon, 2004-11-01 at 20:37, Henri Yandell wrote:
> >
> > If not, would anyone be interested in http://www.osjava.org/norbert/ ?
> >
> > I'd like to put it in the sandbox and thought that it would be of a
> > lot of interest to the HttpClient project and users.
> >
> 
> Can we keep it in the sandbox for a while? As soon as HttpClient 4.0 API
> starts shaping up, the robot.txt parser could be migrated to Jakarta
> HttpClient to lay a foundation for a web crawler subcomponent.

I'll go ahead and migrate it into the sandbox at some point soon. 

On the web crawler side; there's:

http://www.osjava.org/scraping-engine/

I need to migrate it to use commons-configuration, and it already sits
on top of HttpClient. Food for thought anyway I hope. I use it
personally and its used at my workplace, but haven't really pushed it
outside of my own use yet.

Hen

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: robots.txt parser

Posted by Ortwin Glück <or...@nose.ch>.

Oleg Kalnichevski wrote:
> Can we keep it in the sandbox for a while? As soon as HttpClient 4.0 API
> starts shaping up, the robot.txt parser could be migrated to Jakarta
> HttpClient to lay a foundation for a web crawler subcomponent.
> 
> Folks, what do you think?

I agree completely. It's definitely a nice contribution. We could 
currently add it (or a link to it at least) to our contrib directory. 
We'll have to discuss when we define 4.0 API on which level this would 
be integrated/interfaced into/with HttpClient.

-- 
  _________________________________________________________________
  NOSE applied intelligence ag

  ortwin glück                      [www]      http://www.nose.ch
  software engineer
  hardturmstrasse 171               [pgp id]           0x81CF3416
  8005 zürich                       [office]      +41-1-277 57 35
  switzerland                       [fax]         +41-1-277 57 12

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: robots.txt parser

Posted by Michael Becke <be...@u.washington.edu>.
On Nov 1, 2004, at 5:59 PM, Oleg Kalnichevski wrote:

> Can we keep it in the sandbox for a while? As soon as HttpClient 4.0 
> API
> starts shaping up, the robot.txt parser could be migrated to Jakarta
> HttpClient to lay a foundation for a web crawler subcomponent.
>
> Folks, what do you think?

Sounds about right to me.  I'm definitely looking forward to it.

Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: robots.txt parser

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Mon, 2004-11-01 at 20:37, Henri Yandell wrote:
> Does HttpClient have anything to parse a robots.txt file?

Hi Henri,

No, it does not. At the moment we are trying to keep HttpClient
completely content-agnostic. This said, as soon as HttpClient 3.0 goes
RC (or maybe even earlier) we'll embark on a long planned API redesign.
One of the goals that we have in mind is to expand the scope of the
project beyond the client-side, break monolithic HttpClient into smaller
loosely coupled components and eventually make HttpClient evolve into a
flexible toolset of HTTP components, which can be used to rapidly
assemble HTTP agents, web crawlers, HTTP proxies, lightweight embedded
HTTP servers. At that point a robots.txt parser would be a very welcome
contribution

> 
> If not, would anyone be interested in http://www.osjava.org/norbert/ ?
> 
> I'd like to put it in the sandbox and thought that it would be of a
> lot of interest to the HttpClient project and users.
> 

Can we keep it in the sandbox for a while? As soon as HttpClient 4.0 API
starts shaping up, the robot.txt parser could be migrated to Jakarta
HttpClient to lay a foundation for a web crawler subcomponent.

Folks, what do you think?

Oleg


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org