You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by Julien Nioche <li...@gmail.com> on 2011/07/06 22:12:15 UTC

[ANN] Release crawler-commons 0.1

[Apologies for cross-posting]

The initial release of crawler-commons is available from :
http://code.google.com/p/crawler-commons/downloads/list

The purpose of this project is to develop a set of reusable Java components
that implement functionality common to any web crawler. These components
would benefit from collaboration among various existing web crawler
projects, and reduce duplication of effort.
The current version contains resources for :
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher

This release is available on Sonatype's OSS Nexus repository [
https://oss.sonatype.org/content/repositories/releases/com/google/code/crawler-commons/]
and should be available on Maven Central soon.

Please send your questions, comments or suggestions to
http://groups.google.com/group/crawler-commons

Best regards,

Julien

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

[ANN] Release crawler-commons 0.1

Posted by Julien Nioche <li...@gmail.com>.

[Apologies for cross-posting]

The initial release of crawler-commons is available from :
http://code.google.com/p/crawler-commons/downloads/list

The purpose of this project is to develop a set of reusable Java components
that implement functionality common to any web crawler. These components
would benefit from collaboration among various existing web crawler
projects, and reduce duplication of effort.
The current version contains resources for :
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher

This release is available on Sonatype's OSS Nexus repository [
https://oss.sonatype.org/content/repositories/releases/com/google/code/crawler-commons/]
and should be available on Maven Central soon.

Please send your questions, comments or suggestions to
http://groups.google.com/group/crawler-commons

Best regards,

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: [ANN] Release crawler-commons 0.1

Posted by Ken Krugler <kk...@transpac.com>.

On Jul 6, 2011, at 1:37pm, Julien Nioche wrote:

> [cc to crawler-commons list]
> 
> I wasn't part of the initial discussion so I don't know what the arguments
> for / against were.
> I suppose it depends partially on user adoption. The project has had a slow
> start but with this initial release it should gain a bit of traction. The
> license is already Apache 2.0. We'll see how it goes, but as long as it
> thrives I don't really mind were it lives

See the "Hosting Options" section on this page:

http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

-- Ken

> On 6 July 2011 21:15, Markus Jelsma <ma...@openindex.io> wrote:
> 
>> Impressive! Are you guys going for the ASF incubator?
>> 
>>> [Apologies for cross-posting]
>>> 
>>> The initial release of crawler-commons is available from :
>>> http://code.google.com/p/crawler-commons/downloads/list
>>> 
>>> The purpose of this project is to develop a set of reusable Java
>> components
>>> that implement functionality common to any web crawler. These components
>>> would benefit from collaboration among various existing web crawler
>>> projects, and reduce duplication of effort.
>>> The current version contains resources for :
>>> - parsing robots.txt
>>> - parsing sitemaps
>>> - URL analyzer which returns Top Level Domains
>>> - a simple HttpFetcher
>>> 
>>> This release is available on Sonatype's OSS Nexus repository [
>>> 
>> https://oss.sonatype.org/content/repositories/releases/com/google/code/craw
>>> ler-commons/] and should be available on Maven Central soon.
>>> 
>>> Please send your questions, comments or suggestions to
>>> http://groups.google.com/group/crawler-commons
>>> 
>>> Best regards,
>>> 
>>> Julien

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions

Re: [ANN] Release crawler-commons 0.1

Posted by Julien Nioche <li...@gmail.com>.

[cc to crawler-commons list]

I wasn't part of the initial discussion so I don't know what the arguments
for / against were.
I suppose it depends partially on user adoption. The project has had a slow
start but with this initial release it should gain a bit of traction. The
license is already Apache 2.0. We'll see how it goes, but as long as it
thrives I don't really mind were it lives

Julien

On 6 July 2011 21:15, Markus Jelsma <ma...@openindex.io> wrote:

> Impressive! Are you guys going for the ASF incubator?
>
> > [Apologies for cross-posting]
> >
> > The initial release of crawler-commons is available from :
> > http://code.google.com/p/crawler-commons/downloads/list
> >
> > The purpose of this project is to develop a set of reusable Java
> components
> > that implement functionality common to any web crawler. These components
> > would benefit from collaboration among various existing web crawler
> > projects, and reduce duplication of effort.
> > The current version contains resources for :
> > - parsing robots.txt
> > - parsing sitemaps
> > - URL analyzer which returns Top Level Domains
> > - a simple HttpFetcher
> >
> > This release is available on Sonatype's OSS Nexus repository [
> >
> https://oss.sonatype.org/content/repositories/releases/com/google/code/craw
> > ler-commons/] and should be available on Maven Central soon.
> >
> > Please send your questions, comments or suggestions to
> > http://groups.google.com/group/crawler-commons
> >
> > Best regards,
> >
> > Julien
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: [ANN] Release crawler-commons 0.1

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

It would be great to go the Incubator route. I pointed this out to 
Andrzej way back when this was starting too but it would be good to 
think about things like Spring injection (DI-framework) for config, 
phase-based crawler processing, etc. Check out some of the work
in the OODT catalog crawler [1]. Might help.

I read the Wiki page Ken pointed to and one of the options proposed
(at the time) was Lucene sub-project. I don't think that'll work anymore 
since the ASF doesn't want umbrella projects. So, the goal would be
Incubator PMC sponsorship with a graduation path towards TLP. 

But it's up to the guys doing what they want to do, and I'm on the 
outside looking on on this one. *Except for* Nutch's concern :-) I care 
very much about the way that Nutch consumes any of this code 
so I'm happy to chime in on that. 

Cheers,
Chris

[1] http://oodt.apache.org/components/maven/crawler/

On Jul 6, 2011, at 1:15 PM, Markus Jelsma wrote:

> Impressive! Are you guys going for the ASF incubator?
> 
>> [Apologies for cross-posting]
>> 
>> The initial release of crawler-commons is available from :
>> http://code.google.com/p/crawler-commons/downloads/list
>> 
>> The purpose of this project is to develop a set of reusable Java components
>> that implement functionality common to any web crawler. These components
>> would benefit from collaboration among various existing web crawler
>> projects, and reduce duplication of effort.
>> The current version contains resources for :
>> - parsing robots.txt
>> - parsing sitemaps
>> - URL analyzer which returns Top Level Domains
>> - a simple HttpFetcher
>> 
>> This release is available on Sonatype's OSS Nexus repository [
>> https://oss.sonatype.org/content/repositories/releases/com/google/code/craw
>> ler-commons/] and should be available on Maven Central soon.
>> 
>> Please send your questions, comments or suggestions to
>> http://groups.google.com/group/crawler-commons
>> 
>> Best regards,
>> 
>> Julien
>> 
>> --
>> 
>> Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: [ANN] Release crawler-commons 0.1

Posted by Markus Jelsma <ma...@openindex.io>.

Impressive! Are you guys going for the ASF incubator?

> [Apologies for cross-posting]
> 
> The initial release of crawler-commons is available from :
> http://code.google.com/p/crawler-commons/downloads/list
> 
> The purpose of this project is to develop a set of reusable Java components
> that implement functionality common to any web crawler. These components
> would benefit from collaboration among various existing web crawler
> projects, and reduce duplication of effort.
> The current version contains resources for :
> - parsing robots.txt
> - parsing sitemaps
> - URL analyzer which returns Top Level Domains
> - a simple HttpFetcher
> 
> This release is available on Sonatype's OSS Nexus repository [
> https://oss.sonatype.org/content/repositories/releases/com/google/code/craw
> ler-commons/] and should be available on Maven Central soon.
> 
> Please send your questions, comments or suggestions to
> http://groups.google.com/group/crawler-commons
> 
> Best regards,
> 
> Julien
> 
> --
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com