You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Julien Nioche <li...@gmail.com> on 2011/06/02 17:11:14 UTC

CrawlerCommons & ManifoldCF

Hi guys,

I'd just like to mention Crawler Commons which is a effort between the
committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
put some basic functionalities in common. We currently have mostly a top
level domain finder and a sitemap parser, but are definitely planning to
have other things there as well, e.g. robots.txt parser, protocol handler
etc...

Would you like to get involved? There are quite a few things that the
crawler in Manifold could reuse or contribute to.

Best,

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: CrawlerCommons & ManifoldCF

Posted by Fuad Efendi <fu...@efendi.ca>.
Thanks Julien; I found it, strange...

Yes, I need to separate Robots Rules Parser, if BIXO agrees...


ManifoldCF current style:

1. Open socket
2. Load 500 kbits (in 2 milliseconds)
3. Speep 998 milliseconds

Just because there is user interface where we set bandwidth limit to 500
kbps (probably 50 kbytes)

So that it will be hard... I'd like to see HttpClient instead... or, if
crawler-commons includes "fetcher", to see that... even better if "fetcher"
is rich enough to support POST (there was some interest at Droids)

Existing code seems outdated: why external server should allocate resources
(TCP and HTTP Handler) which are not used 99.8% of the time?

But reusing of Robots Rules is most importnant; Nutch has some prooblems
too...


Thanks




-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: June-03-11 7:01 AM
To: connectors-dev@incubator.apache.org; crawler-commons@googlegroups.com
Subject: Re: CrawlerCommons & ManifoldCF

There is a link to the discussion group on the main page, becoming a member
of the group is pretty straightforward

On 3 June 2011 00:36, Fuad Efendi <fu...@efendi.ca> wrote:

> I mean "join button" at http://code.google.com/p/crawler-commons/
> I am well familiar with BIXO and Droids; it will be hard to make minor 
> changes in ManifoldCF... although it's possible (without "crawler" 
> part, only "robots rules parser")...
> -Fuad
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: June-02-11 7:05 PM
> To: connectors-dev@incubator.apache.org; 
> crawler-commons@googlegroups.com
> Subject: RE: CrawlerCommons & ManifoldCF
>
> I'd like to join this project but can't find "join" button :) Thanks!
>
> Fuad Efendi
> +1 416-993-2060
> http://www.linkedin.com/in/liferay
>
> Tokenizer Inc.
> http://www.tokenizer.ca/
> Data Mining, Vertical Search
>
> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: June-02-11 11:11 AM
> To: connectors-dev@incubator.apache.org; 
> crawler-commons@googlegroups.com
> Subject: CrawlerCommons & ManifoldCF
>
> Hi guys,
>
> I'd just like to mention Crawler Commons which is a effort between the 
> committers of various crawl-related projects (Nutch, Bixo or Heritrix) 
> to put some basic functionalities in common. We currently have mostly 
> a top level domain finder and a sitemap parser, but are definitely 
> planning to have other things there as well, e.g. robots.txt parser, 
> protocol handler etc...
>
> Would you like to get involved? There are quite a few things that the 
> crawler in Manifold could reuse or contribute to.
>
> Best,
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
>


--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: CrawlerCommons & ManifoldCF

Posted by Julien Nioche <li...@gmail.com>.
There is a link to the discussion group on the main page, becoming a member
of the group is pretty straightforward

On 3 June 2011 00:36, Fuad Efendi <fu...@efendi.ca> wrote:

> I mean "join button" at http://code.google.com/p/crawler-commons/
> I am well familiar with BIXO and Droids; it will be hard to make minor
> changes in ManifoldCF... although it's possible (without "crawler" part,
> only "robots rules parser")...
> -Fuad
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: June-02-11 7:05 PM
> To: connectors-dev@incubator.apache.org; crawler-commons@googlegroups.com
> Subject: RE: CrawlerCommons & ManifoldCF
>
> I'd like to join this project but can't find "join" button :) Thanks!
>
> Fuad Efendi
> +1 416-993-2060
> http://www.linkedin.com/in/liferay
>
> Tokenizer Inc.
> http://www.tokenizer.ca/
> Data Mining, Vertical Search
>
> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: June-02-11 11:11 AM
> To: connectors-dev@incubator.apache.org; crawler-commons@googlegroups.com
> Subject: CrawlerCommons & ManifoldCF
>
> Hi guys,
>
> I'd just like to mention Crawler Commons which is a effort between the
> committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
> put some basic functionalities in common. We currently have mostly a top
> level domain finder and a sitemap parser, but are definitely planning to
> have other things there as well, e.g. robots.txt parser, protocol handler
> etc...
>
> Would you like to get involved? There are quite a few things that the
> crawler in Manifold could reuse or contribute to.
>
> Best,
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: CrawlerCommons & ManifoldCF

Posted by Fuad Efendi <fu...@efendi.ca>.
I mean "join button" at http://code.google.com/p/crawler-commons/
I am well familiar with BIXO and Droids; it will be hard to make minor
changes in ManifoldCF... although it's possible (without "crawler" part,
only "robots rules parser")...
-Fuad


-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: June-02-11 7:05 PM
To: connectors-dev@incubator.apache.org; crawler-commons@googlegroups.com
Subject: RE: CrawlerCommons & ManifoldCF

I'd like to join this project but can't find "join" button :) Thanks!

Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: June-02-11 11:11 AM
To: connectors-dev@incubator.apache.org; crawler-commons@googlegroups.com
Subject: CrawlerCommons & ManifoldCF

Hi guys,

I'd just like to mention Crawler Commons which is a effort between the
committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
put some basic functionalities in common. We currently have mostly a top
level domain finder and a sitemap parser, but are definitely planning to
have other things there as well, e.g. robots.txt parser, protocol handler
etc...

Would you like to get involved? There are quite a few things that the
crawler in Manifold could reuse or contribute to.

Best,

Julien

--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


RE: CrawlerCommons & ManifoldCF

Posted by Fuad Efendi <fu...@efendi.ca>.
I'd like to join this project but can't find "join" button :)
Thanks!

Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: June-02-11 11:11 AM
To: connectors-dev@incubator.apache.org; crawler-commons@googlegroups.com
Subject: CrawlerCommons & ManifoldCF

Hi guys,

I'd just like to mention Crawler Commons which is a effort between the
committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
put some basic functionalities in common. We currently have mostly a top
level domain finder and a sitemap parser, but are definitely planning to
have other things there as well, e.g. robots.txt parser, protocol handler
etc...

Would you like to get involved? There are quite a few things that the
crawler in Manifold could reuse or contribute to.

Best,

Julien

--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: CrawlerCommons & ManifoldCF

Posted by Julien Nioche <li...@gmail.com>.
Hi,

We could reuse RobotsData indeed and refactor it a bit.

Ken, you said you'd be keen to contribute your code for robot parsing as
well - do you think it would be quicker than refactoring Manifold's code? Or
does it do support additional features? What about Droids?

Julien

PS: Anyone attending BerlinBuzzwords next week?


On 2 June 2011 17:57, Karl Wright <da...@gmail.com> wrote:

> I don't think it would be hard to peel out the robots parser, although
> obviously it would need refactoring to live in a more standard library
> environment.  If you want to look at it, it is in:
>
>
> https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java
>
> Look for the static class "RobotsData", around line 299.
>
> Karl
>
>
>
> On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche
> <li...@gmail.com> wrote:
> > Hi Karl,
> >
> > Maybe a good start would be to identify which parts of your crawler could
> be
> > shared and would not take too much effort to be made generic. I haven't
> > looked to the code of the crawler in great details but do you think the
> > robots parser would be a good candidate?
> >
> > Julien
> >
> > On 2 June 2011 16:23, Karl Wright <da...@gmail.com> wrote:
> >
> >> Absolutely!
> >> We're a bit thin on active committers at the moment, which will
> >> probably limit our ability to take any highly active roles in your
> >> development process.  But we do have a pile of code which you might be
> >> able to leverage, and once there is common functionality available I
> >> think we'd all prefer to use that rather than home-grown code.
> >>
> >> How would you prefer that we proceed?
> >>
> >> Karl
> >>
> >>
> >> On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
> >> <li...@gmail.com> wrote:
> >> > Hi guys,
> >> >
> >> > I'd just like to mention Crawler Commons which is a effort between the
> >> > committers of various crawl-related projects (Nutch, Bixo or Heritrix)
> to
> >> > put some basic functionalities in common. We currently have mostly a
> top
> >> > level domain finder and a sitemap parser, but are definitely planning
> to
> >> > have other things there as well, e.g. robots.txt parser, protocol
> handler
> >> > etc...
> >> >
> >> > Would you like to get involved? There are quite a few things that the
> >> > crawler in Manifold could reuse or contribute to.
> >> >
> >> > Best,
> >> >
> >> > Julien
> >> >
> >> > --
> >> > *
> >> > *Open Source Solutions for Text Engineering
> >> >
> >> > http://digitalpebble.blogspot.com/
> >> > http://www.digitalpebble.com
> >> >
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: CrawlerCommons & ManifoldCF

Posted by Karl Wright <da...@gmail.com>.
I don't think it would be hard to peel out the robots parser, although
obviously it would need refactoring to live in a more standard library
environment.  If you want to look at it, it is in:

https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java

Look for the static class "RobotsData", around line 299.

Karl



On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche
<li...@gmail.com> wrote:
> Hi Karl,
>
> Maybe a good start would be to identify which parts of your crawler could be
> shared and would not take too much effort to be made generic. I haven't
> looked to the code of the crawler in great details but do you think the
> robots parser would be a good candidate?
>
> Julien
>
> On 2 June 2011 16:23, Karl Wright <da...@gmail.com> wrote:
>
>> Absolutely!
>> We're a bit thin on active committers at the moment, which will
>> probably limit our ability to take any highly active roles in your
>> development process.  But we do have a pile of code which you might be
>> able to leverage, and once there is common functionality available I
>> think we'd all prefer to use that rather than home-grown code.
>>
>> How would you prefer that we proceed?
>>
>> Karl
>>
>>
>> On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
>> <li...@gmail.com> wrote:
>> > Hi guys,
>> >
>> > I'd just like to mention Crawler Commons which is a effort between the
>> > committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
>> > put some basic functionalities in common. We currently have mostly a top
>> > level domain finder and a sitemap parser, but are definitely planning to
>> > have other things there as well, e.g. robots.txt parser, protocol handler
>> > etc...
>> >
>> > Would you like to get involved? There are quite a few things that the
>> > crawler in Manifold could reuse or contribute to.
>> >
>> > Best,
>> >
>> > Julien
>> >
>> > --
>> > *
>> > *Open Source Solutions for Text Engineering
>> >
>> > http://digitalpebble.blogspot.com/
>> > http://www.digitalpebble.com
>> >
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: CrawlerCommons & ManifoldCF

Posted by Julien Nioche <li...@gmail.com>.
Hi Karl,

Maybe a good start would be to identify which parts of your crawler could be
shared and would not take too much effort to be made generic. I haven't
looked to the code of the crawler in great details but do you think the
robots parser would be a good candidate?

Julien

On 2 June 2011 16:23, Karl Wright <da...@gmail.com> wrote:

> Absolutely!
> We're a bit thin on active committers at the moment, which will
> probably limit our ability to take any highly active roles in your
> development process.  But we do have a pile of code which you might be
> able to leverage, and once there is common functionality available I
> think we'd all prefer to use that rather than home-grown code.
>
> How would you prefer that we proceed?
>
> Karl
>
>
> On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
> <li...@gmail.com> wrote:
> > Hi guys,
> >
> > I'd just like to mention Crawler Commons which is a effort between the
> > committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
> > put some basic functionalities in common. We currently have mostly a top
> > level domain finder and a sitemap parser, but are definitely planning to
> > have other things there as well, e.g. robots.txt parser, protocol handler
> > etc...
> >
> > Would you like to get involved? There are quite a few things that the
> > crawler in Manifold could reuse or contribute to.
> >
> > Best,
> >
> > Julien
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: CrawlerCommons & ManifoldCF

Posted by Karl Wright <da...@gmail.com>.
Absolutely!
We're a bit thin on active committers at the moment, which will
probably limit our ability to take any highly active roles in your
development process.  But we do have a pile of code which you might be
able to leverage, and once there is common functionality available I
think we'd all prefer to use that rather than home-grown code.

How would you prefer that we proceed?

Karl


On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
<li...@gmail.com> wrote:
> Hi guys,
>
> I'd just like to mention Crawler Commons which is a effort between the
> committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
> put some basic functionalities in common. We currently have mostly a top
> level domain finder and a sitemap parser, but are definitely planning to
> have other things there as well, e.g. robots.txt parser, protocol handler
> etc...
>
> Would you like to get involved? There are quite a few things that the
> crawler in Manifold could reuse or contribute to.
>
> Best,
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>