You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@oodt.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2014/05/01 19:35:50 UTC

CAS Crawler Crawling Code

Hi Folks,
Im sitting jumping between ProductCrawler and StdIngester trying to pin
point _exactly_ where product fetching actually happens.
I'm aware of the triple headed nature of crawler workflows e.g.
preIngestion, postIngestionSuccess and postIngestionFailure... I can see
the logic within the ProductCrawler code... what I cannot locate is where
HTTP/transport socket connections are created and used.
Can anyone please point this out?
Thanks
Lewis

Re: CAS Crawler Crawling Code

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Nailed it.
Stepping through in Eclipse will help out a lot.
Have a great weekend folks :-)
On May 1, 2014 11:40 AM, "Chris Mattmann" <ch...@gmail.com> wrote:

> Hey Lewis,
>
> That's b/c Crawler doesn't do HTTP connections.
> PushPull is the component where that occurs. We
> specifically made Crawler only handle local data,
> and refactored the protocol layer/functionality
> into PushPull and they operate through a shared
> directory structure for a 'staging' dir and through
> Crawler pre conditions and Actions.
>
> Scope out Push Pull and then we can discuss.
>
> Thanks dude.
>
> Cheers,
> Chris
>
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
>
>
>
>
> -----Original Message-----
> From: Lewis John Mcgibbney <le...@gmail.com>
> Reply-To: <us...@oodt.apache.org>
> Date: Thursday, May 1, 2014 10:35 AM
> To: <us...@oodt.apache.org>
> Subject: CAS Crawler Crawling Code
>
> >Hi Folks,
> >Im sitting jumping between ProductCrawler and StdIngester trying to pin
> >point _exactly_ where product fetching actually happens.
> >I'm aware of the triple headed nature of crawler workflows e.g.
> >preIngestion, postIngestionSuccess and postIngestionFailure... I can see
> >the logic within the ProductCrawler code... what I cannot locate is where
> >HTTP/transport socket connections are created and used.
> >
> >Can anyone please point this out?
> >Thanks
> >Lewis
>
>
>

Re: CAS Crawler Crawling Code

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Nailed it.
Stepping through in Eclipse will help out a lot.
Have a great weekend folks :-)
On May 1, 2014 11:40 AM, "Chris Mattmann" <ch...@gmail.com> wrote:

> Hey Lewis,
>
> That's b/c Crawler doesn't do HTTP connections.
> PushPull is the component where that occurs. We
> specifically made Crawler only handle local data,
> and refactored the protocol layer/functionality
> into PushPull and they operate through a shared
> directory structure for a 'staging' dir and through
> Crawler pre conditions and Actions.
>
> Scope out Push Pull and then we can discuss.
>
> Thanks dude.
>
> Cheers,
> Chris
>
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
>
>
>
>
> -----Original Message-----
> From: Lewis John Mcgibbney <le...@gmail.com>
> Reply-To: <us...@oodt.apache.org>
> Date: Thursday, May 1, 2014 10:35 AM
> To: <us...@oodt.apache.org>
> Subject: CAS Crawler Crawling Code
>
> >Hi Folks,
> >Im sitting jumping between ProductCrawler and StdIngester trying to pin
> >point _exactly_ where product fetching actually happens.
> >I'm aware of the triple headed nature of crawler workflows e.g.
> >preIngestion, postIngestionSuccess and postIngestionFailure... I can see
> >the logic within the ProductCrawler code... what I cannot locate is where
> >HTTP/transport socket connections are created and used.
> >
> >Can anyone please point this out?
> >Thanks
> >Lewis
>
>
>

Re: CAS Crawler Crawling Code

Posted by Chris Mattmann <ch...@gmail.com>.

Hey Lewis,

That's b/c Crawler doesn't do HTTP connections.
PushPull is the component where that occurs. We
specifically made Crawler only handle local data,
and refactored the protocol layer/functionality
into PushPull and they operate through a shared
directory structure for a 'staging' dir and through
Crawler pre conditions and Actions.

Scope out Push Pull and then we can discuss.

Thanks dude.

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: <us...@oodt.apache.org>
Date: Thursday, May 1, 2014 10:35 AM
To: <us...@oodt.apache.org>
Subject: CAS Crawler Crawling Code

>Hi Folks,
>Im sitting jumping between ProductCrawler and StdIngester trying to pin
>point _exactly_ where product fetching actually happens.
>I'm aware of the triple headed nature of crawler workflows e.g.
>preIngestion, postIngestionSuccess and postIngestionFailure... I can see
>the logic within the ProductCrawler code... what I cannot locate is where
>HTTP/transport socket connections are created and used.
>
>Can anyone please point this out?
>Thanks
>Lewis

Re: CAS Crawler Crawling Code

Posted by Chris Mattmann <ch...@gmail.com>.

Hey Lewis,

That's b/c Crawler doesn't do HTTP connections.
PushPull is the component where that occurs. We
specifically made Crawler only handle local data,
and refactored the protocol layer/functionality
into PushPull and they operate through a shared
directory structure for a 'staging' dir and through
Crawler pre conditions and Actions.

Scope out Push Pull and then we can discuss.

Thanks dude.

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: <us...@oodt.apache.org>
Date: Thursday, May 1, 2014 10:35 AM
To: <us...@oodt.apache.org>
Subject: CAS Crawler Crawling Code

>Hi Folks,
>Im sitting jumping between ProductCrawler and StdIngester trying to pin
>point _exactly_ where product fetching actually happens.
>I'm aware of the triple headed nature of crawler workflows e.g.
>preIngestion, postIngestionSuccess and postIngestionFailure... I can see
>the logic within the ProductCrawler code... what I cannot locate is where
>HTTP/transport socket connections are created and used.
>
>Can anyone please point this out?
>Thanks
>Lewis