You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Michael Kooloos <mk...@hotmail.com> on 2012/08/30 15:25:34 UTC

Web connector - Session-based access credentials


Hi,

Does anyone have a working example of the session-based access credentials for the web connector? Following the end-user-documentation as good as possible, but still no luck :(

Thanks!
 		 	   		   		 	   		  

RE: Web connector - Session-based access credentials

Posted by Michael Kooloos <mk...@hotmail.com>.
Sorry, mean "second last page" ;)

> Date: Mon, 3 Sep 2012 08:18:02 -0400
> Subject: Re: Web connector - Session-based access credentials
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
> 
> What do you mean, "first last page"?
> The Web Connector needs to refetch the page that caused the
> redirection, because that is likely to be a content page based on the
> user's own description of the login sequence.  Otherwise pages would
> be missing from the crawl, whenever login needed to be redone.
> 
> Karl
> 
> On Mon, Sep 3, 2012 at 7:43 AM, Michael Kooloos <mk...@hotmail.com> wrote:
> > No, same thing happens in the browser also, so need to find a different seed
> > page that doesn't have this behaviour, but no luck there yet..
> >
> > Other way to solve this 'issue' is if the coonnector will go back to the
> > first last page after finishing the login-sequence, instead of the last page
> > (since the last page stays in a loop). Should be possible, right?
> >
> > Michael
> >
> >> Date: Mon, 3 Sep 2012 07:15:20 -0400
> >
> >> Subject: Re: Web connector - Session-based access credentials
> >> From: daddywri@gmail.com
> >> To: user@manifoldcf.apache.org
> >>
> >> Ok - if the redirect is occurring in a browser whether or not you are
> >> logged in, then yes, you cannot use that page as a seed. If this only
> >> seems to happen in the Web Connector, on the other hand, we should
> >> keep talking, because your login sequence is not actually succeeding
> >> to set up the session cookies properly.
> >>
> >> Thanks!
> >> Karl
> >>
> >> On Mon, Sep 3, 2012 at 6:14 AM, Michael Kooloos <mk...@hotmail.com>
> >> wrote:
> >> > Hi Karl,
> >> >
> >> > Thanks. Found the issue, the seed document keeps redirecting to the
> >> > logon
> >> > page (even after login has occured). This is an issue (protection?) of
> >> > the
> >> > website and it now makes sense to me why the connector stays in a loop.
> >> > Haven't found a solution yet, have to find a more appropriate seed
> >> > document
> >> > or a way to skip the redirect the second time it enters the loop..
> >> >
> >> > Many thanks for your support!
> >> >
> >> >> Date: Thu, 30 Aug 2012 11:52:01 -0400
> >> >
> >> >> Subject: Re: Web connector - Session-based access credentials
> >> >> From: daddywri@gmail.com
> >> >> To: user@manifoldcf.apache.org
> >> >>
> >> >> If I understand how you have it set up, what the ManifoldCF web
> >> >> connector will do is this:
> >> >>
> >> >> (1) Fetch the seed document.
> >> >> (2) Take the redirection to the logon page, and thus enter the login
> >> >> sequence
> >> >> (3) Do the login sequence and establish the correct cookies
> >> >> (4) Refetch the seed document
> >> >> (5) Take the redirection to the logon page...
> >> >>
> >> >> So, as you can see, your seed document must redirect ONLY if login has
> >> >> not yet occurred, or you will be stuck in a loop. So either fix that,
> >> >> or choose a more appropriate seed document.
> >> >>
> >> >> On normal site, typically you get different results on most content
> >> >> pages when login has occurred vs. when login has not yet occurred. It
> >> >> is up to you to define in the Web Connector what combination of pages
> >> >> and content constitute a logon request vs. normal content fetch. And
> >> >> that's the whole problem, and why this is so complicated.
> >> >>
> >> >> Thanks,
> >> >> Karl
> >> >>
> >> >> On Thu, Aug 30, 2012 at 11:38 AM, Michael Kooloos
> >> >> <mk...@hotmail.com>
> >> >> wrote:
> >> >> > Karl,
> >> >> >
> >> >> > My seed document is not a logon page, but the seed document url
> >> >> > automatically redirects to the logon pages. So the first regex is of
> >> >> > the
> >> >> > logon page, then the regex for the Login URL is the same (since it's
> >> >> > the
> >> >> > logon page), type = Form. Do I define any redirect after the logon
> >> >> > form?
> >> >> >
> >> >> > Hope this makes a bit of sence..
> >> >> >
> >> >> > Didn't think it would be that hard to setup some access credentials..
> >> >> >
> >> >> >> Date: Thu, 30 Aug 2012 10:03:20 -0400
> >> >> >
> >> >> >> Subject: Re: Web connector - Session-based access credentials
> >> >> >> From: daddywri@gmail.com
> >> >> >> To: user@manifoldcf.apache.org
> >> >> >>
> >> >> >> It sounds like your regular expression(s) which describe what pages
> >> >> >> belong to the logon sequence may be incorrect. After the logon
> >> >> >> sequence exits, the crawler will attempt to refetch the page it was
> >> >> >> working on before it entered the logon sequence. If that page is
> >> >> >> PART
> >> >> >> of the logon sequence it will loop as you describe.
> >> >> >>
> >> >> >> Your seed documents should therefore NOT be logon pages or you will
> >> >> >> never get anywhere...
> >> >> >>
> >> >> >> Karl
> >> >> >>
> >> >> >> On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos
> >> >> >> <mk...@hotmail.com>
> >> >> >> wrote:
> >> >> >> > Karl,
> >> >> >> >
> >> >> >> > I've read through the similar problems/questions on the list (only
> >> >> >> > found
> >> >> >> > 3),
> >> >> >> > but without any luck. In the Seed I've the page I want to crawl,
> >> >> >> > but
> >> >> >> > this on
> >> >> >> > protected by security, so I setup a redirect to the login-page and
> >> >> >> > a
> >> >> >> > form
> >> >> >> > for the login-page with the username/password parameters. When I
> >> >> >> > look
> >> >> >> > in
> >> >> >> > the
> >> >> >> > Simple History I see the fetch of the first page, the begin-logon,
> >> >> >> > redirect
> >> >> >> > to the login-page, the end-logon, but then it starts all over
> >> >> >> > again
> >> >> >> > and
> >> >> >> > keeps in a loop. Any ideas? I think a working example will help me
> >> >> >> > a
> >> >> >> > lot..
> >> >> >> >
> >> >> >> > Michael
> >> >> >> >
> >> >> >> >> Date: Thu, 30 Aug 2012 09:29:08 -0400
> >> >> >> >> Subject: Re: Web connector - Session-based access credentials
> >> >> >> >> From: daddywri@gmail.com
> >> >> >> >> To: user@manifoldcf.apache.org
> >> >> >> >
> >> >> >> >>
> >> >> >> >> I set it up to crawl Angie's List at one point. It was developed
> >> >> >> >> to
> >> >> >> >> crawl an oil-and-gas exploration subscription site. Others have
> >> >> >> >> fielded fairly detailed questions and/or problems to this list,
> >> >> >> >> so I
> >> >> >> >> know it has been used by many.
> >> >> >> >>
> >> >> >> >> Can you give a more thorough and detailed description of what
> >> >> >> >> your
> >> >> >> >> are
> >> >> >> >> trying to crawl, and what is happening for you?
> >> >> >> >>
> >> >> >> >> Karl
> >> >> >> >>
> >> >> >> >> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos
> >> >> >> >> <mk...@hotmail.com>
> >> >> >> >> wrote:
> >> >> >> >> >
> >> >> >> >> > Hi,
> >> >> >> >> >
> >> >> >> >> > Does anyone have a working example of the session-based access
> >> >> >> >> > credentials
> >> >> >> >> > for the web connector? Following the end-user-documentation as
> >> >> >> >> > good
> >> >> >> >> > as
> >> >> >> >> > possible, but still no luck :(
> >> >> >> >> >
> >> >> >> >> > Thanks!
 		 	   		  

Re: Web connector - Session-based access credentials

Posted by Karl Wright <da...@gmail.com>.
What do you mean, "first last page"?
The Web Connector needs to refetch the page that caused the
redirection, because that is likely to be a content page based on the
user's own description of the login sequence.  Otherwise pages would
be missing from the crawl, whenever login needed to be redone.

Karl

On Mon, Sep 3, 2012 at 7:43 AM, Michael Kooloos <mk...@hotmail.com> wrote:
> No, same thing happens in the browser also, so need to find a different seed
> page that doesn't have this behaviour, but no luck there yet..
>
> Other way to solve this 'issue' is if the coonnector will go back to the
> first last page after finishing the login-sequence, instead of the last page
> (since the last page stays in a loop). Should be possible, right?
>
> Michael
>
>> Date: Mon, 3 Sep 2012 07:15:20 -0400
>
>> Subject: Re: Web connector - Session-based access credentials
>> From: daddywri@gmail.com
>> To: user@manifoldcf.apache.org
>>
>> Ok - if the redirect is occurring in a browser whether or not you are
>> logged in, then yes, you cannot use that page as a seed. If this only
>> seems to happen in the Web Connector, on the other hand, we should
>> keep talking, because your login sequence is not actually succeeding
>> to set up the session cookies properly.
>>
>> Thanks!
>> Karl
>>
>> On Mon, Sep 3, 2012 at 6:14 AM, Michael Kooloos <mk...@hotmail.com>
>> wrote:
>> > Hi Karl,
>> >
>> > Thanks. Found the issue, the seed document keeps redirecting to the
>> > logon
>> > page (even after login has occured). This is an issue (protection?) of
>> > the
>> > website and it now makes sense to me why the connector stays in a loop.
>> > Haven't found a solution yet, have to find a more appropriate seed
>> > document
>> > or a way to skip the redirect the second time it enters the loop..
>> >
>> > Many thanks for your support!
>> >
>> >> Date: Thu, 30 Aug 2012 11:52:01 -0400
>> >
>> >> Subject: Re: Web connector - Session-based access credentials
>> >> From: daddywri@gmail.com
>> >> To: user@manifoldcf.apache.org
>> >>
>> >> If I understand how you have it set up, what the ManifoldCF web
>> >> connector will do is this:
>> >>
>> >> (1) Fetch the seed document.
>> >> (2) Take the redirection to the logon page, and thus enter the login
>> >> sequence
>> >> (3) Do the login sequence and establish the correct cookies
>> >> (4) Refetch the seed document
>> >> (5) Take the redirection to the logon page...
>> >>
>> >> So, as you can see, your seed document must redirect ONLY if login has
>> >> not yet occurred, or you will be stuck in a loop. So either fix that,
>> >> or choose a more appropriate seed document.
>> >>
>> >> On normal site, typically you get different results on most content
>> >> pages when login has occurred vs. when login has not yet occurred. It
>> >> is up to you to define in the Web Connector what combination of pages
>> >> and content constitute a logon request vs. normal content fetch. And
>> >> that's the whole problem, and why this is so complicated.
>> >>
>> >> Thanks,
>> >> Karl
>> >>
>> >> On Thu, Aug 30, 2012 at 11:38 AM, Michael Kooloos
>> >> <mk...@hotmail.com>
>> >> wrote:
>> >> > Karl,
>> >> >
>> >> > My seed document is not a logon page, but the seed document url
>> >> > automatically redirects to the logon pages. So the first regex is of
>> >> > the
>> >> > logon page, then the regex for the Login URL is the same (since it's
>> >> > the
>> >> > logon page), type = Form. Do I define any redirect after the logon
>> >> > form?
>> >> >
>> >> > Hope this makes a bit of sence..
>> >> >
>> >> > Didn't think it would be that hard to setup some access credentials..
>> >> >
>> >> >> Date: Thu, 30 Aug 2012 10:03:20 -0400
>> >> >
>> >> >> Subject: Re: Web connector - Session-based access credentials
>> >> >> From: daddywri@gmail.com
>> >> >> To: user@manifoldcf.apache.org
>> >> >>
>> >> >> It sounds like your regular expression(s) which describe what pages
>> >> >> belong to the logon sequence may be incorrect. After the logon
>> >> >> sequence exits, the crawler will attempt to refetch the page it was
>> >> >> working on before it entered the logon sequence. If that page is
>> >> >> PART
>> >> >> of the logon sequence it will loop as you describe.
>> >> >>
>> >> >> Your seed documents should therefore NOT be logon pages or you will
>> >> >> never get anywhere...
>> >> >>
>> >> >> Karl
>> >> >>
>> >> >> On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos
>> >> >> <mk...@hotmail.com>
>> >> >> wrote:
>> >> >> > Karl,
>> >> >> >
>> >> >> > I've read through the similar problems/questions on the list (only
>> >> >> > found
>> >> >> > 3),
>> >> >> > but without any luck. In the Seed I've the page I want to crawl,
>> >> >> > but
>> >> >> > this on
>> >> >> > protected by security, so I setup a redirect to the login-page and
>> >> >> > a
>> >> >> > form
>> >> >> > for the login-page with the username/password parameters. When I
>> >> >> > look
>> >> >> > in
>> >> >> > the
>> >> >> > Simple History I see the fetch of the first page, the begin-logon,
>> >> >> > redirect
>> >> >> > to the login-page, the end-logon, but then it starts all over
>> >> >> > again
>> >> >> > and
>> >> >> > keeps in a loop. Any ideas? I think a working example will help me
>> >> >> > a
>> >> >> > lot..
>> >> >> >
>> >> >> > Michael
>> >> >> >
>> >> >> >> Date: Thu, 30 Aug 2012 09:29:08 -0400
>> >> >> >> Subject: Re: Web connector - Session-based access credentials
>> >> >> >> From: daddywri@gmail.com
>> >> >> >> To: user@manifoldcf.apache.org
>> >> >> >
>> >> >> >>
>> >> >> >> I set it up to crawl Angie's List at one point. It was developed
>> >> >> >> to
>> >> >> >> crawl an oil-and-gas exploration subscription site. Others have
>> >> >> >> fielded fairly detailed questions and/or problems to this list,
>> >> >> >> so I
>> >> >> >> know it has been used by many.
>> >> >> >>
>> >> >> >> Can you give a more thorough and detailed description of what
>> >> >> >> your
>> >> >> >> are
>> >> >> >> trying to crawl, and what is happening for you?
>> >> >> >>
>> >> >> >> Karl
>> >> >> >>
>> >> >> >> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos
>> >> >> >> <mk...@hotmail.com>
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> > Hi,
>> >> >> >> >
>> >> >> >> > Does anyone have a working example of the session-based access
>> >> >> >> > credentials
>> >> >> >> > for the web connector? Following the end-user-documentation as
>> >> >> >> > good
>> >> >> >> > as
>> >> >> >> > possible, but still no luck :(
>> >> >> >> >
>> >> >> >> > Thanks!

RE: Web connector - Session-based access credentials

Posted by Michael Kooloos <mk...@hotmail.com>.
No, same thing happens in the browser also, so need to find a different seed page that doesn't have this behaviour, but no luck there yet..

Other way to solve this 'issue' is if the coonnector will go back to the first last page after finishing the login-sequence, instead of the last page (since the last page stays in a loop). Should be possible, right?

Michael

> Date: Mon, 3 Sep 2012 07:15:20 -0400
> Subject: Re: Web connector - Session-based access credentials
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
> 
> Ok - if the redirect is occurring in a browser whether or not you are
> logged in, then yes, you cannot use that page as a seed.  If this only
> seems to happen in the Web Connector, on the other hand, we should
> keep talking, because your login sequence is not actually succeeding
> to set up the session cookies properly.
> 
> Thanks!
> Karl
> 
> On Mon, Sep 3, 2012 at 6:14 AM, Michael Kooloos <mk...@hotmail.com> wrote:
> > Hi Karl,
> >
> > Thanks. Found the issue, the seed document keeps redirecting to the logon
> > page (even after login has occured). This is an issue (protection?) of the
> > website and it now makes sense to me why the connector stays in a loop.
> > Haven't found a solution yet, have to find a more appropriate seed document
> > or a way to skip the redirect the second time it enters the loop..
> >
> > Many thanks for your support!
> >
> >> Date: Thu, 30 Aug 2012 11:52:01 -0400
> >
> >> Subject: Re: Web connector - Session-based access credentials
> >> From: daddywri@gmail.com
> >> To: user@manifoldcf.apache.org
> >>
> >> If I understand how you have it set up, what the ManifoldCF web
> >> connector will do is this:
> >>
> >> (1) Fetch the seed document.
> >> (2) Take the redirection to the logon page, and thus enter the login
> >> sequence
> >> (3) Do the login sequence and establish the correct cookies
> >> (4) Refetch the seed document
> >> (5) Take the redirection to the logon page...
> >>
> >> So, as you can see, your seed document must redirect ONLY if login has
> >> not yet occurred, or you will be stuck in a loop. So either fix that,
> >> or choose a more appropriate seed document.
> >>
> >> On normal site, typically you get different results on most content
> >> pages when login has occurred vs. when login has not yet occurred. It
> >> is up to you to define in the Web Connector what combination of pages
> >> and content constitute a logon request vs. normal content fetch. And
> >> that's the whole problem, and why this is so complicated.
> >>
> >> Thanks,
> >> Karl
> >>
> >> On Thu, Aug 30, 2012 at 11:38 AM, Michael Kooloos <mk...@hotmail.com>
> >> wrote:
> >> > Karl,
> >> >
> >> > My seed document is not a logon page, but the seed document url
> >> > automatically redirects to the logon pages. So the first regex is of the
> >> > logon page, then the regex for the Login URL is the same (since it's the
> >> > logon page), type = Form. Do I define any redirect after the logon form?
> >> >
> >> > Hope this makes a bit of sence..
> >> >
> >> > Didn't think it would be that hard to setup some access credentials..
> >> >
> >> >> Date: Thu, 30 Aug 2012 10:03:20 -0400
> >> >
> >> >> Subject: Re: Web connector - Session-based access credentials
> >> >> From: daddywri@gmail.com
> >> >> To: user@manifoldcf.apache.org
> >> >>
> >> >> It sounds like your regular expression(s) which describe what pages
> >> >> belong to the logon sequence may be incorrect. After the logon
> >> >> sequence exits, the crawler will attempt to refetch the page it was
> >> >> working on before it entered the logon sequence. If that page is PART
> >> >> of the logon sequence it will loop as you describe.
> >> >>
> >> >> Your seed documents should therefore NOT be logon pages or you will
> >> >> never get anywhere...
> >> >>
> >> >> Karl
> >> >>
> >> >> On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos <mk...@hotmail.com>
> >> >> wrote:
> >> >> > Karl,
> >> >> >
> >> >> > I've read through the similar problems/questions on the list (only
> >> >> > found
> >> >> > 3),
> >> >> > but without any luck. In the Seed I've the page I want to crawl, but
> >> >> > this on
> >> >> > protected by security, so I setup a redirect to the login-page and a
> >> >> > form
> >> >> > for the login-page with the username/password parameters. When I look
> >> >> > in
> >> >> > the
> >> >> > Simple History I see the fetch of the first page, the begin-logon,
> >> >> > redirect
> >> >> > to the login-page, the end-logon, but then it starts all over again
> >> >> > and
> >> >> > keeps in a loop. Any ideas? I think a working example will help me a
> >> >> > lot..
> >> >> >
> >> >> > Michael
> >> >> >
> >> >> >> Date: Thu, 30 Aug 2012 09:29:08 -0400
> >> >> >> Subject: Re: Web connector - Session-based access credentials
> >> >> >> From: daddywri@gmail.com
> >> >> >> To: user@manifoldcf.apache.org
> >> >> >
> >> >> >>
> >> >> >> I set it up to crawl Angie's List at one point. It was developed to
> >> >> >> crawl an oil-and-gas exploration subscription site. Others have
> >> >> >> fielded fairly detailed questions and/or problems to this list, so I
> >> >> >> know it has been used by many.
> >> >> >>
> >> >> >> Can you give a more thorough and detailed description of what your
> >> >> >> are
> >> >> >> trying to crawl, and what is happening for you?
> >> >> >>
> >> >> >> Karl
> >> >> >>
> >> >> >> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos
> >> >> >> <mk...@hotmail.com>
> >> >> >> wrote:
> >> >> >> >
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > Does anyone have a working example of the session-based access
> >> >> >> > credentials
> >> >> >> > for the web connector? Following the end-user-documentation as
> >> >> >> > good
> >> >> >> > as
> >> >> >> > possible, but still no luck :(
> >> >> >> >
> >> >> >> > Thanks!
 		 	   		  

Re: Web connector - Session-based access credentials

Posted by Karl Wright <da...@gmail.com>.
Ok - if the redirect is occurring in a browser whether or not you are
logged in, then yes, you cannot use that page as a seed.  If this only
seems to happen in the Web Connector, on the other hand, we should
keep talking, because your login sequence is not actually succeeding
to set up the session cookies properly.

Thanks!
Karl

On Mon, Sep 3, 2012 at 6:14 AM, Michael Kooloos <mk...@hotmail.com> wrote:
> Hi Karl,
>
> Thanks. Found the issue, the seed document keeps redirecting to the logon
> page (even after login has occured). This is an issue (protection?) of the
> website and it now makes sense to me why the connector stays in a loop.
> Haven't found a solution yet, have to find a more appropriate seed document
> or a way to skip the redirect the second time it enters the loop..
>
> Many thanks for your support!
>
>> Date: Thu, 30 Aug 2012 11:52:01 -0400
>
>> Subject: Re: Web connector - Session-based access credentials
>> From: daddywri@gmail.com
>> To: user@manifoldcf.apache.org
>>
>> If I understand how you have it set up, what the ManifoldCF web
>> connector will do is this:
>>
>> (1) Fetch the seed document.
>> (2) Take the redirection to the logon page, and thus enter the login
>> sequence
>> (3) Do the login sequence and establish the correct cookies
>> (4) Refetch the seed document
>> (5) Take the redirection to the logon page...
>>
>> So, as you can see, your seed document must redirect ONLY if login has
>> not yet occurred, or you will be stuck in a loop. So either fix that,
>> or choose a more appropriate seed document.
>>
>> On normal site, typically you get different results on most content
>> pages when login has occurred vs. when login has not yet occurred. It
>> is up to you to define in the Web Connector what combination of pages
>> and content constitute a logon request vs. normal content fetch. And
>> that's the whole problem, and why this is so complicated.
>>
>> Thanks,
>> Karl
>>
>> On Thu, Aug 30, 2012 at 11:38 AM, Michael Kooloos <mk...@hotmail.com>
>> wrote:
>> > Karl,
>> >
>> > My seed document is not a logon page, but the seed document url
>> > automatically redirects to the logon pages. So the first regex is of the
>> > logon page, then the regex for the Login URL is the same (since it's the
>> > logon page), type = Form. Do I define any redirect after the logon form?
>> >
>> > Hope this makes a bit of sence..
>> >
>> > Didn't think it would be that hard to setup some access credentials..
>> >
>> >> Date: Thu, 30 Aug 2012 10:03:20 -0400
>> >
>> >> Subject: Re: Web connector - Session-based access credentials
>> >> From: daddywri@gmail.com
>> >> To: user@manifoldcf.apache.org
>> >>
>> >> It sounds like your regular expression(s) which describe what pages
>> >> belong to the logon sequence may be incorrect. After the logon
>> >> sequence exits, the crawler will attempt to refetch the page it was
>> >> working on before it entered the logon sequence. If that page is PART
>> >> of the logon sequence it will loop as you describe.
>> >>
>> >> Your seed documents should therefore NOT be logon pages or you will
>> >> never get anywhere...
>> >>
>> >> Karl
>> >>
>> >> On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos <mk...@hotmail.com>
>> >> wrote:
>> >> > Karl,
>> >> >
>> >> > I've read through the similar problems/questions on the list (only
>> >> > found
>> >> > 3),
>> >> > but without any luck. In the Seed I've the page I want to crawl, but
>> >> > this on
>> >> > protected by security, so I setup a redirect to the login-page and a
>> >> > form
>> >> > for the login-page with the username/password parameters. When I look
>> >> > in
>> >> > the
>> >> > Simple History I see the fetch of the first page, the begin-logon,
>> >> > redirect
>> >> > to the login-page, the end-logon, but then it starts all over again
>> >> > and
>> >> > keeps in a loop. Any ideas? I think a working example will help me a
>> >> > lot..
>> >> >
>> >> > Michael
>> >> >
>> >> >> Date: Thu, 30 Aug 2012 09:29:08 -0400
>> >> >> Subject: Re: Web connector - Session-based access credentials
>> >> >> From: daddywri@gmail.com
>> >> >> To: user@manifoldcf.apache.org
>> >> >
>> >> >>
>> >> >> I set it up to crawl Angie's List at one point. It was developed to
>> >> >> crawl an oil-and-gas exploration subscription site. Others have
>> >> >> fielded fairly detailed questions and/or problems to this list, so I
>> >> >> know it has been used by many.
>> >> >>
>> >> >> Can you give a more thorough and detailed description of what your
>> >> >> are
>> >> >> trying to crawl, and what is happening for you?
>> >> >>
>> >> >> Karl
>> >> >>
>> >> >> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos
>> >> >> <mk...@hotmail.com>
>> >> >> wrote:
>> >> >> >
>> >> >> > Hi,
>> >> >> >
>> >> >> > Does anyone have a working example of the session-based access
>> >> >> > credentials
>> >> >> > for the web connector? Following the end-user-documentation as
>> >> >> > good
>> >> >> > as
>> >> >> > possible, but still no luck :(
>> >> >> >
>> >> >> > Thanks!

RE: Web connector - Session-based access credentials

Posted by Michael Kooloos <mk...@hotmail.com>.
Hi Karl,

Thanks. Found the issue, the seed document keeps redirecting to the logon page (even after login has occured). This is an issue (protection?) of the website and it now makes sense to me why the connector stays in a loop. Haven't found a solution yet, have to find a more appropriate seed document or a way to skip the redirect the second time it enters the loop..

Many thanks for your support!

> Date: Thu, 30 Aug 2012 11:52:01 -0400
> Subject: Re: Web connector - Session-based access credentials
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
> 
> If I understand how you have it set up, what the ManifoldCF web
> connector will do is this:
> 
> (1) Fetch the seed document.
> (2) Take the redirection to the logon page, and thus enter the login sequence
> (3) Do the login sequence and establish the correct cookies
> (4) Refetch the seed document
> (5) Take the redirection to the logon page...
> 
> So, as you can see, your seed document must redirect ONLY if login has
> not yet occurred, or you will be stuck in a loop.  So either fix that,
> or choose a more appropriate seed document.
> 
> On normal site, typically you get different results on most content
> pages when login has occurred vs. when login has not yet occurred.  It
> is up to you to define in the Web Connector what combination of pages
> and content constitute a logon request vs. normal content fetch.  And
> that's the whole problem, and why this is so complicated.
> 
> Thanks,
> Karl
> 
> On Thu, Aug 30, 2012 at 11:38 AM, Michael Kooloos <mk...@hotmail.com> wrote:
> > Karl,
> >
> > My seed document is not a logon page, but the seed document url
> > automatically redirects to the logon pages. So the first regex is of the
> > logon page, then the regex for the Login URL is the same (since it's the
> > logon page), type = Form. Do I define any redirect after the logon form?
> >
> > Hope this makes a bit of sence..
> >
> > Didn't think it would be that hard to setup some access credentials..
> >
> >> Date: Thu, 30 Aug 2012 10:03:20 -0400
> >
> >> Subject: Re: Web connector - Session-based access credentials
> >> From: daddywri@gmail.com
> >> To: user@manifoldcf.apache.org
> >>
> >> It sounds like your regular expression(s) which describe what pages
> >> belong to the logon sequence may be incorrect. After the logon
> >> sequence exits, the crawler will attempt to refetch the page it was
> >> working on before it entered the logon sequence. If that page is PART
> >> of the logon sequence it will loop as you describe.
> >>
> >> Your seed documents should therefore NOT be logon pages or you will
> >> never get anywhere...
> >>
> >> Karl
> >>
> >> On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos <mk...@hotmail.com>
> >> wrote:
> >> > Karl,
> >> >
> >> > I've read through the similar problems/questions on the list (only found
> >> > 3),
> >> > but without any luck. In the Seed I've the page I want to crawl, but
> >> > this on
> >> > protected by security, so I setup a redirect to the login-page and a
> >> > form
> >> > for the login-page with the username/password parameters. When I look in
> >> > the
> >> > Simple History I see the fetch of the first page, the begin-logon,
> >> > redirect
> >> > to the login-page, the end-logon, but then it starts all over again and
> >> > keeps in a loop. Any ideas? I think a working example will help me a
> >> > lot..
> >> >
> >> > Michael
> >> >
> >> >> Date: Thu, 30 Aug 2012 09:29:08 -0400
> >> >> Subject: Re: Web connector - Session-based access credentials
> >> >> From: daddywri@gmail.com
> >> >> To: user@manifoldcf.apache.org
> >> >
> >> >>
> >> >> I set it up to crawl Angie's List at one point. It was developed to
> >> >> crawl an oil-and-gas exploration subscription site. Others have
> >> >> fielded fairly detailed questions and/or problems to this list, so I
> >> >> know it has been used by many.
> >> >>
> >> >> Can you give a more thorough and detailed description of what your are
> >> >> trying to crawl, and what is happening for you?
> >> >>
> >> >> Karl
> >> >>
> >> >> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos <mk...@hotmail.com>
> >> >> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > Does anyone have a working example of the session-based access
> >> >> > credentials
> >> >> > for the web connector? Following the end-user-documentation as good
> >> >> > as
> >> >> > possible, but still no luck :(
> >> >> >
> >> >> > Thanks!
 		 	   		  

Re: Web connector - Session-based access credentials

Posted by Karl Wright <da...@gmail.com>.
If I understand how you have it set up, what the ManifoldCF web
connector will do is this:

(1) Fetch the seed document.
(2) Take the redirection to the logon page, and thus enter the login sequence
(3) Do the login sequence and establish the correct cookies
(4) Refetch the seed document
(5) Take the redirection to the logon page...

So, as you can see, your seed document must redirect ONLY if login has
not yet occurred, or you will be stuck in a loop.  So either fix that,
or choose a more appropriate seed document.

On normal site, typically you get different results on most content
pages when login has occurred vs. when login has not yet occurred.  It
is up to you to define in the Web Connector what combination of pages
and content constitute a logon request vs. normal content fetch.  And
that's the whole problem, and why this is so complicated.

Thanks,
Karl

On Thu, Aug 30, 2012 at 11:38 AM, Michael Kooloos <mk...@hotmail.com> wrote:
> Karl,
>
> My seed document is not a logon page, but the seed document url
> automatically redirects to the logon pages. So the first regex is of the
> logon page, then the regex for the Login URL is the same (since it's the
> logon page), type = Form. Do I define any redirect after the logon form?
>
> Hope this makes a bit of sence..
>
> Didn't think it would be that hard to setup some access credentials..
>
>> Date: Thu, 30 Aug 2012 10:03:20 -0400
>
>> Subject: Re: Web connector - Session-based access credentials
>> From: daddywri@gmail.com
>> To: user@manifoldcf.apache.org
>>
>> It sounds like your regular expression(s) which describe what pages
>> belong to the logon sequence may be incorrect. After the logon
>> sequence exits, the crawler will attempt to refetch the page it was
>> working on before it entered the logon sequence. If that page is PART
>> of the logon sequence it will loop as you describe.
>>
>> Your seed documents should therefore NOT be logon pages or you will
>> never get anywhere...
>>
>> Karl
>>
>> On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos <mk...@hotmail.com>
>> wrote:
>> > Karl,
>> >
>> > I've read through the similar problems/questions on the list (only found
>> > 3),
>> > but without any luck. In the Seed I've the page I want to crawl, but
>> > this on
>> > protected by security, so I setup a redirect to the login-page and a
>> > form
>> > for the login-page with the username/password parameters. When I look in
>> > the
>> > Simple History I see the fetch of the first page, the begin-logon,
>> > redirect
>> > to the login-page, the end-logon, but then it starts all over again and
>> > keeps in a loop. Any ideas? I think a working example will help me a
>> > lot..
>> >
>> > Michael
>> >
>> >> Date: Thu, 30 Aug 2012 09:29:08 -0400
>> >> Subject: Re: Web connector - Session-based access credentials
>> >> From: daddywri@gmail.com
>> >> To: user@manifoldcf.apache.org
>> >
>> >>
>> >> I set it up to crawl Angie's List at one point. It was developed to
>> >> crawl an oil-and-gas exploration subscription site. Others have
>> >> fielded fairly detailed questions and/or problems to this list, so I
>> >> know it has been used by many.
>> >>
>> >> Can you give a more thorough and detailed description of what your are
>> >> trying to crawl, and what is happening for you?
>> >>
>> >> Karl
>> >>
>> >> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos <mk...@hotmail.com>
>> >> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > Does anyone have a working example of the session-based access
>> >> > credentials
>> >> > for the web connector? Following the end-user-documentation as good
>> >> > as
>> >> > possible, but still no luck :(
>> >> >
>> >> > Thanks!

RE: Web connector - Session-based access credentials

Posted by Michael Kooloos <mk...@hotmail.com>.
Karl,

My seed document is not a logon page, but the seed document url automatically redirects to the logon pages. So the first regex is of the logon page, then the regex for the Login URL is the same (since it's the logon page), type = Form. Do I define any redirect after the logon form?

Hope this makes a bit of sence..

Didn't think it would be that hard to setup some access credentials..

> Date: Thu, 30 Aug 2012 10:03:20 -0400
> Subject: Re: Web connector - Session-based access credentials
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
> 
> It sounds like your regular expression(s) which describe what pages
> belong to the logon sequence may be incorrect.  After the logon
> sequence exits, the crawler will attempt to refetch the page it was
> working on before it entered the logon sequence.  If that page is PART
> of the logon sequence it will loop as you describe.
> 
> Your seed documents should therefore NOT be logon pages or you will
> never get anywhere...
> 
> Karl
> 
> On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos <mk...@hotmail.com> wrote:
> > Karl,
> >
> > I've read through the similar problems/questions on the list (only found 3),
> > but without any luck. In the Seed I've the page I want to crawl, but this on
> > protected by security, so I setup a redirect to the login-page and a form
> > for the login-page with the username/password parameters. When I look in the
> > Simple History I see the fetch of the first page, the begin-logon, redirect
> > to the login-page, the end-logon, but then it starts all over again and
> > keeps in a loop. Any ideas? I think a working example will help me a lot..
> >
> > Michael
> >
> >> Date: Thu, 30 Aug 2012 09:29:08 -0400
> >> Subject: Re: Web connector - Session-based access credentials
> >> From: daddywri@gmail.com
> >> To: user@manifoldcf.apache.org
> >
> >>
> >> I set it up to crawl Angie's List at one point. It was developed to
> >> crawl an oil-and-gas exploration subscription site. Others have
> >> fielded fairly detailed questions and/or problems to this list, so I
> >> know it has been used by many.
> >>
> >> Can you give a more thorough and detailed description of what your are
> >> trying to crawl, and what is happening for you?
> >>
> >> Karl
> >>
> >> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos <mk...@hotmail.com>
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > Does anyone have a working example of the session-based access
> >> > credentials
> >> > for the web connector? Following the end-user-documentation as good as
> >> > possible, but still no luck :(
> >> >
> >> > Thanks!
 		 	   		  

Re: Web connector - Session-based access credentials

Posted by Karl Wright <da...@gmail.com>.
It sounds like your regular expression(s) which describe what pages
belong to the logon sequence may be incorrect.  After the logon
sequence exits, the crawler will attempt to refetch the page it was
working on before it entered the logon sequence.  If that page is PART
of the logon sequence it will loop as you describe.

Your seed documents should therefore NOT be logon pages or you will
never get anywhere...

Karl

On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos <mk...@hotmail.com> wrote:
> Karl,
>
> I've read through the similar problems/questions on the list (only found 3),
> but without any luck. In the Seed I've the page I want to crawl, but this on
> protected by security, so I setup a redirect to the login-page and a form
> for the login-page with the username/password parameters. When I look in the
> Simple History I see the fetch of the first page, the begin-logon, redirect
> to the login-page, the end-logon, but then it starts all over again and
> keeps in a loop. Any ideas? I think a working example will help me a lot..
>
> Michael
>
>> Date: Thu, 30 Aug 2012 09:29:08 -0400
>> Subject: Re: Web connector - Session-based access credentials
>> From: daddywri@gmail.com
>> To: user@manifoldcf.apache.org
>
>>
>> I set it up to crawl Angie's List at one point. It was developed to
>> crawl an oil-and-gas exploration subscription site. Others have
>> fielded fairly detailed questions and/or problems to this list, so I
>> know it has been used by many.
>>
>> Can you give a more thorough and detailed description of what your are
>> trying to crawl, and what is happening for you?
>>
>> Karl
>>
>> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos <mk...@hotmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > Does anyone have a working example of the session-based access
>> > credentials
>> > for the web connector? Following the end-user-documentation as good as
>> > possible, but still no luck :(
>> >
>> > Thanks!

RE: Web connector - Session-based access credentials

Posted by Michael Kooloos <mk...@hotmail.com>.
Karl,

I've read through the similar problems/questions on the list (only found 3), but without any luck. In the Seed I've the page I want to crawl, but this on protected by security, so I setup a redirect to the login-page and a form for the login-page with the username/password parameters. When I look in the Simple History I see the fetch of the first page, the begin-logon, redirect to the login-page, the end-logon, but then it starts all over again and keeps in a loop. Any ideas? I think a working example will help me a lot..

Michael

> Date: Thu, 30 Aug 2012 09:29:08 -0400
> Subject: Re: Web connector - Session-based access credentials
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
> 
> I set it up to crawl Angie's List at one point.  It was developed to
> crawl an oil-and-gas exploration subscription site.  Others have
> fielded fairly detailed questions and/or problems to this list, so I
> know it has been used by many.
> 
> Can you give a more thorough and detailed description of what your are
> trying to crawl, and what is happening for you?
> 
> Karl
> 
> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos <mk...@hotmail.com> wrote:
> >
> > Hi,
> >
> > Does anyone have a working example of the session-based access credentials
> > for the web connector? Following the end-user-documentation as good as
> > possible, but still no luck :(
> >
> > Thanks!
 		 	   		  

Re: Web connector - Session-based access credentials

Posted by Karl Wright <da...@gmail.com>.
I set it up to crawl Angie's List at one point.  It was developed to
crawl an oil-and-gas exploration subscription site.  Others have
fielded fairly detailed questions and/or problems to this list, so I
know it has been used by many.

Can you give a more thorough and detailed description of what your are
trying to crawl, and what is happening for you?

Karl

On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos <mk...@hotmail.com> wrote:
>
> Hi,
>
> Does anyone have a working example of the session-based access credentials
> for the web connector? Following the end-user-documentation as good as
> possible, but still no luck :(
>
> Thanks!