You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (Created) (JIRA)" <ji...@apache.org> on 2011/10/16 03:00:21 UTC

[jira] [Created] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Clarify documentation as to how to set up session login for web connector
-------------------------------------------------------------------------

                 Key: CONNECTORS-275
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Documentation, Web connector
    Affects Versions: ManifoldCF 0.4
            Reporter: Karl Wright


A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:

"I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.

I've got a subscription request into the user mailing list, but here's the parts that are not clear.

I generally understand about using regexes to define sites and sorting out content pages from login pages.

But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.

It's also not clear about the "page type" radio button choices.

For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".

And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.

Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.

Here's the scenario I'm trying, if you'd like to use it:

Try to fetch: http://site.com/product?id=1234
If you get a redirect to: http://site.com/Main.asp
Note that there's no login form nor link on this page.
Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
Then record the session cookie and try for /product?id=1234 again.

I realize this is odd, I didn't design it. "



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated CONNECTORS-275:
-----------------------------------

    Fix Version/s: ManifoldCF 0.4
         Assignee: Karl Wright

Documentation was actually updated, and there is agreement that we will open tickets for new features, so I'm going to resolve this ticket.

                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>         Attachments: CONNECTORS-275.patch
>
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161139#comment-13161139 ] 

Karl Wright commented on CONNECTORS-275:
----------------------------------------

bq. I'm assuming Method A is closer to what you described:
bq.
bq. Method A:
bq.
bq.    * (while on mysite.com/session-timeout-message.html which has a link to login.cgi)
bq.      3: (same as above, matching timeout-msg.html) Tell MCF to Fetch: http://mysite.com/login.cgi
bq.      4: (new, matching login.cgi) Tell MCF that the form name is ^$, and that the parameters are username=me and password=hello.

Yes, exactly.

bq. The only issue here is that, since there is no form on login.cgi, there's no "method=GET" to inherit from.

So the link from the timeout page sends you to login.cgi without any parameters at all, and yet login.cgi requires parameters to perform the login?  Or (I've seen this done before) when you go to http://mysite.com/login.cgi, do you get the form at that time, which when submitted goes right back to login.cgi, but this time with the GET form data?  If the former, we'd need a new type of login page.  If the latter, we could make it work with the current software.

bq. If more code needs to be written, I wasn't necessarily bugging you to write it (though you'd be faster at it!)

Let's think it through first and then see.  Usually in cases like this I create a branch so that we can do multiple commits and not have to put everything in a single (massive) patch.  This also means we can both work on the code.

Adding a new login page type is not that challenging technically, just a bit of work in the UI mostly.  But how would that new login page actually work?  Should it match the URL regexp only, or should there be some other identifying characteristic on the page itself?  And, since there's no form to submit, and there are three different ways to submit a form in HTML, it seems to me that we'd want to basically specify a "virtual form", consisting of everything you might find on a normal form: the form type, the action URL, an a complete set of name/value pairs to be transmitted to the action target.  Does this sound right?





                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161249#comment-13161249 ] 

Karl Wright commented on CONNECTORS-275:
----------------------------------------

bq. So that'd be another problem - to tell MCF to ignore even severe errors (so that we can have the 2 step rule)

It would be better to just extend the link-type login page, so that there is a new checkbox meaning "don't chase the link that was specified but instead substitute the following virtual form."

bq. What if my parameters are long and have spaces and need URL encoding - then I'd have to encode them manually.

That's why I think it's a much better idea to have a notion of a "virtual form", where you enter the parameters and values and standard code does the rest.  It's more flexible too in the end in that it would handle GETs, POSTs, even multipart forms.


                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128319#comment-13128319 ] 

Karl Wright commented on CONNECTORS-275:
----------------------------------------

It would be great to hear some clarification on why pages that obviously would be needed for a user to log into this site using a browser do not exist.  The Web Connector is designed to permit crawling only of sites that can be visited by a human being with a browser; it's not a generic HTTP API crawler by any stretch.

To answer specific questions about the connector itself:

bq. But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.

The "Login URL" is to allow you to specify, via URL regular expression, which pages are part of the login page sequence.  The "Form name/link target" regexp, combined with the "page type" you mentioned, together determine what ManifoldCF regards as a fetch that is part of the login sequence, and one that is not.  As it says in the end-user documentation: "You declare a page to be a login page by identifying it both by its URL, and by what the crawler finds on the page when it fetches it. "

bq. For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".

You are saying that a page fetch that matches the URL that is a redirection will be considered part of the login sequence, and is thus not indexable content.

bq. And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.

You can match a missing or blank form name with an empty regexp, or even more specifically "^$", which ONLY matches the empty string.

Hope this helps.

                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Mark Bennett (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161170#comment-13161170 ] 

Mark Bennett commented on CONNECTORS-275:
-----------------------------------------

> So the link from the timeout page sends you to login.cgi without any parameters at all, and yet login.cgi requires parameters to perform the login?

I believe so, need to verify.  8 different sites = 8 slightly different behaviors.

> Or (I've seen this done before) when you go to http://mysite.com/login.cgi, do you get the form at that time...

I wish!  At least on some sites, no.

And worse(!) on some sites, if you just go to login.cgi with no parms, you get a nasty error, like maybe a 500.

So that'd be another problem - to tell MCF to ignore even severe errors (so that we can have the 2 step rule)

> But how would that new login page actually work? Should it match the URL regexp only, or should there be some other identifying characteristic on the page itself?

Not sure I'm directly answering this....

But this might be where my habits with other spiders are different enough than MCF's that maybe there's implicit "unlearn *that*!" in my near future.

I'd classify MCF as a reactive pattern matcher.  It can do almost anything based on what it gets back.

Whereas I was thinking more proactive "IF you see url-A THEN GOTO arbitrary-url-B", where the ONLY place literal url-B exists is in the config screen.  In that scenario, where I can inject arbitrary new URLs via configuration, then to me it looks "easy".

In that scenario (arbitrary config injection) we solve all the problems at once.  A URL with ? arg=value & arg=value IS a GET, so no config there.  And I get to specify the args inline, in the URL.

This is inelegant as a general solution.  I can enumerate a few right here: What if it needed to be a POST after all?  What if my parameters are long and have spaces and need URL encoding - then I'd have to encode them manually.  Editing 1.5k URLs in a 3 inch HTML web form is UGLY.  And what if I didn't know the exact URL, but I could calculate it based on some other state?

MCF's model handles all those other items in a much more general, re-usable way.  Whereas the special case of "I just need it to fetch this arbitrary 200 character URL" almost seems like a degenerate use case which coincidently has an easy fix.  And my only response to that, arguing both sides of the coin here, is that this might be a much more common "edge case" than a software architect might assume.

Do the last few paragraphs make sense?  And did it answer your question?

BTW Karl, this is probably the most detailed (and to me interesting) conversation I've had with anybody about the minutia of URLs and logins in a while.  Normally I'd coral an engineer in front of a whiteboard, but this is more like how they used to play chess, via US Mail, kinda fun!



                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Mark Bennett (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Bennett updated CONNECTORS-275:
------------------------------------

    Status: Patch Available  (was: Open)

Adding table comparing page based and session based authentication.
                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Mark Bennett (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167231#comment-13167231 ] 

Mark Bennett edited comment on CONNECTORS-275 at 12/11/11 10:36 PM:
--------------------------------------------------------------------

Update to doc comparing page based and session based authentication.
                
      was (Author: mbennett):
    Updated to doc comparing page based and session based authentication.
                  
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>         Attachments: CONNECTORS-275.patch
>
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162485#comment-13162485 ] 

Karl Wright commented on CONNECTORS-275:
----------------------------------------

I'd be happy to have contributions on the doc!

The site is checked in under trunk/site, and is rendered by Apache Forrest.  You build it using:

ant clean doc javadoc

Then the built doc (under dist/doc) is imported (via svn import) into svn in a different place, where it is mirrored nightly.  The manual part of the process is not necessarily done on every small change, though, which is why you haven't seen the mirrored site be updated with the documentation improvements yet.

See https://cwiki.apache.org/confluence/display/CONNECTORS/Updating+the+Website for more details.

Submit patches to the doc by doing an svn diff >CONNECTORS-xxx.patch, and attaching the patch to the CONNECTORS-xxx Jira ticket.  Be sure you click the "Grant license to ASF" radio button though, or we can't use your patch.



                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Mark Bennett (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159906#comment-13159906 ] 

Mark Bennett commented on CONNECTORS-275:
-----------------------------------------

Hi Karl, this is Mark from the original book site post.  I appreciate your actions and will try to clarify your questions.

It's funny, once somebody understands how Manifold works, I sure rereading the doc would seem correct and "obvious" - this happens a lot with tech.  But I'm not there yet.

For myself, I've shifted from writing thorough doc, and instead providing minimal doc and many examples.  Maybe I can contribute more to that when I get my sea legs with Manifold.

To your comments:
> It would be great to hear some clarification on why pages that obviously would be needed
> for a user to log into this site using a browser do not exist.

Rereading my post I'm not sure which part this referred to.  But I think it might be related to this:

I think Manifold assumes that, when another login is needed, a redirect will be issued to a login form.  And, given that, we give it regexes to tell which redirects are normal content vs. logins.

But in this system, when the session is expired, it doesn't do a redirect to that login form.  I think it gives some other warning page, maybe with a link, which might be a javascript link, not sure.

So, when we know a session has expired, we need to tell Manifold the literal URL to go back to.  I'm thinking now that Manifold just doesn't support that function at this time?  So maybe the "how do I configure this" is that "you don't!" (currently).  I've downloaded the code and starting to poke around in Eclipse, to maybe extend it.

Although this may sound like an odd edge case, it appears to be quite common with the class of sites we're dealing with.

The rest of your comments made sense, and I'm incorporating them.

Here are some other specific thoughts on the doc, giving you the newb perspective.  I'm being specific not to badger, but trying to capture specific areas of confusion, and hoping to provide more actionable items then just "needs better doc".

There were two "sets" of items that could use a bit more narration.  This is already started in the current doc, and you've expanded it above, but I'll enumerate it here.

In the UI there are three "Regex" labels, literally:
1: "URL regular expression"
2: "Login URL regular expression"
3: "Form name/link target regular expression"

These are similar enough to confuse us newbs, and again you've addressed some of this above.

Then there are three Page types:
1: "Form name"
2: "Link target"
3: "Redirection"

These are already mentioned in the doc, I realize that.  But exactly how each item from the two lists interact, in which combinations, is murky to us newbs.

I had already asked about "what if there's no form", which you've now answered.  Confirming that, even if there's no form, it'll still use the name value pairs?

I'm also not sure how you'd tell the system to use a GET vs. a POST, when submitting a form.

A further complication is that, on many of the sites, some of the "redirects" and other actions are done with Javascript.  I haven't gone far enough into the doc see if/how that's handled.  The previous instance of the app we're rewriting was using WebHarvest, which seemed to have a single "magic" boolean flag for handling some of the Javascript appropriately, though I don't know the mechanics of it.

Look forward to continuing the dialog, thanks again Karl!
Mark
                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Mark Bennett (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Bennett updated CONNECTORS-275:
------------------------------------

    Comment: was deleted

(was: Adding table comparing page based and session based authentication.)
    
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161315#comment-13161315 ] 

Karl Wright commented on CONNECTORS-275:
----------------------------------------

r1209313 to improve web connector end user documentation.

                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Mark Bennett (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Bennett updated CONNECTORS-275:
------------------------------------

    Attachment: CONNECTORS-275.patch

Updated to doc comparing page based and session based authentication.
                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>         Attachments: CONNECTORS-275.patch
>
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated CONNECTORS-275:
-----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Create new feature tickets as needed
                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.4
>
>         Attachments: CONNECTORS-275.patch
>
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159945#comment-13159945 ] 

Karl Wright commented on CONNECTORS-275:
----------------------------------------

Ok, let's start with some basics.  First, the goals of all the setup you have to go through are as follows:

(1) Identify what site, or part of the site, has protected content;
(2) Identify which http/https fetches are not content, but are in fact part of a "login sequence", which a normal person has to go through to get the appropriate cookies.

One of the regexps you supply (the first one) basically describes the set of URLs for which the content is protected, and for which the right cookies have to be in place for you to get at the "real" content.  Once you've specified this, then for each protection zone (described by its URL regexp), you need to specify how ManifoldCF should identify whether a given fetch should be considered part of the login sequence or not.  It's not enough to just identify the URL of login pages, since (for instance) if your session has expired you may well have a redirection get fetched instead of the content you want.  So you specify each class of login page as one of three types, using not only the URL to identify the class (this is where you get the second regexp), but also something about what is on the page: whether it is a redirection to a URL (yes, again described by a URL regexp), whether it has a form with a specified name (described by a regexp), or whether it has a specific link on it (once again, described by a regexp).

You will note in all three case above that there is an implicit flow through the login sequence that you can describe as part of specifying the login sequence.  For example, if upon session timeout you expect to see a redirection to a link, or family of links (remember, it's regexp, so you can describe that easily), then as part of identifying the redirection as belonging to the login sequence, the web connector also now has a new link to fetch.  And this is what it does.  The same applies to forms; if the form name that was specified is found, then the web connector submits that form using values for the form elements that you specify, and using the submission type actually mentioned on the form page (GET, POST, or multi-part).  Any other elements of the form are left in whatever the HTML specified; no Javascript is evaluated.  So if you think a form element's value is being set by Javascript, you have to figure out what it is being set to and enter this value by hand as part of the specification for the "form" type of login page.  Usually this only amounts to a user name and password.

As far as your site, which redirects you to a page when session has expired, you would need two specifications for login pages to cover the situation - one for the redirection itself, and one for the page that you get redirected to.  Usually in these situations the target page has at least a link on it that takes you back to the main login form, and that is what you'd use to identify it (it would be a 'link' type login page, where you'd specify the target URL of the link itself using a regexp).  As I said before, if there is no way at all to navigate back from a session expiration to the login form, and the user has to just type the login page URL into his browser again, then the web connector will need another type of login page to model this behavior.  It's not hard to add and I'm willing to do it, but first I really want to know if there are production sites out there that are so user-unfriendly. ;-)

Now, to answer some of your specific questions:

bq. A further complication is that, on many of the sites, some of the "redirects" and other actions are done with Javascript. 

Javascript can only execute after the page is loaded, while a true redirection (which is done by a return code) precludes any Javascript execution.  If you believe that redirection is handled by Javascript on this site, that implies that the site's content pages actually load, and then Javascript decides there's been a session timeout, and redirects you away.  But the content is in fact all there, and there would be no need to log in at all to crawl the site.  That can't be right!  You'll want to research exactly what is happening; I recommend LiveHttpHeaders on FireFox to figure out what's happening in detail.

bq. The previous instance of the app we're rewriting was using WebHarvest, which seemed to have a single "magic" boolean flag for handling some of the Javascript appropriately, though I don't know the mechanics of it.

I'm afraid "magic" is above my pay grade at the moment.  If you learn what they are doing we can look into it further.

bq. I think it gives some other warning page, maybe with a link, which might be a javascript link, not sure.

If everything runs through javascript, that's a problem.  For 'link' type login pages, the web connector only looks for html that it recognizes as a link, e.g. <a href="...">..</a>.  While you could identify the "session expired" page using a link that executes a javascript function perhaps, you could not actually execute that javascript, so the web connector would not know where to go next.

I suggest you do enough research, or point me at the site if it's public, in order to understand what the site is doing before going further.

                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Mark Bennett (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162233#comment-13162233 ] 

Mark Bennett commented on CONNECTORS-275:
-----------------------------------------

Hi Karl,

I went to take a look at what you had checked in, thanks.  Looks you added content to the source XML doc at http://svn.apache.org/repos/asf/incubator/lcf/trunk/site/src/documentation/content/xdocs/end-user-documentation.xml, but when I go to look at it on http://incubator.apache.org/connectors/end-user-documentation.html#webrepository I'm not seeing the change.

I'm assuming there's some cron XML->HTML job that runs, I don't know much about the logistics.

If I had suggested updates to the doc, I imagine I could checkin a patch to the XML file to this bug report.  But is there some tool that you use to edit it that'll let you switch back and forth between wysiwyg/html/xml ?

And I'm guessing the doc is in this format so that folks get a local copy?  And so that it's also version controlled under SVN, instead of having separate versions in Confluence?

I've had a chat with some cohorts here and I think I might be able to help with the doc.

Thanks again,
Mark 
                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167439#comment-13167439 ] 

Karl Wright commented on CONNECTORS-275:
----------------------------------------

Would you like to create a new ticket to cover changes to the Web Connector itself, as we've discussed above?  I think this ticket basically covers documentation only.  Once you've decided what you need then I think we'd want a new ticket that is specific for those code changes.

                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>         Attachments: CONNECTORS-275.patch
>
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Mark Bennett (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161116#comment-13161116 ] 

Mark Bennett commented on CONNECTORS-275:
-----------------------------------------

Hi Karl,

Agreed on pretty much every point, I'll need to do some research this weekend.  The additional explanations about regexes are actually very helpful.

In thinking more about this, I may have mis-spoke earlier about what happens when a login is needed.  Let's say a session has expired, and the entire site matches the first regex.

1: Attempted Fetch: http://mysite.com/page1.html, but my session has timed out.
2: Redirected to: http://mysite.com/session-timeout-message.html, which does NOT have a form, but DOES have a link to a login page, login.cgi
3: Using rules I could tell MCF to Fetch: http://mysite.com/login.cgi

Here's the issue:

On login.cgi there is no form.

In Step 3 above, what I'd really want to do is say:

Fetch: http://mysite.com/login.cgi?username=me&password=hello

>From what you've said, I think I would either need to:

A: Keep step 3, and add a step 4 with the parameters
or
B: Modify step 3 to include arguments

I'm assuming Method A is closer to what you described:

Method A:

* (while on mysite.com/session-timeout-message.html which has a link to login.cgi)
3: (same as above, matching timeout-msg.html) Tell MCF to Fetch: http://mysite.com/login.cgi
4: (new, matching login.cgi) Tell MCF that the form name is ^$, and that the parameters are username=me and password=hello.

The only issue here is that, since there is no form on login.cgi, there's no "method=GET" to inherit from.

Is this closer to what you were saying?  And as I said, I need to verify that this is exactly what's happening.

WRT Coding:

If more code needs to be written, I wasn't necessarily bugging you to write it (though you'd be faster at it!)

Sadly the sites are under NDA (eCommerce stuff).  If I got completely stuck and couldn't code my way out of it, and you still had time to volunteer, then maybe we could talk about NDA's, but that's way over the line of what I'd expect from another volunteer developer.

The good news is that this coding (with Manifold) is on my own time, in frustration with the legacy code, so although I couldn't share the specific logins, the resulting code would be unencumbered.  WebHarvest is interesting, but seems like if you want any threading, persistence or job management you get to write it yourself, and thus MCF seems way more attractive. ;-)

                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167297#comment-13167297 ] 

Karl Wright commented on CONNECTORS-275:
----------------------------------------

Committed this patch.

                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>         Attachments: CONNECTORS-275.patch
>
>
> A book reader has this comment, which basically implies that we need to improve the documentation for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when it referred back to the online doc for setting up logins for a Web spidering. The online doc is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that are not clear.
> I generally understand about using regexes to define sites and sorting out content pages from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of the site I'm trying to spider, when your session expires, you manually go back to an https page and supply your username and password as CGI parameters. I know this sounds odd, but it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira