You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Fred Schmitt <fr...@web.de> on 2010/10/20 12:59:04 UTC

Problems while indexing Jira/ and an other problem

Hi all,

I'm trying to index a jira system and have some problems.

I am using a Web Connector and my Seed is the login page. It seems that the log in process works. 
But when i fetch and index the next page, to which the login page redirects, I'm not logged in any more. 
Here is a extract of my job history.

Start                Time                   Activity          Identifier                       Result Code    Bytes      Time      
10-12-2010     14:52:56.735     end logon     http://.../jira/login.jsp          OK                 0         1     
10-12-2010     14:52:56.720     fetch             http://.../jira/login.jsp          302               0         16     
10-12-2010     14:52:56.704     begin logon  http://.../jira/login.jsp          OK                 0         1     
10-12-2010     14:52:56.674     fetch             http://.../jira/login.jsp          200             5702     15     
10-12-2010     14:52:54.423     job start       1285328088067(jira)                                0         1

Jira login is based on Cookies but I haven't found a way to control cookies manual in MCF. 
I have found out that MCF could supports cookies. 
So is it possible to control and set cookies or how could i manage that it stays logged in?

thoughts:
I have found out that Jira is also based on the Lucene core like Solr where mit Output Connection is pointed at.
Jira has got an own index for its included search.
Do you know if it is possible to merge the indexes from Solr an Jira?


There is another problem I have while creating a job with the Window Share Connector. 
I selected the "Paths" window and created a new path.
After I added the path and when I click onto the "insert" button i get this Exeption on firebug: 
"missing ; before statement http://localhost:8080/mcf-crawler-ui/execute.jsp Line 1"

best Regards,
Fred
___________________________________________________________
GRATIS! Movie-FLAT mit über 300 Videos. 
Jetzt freischalten unter http://movieflat.web.de

Re: Problems while indexing Jira/ and an other problem

Posted by Karl Wright <da...@gmail.com>.
A couple of comments.
(1) The place where you start is called the "seed".  Let's use that
terminology so I don't get confused.
(2) Every session-protected area has a regular expression that is
supposed to match all protected pages in the area.  In the connection
edit page Access Credentials tab, for Session authentication, this
appears as a column called "URL regular expression".  It is this
expression I was referring to when I said that it seemed incorrect for
your task.  This has nothing to do with the seeds for the crawl.
(3) Even with javascript login, it is usually possible to fill in the
form with information that performs the login.  The only exception is
when some complex logic, such as MD5 or string manipulation, is used
to fill in the form fields.  A good way to do it is to read the
Javascript and figure out what it is trying to do, and then just fill
in the form fields with what the Javascript would have done.

Hope this helps.  I'd actually try crawling a Jira instance myself to
be of further assistance, but I'm really pressed for time right now.

Karl

On Mon, Oct 25, 2010 at 3:14 PM, Fred Schmitt <fr...@web.de> wrote:
> Hi,
> sry it don't work or i missunderstand you. Here the way I tried.
>
> I started at "http://.../jira/". These site redirects me to "http://.../jira/secure/Dashboard.jspa". That don't work and I have to exclude many pages because i get stuff I don't want and the logging is not really working.
> "http://.../jira/secure/Dashboard.jspa" is another loggin site but it needs javascript to login and that don't work.
> I have to be logged in the "jira/secure/IssueNavigator" ,because there I can browse through my projects, and the projects under "jira/browse/projectname" and there issues "jira/browse/projectname-issuenumber".
>
> At the moment the only way is that I log in for "http://.../jira/IssueNavigator.jspa" and start there and then I have to write for each project another login sequenz "http://.../jira/browse/projectname" and then i am logged in for all issues of this project but only one project each login sequenz.
>
> best Regards
> Fred
>
> -----Ursprüngliche Nachricht-----
> Von: "Karl Wright" <da...@gmail.com>
> Gesendet: 25.10.2010 11:29:16
> An: connectors-user@incubator.apache.org
> Betreff: Re: Problems while indexing Jira/ and an other problem
>
>>Fred, did this answer help you?  Are you all set now?
>>Karl
>>
>>On Sat, Oct 23, 2010 at 3:20 AM, Karl Wright <da...@gmail.com> wrote:
>>> The web connector will not send a secured site's cookies to a page
>>> that does not match the regular expression that defines the overall
>>> secured area.  Your secured area url seems to be
>>> "http://.../jira/secure/IssueNavigator.jspa", which is not sufficient
>>> obviously.
>>>
>>> Karl
>>>
>>> On Fri, Oct 22, 2010 at 4:21 AM, Fred Schmitt <fr...@web.de> wrote:
>>>> Hi,
>>>>
>>>> thanks for the quick answer.
>>>> I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence.
>>>> Here is a extract of my Access Credentials of my web Connector.
>>>>
>>>> URL regular expression
>>>> http://.../jira/secure/IssueNavigator.jspa
>>>>
>>>> Login Pages
>>>> Login URL regular expression      Page type           Form name/link target regular expression       Override form parameters
>>>>                                                      link                          http://.../jira/login.jsp
>>>>  http://.../jira/login.jsp                   form                                                                                        Parameter regular expression      Value       Password
>>>>                                                                                                                                                             username                               fred
>>>>                                                                                                                                                             password                                                ******
>>>>
>>>> I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.
>>>> How could i solve this problem and log in for all pages without writing many login sequences?
>>>> I have already tried to write a login sequence which included "http://.../jira/browse/*" or  "http://.../jira/browse/", but both haven't worked.
>>>>
>>>> best Regards,
>>>> Fred
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: "Karl Wright" <da...@gmail.com>
>>>> Gesendet: 20.10.2010 13:11:23
>>>> An: connectors-user@incubator.apache.org
>>>> Betreff: Re: Problems while indexing Jira/ and an other problem
>>>>
>>>>>Hi,
>>>>>I think you should open a JIRA ticket for the Windows Share connector.
>>>>> It sounds like the javascript for handling the insert link might be
>>>>>broken in the UI.
>>>>>
>>>>>As for the web session login, the MCF crawler of course handles
>>>>>cookies - that is a major piece of session authentication.  The
>>>>>question is whether it is recording the cookie set that happens as a
>>>>>result of the login sequence.  What you want to be sure of is that all
>>>>>the parts of the login, including the final redirection back to the
>>>>>content page, are considered part of the login sequence.  You also
>>>>>want to be sure that you don't use as your seed URL the login page
>>>>>itself, because then there is no place to resume when the login is
>>>>>done.  Instead you want a seed which is the root or home page.  If
>>>>>login is mandatory, then presumably there would be a redirection that
>>>>>takes you to the login page.  That redirection should *also* be part
>>>>>of the login sequence.
>>>>>In short, the login sequence needs to cover every fetch that isn't
>>>>>actual indexable content.  The cookies that are set at the end of that
>>>>>sequence are what will be retained for all subsequent fetches from the
>>>>>protected area of the site that you specify with your url regular
>>>>>expression.
>>>>>
>>>>>Hope this helps.
>>>>>
>>>>>Karl
>>>> ___________________________________________________________
>>>> GRATIS! Movie-FLAT mit über 300 Videos.
>>>> Jetzt freischalten unter http://movieflat.web.de
>>>>
>>>
> ___________________________________________________________
> GRATIS! Movie-FLAT mit über 300 Videos.
> Jetzt freischalten unter http://movieflat.web.de
>

Re: Problems while indexing Jira/ and an other problem

Posted by Fred Schmitt <fr...@web.de>.
Hi,
sry it don't work or i missunderstand you. Here the way I tried.

I started at "http://.../jira/". These site redirects me to "http://.../jira/secure/Dashboard.jspa". That don't work and I have to exclude many pages because i get stuff I don't want and the logging is not really working.
"http://.../jira/secure/Dashboard.jspa" is another loggin site but it needs javascript to login and that don't work.
I have to be logged in the "jira/secure/IssueNavigator" ,because there I can browse through my projects, and the projects under "jira/browse/projectname" and there issues "jira/browse/projectname-issuenumber".

At the moment the only way is that I log in for "http://.../jira/IssueNavigator.jspa" and start there and then I have to write for each project another login sequenz "http://.../jira/browse/projectname" and then i am logged in for all issues of this project but only one project each login sequenz.

best Regards
Fred

-----Ursprüngliche Nachricht-----
Von: "Karl Wright" <da...@gmail.com>
Gesendet: 25.10.2010 11:29:16
An: connectors-user@incubator.apache.org
Betreff: Re: Problems while indexing Jira/ and an other problem

>Fred, did this answer help you?  Are you all set now?
>Karl
>
>On Sat, Oct 23, 2010 at 3:20 AM, Karl Wright <da...@gmail.com> wrote:
>> The web connector will not send a secured site's cookies to a page
>> that does not match the regular expression that defines the overall
>> secured area.  Your secured area url seems to be
>> "http://.../jira/secure/IssueNavigator.jspa", which is not sufficient
>> obviously.
>>
>> Karl
>>
>> On Fri, Oct 22, 2010 at 4:21 AM, Fred Schmitt <fr...@web.de> wrote:
>>> Hi,
>>>
>>> thanks for the quick answer.
>>> I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence.
>>> Here is a extract of my Access Credentials of my web Connector.
>>>
>>> URL regular expression
>>> http://.../jira/secure/IssueNavigator.jspa
>>>
>>> Login Pages
>>> Login URL regular expression      Page type           Form name/link target regular expression       Override form parameters
>>>                                                      link                          http://.../jira/login.jsp
>>>  http://.../jira/login.jsp                   form                                                                                        Parameter regular expression      Value       Password
>>>                                                                                                                                                             username                               fred
>>>                                                                                                                                                             password                                                ******
>>>
>>> I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.
>>> How could i solve this problem and log in for all pages without writing many login sequences?
>>> I have already tried to write a login sequence which included "http://.../jira/browse/*" or  "http://.../jira/browse/", but both haven't worked.
>>>
>>> best Regards,
>>> Fred
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: "Karl Wright" <da...@gmail.com>
>>> Gesendet: 20.10.2010 13:11:23
>>> An: connectors-user@incubator.apache.org
>>> Betreff: Re: Problems while indexing Jira/ and an other problem
>>>
>>>>Hi,
>>>>I think you should open a JIRA ticket for the Windows Share connector.
>>>> It sounds like the javascript for handling the insert link might be
>>>>broken in the UI.
>>>>
>>>>As for the web session login, the MCF crawler of course handles
>>>>cookies - that is a major piece of session authentication.  The
>>>>question is whether it is recording the cookie set that happens as a
>>>>result of the login sequence.  What you want to be sure of is that all
>>>>the parts of the login, including the final redirection back to the
>>>>content page, are considered part of the login sequence.  You also
>>>>want to be sure that you don't use as your seed URL the login page
>>>>itself, because then there is no place to resume when the login is
>>>>done.  Instead you want a seed which is the root or home page.  If
>>>>login is mandatory, then presumably there would be a redirection that
>>>>takes you to the login page.  That redirection should *also* be part
>>>>of the login sequence.
>>>>In short, the login sequence needs to cover every fetch that isn't
>>>>actual indexable content.  The cookies that are set at the end of that
>>>>sequence are what will be retained for all subsequent fetches from the
>>>>protected area of the site that you specify with your url regular
>>>>expression.
>>>>
>>>>Hope this helps.
>>>>
>>>>Karl
>>> ___________________________________________________________
>>> GRATIS! Movie-FLAT mit über 300 Videos.
>>> Jetzt freischalten unter http://movieflat.web.de
>>>
>>
___________________________________________________________
GRATIS! Movie-FLAT mit über 300 Videos. 
Jetzt freischalten unter http://movieflat.web.de

Re: Problems while indexing Jira/ and an other problem

Posted by Karl Wright <da...@gmail.com>.
Fred, did this answer help you?  Are you all set now?
Karl

On Sat, Oct 23, 2010 at 3:20 AM, Karl Wright <da...@gmail.com> wrote:
> The web connector will not send a secured site's cookies to a page
> that does not match the regular expression that defines the overall
> secured area.  Your secured area url seems to be
> "http://.../jira/secure/IssueNavigator.jspa", which is not sufficient
> obviously.
>
> Karl
>
> On Fri, Oct 22, 2010 at 4:21 AM, Fred Schmitt <fr...@web.de> wrote:
>> Hi,
>>
>> thanks for the quick answer.
>> I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence.
>> Here is a extract of my Access Credentials of my web Connector.
>>
>> URL regular expression
>> http://.../jira/secure/IssueNavigator.jspa
>>
>> Login Pages
>> Login URL regular expression      Page type           Form name/link target regular expression       Override form parameters
>>                                                      link                          http://.../jira/login.jsp
>>  http://.../jira/login.jsp                   form                                                                                        Parameter regular expression      Value       Password
>>                                                                                                                                                             username                               fred
>>                                                                                                                                                             password                                                ******
>>
>> I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.
>> How could i solve this problem and log in for all pages without writing many login sequences?
>> I have already tried to write a login sequence which included "http://.../jira/browse/*" or  "http://.../jira/browse/", but both haven't worked.
>>
>> best Regards,
>> Fred
>>
>> -----Ursprüngliche Nachricht-----
>> Von: "Karl Wright" <da...@gmail.com>
>> Gesendet: 20.10.2010 13:11:23
>> An: connectors-user@incubator.apache.org
>> Betreff: Re: Problems while indexing Jira/ and an other problem
>>
>>>Hi,
>>>I think you should open a JIRA ticket for the Windows Share connector.
>>> It sounds like the javascript for handling the insert link might be
>>>broken in the UI.
>>>
>>>As for the web session login, the MCF crawler of course handles
>>>cookies - that is a major piece of session authentication.  The
>>>question is whether it is recording the cookie set that happens as a
>>>result of the login sequence.  What you want to be sure of is that all
>>>the parts of the login, including the final redirection back to the
>>>content page, are considered part of the login sequence.  You also
>>>want to be sure that you don't use as your seed URL the login page
>>>itself, because then there is no place to resume when the login is
>>>done.  Instead you want a seed which is the root or home page.  If
>>>login is mandatory, then presumably there would be a redirection that
>>>takes you to the login page.  That redirection should *also* be part
>>>of the login sequence.
>>>In short, the login sequence needs to cover every fetch that isn't
>>>actual indexable content.  The cookies that are set at the end of that
>>>sequence are what will be retained for all subsequent fetches from the
>>>protected area of the site that you specify with your url regular
>>>expression.
>>>
>>>Hope this helps.
>>>
>>>Karl
>> ___________________________________________________________
>> GRATIS! Movie-FLAT mit über 300 Videos.
>> Jetzt freischalten unter http://movieflat.web.de
>>
>

Re: Problems while indexing Jira/ and an other problem

Posted by Karl Wright <da...@gmail.com>.
The web connector will not send a secured site's cookies to a page
that does not match the regular expression that defines the overall
secured area.  Your secured area url seems to be
"http://.../jira/secure/IssueNavigator.jspa", which is not sufficient
obviously.

Karl

On Fri, Oct 22, 2010 at 4:21 AM, Fred Schmitt <fr...@web.de> wrote:
> Hi,
>
> thanks for the quick answer.
> I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence.
> Here is a extract of my Access Credentials of my web Connector.
>
> URL regular expression
> http://.../jira/secure/IssueNavigator.jspa
>
> Login Pages
> Login URL regular expression      Page type           Form name/link target regular expression       Override form parameters
>                                                      link                          http://.../jira/login.jsp
>  http://.../jira/login.jsp                   form                                                                                        Parameter regular expression      Value       Password
>                                                                                                                                                             username                               fred
>                                                                                                                                                             password                                                ******
>
> I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.
> How could i solve this problem and log in for all pages without writing many login sequences?
> I have already tried to write a login sequence which included "http://.../jira/browse/*" or  "http://.../jira/browse/", but both haven't worked.
>
> best Regards,
> Fred
>
> -----Ursprüngliche Nachricht-----
> Von: "Karl Wright" <da...@gmail.com>
> Gesendet: 20.10.2010 13:11:23
> An: connectors-user@incubator.apache.org
> Betreff: Re: Problems while indexing Jira/ and an other problem
>
>>Hi,
>>I think you should open a JIRA ticket for the Windows Share connector.
>> It sounds like the javascript for handling the insert link might be
>>broken in the UI.
>>
>>As for the web session login, the MCF crawler of course handles
>>cookies - that is a major piece of session authentication.  The
>>question is whether it is recording the cookie set that happens as a
>>result of the login sequence.  What you want to be sure of is that all
>>the parts of the login, including the final redirection back to the
>>content page, are considered part of the login sequence.  You also
>>want to be sure that you don't use as your seed URL the login page
>>itself, because then there is no place to resume when the login is
>>done.  Instead you want a seed which is the root or home page.  If
>>login is mandatory, then presumably there would be a redirection that
>>takes you to the login page.  That redirection should *also* be part
>>of the login sequence.
>>In short, the login sequence needs to cover every fetch that isn't
>>actual indexable content.  The cookies that are set at the end of that
>>sequence are what will be retained for all subsequent fetches from the
>>protected area of the site that you specify with your url regular
>>expression.
>>
>>Hope this helps.
>>
>>Karl
> ___________________________________________________________
> GRATIS! Movie-FLAT mit über 300 Videos.
> Jetzt freischalten unter http://movieflat.web.de
>

Re: Problems while indexing Jira/ and an other problem

Posted by Fred Schmitt <fr...@web.de>.
Hi,

thanks for the quick answer. 
I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence. 
Here is a extract of my Access Credentials of my web Connector.

URL regular expression
http://.../jira/secure/IssueNavigator.jspa
 
Login Pages
Login URL regular expression      Page type           Form name/link target regular expression       Override form parameters
                                                      link                          http://.../jira/login.jsp
 http://.../jira/login.jsp                   form                                                                                        Parameter regular expression      Value       Password
                                                                                                                                                             username                               fred
                                                                                                                                                             password                                                ******

I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.  
How could i solve this problem and log in for all pages without writing many login sequences?
I have already tried to write a login sequence which included "http://.../jira/browse/*" or  "http://.../jira/browse/", but both haven't worked.

best Regards,
Fred

-----Ursprüngliche Nachricht-----
Von: "Karl Wright" <da...@gmail.com>
Gesendet: 20.10.2010 13:11:23
An: connectors-user@incubator.apache.org
Betreff: Re: Problems while indexing Jira/ and an other problem

>Hi,
>I think you should open a JIRA ticket for the Windows Share connector.
> It sounds like the javascript for handling the insert link might be
>broken in the UI.
>
>As for the web session login, the MCF crawler of course handles
>cookies - that is a major piece of session authentication.  The
>question is whether it is recording the cookie set that happens as a
>result of the login sequence.  What you want to be sure of is that all
>the parts of the login, including the final redirection back to the
>content page, are considered part of the login sequence.  You also
>want to be sure that you don't use as your seed URL the login page
>itself, because then there is no place to resume when the login is
>done.  Instead you want a seed which is the root or home page.  If
>login is mandatory, then presumably there would be a redirection that
>takes you to the login page.  That redirection should *also* be part
>of the login sequence.
>In short, the login sequence needs to cover every fetch that isn't
>actual indexable content.  The cookies that are set at the end of that
>sequence are what will be retained for all subsequent fetches from the
>protected area of the site that you specify with your url regular
>expression.
>
>Hope this helps.
>
>Karl
___________________________________________________________
GRATIS! Movie-FLAT mit über 300 Videos. 
Jetzt freischalten unter http://movieflat.web.de

Re: Problems while indexing Jira/ and an other problem

Posted by Karl Wright <da...@gmail.com>.
Hi,
I think you should open a JIRA ticket for the Windows Share connector.
 It sounds like the javascript for handling the insert link might be
broken in the UI.

As for the web session login, the MCF crawler of course handles
cookies - that is a major piece of session authentication.  The
question is whether it is recording the cookie set that happens as a
result of the login sequence.  What you want to be sure of is that all
the parts of the login, including the final redirection back to the
content page, are considered part of the login sequence.  You also
want to be sure that you don't use as your seed URL the login page
itself, because then there is no place to resume when the login is
done.  Instead you want a seed which is the root or home page.  If
login is mandatory, then presumably there would be a redirection that
takes you to the login page.  That redirection should *also* be part
of the login sequence.
In short, the login sequence needs to cover every fetch that isn't
actual indexable content.  The cookies that are set at the end of that
sequence are what will be retained for all subsequent fetches from the
protected area of the site that you specify with your url regular
expression.

Hope this helps.

Karl


On Wed, Oct 20, 2010 at 6:59 AM, Fred Schmitt <fr...@web.de> wrote:
> Hi all,
>
> I'm trying to index a jira system and have some problems.
>
> I am using a Web Connector and my Seed is the login page. It seems that the log in process works.
> But when i fetch and index the next page, to which the login page redirects, I'm not logged in any more.
> Here is a extract of my job history.
>
> Start                Time                   Activity          Identifier                       Result Code    Bytes      Time
> 10-12-2010     14:52:56.735     end logon     http://.../jira/login.jsp          OK                 0         1
> 10-12-2010     14:52:56.720     fetch             http://.../jira/login.jsp          302               0         16
> 10-12-2010     14:52:56.704     begin logon  http://.../jira/login.jsp          OK                 0         1
> 10-12-2010     14:52:56.674     fetch             http://.../jira/login.jsp          200             5702     15
> 10-12-2010     14:52:54.423     job start       1285328088067(jira)                                0         1
>
> Jira login is based on Cookies but I haven't found a way to control cookies manual in MCF.
> I have found out that MCF could supports cookies.
> So is it possible to control and set cookies or how could i manage that it stays logged in?
>
> thoughts:
> I have found out that Jira is also based on the Lucene core like Solr where mit Output Connection is pointed at.
> Jira has got an own index for its included search.
> Do you know if it is possible to merge the indexes from Solr an Jira?
>
>
> There is another problem I have while creating a job with the Window Share Connector.
> I selected the "Paths" window and created a new path.
> After I added the path and when I click onto the "insert" button i get this Exeption on firebug:
> "missing ; before statement http://localhost:8080/mcf-crawler-ui/execute.jsp Line 1"
>
> best Regards,
> Fred
> ___________________________________________________________
> GRATIS! Movie-FLAT mit über 300 Videos.
> Jetzt freischalten unter http://movieflat.web.de
>