You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Fred Schmitt <fr...@web.de> on 2010/10/20 12:59:04 UTC
Problems while indexing Jira/ and an other problem
Hi all,
I'm trying to index a jira system and have some problems.
I am using a Web Connector and my Seed is the login page. It seems that the log in process works.
But when i fetch and index the next page, to which the login page redirects, I'm not logged in any more.
Here is a extract of my job history.
Start Time Activity Identifier Result Code Bytes Time
10-12-2010 14:52:56.735 end logon http://.../jira/login.jsp OK 0 1
10-12-2010 14:52:56.720 fetch http://.../jira/login.jsp 302 0 16
10-12-2010 14:52:56.704 begin logon http://.../jira/login.jsp OK 0 1
10-12-2010 14:52:56.674 fetch http://.../jira/login.jsp 200 5702 15
10-12-2010 14:52:54.423 job start 1285328088067(jira) 0 1
Jira login is based on Cookies but I haven't found a way to control cookies manual in MCF.
I have found out that MCF could supports cookies.
So is it possible to control and set cookies or how could i manage that it stays logged in?
thoughts:
I have found out that Jira is also based on the Lucene core like Solr where mit Output Connection is pointed at.
Jira has got an own index for its included search.
Do you know if it is possible to merge the indexes from Solr an Jira?
There is another problem I have while creating a job with the Window Share Connector.
I selected the "Paths" window and created a new path.
After I added the path and when I click onto the "insert" button i get this Exeption on firebug:
"missing ; before statement http://localhost:8080/mcf-crawler-ui/execute.jsp Line 1"
best Regards,
Fred
___________________________________________________________
GRATIS! Movie-FLAT mit über 300 Videos.
Jetzt freischalten unter http://movieflat.web.de
Re: Problems while indexing Jira/ and an other problem
Posted by Karl Wright <da...@gmail.com>.
A couple of comments.
(1) The place where you start is called the "seed". Let's use that
terminology so I don't get confused.
(2) Every session-protected area has a regular expression that is
supposed to match all protected pages in the area. In the connection
edit page Access Credentials tab, for Session authentication, this
appears as a column called "URL regular expression". It is this
expression I was referring to when I said that it seemed incorrect for
your task. This has nothing to do with the seeds for the crawl.
(3) Even with javascript login, it is usually possible to fill in the
form with information that performs the login. The only exception is
when some complex logic, such as MD5 or string manipulation, is used
to fill in the form fields. A good way to do it is to read the
Javascript and figure out what it is trying to do, and then just fill
in the form fields with what the Javascript would have done.
Hope this helps. I'd actually try crawling a Jira instance myself to
be of further assistance, but I'm really pressed for time right now.
Karl
On Mon, Oct 25, 2010 at 3:14 PM, Fred Schmitt <fr...@web.de> wrote:
> Hi,
> sry it don't work or i missunderstand you. Here the way I tried.
>
> I started at "http://.../jira/". These site redirects me to "http://.../jira/secure/Dashboard.jspa". That don't work and I have to exclude many pages because i get stuff I don't want and the logging is not really working.
> "http://.../jira/secure/Dashboard.jspa" is another loggin site but it needs javascript to login and that don't work.
> I have to be logged in the "jira/secure/IssueNavigator" ,because there I can browse through my projects, and the projects under "jira/browse/projectname" and there issues "jira/browse/projectname-issuenumber".
>
> At the moment the only way is that I log in for "http://.../jira/IssueNavigator.jspa" and start there and then I have to write for each project another login sequenz "http://.../jira/browse/projectname" and then i am logged in for all issues of this project but only one project each login sequenz.
>
> best Regards
> Fred
>
> -----Ursprüngliche Nachricht-----
> Von: "Karl Wright" <da...@gmail.com>
> Gesendet: 25.10.2010 11:29:16
> An: connectors-user@incubator.apache.org
> Betreff: Re: Problems while indexing Jira/ and an other problem
>
>>Fred, did this answer help you? Are you all set now?
>>Karl
>>
>>On Sat, Oct 23, 2010 at 3:20 AM, Karl Wright <da...@gmail.com> wrote:
>>> The web connector will not send a secured site's cookies to a page
>>> that does not match the regular expression that defines the overall
>>> secured area. Your secured area url seems to be
>>> "http://.../jira/secure/IssueNavigator.jspa", which is not sufficient
>>> obviously.
>>>
>>> Karl
>>>
>>> On Fri, Oct 22, 2010 at 4:21 AM, Fred Schmitt <fr...@web.de> wrote:
>>>> Hi,
>>>>
>>>> thanks for the quick answer.
>>>> I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence.
>>>> Here is a extract of my Access Credentials of my web Connector.
>>>>
>>>> URL regular expression
>>>> http://.../jira/secure/IssueNavigator.jspa
>>>>
>>>> Login Pages
>>>> Login URL regular expression Page type Form name/link target regular expression Override form parameters
>>>> link http://.../jira/login.jsp
>>>> http://.../jira/login.jsp form Parameter regular expression Value Password
>>>> username fred
>>>> password ******
>>>>
>>>> I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.
>>>> How could i solve this problem and log in for all pages without writing many login sequences?
>>>> I have already tried to write a login sequence which included "http://.../jira/browse/*" or "http://.../jira/browse/", but both haven't worked.
>>>>
>>>> best Regards,
>>>> Fred
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: "Karl Wright" <da...@gmail.com>
>>>> Gesendet: 20.10.2010 13:11:23
>>>> An: connectors-user@incubator.apache.org
>>>> Betreff: Re: Problems while indexing Jira/ and an other problem
>>>>
>>>>>Hi,
>>>>>I think you should open a JIRA ticket for the Windows Share connector.
>>>>> It sounds like the javascript for handling the insert link might be
>>>>>broken in the UI.
>>>>>
>>>>>As for the web session login, the MCF crawler of course handles
>>>>>cookies - that is a major piece of session authentication. The
>>>>>question is whether it is recording the cookie set that happens as a
>>>>>result of the login sequence. What you want to be sure of is that all
>>>>>the parts of the login, including the final redirection back to the
>>>>>content page, are considered part of the login sequence. You also
>>>>>want to be sure that you don't use as your seed URL the login page
>>>>>itself, because then there is no place to resume when the login is
>>>>>done. Instead you want a seed which is the root or home page. If
>>>>>login is mandatory, then presumably there would be a redirection that
>>>>>takes you to the login page. That redirection should *also* be part
>>>>>of the login sequence.
>>>>>In short, the login sequence needs to cover every fetch that isn't
>>>>>actual indexable content. The cookies that are set at the end of that
>>>>>sequence are what will be retained for all subsequent fetches from the
>>>>>protected area of the site that you specify with your url regular
>>>>>expression.
>>>>>
>>>>>Hope this helps.
>>>>>
>>>>>Karl
>>>> ___________________________________________________________
>>>> GRATIS! Movie-FLAT mit über 300 Videos.
>>>> Jetzt freischalten unter http://movieflat.web.de
>>>>
>>>
> ___________________________________________________________
> GRATIS! Movie-FLAT mit über 300 Videos.
> Jetzt freischalten unter http://movieflat.web.de
>
Re: Problems while indexing Jira/ and an other problem
Posted by Fred Schmitt <fr...@web.de>.
Hi,
sry it don't work or i missunderstand you. Here the way I tried.
I started at "http://.../jira/". These site redirects me to "http://.../jira/secure/Dashboard.jspa". That don't work and I have to exclude many pages because i get stuff I don't want and the logging is not really working.
"http://.../jira/secure/Dashboard.jspa" is another loggin site but it needs javascript to login and that don't work.
I have to be logged in the "jira/secure/IssueNavigator" ,because there I can browse through my projects, and the projects under "jira/browse/projectname" and there issues "jira/browse/projectname-issuenumber".
At the moment the only way is that I log in for "http://.../jira/IssueNavigator.jspa" and start there and then I have to write for each project another login sequenz "http://.../jira/browse/projectname" and then i am logged in for all issues of this project but only one project each login sequenz.
best Regards
Fred
-----Ursprüngliche Nachricht-----
Von: "Karl Wright" <da...@gmail.com>
Gesendet: 25.10.2010 11:29:16
An: connectors-user@incubator.apache.org
Betreff: Re: Problems while indexing Jira/ and an other problem
>Fred, did this answer help you? Are you all set now?
>Karl
>
>On Sat, Oct 23, 2010 at 3:20 AM, Karl Wright <da...@gmail.com> wrote:
>> The web connector will not send a secured site's cookies to a page
>> that does not match the regular expression that defines the overall
>> secured area. Your secured area url seems to be
>> "http://.../jira/secure/IssueNavigator.jspa", which is not sufficient
>> obviously.
>>
>> Karl
>>
>> On Fri, Oct 22, 2010 at 4:21 AM, Fred Schmitt <fr...@web.de> wrote:
>>> Hi,
>>>
>>> thanks for the quick answer.
>>> I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence.
>>> Here is a extract of my Access Credentials of my web Connector.
>>>
>>> URL regular expression
>>> http://.../jira/secure/IssueNavigator.jspa
>>>
>>> Login Pages
>>> Login URL regular expression Page type Form name/link target regular expression Override form parameters
>>> link http://.../jira/login.jsp
>>> http://.../jira/login.jsp form Parameter regular expression Value Password
>>> username fred
>>> password ******
>>>
>>> I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.
>>> How could i solve this problem and log in for all pages without writing many login sequences?
>>> I have already tried to write a login sequence which included "http://.../jira/browse/*" or "http://.../jira/browse/", but both haven't worked.
>>>
>>> best Regards,
>>> Fred
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: "Karl Wright" <da...@gmail.com>
>>> Gesendet: 20.10.2010 13:11:23
>>> An: connectors-user@incubator.apache.org
>>> Betreff: Re: Problems while indexing Jira/ and an other problem
>>>
>>>>Hi,
>>>>I think you should open a JIRA ticket for the Windows Share connector.
>>>> It sounds like the javascript for handling the insert link might be
>>>>broken in the UI.
>>>>
>>>>As for the web session login, the MCF crawler of course handles
>>>>cookies - that is a major piece of session authentication. The
>>>>question is whether it is recording the cookie set that happens as a
>>>>result of the login sequence. What you want to be sure of is that all
>>>>the parts of the login, including the final redirection back to the
>>>>content page, are considered part of the login sequence. You also
>>>>want to be sure that you don't use as your seed URL the login page
>>>>itself, because then there is no place to resume when the login is
>>>>done. Instead you want a seed which is the root or home page. If
>>>>login is mandatory, then presumably there would be a redirection that
>>>>takes you to the login page. That redirection should *also* be part
>>>>of the login sequence.
>>>>In short, the login sequence needs to cover every fetch that isn't
>>>>actual indexable content. The cookies that are set at the end of that
>>>>sequence are what will be retained for all subsequent fetches from the
>>>>protected area of the site that you specify with your url regular
>>>>expression.
>>>>
>>>>Hope this helps.
>>>>
>>>>Karl
>>> ___________________________________________________________
>>> GRATIS! Movie-FLAT mit über 300 Videos.
>>> Jetzt freischalten unter http://movieflat.web.de
>>>
>>
___________________________________________________________
GRATIS! Movie-FLAT mit über 300 Videos.
Jetzt freischalten unter http://movieflat.web.de
Re: Problems while indexing Jira/ and an other problem
Posted by Karl Wright <da...@gmail.com>.
Fred, did this answer help you? Are you all set now?
Karl
On Sat, Oct 23, 2010 at 3:20 AM, Karl Wright <da...@gmail.com> wrote:
> The web connector will not send a secured site's cookies to a page
> that does not match the regular expression that defines the overall
> secured area. Your secured area url seems to be
> "http://.../jira/secure/IssueNavigator.jspa", which is not sufficient
> obviously.
>
> Karl
>
> On Fri, Oct 22, 2010 at 4:21 AM, Fred Schmitt <fr...@web.de> wrote:
>> Hi,
>>
>> thanks for the quick answer.
>> I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence.
>> Here is a extract of my Access Credentials of my web Connector.
>>
>> URL regular expression
>> http://.../jira/secure/IssueNavigator.jspa
>>
>> Login Pages
>> Login URL regular expression Page type Form name/link target regular expression Override form parameters
>> link http://.../jira/login.jsp
>> http://.../jira/login.jsp form Parameter regular expression Value Password
>> username fred
>> password ******
>>
>> I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.
>> How could i solve this problem and log in for all pages without writing many login sequences?
>> I have already tried to write a login sequence which included "http://.../jira/browse/*" or "http://.../jira/browse/", but both haven't worked.
>>
>> best Regards,
>> Fred
>>
>> -----Ursprüngliche Nachricht-----
>> Von: "Karl Wright" <da...@gmail.com>
>> Gesendet: 20.10.2010 13:11:23
>> An: connectors-user@incubator.apache.org
>> Betreff: Re: Problems while indexing Jira/ and an other problem
>>
>>>Hi,
>>>I think you should open a JIRA ticket for the Windows Share connector.
>>> It sounds like the javascript for handling the insert link might be
>>>broken in the UI.
>>>
>>>As for the web session login, the MCF crawler of course handles
>>>cookies - that is a major piece of session authentication. The
>>>question is whether it is recording the cookie set that happens as a
>>>result of the login sequence. What you want to be sure of is that all
>>>the parts of the login, including the final redirection back to the
>>>content page, are considered part of the login sequence. You also
>>>want to be sure that you don't use as your seed URL the login page
>>>itself, because then there is no place to resume when the login is
>>>done. Instead you want a seed which is the root or home page. If
>>>login is mandatory, then presumably there would be a redirection that
>>>takes you to the login page. That redirection should *also* be part
>>>of the login sequence.
>>>In short, the login sequence needs to cover every fetch that isn't
>>>actual indexable content. The cookies that are set at the end of that
>>>sequence are what will be retained for all subsequent fetches from the
>>>protected area of the site that you specify with your url regular
>>>expression.
>>>
>>>Hope this helps.
>>>
>>>Karl
>> ___________________________________________________________
>> GRATIS! Movie-FLAT mit über 300 Videos.
>> Jetzt freischalten unter http://movieflat.web.de
>>
>
Re: Problems while indexing Jira/ and an other problem
Posted by Karl Wright <da...@gmail.com>.
The web connector will not send a secured site's cookies to a page
that does not match the regular expression that defines the overall
secured area. Your secured area url seems to be
"http://.../jira/secure/IssueNavigator.jspa", which is not sufficient
obviously.
Karl
On Fri, Oct 22, 2010 at 4:21 AM, Fred Schmitt <fr...@web.de> wrote:
> Hi,
>
> thanks for the quick answer.
> I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence.
> Here is a extract of my Access Credentials of my web Connector.
>
> URL regular expression
> http://.../jira/secure/IssueNavigator.jspa
>
> Login Pages
> Login URL regular expression Page type Form name/link target regular expression Override form parameters
> link http://.../jira/login.jsp
> http://.../jira/login.jsp form Parameter regular expression Value Password
> username fred
> password ******
>
> I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.
> How could i solve this problem and log in for all pages without writing many login sequences?
> I have already tried to write a login sequence which included "http://.../jira/browse/*" or "http://.../jira/browse/", but both haven't worked.
>
> best Regards,
> Fred
>
> -----Ursprüngliche Nachricht-----
> Von: "Karl Wright" <da...@gmail.com>
> Gesendet: 20.10.2010 13:11:23
> An: connectors-user@incubator.apache.org
> Betreff: Re: Problems while indexing Jira/ and an other problem
>
>>Hi,
>>I think you should open a JIRA ticket for the Windows Share connector.
>> It sounds like the javascript for handling the insert link might be
>>broken in the UI.
>>
>>As for the web session login, the MCF crawler of course handles
>>cookies - that is a major piece of session authentication. The
>>question is whether it is recording the cookie set that happens as a
>>result of the login sequence. What you want to be sure of is that all
>>the parts of the login, including the final redirection back to the
>>content page, are considered part of the login sequence. You also
>>want to be sure that you don't use as your seed URL the login page
>>itself, because then there is no place to resume when the login is
>>done. Instead you want a seed which is the root or home page. If
>>login is mandatory, then presumably there would be a redirection that
>>takes you to the login page. That redirection should *also* be part
>>of the login sequence.
>>In short, the login sequence needs to cover every fetch that isn't
>>actual indexable content. The cookies that are set at the end of that
>>sequence are what will be retained for all subsequent fetches from the
>>protected area of the site that you specify with your url regular
>>expression.
>>
>>Hope this helps.
>>
>>Karl
> ___________________________________________________________
> GRATIS! Movie-FLAT mit über 300 Videos.
> Jetzt freischalten unter http://movieflat.web.de
>
Re: Problems while indexing Jira/ and an other problem
Posted by Fred Schmitt <fr...@web.de>.
Hi,
thanks for the quick answer.
I have tested your suggestion and started at the page http://..../jira/secure/IssueNavigator.jspa and wrote a logging sequence.
Here is a extract of my Access Credentials of my web Connector.
URL regular expression
http://.../jira/secure/IssueNavigator.jspa
Login Pages
Login URL regular expression Page type Form name/link target regular expression Override form parameters
link http://.../jira/login.jsp
http://.../jira/login.jsp form Parameter regular expression Value Password
username fred
password ******
I am logged in and can browse through all issues but when I fetch an index an issue, for example "http://.../jira/browse/project-5", I get the message that I am not logged in anymore. It works when I write a login sequence for this project, but only for the specified one and it's issues. That means that I have to write a login sequence for each project.
How could i solve this problem and log in for all pages without writing many login sequences?
I have already tried to write a login sequence which included "http://.../jira/browse/*" or "http://.../jira/browse/", but both haven't worked.
best Regards,
Fred
-----Ursprüngliche Nachricht-----
Von: "Karl Wright" <da...@gmail.com>
Gesendet: 20.10.2010 13:11:23
An: connectors-user@incubator.apache.org
Betreff: Re: Problems while indexing Jira/ and an other problem
>Hi,
>I think you should open a JIRA ticket for the Windows Share connector.
> It sounds like the javascript for handling the insert link might be
>broken in the UI.
>
>As for the web session login, the MCF crawler of course handles
>cookies - that is a major piece of session authentication. The
>question is whether it is recording the cookie set that happens as a
>result of the login sequence. What you want to be sure of is that all
>the parts of the login, including the final redirection back to the
>content page, are considered part of the login sequence. You also
>want to be sure that you don't use as your seed URL the login page
>itself, because then there is no place to resume when the login is
>done. Instead you want a seed which is the root or home page. If
>login is mandatory, then presumably there would be a redirection that
>takes you to the login page. That redirection should *also* be part
>of the login sequence.
>In short, the login sequence needs to cover every fetch that isn't
>actual indexable content. The cookies that are set at the end of that
>sequence are what will be retained for all subsequent fetches from the
>protected area of the site that you specify with your url regular
>expression.
>
>Hope this helps.
>
>Karl
___________________________________________________________
GRATIS! Movie-FLAT mit über 300 Videos.
Jetzt freischalten unter http://movieflat.web.de
Re: Problems while indexing Jira/ and an other problem
Posted by Karl Wright <da...@gmail.com>.
Hi,
I think you should open a JIRA ticket for the Windows Share connector.
It sounds like the javascript for handling the insert link might be
broken in the UI.
As for the web session login, the MCF crawler of course handles
cookies - that is a major piece of session authentication. The
question is whether it is recording the cookie set that happens as a
result of the login sequence. What you want to be sure of is that all
the parts of the login, including the final redirection back to the
content page, are considered part of the login sequence. You also
want to be sure that you don't use as your seed URL the login page
itself, because then there is no place to resume when the login is
done. Instead you want a seed which is the root or home page. If
login is mandatory, then presumably there would be a redirection that
takes you to the login page. That redirection should *also* be part
of the login sequence.
In short, the login sequence needs to cover every fetch that isn't
actual indexable content. The cookies that are set at the end of that
sequence are what will be retained for all subsequent fetches from the
protected area of the site that you specify with your url regular
expression.
Hope this helps.
Karl
On Wed, Oct 20, 2010 at 6:59 AM, Fred Schmitt <fr...@web.de> wrote:
> Hi all,
>
> I'm trying to index a jira system and have some problems.
>
> I am using a Web Connector and my Seed is the login page. It seems that the log in process works.
> But when i fetch and index the next page, to which the login page redirects, I'm not logged in any more.
> Here is a extract of my job history.
>
> Start Time Activity Identifier Result Code Bytes Time
> 10-12-2010 14:52:56.735 end logon http://.../jira/login.jsp OK 0 1
> 10-12-2010 14:52:56.720 fetch http://.../jira/login.jsp 302 0 16
> 10-12-2010 14:52:56.704 begin logon http://.../jira/login.jsp OK 0 1
> 10-12-2010 14:52:56.674 fetch http://.../jira/login.jsp 200 5702 15
> 10-12-2010 14:52:54.423 job start 1285328088067(jira) 0 1
>
> Jira login is based on Cookies but I haven't found a way to control cookies manual in MCF.
> I have found out that MCF could supports cookies.
> So is it possible to control and set cookies or how could i manage that it stays logged in?
>
> thoughts:
> I have found out that Jira is also based on the Lucene core like Solr where mit Output Connection is pointed at.
> Jira has got an own index for its included search.
> Do you know if it is possible to merge the indexes from Solr an Jira?
>
>
> There is another problem I have while creating a job with the Window Share Connector.
> I selected the "Paths" window and created a new path.
> After I added the path and when I click onto the "insert" button i get this Exeption on firebug:
> "missing ; before statement http://localhost:8080/mcf-crawler-ui/execute.jsp Line 1"
>
> best Regards,
> Fred
> ___________________________________________________________
> GRATIS! Movie-FLAT mit über 300 Videos.
> Jetzt freischalten unter http://movieflat.web.de
>