You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by ritika jain <ri...@gmail.com> on 2020/05/05 10:34:27 UTC
Illegal Seed URL
Hi All,
I am using Manifoldcf 2.14 Repository as Web crawler and Output as Elastic
Search. I have mentioned a seed URL which is valid as it is opening
successfully in browser.
Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en
<https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.
Which is having ? query string in URL.
I am doing anything wrong in this
Thanks
Ritika
Re: Illegal Seed URL
Posted by Karl Wright <da...@gmail.com>.
The "?" in your url probably is being interpreted as a regular expression
"?" in your include list. You need to escape it properly there.
Karl
On Wed, May 6, 2020 at 2:54 AM ritika jain <ri...@gmail.com> wrote:
> Hi Michael,
>
> Yes i testing this with Debug Mode and tested one more scenario.
> Whenever Seed URL is something like this:-
> https://www.abc.com/societybusiness/entrepreneurship/?lang=en
> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.,
> Our web connector.Java code is return Null in this function, when m.find()
> is executed. hence giving DocumentIdenitifer null and thus Iilegal seed URL
> error
>
> /** Check if the document identifier is legal.
> */
> public boolean isDocumentLegal(String url)
> {
> // First, verify that the url matches one of the patterns in the
> include list.
> int i = 0;
> while (i < includePatterns.size())
> {
> Pattern p = includePatterns.get(i);
> Matcher m = p.matcher(url);
> if (m.find())
> break;
> i++;
>
> Whereas when the Seed method is something like this :-
> https://www.abc.com/societybusiness/entrepreneurship/ , this code is
> getting passed with out fail.
> Can anybody make me understand why the same code is behaving differently?
>
> Thanks
> Ritika
> }
>
> On Tue, May 5, 2020 at 6:09 PM Michael Cizmar <mi...@mcplusa.com>
> wrote:
>
>> Hi Ritika,
>>
>>
>>
>> There are several reasons that you could get that. Have you started
>> manifoldcf in debug mode? If so, what’s the output just before that
>> statement in the logs?
>>
>>
>>
>> --
>>
>> Michael Cizmar
>>
>>
>>
>> *From: *ritika jain <ri...@gmail.com>
>> *Reply-To: *"user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> *Date: *Tuesday, May 5, 2020 at 4:34 AM
>> *To: *"user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> *Subject: *Illegal Seed URL
>>
>>
>>
>> Hi All,
>>
>>
>>
>> I am using Manifoldcf 2.14 Repository as Web crawler and Output as
>> Elastic Search. I have mentioned a seed URL which is valid as it is opening
>> successfully in browser.
>>
>> Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en
>> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>
>> .
>>
>>
>>
>> Which is having ? query string in URL.
>>
>> I am doing anything wrong in this
>>
>>
>>
>> Thanks
>>
>> Ritika
>>
>>
>>
>>
>>
>
Re: Illegal Seed URL
Posted by ritika jain <ri...@gmail.com>.
Hi Michael,
Yes i testing this with Debug Mode and tested one more scenario.
Whenever Seed URL is something like this:-
https://www.abc.com/societybusiness/entrepreneurship/?lang=en
<https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.,
Our web connector.Java code is return Null in this function, when m.find()
is executed. hence giving DocumentIdenitifer null and thus Iilegal seed URL
error
/** Check if the document identifier is legal.
*/
public boolean isDocumentLegal(String url)
{
// First, verify that the url matches one of the patterns in the
include list.
int i = 0;
while (i < includePatterns.size())
{
Pattern p = includePatterns.get(i);
Matcher m = p.matcher(url);
if (m.find())
break;
i++;
Whereas when the Seed method is something like this :-
https://www.abc.com/societybusiness/entrepreneurship/ , this code is
getting passed with out fail.
Can anybody make me understand why the same code is behaving differently?
Thanks
Ritika
}
On Tue, May 5, 2020 at 6:09 PM Michael Cizmar <mi...@mcplusa.com>
wrote:
> Hi Ritika,
>
>
>
> There are several reasons that you could get that. Have you started
> manifoldcf in debug mode? If so, what’s the output just before that
> statement in the logs?
>
>
>
> --
>
> Michael Cizmar
>
>
>
> *From: *ritika jain <ri...@gmail.com>
> *Reply-To: *"user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> *Date: *Tuesday, May 5, 2020 at 4:34 AM
> *To: *"user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> *Subject: *Illegal Seed URL
>
>
>
> Hi All,
>
>
>
> I am using Manifoldcf 2.14 Repository as Web crawler and Output as Elastic
> Search. I have mentioned a seed URL which is valid as it is opening
> successfully in browser.
>
> Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en
> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>
> .
>
>
>
> Which is having ? query string in URL.
>
> I am doing anything wrong in this
>
>
>
> Thanks
>
> Ritika
>
>
>
>
>
Re: Illegal Seed URL
Posted by Michael Cizmar <mi...@mcplusa.com>.
Hi Ritika,
There are several reasons that you could get that. Have you started manifoldcf in debug mode? If so, what’s the output just before that statement in the logs?
--
Michael Cizmar
From: ritika jain <ri...@gmail.com>
Reply-To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Date: Tuesday, May 5, 2020 at 4:34 AM
To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject: Illegal Seed URL
Hi All,
I am using Manifoldcf 2.14 Repository as Web crawler and Output as Elastic Search. I have mentioned a seed URL which is valid as it is opening successfully in browser.
Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en<https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.
Which is having ? query string in URL.
I am doing anything wrong in this
Thanks
Ritika