You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by ritika jain <ri...@gmail.com> on 2020/05/05 10:34:27 UTC

Illegal Seed URL

Hi All,

I am using Manifoldcf 2.14 Repository as Web crawler and Output as Elastic
Search. I have mentioned a seed URL which is valid as it is opening
successfully in browser.
Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en
<https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.

Which is having ? query string in URL.
I am doing anything wrong in this

Thanks
Ritika

Re: Illegal Seed URL

Posted by Karl Wright <da...@gmail.com>.
The "?" in your url probably is being interpreted as a regular expression
"?" in your include list.  You need to escape it properly there.

Karl


On Wed, May 6, 2020 at 2:54 AM ritika jain <ri...@gmail.com> wrote:

> Hi Michael,
>
> Yes i testing this with Debug Mode and tested one more scenario.
> Whenever Seed URL is something like this:-
> https://www.abc.com/societybusiness/entrepreneurship/?lang=en
> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.,
> Our web connector.Java code is return Null in this function, when m.find()
> is executed. hence giving DocumentIdenitifer null and thus Iilegal seed URL
> error
>
>     /** Check if the document identifier is legal.
>     */
>     public boolean isDocumentLegal(String url)
>     {
>       // First, verify that the url matches one of the patterns in the
> include list.
>       int i = 0;
>       while (i < includePatterns.size())
>       {
>         Pattern p = includePatterns.get(i);
>         Matcher m = p.matcher(url);
>         if (m.find())
>           break;
>         i++;
>
> Whereas when the Seed method is something like this :-
> https://www.abc.com/societybusiness/entrepreneurship/ ,  this code is
> getting passed with out fail.
> Can anybody make me understand why the same code is behaving differently?
>
> Thanks
> Ritika
>       }
>
> On Tue, May 5, 2020 at 6:09 PM Michael Cizmar <mi...@mcplusa.com>
> wrote:
>
>> Hi Ritika,
>>
>>
>>
>> There are several reasons that you could get that.  Have you started
>> manifoldcf in debug mode?  If so, what’s the output just before that
>> statement in the logs?
>>
>>
>>
>> --
>>
>> Michael Cizmar
>>
>>
>>
>> *From: *ritika jain <ri...@gmail.com>
>> *Reply-To: *"user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> *Date: *Tuesday, May 5, 2020 at 4:34 AM
>> *To: *"user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> *Subject: *Illegal Seed URL
>>
>>
>>
>> Hi All,
>>
>>
>>
>> I am using Manifoldcf 2.14 Repository as Web crawler and Output as
>> Elastic Search. I have mentioned a seed URL which is valid as it is opening
>> successfully in browser.
>>
>> Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en
>> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>
>> .
>>
>>
>>
>> Which is having ? query string in URL.
>>
>> I am doing anything wrong in this
>>
>>
>>
>> Thanks
>>
>> Ritika
>>
>>
>>
>>
>>
>

Re: Illegal Seed URL

Posted by ritika jain <ri...@gmail.com>.
Hi Michael,

Yes i testing this with Debug Mode and tested one more scenario.
Whenever Seed URL is something like this:-
https://www.abc.com/societybusiness/entrepreneurship/?lang=en
<https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.,
Our web connector.Java code is return Null in this function, when m.find()
is executed. hence giving DocumentIdenitifer null and thus Iilegal seed URL
error

    /** Check if the document identifier is legal.
    */
    public boolean isDocumentLegal(String url)
    {
      // First, verify that the url matches one of the patterns in the
include list.
      int i = 0;
      while (i < includePatterns.size())
      {
        Pattern p = includePatterns.get(i);
        Matcher m = p.matcher(url);
        if (m.find())
          break;
        i++;

Whereas when the Seed method is something like this :-
https://www.abc.com/societybusiness/entrepreneurship/ ,  this code is
getting passed with out fail.
Can anybody make me understand why the same code is behaving differently?

Thanks
Ritika
      }

On Tue, May 5, 2020 at 6:09 PM Michael Cizmar <mi...@mcplusa.com>
wrote:

> Hi Ritika,
>
>
>
> There are several reasons that you could get that.  Have you started
> manifoldcf in debug mode?  If so, what’s the output just before that
> statement in the logs?
>
>
>
> --
>
> Michael Cizmar
>
>
>
> *From: *ritika jain <ri...@gmail.com>
> *Reply-To: *"user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> *Date: *Tuesday, May 5, 2020 at 4:34 AM
> *To: *"user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> *Subject: *Illegal Seed URL
>
>
>
> Hi All,
>
>
>
> I am using Manifoldcf 2.14 Repository as Web crawler and Output as Elastic
> Search. I have mentioned a seed URL which is valid as it is opening
> successfully in browser.
>
> Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en
> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>
> .
>
>
>
> Which is having ? query string in URL.
>
> I am doing anything wrong in this
>
>
>
> Thanks
>
> Ritika
>
>
>
>
>

Re: Illegal Seed URL

Posted by Michael Cizmar <mi...@mcplusa.com>.
Hi Ritika,

There are several reasons that you could get that.  Have you started manifoldcf in debug mode?  If so, what’s the output just before that statement in the logs?

--
Michael Cizmar


From: ritika jain <ri...@gmail.com>
Reply-To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Date: Tuesday, May 5, 2020 at 4:34 AM
To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject: Illegal Seed URL

Hi All,

I am using Manifoldcf 2.14 Repository as Web crawler and Output as Elastic Search. I have mentioned a seed URL which is valid as it is opening successfully in browser.
Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en<https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.

Which is having ? query string in URL.
I am doing anything wrong in this

Thanks
Ritika