You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Michael Kelleher <mj...@gmail.com> on 2011/12/06 19:52:03 UTC

WEB: Illegal seed URL

Here is my seed URL (minus the hostname):  
https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100

I am using a Web Crawler connection that has been tested with the 
NullOutputConnector - so I dont think the issue can be here


I am also using the Solr Output Connector - this had been throwing an 
Exception till I fixed the core name - this is the first time I have 
used this.  So, maybe I dont have things configured correct here.  
However, there are no exceptions in the log.  Also, I am not using 
authentication at all on Solr.


I looked at the class: 
connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java 
and it was not Obvious what the issue is.

Also, in logging.ini - I changed the logging level to DEBUG and 
restarted before I tested the crawl, which further obscures the logic to 
me in WebcrawlerConnector.java

Is there somewhere else I can set logging levels.  I am not sure my 
change to logging.ini is having any effect.  Also, is there some other 
test you might suggest?

thanks.

--mike

Re: WEB: Illegal seed URL

Posted by Michael Kelleher <mj...@gmail.com>.
The issue was my use of regexes in the inclusions list.  Oddly enough, 
some regexes I used (and verified via 
http://myregexp.com/signedJar.html) that should function properly, did not.

However, my crawl is functioning properly, and is only visiting the 
appropriate documents.

--mike

On 12/06/2011 02:34 PM, Karl Wright wrote:
> On second thought, "illegal seed" can also mean that the seed is
> excluded from the crawl due to your inclusion/exclusion regexp lists.
> Might want to check that out too.
>
> Karl
>
> On Tue, Dec 6, 2011 at 2:23 PM, Karl Wright<da...@gmail.com>  wrote:
>> The URL as stated is fine and is pretty standard.  I don't think
>> there's a problem there, unless you inadvertantly fixed something when
>> you changed the hostname.
>>
>> Can you look at the log - there may well be a stack trace, especially
>> if you have<property name="org.apache.manifoldcf.connectors"
>> value="DEBUG"/>  set.  I'd love to see what the trace is.
>>
>> Karl
>>
>> On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher<mj...@gmail.com>  wrote:
>>> Here is my seed URL (minus the hostname):
>>>   https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100
>>>
>>> I am using a Web Crawler connection that has been tested with the
>>> NullOutputConnector - so I dont think the issue can be here
>>>
>>>
>>> I am also using the Solr Output Connector - this had been throwing an
>>> Exception till I fixed the core name - this is the first time I have used
>>> this.  So, maybe I dont have things configured correct here.  However, there
>>> are no exceptions in the log.  Also, I am not using authentication at all on
>>> Solr.
>>>
>>>
>>> I looked at the class:
>>> connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
>>> and it was not Obvious what the issue is.
>>>
>>> Also, in logging.ini - I changed the logging level to DEBUG and restarted
>>> before I tested the crawl, which further obscures the logic to me in
>>> WebcrawlerConnector.java
>>>
>>> Is there somewhere else I can set logging levels.  I am not sure my change
>>> to logging.ini is having any effect.  Also, is there some other test you
>>> might suggest?
>>>
>>> thanks.
>>>
>>> --mike


Re: WEB: Illegal seed URL

Posted by Michael Kelleher <mj...@gmail.com>.
Yes, your are right.

I am making incremental slight modifications starting from including .* 
to what I want to use to limit the crawl.

The issue is the regex I am using.

I will update the mailing list as soon as I have it 100% fixed.

thanks!

--mike

On 12/06/2011 02:34 PM, Karl Wright wrote:
> On second thought, "illegal seed" can also mean that the seed is
> excluded from the crawl due to your inclusion/exclusion regexp lists.
> Might want to check that out too.
>
> Karl
>
> On Tue, Dec 6, 2011 at 2:23 PM, Karl Wright<da...@gmail.com>  wrote:
>> The URL as stated is fine and is pretty standard.  I don't think
>> there's a problem there, unless you inadvertantly fixed something when
>> you changed the hostname.
>>
>> Can you look at the log - there may well be a stack trace, especially
>> if you have<property name="org.apache.manifoldcf.connectors"
>> value="DEBUG"/>  set.  I'd love to see what the trace is.
>>
>> Karl
>>
>> On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher<mj...@gmail.com>  wrote:
>>> Here is my seed URL (minus the hostname):
>>>   https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100
>>>
>>> I am using a Web Crawler connection that has been tested with the
>>> NullOutputConnector - so I dont think the issue can be here
>>>
>>>
>>> I am also using the Solr Output Connector - this had been throwing an
>>> Exception till I fixed the core name - this is the first time I have used
>>> this.  So, maybe I dont have things configured correct here.  However, there
>>> are no exceptions in the log.  Also, I am not using authentication at all on
>>> Solr.
>>>
>>>
>>> I looked at the class:
>>> connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
>>> and it was not Obvious what the issue is.
>>>
>>> Also, in logging.ini - I changed the logging level to DEBUG and restarted
>>> before I tested the crawl, which further obscures the logic to me in
>>> WebcrawlerConnector.java
>>>
>>> Is there somewhere else I can set logging levels.  I am not sure my change
>>> to logging.ini is having any effect.  Also, is there some other test you
>>> might suggest?
>>>
>>> thanks.
>>>
>>> --mike


Re: WEB: Illegal seed URL

Posted by Karl Wright <da...@gmail.com>.
On second thought, "illegal seed" can also mean that the seed is
excluded from the crawl due to your inclusion/exclusion regexp lists.
Might want to check that out too.

Karl

On Tue, Dec 6, 2011 at 2:23 PM, Karl Wright <da...@gmail.com> wrote:
> The URL as stated is fine and is pretty standard.  I don't think
> there's a problem there, unless you inadvertantly fixed something when
> you changed the hostname.
>
> Can you look at the log - there may well be a stack trace, especially
> if you have <property name="org.apache.manifoldcf.connectors"
> value="DEBUG"/> set.  I'd love to see what the trace is.
>
> Karl
>
> On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher <mj...@gmail.com> wrote:
>> Here is my seed URL (minus the hostname):
>>  https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100
>>
>> I am using a Web Crawler connection that has been tested with the
>> NullOutputConnector - so I dont think the issue can be here
>>
>>
>> I am also using the Solr Output Connector - this had been throwing an
>> Exception till I fixed the core name - this is the first time I have used
>> this.  So, maybe I dont have things configured correct here.  However, there
>> are no exceptions in the log.  Also, I am not using authentication at all on
>> Solr.
>>
>>
>> I looked at the class:
>> connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
>> and it was not Obvious what the issue is.
>>
>> Also, in logging.ini - I changed the logging level to DEBUG and restarted
>> before I tested the crawl, which further obscures the logic to me in
>> WebcrawlerConnector.java
>>
>> Is there somewhere else I can set logging levels.  I am not sure my change
>> to logging.ini is having any effect.  Also, is there some other test you
>> might suggest?
>>
>> thanks.
>>
>> --mike

Re: WEB: Illegal seed URL

Posted by Karl Wright <da...@gmail.com>.
The URL as stated is fine and is pretty standard.  I don't think
there's a problem there, unless you inadvertantly fixed something when
you changed the hostname.

Can you look at the log - there may well be a stack trace, especially
if you have <property name="org.apache.manifoldcf.connectors"
value="DEBUG"/> set.  I'd love to see what the trace is.

Karl

On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher <mj...@gmail.com> wrote:
> Here is my seed URL (minus the hostname):
>  https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100
>
> I am using a Web Crawler connection that has been tested with the
> NullOutputConnector - so I dont think the issue can be here
>
>
> I am also using the Solr Output Connector - this had been throwing an
> Exception till I fixed the core name - this is the first time I have used
> this.  So, maybe I dont have things configured correct here.  However, there
> are no exceptions in the log.  Also, I am not using authentication at all on
> Solr.
>
>
> I looked at the class:
> connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
> and it was not Obvious what the issue is.
>
> Also, in logging.ini - I changed the logging level to DEBUG and restarted
> before I tested the crawl, which further obscures the logic to me in
> WebcrawlerConnector.java
>
> Is there somewhere else I can set logging levels.  I am not sure my change
> to logging.ini is having any effect.  Also, is there some other test you
> might suggest?
>
> thanks.
>
> --mike