You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Michael Kelleher <mj...@gmail.com> on 2011/12/06 19:52:03 UTC
WEB: Illegal seed URL
Here is my seed URL (minus the hostname):
https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100
I am using a Web Crawler connection that has been tested with the
NullOutputConnector - so I dont think the issue can be here
I am also using the Solr Output Connector - this had been throwing an
Exception till I fixed the core name - this is the first time I have
used this. So, maybe I dont have things configured correct here.
However, there are no exceptions in the log. Also, I am not using
authentication at all on Solr.
I looked at the class:
connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
and it was not Obvious what the issue is.
Also, in logging.ini - I changed the logging level to DEBUG and
restarted before I tested the crawl, which further obscures the logic to
me in WebcrawlerConnector.java
Is there somewhere else I can set logging levels. I am not sure my
change to logging.ini is having any effect. Also, is there some other
test you might suggest?
thanks.
--mike
Re: WEB: Illegal seed URL
Posted by Michael Kelleher <mj...@gmail.com>.
The issue was my use of regexes in the inclusions list. Oddly enough,
some regexes I used (and verified via
http://myregexp.com/signedJar.html) that should function properly, did not.
However, my crawl is functioning properly, and is only visiting the
appropriate documents.
--mike
On 12/06/2011 02:34 PM, Karl Wright wrote:
> On second thought, "illegal seed" can also mean that the seed is
> excluded from the crawl due to your inclusion/exclusion regexp lists.
> Might want to check that out too.
>
> Karl
>
> On Tue, Dec 6, 2011 at 2:23 PM, Karl Wright<da...@gmail.com> wrote:
>> The URL as stated is fine and is pretty standard. I don't think
>> there's a problem there, unless you inadvertantly fixed something when
>> you changed the hostname.
>>
>> Can you look at the log - there may well be a stack trace, especially
>> if you have<property name="org.apache.manifoldcf.connectors"
>> value="DEBUG"/> set. I'd love to see what the trace is.
>>
>> Karl
>>
>> On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher<mj...@gmail.com> wrote:
>>> Here is my seed URL (minus the hostname):
>>> https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100
>>>
>>> I am using a Web Crawler connection that has been tested with the
>>> NullOutputConnector - so I dont think the issue can be here
>>>
>>>
>>> I am also using the Solr Output Connector - this had been throwing an
>>> Exception till I fixed the core name - this is the first time I have used
>>> this. So, maybe I dont have things configured correct here. However, there
>>> are no exceptions in the log. Also, I am not using authentication at all on
>>> Solr.
>>>
>>>
>>> I looked at the class:
>>> connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
>>> and it was not Obvious what the issue is.
>>>
>>> Also, in logging.ini - I changed the logging level to DEBUG and restarted
>>> before I tested the crawl, which further obscures the logic to me in
>>> WebcrawlerConnector.java
>>>
>>> Is there somewhere else I can set logging levels. I am not sure my change
>>> to logging.ini is having any effect. Also, is there some other test you
>>> might suggest?
>>>
>>> thanks.
>>>
>>> --mike
Re: WEB: Illegal seed URL
Posted by Michael Kelleher <mj...@gmail.com>.
Yes, your are right.
I am making incremental slight modifications starting from including .*
to what I want to use to limit the crawl.
The issue is the regex I am using.
I will update the mailing list as soon as I have it 100% fixed.
thanks!
--mike
On 12/06/2011 02:34 PM, Karl Wright wrote:
> On second thought, "illegal seed" can also mean that the seed is
> excluded from the crawl due to your inclusion/exclusion regexp lists.
> Might want to check that out too.
>
> Karl
>
> On Tue, Dec 6, 2011 at 2:23 PM, Karl Wright<da...@gmail.com> wrote:
>> The URL as stated is fine and is pretty standard. I don't think
>> there's a problem there, unless you inadvertantly fixed something when
>> you changed the hostname.
>>
>> Can you look at the log - there may well be a stack trace, especially
>> if you have<property name="org.apache.manifoldcf.connectors"
>> value="DEBUG"/> set. I'd love to see what the trace is.
>>
>> Karl
>>
>> On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher<mj...@gmail.com> wrote:
>>> Here is my seed URL (minus the hostname):
>>> https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100
>>>
>>> I am using a Web Crawler connection that has been tested with the
>>> NullOutputConnector - so I dont think the issue can be here
>>>
>>>
>>> I am also using the Solr Output Connector - this had been throwing an
>>> Exception till I fixed the core name - this is the first time I have used
>>> this. So, maybe I dont have things configured correct here. However, there
>>> are no exceptions in the log. Also, I am not using authentication at all on
>>> Solr.
>>>
>>>
>>> I looked at the class:
>>> connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
>>> and it was not Obvious what the issue is.
>>>
>>> Also, in logging.ini - I changed the logging level to DEBUG and restarted
>>> before I tested the crawl, which further obscures the logic to me in
>>> WebcrawlerConnector.java
>>>
>>> Is there somewhere else I can set logging levels. I am not sure my change
>>> to logging.ini is having any effect. Also, is there some other test you
>>> might suggest?
>>>
>>> thanks.
>>>
>>> --mike
Re: WEB: Illegal seed URL
Posted by Karl Wright <da...@gmail.com>.
On second thought, "illegal seed" can also mean that the seed is
excluded from the crawl due to your inclusion/exclusion regexp lists.
Might want to check that out too.
Karl
On Tue, Dec 6, 2011 at 2:23 PM, Karl Wright <da...@gmail.com> wrote:
> The URL as stated is fine and is pretty standard. I don't think
> there's a problem there, unless you inadvertantly fixed something when
> you changed the hostname.
>
> Can you look at the log - there may well be a stack trace, especially
> if you have <property name="org.apache.manifoldcf.connectors"
> value="DEBUG"/> set. I'd love to see what the trace is.
>
> Karl
>
> On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher <mj...@gmail.com> wrote:
>> Here is my seed URL (minus the hostname):
>> https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100
>>
>> I am using a Web Crawler connection that has been tested with the
>> NullOutputConnector - so I dont think the issue can be here
>>
>>
>> I am also using the Solr Output Connector - this had been throwing an
>> Exception till I fixed the core name - this is the first time I have used
>> this. So, maybe I dont have things configured correct here. However, there
>> are no exceptions in the log. Also, I am not using authentication at all on
>> Solr.
>>
>>
>> I looked at the class:
>> connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
>> and it was not Obvious what the issue is.
>>
>> Also, in logging.ini - I changed the logging level to DEBUG and restarted
>> before I tested the crawl, which further obscures the logic to me in
>> WebcrawlerConnector.java
>>
>> Is there somewhere else I can set logging levels. I am not sure my change
>> to logging.ini is having any effect. Also, is there some other test you
>> might suggest?
>>
>> thanks.
>>
>> --mike
Re: WEB: Illegal seed URL
Posted by Karl Wright <da...@gmail.com>.
The URL as stated is fine and is pretty standard. I don't think
there's a problem there, unless you inadvertantly fixed something when
you changed the hostname.
Can you look at the log - there may well be a stack trace, especially
if you have <property name="org.apache.manifoldcf.connectors"
value="DEBUG"/> set. I'd love to see what the trace is.
Karl
On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher <mj...@gmail.com> wrote:
> Here is my seed URL (minus the hostname):
> https://hostname.com/vwebv/search?searchArg=dvd&searchCode=SALL&searchType=1&recCount=100
>
> I am using a Web Crawler connection that has been tested with the
> NullOutputConnector - so I dont think the issue can be here
>
>
> I am also using the Solr Output Connector - this had been throwing an
> Exception till I fixed the core name - this is the first time I have used
> this. So, maybe I dont have things configured correct here. However, there
> are no exceptions in the log. Also, I am not using authentication at all on
> Solr.
>
>
> I looked at the class:
> connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
> and it was not Obvious what the issue is.
>
> Also, in logging.ini - I changed the logging level to DEBUG and restarted
> before I tested the crawl, which further obscures the logic to me in
> WebcrawlerConnector.java
>
> Is there somewhere else I can set logging levels. I am not sure my change
> to logging.ini is having any effect. Also, is there some other test you
> might suggest?
>
> thanks.
>
> --mike