You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Shigeki Kobayashi <sh...@g.softbank.co.jp> on 2014/06/27 11:09:32 UTC
basic question of Web crawler setting of "Include in index"
Hello guys
I am having a trouble in setting web crawling jobs.
I want to index only php sites so I set the "Include in index" option to
.*\.php.*
I set the option above because php sites' URLs can take parameters like
.php?a=b,
but MCF indexes only URLs end with .php
I need to index URLs with parameters. Are there anything wrong with my
regular expression?
Re: basic question of Web crawler setting of "Include in index"
Posted by Shigeki Kobayashi <sh...@g.softbank.co.jp>.
Hi Karl.
Thanks a lot for your help!
I now understand how the setting works and this solved the problem!
Again, thanks a lot.
Best regards.
Shigeki
2014-06-27 20:46 GMT+09:00 Karl Wright <da...@gmail.com>:
> Hi Shigeki,
>
> The code doesn't care about the query string. It uses "find()" anyway,
> which means you don't have to have the leading ".*" and trailing ".*":
>
> >>>>>>
> // First, verify that the url matches one of the patterns in the
> include list.
> int i = 0;
> while (i < includeIndexPatterns.size())
> {
> Pattern p = includeIndexPatterns.get(i);
> Matcher m = p.matcher(url);
> if (m.find())
> break;
> i++;
> }
> if (i == includeIndexPatterns.size())
> {
> if (Logging.connectors.isDebugEnabled())
> Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
> because no include patterns match it");
> return false;
> }
>
> // Now make sure it's not in the exclude list.
> i = 0;
> while (i < excludeIndexPatterns.size())
> {
> Pattern p = excludeIndexPatterns.get(i);
> Matcher m = p.matcher(url);
> if (m.find())
> {
> if (Logging.connectors.isDebugEnabled())
> Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
> because exclude pattern '"+p.toString()+"' matched it");
> return false;
> }
> i++;
> }
>
> return true;
> <<<<<<
>
> If you turn on connector debugging, you may see more reasons why the url
> is being rejected in the log.
>
> Thanks,
> Karl
>
>
Re: basic question of Web crawler setting of "Include in index"
Posted by Karl Wright <da...@gmail.com>.
Hi Shigeki,
The code doesn't care about the query string. It uses "find()" anyway,
which means you don't have to have the leading ".*" and trailing ".*":
>>>>>>
// First, verify that the url matches one of the patterns in the
include list.
int i = 0;
while (i < includeIndexPatterns.size())
{
Pattern p = includeIndexPatterns.get(i);
Matcher m = p.matcher(url);
if (m.find())
break;
i++;
}
if (i == includeIndexPatterns.size())
{
if (Logging.connectors.isDebugEnabled())
Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
because no include patterns match it");
return false;
}
// Now make sure it's not in the exclude list.
i = 0;
while (i < excludeIndexPatterns.size())
{
Pattern p = excludeIndexPatterns.get(i);
Matcher m = p.matcher(url);
if (m.find())
{
if (Logging.connectors.isDebugEnabled())
Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
because exclude pattern '"+p.toString()+"' matched it");
return false;
}
i++;
}
return true;
<<<<<<
If you turn on connector debugging, you may see more reasons why the url is
being rejected in the log.
Thanks,
Karl