You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Shigeki Kobayashi <sh...@g.softbank.co.jp> on 2014/06/27 11:09:32 UTC

basic question of Web crawler setting of "Include in index"

Hello  guys


I am having a trouble in setting web crawling jobs.

I want to index only php sites so I set the "Include in index" option to
.*\.php.*


I set the option above because php sites' URLs can take parameters like
.php?a=b,
but MCF indexes only URLs end with .php

I need to index URLs with parameters. Are there anything wrong with my
regular expression?

Re: basic question of Web crawler setting of "Include in index"

Posted by Shigeki Kobayashi <sh...@g.softbank.co.jp>.

Hi Karl.

Thanks a lot for your help!

I now understand how the setting works and this solved the problem!


Again, thanks a lot.

Best regards.

Shigeki


2014-06-27 20:46 GMT+09:00 Karl Wright <da...@gmail.com>:

> Hi Shigeki,
>
> The code doesn't care about the query string.  It uses "find()" anyway,
> which means you don't have to have the leading ".*" and trailing ".*":
>
> >>>>>>
>       // First, verify that the url matches one of the patterns in the
> include list.
>       int i = 0;
>       while (i < includeIndexPatterns.size())
>       {
>         Pattern p = includeIndexPatterns.get(i);
>         Matcher m = p.matcher(url);
>         if (m.find())
>           break;
>         i++;
>       }
>       if (i == includeIndexPatterns.size())
>       {
>         if (Logging.connectors.isDebugEnabled())
>           Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
> because no include patterns match it");
>         return false;
>       }
>
>       // Now make sure it's not in the exclude list.
>       i = 0;
>       while (i < excludeIndexPatterns.size())
>       {
>         Pattern p = excludeIndexPatterns.get(i);
>         Matcher m = p.matcher(url);
>         if (m.find())
>         {
>           if (Logging.connectors.isDebugEnabled())
>             Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
> because exclude pattern '"+p.toString()+"' matched it");
>           return false;
>         }
>         i++;
>       }
>
>       return true;
> <<<<<<
>
> If you turn on connector debugging, you may see more reasons why the url
> is being rejected in the log.
>
> Thanks,
> Karl
>
>

Re: basic question of Web crawler setting of "Include in index"

Posted by Karl Wright <da...@gmail.com>.

Hi Shigeki,

The code doesn't care about the query string.  It uses "find()" anyway,
which means you don't have to have the leading ".*" and trailing ".*":

>>>>>>
      // First, verify that the url matches one of the patterns in the
include list.
      int i = 0;
      while (i < includeIndexPatterns.size())
      {
        Pattern p = includeIndexPatterns.get(i);
        Matcher m = p.matcher(url);
        if (m.find())
          break;
        i++;
      }
      if (i == includeIndexPatterns.size())
      {
        if (Logging.connectors.isDebugEnabled())
          Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
because no include patterns match it");
        return false;
      }

      // Now make sure it's not in the exclude list.
      i = 0;
      while (i < excludeIndexPatterns.size())
      {
        Pattern p = excludeIndexPatterns.get(i);
        Matcher m = p.matcher(url);
        if (m.find())
        {
          if (Logging.connectors.isDebugEnabled())
            Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
because exclude pattern '"+p.toString()+"' matched it");
          return false;
        }
        i++;
      }

      return true;
<<<<<<

If you turn on connector debugging, you may see more reasons why the url is
being rejected in the log.

Thanks,
Karl