You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Volli <il...@web.de> on 2010/08/24 23:17:30 UTC

nutch crawler ignores query string url like "...a.php?b=com_x&c=y" - SOLVED

I think it's the query string exclusion in files 
conf/regex-urlfilter.txt or conf/crawl-urlfilter.txt:

FIND:
# skip URLs containing certain characters as probable 
queries, etc.
-[?*!@=]

REPLACE:
# skip URLs containing certain characters as probable 
queries, etc.
# -[?*!@=]

OR CHANGE:
# -[?*!@=]
-[*!@]


Am 24.08.2010 02:50, schrieb Israel:
> Hello volley. please help me one more time, i want to crawl this page, but
> don't generate nothing...is posible?
>
> http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw
...


Re: nutch crawler ignores query string url like "...a.php?b=com_x&c=y" - SOLVED

Posted by Israel <we...@gmail.com>.
thanks volley.........you rule jajaja

Re: nutch crawler ignores query string url like "...a.php?b=com_x&c=y" - SOLVED

Posted by Volli <il...@web.de>.
Because some characters were replaced by dots in my last post:
"OR CHANGE:" in words:
Remove question mark and equals sign.

I don't know if the remaining charcaters are allowed ones in 
a query string. Possibly a stupid solution.

Am 24.08.2010 23:17, schrieb Volli:
> I think it's the query string exclusion in files
> conf/regex-urlfilter.txt or conf/crawl-urlfilter.txt:
>
> FIND:
> # skip URLs containing certain characters as probable
> queries, etc.
> -[?*!@=]
>
> REPLACE:
> # skip URLs containing certain characters as probable
> queries, etc.
> # -[?*!@=]
>
> OR CHANGE:
> # -[?*!@=]
> -[*!@]
>
&gt;
> Am 24.08.2010 02:50, schrieb Israel:
>> Hello volley. please help me one more time, i want to
>> crawl this page, but
>> don't generate nothing...is posible?
>>
>> http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw
>>
> ...
>