You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Volli <il...@web.de> on 2010/08/24 23:17:30 UTC
nutch crawler ignores query string url like "...a.php?b=com_x&c=y"
- SOLVED
I think it's the query string exclusion in files
conf/regex-urlfilter.txt or conf/crawl-urlfilter.txt:
FIND:
# skip URLs containing certain characters as probable
queries, etc.
-[?*!@=]
REPLACE:
# skip URLs containing certain characters as probable
queries, etc.
# -[?*!@=]
OR CHANGE:
# -[?*!@=]
-[*!@]
Am 24.08.2010 02:50, schrieb Israel:
> Hello volley. please help me one more time, i want to crawl this page, but
> don't generate nothing...is posible?
>
> http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw
...
Re: nutch crawler ignores query string url like "...a.php?b=com_x&c=y"
- SOLVED
Posted by Israel <we...@gmail.com>.
thanks volley.........you rule jajaja
Re: nutch crawler ignores query string url like "...a.php?b=com_x&c=y"
- SOLVED
Posted by Volli <il...@web.de>.
Because some characters were replaced by dots in my last post:
"OR CHANGE:" in words:
Remove question mark and equals sign.
I don't know if the remaining charcaters are allowed ones in
a query string. Possibly a stupid solution.
Am 24.08.2010 23:17, schrieb Volli:
> I think it's the query string exclusion in files
> conf/regex-urlfilter.txt or conf/crawl-urlfilter.txt:
>
> FIND:
> # skip URLs containing certain characters as probable
> queries, etc.
> -[?*!@=]
>
> REPLACE:
> # skip URLs containing certain characters as probable
> queries, etc.
> # -[?*!@=]
>
> OR CHANGE:
> # -[?*!@=]
> -[*!@]
>
>
> Am 24.08.2010 02:50, schrieb Israel:
>> Hello volley. please help me one more time, i want to
>> crawl this page, but
>> don't generate nothing...is posible?
>>
>> http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw
>>
> ...
>