You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2015/03/05 21:17:24 UTC

need a little bit apache nutch ..

Hi,
Please see
http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F

Also please ensure that your urlfilter permits '?' In URLS entries
Hth
Lewis

On Thursday, March 5, 2015, Gaplan <gaplan@gmail.com
<javascript:_e(%7B%7D,'cvml','gaplan@gmail.com');>> wrote:

> can you help me ?
>
> i have to crawl domain http://www.kadinlarkulubu.com/forum/index.php
> but in links always
> a href  = index.php?blabla not a href= "
> http://www.kadinlarkulubu.com/forum/index.php?blabla"
> how can i configured this ?
> thank you for your time..
> OSA
>


-- 
*Lewis*

Re: need a little bit apache nutch ..

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Please look at the URL filter you define within within plugin.includes
property in nutch-site.xml, if it it regex-urlfilter (which it is by
default) then you will need to edit the following line to remove '?'

https://github.com/apache/nutch/blob/trunk/conf/regex-urlfilter.txt.template#L33

Hopefully this makes better sense.
Lewis

On Thursday, March 5, 2015, Gaplan <ga...@gmail.com> wrote:

> thans for answer Lewis.
>  i can't understand this.
> "Also please ensure that your urlfilter permits '?' In URLS entries"
> how can i do that ?
>
> On Thu, Mar 5, 2015 at 10:17 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com
> <javascript:_e(%7B%7D,'cvml','lewis.mcgibbney@gmail.com');>> wrote:
>
>> Hi,
>> Please see
>>
>> http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F
>>
>> Also please ensure that your urlfilter permits '?' In URLS entries
>> Hth
>> Lewis
>>
>> On Thursday, March 5, 2015, Gaplan <ga...@gmail.com> wrote:
>>
>>> can you help me ?
>>>
>>> i have to crawl domain http://www.kadinlarkulubu.com/forum/index.php
>>> but in links always
>>> a href  = index.php?blabla not a href= "
>>> http://www.kadinlarkulubu.com/forum/index.php?blabla"
>>> how can i configured this ?
>>> thank you for your time..
>>> OSA
>>>
>>
>>
>> --
>> *Lewis*
>>
>>
>

-- 
*Lewis*