You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by remi tassing <ta...@gmail.com> on 2011/12/09 13:32:41 UTC

"URLFilterChecker" documentation

Hello guys,

how do you use "org.apache.nutch.net.URLFilterChecker"? It's not documented
and it always shows me this "Checking combination of all URLFilters
available" and then gets stuck.

Remi

Re: "URLFilterChecker" documentation

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

Can anyone confirm if this is an issue?

If so I think we should log it before it goes unnoticed.

Thanks

Lewis

On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> If you look at the output I posted, even when I specified a particular
> filter, the checkAll() method is still getting called, as is indicated by
> the "Checking combination of all URLFilters available" log output. It's not
> a particularly complex class, so hopefully if we can confirm this is a bug
> we can fix it quickly.
>
> Finally, I must ask, Remi which URL filters have you included in your
> plugin.includes property in nutch-site.xml after building Nutch?
>
> On Fri, Dec 9, 2011 at 3:11 PM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
>>
>> Hi Remi & Markus,
>>
>> Yeah, I can replicate this, good catch Remi.
>>
>> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com
>> -filterName regex-urlfilter.txt
>>
>> Checking combination of all URLFilters available
>> ^Z
>> [2]+  Stopped                 bin/nutch
>> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com
>> -filterName regex-urlfilter.txt
>> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com
>> -filterName regex-urlfilter
>>
>> Checking combination of all URLFilters available
>>
>> The first instance was hanging, so was the second. This needs some further
>> investigation I think. Can someone else please confirm before we log this in
>> Jira?
>>
>> Thanks for reporting
>>
>>
>> On Fri, Dec 9, 2011 at 12:53 PM, remi tassing <ta...@gmail.com>
>> wrote:
>>>
>>> I fed with URL but it didn't work:
>>>
>>> $ bin/nutch org.apache.nutch.net.URLFilterChecker http://www.google.com
>>> Checking combination of all URLFilters available
>>>
>>> Remi
>>>
>>> On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma
>>> <ma...@openindex.io>wrote:
>>>
>>> > it reads from stdin so you can either type a url followed by enter or
>>> > feed
>>> > from stdin using pipes.
>>> >
>>> > On Friday 09 December 2011 13:32:41 remi tassing wrote:
>>> > > Hello guys,
>>> > >
>>> > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not
>>> > documented
>>> > > and it always shows me this "Checking combination of all URLFilters
>>> > > available" and then gets stuck.
>>> > >
>>> > > Remi
>>> >
>>> > --
>>> > Markus Jelsma - CTO - Openindex
>>> >
>>>
>>>
>>>
>>> --
>>> Remi Tassing
>>
>>
>>
>>
>> --
>> Lewis
>>
>
>
>
> --
> Lewis
>



-- 
Lewis

Re: "URLFilterChecker" documentation

Posted by Lewis John Mcgibbney <le...@gmail.com>.
If you look at the output I posted, even when I specified a particular
filter, the checkAll() method is still getting called, as is indicated by
the "Checking combination of all URLFilters available" log output. It's not
a particularly complex class, so hopefully if we can confirm this is a bug
we can fix it quickly.

Finally, I must ask, Remi which URL filters have you included in your
plugin.includes property in nutch-site.xml after building Nutch?

On Fri, Dec 9, 2011 at 3:11 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Remi & Markus,
>
> Yeah, I can replicate this, good catch Remi.
>
> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com-filterName regex-urlfilter.txt
>
> Checking combination of all URLFilters available
> ^Z
> [2]+  Stopped                 bin/nutch
> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com-filterName regex-urlfilter.txt
> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com-filterName regex-urlfilter
>
> Checking combination of all URLFilters available
>
> The first instance was hanging, so was the second. This needs some further
> investigation I think. Can someone else please confirm before we log this
> in Jira?
>
> Thanks for reporting
>
>
> On Fri, Dec 9, 2011 at 12:53 PM, remi tassing <ta...@gmail.com>wrote:
>
>> I fed with URL but it didn't work:
>>
>> $ bin/nutch org.apache.nutch.net.URLFilterChecker http://www.google.com
>> Checking combination of all URLFilters available
>>
>> Remi
>>
>> On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma <markus.jelsma@openindex.io
>> >wrote:
>>
>> > it reads from stdin so you can either type a url followed by enter or
>> feed
>> > from stdin using pipes.
>> >
>> > On Friday 09 December 2011 13:32:41 remi tassing wrote:
>> > > Hello guys,
>> > >
>> > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not
>> > documented
>> > > and it always shows me this "Checking combination of all URLFilters
>> > > available" and then gets stuck.
>> > >
>> > > Remi
>> >
>> > --
>> > Markus Jelsma - CTO - Openindex
>> >
>>
>>
>>
>> --
>> Remi Tassing
>>
>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: "URLFilterChecker" documentation

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Remi & Markus,

Yeah, I can replicate this, good catch Remi.

lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com-filterName regex-urlfilter.txt
Checking combination of all URLFilters available
^Z
[2]+  Stopped                 bin/nutch
org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com-filterName regex-urlfilter.txt
lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com-filterName regex-urlfilter
Checking combination of all URLFilters available

The first instance was hanging, so was the second. This needs some further
investigation I think. Can someone else please confirm before we log this
in Jira?

Thanks for reporting

On Fri, Dec 9, 2011 at 12:53 PM, remi tassing <ta...@gmail.com> wrote:

> I fed with URL but it didn't work:
>
> $ bin/nutch org.apache.nutch.net.URLFilterChecker http://www.google.com
> Checking combination of all URLFilters available
>
> Remi
>
> On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma <markus.jelsma@openindex.io
> >wrote:
>
> > it reads from stdin so you can either type a url followed by enter or
> feed
> > from stdin using pipes.
> >
> > On Friday 09 December 2011 13:32:41 remi tassing wrote:
> > > Hello guys,
> > >
> > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not
> > documented
> > > and it always shows me this "Checking combination of all URLFilters
> > > available" and then gets stuck.
> > >
> > > Remi
> >
> > --
> > Markus Jelsma - CTO - Openindex
> >
>
>
>
> --
> Remi Tassing
>



-- 
*Lewis*

Re: "URLFilterChecker" documentation

Posted by remi tassing <ta...@gmail.com>.
I fed with URL but it didn't work:

$ bin/nutch org.apache.nutch.net.URLFilterChecker http://www.google.com
Checking combination of all URLFilters available

Remi

On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma <ma...@openindex.io>wrote:

> it reads from stdin so you can either type a url followed by enter or feed
> from stdin using pipes.
>
> On Friday 09 December 2011 13:32:41 remi tassing wrote:
> > Hello guys,
> >
> > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not
> documented
> > and it always shows me this "Checking combination of all URLFilters
> > available" and then gets stuck.
> >
> > Remi
>
> --
> Markus Jelsma - CTO - Openindex
>



-- 
Remi Tassing

Re: "URLFilterChecker" documentation

Posted by Markus Jelsma <ma...@openindex.io>.
it reads from stdin so you can either type a url followed by enter or feed 
from stdin using pipes.

On Friday 09 December 2011 13:32:41 remi tassing wrote:
> Hello guys,
> 
> how do you use "org.apache.nutch.net.URLFilterChecker"? It's not documented
> and it always shows me this "Checking combination of all URLFilters
> available" and then gets stuck.
> 
> Remi

-- 
Markus Jelsma - CTO - Openindex

Re: "URLFilterChecker" documentation

Posted by remi tassing <ta...@gmail.com>.
It actually works fine!

I accidentally left a "+." at the beginning of regex-urlfilter.txt and only
put "-." at the end.

Thanks to Mark and Lewis!

Remi

On Tuesday, December 13, 2011, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> I get it now ... Duh :0)
>
> Output is fine for me. What is wrong with your results Remi?
>
> On Tue, Dec 13, 2011 at 7:09 PM, remi tassing <ta...@gmail.com>
wrote:
>> Pla check Markus's earlier email.on the format. It seems  be working.but
>> the output is still incorrect for me.
>>
>> On Tuesday, December 13, 2011, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>> Heres my output from URLFilterChecker [1]
>>>
>>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>>> org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex
>>> Exception in thread "main" java.lang.RuntimeException: Filter
>>> urlfilter-regex not found.
>>>        at
>> org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>>>        at
>> org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
>>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>>> org.apache.nutch.net.URLFilterChecker -allCombined
>>> Checking combination of all URLFilters available
>>> ^Z
>>> [10]+  Stopped                 bin/nutch
>>> org.apache.nutch.net.URLFilterChecker -allCombined
>>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>>> org.apache.nutch.net.URLFilterChecker -filterName RegexURLFilter
>>> Exception in thread "main" java.lang.RuntimeException: Filter
>>> RegexURLFilter not found.
>>>        at
>> org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>>>        at
>> org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
>>>
>>> I'm noticing three things
>>>
>>> 1) NO reference to a single urlfilter seems to work when appended to
>>> the -filterName parameter e.g. regex-urlfilter, urlfilter-regex,
>>> RegexURLFilter, regex-urlfilter.txt
>>> 2) When no -filterName parameter is passed but a value is passed e.g.
>>> bin/nutch org.apache.nutch.net.URLFilterChecker regex-urlfilter log
>>> output is as follows
>>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>>> org.apache.nutch.net.URLFilterChecker regex-urlfilter
>>> Checking combination of all URLFilters available
>>> Therefore it seems to incorrectly skip to the checkAll method then hang!
>>> 3) If the -allCombined parameter is passed the output indiciates that
>>> it does the same as 2) above...
>>>
>>> Can you please check if you are getting the same behaviour Markus? Thank
>> you
>>>
>>> [1]
>>
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/net/URLFilterChecker.java
>>>
>>> On Tue, Dec 13, 2011 at 5:06 PM, Markus Jelsma
>>> <ma...@openindex.io> wrote:
>>>> i see no log output mate :)
>>>>
>>>> On Tuesday 13 December 2011 17:58:36 you wrote:
>>>>> Thanks Markus.
>>>>>
>>>>> Can you look at my log output and inform where I am going wrong
>>>>> please? It seemed to be playing up for me.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, Dec 13, 2011 at 4:53 PM, Markus Jelsma
>>>>>
>>>>> <ma...@openindex.io> wrote:
>>>>> > I've never seen it hanging and use it weekly.
>>>>> >
>>>>> > On Tuesday 13 December 2011 17:45:54 you wrote:
>>>>> >> Hi,
>>>>> >>
>>>>> >> Can anyone confirm if this is an issue?
>>>>> >>
>>>>> >> If so I think we should log it before it goes unnoticed.
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> Lewis
>>>>> >>
>>>>> >> On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
>>>>> >>
>>>>> >> --
> Lewis
>

Re: "URLFilterChecker" documentation

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I get it now ... Duh :0)

Output is fine for me. What is wrong with your results Remi?

On Tue, Dec 13, 2011 at 7:09 PM, remi tassing <ta...@gmail.com> wrote:
> Pla check Markus's earlier email.on the format. It seems to be working.but
> the output is still incorrect for me.
>
> On Tuesday, December 13, 2011, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>> Heres my output from URLFilterChecker [1]
>>
>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex
>> Exception in thread "main" java.lang.RuntimeException: Filter
>> urlfilter-regex not found.
>>        at
> org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>>        at
> org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker -allCombined
>> Checking combination of all URLFilters available
>> ^Z
>> [10]+  Stopped                 bin/nutch
>> org.apache.nutch.net.URLFilterChecker -allCombined
>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker -filterName RegexURLFilter
>> Exception in thread "main" java.lang.RuntimeException: Filter
>> RegexURLFilter not found.
>>        at
> org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>>        at
> org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
>>
>> I'm noticing three things
>>
>> 1) NO reference to a single urlfilter seems to work when appended to
>> the -filterName parameter e.g. regex-urlfilter, urlfilter-regex,
>> RegexURLFilter, regex-urlfilter.txt
>> 2) When no -filterName parameter is passed but a value is passed e.g.
>> bin/nutch org.apache.nutch.net.URLFilterChecker regex-urlfilter log
>> output is as follows
>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker regex-urlfilter
>> Checking combination of all URLFilters available
>> Therefore it seems to incorrectly skip to the checkAll method then hang!
>> 3) If the -allCombined parameter is passed the output indiciates that
>> it does the same as 2) above...
>>
>> Can you please check if you are getting the same behaviour Markus? Thank
> you
>>
>> [1]
> http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/net/URLFilterChecker.java
>>
>> On Tue, Dec 13, 2011 at 5:06 PM, Markus Jelsma
>> <ma...@openindex.io> wrote:
>>> i see no log output mate :)
>>>
>>> On Tuesday 13 December 2011 17:58:36 you wrote:
>>>> Thanks Markus.
>>>>
>>>> Can you look at my log output and inform where I am going wrong
>>>> please? It seemed to be playing up for me.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Dec 13, 2011 at 4:53 PM, Markus Jelsma
>>>>
>>>> <ma...@openindex.io> wrote:
>>>> > I've never seen it hanging and use it weekly.
>>>> >
>>>> > On Tuesday 13 December 2011 17:45:54 you wrote:
>>>> >> Hi,
>>>> >>
>>>> >> Can anyone confirm if this is an issue?
>>>> >>
>>>> >> If so I think we should log it before it goes unnoticed.
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> Lewis
>>>> >>
>>>> >> On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
>>>> >>
>>>> >> <le...@gmail.com> wrote:
>>>> >> > If you look at the output I posted, even when I specified a
> particular
>>>> >> > filter, the checkAll() method is still getting called, as is
> indicated
>>>> >> > by the "Checking combination of all URLFilters available" log
> output.
>>>> >> > It's not a particularly complex class, so hopefully if we can
> confirm
>>>> >> > this is a bug we can fix it quickly.
>>>> >> >
>>>> >> > Finally, I must ask, Remi which URL filters have you included in
> your
>>>> >> > plugin.includes property in nutch-site.xml after building Nutch?
>>>> >> >
>>>> >> > On Fri, Dec 9, 2011 at 3:11 PM, Lewis John Mcgibbney
>>>> >> >
>>>> >> > <le...@gmail.com> wrote:
>>>> >> >> Hi Remi & Markus,
>>>> >> >>
>>>> >> >> Yeah, I can replicate this, good catch Remi.
>>>> >> >>
>>>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>>>> >> >> org.apache.nutch.net.URLFilterChecker
> http://www.heraldscotland.com
>>>> >> >> -filterName regex-urlfilter.txt
>>>> >> >>
>>>> >> >> Checking combination of all URLFilters available
>>>> >> >> ^Z
>>>> >> >> [2]+  Stopped                 bin/nutch
>>>> >> >> org.apache.nutch.net.URLFilterChecker
> http://www.heraldscotland.com
>>>> >> >> -filterName regex-urlfilter.txt
>>>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>>>> >> >> org.apache.nutch.net.URLFilterChecker
> http://www.heraldscotland.com
>>>> >> >> -filterName regex-urlfilter
>>>> >> >>
>>>> >> >> Checking combination of all URLFilters available
>>>> >> >>
>>>> >> >> The first instance was hanging, so was the second. This needs some
>>>> >> >> further investigation I think. Can someone else please confirm
> before
>>>> >> >> we log this in Jira?
>>>> >> >>
>>>> >> >> Thanks for reporting
>>>> >> >>
>>>> >> >>
>>>> >> >> On Fri, Dec 9, 2011 at 12:53 PM, remi tassing <
> tassingremi@gmail.com>
>>>> >> >>
>>>> >> >> wrote:
>>>> >> >>> I fed with URL but it didn't work:
>>>> >> >>>
>>>> >> >>> $ bin/nutch org.apache.nutch.net.URLFilterChecker
>>>> >> >>> http://www.google.com Checking combination of all URLFilters
>>>> >> >>> available
>>>> >> >>>
>>>> >> >>> Remi
>>>> >> >>>
>>>> >> >>> On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma
>>>> >> >>>
>>>> >> >>> <ma...@openindex.io>wrote:
>>>> >> >>> > it reads from stdin so you can either type a url followed by
> enter
>>>> >> >>> > or feed
>>>> >> >>> > from stdin using pipes.
>>>> >> >>> >
>>>> >> >>> > On Friday 09 December 2011 13:32:41 remi tassing wrote:
>>>> >> >>> > > Hello guys,
>>>> >> >>> > >
>>>> >> >>> > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's
> not
>>>> >> >>> >
>>>> >> >>> > documented
>>>> >> >>> >
>>>> >> >>> > > and it always shows me this "Checking combination of all
>>>> >> >>> > > URLFilters available" and then gets stuck.
>>>> >> >>> > >
>>>> >> >>> > > Remi
>>>> >> >>> >
>>>> >> >>> > --
>>>> >> >>> > Markus Jelsma - CTO - Openindex
>>>> >> >>>
>>>> >> >>> --
>>>> >> >>> Remi Tassing
>>>> >> >>
>>>> >> >> --
>>>> >> >> Lewis
>>>> >> >
>>>> >> > --
>>>> >> > Lewis
>>>> >
>>>> > --
>>>> > Markus Jelsma - CTO - Openindex
>>>
>>> --
>>> Markus Jelsma - CTO - Openindex
>>
>>
>>
>> --
>> Lewis
>>



-- 
Lewis

Re: "URLFilterChecker" documentation

Posted by remi tassing <ta...@gmail.com>.
Pla check Markus's earlier email.on the format. It seems to be working.but
the output is still incorrect for me.

On Tuesday, December 13, 2011, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Heres my output from URLFilterChecker [1]
>
> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex
> Exception in thread "main" java.lang.RuntimeException: Filter
> urlfilter-regex not found.
>        at
org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>        at
org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker -allCombined
> Checking combination of all URLFilters available
> ^Z
> [10]+  Stopped                 bin/nutch
> org.apache.nutch.net.URLFilterChecker -allCombined
> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker -filterName RegexURLFilter
> Exception in thread "main" java.lang.RuntimeException: Filter
> RegexURLFilter not found.
>        at
org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>        at
org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
>
> I'm noticing three things
>
> 1) NO reference to a single urlfilter seems to work when appended to
> the -filterName parameter e.g. regex-urlfilter, urlfilter-regex,
> RegexURLFilter, regex-urlfilter.txt
> 2) When no -filterName parameter is passed but a value is passed e.g.
> bin/nutch org.apache.nutch.net.URLFilterChecker regex-urlfilter log
> output is as follows
> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker regex-urlfilter
> Checking combination of all URLFilters available
> Therefore it seems to incorrectly skip to the checkAll method then hang!
> 3) If the -allCombined parameter is passed the output indiciates that
> it does the same as 2) above...
>
> Can you please check if you are getting the same behaviour Markus? Thank
you
>
> [1]
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/net/URLFilterChecker.java
>
> On Tue, Dec 13, 2011 at 5:06 PM, Markus Jelsma
> <ma...@openindex.io> wrote:
>> i see no log output mate :)
>>
>> On Tuesday 13 December 2011 17:58:36 you wrote:
>>> Thanks Markus.
>>>
>>> Can you look at my log output and inform where I am going wrong
>>> please? It seemed to be playing up for me.
>>>
>>> Thanks
>>>
>>> On Tue, Dec 13, 2011 at 4:53 PM, Markus Jelsma
>>>
>>> <ma...@openindex.io> wrote:
>>> > I've never seen it hanging and use it weekly.
>>> >
>>> > On Tuesday 13 December 2011 17:45:54 you wrote:
>>> >> Hi,
>>> >>
>>> >> Can anyone confirm if this is an issue?
>>> >>
>>> >> If so I think we should log it before it goes unnoticed.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Lewis
>>> >>
>>> >> On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
>>> >>
>>> >> <le...@gmail.com> wrote:
>>> >> > If you look at the output I posted, even when I specified a
particular
>>> >> > filter, the checkAll() method is still getting called, as is
indicated
>>> >> > by the "Checking combination of all URLFilters available" log
output.
>>> >> > It's not a particularly complex class, so hopefully if we can
confirm
>>> >> > this is a bug we can fix it quickly.
>>> >> >
>>> >> > Finally, I must ask, Remi which URL filters have you included in
your
>>> >> > plugin.includes property in nutch-site.xml after building Nutch?
>>> >> >
>>> >> > On Fri, Dec 9, 2011 at 3:11 PM, Lewis John Mcgibbney
>>> >> >
>>> >> > <le...@gmail.com> wrote:
>>> >> >> Hi Remi & Markus,
>>> >> >>
>>> >> >> Yeah, I can replicate this, good catch Remi.
>>> >> >>
>>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>>> >> >> org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com
>>> >> >> -filterName regex-urlfilter.txt
>>> >> >>
>>> >> >> Checking combination of all URLFilters available
>>> >> >> ^Z
>>> >> >> [2]+  Stopped                 bin/nutch
>>> >> >> org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com
>>> >> >> -filterName regex-urlfilter.txt
>>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>>> >> >> org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com
>>> >> >> -filterName regex-urlfilter
>>> >> >>
>>> >> >> Checking combination of all URLFilters available
>>> >> >>
>>> >> >> The first instance was hanging, so was the second. This needs some
>>> >> >> further investigation I think. Can someone else please confirm
before
>>> >> >> we log this in Jira?
>>> >> >>
>>> >> >> Thanks for reporting
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Dec 9, 2011 at 12:53 PM, remi tassing <
tassingremi@gmail.com>
>>> >> >>
>>> >> >> wrote:
>>> >> >>> I fed with URL but it didn't work:
>>> >> >>>
>>> >> >>> $ bin/nutch org.apache.nutch.net.URLFilterChecker
>>> >> >>> http://www.google.com Checking combination of all URLFilters
>>> >> >>> available
>>> >> >>>
>>> >> >>> Remi
>>> >> >>>
>>> >> >>> On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma
>>> >> >>>
>>> >> >>> <ma...@openindex.io>wrote:
>>> >> >>> > it reads from stdin so you can either type a url followed by
enter
>>> >> >>> > or feed
>>> >> >>> > from stdin using pipes.
>>> >> >>> >
>>> >> >>> > On Friday 09 December 2011 13:32:41 remi tassing wrote:
>>> >> >>> > > Hello guys,
>>> >> >>> > >
>>> >> >>> > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's
not
>>> >> >>> >
>>> >> >>> > documented
>>> >> >>> >
>>> >> >>> > > and it always shows me this "Checking combination of all
>>> >> >>> > > URLFilters available" and then gets stuck.
>>> >> >>> > >
>>> >> >>> > > Remi
>>> >> >>> >
>>> >> >>> > --
>>> >> >>> > Markus Jelsma - CTO - Openindex
>>> >> >>>
>>> >> >>> --
>>> >> >>> Remi Tassing
>>> >> >>
>>> >> >> --
>>> >> >> Lewis
>>> >> >
>>> >> > --
>>> >> > Lewis
>>> >
>>> > --
>>> > Markus Jelsma - CTO - Openindex
>>
>> --
>> Markus Jelsma - CTO - Openindex
>
>
>
> --
> Lewis
>

Re: "URLFilterChecker" documentation

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Heres my output from URLFilterChecker [1]

lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex
Exception in thread "main" java.lang.RuntimeException: Filter
urlfilter-regex not found.
	at org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
	at org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker -allCombined
Checking combination of all URLFilters available
^Z
[10]+  Stopped                 bin/nutch
org.apache.nutch.net.URLFilterChecker -allCombined
lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker -filterName RegexURLFilter
Exception in thread "main" java.lang.RuntimeException: Filter
RegexURLFilter not found.
	at org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
	at org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)

I'm noticing three things

1) NO reference to a single urlfilter seems to work when appended to
the -filterName parameter e.g. regex-urlfilter, urlfilter-regex,
RegexURLFilter, regex-urlfilter.txt
2) When no -filterName parameter is passed but a value is passed e.g.
bin/nutch org.apache.nutch.net.URLFilterChecker regex-urlfilter log
output is as follows
lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker regex-urlfilter
Checking combination of all URLFilters available
Therefore it seems to incorrectly skip to the checkAll method then hang!
3) If the -allCombined parameter is passed the output indiciates that
it does the same as 2) above...

Can you please check if you are getting the same behaviour Markus? Thank you

[1] http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/net/URLFilterChecker.java

On Tue, Dec 13, 2011 at 5:06 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> i see no log output mate :)
>
> On Tuesday 13 December 2011 17:58:36 you wrote:
>> Thanks Markus.
>>
>> Can you look at my log output and inform where I am going wrong
>> please? It seemed to be playing up for me.
>>
>> Thanks
>>
>> On Tue, Dec 13, 2011 at 4:53 PM, Markus Jelsma
>>
>> <ma...@openindex.io> wrote:
>> > I've never seen it hanging and use it weekly.
>> >
>> > On Tuesday 13 December 2011 17:45:54 you wrote:
>> >> Hi,
>> >>
>> >> Can anyone confirm if this is an issue?
>> >>
>> >> If so I think we should log it before it goes unnoticed.
>> >>
>> >> Thanks
>> >>
>> >> Lewis
>> >>
>> >> On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
>> >>
>> >> <le...@gmail.com> wrote:
>> >> > If you look at the output I posted, even when I specified a particular
>> >> > filter, the checkAll() method is still getting called, as is indicated
>> >> > by the "Checking combination of all URLFilters available" log output.
>> >> > It's not a particularly complex class, so hopefully if we can confirm
>> >> > this is a bug we can fix it quickly.
>> >> >
>> >> > Finally, I must ask, Remi which URL filters have you included in your
>> >> > plugin.includes property in nutch-site.xml after building Nutch?
>> >> >
>> >> > On Fri, Dec 9, 2011 at 3:11 PM, Lewis John Mcgibbney
>> >> >
>> >> > <le...@gmail.com> wrote:
>> >> >> Hi Remi & Markus,
>> >> >>
>> >> >> Yeah, I can replicate this, good catch Remi.
>> >> >>
>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>> >> >> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com
>> >> >> -filterName regex-urlfilter.txt
>> >> >>
>> >> >> Checking combination of all URLFilters available
>> >> >> ^Z
>> >> >> [2]+  Stopped                 bin/nutch
>> >> >> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com
>> >> >> -filterName regex-urlfilter.txt
>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>> >> >> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com
>> >> >> -filterName regex-urlfilter
>> >> >>
>> >> >> Checking combination of all URLFilters available
>> >> >>
>> >> >> The first instance was hanging, so was the second. This needs some
>> >> >> further investigation I think. Can someone else please confirm before
>> >> >> we log this in Jira?
>> >> >>
>> >> >> Thanks for reporting
>> >> >>
>> >> >>
>> >> >> On Fri, Dec 9, 2011 at 12:53 PM, remi tassing <ta...@gmail.com>
>> >> >>
>> >> >> wrote:
>> >> >>> I fed with URL but it didn't work:
>> >> >>>
>> >> >>> $ bin/nutch org.apache.nutch.net.URLFilterChecker
>> >> >>> http://www.google.com Checking combination of all URLFilters
>> >> >>> available
>> >> >>>
>> >> >>> Remi
>> >> >>>
>> >> >>> On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma
>> >> >>>
>> >> >>> <ma...@openindex.io>wrote:
>> >> >>> > it reads from stdin so you can either type a url followed by enter
>> >> >>> > or feed
>> >> >>> > from stdin using pipes.
>> >> >>> >
>> >> >>> > On Friday 09 December 2011 13:32:41 remi tassing wrote:
>> >> >>> > > Hello guys,
>> >> >>> > >
>> >> >>> > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not
>> >> >>> >
>> >> >>> > documented
>> >> >>> >
>> >> >>> > > and it always shows me this "Checking combination of all
>> >> >>> > > URLFilters available" and then gets stuck.
>> >> >>> > >
>> >> >>> > > Remi
>> >> >>> >
>> >> >>> > --
>> >> >>> > Markus Jelsma - CTO - Openindex
>> >> >>>
>> >> >>> --
>> >> >>> Remi Tassing
>> >> >>
>> >> >> --
>> >> >> Lewis
>> >> >
>> >> > --
>> >> > Lewis
>> >
>> > --
>> > Markus Jelsma - CTO - Openindex
>
> --
> Markus Jelsma - CTO - Openindex



-- 
Lewis