You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hannes Carl Meyer <ha...@googlemail.com> on 2010/06/24 12:18:29 UTC

Question on normalizing urls / RegexURLNormalizer

Hi,

I'm trying to strip a parameter from URLs using the RegexURLNormalizer. I
added this to my nutch-site.xml:

    <property>
        <name>urlnormalizer.order</name>

<value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
    </property>

    <property>
        <name>urlnormalizer.regex.file</name>
        <value>regex-normalize.xml</value>
    </property>

And defined this expression rule:

<regex>

<pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$1$5</substitution>
</regex>

(to strip the parameter IFLBSERVERID from the URL)

The indexed documents are still containing the parameter and imho the
RegexURLNormalizer does not work. Is it something with:
https://issues.apache.org/jira/browse/NUTCH-706 ?

Thanks and regards

Hannes

-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer

Re: Question on normalizing urls / RegexURLNormalizer

Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Julien, thanks for all your help. but:

If I delete ./plugins AND ./build/plugins he is trying to get them out of
the nutch-1.1.job and fails.
Maybe I'm just using a f*** up nutch-1.1 version, going to check on 1.0
now...

On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> the clue might be in : /~/apache-nutch-1.1-bin/plugins
> regenerate the job then delete this directory. Check where it gets the
> plugins from in the log file
>
>
> On 24 June 2010 16:11, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>
>> Nope, that changes nothing. Just checked out my log file:
>>
>> 2010-06-24 17:13:40,410 INFO  plugin.PluginRepository - Plugins: looking
>> in: /~/apache-nutch-1.1-bin/plugins
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         the nutch
>> core extension points (nutch-extensionpoints)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>> Query Filter (query-basic)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>> Indexing Filter (index-basic)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Html Parse
>> Plug-in (parse-html)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Site Query
>> Filter (query-site)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Http /
>> Https Protocol Plug-in (protocol-httpclient)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>> Summarizer Plug-in (summary-basic)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         HTTP
>> Framework (lib-http)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Text Parse
>> Plug-in (parse-text)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
>> Filter (urlfilter-regex)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Http
>> Protocol Plug-in (protocol-http)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         XML
>> Response Writer Plug-in (response-xml)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         OPIC
>> Scoring Plug-in (scoring-opic)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Tika
>> Parser Plug-in (parse-tika)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         CyberNeko
>> HTML Parser (lib-nekohtml)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Anchor
>> Indexing Filter (index-anchor)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JavaScript
>> Parser (parse-js)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         URL Query
>> Filter (query-url)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
>> Filter Framework (lib-regex-filter)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JSON
>> Response Writer Plug-in (response-json)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>> Summarizer (org.apache.nutch.searcher.Summarizer)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>> Protocol (org.apache.nutch.protocol.Protocol)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Query Filter (org.apache.nutch.searcher.QueryFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Search Results Response Writer
>> (org.apache.nutch.searcher.response.ResponseWriter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
>> Filter (org.apache.nutch.net.URLFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Online Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Content Parser (org.apache.nutch.parse.Parser)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Ontology
>> Model Loader (org.apache.nutch.ontology.Ontology)
>>
>> There is no RegexURLNormalizer being load...
>>
>>
>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> OK. Since you are in distributed mode it should use the content of the
>>> job file. Try deleting ./build/plugins to see if this changes anything
>>>
>>>
>>> On 24 June 2010 15:30, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>
>>>> Jep, did not work, although it displays: "URL normalizing: true" in the
>>>> crawl process...
>>>> Also bin/nutch plugin ... does not work!
>>>>
>>>>
>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>>>> lists.digitalpebble@gmail.com> wrote:
>>>>
>>>>> tried ant clean job?
>>>>>
>>>>>
>>>>>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>>>> local).
>>>>>>
>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>>>
>>>>>
>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>>>> doing its job.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Have you tried using :
>>>>>>> *./nutch plugin urlnormalizer-regex
>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>>>> http://www.myinputurl.com*
>>>>>>> that should help finding where the problem is coming from.
>>>>>>>
>>>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>>>> RegexURLNormalizer. I
>>>>>>>> added this to my nutch-site.xml:
>>>>>>>>
>>>>>>>>    <property>
>>>>>>>>        <name>urlnormalizer.order</name>
>>>>>>>>
>>>>>>>>
>>>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>>>    </property>
>>>>>>>>
>>>>>>>>    <property>
>>>>>>>>        <name>urlnormalizer.regex.file</name>
>>>>>>>>        <value>regex-normalize.xml</value>
>>>>>>>>    </property>
>>>>>>>>
>>>>>>>> And defined this expression rule:
>>>>>>>>
>>>>>>>> <regex>
>>>>>>>>
>>>>>>>>
>>>>>>>> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>>>>>>>>  <substitution>$1$5</substitution>
>>>>>>>> </regex>
>>>>>>>>
>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>>>
>>>>>>>> The indexed documents are still containing the parameter and imho
>>>>>>>> the
>>>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>>>
>>>>>>>> Thanks and regards
>>>>>>>>
>>>>>>>> Hannes
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> DigitalPebble Ltd
>>>>>>>
>>>>>>> Open Source Solutions for Text Engineering
>>>>>>> http://www.digitalpebble.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> DigitalPebble Ltd
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>> http://www.digitalpebble.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>



-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer

Re: Question on normalizing urls / RegexURLNormalizer

Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Awesome... Thank you very very much :-)

On Thu, Jun 24, 2010 at 6:55 PM, reinhard schwab <re...@aon.at>wrote:

> hi hannes,
>
> i have identified your problem.
> your nutch-site.xml plugin.includes property contains a newline after
> urlnormalizer-(basic|pass|regex), which breaks pattern matching in
> PluginRepository.java.
>
>  <property>
>    <name>plugin.includes</name>
>
>  <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(basic|pass|regex)
> </value>
>  </property>
>
> if i remove the newline before </value>, it is ok.
>
> regards
> reinhard
>
> Hannes Carl Meyer schrieb:
> > Just tried it in nutch-1.0 with the same kind of behavior:
> >
> > hc.meyer@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex
> > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> > http://www.myinputurl.com
> > Plugin 'urlnormalizer-regex' not present or inactive.
> >
> > (it is present and it is active through the plugin.includes property in
> > nutch-site.xml)
> >
> > On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> >
> >> the clue might be in : /~/apache-nutch-1.1-bin/plugins
> >> regenerate the job then delete this directory. Check where it gets the
> >> plugins from in the log file
> >>
> >>
> >> On 24 June 2010 16:11, Hannes Carl Meyer <hannescarl@googlemail.com
> >wrote:
> >>
> >>
> >>> Nope, that changes nothing. Just checked out my log file:
> >>>
> >>> 2010-06-24 17:13:40,410 INFO  plugin.PluginRepository - Plugins:
> looking
> >>> in: /~/apache-nutch-1.1-bin/plugins
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Plugin
> >>> Auto-activation mode: [true]
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Registered
> >>> Plugins:
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         the
> nutch
> >>> core extension points (nutch-extensionpoints)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
> >>> Query Filter (query-basic)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
> >>> Indexing Filter (index-basic)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Html
> Parse
> >>> Plug-in (parse-html)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Site
> Query
> >>> Filter (query-site)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Http /
> >>> Https Protocol Plug-in (protocol-httpclient)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
> >>> Summarizer Plug-in (summary-basic)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         HTTP
> >>> Framework (lib-http)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Text
> Parse
> >>> Plug-in (parse-text)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex
> URL
> >>> Filter (urlfilter-regex)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Http
> >>> Protocol Plug-in (protocol-http)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         XML
> >>> Response Writer Plug-in (response-xml)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         OPIC
> >>> Scoring Plug-in (scoring-opic)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Tika
> >>> Parser Plug-in (parse-tika)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -
> CyberNeko
> >>> HTML Parser (lib-nekohtml)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Anchor
> >>> Indexing Filter (index-anchor)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -
> JavaScript
> >>> Parser (parse-js)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         URL
> Query
> >>> Filter (query-url)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex
> URL
> >>> Filter Framework (lib-regex-filter)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JSON
> >>> Response Writer Plug-in (response-json)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository - Registered
> >>> Extension-Points:
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
> >>> Summarizer (org.apache.nutch.searcher.Summarizer)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
> >>> Protocol (org.apache.nutch.protocol.Protocol)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
> >>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         HTML
> Parse
> >>> Filter (org.apache.nutch.parse.HtmlParseFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Query Filter (org.apache.nutch.searcher.QueryFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Search Results Response Writer
> >>> (org.apache.nutch.searcher.response.ResponseWriter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> URL
> >>> Normalizer (org.apache.nutch.net.URLNormalizer)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> URL
> >>> Filter (org.apache.nutch.net.URLFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Online Search Results Clustering Plugin
> >>> (org.apache.nutch.clustering.OnlineClusterer)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Content Parser (org.apache.nutch.parse.Parser)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Scoring (org.apache.nutch.scoring.ScoringFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -
> Ontology
> >>> Model Loader (org.apache.nutch.ontology.Ontology)
> >>>
> >>> There is no RegexURLNormalizer being load...
> >>>
> >>>
> >>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
> >>> lists.digitalpebble@gmail.com> wrote:
> >>>
> >>>
> >>>> OK. Since you are in distributed mode it should use the content of the
> >>>> job file. Try deleting ./build/plugins to see if this changes anything
> >>>>
> >>>>
> >>>> On 24 June 2010 15:30, Hannes Carl Meyer <hannescarl@googlemail.com
> >wrote:
> >>>>
> >>>>
> >>>>> Jep, did not work, although it displays: "URL normalizing: true" in
> the
> >>>>> crawl process...
> >>>>> Also bin/nutch plugin ... does not work!
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
> >>>>> lists.digitalpebble@gmail.com> wrote:
> >>>>>
> >>>>>
> >>>>>> tried ant clean job?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
> >>>>>>> local).
> >>>>>>>
> >>>>>>>
> >>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
> >>>>>>
> >>>>>>> 'urlnormalizer-regex' not present or inactive.".
> conf/nutch-site.xml
> >>>>>>> contains the property plugin.includes including
> urlnormalizer-regex.
> >>>>>>>
> >>>>>>>
> >>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it
> is
> >>>>>>> doing its job.
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Hannes
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
> >>>>>>> lists.digitalpebble@gmail.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Have you tried using :
> >>>>>>>> *./nutch plugin urlnormalizer-regex
> >>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> >>>>>>>> http://www.myinputurl.com*
> >>>>>>>> that should help finding where the problem is coming from.
> >>>>>>>>
> >>>>>>>> Are you running in distributed mode? Did you generate a new job
> file?
> >>>>>>>>
> >>>>>>>> J.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <
> hannescarl@googlemail.com>wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I'm trying to strip a parameter from URLs using the
> >>>>>>>>> RegexURLNormalizer. I
> >>>>>>>>> added this to my nutch-site.xml:
> >>>>>>>>>
> >>>>>>>>>    <property>
> >>>>>>>>>        <name>urlnormalizer.order</name>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
> >>>>>>>>>    </property>
> >>>>>>>>>
> >>>>>>>>>    <property>
> >>>>>>>>>        <name>urlnormalizer.regex.file</name>
> >>>>>>>>>        <value>regex-normalize.xml</value>
> >>>>>>>>>    </property>
> >>>>>>>>>
> >>>>>>>>> And defined this expression rule:
> >>>>>>>>>
> >>>>>>>>> <regex>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
> >>>>>>>>>  <substitution>$1$5</substitution>
> >>>>>>>>> </regex>
> >>>>>>>>>
> >>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
> >>>>>>>>>
> >>>>>>>>> The indexed documents are still containing the parameter and imho
> >>>>>>>>> the
> >>>>>>>>> RegexURLNormalizer does not work. Is it something with:
> >>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
> >>>>>>>>>
> >>>>>>>>> Thanks and regards
> >>>>>>>>>
> >>>>>>>>> Hannes
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>
> >>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
> >>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
> >>>>>>>>> http://twitter.com/hannescarlmeyer
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> DigitalPebble Ltd
> >>>>>>>>
> >>>>>>>> Open Source Solutions for Text Engineering
> >>>>>>>> http://www.digitalpebble.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
> >>>>>>> http://de.linkedin.com/in/hannescarlmeyer
> >>>>>>> http://twitter.com/hannescarlmeyer
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> DigitalPebble Ltd
> >>>>>>
> >>>>>> Open Source Solutions for Text Engineering
> >>>>>> http://www.digitalpebble.com
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> https://www.xing.com/profile/HannesCarl_Meyer
> >>>>> http://de.linkedin.com/in/hannescarlmeyer
> >>>>> http://twitter.com/hannescarlmeyer
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> DigitalPebble Ltd
> >>>>
> >>>> Open Source Solutions for Text Engineering
> >>>> http://www.digitalpebble.com
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>> https://www.xing.com/profile/HannesCarl_Meyer
> >>> http://de.linkedin.com/in/hannescarlmeyer
> >>> http://twitter.com/hannescarlmeyer
> >>>
> >>>
> >>
> >> --
> >> DigitalPebble Ltd
> >>
> >> Open Source Solutions for Text Engineering
> >> http://www.digitalpebble.com
> >>
> >>
> >
> >
>
>

Re: Question on normalizing urls / RegexURLNormalizer

Posted by reinhard schwab <re...@aon.at>.
hi hannes,

i have identified your problem.
your nutch-site.xml plugin.includes property contains a newline after
urlnormalizer-(basic|pass|regex), which breaks pattern matching in
PluginRepository.java.

 <property>
    <name>plugin.includes</name>
 <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(basic|pass|regex)
</value>
  </property>

if i remove the newline before </value>, it is ok.

regards
reinhard

Hannes Carl Meyer schrieb:
> Just tried it in nutch-1.0 with the same kind of behavior:
>
> hc.meyer@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> http://www.myinputurl.com
> Plugin 'urlnormalizer-regex' not present or inactive.
>
> (it is present and it is active through the plugin.includes property in
> nutch-site.xml)
>
> On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>   
>> the clue might be in : /~/apache-nutch-1.1-bin/plugins
>> regenerate the job then delete this directory. Check where it gets the
>> plugins from in the log file
>>
>>
>> On 24 June 2010 16:11, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>
>>     
>>> Nope, that changes nothing. Just checked out my log file:
>>>
>>> 2010-06-24 17:13:40,410 INFO  plugin.PluginRepository - Plugins: looking
>>> in: /~/apache-nutch-1.1-bin/plugins
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         the nutch
>>> core extension points (nutch-extensionpoints)
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>>> Query Filter (query-basic)
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>>> Indexing Filter (index-basic)
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Html Parse
>>> Plug-in (parse-html)
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Site Query
>>> Filter (query-site)
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Http /
>>> Https Protocol Plug-in (protocol-httpclient)
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>>> Summarizer Plug-in (summary-basic)
>>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         HTTP
>>> Framework (lib-http)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Text Parse
>>> Plug-in (parse-text)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
>>> Filter (urlfilter-regex)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Http
>>> Protocol Plug-in (protocol-http)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         XML
>>> Response Writer Plug-in (response-xml)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         OPIC
>>> Scoring Plug-in (scoring-opic)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Tika
>>> Parser Plug-in (parse-tika)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         CyberNeko
>>> HTML Parser (lib-nekohtml)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Anchor
>>> Indexing Filter (index-anchor)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JavaScript
>>> Parser (parse-js)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         URL Query
>>> Filter (query-url)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
>>> Filter Framework (lib-regex-filter)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JSON
>>> Response Writer Plug-in (response-json)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>>> Protocol (org.apache.nutch.protocol.Protocol)
>>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>>> Query Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>>> Search Results Response Writer
>>> (org.apache.nutch.searcher.response.ResponseWriter)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>>> Online Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>>> Content Parser (org.apache.nutch.parse.Parser)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Ontology
>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>
>>> There is no RegexURLNormalizer being load...
>>>
>>>
>>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
>>> lists.digitalpebble@gmail.com> wrote:
>>>
>>>       
>>>> OK. Since you are in distributed mode it should use the content of the
>>>> job file. Try deleting ./build/plugins to see if this changes anything
>>>>
>>>>
>>>> On 24 June 2010 15:30, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>
>>>>         
>>>>> Jep, did not work, although it displays: "URL normalizing: true" in the
>>>>> crawl process...
>>>>> Also bin/nutch plugin ... does not work!
>>>>>
>>>>>
>>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>
>>>>>           
>>>>>> tried ant clean job?
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>>>>> local).
>>>>>>>
>>>>>>>               
>>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>>>>             
>>>>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>>>>
>>>>>>>               
>>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>>>>> doing its job.
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Hannes
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>>>
>>>>>>>               
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Have you tried using :
>>>>>>>> *./nutch plugin urlnormalizer-regex
>>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>>>>> http://www.myinputurl.com*
>>>>>>>> that should help finding where the problem is coming from.
>>>>>>>>
>>>>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>>>>
>>>>>>>> J.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>>>>> RegexURLNormalizer. I
>>>>>>>>> added this to my nutch-site.xml:
>>>>>>>>>
>>>>>>>>>    <property>
>>>>>>>>>        <name>urlnormalizer.order</name>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>>>>    </property>
>>>>>>>>>
>>>>>>>>>    <property>
>>>>>>>>>        <name>urlnormalizer.regex.file</name>
>>>>>>>>>        <value>regex-normalize.xml</value>
>>>>>>>>>    </property>
>>>>>>>>>
>>>>>>>>> And defined this expression rule:
>>>>>>>>>
>>>>>>>>> <regex>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>>>>>>>>>  <substitution>$1$5</substitution>
>>>>>>>>> </regex>
>>>>>>>>>
>>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>>>>
>>>>>>>>> The indexed documents are still containing the parameter and imho
>>>>>>>>> the
>>>>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>>>>
>>>>>>>>> Thanks and regards
>>>>>>>>>
>>>>>>>>> Hannes
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>
>>>>>>>> --
>>>>>>>> DigitalPebble Ltd
>>>>>>>>
>>>>>>>> Open Source Solutions for Text Engineering
>>>>>>>> http://www.digitalpebble.com
>>>>>>>>
>>>>>>>>                 
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>
>>>>>>>               
>>>>>>
>>>>>> --
>>>>>> DigitalPebble Ltd
>>>>>>
>>>>>> Open Source Solutions for Text Engineering
>>>>>> http://www.digitalpebble.com
>>>>>>
>>>>>>             
>>>>>
>>>>> --
>>>>>
>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>> http://twitter.com/hannescarlmeyer
>>>>>
>>>>>           
>>>>
>>>> --
>>>> DigitalPebble Ltd
>>>>
>>>> Open Source Solutions for Text Engineering
>>>> http://www.digitalpebble.com
>>>>
>>>>         
>>>
>>> --
>>>
>>> https://www.xing.com/profile/HannesCarl_Meyer
>>> http://de.linkedin.com/in/hannescarlmeyer
>>> http://twitter.com/hannescarlmeyer
>>>
>>>       
>>
>> --
>> DigitalPebble Ltd
>>
>> Open Source Solutions for Text Engineering
>> http://www.digitalpebble.com
>>
>>     
>
>   


Re: Question on normalizing urls / RegexURLNormalizer

Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Just tried it in nutch-1.0 with the same kind of behavior:

hc.meyer@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
http://www.myinputurl.com
Plugin 'urlnormalizer-regex' not present or inactive.

(it is present and it is active through the plugin.includes property in
nutch-site.xml)

On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> the clue might be in : /~/apache-nutch-1.1-bin/plugins
> regenerate the job then delete this directory. Check where it gets the
> plugins from in the log file
>
>
> On 24 June 2010 16:11, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>
>> Nope, that changes nothing. Just checked out my log file:
>>
>> 2010-06-24 17:13:40,410 INFO  plugin.PluginRepository - Plugins: looking
>> in: /~/apache-nutch-1.1-bin/plugins
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         the nutch
>> core extension points (nutch-extensionpoints)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>> Query Filter (query-basic)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>> Indexing Filter (index-basic)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Html Parse
>> Plug-in (parse-html)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Site Query
>> Filter (query-site)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Http /
>> Https Protocol Plug-in (protocol-httpclient)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>> Summarizer Plug-in (summary-basic)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         HTTP
>> Framework (lib-http)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Text Parse
>> Plug-in (parse-text)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
>> Filter (urlfilter-regex)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Http
>> Protocol Plug-in (protocol-http)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         XML
>> Response Writer Plug-in (response-xml)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         OPIC
>> Scoring Plug-in (scoring-opic)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Tika
>> Parser Plug-in (parse-tika)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         CyberNeko
>> HTML Parser (lib-nekohtml)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Anchor
>> Indexing Filter (index-anchor)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JavaScript
>> Parser (parse-js)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         URL Query
>> Filter (query-url)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
>> Filter Framework (lib-regex-filter)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JSON
>> Response Writer Plug-in (response-json)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>> Summarizer (org.apache.nutch.searcher.Summarizer)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>> Protocol (org.apache.nutch.protocol.Protocol)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Query Filter (org.apache.nutch.searcher.QueryFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Search Results Response Writer
>> (org.apache.nutch.searcher.response.ResponseWriter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
>> Filter (org.apache.nutch.net.URLFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Online Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Content Parser (org.apache.nutch.parse.Parser)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Ontology
>> Model Loader (org.apache.nutch.ontology.Ontology)
>>
>> There is no RegexURLNormalizer being load...
>>
>>
>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> OK. Since you are in distributed mode it should use the content of the
>>> job file. Try deleting ./build/plugins to see if this changes anything
>>>
>>>
>>> On 24 June 2010 15:30, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>
>>>> Jep, did not work, although it displays: "URL normalizing: true" in the
>>>> crawl process...
>>>> Also bin/nutch plugin ... does not work!
>>>>
>>>>
>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>>>> lists.digitalpebble@gmail.com> wrote:
>>>>
>>>>> tried ant clean job?
>>>>>
>>>>>
>>>>>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>>>> local).
>>>>>>
>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>>>
>>>>>
>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>>>> doing its job.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Have you tried using :
>>>>>>> *./nutch plugin urlnormalizer-regex
>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>>>> http://www.myinputurl.com*
>>>>>>> that should help finding where the problem is coming from.
>>>>>>>
>>>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>>>> RegexURLNormalizer. I
>>>>>>>> added this to my nutch-site.xml:
>>>>>>>>
>>>>>>>>    <property>
>>>>>>>>        <name>urlnormalizer.order</name>
>>>>>>>>
>>>>>>>>
>>>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>>>    </property>
>>>>>>>>
>>>>>>>>    <property>
>>>>>>>>        <name>urlnormalizer.regex.file</name>
>>>>>>>>        <value>regex-normalize.xml</value>
>>>>>>>>    </property>
>>>>>>>>
>>>>>>>> And defined this expression rule:
>>>>>>>>
>>>>>>>> <regex>
>>>>>>>>
>>>>>>>>
>>>>>>>> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>>>>>>>>  <substitution>$1$5</substitution>
>>>>>>>> </regex>
>>>>>>>>
>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>>>
>>>>>>>> The indexed documents are still containing the parameter and imho
>>>>>>>> the
>>>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>>>
>>>>>>>> Thanks and regards
>>>>>>>>
>>>>>>>> Hannes
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> DigitalPebble Ltd
>>>>>>>
>>>>>>> Open Source Solutions for Text Engineering
>>>>>>> http://www.digitalpebble.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> DigitalPebble Ltd
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>> http://www.digitalpebble.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>

Re: Question on normalizing urls / RegexURLNormalizer

Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Nope, that changes nothing. Just checked out my log file:

2010-06-24 17:13:40,410 INFO  plugin.PluginRepository - Plugins: looking in:
/~/apache-nutch-1.1-bin/plugins
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Registered Plugins:
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Http / Https
Protocol Plug-in (protocol-httpclient)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         XML Response
Writer Plug-in (response-xml)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Tika Parser
Plug-in (parse-tika)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Anchor
Indexing Filter (index-anchor)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JavaScript
Parser (parse-js)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JSON
Response Writer Plug-in (response-json)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository - Registered
Extension-Points:
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch Field
Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch Search
Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Ontology
Model Loader (org.apache.nutch.ontology.Ontology)

There is no RegexURLNormalizer being load...

On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> OK. Since you are in distributed mode it should use the content of the job
> file. Try deleting ./build/plugins to see if this changes anything
>
>
> On 24 June 2010 15:30, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>
>> Jep, did not work, although it displays: "URL normalizing: true" in the
>> crawl process...
>> Also bin/nutch plugin ... does not work!
>>
>>
>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> tried ant clean job?
>>>
>>>
>>>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>> local).
>>>>
>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>
>>>
>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>> doing its job.
>>>>
>>>> Regards
>>>>
>>>> Hannes
>>>>
>>>>
>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>> lists.digitalpebble@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Have you tried using :
>>>>> *./nutch plugin urlnormalizer-regex
>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>> http://www.myinputurl.com*
>>>>> that should help finding where the problem is coming from.
>>>>>
>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>> RegexURLNormalizer. I
>>>>>> added this to my nutch-site.xml:
>>>>>>
>>>>>>    <property>
>>>>>>        <name>urlnormalizer.order</name>
>>>>>>
>>>>>>
>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>    </property>
>>>>>>
>>>>>>    <property>
>>>>>>        <name>urlnormalizer.regex.file</name>
>>>>>>        <value>regex-normalize.xml</value>
>>>>>>    </property>
>>>>>>
>>>>>> And defined this expression rule:
>>>>>>
>>>>>> <regex>
>>>>>>
>>>>>>
>>>>>> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>>>>>>  <substitution>$1$5</substitution>
>>>>>> </regex>
>>>>>>
>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>
>>>>>> The indexed documents are still containing the parameter and imho the
>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>
>>>>>> Thanks and regards
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>> --
>>>>>>
>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> DigitalPebble Ltd
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>> http://www.digitalpebble.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>



-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer

Re: Question on normalizing urls / RegexURLNormalizer

Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Jep, did not work, although it displays: "URL normalizing: true" in the
crawl process...
Also bin/nutch plugin ... does not work!

On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> tried ant clean job?
>
>
>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is local).
>>
> When executing bin/nucht plugin ... I'm getting a "Plugin
>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>> contains the property plugin.includes including urlnormalizer-regex.
>>
>
>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>> doing its job.
>>
>> Regards
>>
>> Hannes
>>
>>
>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Have you tried using :
>>> *./nutch plugin urlnormalizer-regex
>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>> http://www.myinputurl.com*
>>> that should help finding where the problem is coming from.
>>>
>>> Are you running in distributed mode? Did you generate a new job file?
>>>
>>> J.
>>>
>>>
>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to strip a parameter from URLs using the RegexURLNormalizer.
>>>> I
>>>> added this to my nutch-site.xml:
>>>>
>>>>    <property>
>>>>        <name>urlnormalizer.order</name>
>>>>
>>>>
>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>    </property>
>>>>
>>>>    <property>
>>>>        <name>urlnormalizer.regex.file</name>
>>>>        <value>regex-normalize.xml</value>
>>>>    </property>
>>>>
>>>> And defined this expression rule:
>>>>
>>>> <regex>
>>>>
>>>>
>>>> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>>>>  <substitution>$1$5</substitution>
>>>> </regex>
>>>>
>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>
>>>> The indexed documents are still containing the parameter and imho the
>>>> RegexURLNormalizer does not work. Is it something with:
>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>
>>>> Thanks and regards
>>>>
>>>> Hannes
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>



-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer

Re: Question on normalizing urls / RegexURLNormalizer

Posted by Julien Nioche <li...@gmail.com>.
Hi,

Have you tried using :
*./nutch plugin urlnormalizer-regex
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
http://www.myinputurl.com*
that should help finding where the problem is coming from.

Are you running in distributed mode? Did you generate a new job file?

J.

On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com> wrote:

> Hi,
>
> I'm trying to strip a parameter from URLs using the RegexURLNormalizer. I
> added this to my nutch-site.xml:
>
>    <property>
>        <name>urlnormalizer.order</name>
>
> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>    </property>
>
>    <property>
>        <name>urlnormalizer.regex.file</name>
>        <value>regex-normalize.xml</value>
>    </property>
>
> And defined this expression rule:
>
> <regex>
>
>
> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>  <substitution>$1$5</substitution>
> </regex>
>
> (to strip the parameter IFLBSERVERID from the URL)
>
> The indexed documents are still containing the parameter and imho the
> RegexURLNormalizer does not work. Is it something with:
> https://issues.apache.org/jira/browse/NUTCH-706 ?
>
> Thanks and regards
>
> Hannes
>
> --
>
> https://www.xing.com/profile/HannesCarl_Meyer
> http://de.linkedin.com/in/hannescarlmeyer
> http://twitter.com/hannescarlmeyer
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com