You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hannes Carl Meyer <ha...@googlemail.com> on 2010/06/24 12:18:29 UTC
Question on normalizing urls / RegexURLNormalizer
Hi,
I'm trying to strip a parameter from URLs using the RegexURLNormalizer. I
added this to my nutch-site.xml:
<property>
<name>urlnormalizer.order</name>
<value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
</property>
<property>
<name>urlnormalizer.regex.file</name>
<value>regex-normalize.xml</value>
</property>
And defined this expression rule:
<regex>
<pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
<substitution>$1$5</substitution>
</regex>
(to strip the parameter IFLBSERVERID from the URL)
The indexed documents are still containing the parameter and imho the
RegexURLNormalizer does not work. Is it something with:
https://issues.apache.org/jira/browse/NUTCH-706 ?
Thanks and regards
Hannes
--
https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer
Re: Question on normalizing urls / RegexURLNormalizer
Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Julien, thanks for all your help. but:
If I delete ./plugins AND ./build/plugins he is trying to get them out of
the nutch-1.1.job and fails.
Maybe I'm just using a f*** up nutch-1.1 version, going to check on 1.0
now...
On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:
> the clue might be in : /~/apache-nutch-1.1-bin/plugins
> regenerate the job then delete this directory. Check where it gets the
> plugins from in the log file
>
>
> On 24 June 2010 16:11, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>
>> Nope, that changes nothing. Just checked out my log file:
>>
>> 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking
>> in: /~/apache-nutch-1.1-bin/plugins
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Registered
>> Plugins:
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - the nutch
>> core extension points (nutch-extensionpoints)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>> Query Filter (query-basic)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>> Indexing Filter (index-basic)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Html Parse
>> Plug-in (parse-html)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Site Query
>> Filter (query-site)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Http /
>> Https Protocol Plug-in (protocol-httpclient)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>> Summarizer Plug-in (summary-basic)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - HTTP
>> Framework (lib-http)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Text Parse
>> Plug-in (parse-text)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
>> Filter (urlfilter-regex)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Http
>> Protocol Plug-in (protocol-http)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - XML
>> Response Writer Plug-in (response-xml)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - OPIC
>> Scoring Plug-in (scoring-opic)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Tika
>> Parser Plug-in (parse-tika)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - CyberNeko
>> HTML Parser (lib-nekohtml)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Anchor
>> Indexing Filter (index-anchor)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JavaScript
>> Parser (parse-js)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - URL Query
>> Filter (query-url)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
>> Filter Framework (lib-regex-filter)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JSON
>> Response Writer Plug-in (response-json)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Registered
>> Extension-Points:
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>> Summarizer (org.apache.nutch.searcher.Summarizer)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>> Protocol (org.apache.nutch.protocol.Protocol)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Query Filter (org.apache.nutch.searcher.QueryFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Search Results Response Writer
>> (org.apache.nutch.searcher.response.ResponseWriter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
>> Filter (org.apache.nutch.net.URLFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Online Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Content Parser (org.apache.nutch.parse.Parser)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Ontology
>> Model Loader (org.apache.nutch.ontology.Ontology)
>>
>> There is no RegexURLNormalizer being load...
>>
>>
>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> OK. Since you are in distributed mode it should use the content of the
>>> job file. Try deleting ./build/plugins to see if this changes anything
>>>
>>>
>>> On 24 June 2010 15:30, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>
>>>> Jep, did not work, although it displays: "URL normalizing: true" in the
>>>> crawl process...
>>>> Also bin/nutch plugin ... does not work!
>>>>
>>>>
>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>>>> lists.digitalpebble@gmail.com> wrote:
>>>>
>>>>> tried ant clean job?
>>>>>
>>>>>
>>>>>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>>>> local).
>>>>>>
>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>>>
>>>>>
>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>>>> doing its job.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Have you tried using :
>>>>>>> *./nutch plugin urlnormalizer-regex
>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>>>> http://www.myinputurl.com*
>>>>>>> that should help finding where the problem is coming from.
>>>>>>>
>>>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>>>> RegexURLNormalizer. I
>>>>>>>> added this to my nutch-site.xml:
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>urlnormalizer.order</name>
>>>>>>>>
>>>>>>>>
>>>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>urlnormalizer.regex.file</name>
>>>>>>>> <value>regex-normalize.xml</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> And defined this expression rule:
>>>>>>>>
>>>>>>>> <regex>
>>>>>>>>
>>>>>>>>
>>>>>>>> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
>>>>>>>> <substitution>$1$5</substitution>
>>>>>>>> </regex>
>>>>>>>>
>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>>>
>>>>>>>> The indexed documents are still containing the parameter and imho
>>>>>>>> the
>>>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>>>
>>>>>>>> Thanks and regards
>>>>>>>>
>>>>>>>> Hannes
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> DigitalPebble Ltd
>>>>>>>
>>>>>>> Open Source Solutions for Text Engineering
>>>>>>> http://www.digitalpebble.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> DigitalPebble Ltd
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>> http://www.digitalpebble.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
--
https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer
Re: Question on normalizing urls / RegexURLNormalizer
Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Awesome... Thank you very very much :-)
On Thu, Jun 24, 2010 at 6:55 PM, reinhard schwab <re...@aon.at>wrote:
> hi hannes,
>
> i have identified your problem.
> your nutch-site.xml plugin.includes property contains a newline after
> urlnormalizer-(basic|pass|regex), which breaks pattern matching in
> PluginRepository.java.
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(basic|pass|regex)
> </value>
> </property>
>
> if i remove the newline before </value>, it is ok.
>
> regards
> reinhard
>
> Hannes Carl Meyer schrieb:
> > Just tried it in nutch-1.0 with the same kind of behavior:
> >
> > hc.meyer@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex
> > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> > http://www.myinputurl.com
> > Plugin 'urlnormalizer-regex' not present or inactive.
> >
> > (it is present and it is active through the plugin.includes property in
> > nutch-site.xml)
> >
> > On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> >
> >> the clue might be in : /~/apache-nutch-1.1-bin/plugins
> >> regenerate the job then delete this directory. Check where it gets the
> >> plugins from in the log file
> >>
> >>
> >> On 24 June 2010 16:11, Hannes Carl Meyer <hannescarl@googlemail.com
> >wrote:
> >>
> >>
> >>> Nope, that changes nothing. Just checked out my log file:
> >>>
> >>> 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins:
> looking
> >>> in: /~/apache-nutch-1.1-bin/plugins
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin
> >>> Auto-activation mode: [true]
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Registered
> >>> Plugins:
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - the
> nutch
> >>> core extension points (nutch-extensionpoints)
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
> >>> Query Filter (query-basic)
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
> >>> Indexing Filter (index-basic)
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Html
> Parse
> >>> Plug-in (parse-html)
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Site
> Query
> >>> Filter (query-site)
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Http /
> >>> Https Protocol Plug-in (protocol-httpclient)
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
> >>> Summarizer Plug-in (summary-basic)
> >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - HTTP
> >>> Framework (lib-http)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Text
> Parse
> >>> Plug-in (parse-text)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex
> URL
> >>> Filter (urlfilter-regex)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Http
> >>> Protocol Plug-in (protocol-http)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - XML
> >>> Response Writer Plug-in (response-xml)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - OPIC
> >>> Scoring Plug-in (scoring-opic)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Tika
> >>> Parser Plug-in (parse-tika)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository -
> CyberNeko
> >>> HTML Parser (lib-nekohtml)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Anchor
> >>> Indexing Filter (index-anchor)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository -
> JavaScript
> >>> Parser (parse-js)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - URL
> Query
> >>> Filter (query-url)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex
> URL
> >>> Filter Framework (lib-regex-filter)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JSON
> >>> Response Writer Plug-in (response-json)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Registered
> >>> Extension-Points:
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
> >>> Summarizer (org.apache.nutch.searcher.Summarizer)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
> >>> Protocol (org.apache.nutch.protocol.Protocol)
> >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
> >>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
> >>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - HTML
> Parse
> >>> Filter (org.apache.nutch.parse.HtmlParseFilter)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
> >>> Query Filter (org.apache.nutch.searcher.QueryFilter)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
> >>> Search Results Response Writer
> >>> (org.apache.nutch.searcher.response.ResponseWriter)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
> URL
> >>> Normalizer (org.apache.nutch.net.URLNormalizer)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
> URL
> >>> Filter (org.apache.nutch.net.URLFilter)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
> >>> Online Search Results Clustering Plugin
> >>> (org.apache.nutch.clustering.OnlineClusterer)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
> >>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
> >>> Content Parser (org.apache.nutch.parse.Parser)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
> >>> Scoring (org.apache.nutch.scoring.ScoringFilter)
> >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository -
> Ontology
> >>> Model Loader (org.apache.nutch.ontology.Ontology)
> >>>
> >>> There is no RegexURLNormalizer being load...
> >>>
> >>>
> >>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
> >>> lists.digitalpebble@gmail.com> wrote:
> >>>
> >>>
> >>>> OK. Since you are in distributed mode it should use the content of the
> >>>> job file. Try deleting ./build/plugins to see if this changes anything
> >>>>
> >>>>
> >>>> On 24 June 2010 15:30, Hannes Carl Meyer <hannescarl@googlemail.com
> >wrote:
> >>>>
> >>>>
> >>>>> Jep, did not work, although it displays: "URL normalizing: true" in
> the
> >>>>> crawl process...
> >>>>> Also bin/nutch plugin ... does not work!
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
> >>>>> lists.digitalpebble@gmail.com> wrote:
> >>>>>
> >>>>>
> >>>>>> tried ant clean job?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
> >>>>>>> local).
> >>>>>>>
> >>>>>>>
> >>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
> >>>>>>
> >>>>>>> 'urlnormalizer-regex' not present or inactive.".
> conf/nutch-site.xml
> >>>>>>> contains the property plugin.includes including
> urlnormalizer-regex.
> >>>>>>>
> >>>>>>>
> >>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it
> is
> >>>>>>> doing its job.
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Hannes
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
> >>>>>>> lists.digitalpebble@gmail.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Have you tried using :
> >>>>>>>> *./nutch plugin urlnormalizer-regex
> >>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> >>>>>>>> http://www.myinputurl.com*
> >>>>>>>> that should help finding where the problem is coming from.
> >>>>>>>>
> >>>>>>>> Are you running in distributed mode? Did you generate a new job
> file?
> >>>>>>>>
> >>>>>>>> J.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <
> hannescarl@googlemail.com>wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I'm trying to strip a parameter from URLs using the
> >>>>>>>>> RegexURLNormalizer. I
> >>>>>>>>> added this to my nutch-site.xml:
> >>>>>>>>>
> >>>>>>>>> <property>
> >>>>>>>>> <name>urlnormalizer.order</name>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
> >>>>>>>>> </property>
> >>>>>>>>>
> >>>>>>>>> <property>
> >>>>>>>>> <name>urlnormalizer.regex.file</name>
> >>>>>>>>> <value>regex-normalize.xml</value>
> >>>>>>>>> </property>
> >>>>>>>>>
> >>>>>>>>> And defined this expression rule:
> >>>>>>>>>
> >>>>>>>>> <regex>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
> >>>>>>>>> <substitution>$1$5</substitution>
> >>>>>>>>> </regex>
> >>>>>>>>>
> >>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
> >>>>>>>>>
> >>>>>>>>> The indexed documents are still containing the parameter and imho
> >>>>>>>>> the
> >>>>>>>>> RegexURLNormalizer does not work. Is it something with:
> >>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
> >>>>>>>>>
> >>>>>>>>> Thanks and regards
> >>>>>>>>>
> >>>>>>>>> Hannes
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>
> >>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
> >>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
> >>>>>>>>> http://twitter.com/hannescarlmeyer
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> DigitalPebble Ltd
> >>>>>>>>
> >>>>>>>> Open Source Solutions for Text Engineering
> >>>>>>>> http://www.digitalpebble.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
> >>>>>>> http://de.linkedin.com/in/hannescarlmeyer
> >>>>>>> http://twitter.com/hannescarlmeyer
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> DigitalPebble Ltd
> >>>>>>
> >>>>>> Open Source Solutions for Text Engineering
> >>>>>> http://www.digitalpebble.com
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> https://www.xing.com/profile/HannesCarl_Meyer
> >>>>> http://de.linkedin.com/in/hannescarlmeyer
> >>>>> http://twitter.com/hannescarlmeyer
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> DigitalPebble Ltd
> >>>>
> >>>> Open Source Solutions for Text Engineering
> >>>> http://www.digitalpebble.com
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>> https://www.xing.com/profile/HannesCarl_Meyer
> >>> http://de.linkedin.com/in/hannescarlmeyer
> >>> http://twitter.com/hannescarlmeyer
> >>>
> >>>
> >>
> >> --
> >> DigitalPebble Ltd
> >>
> >> Open Source Solutions for Text Engineering
> >> http://www.digitalpebble.com
> >>
> >>
> >
> >
>
>
Re: Question on normalizing urls / RegexURLNormalizer
Posted by reinhard schwab <re...@aon.at>.
hi hannes,
i have identified your problem.
your nutch-site.xml plugin.includes property contains a newline after
urlnormalizer-(basic|pass|regex), which breaks pattern matching in
PluginRepository.java.
<property>
<name>plugin.includes</name>
<value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(basic|pass|regex)
</value>
</property>
if i remove the newline before </value>, it is ok.
regards
reinhard
Hannes Carl Meyer schrieb:
> Just tried it in nutch-1.0 with the same kind of behavior:
>
> hc.meyer@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> http://www.myinputurl.com
> Plugin 'urlnormalizer-regex' not present or inactive.
>
> (it is present and it is active through the plugin.includes property in
> nutch-site.xml)
>
> On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>
>> the clue might be in : /~/apache-nutch-1.1-bin/plugins
>> regenerate the job then delete this directory. Check where it gets the
>> plugins from in the log file
>>
>>
>> On 24 June 2010 16:11, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>
>>
>>> Nope, that changes nothing. Just checked out my log file:
>>>
>>> 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking
>>> in: /~/apache-nutch-1.1-bin/plugins
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Registered
>>> Plugins:
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - the nutch
>>> core extension points (nutch-extensionpoints)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>>> Query Filter (query-basic)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>>> Indexing Filter (index-basic)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Html Parse
>>> Plug-in (parse-html)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Site Query
>>> Filter (query-site)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Http /
>>> Https Protocol Plug-in (protocol-httpclient)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>>> Summarizer Plug-in (summary-basic)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - HTTP
>>> Framework (lib-http)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Text Parse
>>> Plug-in (parse-text)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
>>> Filter (urlfilter-regex)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Http
>>> Protocol Plug-in (protocol-http)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - XML
>>> Response Writer Plug-in (response-xml)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - OPIC
>>> Scoring Plug-in (scoring-opic)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Tika
>>> Parser Plug-in (parse-tika)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - CyberNeko
>>> HTML Parser (lib-nekohtml)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Anchor
>>> Indexing Filter (index-anchor)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JavaScript
>>> Parser (parse-js)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - URL Query
>>> Filter (query-url)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
>>> Filter Framework (lib-regex-filter)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JSON
>>> Response Writer Plug-in (response-json)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>>> Protocol (org.apache.nutch.protocol.Protocol)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Query Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Search Results Response Writer
>>> (org.apache.nutch.searcher.response.ResponseWriter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Online Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Content Parser (org.apache.nutch.parse.Parser)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Ontology
>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>
>>> There is no RegexURLNormalizer being load...
>>>
>>>
>>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
>>> lists.digitalpebble@gmail.com> wrote:
>>>
>>>
>>>> OK. Since you are in distributed mode it should use the content of the
>>>> job file. Try deleting ./build/plugins to see if this changes anything
>>>>
>>>>
>>>> On 24 June 2010 15:30, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>
>>>>
>>>>> Jep, did not work, although it displays: "URL normalizing: true" in the
>>>>> crawl process...
>>>>> Also bin/nutch plugin ... does not work!
>>>>>
>>>>>
>>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>
>>>>>
>>>>>> tried ant clean job?
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>>>>> local).
>>>>>>>
>>>>>>>
>>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>>>>
>>>>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>>>>
>>>>>>>
>>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>>>>> doing its job.
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Hannes
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Have you tried using :
>>>>>>>> *./nutch plugin urlnormalizer-regex
>>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>>>>> http://www.myinputurl.com*
>>>>>>>> that should help finding where the problem is coming from.
>>>>>>>>
>>>>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>>>>
>>>>>>>> J.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>>>>> RegexURLNormalizer. I
>>>>>>>>> added this to my nutch-site.xml:
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>urlnormalizer.order</name>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>urlnormalizer.regex.file</name>
>>>>>>>>> <value>regex-normalize.xml</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> And defined this expression rule:
>>>>>>>>>
>>>>>>>>> <regex>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
>>>>>>>>> <substitution>$1$5</substitution>
>>>>>>>>> </regex>
>>>>>>>>>
>>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>>>>
>>>>>>>>> The indexed documents are still containing the parameter and imho
>>>>>>>>> the
>>>>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>>>>
>>>>>>>>> Thanks and regards
>>>>>>>>>
>>>>>>>>> Hannes
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> DigitalPebble Ltd
>>>>>>>>
>>>>>>>> Open Source Solutions for Text Engineering
>>>>>>>> http://www.digitalpebble.com
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> DigitalPebble Ltd
>>>>>>
>>>>>> Open Source Solutions for Text Engineering
>>>>>> http://www.digitalpebble.com
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>> http://twitter.com/hannescarlmeyer
>>>>>
>>>>>
>>>>
>>>> --
>>>> DigitalPebble Ltd
>>>>
>>>> Open Source Solutions for Text Engineering
>>>> http://www.digitalpebble.com
>>>>
>>>>
>>>
>>> --
>>>
>>> https://www.xing.com/profile/HannesCarl_Meyer
>>> http://de.linkedin.com/in/hannescarlmeyer
>>> http://twitter.com/hannescarlmeyer
>>>
>>>
>>
>> --
>> DigitalPebble Ltd
>>
>> Open Source Solutions for Text Engineering
>> http://www.digitalpebble.com
>>
>>
>
>
Re: Question on normalizing urls / RegexURLNormalizer
Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Just tried it in nutch-1.0 with the same kind of behavior:
hc.meyer@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
http://www.myinputurl.com
Plugin 'urlnormalizer-regex' not present or inactive.
(it is present and it is active through the plugin.includes property in
nutch-site.xml)
On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:
> the clue might be in : /~/apache-nutch-1.1-bin/plugins
> regenerate the job then delete this directory. Check where it gets the
> plugins from in the log file
>
>
> On 24 June 2010 16:11, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>
>> Nope, that changes nothing. Just checked out my log file:
>>
>> 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking
>> in: /~/apache-nutch-1.1-bin/plugins
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Registered
>> Plugins:
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - the nutch
>> core extension points (nutch-extensionpoints)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>> Query Filter (query-basic)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>> Indexing Filter (index-basic)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Html Parse
>> Plug-in (parse-html)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Site Query
>> Filter (query-site)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Http /
>> Https Protocol Plug-in (protocol-httpclient)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>> Summarizer Plug-in (summary-basic)
>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - HTTP
>> Framework (lib-http)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Text Parse
>> Plug-in (parse-text)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
>> Filter (urlfilter-regex)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Http
>> Protocol Plug-in (protocol-http)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - XML
>> Response Writer Plug-in (response-xml)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - OPIC
>> Scoring Plug-in (scoring-opic)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Tika
>> Parser Plug-in (parse-tika)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - CyberNeko
>> HTML Parser (lib-nekohtml)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Anchor
>> Indexing Filter (index-anchor)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JavaScript
>> Parser (parse-js)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - URL Query
>> Filter (query-url)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
>> Filter Framework (lib-regex-filter)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JSON
>> Response Writer Plug-in (response-json)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Registered
>> Extension-Points:
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>> Summarizer (org.apache.nutch.searcher.Summarizer)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>> Protocol (org.apache.nutch.protocol.Protocol)
>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Query Filter (org.apache.nutch.searcher.QueryFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Search Results Response Writer
>> (org.apache.nutch.searcher.response.ResponseWriter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
>> Filter (org.apache.nutch.net.URLFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Online Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Content Parser (org.apache.nutch.parse.Parser)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Ontology
>> Model Loader (org.apache.nutch.ontology.Ontology)
>>
>> There is no RegexURLNormalizer being load...
>>
>>
>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> OK. Since you are in distributed mode it should use the content of the
>>> job file. Try deleting ./build/plugins to see if this changes anything
>>>
>>>
>>> On 24 June 2010 15:30, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>
>>>> Jep, did not work, although it displays: "URL normalizing: true" in the
>>>> crawl process...
>>>> Also bin/nutch plugin ... does not work!
>>>>
>>>>
>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>>>> lists.digitalpebble@gmail.com> wrote:
>>>>
>>>>> tried ant clean job?
>>>>>
>>>>>
>>>>>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>>>> local).
>>>>>>
>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>>>
>>>>>
>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>>>> doing its job.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>>>> lists.digitalpebble@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Have you tried using :
>>>>>>> *./nutch plugin urlnormalizer-regex
>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>>>> http://www.myinputurl.com*
>>>>>>> that should help finding where the problem is coming from.
>>>>>>>
>>>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>>>> RegexURLNormalizer. I
>>>>>>>> added this to my nutch-site.xml:
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>urlnormalizer.order</name>
>>>>>>>>
>>>>>>>>
>>>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>urlnormalizer.regex.file</name>
>>>>>>>> <value>regex-normalize.xml</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> And defined this expression rule:
>>>>>>>>
>>>>>>>> <regex>
>>>>>>>>
>>>>>>>>
>>>>>>>> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
>>>>>>>> <substitution>$1$5</substitution>
>>>>>>>> </regex>
>>>>>>>>
>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>>>
>>>>>>>> The indexed documents are still containing the parameter and imho
>>>>>>>> the
>>>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>>>
>>>>>>>> Thanks and regards
>>>>>>>>
>>>>>>>> Hannes
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> DigitalPebble Ltd
>>>>>>>
>>>>>>> Open Source Solutions for Text Engineering
>>>>>>> http://www.digitalpebble.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> DigitalPebble Ltd
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>> http://www.digitalpebble.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
Re: Question on normalizing urls / RegexURLNormalizer
Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Nope, that changes nothing. Just checked out my log file:
2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking in:
/~/apache-nutch-1.1-bin/plugins
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Registered Plugins:
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Http / Https
Protocol Plug-in (protocol-httpclient)
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - XML Response
Writer Plug-in (response-xml)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Tika Parser
Plug-in (parse-tika)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - CyberNeko
HTML Parser (lib-nekohtml)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Anchor
Indexing Filter (index-anchor)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JSON
Response Writer Plug-in (response-json)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Registered
Extension-Points:
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Field
Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Search
Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
There is no RegexURLNormalizer being load...
On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:
> OK. Since you are in distributed mode it should use the content of the job
> file. Try deleting ./build/plugins to see if this changes anything
>
>
> On 24 June 2010 15:30, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>
>> Jep, did not work, although it displays: "URL normalizing: true" in the
>> crawl process...
>> Also bin/nutch plugin ... does not work!
>>
>>
>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> tried ant clean job?
>>>
>>>
>>>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>> local).
>>>>
>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>
>>>
>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>> doing its job.
>>>>
>>>> Regards
>>>>
>>>> Hannes
>>>>
>>>>
>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>> lists.digitalpebble@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Have you tried using :
>>>>> *./nutch plugin urlnormalizer-regex
>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>> http://www.myinputurl.com*
>>>>> that should help finding where the problem is coming from.
>>>>>
>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>> RegexURLNormalizer. I
>>>>>> added this to my nutch-site.xml:
>>>>>>
>>>>>> <property>
>>>>>> <name>urlnormalizer.order</name>
>>>>>>
>>>>>>
>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>> </property>
>>>>>>
>>>>>> <property>
>>>>>> <name>urlnormalizer.regex.file</name>
>>>>>> <value>regex-normalize.xml</value>
>>>>>> </property>
>>>>>>
>>>>>> And defined this expression rule:
>>>>>>
>>>>>> <regex>
>>>>>>
>>>>>>
>>>>>> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
>>>>>> <substitution>$1$5</substitution>
>>>>>> </regex>
>>>>>>
>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>
>>>>>> The indexed documents are still containing the parameter and imho the
>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>
>>>>>> Thanks and regards
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>> --
>>>>>>
>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> DigitalPebble Ltd
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>> http://www.digitalpebble.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
--
https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer
Re: Question on normalizing urls / RegexURLNormalizer
Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Jep, did not work, although it displays: "URL normalizing: true" in the
crawl process...
Also bin/nutch plugin ... does not work!
On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:
> tried ant clean job?
>
>
>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is local).
>>
> When executing bin/nucht plugin ... I'm getting a "Plugin
>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>> contains the property plugin.includes including urlnormalizer-regex.
>>
>
>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>> doing its job.
>>
>> Regards
>>
>> Hannes
>>
>>
>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Have you tried using :
>>> *./nutch plugin urlnormalizer-regex
>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>> http://www.myinputurl.com*
>>> that should help finding where the problem is coming from.
>>>
>>> Are you running in distributed mode? Did you generate a new job file?
>>>
>>> J.
>>>
>>>
>>> On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to strip a parameter from URLs using the RegexURLNormalizer.
>>>> I
>>>> added this to my nutch-site.xml:
>>>>
>>>> <property>
>>>> <name>urlnormalizer.order</name>
>>>>
>>>>
>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>urlnormalizer.regex.file</name>
>>>> <value>regex-normalize.xml</value>
>>>> </property>
>>>>
>>>> And defined this expression rule:
>>>>
>>>> <regex>
>>>>
>>>>
>>>> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
>>>> <substitution>$1$5</substitution>
>>>> </regex>
>>>>
>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>
>>>> The indexed documents are still containing the parameter and imho the
>>>> RegexURLNormalizer does not work. Is it something with:
>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>
>>>> Thanks and regards
>>>>
>>>> Hannes
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
--
https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer
Re: Question on normalizing urls / RegexURLNormalizer
Posted by Julien Nioche <li...@gmail.com>.
Hi,
Have you tried using :
*./nutch plugin urlnormalizer-regex
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
http://www.myinputurl.com*
that should help finding where the problem is coming from.
Are you running in distributed mode? Did you generate a new job file?
J.
On 24 June 2010 11:18, Hannes Carl Meyer <ha...@googlemail.com> wrote:
> Hi,
>
> I'm trying to strip a parameter from URLs using the RegexURLNormalizer. I
> added this to my nutch-site.xml:
>
> <property>
> <name>urlnormalizer.order</name>
>
> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
> </property>
>
> <property>
> <name>urlnormalizer.regex.file</name>
> <value>regex-normalize.xml</value>
> </property>
>
> And defined this expression rule:
>
> <regex>
>
>
> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
> <substitution>$1$5</substitution>
> </regex>
>
> (to strip the parameter IFLBSERVERID from the URL)
>
> The indexed documents are still containing the parameter and imho the
> RegexURLNormalizer does not work. Is it something with:
> https://issues.apache.org/jira/browse/NUTCH-706 ?
>
> Thanks and regards
>
> Hannes
>
> --
>
> https://www.xing.com/profile/HannesCarl_Meyer
> http://de.linkedin.com/in/hannescarlmeyer
> http://twitter.com/hannescarlmeyer
>
--
DigitalPebble Ltd
Open Source Solutions for Text Engineering
http://www.digitalpebble.com