You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/02/02 03:49:24 UTC

Custom HtmlParseFilter configurations

Hi all,

 I am writing an custom HtmlParserFilter by implementing the
HtmlParseFilter. And, I am using the ParserChecker for testing the filter.

 I could see by some Syso's in the HTMLParseFilters class that by default
only org.apache.nutch.parse.js.JSParseFilter is being used. If I would like
to use my custom filter should I be adding some configurations any where?

 And a point to be noted is that, when I add the following lines in
nutch-site.xml,

<property>
          <name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
          <description>Regular expression naming plugin id names to
              include.  Any plugin not matching this expression is excluded.
              In any case you need at least include the
nutch-extensionpoints plugin. By
              default Nutch includes crawling just HTML and plain text via
HTTP,
              and basic indexing and search plugins.
          </description>
    </property>

 I don't even see JSParseFilter being applied. The package that has my
custom filter does not have any special plugin configuration xml files, do I
have to add some or configure it else where. I am using Nutch 1.2.

 I see my knowledge with Nutch growing considerably, thanks to all of you.

Cheers,
Abi

Re: Custom HtmlParseFilter configurations

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi Mike,

 Got it ! Thanks. I forgot to note the detail that the filters applied were
all HTMLParseFilters.

Regards,
Abi


On Wed, Feb 2, 2011 at 11:07 PM, Mike Baranczak <mb...@gmail.com>wrote:

> HTMLParseFilter is only one type of plugin, there are several other types.
> In the configuration you have, it looks like JSParseFilter and
> TestPluginFilter are the only plugins that implement HTMLParseFilter, so the
> results make sense.
>
> -MB
>
>
> On Feb 2, 2011, at 12:09 AM, .: Abhishek :. wrote:
>
> > Hi Mike et all,
> >
> > Yes the adding of plugin.xml made it work.
> >
> > However, the outstanding question even now is that - even though my
> > plugin.includes lists a lot of plugin names why is that I just see
> JSParser
> > and my own custom parser in the HTMLParseFilters.
> >
> > The following is my plugin.includes value,
> >
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|test-plugin</value>
> >
> > Here test-plugin is my custom plugin. When I add the following line,
> >
> > for(HtmlParseFilter filter: htmlParseFilters){
> >                System.out.println("Filter Name :
> > "+filter.getClass().getName());
> >            }
> >
> >  below the last line of the constructor that takes conf  parameter i.e
> > this.htmlParseFilters = (HtmlParseFilter[])
> > objectCache.getObject(HtmlParseFilter.class.getName());
> > in the HTMLParserFilters I just see,
> >
> > Filter Name : org.apache.nutch.parse.js.JSParseFilter
> > Filter Name : com.test.nutch.TestPluginFilter
> >
> > I am just wondering why is this. I should be seeing all the listed
> filters
> > in the values tag in plugin.includes right?
> >
> >
> >
> >
> > On Wed, Feb 2, 2011 at 11:29 AM, Mike Baranczak <mbaranczak@gmail.com
> >wrote:
> >
> >> Yes, you do have to make a config file for your plugin to be seen by
> Nutch.
> >>
> >> If you built Nutch from source, you should have the directory
> >> build/plugins. That's where the compiled plugins are. The names of the
> >> directories under there are the names that get included in
> >> 'plugin.includes'. Take a look at the existing plugin.xml files, you
> should
> >> be able to figure it out by example.
> >>
> >> The standard way to package the plugin code is to put it in a jar in the
> >> corresponding plugin directory. This ensures that it won't get loaded if
> >> it's not used. (This is optional: if you KNOW that it's gonna get used
> every
> >> time, you can put your code anywhere on the classpath.)
> >>
> >> Note that I'm using 1.1 - I can't guarantee that this information is
> still
> >> current.
> >>
> >> -MB
> >>
> >>
> >>
> >> On Feb 1, 2011, at 9:49 PM, .: Abhishek :. wrote:
> >>
> >>> Hi all,
> >>>
> >>> I am writing an custom HtmlParserFilter by implementing the
> >>> HtmlParseFilter. And, I am using the ParserChecker for testing the
> >> filter.
> >>>
> >>> I could see by some Syso's in the HTMLParseFilters class that by
> default
> >>> only org.apache.nutch.parse.js.JSParseFilter is being used. If I would
> >> like
> >>> to use my custom filter should I be adding some configurations any
> where?
> >>>
> >>> And a point to be noted is that, when I add the following lines in
> >>> nutch-site.xml,
> >>>
> >>> <property>
> >>>         <name>plugin.includes</name>
> >>>
> >>>
> >>
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
> >>>         <description>Regular expression naming plugin id names to
> >>>             include.  Any plugin not matching this expression is
> >> excluded.
> >>>             In any case you need at least include the
> >>> nutch-extensionpoints plugin. By
> >>>             default Nutch includes crawling just HTML and plain text
> via
> >>> HTTP,
> >>>             and basic indexing and search plugins.
> >>>         </description>
> >>>   </property>
> >>>
> >>> I don't even see JSParseFilter being applied. The package that has my
> >>> custom filter does not have any special plugin configuration xml files,
> >> do I
> >>> have to add some or configure it else where. I am using Nutch 1.2.
> >>>
> >>> I see my knowledge with Nutch growing considerably, thanks to all of
> you.
> >>>
> >>> Cheers,
> >>> Abi
> >>
> >>
>
>

Re: Custom HtmlParseFilter configurations

Posted by Mike Baranczak <mb...@gmail.com>.
HTMLParseFilter is only one type of plugin, there are several other types. In the configuration you have, it looks like JSParseFilter and TestPluginFilter are the only plugins that implement HTMLParseFilter, so the results make sense.

-MB


On Feb 2, 2011, at 12:09 AM, .: Abhishek :. wrote:

> Hi Mike et all,
> 
> Yes the adding of plugin.xml made it work.
> 
> However, the outstanding question even now is that - even though my
> plugin.includes lists a lot of plugin names why is that I just see JSParser
> and my own custom parser in the HTMLParseFilters.
> 
> The following is my plugin.includes value,
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|test-plugin</value>
> 
> Here test-plugin is my custom plugin. When I add the following line,
> 
> for(HtmlParseFilter filter: htmlParseFilters){
>                System.out.println("Filter Name :
> "+filter.getClass().getName());
>            }
> 
>  below the last line of the constructor that takes conf  parameter i.e
> this.htmlParseFilters = (HtmlParseFilter[])
> objectCache.getObject(HtmlParseFilter.class.getName());
> in the HTMLParserFilters I just see,
> 
> Filter Name : org.apache.nutch.parse.js.JSParseFilter
> Filter Name : com.test.nutch.TestPluginFilter
> 
> I am just wondering why is this. I should be seeing all the listed filters
> in the values tag in plugin.includes right?
> 
> 
> 
> 
> On Wed, Feb 2, 2011 at 11:29 AM, Mike Baranczak <mb...@gmail.com>wrote:
> 
>> Yes, you do have to make a config file for your plugin to be seen by Nutch.
>> 
>> If you built Nutch from source, you should have the directory
>> build/plugins. That's where the compiled plugins are. The names of the
>> directories under there are the names that get included in
>> 'plugin.includes'. Take a look at the existing plugin.xml files, you should
>> be able to figure it out by example.
>> 
>> The standard way to package the plugin code is to put it in a jar in the
>> corresponding plugin directory. This ensures that it won't get loaded if
>> it's not used. (This is optional: if you KNOW that it's gonna get used every
>> time, you can put your code anywhere on the classpath.)
>> 
>> Note that I'm using 1.1 - I can't guarantee that this information is still
>> current.
>> 
>> -MB
>> 
>> 
>> 
>> On Feb 1, 2011, at 9:49 PM, .: Abhishek :. wrote:
>> 
>>> Hi all,
>>> 
>>> I am writing an custom HtmlParserFilter by implementing the
>>> HtmlParseFilter. And, I am using the ParserChecker for testing the
>> filter.
>>> 
>>> I could see by some Syso's in the HTMLParseFilters class that by default
>>> only org.apache.nutch.parse.js.JSParseFilter is being used. If I would
>> like
>>> to use my custom filter should I be adding some configurations any where?
>>> 
>>> And a point to be noted is that, when I add the following lines in
>>> nutch-site.xml,
>>> 
>>> <property>
>>>         <name>plugin.includes</name>
>>> 
>>> 
>> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
>>>         <description>Regular expression naming plugin id names to
>>>             include.  Any plugin not matching this expression is
>> excluded.
>>>             In any case you need at least include the
>>> nutch-extensionpoints plugin. By
>>>             default Nutch includes crawling just HTML and plain text via
>>> HTTP,
>>>             and basic indexing and search plugins.
>>>         </description>
>>>   </property>
>>> 
>>> I don't even see JSParseFilter being applied. The package that has my
>>> custom filter does not have any special plugin configuration xml files,
>> do I
>>> have to add some or configure it else where. I am using Nutch 1.2.
>>> 
>>> I see my knowledge with Nutch growing considerably, thanks to all of you.
>>> 
>>> Cheers,
>>> Abi
>> 
>> 


Re: Custom HtmlParseFilter configurations

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi Mike et all,

 Yes the adding of plugin.xml made it work.

 However, the outstanding question even now is that - even though my
plugin.includes lists a lot of plugin names why is that I just see JSParser
and my own custom parser in the HTMLParseFilters.

 The following is my plugin.includes value,
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|test-plugin</value>

 Here test-plugin is my custom plugin. When I add the following line,

 for(HtmlParseFilter filter: htmlParseFilters){
                System.out.println("Filter Name :
"+filter.getClass().getName());
            }

  below the last line of the constructor that takes conf  parameter i.e
this.htmlParseFilters = (HtmlParseFilter[])
objectCache.getObject(HtmlParseFilter.class.getName());
in the HTMLParserFilters I just see,

Filter Name : org.apache.nutch.parse.js.JSParseFilter
Filter Name : com.test.nutch.TestPluginFilter

 I am just wondering why is this. I should be seeing all the listed filters
in the values tag in plugin.includes right?




On Wed, Feb 2, 2011 at 11:29 AM, Mike Baranczak <mb...@gmail.com>wrote:

> Yes, you do have to make a config file for your plugin to be seen by Nutch.
>
> If you built Nutch from source, you should have the directory
> build/plugins. That's where the compiled plugins are. The names of the
> directories under there are the names that get included in
> 'plugin.includes'. Take a look at the existing plugin.xml files, you should
> be able to figure it out by example.
>
> The standard way to package the plugin code is to put it in a jar in the
> corresponding plugin directory. This ensures that it won't get loaded if
> it's not used. (This is optional: if you KNOW that it's gonna get used every
> time, you can put your code anywhere on the classpath.)
>
> Note that I'm using 1.1 - I can't guarantee that this information is still
> current.
>
> -MB
>
>
>
> On Feb 1, 2011, at 9:49 PM, .: Abhishek :. wrote:
>
> > Hi all,
> >
> > I am writing an custom HtmlParserFilter by implementing the
> > HtmlParseFilter. And, I am using the ParserChecker for testing the
> filter.
> >
> > I could see by some Syso's in the HTMLParseFilters class that by default
> > only org.apache.nutch.parse.js.JSParseFilter is being used. If I would
> like
> > to use my custom filter should I be adding some configurations any where?
> >
> > And a point to be noted is that, when I add the following lines in
> > nutch-site.xml,
> >
> > <property>
> >          <name>plugin.includes</name>
> >
> >
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
> >          <description>Regular expression naming plugin id names to
> >              include.  Any plugin not matching this expression is
> excluded.
> >              In any case you need at least include the
> > nutch-extensionpoints plugin. By
> >              default Nutch includes crawling just HTML and plain text via
> > HTTP,
> >              and basic indexing and search plugins.
> >          </description>
> >    </property>
> >
> > I don't even see JSParseFilter being applied. The package that has my
> > custom filter does not have any special plugin configuration xml files,
> do I
> > have to add some or configure it else where. I am using Nutch 1.2.
> >
> > I see my knowledge with Nutch growing considerably, thanks to all of you.
> >
> > Cheers,
> > Abi
>
>

Re: Custom HtmlParseFilter configurations

Posted by Mike Baranczak <mb...@gmail.com>.
Yes, you do have to make a config file for your plugin to be seen by Nutch. 

If you built Nutch from source, you should have the directory build/plugins. That's where the compiled plugins are. The names of the directories under there are the names that get included in 'plugin.includes'. Take a look at the existing plugin.xml files, you should be able to figure it out by example.

The standard way to package the plugin code is to put it in a jar in the corresponding plugin directory. This ensures that it won't get loaded if it's not used. (This is optional: if you KNOW that it's gonna get used every time, you can put your code anywhere on the classpath.)

Note that I'm using 1.1 - I can't guarantee that this information is still current.

-MB



On Feb 1, 2011, at 9:49 PM, .: Abhishek :. wrote:

> Hi all,
> 
> I am writing an custom HtmlParserFilter by implementing the
> HtmlParseFilter. And, I am using the ParserChecker for testing the filter.
> 
> I could see by some Syso's in the HTMLParseFilters class that by default
> only org.apache.nutch.parse.js.JSParseFilter is being used. If I would like
> to use my custom filter should I be adding some configurations any where?
> 
> And a point to be noted is that, when I add the following lines in
> nutch-site.xml,
> 
> <property>
>          <name>plugin.includes</name>
> 
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
>          <description>Regular expression naming plugin id names to
>              include.  Any plugin not matching this expression is excluded.
>              In any case you need at least include the
> nutch-extensionpoints plugin. By
>              default Nutch includes crawling just HTML and plain text via
> HTTP,
>              and basic indexing and search plugins.
>          </description>
>    </property>
> 
> I don't even see JSParseFilter being applied. The package that has my
> custom filter does not have any special plugin configuration xml files, do I
> have to add some or configure it else where. I am using Nutch 1.2.
> 
> I see my knowledge with Nutch growing considerably, thanks to all of you.
> 
> Cheers,
> Abi