You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by zhao xiuwen <re...@gmail.com> on 2007/10/31 09:19:14 UTC
How to extract specified information from html?
Hi,
I have seen the http://wiki.apache.org/nutch/WritingPluginExample, but I
don't understand clearly.
I need extract specified infromation in specified web site in nucth.
Firstly,I determine a URL set.
Secondly,I determine that the current page URL was contained the URL set.
Lastly,I extract infromation according to regular expression and save it.
For example:a.html
<span class="title">behavioral<font color=red>disease</font>(N76.8)
</span>
extraction result:DiseaseName: behavioral disease,ID=N76.8
How should I do?
Thanks a lot.
Re: How to extract specified information from html?
Posted by Adam Lofts <ad...@gmail.com>.
Hi,
On 31/10/2007, zhao xiuwen <re...@gmail.com> wrote:
>
> Should I implement HtmlParseFilter?
Yes
If it is,How to invoke my method in
> filter() of HtmlParseFilter?
Load your plugin in the nutch config and filter() will be called for every
html file that you crawl.
Best,
Adam
Re: How to extract specified information from html?
Posted by jqq <re...@gmail.com>.
Thanks.
2007/11/3, qi wu <ch...@gmail.com>:
>
> Try to take a look at HtmlParser.java in parse-html plugin...You can
> develop your own HtmlParser by modifying the implementation of function
>
> public Parse getParse(Content content) {
> Step1: get HTML sourcecode through content.
> String htmlCode= content.toString( );
>
> Step2: Check the Html Source code one by one with a Regular Expression to
> find the structured data you want..
>
> Step3: Keep the data extracted ,to database or anyting elses;
>
>
> }
>
> ----- Original Message -----
> From: "zhao xiuwen" <re...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, November 01, 2007 12:12 AM
> Subject: Re: How to extract specified information from html?
>
>
> > Should I implement HtmlParseFilter? If it is,How to invoke my method in
> > filter() of HtmlParseFilter?
> >
> > Thanks.
> >
> >
> > 2007/10/31, zhao xiuwen <re...@gmail.com>:
> >>
> >> Hi,
> >> I have seen the http://wiki.apache.org/nutch/WritingPluginExample,
> but
> >> I don't understand clearly.
> >> I need extract specified infromation in specified web site in
> nucth.
> >> Firstly,I determine a URL set.
> >> Secondly,I determine that the current page URL was contained the URL
> >> set.
> >> Lastly,I extract infromation according to regular expression and
> >> save it.
> >>
> >> For example:a.html
> >> <span class="title">behavioral<font color=red>disease</font>(N76.8)
> >> </span>
> >> extraction result:DiseaseName: behavioral disease,ID=N76.8
> >>
> >> How should I do?
> >>
> >> Thanks a lot.
> >>
> >>
> >
Re: How to extract specified information from html?
Posted by qi wu <ch...@gmail.com>.
Try to take a look at HtmlParser.java in parse-html plugin...You can develop your own HtmlParser by modifying the implementation of function
public Parse getParse(Content content) {
Step1: get HTML sourcecode through content.
String htmlCode= content.toString( );
Step2: Check the Html Source code one by one with a Regular Expression to find the structured data you want..
Step3: Keep the data extracted ,to database or anyting elses;
}
----- Original Message -----
From: "zhao xiuwen" <re...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, November 01, 2007 12:12 AM
Subject: Re: How to extract specified information from html?
> Should I implement HtmlParseFilter? If it is,How to invoke my method in
> filter() of HtmlParseFilter?
>
> Thanks.
>
>
> 2007/10/31, zhao xiuwen <re...@gmail.com>:
>>
>> Hi,
>> I have seen the http://wiki.apache.org/nutch/WritingPluginExample, but
>> I don't understand clearly.
>> I need extract specified infromation in specified web site in nucth.
>> Firstly,I determine a URL set.
>> Secondly,I determine that the current page URL was contained the URL
>> set.
>> Lastly,I extract infromation according to regular expression and
>> save it.
>>
>> For example:a.html
>> <span class="title">behavioral<font color=red>disease</font>(N76.8)
>> </span>
>> extraction result:DiseaseName: behavioral disease,ID=N76.8
>>
>> How should I do?
>>
>> Thanks a lot.
>>
>>
>
Re: How to extract specified information from html?
Posted by zhao xiuwen <re...@gmail.com>.
Should I implement HtmlParseFilter? If it is,How to invoke my method in
filter() of HtmlParseFilter?
Thanks.
2007/10/31, zhao xiuwen <re...@gmail.com>:
>
> Hi,
> I have seen the http://wiki.apache.org/nutch/WritingPluginExample, but
> I don't understand clearly.
> I need extract specified infromation in specified web site in nucth.
> Firstly,I determine a URL set.
> Secondly,I determine that the current page URL was contained the URL
> set.
> Lastly,I extract infromation according to regular expression and
> save it.
>
> For example:a.html
> <span class="title">behavioral<font color=red>disease</font>(N76.8)
> </span>
> extraction result:DiseaseName: behavioral disease,ID=N76.8
>
> How should I do?
>
> Thanks a lot.
>
>