You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by qi wu <ch...@gmail.com> on 2007/11/03 14:56:42 UTC
Re: How to extract specified information from html?
Try to take a look at HtmlParser.java in parse-html plugin...You can develop your own HtmlParser by modifying the implementation of function
public Parse getParse(Content content) {
Step1: get HTML sourcecode through content.
String htmlCode= content.toString( );
Step2: Check the Html Source code one by one with a Regular Expression to find the structured data you want..
Step3: Keep the data extracted ,to database or anyting elses;
}
----- Original Message -----
From: "zhao xiuwen" <re...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, November 01, 2007 12:12 AM
Subject: Re: How to extract specified information from html?
> Should I implement HtmlParseFilter? If it is,How to invoke my method in
> filter() of HtmlParseFilter?
>
> Thanks.
>
>
> 2007/10/31, zhao xiuwen <re...@gmail.com>:
>>
>> Hi,
>> I have seen the http://wiki.apache.org/nutch/WritingPluginExample, but
>> I don't understand clearly.
>> I need extract specified infromation in specified web site in nucth.
>> Firstly,I determine a URL set.
>> Secondly,I determine that the current page URL was contained the URL
>> set.
>> Lastly,I extract infromation according to regular expression and
>> save it.
>>
>> For example:a.html
>> <span class="title">behavioral<font color=red>disease</font>(N76.8)
>> </span>
>> extraction result:DiseaseName: behavioral disease,ID=N76.8
>>
>> How should I do?
>>
>> Thanks a lot.
>>
>>
>
Re: How to extract specified information from html?
Posted by jqq <re...@gmail.com>.
Thanks.
2007/11/3, qi wu <ch...@gmail.com>:
>
> Try to take a look at HtmlParser.java in parse-html plugin...You can
> develop your own HtmlParser by modifying the implementation of function
>
> public Parse getParse(Content content) {
> Step1: get HTML sourcecode through content.
> String htmlCode= content.toString( );
>
> Step2: Check the Html Source code one by one with a Regular Expression to
> find the structured data you want..
>
> Step3: Keep the data extracted ,to database or anyting elses;
>
>
> }
>
> ----- Original Message -----
> From: "zhao xiuwen" <re...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, November 01, 2007 12:12 AM
> Subject: Re: How to extract specified information from html?
>
>
> > Should I implement HtmlParseFilter? If it is,How to invoke my method in
> > filter() of HtmlParseFilter?
> >
> > Thanks.
> >
> >
> > 2007/10/31, zhao xiuwen <re...@gmail.com>:
> >>
> >> Hi,
> >> I have seen the http://wiki.apache.org/nutch/WritingPluginExample,
> but
> >> I don't understand clearly.
> >> I need extract specified infromation in specified web site in
> nucth.
> >> Firstly,I determine a URL set.
> >> Secondly,I determine that the current page URL was contained the URL
> >> set.
> >> Lastly,I extract infromation according to regular expression and
> >> save it.
> >>
> >> For example:a.html
> >> <span class="title">behavioral<font color=red>disease</font>(N76.8)
> >> </span>
> >> extraction result:DiseaseName: behavioral disease,ID=N76.8
> >>
> >> How should I do?
> >>
> >> Thanks a lot.
> >>
> >>
> >