You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by qi wu <ch...@gmail.com> on 2007/11/03 14:56:42 UTC

Re: How to extract specified information from html?

Try to take a look at HtmlParser.java in parse-html plugin...You can develop your own HtmlParser by modifying the implementation of  function

public Parse getParse(Content content) {
 Step1: get HTML sourcecode through content.
  String htmlCode= content.toString( );

Step2:  Check the Html Source code one by one with a Regular Expression to find the structured data you want..

Step3: Keep the data extracted ,to database or anyting elses;
 

}

----- Original Message ----- 
From: "zhao xiuwen" <re...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, November 01, 2007 12:12 AM
Subject: Re: How to extract specified information from html?


> Should I implement HtmlParseFilter? If it is,How to invoke my method in
> filter() of  HtmlParseFilter?
> 
> Thanks.
> 
> 
> 2007/10/31, zhao xiuwen <re...@gmail.com>:
>>
>> Hi,
>>     I have seen the http://wiki.apache.org/nutch/WritingPluginExample, but
>> I don't understand clearly.
>>     I  need extract specified infromation  in specified web site in nucth.
>>    Firstly,I determine a URL set.
>>   Secondly,I determine that the current page URL was contained the URL
>> set.
>>   Lastly,I extract infromation according to  regular expression and
>> save it.
>>
>> For example:a.html
>>    <span class="title">behavioral<font color=red>disease</font>(N76.8)
>> </span>
>>    extraction result:DiseaseName: behavioral disease,ID=N76.8
>>
>> How should I do?
>>
>> Thanks a lot.
>>
>>
>

Re: How to extract specified information from html?

Posted by jqq <re...@gmail.com>.

Thanks.

2007/11/3, qi wu <ch...@gmail.com>:
>
> Try to take a look at HtmlParser.java in parse-html plugin...You can
> develop your own HtmlParser by modifying the implementation of  function
>
> public Parse getParse(Content content) {
> Step1: get HTML sourcecode through content.
> String htmlCode= content.toString( );
>
> Step2:  Check the Html Source code one by one with a Regular Expression to
> find the structured data you want..
>
> Step3: Keep the data extracted ,to database or anyting elses;
>
>
> }
>
> ----- Original Message -----
> From: "zhao xiuwen" <re...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, November 01, 2007 12:12 AM
> Subject: Re: How to extract specified information from html?
>
>
> > Should I implement HtmlParseFilter? If it is,How to invoke my method in
> > filter() of  HtmlParseFilter?
> >
> > Thanks.
> >
> >
> > 2007/10/31, zhao xiuwen <re...@gmail.com>:
> >>
> >> Hi,
> >>     I have seen the http://wiki.apache.org/nutch/WritingPluginExample,
> but
> >> I don't understand clearly.
> >>     I  need extract specified infromation  in specified web site in
> nucth.
> >>    Firstly,I determine a URL set.
> >>   Secondly,I determine that the current page URL was contained the URL
> >> set.
> >>   Lastly,I extract infromation according to  regular expression and
> >> save it.
> >>
> >> For example:a.html
> >>    <span class="title">behavioral<font color=red>disease</font>(N76.8)
> >> </span>
> >>    extraction result:DiseaseName: behavioral disease,ID=N76.8
> >>
> >> How should I do?
> >>
> >> Thanks a lot.
> >>
> >>
> >