You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by zhao xiuwen <re...@gmail.com> on 2007/10/31 09:19:14 UTC

How to extract specified information from html?

Hi,
    I have seen the http://wiki.apache.org/nutch/WritingPluginExample, but I
don't understand clearly.
    I  need extract specified infromation  in specified web site in nucth.
   Firstly,I determine a URL set.
  Secondly,I determine that the current page URL was contained the URL set.
  Lastly,I extract infromation according to  regular expression and save it.

For example:a.html
   <span class="title">behavioral<font color=red>disease</font>(N76.8)
</span>
   extraction result:DiseaseName: behavioral disease,ID=N76.8

How should I do?

Thanks a lot.

Re: How to extract specified information from html?

Posted by Adam Lofts <ad...@gmail.com>.

Hi,

On 31/10/2007, zhao xiuwen <re...@gmail.com> wrote:
>
> Should I implement HtmlParseFilter?


Yes

If it is,How to invoke my method in
> filter() of  HtmlParseFilter?


Load your plugin in the nutch config and filter() will be called for every
html file that you crawl.

Best,
Adam

Re: How to extract specified information from html?

Posted by jqq <re...@gmail.com>.

Thanks.

2007/11/3, qi wu <ch...@gmail.com>:
>
> Try to take a look at HtmlParser.java in parse-html plugin...You can
> develop your own HtmlParser by modifying the implementation of  function
>
> public Parse getParse(Content content) {
> Step1: get HTML sourcecode through content.
> String htmlCode= content.toString( );
>
> Step2:  Check the Html Source code one by one with a Regular Expression to
> find the structured data you want..
>
> Step3: Keep the data extracted ,to database or anyting elses;
>
>
> }
>
> ----- Original Message -----
> From: "zhao xiuwen" <re...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, November 01, 2007 12:12 AM
> Subject: Re: How to extract specified information from html?
>
>
> > Should I implement HtmlParseFilter? If it is,How to invoke my method in
> > filter() of  HtmlParseFilter?
> >
> > Thanks.
> >
> >
> > 2007/10/31, zhao xiuwen <re...@gmail.com>:
> >>
> >> Hi,
> >>     I have seen the http://wiki.apache.org/nutch/WritingPluginExample,
> but
> >> I don't understand clearly.
> >>     I  need extract specified infromation  in specified web site in
> nucth.
> >>    Firstly,I determine a URL set.
> >>   Secondly,I determine that the current page URL was contained the URL
> >> set.
> >>   Lastly,I extract infromation according to  regular expression and
> >> save it.
> >>
> >> For example:a.html
> >>    <span class="title">behavioral<font color=red>disease</font>(N76.8)
> >> </span>
> >>    extraction result:DiseaseName: behavioral disease,ID=N76.8
> >>
> >> How should I do?
> >>
> >> Thanks a lot.
> >>
> >>
> >

Re: How to extract specified information from html?

Posted by qi wu <ch...@gmail.com>.

Try to take a look at HtmlParser.java in parse-html plugin...You can develop your own HtmlParser by modifying the implementation of  function

public Parse getParse(Content content) {
 Step1: get HTML sourcecode through content.
  String htmlCode= content.toString( );

Step2:  Check the Html Source code one by one with a Regular Expression to find the structured data you want..

Step3: Keep the data extracted ,to database or anyting elses;
 

}

----- Original Message ----- 
From: "zhao xiuwen" <re...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, November 01, 2007 12:12 AM
Subject: Re: How to extract specified information from html?


> Should I implement HtmlParseFilter? If it is,How to invoke my method in
> filter() of  HtmlParseFilter?
> 
> Thanks.
> 
> 
> 2007/10/31, zhao xiuwen <re...@gmail.com>:
>>
>> Hi,
>>     I have seen the http://wiki.apache.org/nutch/WritingPluginExample, but
>> I don't understand clearly.
>>     I  need extract specified infromation  in specified web site in nucth.
>>    Firstly,I determine a URL set.
>>   Secondly,I determine that the current page URL was contained the URL
>> set.
>>   Lastly,I extract infromation according to  regular expression and
>> save it.
>>
>> For example:a.html
>>    <span class="title">behavioral<font color=red>disease</font>(N76.8)
>> </span>
>>    extraction result:DiseaseName: behavioral disease,ID=N76.8
>>
>> How should I do?
>>
>> Thanks a lot.
>>
>>
>

Re: How to extract specified information from html?

Posted by zhao xiuwen <re...@gmail.com>.

Should I implement HtmlParseFilter? If it is,How to invoke my method in
filter() of  HtmlParseFilter?

Thanks.


2007/10/31, zhao xiuwen <re...@gmail.com>:
>
> Hi,
>     I have seen the http://wiki.apache.org/nutch/WritingPluginExample, but
> I don't understand clearly.
>     I  need extract specified infromation  in specified web site in nucth.
>    Firstly,I determine a URL set.
>   Secondly,I determine that the current page URL was contained the URL
> set.
>   Lastly,I extract infromation according to  regular expression and
> save it.
>
> For example:a.html
>    <span class="title">behavioral<font color=red>disease</font>(N76.8)
> </span>
>    extraction result:DiseaseName: behavioral disease,ID=N76.8
>
> How should I do?
>
> Thanks a lot.
>
>