You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2005/07/26 10:15:53 UTC

Re: Information extraction

Hi Cuong.

I am going to build private book search engine. And I am face the same problem.
Could you describe more about the information you want to extract and
the website?

Regards
/Jack

On 7/26/05, Cuong Hoang <cl...@gmail.com> wrote:
> Hi all,
> 
> 
> 
> Does anyone have experience with designing web information extraction such
> as shopbots/pricebots? I'm currently doing research on this topic and want
> to integrate Nutch. A few guidelines from anyone who has designed this type
> of systems will really be helpful to me.
> 
> 
> 
> Regards,
> 
> 
> 
> Cuong Hoang
> 
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Information extraction

Posted by Chris Lu <ch...@gmail.com>.
My approach to tackle structured information is to use DBSight, which
create Lecene indexes on retrieved data from any database.

As Erik mentioned, scraping is highly fragile. By going directly to
database, we can get more reliable/up-to-date/flexible with the data.
On the other hand, you will need database access, and this approach is
quite different from Nutch.

Or Nutch/Lucene can provide a simple XML analyzer, consuming a
specific format of XML data filtered by any plug-in XSL from any XML
structure.

-- 
Chris Lu
---------------------
Full-Text Search on Any Database
http://www.dbsight.net


On 7/26/05, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> Further on the information extraction idea, consider what the SIMILE
> team at MIT are doing... http://simile.mit.edu
> 
> The lower-case semantic web is gaining a lot of momentum these days,
> and I'm a strong proponent and student of it at the moment.  Scraping
> rich information from a site is certainly reasonably pragmatic, but
> it is also highly fragile.  SIMILE's Piggy Bank has a scraper
> facility.  In an more ideal world, computer shops, book stores,
> libraries, and anyone with data to share would publish it in a
> reusable and structured way (RDF seems to me to be the best way to do
> this).  Merging a full-text search engine with structured
> information, though, is yet another tricky thing that I am myself
> working with at the moment.
> 
> I'd love to have more discussions along these lines.
> 
>      Erik
> 
> 
> On Jul 26, 2005, at 5:50 AM, Cuong Hoang wrote:
> 
> > Hi Jack,
> >
> > I've been doing research the last few days and I think that once
> > successfully implemented, an information extraction system should
> > be able to
> > extract information from various sources. I've started reading
> > pattern/context free grammar/ontology which I think will be the
> > core of such
> > a system. I intend to index computer shops.
> >
> > Regards,
> >
> > Cuong Hoang
> >
> > -----Original Message-----
> > From: Jack Tang [mailto:himars@gmail.com]
> > Sent: Tuesday, 26 July 2005 6:16 PM
> > To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
> > Subject: Re: Information extraction
> >
> > Hi Cuong.
> >
> > I am going to build private book search engine. And I am face the same
> > problem.
> > Could you describe more about the information you want to extract and
> > the website?
> >
> > Regards
> > /Jack
> >
> > On 7/26/05, Cuong Hoang <cl...@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >>
> >>
> >> Does anyone have experience with designing web information
> >> extraction such
> >> as shopbots/pricebots? I'm currently doing research on this topic
> >> and want
> >> to integrate Nutch. A few guidelines from anyone who has designed
> >> this
> >>
> > type
> >
> >> of systems will really be helpful to me.
> >>
> >>
> >>
> >> Regards,
> >>
> >>
> >>
> >> Cuong Hoang
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> 
>

Re: Information extraction

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Further on the information extraction idea, consider what the SIMILE  
team at MIT are doing... http://simile.mit.edu

The lower-case semantic web is gaining a lot of momentum these days,  
and I'm a strong proponent and student of it at the moment.  Scraping  
rich information from a site is certainly reasonably pragmatic, but  
it is also highly fragile.  SIMILE's Piggy Bank has a scraper  
facility.  In an more ideal world, computer shops, book stores,  
libraries, and anyone with data to share would publish it in a  
reusable and structured way (RDF seems to me to be the best way to do  
this).  Merging a full-text search engine with structured  
information, though, is yet another tricky thing that I am myself  
working with at the moment.

I'd love to have more discussions along these lines.

     Erik


On Jul 26, 2005, at 5:50 AM, Cuong Hoang wrote:

> Hi Jack,
>
> I've been doing research the last few days and I think that once
> successfully implemented, an information extraction system should  
> be able to
> extract information from various sources. I've started reading
> pattern/context free grammar/ontology which I think will be the  
> core of such
> a system. I intend to index computer shops.
>
> Regards,
>
> Cuong Hoang
>
> -----Original Message-----
> From: Jack Tang [mailto:himars@gmail.com]
> Sent: Tuesday, 26 July 2005 6:16 PM
> To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
> Subject: Re: Information extraction
>
> Hi Cuong.
>
> I am going to build private book search engine. And I am face the same
> problem.
> Could you describe more about the information you want to extract and
> the website?
>
> Regards
> /Jack
>
> On 7/26/05, Cuong Hoang <cl...@gmail.com> wrote:
>
>> Hi all,
>>
>>
>>
>> Does anyone have experience with designing web information  
>> extraction such
>> as shopbots/pricebots? I'm currently doing research on this topic  
>> and want
>> to integrate Nutch. A few guidelines from anyone who has designed  
>> this
>>
> type
>
>> of systems will really be helpful to me.
>>
>>
>>
>> Regards,
>>
>>
>>
>> Cuong Hoang
>>
>>
>>
>>
>
>
> -- 
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


RE: Information extraction

Posted by Cuong Hoang <cl...@gmail.com>.
Hi Jack,

I've been doing research the last few days and I think that once
successfully implemented, an information extraction system should be able to
extract information from various sources. I've started reading
pattern/context free grammar/ontology which I think will be the core of such
a system. I intend to index computer shops. 

Regards,

Cuong Hoang

-----Original Message-----
From: Jack Tang [mailto:himars@gmail.com] 
Sent: Tuesday, 26 July 2005 6:16 PM
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Subject: Re: Information extraction

Hi Cuong.

I am going to build private book search engine. And I am face the same
problem.
Could you describe more about the information you want to extract and
the website?

Regards
/Jack

On 7/26/05, Cuong Hoang <cl...@gmail.com> wrote:
> Hi all,
> 
> 
> 
> Does anyone have experience with designing web information extraction such
> as shopbots/pricebots? I'm currently doing research on this topic and want
> to integrate Nutch. A few guidelines from anyone who has designed this
type
> of systems will really be helpful to me.
> 
> 
> 
> Regards,
> 
> 
> 
> Cuong Hoang
> 
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars