You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2005/07/26 10:15:53 UTC
Re: Information extraction
Hi Cuong.
I am going to build private book search engine. And I am face the same problem.
Could you describe more about the information you want to extract and
the website?
Regards
/Jack
On 7/26/05, Cuong Hoang <cl...@gmail.com> wrote:
> Hi all,
>
>
>
> Does anyone have experience with designing web information extraction such
> as shopbots/pricebots? I'm currently doing research on this topic and want
> to integrate Nutch. A few guidelines from anyone who has designed this type
> of systems will really be helpful to me.
>
>
>
> Regards,
>
>
>
> Cuong Hoang
>
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: Information extraction
Posted by Chris Lu <ch...@gmail.com>.
My approach to tackle structured information is to use DBSight, which
create Lecene indexes on retrieved data from any database.
As Erik mentioned, scraping is highly fragile. By going directly to
database, we can get more reliable/up-to-date/flexible with the data.
On the other hand, you will need database access, and this approach is
quite different from Nutch.
Or Nutch/Lucene can provide a simple XML analyzer, consuming a
specific format of XML data filtered by any plug-in XSL from any XML
structure.
--
Chris Lu
---------------------
Full-Text Search on Any Database
http://www.dbsight.net
On 7/26/05, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> Further on the information extraction idea, consider what the SIMILE
> team at MIT are doing... http://simile.mit.edu
>
> The lower-case semantic web is gaining a lot of momentum these days,
> and I'm a strong proponent and student of it at the moment. Scraping
> rich information from a site is certainly reasonably pragmatic, but
> it is also highly fragile. SIMILE's Piggy Bank has a scraper
> facility. In an more ideal world, computer shops, book stores,
> libraries, and anyone with data to share would publish it in a
> reusable and structured way (RDF seems to me to be the best way to do
> this). Merging a full-text search engine with structured
> information, though, is yet another tricky thing that I am myself
> working with at the moment.
>
> I'd love to have more discussions along these lines.
>
> Erik
>
>
> On Jul 26, 2005, at 5:50 AM, Cuong Hoang wrote:
>
> > Hi Jack,
> >
> > I've been doing research the last few days and I think that once
> > successfully implemented, an information extraction system should
> > be able to
> > extract information from various sources. I've started reading
> > pattern/context free grammar/ontology which I think will be the
> > core of such
> > a system. I intend to index computer shops.
> >
> > Regards,
> >
> > Cuong Hoang
> >
> > -----Original Message-----
> > From: Jack Tang [mailto:himars@gmail.com]
> > Sent: Tuesday, 26 July 2005 6:16 PM
> > To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
> > Subject: Re: Information extraction
> >
> > Hi Cuong.
> >
> > I am going to build private book search engine. And I am face the same
> > problem.
> > Could you describe more about the information you want to extract and
> > the website?
> >
> > Regards
> > /Jack
> >
> > On 7/26/05, Cuong Hoang <cl...@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >>
> >>
> >> Does anyone have experience with designing web information
> >> extraction such
> >> as shopbots/pricebots? I'm currently doing research on this topic
> >> and want
> >> to integrate Nutch. A few guidelines from anyone who has designed
> >> this
> >>
> > type
> >
> >> of systems will really be helpful to me.
> >>
> >>
> >>
> >> Regards,
> >>
> >>
> >>
> >> Cuong Hoang
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>
Re: Information extraction
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Further on the information extraction idea, consider what the SIMILE
team at MIT are doing... http://simile.mit.edu
The lower-case semantic web is gaining a lot of momentum these days,
and I'm a strong proponent and student of it at the moment. Scraping
rich information from a site is certainly reasonably pragmatic, but
it is also highly fragile. SIMILE's Piggy Bank has a scraper
facility. In an more ideal world, computer shops, book stores,
libraries, and anyone with data to share would publish it in a
reusable and structured way (RDF seems to me to be the best way to do
this). Merging a full-text search engine with structured
information, though, is yet another tricky thing that I am myself
working with at the moment.
I'd love to have more discussions along these lines.
Erik
On Jul 26, 2005, at 5:50 AM, Cuong Hoang wrote:
> Hi Jack,
>
> I've been doing research the last few days and I think that once
> successfully implemented, an information extraction system should
> be able to
> extract information from various sources. I've started reading
> pattern/context free grammar/ontology which I think will be the
> core of such
> a system. I intend to index computer shops.
>
> Regards,
>
> Cuong Hoang
>
> -----Original Message-----
> From: Jack Tang [mailto:himars@gmail.com]
> Sent: Tuesday, 26 July 2005 6:16 PM
> To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
> Subject: Re: Information extraction
>
> Hi Cuong.
>
> I am going to build private book search engine. And I am face the same
> problem.
> Could you describe more about the information you want to extract and
> the website?
>
> Regards
> /Jack
>
> On 7/26/05, Cuong Hoang <cl...@gmail.com> wrote:
>
>> Hi all,
>>
>>
>>
>> Does anyone have experience with designing web information
>> extraction such
>> as shopbots/pricebots? I'm currently doing research on this topic
>> and want
>> to integrate Nutch. A few guidelines from anyone who has designed
>> this
>>
> type
>
>> of systems will really be helpful to me.
>>
>>
>>
>> Regards,
>>
>>
>>
>> Cuong Hoang
>>
>>
>>
>>
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
RE: Information extraction
Posted by Cuong Hoang <cl...@gmail.com>.
Hi Jack,
I've been doing research the last few days and I think that once
successfully implemented, an information extraction system should be able to
extract information from various sources. I've started reading
pattern/context free grammar/ontology which I think will be the core of such
a system. I intend to index computer shops.
Regards,
Cuong Hoang
-----Original Message-----
From: Jack Tang [mailto:himars@gmail.com]
Sent: Tuesday, 26 July 2005 6:16 PM
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Subject: Re: Information extraction
Hi Cuong.
I am going to build private book search engine. And I am face the same
problem.
Could you describe more about the information you want to extract and
the website?
Regards
/Jack
On 7/26/05, Cuong Hoang <cl...@gmail.com> wrote:
> Hi all,
>
>
>
> Does anyone have experience with designing web information extraction such
> as shopbots/pricebots? I'm currently doing research on this topic and want
> to integrate Nutch. A few guidelines from anyone who has designed this
type
> of systems will really be helpful to me.
>
>
>
> Regards,
>
>
>
> Cuong Hoang
>
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars