You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sebastian Ho <se...@bii.a-star.edu.sg> on 2004/04/13 03:28:00 UTC

suitability of lucene for project

hi all

i am investigating technologies to use for a project which basically
retrieves html pages on a regular basis(or whenever there are changes)
and allow html parsing to extract specific information, and presenting
them as links in a webpage. Note that this is not a general search
engine kind of project but we are extracting clinical information from
various website and consolidating them.

Pls advise me whether Lucene can do the above and in areas where it
cannot, suggestions to solutions will be appreciated.

Thanks

Sebastian Ho
Bioinformatics Institute


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: suitability of lucene for project

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

No, Lucene is not the right solution for this particular use.  It does 
not include anything to retrieve HTML pages, or parse them.  However, 
if you ever needed full-text search, the Lucene is where it's at.

	Erik


On Apr 12, 2004, at 9:28 PM, Sebastian Ho wrote:

> hi all
>
> i am investigating technologies to use for a project which basically
> retrieves html pages on a regular basis(or whenever there are changes)
> and allow html parsing to extract specific information, and presenting
> them as links in a webpage. Note that this is not a general search
> engine kind of project but we are extracting clinical information from
> various website and consolidating them.
>
> Pls advise me whether Lucene can do the above and in areas where it
> cannot, suggestions to solutions will be appreciated.
>
> Thanks
>
> Sebastian Ho
> Bioinformatics Institute
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: suitability of lucene for project

Posted by Sebastian Ho <se...@bii.a-star.edu.sg>.

I will be searching webpages (url given by user) for keyword (in
clinical record). Will that be structured or unstructured? The records
might be in a table or a list of urls pointing to individual record
webpages.

thks

sebastian


On Tue, 2004-04-13 at 11:15, Stephane James Vaucher wrote:
> It could be part of you solution, but I don't think so. Let me explain:
> 
> I've done this a few times something similar to what you describe. I use 
> often use HttpUnit to get information. How you process it, it's up 
> to you. If you want it to be indexed (searchable), you can use Lucene. If 
> you want to extract structured (or semi-structured) information, use 
> wrapper induction techniques (not Lucene).
> 
> cheers,
> sv
> 
> On 13 Apr 2004, Sebastian Ho wrote:
> 
> > hi all
> > 
> > i am investigating technologies to use for a project which basically
> > retrieves html pages on a regular basis(or whenever there are changes)
> > and allow html parsing to extract specific information, and presenting
> > them as links in a webpage. Note that this is not a general search
> > engine kind of project but we are extracting clinical information from
> > various website and consolidating them.
> > 
> > Pls advise me whether Lucene can do the above and in areas where it
> > cannot, suggestions to solutions will be appreciated.
> > 
> > Thanks
> > 
> > Sebastian Ho
> > Bioinformatics Institute
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: suitability of lucene for project

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.

It could be part of you solution, but I don't think so. Let me explain:

I've done this a few times something similar to what you describe. I use 
often use HttpUnit to get information. How you process it, it's up 
to you. If you want it to be indexed (searchable), you can use Lucene. If 
you want to extract structured (or semi-structured) information, use 
wrapper induction techniques (not Lucene).

cheers,
sv

On 13 Apr 2004, Sebastian Ho wrote:

> hi all
> 
> i am investigating technologies to use for a project which basically
> retrieves html pages on a regular basis(or whenever there are changes)
> and allow html parsing to extract specific information, and presenting
> them as links in a webpage. Note that this is not a general search
> engine kind of project but we are extracting clinical information from
> various website and consolidating them.
> 
> Pls advise me whether Lucene can do the above and in areas where it
> cannot, suggestions to solutions will be appreciated.
> 
> Thanks
> 
> Sebastian Ho
> Bioinformatics Institute
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org