You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Dodson <mg...@mac.com> on 2006/01/28 17:38:46 UTC

indexing URL's from parsed HTML

I'm new to Lucene and I'm trying to index an HTML file parsed with  
NekoHTML.

With text between HTML tags, its easy enough to have an overloaded  
getText() method which either recursively indexes all text, or which  
accepts the name of a tag (like "title") and only finds text between  
<title></title> tags.

Unfortunately I'm trying to index URL's, image names, and ALT text,  
all of which remain inside the tag and I can't figure out how to  
access that data.  I realize this is more of a NekoHTML question than  
a Lucene question, but I know Lucene is often used for indexing web  
content and was hoping someone on this list could help.

Cheers.
Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org