You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Dodson <mg...@mac.com> on 2006/01/28 17:38:46 UTC
indexing URL's from parsed HTML
I'm new to Lucene and I'm trying to index an HTML file parsed with
NekoHTML.
With text between HTML tags, its easy enough to have an overloaded
getText() method which either recursively indexes all text, or which
accepts the name of a tag (like "title") and only finds text between
<title></title> tags.
Unfortunately I'm trying to index URL's, image names, and ALT text,
all of which remain inside the tag and I can't figure out how to
access that data. I realize this is more of a NekoHTML question than
a Lucene question, but I know Lucene is often used for indexing web
content and was hoping someone on this list could help.
Cheers.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org