You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Wechner <mi...@wyona.org> on 2003/01/30 10:56:50 UTC
or
Hi
I am looking for an HTMLParser which skips text tagged by
<no-index> or something similar. This way I could exclude for
instance a "global navigation section" within the HTML
<no-index>
International<br>
Business<br>
Science<br>
...
</no-index>
It seems that the current demo/HTMLParser
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q11)
is not capable of doing something like that.
Any pointers are very welcome.
Thanks a lot
Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: or
Posted by Michael Wechner <mi...@wyona.org>.
Erik Hatcher wrote:
> On Thursday, January 30, 2003, at 06:59 PM, Michael Wechner wrote:
<snip/>
>
>> 2) I got two Javadoc warnings, because @return was empty within
>> HtmlDocument (getDocument() and Document())
>
>
> picky picky! :) But thanks - I'll correct those too.
sorry for that, but ant resp. javadoc was picky :-)
>
>
> I'm not ready to commit my changes - I'll do so in a few weeks when I
> get some refactoring done on IndexTask.
No problem
Thanks
Michael
>
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: or
Posted by Erik Hatcher <li...@ehatchersolutions.com>.
On Thursday, January 30, 2003, at 06:59 PM, Michael Wechner wrote:
> Well, I haven't found out how to use JTidy to ignore such tags that
> have such a class.
You did it the way I envisioned. I did not expect JTidy to have a way
to ignore tags either, but rather having to do it the laborious way and
check the attribute values yourself as you did.
> 1) I noticed that demo/HTMLDocument (resp. demo/html/HTMLParser) sets:
>
> contents= title + body
>
> and your class HtmlDocument
>
> contents=body
Yeah, I really should glue all fields into "contents" like that, thanks
for the enhancement. I'll roll that into my update. My original needs
were not to mirror the demo/HTMLDocument class so I didn't think of
making them compatible at the fields level. I just changed it so there
are now title, body, and contents fields.
> 2) I got two Javadoc warnings, because @return was empty within
> HtmlDocument (getDocument() and Document())
picky picky! :) But thanks - I'll correct those too.
I'm not ready to commit my changes - I'll do so in a few weeks when I
get some refactoring done on IndexTask.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: or
Posted by Michael Wechner <mi...@wyona.org>.
Erik Hatcher wrote:
> If you look at the contributions/ant area of the Lucene sandbox in
> CVS you'll see my HtmlDocument class which uses JTidy.
>
> Rather than making up some invalid HTML tag, I'd recommend you
> separate your navigation section with a <div> or <span> with a
> special class="navigation" or something like that. Then use JTidy to
> ignore such tags that have that class. Then you get valid, clean
> HTML and the ability to filter it for indexing.
Well, I haven't found out how to use JTidy to ignore such tags that
have such a class. So I just
added some code to your class HtmlDocument within the getBodyText method:
if(child.getNodeName().equals("span")){
org.w3c.dom.Attr
attribute=((Element)child).getAttributeNode("class");
if(attribute != null){
if(attribute.getValue().equals("lucene-no-index")){
System.out.println("HtmlDocument.getBodyText(): ignore span!");
break;
}
}
System.out.println("HtmlDocument.getBodyText():
accept span!");
}
This way text will be ignored within <span
class="lucene-no-index">...</span>
It's not "perfect", but it's working very well for the moment.
Two remarks:
1) I noticed that demo/HTMLDocument (resp. demo/html/HTMLParser) sets:
contents= title + body
and your class HtmlDocument
contents=body
2) I got two Javadoc warnings, because @return was empty within
HtmlDocument (getDocument() and Document())
Thanks very much for your help
Michael
>
>
> Erik
>
>
>
> On Thursday, January 30, 2003, at 04:56 AM, Michael Wechner wrote:
>
>> Hi
>>
>> I am looking for an HTMLParser which skips text tagged by
>>
>> <no-index> or something similar. This way I could exclude for
>> instance a "global navigation section" within the HTML
>>
>> <no-index>
>> International<br>
>> Business<br>
>> Science<br>
>> ...
>> </no-index>
>>
>> It seems that the current demo/HTMLParser
>> (http://lucene.sourceforge.net/cgi-bin/faq/
>> faqmanager.cgi?file=chapter.indexing&toc=faq#q11)
>> is not capable of doing something like that.
>>
>> Any pointers are very welcome.
>>
>> Thanks a lot
>>
>> Michael
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: or
Posted by Erik Hatcher <li...@ehatchersolutions.com>.
If you look at the contributions/ant area of the Lucene sandbox in CVS
you'll see my HtmlDocument class which uses JTidy.
Rather than making up some invalid HTML tag, I'd recommend you separate
your navigation section with a <div> or <span> with a special
class="navigation" or something like that. Then use JTidy to ignore
such tags that have that class. Then you get valid, clean HTML and the
ability to filter it for indexing.
Erik
On Thursday, January 30, 2003, at 04:56 AM, Michael Wechner wrote:
> Hi
>
> I am looking for an HTMLParser which skips text tagged by
>
> <no-index> or something similar. This way I could exclude for
> instance a "global navigation section" within the HTML
>
> <no-index>
> International<br>
> Business<br>
> Science<br>
> ...
> </no-index>
>
> It seems that the current demo/HTMLParser
> (http://lucene.sourceforge.net/cgi-bin/faq/
> faqmanager.cgi?file=chapter.indexing&toc=faq#q11)
> is not capable of doing something like that.
>
> Any pointers are very welcome.
>
> Thanks a lot
>
> Michael
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: or
Posted by Michael Wechner <mi...@wyona.org>.
Erik Hatcher wrote:
> On Thursday, January 30, 2003, at 07:07 PM, Michael Wechner wrote:
>
>> Maybe Erik wants to include an "improved" version of my code snippet
>> into CVS.
>
>
> Only if it can be made generic somehow - but that might be a bit
> tricky to implement depending on how crazy we wanted to get with it.
> The HtmlDocument class is really meant to be just an example of how to
> use the Ant <index> task I wrote along with the
> FileExtensionDocumentHandler that uses it. So its original purpose
> was not to be a robust HTML document indexer, but an example piece of
> a larger puzzle.
sure, no problem. Actually I think it's good to have small demo code and
"larger" industrial strength code.
>
>
>> I guess I am not the only one wanting to exclude certain parts from
>> an HTML page ;-)
>
>
> I've seen this request come up in the recent past, in fact. And its a
> perfectly reasonable one, especially if you are in charge of the HTML.
yeah, I am not sure if there is a standard way to do this. I just know
from an Atomz demo that they
are using something like this.
It would be nice if there would be a "standard tag" for this, or at
least that the Open Source Search Engines
projects could agree on one. To have it configurable would also be nice
of course, but I think it
wouldn't be necessary for the beginning.
Thanks
Michael
>
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: or
Posted by Erik Hatcher <li...@ehatchersolutions.com>.
On Thursday, January 30, 2003, at 07:07 PM, Michael Wechner wrote:
> Maybe Erik wants to include an "improved" version of my code snippet
> into CVS.
Only if it can be made generic somehow - but that might be a bit tricky
to implement depending on how crazy we wanted to get with it. The
HtmlDocument class is really meant to be just an example of how to use
the Ant <index> task I wrote along with the
FileExtensionDocumentHandler that uses it. So its original purpose was
not to be a robust HTML document indexer, but an example piece of a
larger puzzle.
> I guess I am not the only one wanting to exclude certain parts from an
> HTML page ;-)
I've seen this request come up in the recent past, in fact. And its a
perfectly reasonable one, especially if you are in charge of the HTML.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: or
Posted by Michael Wechner <mi...@wyona.org>.
Kelvin Tan wrote:
>My suggestion would be to modify HTMLParser to do the job. Don't think it's
>very difficult. I'm unaware of any existing HTML Parsers which support that
>functionality...
>
Maybe Erik wants to include an "improved" version of my code snippet
into CVS.
I guess I am not the only one wanting to exclude certain parts from an
HTML page ;-)
All the best
Michael
>
>
>Regards,
>Kelvin
>
>--------
>The book giving manifesto - http://how.to/sharethisbook
>
>
>On Thu, 30 Jan 2003 10:56:50 +0100, Michael Wechner said:
>
>
>>Hi
>>
>>I am looking for an HTMLParser which skips text tagged by
>>
>><no-index> or something similar. This way I could exclude for
>>instance a "global navigation section" within the HTML
>>
>><no-index> International<br> Business<br> Science<br> ...
>></no-index>
>>
>>It seems that the current demo/HTMLParser
>>(http://lucene.sourceforge.net/cgi-
>>bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q11) is not
>>capable of doing something like that.
>>
>>Any pointers are very welcome.
>>
>>Thanks a lot
>>
>>Michael
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: or
Posted by Kelvin Tan <ke...@relevanz.com>.
My suggestion would be to modify HTMLParser to do the job. Don't think it's
very difficult. I'm unaware of any existing HTML Parsers which support that
functionality...
Regards,
Kelvin
--------
The book giving manifesto - http://how.to/sharethisbook
On Thu, 30 Jan 2003 10:56:50 +0100, Michael Wechner said:
>Hi
>
>I am looking for an HTMLParser which skips text tagged by
>
><no-index> or something similar. This way I could exclude for
>instance a "global navigation section" within the HTML
>
><no-index> International<br> Business<br> Science<br> ...
></no-index>
>
>It seems that the current demo/HTMLParser
>(http://lucene.sourceforge.net/cgi-
>bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q11) is not
>capable of doing something like that.
>
>Any pointers are very welcome.
>
>Thanks a lot
>
>Michael
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org