You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Wechner <mi...@wyona.org> on 2003/01/30 10:56:50 UTC

or

Hi

I am looking for an HTMLParser which skips text tagged by

<no-index>  or something similar. This way I could exclude for
instance a "global navigation section" within the HTML

<no-index>
International<br>
Business<br>
Science<br>
...
</no-index>

It seems that the current demo/HTMLParser 
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q11)
is not capable of doing something like that.

Any pointers are very welcome.

Thanks a lot

Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: or

Posted by Michael Wechner <mi...@wyona.org>.
Erik Hatcher wrote:

> On Thursday, January 30, 2003, at 06:59  PM, Michael Wechner wrote:

<snip/>

>
>> 2) I got two Javadoc warnings, because @return was empty within 
>> HtmlDocument (getDocument() and Document())
>
>
> picky picky!  :)  But thanks - I'll correct those too. 


sorry for that, but ant resp. javadoc was picky :-)

>
>
> I'm not ready to commit my changes - I'll do so in a few weeks when I 
> get some refactoring done on IndexTask. 


No problem

Thanks

Michael

>
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: or

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
On Thursday, January 30, 2003, at 06:59  PM, Michael Wechner wrote:
> Well, I haven't  found out how to use JTidy to ignore such tags that 
> have such a class.

You did it the way I envisioned.  I did not expect JTidy to have a way 
to ignore tags either, but rather having to do it the laborious way and 
check the attribute values yourself as you did.

> 1) I noticed that demo/HTMLDocument (resp. demo/html/HTMLParser) sets:
>
>      contents= title + body
>
>  and your class HtmlDocument
>
>     contents=body

Yeah, I really should glue all fields into "contents" like that, thanks 
for the enhancement.  I'll roll that into my update.  My original needs 
were not to mirror the demo/HTMLDocument class so I didn't think of 
making them compatible at the fields level.  I just changed it so there 
are now title, body, and contents fields.

> 2) I got two Javadoc warnings, because @return was empty within 
> HtmlDocument (getDocument() and Document())

picky picky!  :)  But thanks - I'll correct those too.

I'm not ready to commit my changes - I'll do so in a few weeks when I 
get some refactoring done on IndexTask.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: or

Posted by Michael Wechner <mi...@wyona.org>.
Erik Hatcher wrote:

> If you look at the contributions/ant area of the Lucene sandbox in 
> CVS  you'll see my HtmlDocument class which uses JTidy.
>
> Rather than making up some invalid HTML tag, I'd recommend you 
> separate  your navigation section with a <div> or <span> with a 
> special  class="navigation" or something like that.  Then use JTidy to 
> ignore  such tags that have that class.  Then you get valid, clean 
> HTML and the  ability to filter it for indexing. 


Well, I haven't  found out how to use JTidy to ignore such tags that 
have such a class. So I just
added some code to your class HtmlDocument within the getBodyText method:

                  if(child.getNodeName().equals("span")){
                      org.w3c.dom.Attr 
attribute=((Element)child).getAttributeNode("class");
                      if(attribute != null){
                         if(attribute.getValue().equals("lucene-no-index")){
                           
System.out.println("HtmlDocument.getBodyText(): ignore span!");
                           break;
                           }
                         }
                       System.out.println("HtmlDocument.getBodyText(): 
accept span!");
                       }

This way text will be ignored within <span 
class="lucene-no-index">...</span>
It's not "perfect", but it's working very well for the moment.

Two remarks:

1) I noticed that demo/HTMLDocument (resp. demo/html/HTMLParser) sets:

      contents= title + body

  and your class HtmlDocument

     contents=body


2) I got two Javadoc warnings, because @return was empty within 
HtmlDocument (getDocument() and Document())


Thanks very much for your help

Michael





>
>
>     Erik
>
>
>
> On Thursday, January 30, 2003, at 04:56  AM, Michael Wechner wrote:
>
>> Hi
>>
>> I am looking for an HTMLParser which skips text tagged by
>>
>> <no-index>  or something similar. This way I could exclude for
>> instance a "global navigation section" within the HTML
>>
>> <no-index>
>> International<br>
>> Business<br>
>> Science<br>
>> ...
>> </no-index>
>>
>> It seems that the current demo/HTMLParser  
>> (http://lucene.sourceforge.net/cgi-bin/faq/ 
>> faqmanager.cgi?file=chapter.indexing&toc=faq#q11)
>> is not capable of doing something like that.
>>
>> Any pointers are very welcome.
>>
>> Thanks a lot
>>
>> Michael
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: or

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
If you look at the contributions/ant area of the Lucene sandbox in CVS  
you'll see my HtmlDocument class which uses JTidy.

Rather than making up some invalid HTML tag, I'd recommend you separate  
your navigation section with a <div> or <span> with a special  
class="navigation" or something like that.  Then use JTidy to ignore  
such tags that have that class.  Then you get valid, clean HTML and the  
ability to filter it for indexing.

	Erik



On Thursday, January 30, 2003, at 04:56  AM, Michael Wechner wrote:

> Hi
>
> I am looking for an HTMLParser which skips text tagged by
>
> <no-index>  or something similar. This way I could exclude for
> instance a "global navigation section" within the HTML
>
> <no-index>
> International<br>
> Business<br>
> Science<br>
> ...
> </no-index>
>
> It seems that the current demo/HTMLParser  
> (http://lucene.sourceforge.net/cgi-bin/faq/ 
> faqmanager.cgi?file=chapter.indexing&toc=faq#q11)
> is not capable of doing something like that.
>
> Any pointers are very welcome.
>
> Thanks a lot
>
> Michael
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: or

Posted by Michael Wechner <mi...@wyona.org>.
Erik Hatcher wrote:

> On Thursday, January 30, 2003, at 07:07  PM, Michael Wechner wrote:
>
>> Maybe Erik wants to include an "improved" version of my code snippet 
>> into CVS.
>
>
> Only if it can be made generic somehow - but that might be a bit 
> tricky to implement depending on how crazy we wanted to get with it.  
> The HtmlDocument class is really meant to be just an example of how to 
> use the Ant <index> task I wrote along with the 
> FileExtensionDocumentHandler that uses it.  So its original purpose 
> was not to be a robust HTML document indexer, but an example piece of 
> a larger puzzle. 


sure, no problem. Actually I think it's good to have small demo code and 
"larger" industrial strength code.

>
>
>> I guess I am not the only one wanting to exclude certain parts from 
>> an HTML page ;-)
>
>
> I've seen this request come up in the recent past, in fact.  And its a 
> perfectly reasonable one, especially if you are in charge of the HTML. 


yeah, I am not sure if there is a standard way to do this. I just know 
from an Atomz demo that they
are using something like this.
It would be nice if there would be a "standard tag" for this, or at 
least that the Open Source Search Engines
projects could agree on one. To have it configurable would also be nice 
of course, but I think it
wouldn't be necessary for the beginning.

Thanks

Michael

>
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: or

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
On Thursday, January 30, 2003, at 07:07  PM, Michael Wechner wrote:
> Maybe Erik wants to include an "improved" version of my code snippet 
> into CVS.

Only if it can be made generic somehow - but that might be a bit tricky 
to implement depending on how crazy we wanted to get with it.  The 
HtmlDocument class is really meant to be just an example of how to use 
the Ant <index> task I wrote along with the 
FileExtensionDocumentHandler that uses it.  So its original purpose was 
not to be a robust HTML document indexer, but an example piece of a 
larger puzzle.

> I guess I am not the only one wanting to exclude certain parts from an 
> HTML page ;-)

I've seen this request come up in the recent past, in fact.  And its a 
perfectly reasonable one, especially if you are in charge of the HTML.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: or

Posted by Michael Wechner <mi...@wyona.org>.
Kelvin Tan wrote:

>My suggestion would be to modify HTMLParser to do the job. Don't think it's 
>very difficult. I'm unaware of any existing HTML Parsers which support that 
>functionality...
>

Maybe Erik wants to include an "improved" version of my code snippet 
into CVS.

I guess I am not the only one wanting to exclude certain parts from an 
HTML page ;-)

All the best

Michael

>
>
>Regards,
>Kelvin
>
>--------
>The book giving manifesto     - http://how.to/sharethisbook
>
>
>On Thu, 30 Jan 2003 10:56:50 +0100, Michael Wechner said:
>  
>
>>Hi
>>
>>I am looking for an HTMLParser which skips text tagged by
>>
>><no-index>  or something similar. This way I could exclude for
>>instance a "global navigation section" within the HTML
>>
>><no-index> International<br> Business<br> Science<br> ...
>></no-index>
>>
>>It seems that the current demo/HTMLParser
>>(http://lucene.sourceforge.net/cgi-
>>bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q11) is not
>>capable of doing something like that.
>>
>>Any pointers are very welcome.
>>
>>Thanks a lot
>>
>>Michael
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>    
>>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: or

Posted by Kelvin Tan <ke...@relevanz.com>.
My suggestion would be to modify HTMLParser to do the job. Don't think it's 
very difficult. I'm unaware of any existing HTML Parsers which support that 
functionality...


Regards,
Kelvin

--------
The book giving manifesto     - http://how.to/sharethisbook


On Thu, 30 Jan 2003 10:56:50 +0100, Michael Wechner said:
>Hi
>
>I am looking for an HTMLParser which skips text tagged by
>
><no-index>  or something similar. This way I could exclude for
>instance a "global navigation section" within the HTML
>
><no-index> International<br> Business<br> Science<br> ...
></no-index>
>
>It seems that the current demo/HTMLParser
>(http://lucene.sourceforge.net/cgi-
>bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q11) is not
>capable of doing something like that.
>
>Any pointers are very welcome.
>
>Thanks a lot
>
>Michael
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org