You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2002/10/18 17:39:50 UTC

Tags Screwing up Searches

Some content I'm indexing contains certain HTML tags, like <p>, <b>, <i>, etc.  What I find is that when a term I'm searching for touches one of these tags (which is fairly typical), the term isn't recognized and the search fails.  For example, <b>College Soccer</b> doesn't match on either "college" or "soccer".  I seem to recall someone else bring up a similar problem with a word that ends a sentence (and is thus treated as if the period was part of the word), but don't recall what the response was and I can't find that thread.

Does anyone have some ideas on what's the best way to handle this?  Filter out the tags in the process of creating the Document for indexing? Or through a modification to the Analyzer (I'm using the StandardAnalyzer)? Or something else?

TIA,

Terry

Re: Tags Screwing up Searches

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Si, with StandardAnalyzer, I believe, since neither < nor > are
alphabetical characters.

Otis

--- Terry Steichen <te...@net-frame.com> wrote:
> How should this be done (the translation, that is)?  If it were left
> as '<'
> and '>', would Lucene parse it properly?
> 
> Terry
> 
> ----- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Monday, October 21, 2002 5:40 PM
> Subject: Re: Tags Screwing up Searches
> 
> 
> > Thanks for the update.
> > This all sounds right (no bugs).  The problem is the code that you
> have
> > that translates those < and > characters.
> >
> > Otis
> >
> > --- Terry Steichen <te...@net-frame.com> wrote:
> > > Otis,
> > >
> > > I discovered that the actual text that I was dealing with already
> > > converted
> > > the '<' converted to '&lt;', and so forth.  So the problem is
> that
> > > with
> > > something like '&lt;b&gt;College Soccer&lt;/b&gt;', Lucene
> recognizes
> > > the
> > > trailing semi-colon ';' as a word separator, so it can find the
> term
> > > 'college', but it does not see the ending of 'soccer'.  I did
> confirm
> > > that
> > > it *will* match on 'soccer&lt;' just fine.
> > >
> > > I've proceeded to add a string substitution method which replaces
> > > '&lt;'
> > > with '    ' (four spaces, in order to hopefully keep the offsets
> > > straight).
> > > It appears to work, though I believe it slows down the indexing.
> > >
> > > I don't know enough about the inner design of Lucene to figure
> this
> > > out, but
> > > it seems logical that there would be a much more efficient way to
> > > handle
> > > this than string operations.
> > >
> > > Anyway, thought I'd bring you up to date.
> > >
> > > Regards,
> > >
> > > Terry
> > >
> > > PS: I've had no responses from the list, so perhaps this is a
> unique
> > > problem
> > > and doesn't justify a formal fix effort.
> > >
> > > ----- Original Message -----
> > > From: "Terry Steichen" <te...@net-frame.com>
> > > To: "Lucene Users Group" <lu...@jakarta.apache.org>
> > > Sent: Friday, October 18, 2002 11:39 AM
> > > Subject: Tags Screwing up Searches
> > >
> > >
> > > Some content I'm indexing contains certain HTML tags, like <p>,
> <b>,
> > > <i>,
> > > etc.  What I find is that when a term I'm searching for touches
> one
> > > of these
> > > tags (which is fairly typical), the term isn't recognized and the
> > > search
> > > fails.  For example, <b>College Soccer</b> doesn't match on
> either
> > > "college"
> > > or "soccer".  I seem to recall someone else bring up a similar
> > > problem with
> > > a word that ends a sentence (and is thus treated as if the period
> was
> > > part
> > > of the word), but don't recall what the response was and I can't
> find
> > > that
> > > thread.
> > >
> > > Does anyone have some ideas on what's the best way to handle
> this?
> > > Filter
> > > out the tags in the process of creating the Document for
> indexing? Or
> > > through a modification to the Analyzer (I'm using the
> > > StandardAnalyzer)? Or
> > > something else?
> > >
> > > TIA,
> > >
> > > Terry
> > >
> > >
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > > <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> > > <ma...@jakarta.apache.org>
> > >
> >
> >
> > __________________________________________________
> > Do you Yahoo!?
> > Y! Web Hosting - Let the expert host your web site
> > http://webhosting.yahoo.com/
> >
> > --
> > To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
> >
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 
>

__________________________________________________
Do you Yahoo!?
Y! Web Hosting - Let the expert host your web site
http://webhosting.yahoo.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Terry Steichen <te...@net-frame.com>.

How should this be done (the translation, that is)?  If it were left as '<'
and '>', would Lucene parse it properly?

Terry

----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, October 21, 2002 5:40 PM
Subject: Re: Tags Screwing up Searches


> Thanks for the update.
> This all sounds right (no bugs).  The problem is the code that you have
> that translates those < and > characters.
>
> Otis
>
> --- Terry Steichen <te...@net-frame.com> wrote:
> > Otis,
> >
> > I discovered that the actual text that I was dealing with already
> > converted
> > the '<' converted to '&lt;', and so forth.  So the problem is that
> > with
> > something like '&lt;b&gt;College Soccer&lt;/b&gt;', Lucene recognizes
> > the
> > trailing semi-colon ';' as a word separator, so it can find the term
> > 'college', but it does not see the ending of 'soccer'.  I did confirm
> > that
> > it *will* match on 'soccer&lt;' just fine.
> >
> > I've proceeded to add a string substitution method which replaces
> > '&lt;'
> > with '    ' (four spaces, in order to hopefully keep the offsets
> > straight).
> > It appears to work, though I believe it slows down the indexing.
> >
> > I don't know enough about the inner design of Lucene to figure this
> > out, but
> > it seems logical that there would be a much more efficient way to
> > handle
> > this than string operations.
> >
> > Anyway, thought I'd bring you up to date.
> >
> > Regards,
> >
> > Terry
> >
> > PS: I've had no responses from the list, so perhaps this is a unique
> > problem
> > and doesn't justify a formal fix effort.
> >
> > ----- Original Message -----
> > From: "Terry Steichen" <te...@net-frame.com>
> > To: "Lucene Users Group" <lu...@jakarta.apache.org>
> > Sent: Friday, October 18, 2002 11:39 AM
> > Subject: Tags Screwing up Searches
> >
> >
> > Some content I'm indexing contains certain HTML tags, like <p>, <b>,
> > <i>,
> > etc.  What I find is that when a term I'm searching for touches one
> > of these
> > tags (which is fairly typical), the term isn't recognized and the
> > search
> > fails.  For example, <b>College Soccer</b> doesn't match on either
> > "college"
> > or "soccer".  I seem to recall someone else bring up a similar
> > problem with
> > a word that ends a sentence (and is thus treated as if the period was
> > part
> > of the word), but don't recall what the response was and I can't find
> > that
> > thread.
> >
> > Does anyone have some ideas on what's the best way to handle this?
> > Filter
> > out the tags in the process of creating the Document for indexing? Or
> > through a modification to the Analyzer (I'm using the
> > StandardAnalyzer)? Or
> > something else?
> >
> > TIA,
> >
> > Terry
> >
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> Y! Web Hosting - Let the expert host your web site
> http://webhosting.yahoo.com/
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Thanks for the update.
This all sounds right (no bugs).  The problem is the code that you have
that translates those < and > characters.

Otis

--- Terry Steichen <te...@net-frame.com> wrote:
> Otis,
> 
> I discovered that the actual text that I was dealing with already
> converted
> the '<' converted to '&lt;', and so forth.  So the problem is that
> with
> something like '&lt;b&gt;College Soccer&lt;/b&gt;', Lucene recognizes
> the
> trailing semi-colon ';' as a word separator, so it can find the term
> 'college', but it does not see the ending of 'soccer'.  I did confirm
> that
> it *will* match on 'soccer&lt;' just fine.
> 
> I've proceeded to add a string substitution method which replaces
> '&lt;'
> with '    ' (four spaces, in order to hopefully keep the offsets
> straight).
> It appears to work, though I believe it slows down the indexing.
> 
> I don't know enough about the inner design of Lucene to figure this
> out, but
> it seems logical that there would be a much more efficient way to
> handle
> this than string operations.
> 
> Anyway, thought I'd bring you up to date.
> 
> Regards,
> 
> Terry
> 
> PS: I've had no responses from the list, so perhaps this is a unique
> problem
> and doesn't justify a formal fix effort.
> 
> ----- Original Message -----
> From: "Terry Steichen" <te...@net-frame.com>
> To: "Lucene Users Group" <lu...@jakarta.apache.org>
> Sent: Friday, October 18, 2002 11:39 AM
> Subject: Tags Screwing up Searches
> 
> 
> Some content I'm indexing contains certain HTML tags, like <p>, <b>,
> <i>,
> etc.  What I find is that when a term I'm searching for touches one
> of these
> tags (which is fairly typical), the term isn't recognized and the
> search
> fails.  For example, <b>College Soccer</b> doesn't match on either
> "college"
> or "soccer".  I seem to recall someone else bring up a similar
> problem with
> a word that ends a sentence (and is thus treated as if the period was
> part
> of the word), but don't recall what the response was and I can't find
> that
> thread.
> 
> Does anyone have some ideas on what's the best way to handle this? 
> Filter
> out the tags in the process of creating the Document for indexing? Or
> through a modification to the Analyzer (I'm using the
> StandardAnalyzer)? Or
> something else?
> 
> TIA,
> 
> Terry
> 
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Y! Web Hosting - Let the expert host your web site
http://webhosting.yahoo.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Terry Steichen <te...@net-frame.com>.

Joshua,

To clarify: I require the capability to perform precise, structure-sensitive
searches - you can't do that very well with simple HTML, since a simple
full-text search won't suffice.  The content for the XML 'semantic tags' is
extracted from the original HTML with some complex, XPath-assisted logic.
In other words, that content isn't conveniently wrapped in tags in the
original HTML. I don't recall the exact number of elements in the resulting
XML structure, but it's around 30 (including some metadata that I add).
That's one (of several) reason why the XML/DOM step is necessary.

Regards,

Terry

PS: The problem that caused me to ask my original question stems from the
fact that some of the extracted content (stored in a couple of the XML
sections) sometimes contains HTML tags (in the form of entities), and the
StandardTokenizer (which I'm using) doesn't ignore/remove them.


----- Original Message -----
From: "Joshua O'Madadhain" <jm...@ics.uci.edu>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, October 21, 2002 8:57 PM
Subject: Re: Tags Screwing up Searches


> On Mon, 21 Oct 2002, Terry Steichen wrote:
>
> > Thanks for the comments - you might have something there.  What I do
> > is clean up the HTML with JTidy and then parse it into a DOM.  Then I
> > use selected parts to create a new DOM which I write out as an XML
> > file.  I then use Lucene to index the XML files.  Upon retrieval, I
> > once again parse the XML, format it and render it to a browser.
> >
> > The conversion from brackets to entities is necessary in order for the
> > browser (which will subsequently view it) to render it properly.
> >
> > But maybe, in the indexing process, I could convert it back again (to
> > brackets), but I'm not sure what to do with it then - in other words,
> > how to bring an HTML parser into the picture.  If you have ideas on
> > this, I'd very much appreciate hearing them.
>
> Perhaps there is some reason for the conversion to XML that I'm not
> understanding (and this isn't really within my area of expertise).
>
> But if your purpose is to index HTML files and then display them later in
> response to a search, why not just use JTidy and then index the HTML
> instead (skipping the DOM and XML stages entirely), and then return the
> (cleaned-up) HTML later when asked for?  The basis of any 'semantic' tags
> that you might be putting in the XML (perhaps to define Lucene fields)
> must be there in the HTML anyway, so I'm not sure what the DOM and XML
> representations get you.
>
> Regards,
>
> Joshua O'Madadhain
>
>  jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
>   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
>  It's that moment of dawning comprehension that I live for--Bill Watterson
> My opinions are too rational and insightful to be those of any
organization.
>
>
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.

On Mon, 21 Oct 2002, Terry Steichen wrote:

> Thanks for the comments - you might have something there.  What I do
> is clean up the HTML with JTidy and then parse it into a DOM.  Then I
> use selected parts to create a new DOM which I write out as an XML
> file.  I then use Lucene to index the XML files.  Upon retrieval, I
> once again parse the XML, format it and render it to a browser.
> 
> The conversion from brackets to entities is necessary in order for the
> browser (which will subsequently view it) to render it properly.
> 
> But maybe, in the indexing process, I could convert it back again (to
> brackets), but I'm not sure what to do with it then - in other words,
> how to bring an HTML parser into the picture.  If you have ideas on
> this, I'd very much appreciate hearing them.

Perhaps there is some reason for the conversion to XML that I'm not
understanding (and this isn't really within my area of expertise).  

But if your purpose is to index HTML files and then display them later in
response to a search, why not just use JTidy and then index the HTML
instead (skipping the DOM and XML stages entirely), and then return the
(cleaned-up) HTML later when asked for?  The basis of any 'semantic' tags
that you might be putting in the XML (perhaps to define Lucene fields)
must be there in the HTML anyway, so I'm not sure what the DOM and XML
representations get you.

Regards,

Joshua O'Madadhain

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Terry Steichen <te...@net-frame.com>.

Joshua,

Thanks for the comments - you might have something there.  What I do is
clean up the HTML with JTidy and then parse it into a DOM.  Then I use
selected parts to create a new DOM which I write out as an XML file.  I then
use Lucene to index the XML files.  Upon retrieval, I once again parse the
XML, format it and render it to a browser.

The conversion from brackets to entities is necessary in order for the
browser (which will subsequently view it) to render it properly.

But maybe, in the indexing process, I could convert it back again (to
brackets), but I'm not sure what to do with it then - in other words, how to
bring an HTML parser into the picture.  If you have ideas on this, I'd very
much appreciate hearing them.

Regards,

Terry

----- Original Message -----
From: "Joshua O'Madadhain" <jm...@ics.uci.edu>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, October 21, 2002 5:49 PM
Subject: Re: Tags Screwing up Searches


> On Mon, 21 Oct 2002, Terry Steichen wrote:
>
> > I discovered that the actual text that I was dealing with already
> > converted the '<' converted to '&lt;', and so forth.  So the problem
> > is that with something like '&lt;b&gt;College Soccer&lt;/b&gt;',
> > Lucene recognizes the trailing semi-colon ';' as a word separator, so
> > it can find the term 'college', but it does not see the ending of
> > 'soccer'.  I did confirm that it *will* match on 'soccer&lt;' just
> > fine.
> >
> > I've proceeded to add a string substitution method which replaces
> > '&lt;' with ' ' (four spaces, in order to hopefully keep the offsets
> > straight). It appears to work, though I believe it slows down the
> > indexing.
> >
> > I don't know enough about the inner design of Lucene to figure this
> > out, but it seems logical that there would be a much more efficient
> > way to handle this than string operations.
> >
> > PS: I've had no responses from the list, so perhaps this is a unique
> > problem and doesn't justify a formal fix effort.
>
> A few questions and comments; please pardon me if I am asking questions
> answered in previous email:
>
> (1) Are you using an analyzer that is designed to handle (a) HTML, or
> (b) plain text?
>
> (2) If (b), that's probably why you've been getting this kind of behavior,
> and you may want to look at the HTMLParser sample code in the
> distribution.  The StandardAnalyzer, I'm pretty sure, is not designed to
> handle HTML.
>
> (3) A quick and dirty solution for indexing HTML if you are running on
> some flavor of Unix and don't want to figure out how to do parse HTML
> tags: the text web browser "lynx".  lynx can 'dump' the text from a web
> page out as follows:
>
> cat foo.html | lynx -dump -nolist  > foo.txt
>
> This effectively strips the HTML tags out of foo.html and writes the text
> of the page to the file foo.txt.
>
> Once you've done this, of course, you can use the same analyzers that you
> use for any unformatted text file.
>
> Good luck--
>
> Joshua O'Madadhain
>
>  jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
>   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
>  It's that moment of dawning comprehension that I live for--Bill Watterson
> My opinions are too rational and insightful to be those of any
organization.
>
>
>
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.

On Mon, 21 Oct 2002, Terry Steichen wrote:

> I discovered that the actual text that I was dealing with already
> converted the '<' converted to '&lt;', and so forth.  So the problem
> is that with something like '&lt;b&gt;College Soccer&lt;/b&gt;',
> Lucene recognizes the trailing semi-colon ';' as a word separator, so
> it can find the term 'college', but it does not see the ending of
> 'soccer'.  I did confirm that it *will* match on 'soccer&lt;' just
> fine.
> 
> I've proceeded to add a string substitution method which replaces
> '&lt;' with ' ' (four spaces, in order to hopefully keep the offsets
> straight). It appears to work, though I believe it slows down the
> indexing.
> 
> I don't know enough about the inner design of Lucene to figure this
> out, but it seems logical that there would be a much more efficient
> way to handle this than string operations.
> 
> PS: I've had no responses from the list, so perhaps this is a unique
> problem and doesn't justify a formal fix effort.

A few questions and comments; please pardon me if I am asking questions
answered in previous email:

(1) Are you using an analyzer that is designed to handle (a) HTML, or
(b) plain text?

(2) If (b), that's probably why you've been getting this kind of behavior,
and you may want to look at the HTMLParser sample code in the
distribution.  The StandardAnalyzer, I'm pretty sure, is not designed to
handle HTML.

(3) A quick and dirty solution for indexing HTML if you are running on
some flavor of Unix and don't want to figure out how to do parse HTML
tags: the text web browser "lynx".  lynx can 'dump' the text from a web
page out as follows:

cat foo.html | lynx -dump -nolist  > foo.txt

This effectively strips the HTML tags out of foo.html and writes the text
of the page to the file foo.txt.

Once you've done this, of course, you can use the same analyzers that you
use for any unformatted text file.

Good luck--

Joshua O'Madadhain

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Terry Steichen <te...@net-frame.com>.

Otis,

I discovered that the actual text that I was dealing with already converted
the '<' converted to '&lt;', and so forth.  So the problem is that with
something like '&lt;b&gt;College Soccer&lt;/b&gt;', Lucene recognizes the
trailing semi-colon ';' as a word separator, so it can find the term
'college', but it does not see the ending of 'soccer'.  I did confirm that
it *will* match on 'soccer&lt;' just fine.

I've proceeded to add a string substitution method which replaces '&lt;'
with '    ' (four spaces, in order to hopefully keep the offsets straight).
It appears to work, though I believe it slows down the indexing.

I don't know enough about the inner design of Lucene to figure this out, but
it seems logical that there would be a much more efficient way to handle
this than string operations.

Anyway, thought I'd bring you up to date.

Regards,

Terry

PS: I've had no responses from the list, so perhaps this is a unique problem
and doesn't justify a formal fix effort.

----- Original Message -----
From: "Terry Steichen" <te...@net-frame.com>
To: "Lucene Users Group" <lu...@jakarta.apache.org>
Sent: Friday, October 18, 2002 11:39 AM
Subject: Tags Screwing up Searches


Some content I'm indexing contains certain HTML tags, like <p>, <b>, <i>,
etc.  What I find is that when a term I'm searching for touches one of these
tags (which is fairly typical), the term isn't recognized and the search
fails.  For example, <b>College Soccer</b> doesn't match on either "college"
or "soccer".  I seem to recall someone else bring up a similar problem with
a word that ends a sentence (and is thus treated as if the period was part
of the word), but don't recall what the response was and I can't find that
thread.

Does anyone have some ideas on what's the best way to handle this?  Filter
out the tags in the process of creating the Document for indexing? Or
through a modification to the Analyzer (I'm using the StandardAnalyzer)? Or
something else?

TIA,

Terry




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Terry Steichen <te...@net-frame.com>.

Otis,

Doing some more testing, it turns out that it is the *trailing* tag that
screws things up.  Assume the text contains the phrase '<b>college
soccer</b>'.  This will match on 'college' or on 'soccer*', but not on
'soccer' or 'college soccer'.

I need to fix this quite soon.  In the absence of any better suggestions,
I'm just going to have to go in and either insert spaces or delete brackets
(ugh!).

Regards,

Terry

PS: Am I the only one that's having this problem?  If so, I must have
screwed up something.  If not, it could be a potentially serious bug.

----- Original Message -----
From: "Terry Steichen" <te...@net-frame.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, October 18, 2002 11:56 PM
Subject: Re: Tags Screwing up Searches


> I tested against the phrase in my text, '<b>men's college soccer</b>',
> matching successfully on 'college AND soccer*'.  However, I found no match
> for 'college AND soccer', 'college AND soccer<*', 'college AND soccer<',
> 'college AND soccerb', 'college AND soccerb*', or 'college AND soccer/'.
>
> Regards,
>
> Terry
>
> ---- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Friday, October 18, 2002 9:32 PM
> Subject: Re: Tags Screwing up Searches
>
>
> > Is it possible that the Analyzer is stripping <, >, and / characters
> > and leaving you with terms like: bCollege and Soccerb ?
> >
> > Otis
> >
> > --- Terry Steichen <te...@net-frame.com> wrote:
> > > Some content I'm indexing contains certain HTML tags, like <p>, <b>,
> > > <i>, etc.  What I find is that when a term I'm searching for touches
> > > one of these tags (which is fairly typical), the term isn't
> > > recognized and the search fails.  For example, <b>College Soccer</b>
> > > doesn't match on either "college" or "soccer".  I seem to recall
> > > someone else bring up a similar problem with a word that ends a
> > > sentence (and is thus treated as if the period was part of the word),
> > > but don't recall what the response was and I can't find that thread.
> > >
> > > Does anyone have some ideas on what's the best way to handle this?
> > > Filter out the tags in the process of creating the Document for
> > > indexing? Or through a modification to the Analyzer (I'm using the
> > > StandardAnalyzer)? Or something else?
> > >
> > > TIA,
> > >
> > > Terry
> > >
> > >
> >
> >
> > __________________________________________________
> > Do you Yahoo!?
> > Y! Web Hosting - Let the expert host your web site
> > http://webhosting.yahoo.com/
> >
> > --
> > To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Terry Steichen <te...@net-frame.com>.

I tested against the phrase in my text, '<b>men's college soccer</b>',
matching successfully on 'college AND soccer*'.  However, I found no match
for 'college AND soccer', 'college AND soccer<*', 'college AND soccer<',
'college AND soccerb', 'college AND soccerb*', or 'college AND soccer/'.

Regards,

Terry

---- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, October 18, 2002 9:32 PM
Subject: Re: Tags Screwing up Searches


> Is it possible that the Analyzer is stripping <, >, and / characters
> and leaving you with terms like: bCollege and Soccerb ?
>
> Otis
>
> --- Terry Steichen <te...@net-frame.com> wrote:
> > Some content I'm indexing contains certain HTML tags, like <p>, <b>,
> > <i>, etc.  What I find is that when a term I'm searching for touches
> > one of these tags (which is fairly typical), the term isn't
> > recognized and the search fails.  For example, <b>College Soccer</b>
> > doesn't match on either "college" or "soccer".  I seem to recall
> > someone else bring up a similar problem with a word that ends a
> > sentence (and is thus treated as if the period was part of the word),
> > but don't recall what the response was and I can't find that thread.
> >
> > Does anyone have some ideas on what's the best way to handle this?
> > Filter out the tags in the process of creating the Document for
> > indexing? Or through a modification to the Analyzer (I'm using the
> > StandardAnalyzer)? Or something else?
> >
> > TIA,
> >
> > Terry
> >
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> Y! Web Hosting - Let the expert host your web site
> http://webhosting.yahoo.com/
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Tags Screwing up Searches

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Is it possible that the Analyzer is stripping <, >, and / characters
and leaving you with terms like: bCollege and Soccerb ?

Otis

--- Terry Steichen <te...@net-frame.com> wrote:
> Some content I'm indexing contains certain HTML tags, like <p>, <b>,
> <i>, etc.  What I find is that when a term I'm searching for touches
> one of these tags (which is fairly typical), the term isn't
> recognized and the search fails.  For example, <b>College Soccer</b>
> doesn't match on either "college" or "soccer".  I seem to recall
> someone else bring up a similar problem with a word that ends a
> sentence (and is thus treated as if the period was part of the word),
> but don't recall what the response was and I can't find that thread.
> 
> Does anyone have some ideas on what's the best way to handle this? 
> Filter out the tags in the process of creating the Document for
> indexing? Or through a modification to the Analyzer (I'm using the
> StandardAnalyzer)? Or something else?
> 
> TIA,
> 
> Terry
> 
> 


__________________________________________________
Do you Yahoo!?
Y! Web Hosting - Let the expert host your web site
http://webhosting.yahoo.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>