You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2005/02/18 21:31:05 UTC

Re: Lucene in the Humanities

And before too many replies happen on this thread, I've corrected the 
spelling mistake in the subject!  :O


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Analyzing Advise

Posted by Luke Shannon <ls...@futurebrand.com>.

This is exactly what I was looking for.

Thanks

----- Original Message ----- 
From: "Steven Rowe" <sa...@syr.edu>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, February 18, 2005 4:41 PM
Subject: Re: Analyzing Advise


> Luke Shannon wrote:
> > But now that I'm looking at the API I'm not sure I can specifiy a
> > different analyzer when creating a field.
>
> Is PerFieldAnalyzerWrapper what you're looking for?
>
>
<URL:http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/Pe
rFieldAnalyzerWrapper.html>
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Analyzing Advise

Posted by Steven Rowe <sa...@syr.edu>.

Luke Shannon wrote:
> But now that I'm looking at the API I'm not sure I can specifiy a
> different analyzer when creating a field.

Is PerFieldAnalyzerWrapper what you're looking for?

<URL:http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html>

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Analyzing Advise

Posted by Luke Shannon <ls...@futurebrand.com>.

Hi;

I'm having a situation where my synonyms weren't working for a particular
field.

When I looked at the indexing I noticed it was a Keyword, thus not
tokenized.

The problem is when I switched that field to Text (now tokenized with my
SynonymAnalyzer) a bunch of query queires broke that where testing for
starting with or  or ending with a specific string. My SynonymAnalyzer wraps
a StanardAnalyzer, which acts as I would like for all fields but this one. I
don't want to change the behavior for all tokenizing. Only this one field's
data must remain unaltered.

I was hoping to make a Analyzer, that just applied the Synonyms, that I
could just use on the one field when I added it to the Document. But now
that I'm looking at the API I'm not sure I can specifiy a different analyzer
when creating a field.

Any tips?

Thanks,

Luke



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene in the Humanities

Posted by Paul Elschot <pa...@xs4all.nl>.

On Saturday 19 February 2005 11:02, Erik Hatcher wrote:
> 
> On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote:
> >>> By lowercasing the querytext and searching in title_lc ?
> >>
> >> Well sure, but how about this query:
> >>
> >> 	title:Something AND anotherField:someOtherValue
> >>
> >> QueryParser, as-is, won't be able to do field-name swapping.  I could
> >> certainly apply that technique on all the structured queries that I
> >> build up with the API, but with QueryParser it is trickier.   I'm
> >> definitely open for suggestions on improving how case is handled.  The
> >
> > Overriding this (1.4.3 QueryParser.jj, line 286) might work:
> >
> > protected Query getFieldQuery(String field, String queryText)
> > throws ParseException { ... }
> >
> > It will be called by the parser for both parts of the query above, so 
> > one
> > could change the field depending on the requested type of search
> > and the field name in the query.
> 
> But that wouldn't work for any other type of query.... 
> title:somethingFuzzy~

To get that it would be necessary to override all query parser
methods that take a field argument.

> 
> Though now that I think more about it, a simple s/title:/title_orig:/ 
> before parsing would work, and of course make the default field 

In the overriding getFieldQuery method something like:

if (caseSensitiveSearch(field) && originalFieldIndexed(field)) {
  field = field + "_orig";
} else { //the other 3 cases
 ...
}
return super.getFieldQuery(field, queryText);

The if statement could be factored out for the other overriding methods.

> dynamic.   I need to evaluate how many fields would need to be done 
> this way - it'd be several.  Thanks for the food for thought!
> 
> >> only drawback now is that I'm duplicating indexes, but that is only an
> >> issue in how long it takes to rebuild the index from scratch 
> >> (currently
> >> about 20 minutes or so on a good day - when the machine isn't 
> >> swamped).
> >
> > Once the users get the hang of this, you might end up having to 
> > quadruple
> > the index, or more.
> 
> Why would that be?   They want a case sensitive/insensitive switch.  
> How would it expand beyond that?

With an index for every combination of fields and case sensitivity for these
fields.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene in the Humanities

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote:
>>> By lowercasing the querytext and searching in title_lc ?
>>
>> Well sure, but how about this query:
>>
>> 	title:Something AND anotherField:someOtherValue
>>
>> QueryParser, as-is, won't be able to do field-name swapping.  I could
>> certainly apply that technique on all the structured queries that I
>> build up with the API, but with QueryParser it is trickier.   I'm
>> definitely open for suggestions on improving how case is handled.  The
>
> Overriding this (1.4.3 QueryParser.jj, line 286) might work:
>
> protected Query getFieldQuery(String field, String queryText)
> throws ParseException { ... }
>
> It will be called by the parser for both parts of the query above, so 
> one
> could change the field depending on the requested type of search
> and the field name in the query.

But that wouldn't work for any other type of query.... 
title:somethingFuzzy~

Though now that I think more about it, a simple s/title:/title_orig:/ 
before parsing would work, and of course make the default field 
dynamic.   I need to evaluate how many fields would need to be done 
this way - it'd be several.  Thanks for the food for thought!

>> only drawback now is that I'm duplicating indexes, but that is only an
>> issue in how long it takes to rebuild the index from scratch 
>> (currently
>> about 20 minutes or so on a good day - when the machine isn't 
>> swamped).
>
> Once the users get the hang of this, you might end up having to 
> quadruple
> the index, or more.

Why would that be?   They want a case sensitive/insensitive switch.  
How would it expand beyond that?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene in the Humanities

Posted by Paul Elschot <pa...@xs4all.nl>.

Erik,

On Saturday 19 February 2005 01:33, Erik Hatcher wrote:
> 
> On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote:
> 
> > On Friday 18 February 2005 21:55, Erik Hatcher wrote:
> >>
> >> On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
> >>
> >>> Erik,
> >>>
> >>> Just curious: it would seem easier to use multiple fields for the
> >>> original case and lowercase searching. Is there any particular reason
> >>> you analyzed the documents to multiple indexes instead of multiple
> >>> fields?
> >>
> >> I considered that approach, however to expose QueryParser I'd have to
> >> get tricky.  If I have title_orig and title_lc fields, how would I
> >> allow freeform queries of title:something?
> >
> > By lowercasing the querytext and searching in title_lc ?
> 
> Well sure, but how about this query:
> 
> 	title:Something AND anotherField:someOtherValue
> 
> QueryParser, as-is, won't be able to do field-name swapping.  I could 
> certainly apply that technique on all the structured queries that I 
> build up with the API, but with QueryParser it is trickier.   I'm 
> definitely open for suggestions on improving how case is handled.  The 

Overriding this (1.4.3 QueryParser.jj, line 286) might work:

protected Query getFieldQuery(String field, String queryText)
throws ParseException { ... }

It will be called by the parser for both parts of the query above, so one
could change the field depending on the requested type of search
and the field name in the query.

> only drawback now is that I'm duplicating indexes, but that is only an 
> issue in how long it takes to rebuild the index from scratch (currently 
> about 20 minutes or so on a good day - when the machine isn't swamped).

Once the users get the hang of this, you might end up having to quadruple
the index, or more.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene in the Humanities

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 22, 2005, at 8:50 PM, Chris Hostetter wrote:

>
> : >>> Just curious: it would seem easier to use multiple fields for the
> : >>> original case and lowercase searching. Is there any particular 
> reason
> : >>> you analyzed the documents to multiple indexes instead of 
> multiple
> : >>> fields?
> : >>
> : >> I considered that approach, however to expose QueryParser I'd 
> have to
> : >> get tricky.  If I have title_orig and title_lc fields, how would I
> : >> allow freeform queries of title:something?
>
> Why have seperate fields?
>
> Why not index the title into the "title" field twice, once with each 
> term
> lowercased and once with the case left alone. (Using an analyzer that
> tokenizes "The Quick BrOwN fox" as "[the] [quick] [brown] [fox] [The]
> [Quick] [BrOwN] [fox]")
>
> Then at search time, depending on the value of of the checkbox, 
> construct
> your QueryParser using the appropriate Analyzer.

I assume you mean to stack the tokens in the same positions, so it'd be 
like this:

	[the]	[quick]	[brown]	[fox]
	[The]	[Quick]	[BrOwN]	[fox]

Otherwise, if you simply string it together like what you show, then 
this phrase matches "fox The Quick", which is not in the original 
document.  Though putting in a large gap would do the trick in your 
example.

There is a fiddly issue with this technique that I'm not quite seeing 
at the moment, but I'll brainstorm on it and hopefully remember it or 
perhaps be proven wrong.

I'm Lucene-brain-dead.... I just did a presentation to our local Unix 
Users Group.    I built a man page indexer/searcher with PyLucene 
(thank you Andi!).  I had to learn Python as well, which was a good 
exercise, and learned lots from Andi's helpful private e-mails coaching 
me through my learning curve.  Now that I've seen the beast known as 
Python, I'm yearning for a Ruby version based on GCJ/SWIG.  A local 
Ruby guru and I are planning on meeting for a few hours each week and 
take a stab at it.  I'll commit whatever we do directly to a /ruby 
directory in Subversion.

Here's an example of my PyLucene output:

$ mansearch.py interface section:5
remote - remote host description file
rtadvd.conf - config file for router advertisement daemon
ipnat - IP NAT file format
groff_out - groff intermediate output format
xinetd.conf - Extended Internet Services Daemon configuration file
plist - property list format
racoon.conf - configuration file for racoon
ssh_config - OpenSSH SSH client configuration files
sudoers - list of which users may execute what

Even with custom formatting:

$ mansearch.py --format=#filename interface section:5
/usr/share/man/man5/remote.5
/usr/share/man/man5/rtadvd.conf.5
/usr/share/man/man5/ipnat.5
/usr/share/man/man5/groff_out.5
/usr/share/man/man5/xinetd.conf.5
/usr/share/man/man5/plist.5
/usr/share/man/man5/racoon.conf.5
/usr/share/man/man5/ssh_config.5
/usr/share/man/man5/sudoers.5

suitable for xargs :)

	Erik

>
> The only problem i can think of would be inflated scores for terms that
> are naturally lowercased, because they would wind up getting added to 
> the
> index twice, but based on what i've seen of hte data you are working
> with, i imageing that if you used UPPERCASE instead of lowercase you
> could drasticly reduce the likelyhood of any problems with that.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene in the Humanities

Posted by Chris Hostetter <ho...@fucit.org>.

: >>> Just curious: it would seem easier to use multiple fields for the
: >>> original case and lowercase searching. Is there any particular reason
: >>> you analyzed the documents to multiple indexes instead of multiple
: >>> fields?
: >>
: >> I considered that approach, however to expose QueryParser I'd have to
: >> get tricky.  If I have title_orig and title_lc fields, how would I
: >> allow freeform queries of title:something?

Why have seperate fields?

Why not index the title into the "title" field twice, once with each term
lowercased and once with the case left alone. (Using an analyzer that
tokenizes "The Quick BrOwN fox" as "[the] [quick] [brown] [fox] [The]
[Quick] [BrOwN] [fox]")

Then at search time, depending on the value of of the checkbox, construct
your QueryParser using the appropriate Analyzer.

The only problem i can think of would be inflated scores for terms that
are naturally lowercased, because they would wind up getting added to the
index twice, but based on what i've seen of hte data you are working
with, i imageing that if you used UPPERCASE instead of lowercase you
could drasticly reduce the likelyhood of any problems with that.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene in the Humanities

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote:

> On Friday 18 February 2005 21:55, Erik Hatcher wrote:
>>
>> On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
>>
>>> Erik,
>>>
>>> Just curious: it would seem easier to use multiple fields for the
>>> original case and lowercase searching. Is there any particular reason
>>> you analyzed the documents to multiple indexes instead of multiple
>>> fields?
>>
>> I considered that approach, however to expose QueryParser I'd have to
>> get tricky.  If I have title_orig and title_lc fields, how would I
>> allow freeform queries of title:something?
>
> By lowercasing the querytext and searching in title_lc ?

Well sure, but how about this query:

	title:Something AND anotherField:someOtherValue

QueryParser, as-is, won't be able to do field-name swapping.  I could 
certainly apply that technique on all the structured queries that I 
build up with the API, but with QueryParser it is trickier.   I'm 
definitely open for suggestions on improving how case is handled.  The 
only drawback now is that I'm duplicating indexes, but that is only an 
issue in how long it takes to rebuild the index from scratch (currently 
about 20 minutes or so on a good day - when the machine isn't swamped).

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene in the Humanities

Posted by Paul Elschot <pa...@xs4all.nl>.

On Friday 18 February 2005 21:55, Erik Hatcher wrote:
> 
> On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
> 
> > Erik,
> >
> > Just curious: it would seem easier to use multiple fields for the
> > original case and lowercase searching. Is there any particular reason
> > you analyzed the documents to multiple indexes instead of multiple 
> > fields?
> 
> I considered that approach, however to expose QueryParser I'd have to 
> get tricky.  If I have title_orig and title_lc fields, how would I 
> allow freeform queries of title:something?

By lowercasing the querytext and searching in title_lc ?

Regards,
Paul Elschot.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene in the Humanities

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:

> Erik,
>
> Just curious: it would seem easier to use multiple fields for the
> original case and lowercase searching. Is there any particular reason
> you analyzed the documents to multiple indexes instead of multiple 
> fields?

I considered that approach, however to expose QueryParser I'd have to 
get tricky.  If I have title_orig and title_lc fields, how would I 
allow freeform queries of title:something?

	Erik

p.s. It's fun to see the types of queries folks have already tried 
since I sent this e-mail (repeated queries are possibly someone 
paging):

INFO: Query = +title:dog +archivetype:rap : hits = 3
INFO: Query = +title:dog +archivetype:rap : hits = 3
INFO: Query = +title:dog +archivetype:rap : hits = 3
INFO: Query = rosetti : hits = 3
INFO: Query = +year:[0000 TO 1911] +(archivetype:radheader OR 
archivetype:rap) : hits = 2182
INFO: Query = advil : hits = 0
INFO: Query = test : hits = 24
INFO: Query = td : hits = 1
INFO: Query = td : hits = 1
INFO: Query = woman : hits = 363
INFO: Query = woman : hits = 363
INFO: Query = hello : hits = 0
INFO: Query = +rosetta +archivetype:rap : hits = 0
INFO: Query = +year:[0000 TO 1911] +(archivetype:radheader OR 
archivetype:rap) : hits = 2182
INFO: Query = poem : hits = 316
INFO: Query = crisis : hits = 7
INFO: Query = "crisis at every moment" : hits = 1
INFO: Query = toy : hits = 41
INFO: Query = title:echer : hits = 0
INFO: Query = senori : hits = 0
INFO: Query = +dear +sirs : hits = 11
INFO: Query = title:more : hits = 0
INFO: Query = more : hits = 365
INFO: Query = title:rossetti : hits = 329
INFO: Query = +blessed +damozel : hits = 103
INFO: Query = title:test : hits = 0
INFO: Query = +test +archivetype:radheader : hits = 3
INFO: Query = "crisis at every moment" : hits = 1
INFO: Query = rome : hits = 70
INFO: Query = fdshjkfjkhkfad : hits = 0
INFO: Query = stone : hits = 153
INFO: Query = +title:shakespeare +archivetype:radheader : hits = 1
INFO: Query = title:"xx i ll" : hits = 0
INFO: Query = +dog +cat : hits = 6
INFO: Query = +year:[1280 TO 1305] +archivetype:radheader : hits = 0
INFO: Query = guru : hits = 0
INFO: Query = philosophy : hits = 14
INFO: Query = title:install : hits = 0
INFO: Query = +title:install +archivetype:radheader : hits = 0
INFO: Query = "help freeform.html" : hits = 0
INFO: Query = "help freeform.html" : hits = 0
INFO: Query = install : hits = 1
INFO: Query = life : hits = 554
INFO: Query = life : hits = 554
INFO: Query = life : hits = 554



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene in the Humanities

Posted by Paul Elschot <pa...@xs4all.nl>.

Erik,

Just curious: it would seem easier to use multiple fields for the
original case and lowercase searching. Is there any particular reason
you analyzed the documents to multiple indexes instead of multiple fields?

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org