You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Fuad Efendi <fu...@efendi.ca> on 2010/02/05 22:48:45 UTC

RE: Wildcard searches????

Niclas,

I looked at your initial post, you are creating document with field "abc*"
- nothing related to "wildcard query"!

Of course, query [useragents:abcdefghijklm] will return no results, and [q=useragents:abc] no results, but [q=useragents:abc*] will return something.

text_nav is specific SOLR type for _leading_ wildcard queries; you don't need it (you don't need _leading_ wildcard queries).

On indexing time, instead of
<doc>
<useragents>
                Firefox*
                Mozilla/4.0*
</useragents>
</doc>


You should index
<doc>
<useragents>
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
</useragents>
</doc>

And also, you need to choose properly SOLR type; for instance, textTight or textgen, or even non-tokenized string!


And, query [q=useragents:moz*] will return this document (even if this field is nontokenized).


-Fuad


P.S. Don't use * when you create Lucene document; use it as part of query.




> -----Original Message-----
> From: Niclas Rothman [mailto:niro@lechill.com]
> Sent: February-05-10 4:44 PM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Ted im using SOLR, but I cant figure out what type of fieldtype I should
> use to get a query like this to work:
> 
> 
> q=useragents: abcdefghijklm
> 
> 
> where I have in my index one document with value "abc" in field
> "useragents"
> 
> That query results in 0 hits.
> 
> If I issue this I get 1 hit of course (exact mathch)
> 
> q=useragents: Mozilla
> 
> 
> My document definition in SOLR looks like:
> 
> <fields>
>     <field name="id" type="tint" indexed="true" stored="true"
> required="true" />
>     <field name="useragents" type="text_rev" indexed="true"
> stored="true" required="false" multiValued="true" />
> </fields>
> 
> Any clue?
> 
> Nic
> 
> 
> 
> 
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: 05 February 2010 21:18
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: Re: Wildcard searches????
> 
> This is quite close.  You will have to break down the user agent that is
> your query into the same kinds of pieces as you did for your index.
> Lucene
> will only do exact matching of terms during searching (wildcard queries
> are
> handled by exploding the term into all possible variants).
> 
> Regarding the field type, you will probably have to customize that a
> fair
> bit to make +'s be separators and such.  If you use SOLR to index and
> query
> your data, then it will make sure that your separation into tokens is
> compatible unless you are using shortened forms like you mention here.
> 
> On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <ni...@lechill.com>
> wrote:
> 
> > Hi again Ted and many thanks for your efforts.
> > Ok, just to be sure that we fully understand each other:
> >
> > In my index I will store partial useragents without any wildcards *,
> e.g.
> >
> > Fire    (for Firefox)
> > Inte    (Internet Explorer)
> > Moz     (Mozill)
> >
> >
> > When I during runtime search my index for Media objects that are
> compatible
> > with a useragent,
> > e.g:
> >
> >
> >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> >
> > Hopefully lucene / solr will serve me with all Media objects that
> partially
> > math my full user agent string and also perhaps some mismatches. To be
> > absolutely sure that I only show Media objects that are compatible, I
> will
> > have to loop through the resultset in my program to do a final test
> and
> > exclude any mismatches.
> >
> > Is this what you are saying Ted, that I cant do the whole process in
> Solr /
> > Lucene, that I need to do the final test in my program (C#)?
> >
> > Also, Im using Solr 1.4, what fieldtype would you recommend to use for
> the
> > useragent ( tokenized)
> >
> > Okey, lets see what you have to say about this.
> > Please bear with me, im all new to lucene and solr!!
> >
> > Regards
> > Niclas
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: 05 February 2010 20:43
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > Yes.  I think you have it.
> >
> > To explain in a bit more detail, I think that you should store a
> tokenized
> > form of the user agents and should query using a tokenized form of
> your
> > user
> > agent.  This will retrieve documents that have partial matches to the
> user
> > agent of interest.  Many of these matches, however, may not meet the
> > requirements of the wildcard expression in the documents.  As such,
> you
> > will
> > need to look at each retrieved document to retrieve the wild
> expression
> > from
> > each one in turn to test if the original (untokenized) query satisfies
> the
> > wildcard.
> >
> > If your wildcards are all of a positive nature as your example is,
> then
> > this
> > should work pretty well.
> >
> > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <ni...@lechill.com>
> wrote:
> >
> > > Hi Ted and thanks for all your efforts.
> > > Listen im a little bit lost here trying to understand what you are
> trying
> > > to tell me :-)
> > >
> > > 1. I Store my useragents in a field that is tokenized.
> > > 2. Then when I search, you are saying that I should "scan" down the
> > matches
> > > via a SOLR function, or what?
> > > Are you referring to these functions in SOLR?
> > >
> > > http://wiki.apache.org/solr/FunctionQuery
> > >
> > >
> > > Sorry for not grasping immmediatley!
> > >
> > > Regards Niclas
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 17:44
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Tokenize your user agent strings, then store the tokenized form
> > separately
> > > from the wild card.  At retrieval time, scan down the matches and
> apply
> > the
> > > wildcard from each document to your original query.  The SOLR
> function
> > > query
> > > might be useful for this as would be a custom hit collector.
> > >
> > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <ni...@lechill.com>
> wrote:
> > >
> > > > Hi there, i facing a problem and would like to ask the community
> for
> > some
> > > > help.
> > > >
> > > > In my index I store browser  useragent values as "wildcarded" /
> > partial,
> > > >  which should be understood that an indexed document
> > > > should only be shown to end users if his browsers useragent
> matches a
> > > > wildcared usereragent in my document.
> > > >
> > > > So what I have Is actually a "reversed" matching, the wildcards
> are in
> > my
> > > > document and NOT in my actual query.
> > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> query
> > in
> > > > style with:
> > > >
> > > > useragents:
> > > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > >
> > > > In this example I would have a hit because Mozilla/4.0* matches
> the
> > > > useragent.
> > > >
> > > > <doc>
> > > > <useragents>
> > > >                Firefox*
> > > >                Mozilla/4.0*
> > > > </useragents>
> > > > </doc>
> > > >
> > > >
> > > > Regards
> > > > Niclas
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
> 
> 
> 
> --
> Ted Dunning, CTO
> DeepDyve



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Wildcard searches????

Posted by Fuad Efendi <fu...@efendi.ca>.

Hi Niclas,


"generalization" of the user agent "without including the versions numbers"...

How will you separate Mozilla/5.0 (Browser) from Mozilla/5.0 (Googlebot)?

And, going to the root of a problem... why do you use SOLR such a way? Is it search service showing different content depending on browser type (WAP vs. HTML)???

If it is, you are implementing so-called "business use case" improperly...

Search Engine Results Pages (SERP) should not have dependency on User-Agent HTTP Request Header.

But, raw TCP output may depend on it, and it is not SOLR/Lucene layer; it is upper layer... Tomcat Servlet Container, for instance, may generate different output depending whether it is mobile device (WAP) or browser (Mozilla compatible)...

I don't know your use case specifics... as Ted mentioned, it's much better to post SOLR-specific questions in solr-user@lucene.apache.org...


-Fuad



> -----Original Message-----
> From: Niclas Rothman [mailto:niro@lechill.com]
> Sent: February-05-10 6:12 PM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Hi Fuad and thanks for your reply!
> 
> The first post I know now was a wrong approach, I should not have the
> wildcard included in my index.
> 
> However, I can't do as you suggest, to have the full user agent in the
> index, that’s the whole idea actually.
> 
> The reason can be explained like this, device manufactures are literally
> spitting out new devices and updates all the time which generates new
> user agents that are very similar, perhaps only a small version number
> differs.
> So what I need is to have a "generalization" of the user agent in  my
> index, to only have the start of the useragent without including the
> versions numbers.
> This way my index are all the time "up to date" even if users with new
> version numbers access my search service, which in my app isn’t
> significant but instead causing my problems....
> 
> Example:
> 
> I have 2 Indexed documents where the documents useragent field are
> partial:
> <doc>
> 	<id>1</id>
> 	<useragents>
>       	Firefox
>             Mozilla/4.0+SonyEricsson
> 	</useragents>
> </doc>
> <doc>
> 	<id>2</id>
> 	<useragents>
>       	Firefox
>             Mozilla/4.0+SonyEricsson
> 	</useragents>
> </doc>
> 
> User A searches my app with an user agent as:
> 
> 	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI
> DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> 
> The search app will display both document 1 and 2, because his user
> agent starts exactly has the user agent pattern in my document.
> 
> 
> User B searches my app with an user agent as (Please note that this user
> agent differs in the near end from Users A (JP9.5.1 instead of
> JP8.4.1)):
> 
> 	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI
> DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0
> 
> The search app will also display both document 1 and 2, because his user
> agent starts exactly has the user agent pattern in my document.
> Even if the version number of the java platform differs between user A
> and  B.
> 
> If we now have a different index with FULL user agents, only User A
> would have documents returned, none of the documents user agents matched
> Users B user agent because of the "silly" version number!!
> 
> <doc>
> 	<id>1</id>
> 	<useragents>
>       	Firefox
> 
> Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> 	</useragents>
> </doc>
> <doc>
> 	<id>2</id>
> 	<useragents>
>       	Firefox
> 
> Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> 	</useragents>
> </doc>
> 
> Can you see my problem?
> So the basic thing is if I somehow can do a query saying that at match
> should take place if a document useragent starts with the value of the
> users useragent.
> 
> In theory, having a startsWith "function / locig are easy enough to
> implement in C# / T-SQL,  but how on earth should I do this in SolR /
> Lucene?????
> 
> Regards
> 
> Niclas
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: 05 February 2010 22:49
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Niclas,
> 
> I looked at your initial post, you are creating document with field
> "abc*"
> - nothing related to "wildcard query"!
> 
> Of course, query [useragents:abcdefghijklm] will return no results, and
> [q=useragents:abc] no results, but [q=useragents:abc*] will return
> something.
> 
> text_nav is specific SOLR type for _leading_ wildcard queries; you don't
> need it (you don't need _leading_ wildcard queries).
> 
> On indexing time, instead of
> <doc>
> <useragents>
>                 Firefox*
>                 Mozilla/4.0*
> </useragents>
> </doc>
> 
> 
> You should index
> <doc>
> <useragents>
> 	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI
> DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> </useragents>
> </doc>
> 
> And also, you need to choose properly SOLR type; for instance, textTight
> or textgen, or even non-tokenized string!
> 
> 
> And, query [q=useragents:moz*] will return this document (even if this
> field is nontokenized).
> 
> 
> -Fuad
> 
> 
> P.S. Don't use * when you create Lucene document; use it as part of
> query.
> 
> 
> 
> 
> > -----Original Message-----
> > From: Niclas Rothman [mailto:niro@lechill.com]
> > Sent: February-05-10 4:44 PM
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Ted im using SOLR, but I cant figure out what type of fieldtype I
> should
> > use to get a query like this to work:
> >
> >
> > q=useragents: abcdefghijklm
> >
> >
> > where I have in my index one document with value "abc" in field
> > "useragents"
> >
> > That query results in 0 hits.
> >
> > If I issue this I get 1 hit of course (exact mathch)
> >
> > q=useragents: Mozilla
> >
> >
> > My document definition in SOLR looks like:
> >
> > <fields>
> >     <field name="id" type="tint" indexed="true" stored="true"
> > required="true" />
> >     <field name="useragents" type="text_rev" indexed="true"
> > stored="true" required="false" multiValued="true" />
> > </fields>
> >
> > Any clue?
> >
> > Nic
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: 05 February 2010 21:18
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > This is quite close.  You will have to break down the user agent that
> is
> > your query into the same kinds of pieces as you did for your index.
> > Lucene
> > will only do exact matching of terms during searching (wildcard
> queries
> > are
> > handled by exploding the term into all possible variants).
> >
> > Regarding the field type, you will probably have to customize that a
> > fair
> > bit to make +'s be separators and such.  If you use SOLR to index and
> > query
> > your data, then it will make sure that your separation into tokens is
> > compatible unless you are using shortened forms like you mention here.
> >
> > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <ni...@lechill.com>
> > wrote:
> >
> > > Hi again Ted and many thanks for your efforts.
> > > Ok, just to be sure that we fully understand each other:
> > >
> > > In my index I will store partial useragents without any wildcards *,
> > e.g.
> > >
> > > Fire    (for Firefox)
> > > Inte    (Internet Explorer)
> > > Moz     (Mozill)
> > >
> > >
> > > When I during runtime search my index for Media objects that are
> > compatible
> > > with a useragent,
> > > e.g:
> > >
> > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > >
> > > Hopefully lucene / solr will serve me with all Media objects that
> > partially
> > > math my full user agent string and also perhaps some mismatches. To
> be
> > > absolutely sure that I only show Media objects that are compatible,
> I
> > will
> > > have to loop through the resultset in my program to do a final test
> > and
> > > exclude any mismatches.
> > >
> > > Is this what you are saying Ted, that I cant do the whole process in
> > Solr /
> > > Lucene, that I need to do the final test in my program (C#)?
> > >
> > > Also, Im using Solr 1.4, what fieldtype would you recommend to use
> for
> > the
> > > useragent ( tokenized)
> > >
> > > Okey, lets see what you have to say about this.
> > > Please bear with me, im all new to lucene and solr!!
> > >
> > > Regards
> > > Niclas
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 20:43
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Yes.  I think you have it.
> > >
> > > To explain in a bit more detail, I think that you should store a
> > tokenized
> > > form of the user agents and should query using a tokenized form of
> > your
> > > user
> > > agent.  This will retrieve documents that have partial matches to
> the
> > user
> > > agent of interest.  Many of these matches, however, may not meet the
> > > requirements of the wildcard expression in the documents.  As such,
> > you
> > > will
> > > need to look at each retrieved document to retrieve the wild
> > expression
> > > from
> > > each one in turn to test if the original (untokenized) query
> satisfies
> > the
> > > wildcard.
> > >
> > > If your wildcards are all of a positive nature as your example is,
> > then
> > > this
> > > should work pretty well.
> > >
> > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <ni...@lechill.com>
> > wrote:
> > >
> > > > Hi Ted and thanks for all your efforts.
> > > > Listen im a little bit lost here trying to understand what you are
> > trying
> > > > to tell me :-)
> > > >
> > > > 1. I Store my useragents in a field that is tokenized.
> > > > 2. Then when I search, you are saying that I should "scan" down
> the
> > > matches
> > > > via a SOLR function, or what?
> > > > Are you referring to these functions in SOLR?
> > > >
> > > > http://wiki.apache.org/solr/FunctionQuery
> > > >
> > > >
> > > > Sorry for not grasping immmediatley!
> > > >
> > > > Regards Niclas
> > > >
> > > > -----Original Message-----
> > > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > > Sent: 05 February 2010 17:44
> > > > To: general@lucene.apache.org
> > > > Cc: java-user@lucene.apache.org
> > > > Subject: Re: Wildcard searches????
> > > >
> > > > Tokenize your user agent strings, then store the tokenized form
> > > separately
> > > > from the wild card.  At retrieval time, scan down the matches and
> > apply
> > > the
> > > > wildcard from each document to your original query.  The SOLR
> > function
> > > > query
> > > > might be useful for this as would be a custom hit collector.
> > > >
> > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <ni...@lechill.com>
> > wrote:
> > > >
> > > > > Hi there, i facing a problem and would like to ask the community
> > for
> > > some
> > > > > help.
> > > > >
> > > > > In my index I store browser  useragent values as "wildcarded" /
> > > partial,
> > > > >  which should be understood that an indexed document
> > > > > should only be shown to end users if his browsers useragent
> > matches a
> > > > > wildcared usereragent in my document.
> > > > >
> > > > > So what I have Is actually a "reversed" matching, the wildcards
> > are in
> > > my
> > > > > document and NOT in my actual query.
> > > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> > query
> > > in
> > > > > style with:
> > > > >
> > > > > useragents:
> > > > >
> > > >
> > >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > > >
> > > > > In this example I would have a hit because Mozilla/4.0* matches
> > the
> > > > > useragent.
> > > > >
> > > > > <doc>
> > > > > <useragents>
> > > > >                Firefox*
> > > > >                Mozilla/4.0*
> > > > > </useragents>
> > > > > </doc>
> > > > >
> > > > >
> > > > > Regards
> > > > > Niclas
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ted Dunning, CTO
> > > > DeepDyve
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Wildcard searches????

Posted by Fuad Efendi <fu...@efendi.ca>.

I understand this:

> So what I need is to have a "generalization" of the user agent in  my
> index


So that we may end up with 5 - 10 different tokens. It has to be hard-coded, for instance, via synonym dictionary or something similar (it is very easy in SOLR). WAP, HTML, and etc. Most important agent attributes. It doesn't matter IE or Mozilla; what plays a role is, for instance, screen resolution, character encoding support, gzip support, and etc.; WAP or HTML is very important.

But why???

I think we are giving bad advice without knowing source of a problem (use case)...


Obviously:
Niclas tries to map thousands User-Agent strings to few tokens at indexing time, and at query time.

Question:
Why to use multivalued field? {"Mozilla", "Firefox"} - can't we have simple encoded value "MF"? - we need use case...


...

(better to post in SOLR; it is just configuration settings without hard coding...)




> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: February-05-10 6:45 PM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: Re: Wildcard searches????
> 
> Fuad,
> 
> I think that you took Niclas requirements backwards.  He wants a reverse
> wild-card search where the wildcard is in the document and the search
> query
> is more specific.
> 
> You are correct that leading wildcard is critical here.
> 
> On Fri, Feb 5, 2010 at 2:25 PM, Digy <di...@gmail.com> wrote:
> 
> > http://en.wikipedia.org/wiki/Crossposting
> >
> > -----Original Message-----
> > From: Niclas Rothman [mailto:niro@lechill.com]
> > Sent: Saturday, February 06, 2010 12:12 AM
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Hi Fuad and thanks for your reply!
> >
> > The first post I know now was a wrong approach, I should not have the
> > wildcard included in my index.
> >
> > However, I can't do as you suggest, to have the full user agent in the
> > index, that’s the whole idea actually.
> >
> > The reason can be explained like this, device manufactures are
> literally
> > spitting out new devices and updates all the time which generates new
> user
> > agents that are very similar, perhaps only a small version number
> differs.
> > So what I need is to have a "generalization" of the user agent in  my
> > index, to only have the start of the useragent without including the
> > versions numbers.
> > This way my index are all the time "up to date" even if users with new
> > version numbers access my search service, which in my app isn’t
> significant
> > but instead causing my problems....
> >
> > Example:
> >
> > I have 2 Indexed documents where the documents useragent field are
> partial:
> > <doc>
> >        <id>1</id>
> >        <useragents>
> >        Firefox
> >            Mozilla/4.0+SonyEricsson
> >        </useragents>
> > </doc>
> > <doc>
> >        <id>2</id>
> >        <useragents>
> >        Firefox
> >            Mozilla/4.0+SonyEricsson
> >        </useragents>
> > </doc>
> >
> > User A searches my app with an user agent as:
> >
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> >
> > The search app will display both document 1 and 2, because his user
> agent
> > starts exactly has the user agent pattern in my document.
> >
> >
> > User B searches my app with an user agent as (Please note that this
> user
> > agent differs in the near end from Users A (JP9.5.1 instead of
> JP8.4.1)):
> >
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0
> >
> > The search app will also display both document 1 and 2, because his
> user
> > agent starts exactly has the user agent pattern in my document.
> > Even if the version number of the java platform differs between user A
> and
> >  B.
> >
> > If we now have a different index with FULL user agents, only User A
> would
> > have documents returned, none of the documents user agents matched
> Users B
> > user agent because of the "silly" version number!!
> >
> > <doc>
> >        <id>1</id>
> >        <useragents>
> >        Firefox
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> >        </useragents>
> > </doc>
> > <doc>
> >        <id>2</id>
> >        <useragents>
> >        Firefox
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> >        </useragents>
> > </doc>
> >
> > Can you see my problem?
> > So the basic thing is if I somehow can do a query saying that at match
> > should take place if a document useragent starts with the value of the
> users
> > useragent.
> >
> > In theory, having a startsWith "function / locig are easy enough to
> > implement in C# / T-SQL,  but how on earth should I do this in SolR /
> > Lucene?????
> >
> > Regards
> >
> > Niclas
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Fuad Efendi [mailto:fuad@efendi.ca]
> > Sent: 05 February 2010 22:49
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Niclas,
> >
> > I looked at your initial post, you are creating document with field
> "abc*"
> > - nothing related to "wildcard query"!
> >
> > Of course, query [useragents:abcdefghijklm] will return no results,
> and
> > [q=useragents:abc] no results, but [q=useragents:abc*] will return
> > something.
> >
> > text_nav is specific SOLR type for _leading_ wildcard queries; you
> don't
> > need it (you don't need _leading_ wildcard queries).
> >
> > On indexing time, instead of
> > <doc>
> > <useragents>
> >                Firefox*
> >                Mozilla/4.0*
> > </useragents>
> > </doc>
> >
> >
> > You should index
> > <doc>
> > <useragents>
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> > </useragents>
> > </doc>
> >
> > And also, you need to choose properly SOLR type; for instance,
> textTight or
> > textgen, or even non-tokenized string!
> >
> >
> > And, query [q=useragents:moz*] will return this document (even if this
> > field is nontokenized).
> >
> >
> > -Fuad
> >
> >
> > P.S. Don't use * when you create Lucene document; use it as part of
> query.
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: Niclas Rothman [mailto:niro@lechill.com]
> > > Sent: February-05-10 4:44 PM
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: RE: Wildcard searches????
> > >
> > > Ted im using SOLR, but I cant figure out what type of fieldtype I
> should
> > > use to get a query like this to work:
> > >
> > >
> > > q=useragents: abcdefghijklm
> > >
> > >
> > > where I have in my index one document with value "abc" in field
> > > "useragents"
> > >
> > > That query results in 0 hits.
> > >
> > > If I issue this I get 1 hit of course (exact mathch)
> > >
> > > q=useragents: Mozilla
> > >
> > >
> > > My document definition in SOLR looks like:
> > >
> > > <fields>
> > >     <field name="id" type="tint" indexed="true" stored="true"
> > > required="true" />
> > >     <field name="useragents" type="text_rev" indexed="true"
> > > stored="true" required="false" multiValued="true" />
> > > </fields>
> > >
> > > Any clue?
> > >
> > > Nic
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 21:18
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > This is quite close.  You will have to break down the user agent
> that is
> > > your query into the same kinds of pieces as you did for your index.
> > > Lucene
> > > will only do exact matching of terms during searching (wildcard
> queries
> > > are
> > > handled by exploding the term into all possible variants).
> > >
> > > Regarding the field type, you will probably have to customize that a
> > > fair
> > > bit to make +'s be separators and such.  If you use SOLR to index
> and
> > > query
> > > your data, then it will make sure that your separation into tokens
> is
> > > compatible unless you are using shortened forms like you mention
> here.
> > >
> > > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <ni...@lechill.com>
> > > wrote:
> > >
> > > > Hi again Ted and many thanks for your efforts.
> > > > Ok, just to be sure that we fully understand each other:
> > > >
> > > > In my index I will store partial useragents without any wildcards
> *,
> > > e.g.
> > > >
> > > > Fire    (for Firefox)
> > > > Inte    (Internet Explorer)
> > > > Moz     (Mozill)
> > > >
> > > >
> > > > When I during runtime search my index for Media objects that are
> > > compatible
> > > > with a useragent,
> > > > e.g:
> > > >
> > > >
> > > >
> > >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > >
> > > > Hopefully lucene / solr will serve me with all Media objects that
> > > partially
> > > > math my full user agent string and also perhaps some mismatches.
> To be
> > > > absolutely sure that I only show Media objects that are
> compatible, I
> > > will
> > > > have to loop through the resultset in my program to do a final
> test
> > > and
> > > > exclude any mismatches.
> > > >
> > > > Is this what you are saying Ted, that I cant do the whole process
> in
> > > Solr /
> > > > Lucene, that I need to do the final test in my program (C#)?
> > > >
> > > > Also, Im using Solr 1.4, what fieldtype would you recommend to use
> for
> > > the
> > > > useragent ( tokenized)
> > > >
> > > > Okey, lets see what you have to say about this.
> > > > Please bear with me, im all new to lucene and solr!!
> > > >
> > > > Regards
> > > > Niclas
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > > Sent: 05 February 2010 20:43
> > > > To: general@lucene.apache.org
> > > > Cc: java-user@lucene.apache.org
> > > > Subject: Re: Wildcard searches????
> > > >
> > > > Yes.  I think you have it.
> > > >
> > > > To explain in a bit more detail, I think that you should store a
> > > tokenized
> > > > form of the user agents and should query using a tokenized form of
> > > your
> > > > user
> > > > agent.  This will retrieve documents that have partial matches to
> the
> > > user
> > > > agent of interest.  Many of these matches, however, may not meet
> the
> > > > requirements of the wildcard expression in the documents.  As
> such,
> > > you
> > > > will
> > > > need to look at each retrieved document to retrieve the wild
> > > expression
> > > > from
> > > > each one in turn to test if the original (untokenized) query
> satisfies
> > > the
> > > > wildcard.
> > > >
> > > > If your wildcards are all of a positive nature as your example is,
> > > then
> > > > this
> > > > should work pretty well.
> > > >
> > > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <ni...@lechill.com>
> > > wrote:
> > > >
> > > > > Hi Ted and thanks for all your efforts.
> > > > > Listen im a little bit lost here trying to understand what you
> are
> > > trying
> > > > > to tell me :-)
> > > > >
> > > > > 1. I Store my useragents in a field that is tokenized.
> > > > > 2. Then when I search, you are saying that I should "scan" down
> the
> > > > matches
> > > > > via a SOLR function, or what?
> > > > > Are you referring to these functions in SOLR?
> > > > >
> > > > > http://wiki.apache.org/solr/FunctionQuery
> > > > >
> > > > >
> > > > > Sorry for not grasping immmediatley!
> > > > >
> > > > > Regards Niclas
> > > > >
> > > > > -----Original Message-----
> > > > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > > > Sent: 05 February 2010 17:44
> > > > > To: general@lucene.apache.org
> > > > > Cc: java-user@lucene.apache.org
> > > > > Subject: Re: Wildcard searches????
> > > > >
> > > > > Tokenize your user agent strings, then store the tokenized form
> > > > separately
> > > > > from the wild card.  At retrieval time, scan down the matches
> and
> > > apply
> > > > the
> > > > > wildcard from each document to your original query.  The SOLR
> > > function
> > > > > query
> > > > > might be useful for this as would be a custom hit collector.
> > > > >
> > > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman
> <ni...@lechill.com>
> > > wrote:
> > > > >
> > > > > > Hi there, i facing a problem and would like to ask the
> community
> > > for
> > > > some
> > > > > > help.
> > > > > >
> > > > > > In my index I store browser  useragent values as "wildcarded"
> /
> > > > partial,
> > > > > >  which should be understood that an indexed document
> > > > > > should only be shown to end users if his browsers useragent
> > > matches a
> > > > > > wildcared usereragent in my document.
> > > > > >
> > > > > > So what I have Is actually a "reversed" matching, the
> wildcards
> > > are in
> > > > my
> > > > > > document and NOT in my actual query.
> > > > > > Does anyone know if this "setup" Is possible, e.g. to execute
> a
> > > query
> > > > in
> > > > > > style with:
> > > > > >
> > > > > > useragents:
> > > > > >
> > > > >
> > > >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > > > >
> > > > > > In this example I would have a hit because Mozilla/4.0*
> matches
> > > the
> > > > > > useragent.
> > > > > >
> > > > > > <doc>
> > > > > > <useragents>
> > > > > >                Firefox*
> > > > > >                Mozilla/4.0*
> > > > > > </useragents>
> > > > > > </doc>
> > > > > >
> > > > > >
> > > > > > Regards
> > > > > > Niclas
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ted Dunning, CTO
> > > > > DeepDyve
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ted Dunning, CTO
> > > > DeepDyve
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> >
> >
> >
> >
> 
> 
> --
> Ted Dunning, CTO
> DeepDyve

RE: Wildcard searches????

Posted by Fuad Efendi <fu...@efendi.ca>.

I understand this:

> So what I need is to have a "generalization" of the user agent in  my
> index


So that we may end up with 5 - 10 different tokens. It has to be hard-coded, for instance, via synonym dictionary or something similar (it is very easy in SOLR). WAP, HTML, and etc. Most important agent attributes. It doesn't matter IE or Mozilla; what plays a role is, for instance, screen resolution, character encoding support, gzip support, and etc.; WAP or HTML is very important.

But why???

I think we are giving bad advice without knowing source of a problem (use case)...


Obviously:
Niclas tries to map thousands User-Agent strings to few tokens at indexing time, and at query time.

Question:
Why to use multivalued field? {"Mozilla", "Firefox"} - can't we have simple encoded value "MF"? - we need use case...


...

(better to post in SOLR; it is just configuration settings without hard coding...)




> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: February-05-10 6:45 PM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: Re: Wildcard searches????
> 
> Fuad,
> 
> I think that you took Niclas requirements backwards.  He wants a reverse
> wild-card search where the wildcard is in the document and the search
> query
> is more specific.
> 
> You are correct that leading wildcard is critical here.
> 
> On Fri, Feb 5, 2010 at 2:25 PM, Digy <di...@gmail.com> wrote:
> 
> > http://en.wikipedia.org/wiki/Crossposting
> >
> > -----Original Message-----
> > From: Niclas Rothman [mailto:niro@lechill.com]
> > Sent: Saturday, February 06, 2010 12:12 AM
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Hi Fuad and thanks for your reply!
> >
> > The first post I know now was a wrong approach, I should not have the
> > wildcard included in my index.
> >
> > However, I can't do as you suggest, to have the full user agent in the
> > index, that’s the whole idea actually.
> >
> > The reason can be explained like this, device manufactures are
> literally
> > spitting out new devices and updates all the time which generates new
> user
> > agents that are very similar, perhaps only a small version number
> differs.
> > So what I need is to have a "generalization" of the user agent in  my
> > index, to only have the start of the useragent without including the
> > versions numbers.
> > This way my index are all the time "up to date" even if users with new
> > version numbers access my search service, which in my app isn’t
> significant
> > but instead causing my problems....
> >
> > Example:
> >
> > I have 2 Indexed documents where the documents useragent field are
> partial:
> > <doc>
> >        <id>1</id>
> >        <useragents>
> >        Firefox
> >            Mozilla/4.0+SonyEricsson
> >        </useragents>
> > </doc>
> > <doc>
> >        <id>2</id>
> >        <useragents>
> >        Firefox
> >            Mozilla/4.0+SonyEricsson
> >        </useragents>
> > </doc>
> >
> > User A searches my app with an user agent as:
> >
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> >
> > The search app will display both document 1 and 2, because his user
> agent
> > starts exactly has the user agent pattern in my document.
> >
> >
> > User B searches my app with an user agent as (Please note that this
> user
> > agent differs in the near end from Users A (JP9.5.1 instead of
> JP8.4.1)):
> >
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0
> >
> > The search app will also display both document 1 and 2, because his
> user
> > agent starts exactly has the user agent pattern in my document.
> > Even if the version number of the java platform differs between user A
> and
> >  B.
> >
> > If we now have a different index with FULL user agents, only User A
> would
> > have documents returned, none of the documents user agents matched
> Users B
> > user agent because of the "silly" version number!!
> >
> > <doc>
> >        <id>1</id>
> >        <useragents>
> >        Firefox
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> >        </useragents>
> > </doc>
> > <doc>
> >        <id>2</id>
> >        <useragents>
> >        Firefox
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> >        </useragents>
> > </doc>
> >
> > Can you see my problem?
> > So the basic thing is if I somehow can do a query saying that at match
> > should take place if a document useragent starts with the value of the
> users
> > useragent.
> >
> > In theory, having a startsWith "function / locig are easy enough to
> > implement in C# / T-SQL,  but how on earth should I do this in SolR /
> > Lucene?????
> >
> > Regards
> >
> > Niclas
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Fuad Efendi [mailto:fuad@efendi.ca]
> > Sent: 05 February 2010 22:49
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Niclas,
> >
> > I looked at your initial post, you are creating document with field
> "abc*"
> > - nothing related to "wildcard query"!
> >
> > Of course, query [useragents:abcdefghijklm] will return no results,
> and
> > [q=useragents:abc] no results, but [q=useragents:abc*] will return
> > something.
> >
> > text_nav is specific SOLR type for _leading_ wildcard queries; you
> don't
> > need it (you don't need _leading_ wildcard queries).
> >
> > On indexing time, instead of
> > <doc>
> > <useragents>
> >                Firefox*
> >                Mozilla/4.0*
> > </useragents>
> > </doc>
> >
> >
> > You should index
> > <doc>
> > <useragents>
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> > </useragents>
> > </doc>
> >
> > And also, you need to choose properly SOLR type; for instance,
> textTight or
> > textgen, or even non-tokenized string!
> >
> >
> > And, query [q=useragents:moz*] will return this document (even if this
> > field is nontokenized).
> >
> >
> > -Fuad
> >
> >
> > P.S. Don't use * when you create Lucene document; use it as part of
> query.
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: Niclas Rothman [mailto:niro@lechill.com]
> > > Sent: February-05-10 4:44 PM
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: RE: Wildcard searches????
> > >
> > > Ted im using SOLR, but I cant figure out what type of fieldtype I
> should
> > > use to get a query like this to work:
> > >
> > >
> > > q=useragents: abcdefghijklm
> > >
> > >
> > > where I have in my index one document with value "abc" in field
> > > "useragents"
> > >
> > > That query results in 0 hits.
> > >
> > > If I issue this I get 1 hit of course (exact mathch)
> > >
> > > q=useragents: Mozilla
> > >
> > >
> > > My document definition in SOLR looks like:
> > >
> > > <fields>
> > >     <field name="id" type="tint" indexed="true" stored="true"
> > > required="true" />
> > >     <field name="useragents" type="text_rev" indexed="true"
> > > stored="true" required="false" multiValued="true" />
> > > </fields>
> > >
> > > Any clue?
> > >
> > > Nic
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 21:18
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > This is quite close.  You will have to break down the user agent
> that is
> > > your query into the same kinds of pieces as you did for your index.
> > > Lucene
> > > will only do exact matching of terms during searching (wildcard
> queries
> > > are
> > > handled by exploding the term into all possible variants).
> > >
> > > Regarding the field type, you will probably have to customize that a
> > > fair
> > > bit to make +'s be separators and such.  If you use SOLR to index
> and
> > > query
> > > your data, then it will make sure that your separation into tokens
> is
> > > compatible unless you are using shortened forms like you mention
> here.
> > >
> > > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <ni...@lechill.com>
> > > wrote:
> > >
> > > > Hi again Ted and many thanks for your efforts.
> > > > Ok, just to be sure that we fully understand each other:
> > > >
> > > > In my index I will store partial useragents without any wildcards
> *,
> > > e.g.
> > > >
> > > > Fire    (for Firefox)
> > > > Inte    (Internet Explorer)
> > > > Moz     (Mozill)
> > > >
> > > >
> > > > When I during runtime search my index for Media objects that are
> > > compatible
> > > > with a useragent,
> > > > e.g:
> > > >
> > > >
> > > >
> > >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > >
> > > > Hopefully lucene / solr will serve me with all Media objects that
> > > partially
> > > > math my full user agent string and also perhaps some mismatches.
> To be
> > > > absolutely sure that I only show Media objects that are
> compatible, I
> > > will
> > > > have to loop through the resultset in my program to do a final
> test
> > > and
> > > > exclude any mismatches.
> > > >
> > > > Is this what you are saying Ted, that I cant do the whole process
> in
> > > Solr /
> > > > Lucene, that I need to do the final test in my program (C#)?
> > > >
> > > > Also, Im using Solr 1.4, what fieldtype would you recommend to use
> for
> > > the
> > > > useragent ( tokenized)
> > > >
> > > > Okey, lets see what you have to say about this.
> > > > Please bear with me, im all new to lucene and solr!!
> > > >
> > > > Regards
> > > > Niclas
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > > Sent: 05 February 2010 20:43
> > > > To: general@lucene.apache.org
> > > > Cc: java-user@lucene.apache.org
> > > > Subject: Re: Wildcard searches????
> > > >
> > > > Yes.  I think you have it.
> > > >
> > > > To explain in a bit more detail, I think that you should store a
> > > tokenized
> > > > form of the user agents and should query using a tokenized form of
> > > your
> > > > user
> > > > agent.  This will retrieve documents that have partial matches to
> the
> > > user
> > > > agent of interest.  Many of these matches, however, may not meet
> the
> > > > requirements of the wildcard expression in the documents.  As
> such,
> > > you
> > > > will
> > > > need to look at each retrieved document to retrieve the wild
> > > expression
> > > > from
> > > > each one in turn to test if the original (untokenized) query
> satisfies
> > > the
> > > > wildcard.
> > > >
> > > > If your wildcards are all of a positive nature as your example is,
> > > then
> > > > this
> > > > should work pretty well.
> > > >
> > > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <ni...@lechill.com>
> > > wrote:
> > > >
> > > > > Hi Ted and thanks for all your efforts.
> > > > > Listen im a little bit lost here trying to understand what you
> are
> > > trying
> > > > > to tell me :-)
> > > > >
> > > > > 1. I Store my useragents in a field that is tokenized.
> > > > > 2. Then when I search, you are saying that I should "scan" down
> the
> > > > matches
> > > > > via a SOLR function, or what?
> > > > > Are you referring to these functions in SOLR?
> > > > >
> > > > > http://wiki.apache.org/solr/FunctionQuery
> > > > >
> > > > >
> > > > > Sorry for not grasping immmediatley!
> > > > >
> > > > > Regards Niclas
> > > > >
> > > > > -----Original Message-----
> > > > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > > > Sent: 05 February 2010 17:44
> > > > > To: general@lucene.apache.org
> > > > > Cc: java-user@lucene.apache.org
> > > > > Subject: Re: Wildcard searches????
> > > > >
> > > > > Tokenize your user agent strings, then store the tokenized form
> > > > separately
> > > > > from the wild card.  At retrieval time, scan down the matches
> and
> > > apply
> > > > the
> > > > > wildcard from each document to your original query.  The SOLR
> > > function
> > > > > query
> > > > > might be useful for this as would be a custom hit collector.
> > > > >
> > > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman
> <ni...@lechill.com>
> > > wrote:
> > > > >
> > > > > > Hi there, i facing a problem and would like to ask the
> community
> > > for
> > > > some
> > > > > > help.
> > > > > >
> > > > > > In my index I store browser  useragent values as "wildcarded"
> /
> > > > partial,
> > > > > >  which should be understood that an indexed document
> > > > > > should only be shown to end users if his browsers useragent
> > > matches a
> > > > > > wildcared usereragent in my document.
> > > > > >
> > > > > > So what I have Is actually a "reversed" matching, the
> wildcards
> > > are in
> > > > my
> > > > > > document and NOT in my actual query.
> > > > > > Does anyone know if this "setup" Is possible, e.g. to execute
> a
> > > query
> > > > in
> > > > > > style with:
> > > > > >
> > > > > > useragents:
> > > > > >
> > > > >
> > > >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > > > >
> > > > > > In this example I would have a hit because Mozilla/4.0*
> matches
> > > the
> > > > > > useragent.
> > > > > >
> > > > > > <doc>
> > > > > > <useragents>
> > > > > >                Firefox*
> > > > > >                Mozilla/4.0*
> > > > > > </useragents>
> > > > > > </doc>
> > > > > >
> > > > > >
> > > > > > Regards
> > > > > > Niclas
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ted Dunning, CTO
> > > > > DeepDyve
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ted Dunning, CTO
> > > > DeepDyve
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> >
> >
> >
> >
> 
> 
> --
> Ted Dunning, CTO
> DeepDyve



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Wildcard searches????

Posted by Ted Dunning <te...@gmail.com>.

Fuad,

I think that you took Niclas requirements backwards.  He wants a reverse
wild-card search where the wildcard is in the document and the search query
is more specific.

You are correct that leading wildcard is critical here.

On Fri, Feb 5, 2010 at 2:25 PM, Digy <di...@gmail.com> wrote:

> http://en.wikipedia.org/wiki/Crossposting
>
> -----Original Message-----
> From: Niclas Rothman [mailto:niro@lechill.com]
> Sent: Saturday, February 06, 2010 12:12 AM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
>
> Hi Fuad and thanks for your reply!
>
> The first post I know now was a wrong approach, I should not have the
> wildcard included in my index.
>
> However, I can't do as you suggest, to have the full user agent in the
> index, that’s the whole idea actually.
>
> The reason can be explained like this, device manufactures are literally
> spitting out new devices and updates all the time which generates new user
> agents that are very similar, perhaps only a small version number differs.
> So what I need is to have a "generalization" of the user agent in  my
> index, to only have the start of the useragent without including the
> versions numbers.
> This way my index are all the time "up to date" even if users with new
> version numbers access my search service, which in my app isn’t significant
> but instead causing my problems....
>
> Example:
>
> I have 2 Indexed documents where the documents useragent field are partial:
> <doc>
>        <id>1</id>
>        <useragents>
>        Firefox
>            Mozilla/4.0+SonyEricsson
>        </useragents>
> </doc>
> <doc>
>        <id>2</id>
>        <useragents>
>        Firefox
>            Mozilla/4.0+SonyEricsson
>        </useragents>
> </doc>
>
> User A searches my app with an user agent as:
>
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
>
> The search app will display both document 1 and 2, because his user agent
> starts exactly has the user agent pattern in my document.
>
>
> User B searches my app with an user agent as (Please note that this user
> agent differs in the near end from Users A (JP9.5.1 instead of JP8.4.1)):
>
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0
>
> The search app will also display both document 1 and 2, because his user
> agent starts exactly has the user agent pattern in my document.
> Even if the version number of the java platform differs between user A and
>  B.
>
> If we now have a different index with FULL user agents, only User A would
> have documents returned, none of the documents user agents matched Users B
> user agent because of the "silly" version number!!
>
> <doc>
>        <id>1</id>
>        <useragents>
>        Firefox
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
>        </useragents>
> </doc>
> <doc>
>        <id>2</id>
>        <useragents>
>        Firefox
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
>        </useragents>
> </doc>
>
> Can you see my problem?
> So the basic thing is if I somehow can do a query saying that at match
> should take place if a document useragent starts with the value of the users
> useragent.
>
> In theory, having a startsWith "function / locig are easy enough to
> implement in C# / T-SQL,  but how on earth should I do this in SolR /
> Lucene?????
>
> Regards
>
> Niclas
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: 05 February 2010 22:49
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
>
> Niclas,
>
> I looked at your initial post, you are creating document with field "abc*"
> - nothing related to "wildcard query"!
>
> Of course, query [useragents:abcdefghijklm] will return no results, and
> [q=useragents:abc] no results, but [q=useragents:abc*] will return
> something.
>
> text_nav is specific SOLR type for _leading_ wildcard queries; you don't
> need it (you don't need _leading_ wildcard queries).
>
> On indexing time, instead of
> <doc>
> <useragents>
>                Firefox*
>                Mozilla/4.0*
> </useragents>
> </doc>
>
>
> You should index
> <doc>
> <useragents>
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> </useragents>
> </doc>
>
> And also, you need to choose properly SOLR type; for instance, textTight or
> textgen, or even non-tokenized string!
>
>
> And, query [q=useragents:moz*] will return this document (even if this
> field is nontokenized).
>
>
> -Fuad
>
>
> P.S. Don't use * when you create Lucene document; use it as part of query.
>
>
>
>
> > -----Original Message-----
> > From: Niclas Rothman [mailto:niro@lechill.com]
> > Sent: February-05-10 4:44 PM
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Ted im using SOLR, but I cant figure out what type of fieldtype I should
> > use to get a query like this to work:
> >
> >
> > q=useragents: abcdefghijklm
> >
> >
> > where I have in my index one document with value "abc" in field
> > "useragents"
> >
> > That query results in 0 hits.
> >
> > If I issue this I get 1 hit of course (exact mathch)
> >
> > q=useragents: Mozilla
> >
> >
> > My document definition in SOLR looks like:
> >
> > <fields>
> >     <field name="id" type="tint" indexed="true" stored="true"
> > required="true" />
> >     <field name="useragents" type="text_rev" indexed="true"
> > stored="true" required="false" multiValued="true" />
> > </fields>
> >
> > Any clue?
> >
> > Nic
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: 05 February 2010 21:18
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > This is quite close.  You will have to break down the user agent that is
> > your query into the same kinds of pieces as you did for your index.
> > Lucene
> > will only do exact matching of terms during searching (wildcard queries
> > are
> > handled by exploding the term into all possible variants).
> >
> > Regarding the field type, you will probably have to customize that a
> > fair
> > bit to make +'s be separators and such.  If you use SOLR to index and
> > query
> > your data, then it will make sure that your separation into tokens is
> > compatible unless you are using shortened forms like you mention here.
> >
> > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <ni...@lechill.com>
> > wrote:
> >
> > > Hi again Ted and many thanks for your efforts.
> > > Ok, just to be sure that we fully understand each other:
> > >
> > > In my index I will store partial useragents without any wildcards *,
> > e.g.
> > >
> > > Fire    (for Firefox)
> > > Inte    (Internet Explorer)
> > > Moz     (Mozill)
> > >
> > >
> > > When I during runtime search my index for Media objects that are
> > compatible
> > > with a useragent,
> > > e.g:
> > >
> > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > >
> > > Hopefully lucene / solr will serve me with all Media objects that
> > partially
> > > math my full user agent string and also perhaps some mismatches. To be
> > > absolutely sure that I only show Media objects that are compatible, I
> > will
> > > have to loop through the resultset in my program to do a final test
> > and
> > > exclude any mismatches.
> > >
> > > Is this what you are saying Ted, that I cant do the whole process in
> > Solr /
> > > Lucene, that I need to do the final test in my program (C#)?
> > >
> > > Also, Im using Solr 1.4, what fieldtype would you recommend to use for
> > the
> > > useragent ( tokenized)
> > >
> > > Okey, lets see what you have to say about this.
> > > Please bear with me, im all new to lucene and solr!!
> > >
> > > Regards
> > > Niclas
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 20:43
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Yes.  I think you have it.
> > >
> > > To explain in a bit more detail, I think that you should store a
> > tokenized
> > > form of the user agents and should query using a tokenized form of
> > your
> > > user
> > > agent.  This will retrieve documents that have partial matches to the
> > user
> > > agent of interest.  Many of these matches, however, may not meet the
> > > requirements of the wildcard expression in the documents.  As such,
> > you
> > > will
> > > need to look at each retrieved document to retrieve the wild
> > expression
> > > from
> > > each one in turn to test if the original (untokenized) query satisfies
> > the
> > > wildcard.
> > >
> > > If your wildcards are all of a positive nature as your example is,
> > then
> > > this
> > > should work pretty well.
> > >
> > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <ni...@lechill.com>
> > wrote:
> > >
> > > > Hi Ted and thanks for all your efforts.
> > > > Listen im a little bit lost here trying to understand what you are
> > trying
> > > > to tell me :-)
> > > >
> > > > 1. I Store my useragents in a field that is tokenized.
> > > > 2. Then when I search, you are saying that I should "scan" down the
> > > matches
> > > > via a SOLR function, or what?
> > > > Are you referring to these functions in SOLR?
> > > >
> > > > http://wiki.apache.org/solr/FunctionQuery
> > > >
> > > >
> > > > Sorry for not grasping immmediatley!
> > > >
> > > > Regards Niclas
> > > >
> > > > -----Original Message-----
> > > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > > Sent: 05 February 2010 17:44
> > > > To: general@lucene.apache.org
> > > > Cc: java-user@lucene.apache.org
> > > > Subject: Re: Wildcard searches????
> > > >
> > > > Tokenize your user agent strings, then store the tokenized form
> > > separately
> > > > from the wild card.  At retrieval time, scan down the matches and
> > apply
> > > the
> > > > wildcard from each document to your original query.  The SOLR
> > function
> > > > query
> > > > might be useful for this as would be a custom hit collector.
> > > >
> > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <ni...@lechill.com>
> > wrote:
> > > >
> > > > > Hi there, i facing a problem and would like to ask the community
> > for
> > > some
> > > > > help.
> > > > >
> > > > > In my index I store browser  useragent values as "wildcarded" /
> > > partial,
> > > > >  which should be understood that an indexed document
> > > > > should only be shown to end users if his browsers useragent
> > matches a
> > > > > wildcared usereragent in my document.
> > > > >
> > > > > So what I have Is actually a "reversed" matching, the wildcards
> > are in
> > > my
> > > > > document and NOT in my actual query.
> > > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> > query
> > > in
> > > > > style with:
> > > > >
> > > > > useragents:
> > > > >
> > > >
> > > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > > >
> > > > > In this example I would have a hit because Mozilla/4.0* matches
> > the
> > > > > useragent.
> > > > >
> > > > > <doc>
> > > > > <useragents>
> > > > >                Firefox*
> > > > >                Mozilla/4.0*
> > > > > </useragents>
> > > > > </doc>
> > > > >
> > > > >
> > > > > Regards
> > > > > Niclas
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ted Dunning, CTO
> > > > DeepDyve
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
>
>
>
>


-- 
Ted Dunning, CTO
DeepDyve

RE: here a small test case problem:lucene did not delete old index file after optimize method called

Posted by lu...@sohu.com.

I watched there are three compound files. 
0.cfs&nbsp; 6786kb
1.cfs&nbsp; 2044kb
2.cfs&nbsp; 8790kb(the optimize file)
I think in this testcase only 2.cfs left(the optimize file left),Is that right??
<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p>&nbsp;</o:p>
import java.io.File;
import java.io.IOException;
<o:p>&nbsp;</o:p>
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field; import org.apache.lucene.document.Fieldable;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;
<o:p>&nbsp;</o:p>
public class ReopenTest {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; IndexWriter writer;
<o:p>&nbsp;</o:p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; public static void main(String[] args) throws CorruptIndexException,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LockObtainFailedException, IOException {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; final IndexWriter writer = new
IndexWriter(FSDirectory.open(new File(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "data")), new
StandardAnalyzer(Version.LUCENE_CURRENT),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; IndexWriter.MaxFieldLength.LIMITED);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; writer.setMergeFactor(16);
<o:p>&nbsp;</o:p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Thread writeThread = new Thread(new Runnable() {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Override
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; public void run() {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // write test data
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; try {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for (int i = 0; i &lt; 1000000; i++) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Document doc = new
Document();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Fieldable itemID = new
Field("ItemID", String
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .valueOf(i),
Field.Store.YES,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Field.Index.
ANALYZED);
<o:p>&nbsp;</o:p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; writer.addDocument(doc);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
<o:p>&nbsp;</o:p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Thread.currentThread().sleep(10000);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; System.out.println("optimize
begin");
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; writer.optimize();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; System.out.println("optimize end");
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } catch (CorruptIndexException e) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; e.printStackTrace();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } catch (IOException e) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; e.printStackTrace();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } catch (InterruptedException e) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; e.printStackTrace();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
<o:p>&nbsp;</o:p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; });
<o:p>&nbsp;</o:p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; writeThread.start();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Thread reopenThread = new Thread(new Runnable() {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; public void run() {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; try {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; IndexReader reader =
writer.getReader();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; while (true) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
Thread.currentThread().sleep(1000);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; IndexReader newReader =
writer.getReader();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (newReader != reader) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; reader.close();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; reader = newReader;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; System.out.println("132");
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
<o:p>&nbsp;</o:p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } catch (InterruptedException e) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; e.printStackTrace();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } catch (IOException e) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; e.printStackTrace();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
<o:p>&nbsp;</o:p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; });
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; reopenThread.start();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
<o:p>&nbsp;</o:p>
}
&nbsp;

RE: Wildcard searches????

Posted by Digy <di...@gmail.com>.

http://en.wikipedia.org/wiki/Crossposting

-----Original Message-----
From: Niclas Rothman [mailto:niro@lechill.com] 
Sent: Saturday, February 06, 2010 12:12 AM
To: general@lucene.apache.org
Cc: java-user@lucene.apache.org
Subject: RE: Wildcard searches????

Hi Fuad and thanks for your reply!

The first post I know now was a wrong approach, I should not have the wildcard included in my index. 

However, I can't do as you suggest, to have the full user agent in the index, that’s the whole idea actually. 

The reason can be explained like this, device manufactures are literally spitting out new devices and updates all the time which generates new user agents that are very similar, perhaps only a small version number differs. 
So what I need is to have a "generalization" of the user agent in  my index, to only have the start of the useragent without including the versions numbers. 
This way my index are all the time "up to date" even if users with new version numbers access my search service, which in my app isn’t significant but instead causing my problems.... 

Example:

I have 2 Indexed documents where the documents useragent field are partial:
<doc>
	<id>1</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricsson
	</useragents>
</doc>
<doc>
	<id>2</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricsson
	</useragents>
</doc>

User A searches my app with an user agent as: 
	
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0

The search app will display both document 1 and 2, because his user agent starts exactly has the user agent pattern in my document.


User B searches my app with an user agent as (Please note that this user agent differs in the near end from Users A (JP9.5.1 instead of JP8.4.1)): 
	
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0

The search app will also display both document 1 and 2, because his user agent starts exactly has the user agent pattern in my document. 
Even if the version number of the java platform differs between user A and  B. 

If we now have a different index with FULL user agents, only User A would have documents returned, none of the documents user agents matched Users B user agent because of the "silly" version number!!

<doc>
	<id>1</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
	</useragents>
</doc>
<doc>
	<id>2</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
	</useragents>
</doc>

Can you see my problem?
So the basic thing is if I somehow can do a query saying that at match should take place if a document useragent starts with the value of the users useragent. 

In theory, having a startsWith "function / locig are easy enough to implement in C# / T-SQL,  but how on earth should I do this in SolR / Lucene?????

Regards

Niclas














-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: 05 February 2010 22:49
To: general@lucene.apache.org
Cc: java-user@lucene.apache.org
Subject: RE: Wildcard searches????

Niclas,

I looked at your initial post, you are creating document with field "abc*"
- nothing related to "wildcard query"!

Of course, query [useragents:abcdefghijklm] will return no results, and [q=useragents:abc] no results, but [q=useragents:abc*] will return something.

text_nav is specific SOLR type for _leading_ wildcard queries; you don't need it (you don't need _leading_ wildcard queries).

On indexing time, instead of
<doc>
<useragents>
                Firefox*
                Mozilla/4.0*
</useragents>
</doc>


You should index
<doc>
<useragents>
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
</useragents>
</doc>

And also, you need to choose properly SOLR type; for instance, textTight or textgen, or even non-tokenized string!


And, query [q=useragents:moz*] will return this document (even if this field is nontokenized).


-Fuad


P.S. Don't use * when you create Lucene document; use it as part of query.




> -----Original Message-----
> From: Niclas Rothman [mailto:niro@lechill.com]
> Sent: February-05-10 4:44 PM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Ted im using SOLR, but I cant figure out what type of fieldtype I should
> use to get a query like this to work:
> 
> 
> q=useragents: abcdefghijklm
> 
> 
> where I have in my index one document with value "abc" in field
> "useragents"
> 
> That query results in 0 hits.
> 
> If I issue this I get 1 hit of course (exact mathch)
> 
> q=useragents: Mozilla
> 
> 
> My document definition in SOLR looks like:
> 
> <fields>
>     <field name="id" type="tint" indexed="true" stored="true"
> required="true" />
>     <field name="useragents" type="text_rev" indexed="true"
> stored="true" required="false" multiValued="true" />
> </fields>
> 
> Any clue?
> 
> Nic
> 
> 
> 
> 
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: 05 February 2010 21:18
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: Re: Wildcard searches????
> 
> This is quite close.  You will have to break down the user agent that is
> your query into the same kinds of pieces as you did for your index.
> Lucene
> will only do exact matching of terms during searching (wildcard queries
> are
> handled by exploding the term into all possible variants).
> 
> Regarding the field type, you will probably have to customize that a
> fair
> bit to make +'s be separators and such.  If you use SOLR to index and
> query
> your data, then it will make sure that your separation into tokens is
> compatible unless you are using shortened forms like you mention here.
> 
> On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <ni...@lechill.com>
> wrote:
> 
> > Hi again Ted and many thanks for your efforts.
> > Ok, just to be sure that we fully understand each other:
> >
> > In my index I will store partial useragents without any wildcards *,
> e.g.
> >
> > Fire    (for Firefox)
> > Inte    (Internet Explorer)
> > Moz     (Mozill)
> >
> >
> > When I during runtime search my index for Media objects that are
> compatible
> > with a useragent,
> > e.g:
> >
> >
> >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> >
> > Hopefully lucene / solr will serve me with all Media objects that
> partially
> > math my full user agent string and also perhaps some mismatches. To be
> > absolutely sure that I only show Media objects that are compatible, I
> will
> > have to loop through the resultset in my program to do a final test
> and
> > exclude any mismatches.
> >
> > Is this what you are saying Ted, that I cant do the whole process in
> Solr /
> > Lucene, that I need to do the final test in my program (C#)?
> >
> > Also, Im using Solr 1.4, what fieldtype would you recommend to use for
> the
> > useragent ( tokenized)
> >
> > Okey, lets see what you have to say about this.
> > Please bear with me, im all new to lucene and solr!!
> >
> > Regards
> > Niclas
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: 05 February 2010 20:43
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > Yes.  I think you have it.
> >
> > To explain in a bit more detail, I think that you should store a
> tokenized
> > form of the user agents and should query using a tokenized form of
> your
> > user
> > agent.  This will retrieve documents that have partial matches to the
> user
> > agent of interest.  Many of these matches, however, may not meet the
> > requirements of the wildcard expression in the documents.  As such,
> you
> > will
> > need to look at each retrieved document to retrieve the wild
> expression
> > from
> > each one in turn to test if the original (untokenized) query satisfies
> the
> > wildcard.
> >
> > If your wildcards are all of a positive nature as your example is,
> then
> > this
> > should work pretty well.
> >
> > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <ni...@lechill.com>
> wrote:
> >
> > > Hi Ted and thanks for all your efforts.
> > > Listen im a little bit lost here trying to understand what you are
> trying
> > > to tell me :-)
> > >
> > > 1. I Store my useragents in a field that is tokenized.
> > > 2. Then when I search, you are saying that I should "scan" down the
> > matches
> > > via a SOLR function, or what?
> > > Are you referring to these functions in SOLR?
> > >
> > > http://wiki.apache.org/solr/FunctionQuery
> > >
> > >
> > > Sorry for not grasping immmediatley!
> > >
> > > Regards Niclas
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 17:44
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Tokenize your user agent strings, then store the tokenized form
> > separately
> > > from the wild card.  At retrieval time, scan down the matches and
> apply
> > the
> > > wildcard from each document to your original query.  The SOLR
> function
> > > query
> > > might be useful for this as would be a custom hit collector.
> > >
> > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <ni...@lechill.com>
> wrote:
> > >
> > > > Hi there, i facing a problem and would like to ask the community
> for
> > some
> > > > help.
> > > >
> > > > In my index I store browser  useragent values as "wildcarded" /
> > partial,
> > > >  which should be understood that an indexed document
> > > > should only be shown to end users if his browsers useragent
> matches a
> > > > wildcared usereragent in my document.
> > > >
> > > > So what I have Is actually a "reversed" matching, the wildcards
> are in
> > my
> > > > document and NOT in my actual query.
> > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> query
> > in
> > > > style with:
> > > >
> > > > useragents:
> > > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > >
> > > > In this example I would have a hit because Mozilla/4.0* matches
> the
> > > > useragent.
> > > >
> > > > <doc>
> > > > <useragents>
> > > >                Firefox*
> > > >                Mozilla/4.0*
> > > > </useragents>
> > > > </doc>
> > > >
> > > >
> > > > Regards
> > > > Niclas
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
> 
> 
> 
> --
> Ted Dunning, CTO
> DeepDyve

RE: Wildcard searches????

Posted by Digy <di...@gmail.com>.

http://en.wikipedia.org/wiki/Crossposting

-----Original Message-----
From: Niclas Rothman [mailto:niro@lechill.com] 
Sent: Saturday, February 06, 2010 12:12 AM
To: general@lucene.apache.org
Cc: java-user@lucene.apache.org
Subject: RE: Wildcard searches????

Hi Fuad and thanks for your reply!

The first post I know now was a wrong approach, I should not have the wildcard included in my index. 

However, I can't do as you suggest, to have the full user agent in the index, that’s the whole idea actually. 

The reason can be explained like this, device manufactures are literally spitting out new devices and updates all the time which generates new user agents that are very similar, perhaps only a small version number differs. 
So what I need is to have a "generalization" of the user agent in  my index, to only have the start of the useragent without including the versions numbers. 
This way my index are all the time "up to date" even if users with new version numbers access my search service, which in my app isn’t significant but instead causing my problems.... 

Example:

I have 2 Indexed documents where the documents useragent field are partial:
<doc>
	<id>1</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricsson
	</useragents>
</doc>
<doc>
	<id>2</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricsson
	</useragents>
</doc>

User A searches my app with an user agent as: 
	
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0

The search app will display both document 1 and 2, because his user agent starts exactly has the user agent pattern in my document.


User B searches my app with an user agent as (Please note that this user agent differs in the near end from Users A (JP9.5.1 instead of JP8.4.1)): 
	
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0

The search app will also display both document 1 and 2, because his user agent starts exactly has the user agent pattern in my document. 
Even if the version number of the java platform differs between user A and  B. 

If we now have a different index with FULL user agents, only User A would have documents returned, none of the documents user agents matched Users B user agent because of the "silly" version number!!

<doc>
	<id>1</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
	</useragents>
</doc>
<doc>
	<id>2</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
	</useragents>
</doc>

Can you see my problem?
So the basic thing is if I somehow can do a query saying that at match should take place if a document useragent starts with the value of the users useragent. 

In theory, having a startsWith "function / locig are easy enough to implement in C# / T-SQL,  but how on earth should I do this in SolR / Lucene?????

Regards

Niclas














-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: 05 February 2010 22:49
To: general@lucene.apache.org
Cc: java-user@lucene.apache.org
Subject: RE: Wildcard searches????

Niclas,

I looked at your initial post, you are creating document with field "abc*"
- nothing related to "wildcard query"!

Of course, query [useragents:abcdefghijklm] will return no results, and [q=useragents:abc] no results, but [q=useragents:abc*] will return something.

text_nav is specific SOLR type for _leading_ wildcard queries; you don't need it (you don't need _leading_ wildcard queries).

On indexing time, instead of
<doc>
<useragents>
                Firefox*
                Mozilla/4.0*
</useragents>
</doc>


You should index
<doc>
<useragents>
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
</useragents>
</doc>

And also, you need to choose properly SOLR type; for instance, textTight or textgen, or even non-tokenized string!


And, query [q=useragents:moz*] will return this document (even if this field is nontokenized).


-Fuad


P.S. Don't use * when you create Lucene document; use it as part of query.




> -----Original Message-----
> From: Niclas Rothman [mailto:niro@lechill.com]
> Sent: February-05-10 4:44 PM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Ted im using SOLR, but I cant figure out what type of fieldtype I should
> use to get a query like this to work:
> 
> 
> q=useragents: abcdefghijklm
> 
> 
> where I have in my index one document with value "abc" in field
> "useragents"
> 
> That query results in 0 hits.
> 
> If I issue this I get 1 hit of course (exact mathch)
> 
> q=useragents: Mozilla
> 
> 
> My document definition in SOLR looks like:
> 
> <fields>
>     <field name="id" type="tint" indexed="true" stored="true"
> required="true" />
>     <field name="useragents" type="text_rev" indexed="true"
> stored="true" required="false" multiValued="true" />
> </fields>
> 
> Any clue?
> 
> Nic
> 
> 
> 
> 
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: 05 February 2010 21:18
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: Re: Wildcard searches????
> 
> This is quite close.  You will have to break down the user agent that is
> your query into the same kinds of pieces as you did for your index.
> Lucene
> will only do exact matching of terms during searching (wildcard queries
> are
> handled by exploding the term into all possible variants).
> 
> Regarding the field type, you will probably have to customize that a
> fair
> bit to make +'s be separators and such.  If you use SOLR to index and
> query
> your data, then it will make sure that your separation into tokens is
> compatible unless you are using shortened forms like you mention here.
> 
> On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <ni...@lechill.com>
> wrote:
> 
> > Hi again Ted and many thanks for your efforts.
> > Ok, just to be sure that we fully understand each other:
> >
> > In my index I will store partial useragents without any wildcards *,
> e.g.
> >
> > Fire    (for Firefox)
> > Inte    (Internet Explorer)
> > Moz     (Mozill)
> >
> >
> > When I during runtime search my index for Media objects that are
> compatible
> > with a useragent,
> > e.g:
> >
> >
> >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> >
> > Hopefully lucene / solr will serve me with all Media objects that
> partially
> > math my full user agent string and also perhaps some mismatches. To be
> > absolutely sure that I only show Media objects that are compatible, I
> will
> > have to loop through the resultset in my program to do a final test
> and
> > exclude any mismatches.
> >
> > Is this what you are saying Ted, that I cant do the whole process in
> Solr /
> > Lucene, that I need to do the final test in my program (C#)?
> >
> > Also, Im using Solr 1.4, what fieldtype would you recommend to use for
> the
> > useragent ( tokenized)
> >
> > Okey, lets see what you have to say about this.
> > Please bear with me, im all new to lucene and solr!!
> >
> > Regards
> > Niclas
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: 05 February 2010 20:43
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > Yes.  I think you have it.
> >
> > To explain in a bit more detail, I think that you should store a
> tokenized
> > form of the user agents and should query using a tokenized form of
> your
> > user
> > agent.  This will retrieve documents that have partial matches to the
> user
> > agent of interest.  Many of these matches, however, may not meet the
> > requirements of the wildcard expression in the documents.  As such,
> you
> > will
> > need to look at each retrieved document to retrieve the wild
> expression
> > from
> > each one in turn to test if the original (untokenized) query satisfies
> the
> > wildcard.
> >
> > If your wildcards are all of a positive nature as your example is,
> then
> > this
> > should work pretty well.
> >
> > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <ni...@lechill.com>
> wrote:
> >
> > > Hi Ted and thanks for all your efforts.
> > > Listen im a little bit lost here trying to understand what you are
> trying
> > > to tell me :-)
> > >
> > > 1. I Store my useragents in a field that is tokenized.
> > > 2. Then when I search, you are saying that I should "scan" down the
> > matches
> > > via a SOLR function, or what?
> > > Are you referring to these functions in SOLR?
> > >
> > > http://wiki.apache.org/solr/FunctionQuery
> > >
> > >
> > > Sorry for not grasping immmediatley!
> > >
> > > Regards Niclas
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 17:44
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Tokenize your user agent strings, then store the tokenized form
> > separately
> > > from the wild card.  At retrieval time, scan down the matches and
> apply
> > the
> > > wildcard from each document to your original query.  The SOLR
> function
> > > query
> > > might be useful for this as would be a custom hit collector.
> > >
> > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <ni...@lechill.com>
> wrote:
> > >
> > > > Hi there, i facing a problem and would like to ask the community
> for
> > some
> > > > help.
> > > >
> > > > In my index I store browser  useragent values as "wildcarded" /
> > partial,
> > > >  which should be understood that an indexed document
> > > > should only be shown to end users if his browsers useragent
> matches a
> > > > wildcared usereragent in my document.
> > > >
> > > > So what I have Is actually a "reversed" matching, the wildcards
> are in
> > my
> > > > document and NOT in my actual query.
> > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> query
> > in
> > > > style with:
> > > >
> > > > useragents:
> > > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > >
> > > > In this example I would have a hit because Mozilla/4.0* matches
> the
> > > > useragent.
> > > >
> > > > <doc>
> > > > <useragents>
> > > >                Firefox*
> > > >                Mozilla/4.0*
> > > > </useragents>
> > > > </doc>
> > > >
> > > >
> > > > Regards
> > > > Niclas
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
> 
> 
> 
> --
> Ted Dunning, CTO
> DeepDyve




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Wildcard searches????

Posted by Fuad Efendi <fu...@efendi.ca>.

Hi Niclas,


"generalization" of the user agent "without including the versions numbers"...

How will you separate Mozilla/5.0 (Browser) from Mozilla/5.0 (Googlebot)?

And, going to the root of a problem... why do you use SOLR such a way? Is it search service showing different content depending on browser type (WAP vs. HTML)???

If it is, you are implementing so-called "business use case" improperly...

Search Engine Results Pages (SERP) should not have dependency on User-Agent HTTP Request Header.

But, raw TCP output may depend on it, and it is not SOLR/Lucene layer; it is upper layer... Tomcat Servlet Container, for instance, may generate different output depending whether it is mobile device (WAP) or browser (Mozilla compatible)...

I don't know your use case specifics... as Ted mentioned, it's much better to post SOLR-specific questions in solr-user@lucene.apache.org...


-Fuad



> -----Original Message-----
> From: Niclas Rothman [mailto:niro@lechill.com]
> Sent: February-05-10 6:12 PM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Hi Fuad and thanks for your reply!
> 
> The first post I know now was a wrong approach, I should not have the
> wildcard included in my index.
> 
> However, I can't do as you suggest, to have the full user agent in the
> index, that’s the whole idea actually.
> 
> The reason can be explained like this, device manufactures are literally
> spitting out new devices and updates all the time which generates new
> user agents that are very similar, perhaps only a small version number
> differs.
> So what I need is to have a "generalization" of the user agent in  my
> index, to only have the start of the useragent without including the
> versions numbers.
> This way my index are all the time "up to date" even if users with new
> version numbers access my search service, which in my app isn’t
> significant but instead causing my problems....
> 
> Example:
> 
> I have 2 Indexed documents where the documents useragent field are
> partial:
> <doc>
> 	<id>1</id>
> 	<useragents>
>       	Firefox
>             Mozilla/4.0+SonyEricsson
> 	</useragents>
> </doc>
> <doc>
> 	<id>2</id>
> 	<useragents>
>       	Firefox
>             Mozilla/4.0+SonyEricsson
> 	</useragents>
> </doc>
> 
> User A searches my app with an user agent as:
> 
> 	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI
> DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> 
> The search app will display both document 1 and 2, because his user
> agent starts exactly has the user agent pattern in my document.
> 
> 
> User B searches my app with an user agent as (Please note that this user
> agent differs in the near end from Users A (JP9.5.1 instead of
> JP8.4.1)):
> 
> 	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI
> DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0
> 
> The search app will also display both document 1 and 2, because his user
> agent starts exactly has the user agent pattern in my document.
> Even if the version number of the java platform differs between user A
> and  B.
> 
> If we now have a different index with FULL user agents, only User A
> would have documents returned, none of the documents user agents matched
> Users B user agent because of the "silly" version number!!
> 
> <doc>
> 	<id>1</id>
> 	<useragents>
>       	Firefox
> 
> Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> 	</useragents>
> </doc>
> <doc>
> 	<id>2</id>
> 	<useragents>
>       	Firefox
> 
> Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> 	</useragents>
> </doc>
> 
> Can you see my problem?
> So the basic thing is if I somehow can do a query saying that at match
> should take place if a document useragent starts with the value of the
> users useragent.
> 
> In theory, having a startsWith "function / locig are easy enough to
> implement in C# / T-SQL,  but how on earth should I do this in SolR /
> Lucene?????
> 
> Regards
> 
> Niclas
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: 05 February 2010 22:49
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Niclas,
> 
> I looked at your initial post, you are creating document with field
> "abc*"
> - nothing related to "wildcard query"!
> 
> Of course, query [useragents:abcdefghijklm] will return no results, and
> [q=useragents:abc] no results, but [q=useragents:abc*] will return
> something.
> 
> text_nav is specific SOLR type for _leading_ wildcard queries; you don't
> need it (you don't need _leading_ wildcard queries).
> 
> On indexing time, instead of
> <doc>
> <useragents>
>                 Firefox*
>                 Mozilla/4.0*
> </useragents>
> </doc>
> 
> 
> You should index
> <doc>
> <useragents>
> 	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI
> DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> </useragents>
> </doc>
> 
> And also, you need to choose properly SOLR type; for instance, textTight
> or textgen, or even non-tokenized string!
> 
> 
> And, query [q=useragents:moz*] will return this document (even if this
> field is nontokenized).
> 
> 
> -Fuad
> 
> 
> P.S. Don't use * when you create Lucene document; use it as part of
> query.
> 
> 
> 
> 
> > -----Original Message-----
> > From: Niclas Rothman [mailto:niro@lechill.com]
> > Sent: February-05-10 4:44 PM
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Ted im using SOLR, but I cant figure out what type of fieldtype I
> should
> > use to get a query like this to work:
> >
> >
> > q=useragents: abcdefghijklm
> >
> >
> > where I have in my index one document with value "abc" in field
> > "useragents"
> >
> > That query results in 0 hits.
> >
> > If I issue this I get 1 hit of course (exact mathch)
> >
> > q=useragents: Mozilla
> >
> >
> > My document definition in SOLR looks like:
> >
> > <fields>
> >     <field name="id" type="tint" indexed="true" stored="true"
> > required="true" />
> >     <field name="useragents" type="text_rev" indexed="true"
> > stored="true" required="false" multiValued="true" />
> > </fields>
> >
> > Any clue?
> >
> > Nic
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: 05 February 2010 21:18
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > This is quite close.  You will have to break down the user agent that
> is
> > your query into the same kinds of pieces as you did for your index.
> > Lucene
> > will only do exact matching of terms during searching (wildcard
> queries
> > are
> > handled by exploding the term into all possible variants).
> >
> > Regarding the field type, you will probably have to customize that a
> > fair
> > bit to make +'s be separators and such.  If you use SOLR to index and
> > query
> > your data, then it will make sure that your separation into tokens is
> > compatible unless you are using shortened forms like you mention here.
> >
> > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <ni...@lechill.com>
> > wrote:
> >
> > > Hi again Ted and many thanks for your efforts.
> > > Ok, just to be sure that we fully understand each other:
> > >
> > > In my index I will store partial useragents without any wildcards *,
> > e.g.
> > >
> > > Fire    (for Firefox)
> > > Inte    (Internet Explorer)
> > > Moz     (Mozill)
> > >
> > >
> > > When I during runtime search my index for Media objects that are
> > compatible
> > > with a useragent,
> > > e.g:
> > >
> > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > >
> > > Hopefully lucene / solr will serve me with all Media objects that
> > partially
> > > math my full user agent string and also perhaps some mismatches. To
> be
> > > absolutely sure that I only show Media objects that are compatible,
> I
> > will
> > > have to loop through the resultset in my program to do a final test
> > and
> > > exclude any mismatches.
> > >
> > > Is this what you are saying Ted, that I cant do the whole process in
> > Solr /
> > > Lucene, that I need to do the final test in my program (C#)?
> > >
> > > Also, Im using Solr 1.4, what fieldtype would you recommend to use
> for
> > the
> > > useragent ( tokenized)
> > >
> > > Okey, lets see what you have to say about this.
> > > Please bear with me, im all new to lucene and solr!!
> > >
> > > Regards
> > > Niclas
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 20:43
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Yes.  I think you have it.
> > >
> > > To explain in a bit more detail, I think that you should store a
> > tokenized
> > > form of the user agents and should query using a tokenized form of
> > your
> > > user
> > > agent.  This will retrieve documents that have partial matches to
> the
> > user
> > > agent of interest.  Many of these matches, however, may not meet the
> > > requirements of the wildcard expression in the documents.  As such,
> > you
> > > will
> > > need to look at each retrieved document to retrieve the wild
> > expression
> > > from
> > > each one in turn to test if the original (untokenized) query
> satisfies
> > the
> > > wildcard.
> > >
> > > If your wildcards are all of a positive nature as your example is,
> > then
> > > this
> > > should work pretty well.
> > >
> > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <ni...@lechill.com>
> > wrote:
> > >
> > > > Hi Ted and thanks for all your efforts.
> > > > Listen im a little bit lost here trying to understand what you are
> > trying
> > > > to tell me :-)
> > > >
> > > > 1. I Store my useragents in a field that is tokenized.
> > > > 2. Then when I search, you are saying that I should "scan" down
> the
> > > matches
> > > > via a SOLR function, or what?
> > > > Are you referring to these functions in SOLR?
> > > >
> > > > http://wiki.apache.org/solr/FunctionQuery
> > > >
> > > >
> > > > Sorry for not grasping immmediatley!
> > > >
> > > > Regards Niclas
> > > >
> > > > -----Original Message-----
> > > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > > Sent: 05 February 2010 17:44
> > > > To: general@lucene.apache.org
> > > > Cc: java-user@lucene.apache.org
> > > > Subject: Re: Wildcard searches????
> > > >
> > > > Tokenize your user agent strings, then store the tokenized form
> > > separately
> > > > from the wild card.  At retrieval time, scan down the matches and
> > apply
> > > the
> > > > wildcard from each document to your original query.  The SOLR
> > function
> > > > query
> > > > might be useful for this as would be a custom hit collector.
> > > >
> > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <ni...@lechill.com>
> > wrote:
> > > >
> > > > > Hi there, i facing a problem and would like to ask the community
> > for
> > > some
> > > > > help.
> > > > >
> > > > > In my index I store browser  useragent values as "wildcarded" /
> > > partial,
> > > > >  which should be understood that an indexed document
> > > > > should only be shown to end users if his browsers useragent
> > matches a
> > > > > wildcared usereragent in my document.
> > > > >
> > > > > So what I have Is actually a "reversed" matching, the wildcards
> > are in
> > > my
> > > > > document and NOT in my actual query.
> > > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> > query
> > > in
> > > > > style with:
> > > > >
> > > > > useragents:
> > > > >
> > > >
> > >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > > >
> > > > > In this example I would have a hit because Mozilla/4.0* matches
> > the
> > > > > useragent.
> > > > >
> > > > > <doc>
> > > > > <useragents>
> > > > >                Firefox*
> > > > >                Mozilla/4.0*
> > > > > </useragents>
> > > > > </doc>
> > > > >
> > > > >
> > > > > Regards
> > > > > Niclas
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ted Dunning, CTO
> > > > DeepDyve
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
>

RE: Wildcard searches????

Posted by Niclas Rothman <ni...@lechill.com>.

Hi Fuad and thanks for your reply!

The first post I know now was a wrong approach, I should not have the wildcard included in my index. 

However, I can't do as you suggest, to have the full user agent in the index, that’s the whole idea actually. 

The reason can be explained like this, device manufactures are literally spitting out new devices and updates all the time which generates new user agents that are very similar, perhaps only a small version number differs. 
So what I need is to have a "generalization" of the user agent in  my index, to only have the start of the useragent without including the versions numbers. 
This way my index are all the time "up to date" even if users with new version numbers access my search service, which in my app isn’t significant but instead causing my problems.... 

Example:

I have 2 Indexed documents where the documents useragent field are partial:
<doc>
	<id>1</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricsson
	</useragents>
</doc>
<doc>
	<id>2</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricsson
	</useragents>
</doc>

User A searches my app with an user agent as: 
	
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0

The search app will display both document 1 and 2, because his user agent starts exactly has the user agent pattern in my document.


User B searches my app with an user agent as (Please note that this user agent differs in the near end from Users A (JP9.5.1 instead of JP8.4.1)): 
	
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0

The search app will also display both document 1 and 2, because his user agent starts exactly has the user agent pattern in my document. 
Even if the version number of the java platform differs between user A and  B. 

If we now have a different index with FULL user agents, only User A would have documents returned, none of the documents user agents matched Users B user agent because of the "silly" version number!!

<doc>
	<id>1</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
	</useragents>
</doc>
<doc>
	<id>2</id>
	<useragents>
      	Firefox
            Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
	</useragents>
</doc>

Can you see my problem?
So the basic thing is if I somehow can do a query saying that at match should take place if a document useragent starts with the value of the users useragent. 

In theory, having a startsWith "function / locig are easy enough to implement in C# / T-SQL,  but how on earth should I do this in SolR / Lucene?????

Regards

Niclas














-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: 05 February 2010 22:49
To: general@lucene.apache.org
Cc: java-user@lucene.apache.org
Subject: RE: Wildcard searches????

Niclas,

I looked at your initial post, you are creating document with field "abc*"
- nothing related to "wildcard query"!

Of course, query [useragents:abcdefghijklm] will return no results, and [q=useragents:abc] no results, but [q=useragents:abc*] will return something.

text_nav is specific SOLR type for _leading_ wildcard queries; you don't need it (you don't need _leading_ wildcard queries).

On indexing time, instead of
<doc>
<useragents>
                Firefox*
                Mozilla/4.0*
</useragents>
</doc>


You should index
<doc>
<useragents>
	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
</useragents>
</doc>

And also, you need to choose properly SOLR type; for instance, textTight or textgen, or even non-tokenized string!


And, query [q=useragents:moz*] will return this document (even if this field is nontokenized).


-Fuad


P.S. Don't use * when you create Lucene document; use it as part of query.




> -----Original Message-----
> From: Niclas Rothman [mailto:niro@lechill.com]
> Sent: February-05-10 4:44 PM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Ted im using SOLR, but I cant figure out what type of fieldtype I should
> use to get a query like this to work:
> 
> 
> q=useragents: abcdefghijklm
> 
> 
> where I have in my index one document with value "abc" in field
> "useragents"
> 
> That query results in 0 hits.
> 
> If I issue this I get 1 hit of course (exact mathch)
> 
> q=useragents: Mozilla
> 
> 
> My document definition in SOLR looks like:
> 
> <fields>
>     <field name="id" type="tint" indexed="true" stored="true"
> required="true" />
>     <field name="useragents" type="text_rev" indexed="true"
> stored="true" required="false" multiValued="true" />
> </fields>
> 
> Any clue?
> 
> Nic
> 
> 
> 
> 
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: 05 February 2010 21:18
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: Re: Wildcard searches????
> 
> This is quite close.  You will have to break down the user agent that is
> your query into the same kinds of pieces as you did for your index.
> Lucene
> will only do exact matching of terms during searching (wildcard queries
> are
> handled by exploding the term into all possible variants).
> 
> Regarding the field type, you will probably have to customize that a
> fair
> bit to make +'s be separators and such.  If you use SOLR to index and
> query
> your data, then it will make sure that your separation into tokens is
> compatible unless you are using shortened forms like you mention here.
> 
> On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <ni...@lechill.com>
> wrote:
> 
> > Hi again Ted and many thanks for your efforts.
> > Ok, just to be sure that we fully understand each other:
> >
> > In my index I will store partial useragents without any wildcards *,
> e.g.
> >
> > Fire    (for Firefox)
> > Inte    (Internet Explorer)
> > Moz     (Mozill)
> >
> >
> > When I during runtime search my index for Media objects that are
> compatible
> > with a useragent,
> > e.g:
> >
> >
> >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> >
> > Hopefully lucene / solr will serve me with all Media objects that
> partially
> > math my full user agent string and also perhaps some mismatches. To be
> > absolutely sure that I only show Media objects that are compatible, I
> will
> > have to loop through the resultset in my program to do a final test
> and
> > exclude any mismatches.
> >
> > Is this what you are saying Ted, that I cant do the whole process in
> Solr /
> > Lucene, that I need to do the final test in my program (C#)?
> >
> > Also, Im using Solr 1.4, what fieldtype would you recommend to use for
> the
> > useragent ( tokenized)
> >
> > Okey, lets see what you have to say about this.
> > Please bear with me, im all new to lucene and solr!!
> >
> > Regards
> > Niclas
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: 05 February 2010 20:43
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > Yes.  I think you have it.
> >
> > To explain in a bit more detail, I think that you should store a
> tokenized
> > form of the user agents and should query using a tokenized form of
> your
> > user
> > agent.  This will retrieve documents that have partial matches to the
> user
> > agent of interest.  Many of these matches, however, may not meet the
> > requirements of the wildcard expression in the documents.  As such,
> you
> > will
> > need to look at each retrieved document to retrieve the wild
> expression
> > from
> > each one in turn to test if the original (untokenized) query satisfies
> the
> > wildcard.
> >
> > If your wildcards are all of a positive nature as your example is,
> then
> > this
> > should work pretty well.
> >
> > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <ni...@lechill.com>
> wrote:
> >
> > > Hi Ted and thanks for all your efforts.
> > > Listen im a little bit lost here trying to understand what you are
> trying
> > > to tell me :-)
> > >
> > > 1. I Store my useragents in a field that is tokenized.
> > > 2. Then when I search, you are saying that I should "scan" down the
> > matches
> > > via a SOLR function, or what?
> > > Are you referring to these functions in SOLR?
> > >
> > > http://wiki.apache.org/solr/FunctionQuery
> > >
> > >
> > > Sorry for not grasping immmediatley!
> > >
> > > Regards Niclas
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 17:44
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Tokenize your user agent strings, then store the tokenized form
> > separately
> > > from the wild card.  At retrieval time, scan down the matches and
> apply
> > the
> > > wildcard from each document to your original query.  The SOLR
> function
> > > query
> > > might be useful for this as would be a custom hit collector.
> > >
> > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <ni...@lechill.com>
> wrote:
> > >
> > > > Hi there, i facing a problem and would like to ask the community
> for
> > some
> > > > help.
> > > >
> > > > In my index I store browser  useragent values as "wildcarded" /
> > partial,
> > > >  which should be understood that an indexed document
> > > > should only be shown to end users if his browsers useragent
> matches a
> > > > wildcared usereragent in my document.
> > > >
> > > > So what I have Is actually a "reversed" matching, the wildcards
> are in
> > my
> > > > document and NOT in my actual query.
> > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> query
> > in
> > > > style with:
> > > >
> > > > useragents:
> > > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > >
> > > > In this example I would have a hit because Mozilla/4.0* matches
> the
> > > > useragent.
> > > >
> > > > <doc>
> > > > <useragents>
> > > >                Firefox*
> > > >                Mozilla/4.0*
> > > > </useragents>
> > > > </doc>
> > > >
> > > >
> > > > Regards
> > > > Niclas
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
> 
> 
> 
> --
> Ted Dunning, CTO
> DeepDyve