You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Kainth, Sachin" <Sa...@atkinsglobal.com> on 2007/02/08 14:45:01 UTC

'a', 's' and 't' don't index properly

> Hello,
> 
> I have a database of tracks, artists and albums and I'm indexing these
> 3 attributes plus also the first letter of the track thus (incidently
> I'm using dotlucene but the implementation of dotlucene is similar to
> the Java one):
> 
>    Document Doc = new Document();
>    String Album = ...
>    String Artist = ...
>    String Track = ...
>    Doc.Add(Field.Text("album", Album));
>    Doc.Add(Field.Text("artist", Artist));
>    Doc.Add(Field.Text("track", Track));
>    Doc.Add(Field.Text("firstletter", Track.Substring(0,1)));
> 
> Problem is I don't think certain first letters are being indexed
> properly or at all, either that or there is some problem elsewhere.  I
> have noticed that the letters 'a', 's' and 't' (there may be others)
> cause me problems.  I shall explain the problem I have.  When I search
> for the documents I perform a sorting operation on the firstletter
> field but where the firstletter was 'a', 's' or 't' the returned list
> does not contain those records in sorted order (all other records are
> sorted correctly).
> 
> Here is my search command:
> 
> Hits hits = searcher.Search(query, new Sort(new SortField[] { new
> SortField("firstletter", SortField.STRING)}));
> 
> What I don't know is whether the fault lies in the indexing or in this
> or other code.  Does anyone know what could have happened.
> 
> Thanks
> 
> Sachin


This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need to.

Re: 'a', 's' and 't' don't index properly

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 8, 2007, at 2:14 PM, Mike Klaas wrote:

> On 2/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
>> Is there a .NET version of Solr?
>
> Nope.

But, here's the beauty of Solr... if you're not afraid of a JVM  
running Jetty, Tomcat, Resin, or many others then fire up (Java) Solr  
and use .NET HTTP API to talk to Solr.  It's the best solution  
available and leverage the performance gurus tweaks that make it in  
each new release of Java Lucene and Solr.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 'a', 's' and 't' don't index properly

Posted by Mike Klaas <mi...@gmail.com>.

On 2/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:

> Is there a .NET version of Solr?

Nope.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: 'a', 's' and 't' don't index properly

Posted by "Kainth, Sachin" <Sa...@atkinsglobal.com>.

Thanks Erik,

Is there a .NET version of Solr?

Cheers

Sachin 

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: 08 February 2007 15:26
To: java-user@lucene.apache.org
Subject: Re: 'a', 's' and 't' don't index properly

>From the javadoc...

public final class *SimpleAnalyzer*extends
Analyzer<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/analysis/Ana
lyzer.html>

An Analyzer that filters LetterTokenizer with LowerCaseFilter.


On 2/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Thanks Erik,
>
> Do you know of an analyzer which doesn't remove the characters 'a',
's'
> and 't'.
>
> Sachin
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: 08 February 2007 13:54
> To: java-user@lucene.apache.org
> Subject: Re: 'a', 's' and 't' don't index properly
>
> This really should be posted on the dotlucene list, but....
>
> Your indexing analyzer is probably removing them. For instance, 
> StandardAnalyzer uses a default set of stop words, and a, s, and t are

> definitely among them. You need to use a different analyzer than you 
> are using.
>
> These will also be removed from queries if you use QueryParser with 
> one of several analyzers that remove stop words.
>
> StandardAnalyzer, for instance, also lower-cases tokens, removes most 
> puncutation, etc, so take some care to understand the analyzers and 
> what they do.
>
> Oh, and get a copy of Luke if you haven't already. It'll let you 
> examine your index, see the results of using various analyzers etc.
>
> Best
> Erick
>
> On 2/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
> >
> > > Hello,
> > >
> > > I have a database of tracks, artists and albums and I'm indexing 
> > > these
> > > 3 attributes plus also the first letter of the track thus 
> > > (incidently I'm using dotlucene but the implementation of 
> > > dotlucene is similar to the Java one):
> > >
> > >    Document Doc = new Document();
> > >    String Album = ...
> > >    String Artist = ...
> > >    String Track = ...
> > >    Doc.Add(Field.Text("album", Album));
> > >    Doc.Add(Field.Text("artist", Artist));
> > >    Doc.Add(Field.Text("track", Track));
> > >    Doc.Add(Field.Text("firstletter", Track.Substring(0,1)));
> > >
> > > Problem is I don't think certain first letters are being indexed 
> > > properly or at all, either that or there is some problem
elsewhere.
>
> > > I have noticed that the letters 'a', 's' and 't' (there may be
> > > others) cause me problems.  I shall explain the problem I have.
> > > When I search for the documents I perform a sorting operation on 
> > > the
>
> > > firstletter field but where the firstletter was 'a', 's' or 't' 
> > > the returned list does not contain those records in sorted order 
> > > (all other records are sorted correctly).
> > >
> > > Here is my search command:
> > >
> > > Hits hits = searcher.Search(query, new Sort(new SortField[] { new 
> > > SortField("firstletter", SortField.STRING)}));
> > >
> > > What I don't know is whether the fault lies in the indexing or in 
> > > this or other code.  Does anyone know what could have happened.
> > >
> > > Thanks
> > >
> > > Sachin
> >
> >
> > This email and any attached files are confidential and copyright 
> > protected. If you are not the addressee, any dissemination of this 
> > communication is strictly prohibited. Unless otherwise expressly 
> > agreed in writing, nothing stated in this communication shall be
> legally binding.
> >
> > The ultimate parent company of the Atkins Group is WS Atkins plc.
> > Registered in England No. 1885586.  Registered Office Woodcote 
> > Grove, Ashley Road, Epsom, Surrey KT18 5BW.
> >
> > Consider the environment. Please don't print this e-mail unless you 
> > really need to.
> >
>
>
> This message has been scanned for viruses by MailControl - (see
> http://bluepages.wsatkins.co.uk/?4318150)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 'a', 's' and 't' don't index properly

Posted by Erick Erickson <er...@gmail.com>.

>From the javadoc...

public final class *SimpleAnalyzer*extends
Analyzer<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/analysis/Analyzer.html>

An Analyzer that filters LetterTokenizer with LowerCaseFilter.


On 2/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Thanks Erik,
>
> Do you know of an analyzer which doesn't remove the characters 'a', 's'
> and 't'.
>
> Sachin
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: 08 February 2007 13:54
> To: java-user@lucene.apache.org
> Subject: Re: 'a', 's' and 't' don't index properly
>
> This really should be posted on the dotlucene list, but....
>
> Your indexing analyzer is probably removing them. For instance,
> StandardAnalyzer uses a default set of stop words, and a, s, and t are
> definitely among them. You need to use a different analyzer than you are
> using.
>
> These will also be removed from queries if you use QueryParser with one
> of several analyzers that remove stop words.
>
> StandardAnalyzer, for instance, also lower-cases tokens, removes most
> puncutation, etc, so take some care to understand the analyzers and what
> they do.
>
> Oh, and get a copy of Luke if you haven't already. It'll let you examine
> your index, see the results of using various analyzers etc.
>
> Best
> Erick
>
> On 2/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
> >
> > > Hello,
> > >
> > > I have a database of tracks, artists and albums and I'm indexing
> > > these
> > > 3 attributes plus also the first letter of the track thus
> > > (incidently I'm using dotlucene but the implementation of dotlucene
> > > is similar to the Java one):
> > >
> > >    Document Doc = new Document();
> > >    String Album = ...
> > >    String Artist = ...
> > >    String Track = ...
> > >    Doc.Add(Field.Text("album", Album));
> > >    Doc.Add(Field.Text("artist", Artist));
> > >    Doc.Add(Field.Text("track", Track));
> > >    Doc.Add(Field.Text("firstletter", Track.Substring(0,1)));
> > >
> > > Problem is I don't think certain first letters are being indexed
> > > properly or at all, either that or there is some problem elsewhere.
>
> > > I have noticed that the letters 'a', 's' and 't' (there may be
> > > others) cause me problems.  I shall explain the problem I have.
> > > When I search for the documents I perform a sorting operation on the
>
> > > firstletter field but where the firstletter was 'a', 's' or 't' the
> > > returned list does not contain those records in sorted order (all
> > > other records are sorted correctly).
> > >
> > > Here is my search command:
> > >
> > > Hits hits = searcher.Search(query, new Sort(new SortField[] { new
> > > SortField("firstletter", SortField.STRING)}));
> > >
> > > What I don't know is whether the fault lies in the indexing or in
> > > this or other code.  Does anyone know what could have happened.
> > >
> > > Thanks
> > >
> > > Sachin
> >
> >
> > This email and any attached files are confidential and copyright
> > protected. If you are not the addressee, any dissemination of this
> > communication is strictly prohibited. Unless otherwise expressly
> > agreed in writing, nothing stated in this communication shall be
> legally binding.
> >
> > The ultimate parent company of the Atkins Group is WS Atkins plc.
> > Registered in England No. 1885586.  Registered Office Woodcote Grove,
> > Ashley Road, Epsom, Surrey KT18 5BW.
> >
> > Consider the environment. Please don't print this e-mail unless you
> > really need to.
> >
>
>
> This message has been scanned for viruses by MailControl - (see
> http://bluepages.wsatkins.co.uk/?4318150)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: 'a', 's' and 't' don't index properly

Posted by "Kainth, Sachin" <Sa...@atkinsglobal.com>.

Thanks Erik,

Do you know of an analyzer which doesn't remove the characters 'a', 's'
and 't'.

Sachin 

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: 08 February 2007 13:54
To: java-user@lucene.apache.org
Subject: Re: 'a', 's' and 't' don't index properly

This really should be posted on the dotlucene list, but....

Your indexing analyzer is probably removing them. For instance,
StandardAnalyzer uses a default set of stop words, and a, s, and t are
definitely among them. You need to use a different analyzer than you are
using.

These will also be removed from queries if you use QueryParser with one
of several analyzers that remove stop words.

StandardAnalyzer, for instance, also lower-cases tokens, removes most
puncutation, etc, so take some care to understand the analyzers and what
they do.

Oh, and get a copy of Luke if you haven't already. It'll let you examine
your index, see the results of using various analyzers etc.

Best
Erick

On 2/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> > Hello,
> >
> > I have a database of tracks, artists and albums and I'm indexing 
> > these
> > 3 attributes plus also the first letter of the track thus 
> > (incidently I'm using dotlucene but the implementation of dotlucene 
> > is similar to the Java one):
> >
> >    Document Doc = new Document();
> >    String Album = ...
> >    String Artist = ...
> >    String Track = ...
> >    Doc.Add(Field.Text("album", Album));
> >    Doc.Add(Field.Text("artist", Artist));
> >    Doc.Add(Field.Text("track", Track));
> >    Doc.Add(Field.Text("firstletter", Track.Substring(0,1)));
> >
> > Problem is I don't think certain first letters are being indexed 
> > properly or at all, either that or there is some problem elsewhere.

> > I have noticed that the letters 'a', 's' and 't' (there may be 
> > others) cause me problems.  I shall explain the problem I have.  
> > When I search for the documents I perform a sorting operation on the

> > firstletter field but where the firstletter was 'a', 's' or 't' the 
> > returned list does not contain those records in sorted order (all 
> > other records are sorted correctly).
> >
> > Here is my search command:
> >
> > Hits hits = searcher.Search(query, new Sort(new SortField[] { new 
> > SortField("firstletter", SortField.STRING)}));
> >
> > What I don't know is whether the fault lies in the indexing or in 
> > this or other code.  Does anyone know what could have happened.
> >
> > Thanks
> >
> > Sachin
>
>
> This email and any attached files are confidential and copyright 
> protected. If you are not the addressee, any dissemination of this 
> communication is strictly prohibited. Unless otherwise expressly 
> agreed in writing, nothing stated in this communication shall be
legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins plc.  
> Registered in England No. 1885586.  Registered Office Woodcote Grove, 
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you 
> really need to.
>


This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?4318150)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 'a', 's' and 't' don't index properly

Posted by Erick Erickson <er...@gmail.com>.

This really should be posted on the dotlucene list, but....

Your indexing analyzer is probably removing them. For instance,
StandardAnalyzer uses a default set of stop words, and a, s, and t are
definitely among them. You need to use a different analyzer than you are
using.

These will also be removed from queries if you use QueryParser with one of
several analyzers that remove stop words.

StandardAnalyzer, for instance, also lower-cases tokens, removes most
puncutation, etc, so take some care to understand the analyzers and what
they do.

Oh, and get a copy of Luke if you haven't already. It'll let you examine
your index, see the results of using various analyzers etc.

Best
Erick

On 2/8/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> > Hello,
> >
> > I have a database of tracks, artists and albums and I'm indexing these
> > 3 attributes plus also the first letter of the track thus (incidently
> > I'm using dotlucene but the implementation of dotlucene is similar to
> > the Java one):
> >
> >    Document Doc = new Document();
> >    String Album = ...
> >    String Artist = ...
> >    String Track = ...
> >    Doc.Add(Field.Text("album", Album));
> >    Doc.Add(Field.Text("artist", Artist));
> >    Doc.Add(Field.Text("track", Track));
> >    Doc.Add(Field.Text("firstletter", Track.Substring(0,1)));
> >
> > Problem is I don't think certain first letters are being indexed
> > properly or at all, either that or there is some problem elsewhere.  I
> > have noticed that the letters 'a', 's' and 't' (there may be others)
> > cause me problems.  I shall explain the problem I have.  When I search
> > for the documents I perform a sorting operation on the firstletter
> > field but where the firstletter was 'a', 's' or 't' the returned list
> > does not contain those records in sorted order (all other records are
> > sorted correctly).
> >
> > Here is my search command:
> >
> > Hits hits = searcher.Search(query, new Sort(new SortField[] { new
> > SortField("firstletter", SortField.STRING)}));
> >
> > What I don't know is whether the fault lies in the indexing or in this
> > or other code.  Does anyone know what could have happened.
> >
> > Thanks
> >
> > Sachin
>
>
> This email and any attached files are confidential and copyright
> protected. If you are not the addressee, any dissemination of this
> communication is strictly prohibited. Unless otherwise expressly agreed in
> writing, nothing stated in this communication shall be legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins
> plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove,
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you really
> need to.
>