You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Andre Rubin <an...@gmail.com> on 2008/08/05 21:14:33 UTC

Sorting

Hi there!

I'm new to Lucene, so forgive any misconceptions on my part.

I created an Index and now I want to search on it based on a field.
The field is a String field and Field.Store.YES and
Field.Index.TOKENIZED. No problems with the search.

Now, I wanted to sort the results, and according to the Sort javadoc
the field "should not be tokenized". But I decided to try it anyway,
and it worked. However, the results showed that the tokens were
sorted, not the full string in the field.

Just to make myself more clear, here's an example. Let's say I have
these strings indexed:

"North Carolina"
"British Columbia"
"Canada"

Now I search (with sort) for the token 'c*'

The result I get is (sorted by the token found):

1) Canada
2) North Carolina
3) British Columbia

The result I wanted was (sorted by the whole String)"

1) British Columbia
2) Canada
3) North Carolina

Is there a way to do this?


Another option would be to sort the index itself, since this field is
the only field that we'd be searching on. But I'm just guessing here,
cause I have no idea if this is possible at all!

Thanks,


Andre

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by Erick Erickson <er...@gmail.com>.
Your sort object has no relation to your query in terms of fields. You can
search on anything you want, but construct your Sort object with your
untokenized field.

Best
Erick

On Tue, Aug 5, 2008 at 4:12 PM, Andre Rubin <an...@gmail.com> wrote:

> Sounds like a good alternative. But how do I perform the search on the
> tokenized filed and sort on the un_tokenized one?
>
> Thanks,
>
>
> Andre
>
> On Tue, Aug 5, 2008 at 12:51 PM,  <Ro...@ancept.com> wrote:
> > This is what I did and it works fine.  My untokenized fields where named:
> > "__AMSUNTOK__" + fieldName.
> > Where fieldName was the name of the tokenized field.
> >
> >
> > Bob Hastings
> > Ancept Inc.
> >
> >
> >
> >
> > Mark Miller <ma...@gmail.com>
> > 08/05/2008 02:38 PM
> > Please respond to
> > java-user@lucene.apache.org
> >
> >
> > To
> > java-user@lucene.apache.org
> > cc
> >
> > Subject
> > Re: Sorting
> >
> >
> >
> >
> >
> >
> > Hey Andre,
> >
> > The reason the javadoc says the field should not be tokenized stems from
> > the issue you point out. What you want to do is possible of course, but
> > making the Lucene code change would complicate a process that can be
> > quite memory and cpu intensive on large collections. Done right, it
> > might make a good patch though.
> >
> > A compromise that you can make outside of the Lucene code is to index a
> > separate field with the same contents but untokenized. Sorting on this
> > field instead, Lucene will treat "North Carolina" as one token and sort
> > as you'd expect. The downside to this approach is that you will have to
> > juggle the two fields in the future.
> >
> > - Mark
> >
> > Andre Rubin wrote:
> >> Hi there!
> >>
> >> I'm new to Lucene, so forgive any misconceptions on my part.
> >>
> >> I created an Index and now I want to search on it based on a field.
> >> The field is a String field and Field.Store.YES and
> >> Field.Index.TOKENIZED. No problems with the search.
> >>
> >> Now, I wanted to sort the results, and according to the Sort javadoc
> >> the field "should not be tokenized". But I decided to try it anyway,
> >> and it worked. However, the results showed that the tokens were
> >> sorted, not the full string in the field.
> >>
> >> Just to make myself more clear, here's an example. Let's say I have
> >> these strings indexed:
> >>
> >> "North Carolina"
> >> "British Columbia"
> >> "Canada"
> >>
> >> Now I search (with sort) for the token 'c*'
> >>
> >> The result I get is (sorted by the token found):
> >>
> >> 1) Canada
> >> 2) North Carolina
> >> 3) British Columbia
> >>
> >> The result I wanted was (sorted by the whole String)"
> >>
> >> 1) British Columbia
> >> 2) Canada
> >> 3) North Carolina
> >>
> >> Is there a way to do this?
> >>
> >>
> >> Another option would be to sort the index itself, since this field is
> >> the only field that we'd be searching on. But I'm just guessing here,
> >> cause I have no idea if this is possible at all!
> >>
> >> Thanks,
> >>
> >>
> >> Andre
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Sorting

Posted by Andre Rubin <an...@gmail.com>.
Sounds like a good alternative. But how do I perform the search on the
tokenized filed and sort on the un_tokenized one?

Thanks,


Andre

On Tue, Aug 5, 2008 at 12:51 PM,  <Ro...@ancept.com> wrote:
> This is what I did and it works fine.  My untokenized fields where named:
> "__AMSUNTOK__" + fieldName.
> Where fieldName was the name of the tokenized field.
>
>
> Bob Hastings
> Ancept Inc.
>
>
>
>
> Mark Miller <ma...@gmail.com>
> 08/05/2008 02:38 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Sorting
>
>
>
>
>
>
> Hey Andre,
>
> The reason the javadoc says the field should not be tokenized stems from
> the issue you point out. What you want to do is possible of course, but
> making the Lucene code change would complicate a process that can be
> quite memory and cpu intensive on large collections. Done right, it
> might make a good patch though.
>
> A compromise that you can make outside of the Lucene code is to index a
> separate field with the same contents but untokenized. Sorting on this
> field instead, Lucene will treat "North Carolina" as one token and sort
> as you'd expect. The downside to this approach is that you will have to
> juggle the two fields in the future.
>
> - Mark
>
> Andre Rubin wrote:
>> Hi there!
>>
>> I'm new to Lucene, so forgive any misconceptions on my part.
>>
>> I created an Index and now I want to search on it based on a field.
>> The field is a String field and Field.Store.YES and
>> Field.Index.TOKENIZED. No problems with the search.
>>
>> Now, I wanted to sort the results, and according to the Sort javadoc
>> the field "should not be tokenized". But I decided to try it anyway,
>> and it worked. However, the results showed that the tokens were
>> sorted, not the full string in the field.
>>
>> Just to make myself more clear, here's an example. Let's say I have
>> these strings indexed:
>>
>> "North Carolina"
>> "British Columbia"
>> "Canada"
>>
>> Now I search (with sort) for the token 'c*'
>>
>> The result I get is (sorted by the token found):
>>
>> 1) Canada
>> 2) North Carolina
>> 3) British Columbia
>>
>> The result I wanted was (sorted by the whole String)"
>>
>> 1) British Columbia
>> 2) Canada
>> 3) North Carolina
>>
>> Is there a way to do this?
>>
>>
>> Another option would be to sort the index itself, since this field is
>> the only field that we'd be searching on. But I'm just guessing here,
>> cause I have no idea if this is possible at all!
>>
>> Thanks,
>>
>>
>> Andre
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting

Posted by Ro...@ancept.com.
This is what I did and it works fine.  My untokenized fields where named: 
"__AMSUNTOK__" + fieldName.
Where fieldName was the name of the tokenized field.


Bob Hastings
Ancept Inc.




Mark Miller <ma...@gmail.com> 
08/05/2008 02:38 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: Sorting






Hey Andre,

The reason the javadoc says the field should not be tokenized stems from 
the issue you point out. What you want to do is possible of course, but 
making the Lucene code change would complicate a process that can be 
quite memory and cpu intensive on large collections. Done right, it 
might make a good patch though.

A compromise that you can make outside of the Lucene code is to index a 
separate field with the same contents but untokenized. Sorting on this 
field instead, Lucene will treat "North Carolina" as one token and sort 
as you'd expect. The downside to this approach is that you will have to 
juggle the two fields in the future.

- Mark

Andre Rubin wrote:
> Hi there!
>
> I'm new to Lucene, so forgive any misconceptions on my part.
>
> I created an Index and now I want to search on it based on a field.
> The field is a String field and Field.Store.YES and
> Field.Index.TOKENIZED. No problems with the search.
>
> Now, I wanted to sort the results, and according to the Sort javadoc
> the field "should not be tokenized". But I decided to try it anyway,
> and it worked. However, the results showed that the tokens were
> sorted, not the full string in the field.
>
> Just to make myself more clear, here's an example. Let's say I have
> these strings indexed:
>
> "North Carolina"
> "British Columbia"
> "Canada"
>
> Now I search (with sort) for the token 'c*'
>
> The result I get is (sorted by the token found):
>
> 1) Canada
> 2) North Carolina
> 3) British Columbia
>
> The result I wanted was (sorted by the whole String)"
>
> 1) British Columbia
> 2) Canada
> 3) North Carolina
>
> Is there a way to do this?
>
>
> Another option would be to sort the index itself, since this field is
> the only field that we'd be searching on. But I'm just guessing here,
> cause I have no idea if this is possible at all!
>
> Thanks,
>
>
> Andre
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



Re: Sorting

Posted by Mark Miller <ma...@gmail.com>.
Hey Andre,

The reason the javadoc says the field should not be tokenized stems from 
the issue you point out. What you want to do is possible of course, but 
making the Lucene code change would complicate a process that can be 
quite memory and cpu intensive on large collections. Done right, it 
might make a good patch though.

A compromise that you can make outside of the Lucene code is to index a 
separate field with the same contents but untokenized. Sorting on this 
field instead, Lucene will treat "North Carolina" as one token and sort 
as you'd expect. The downside to this approach is that you will have to 
juggle the two fields in the future.

- Mark

Andre Rubin wrote:
> Hi there!
>
> I'm new to Lucene, so forgive any misconceptions on my part.
>
> I created an Index and now I want to search on it based on a field.
> The field is a String field and Field.Store.YES and
> Field.Index.TOKENIZED. No problems with the search.
>
> Now, I wanted to sort the results, and according to the Sort javadoc
> the field "should not be tokenized". But I decided to try it anyway,
> and it worked. However, the results showed that the tokens were
> sorted, not the full string in the field.
>
> Just to make myself more clear, here's an example. Let's say I have
> these strings indexed:
>
> "North Carolina"
> "British Columbia"
> "Canada"
>
> Now I search (with sort) for the token 'c*'
>
> The result I get is (sorted by the token found):
>
> 1) Canada
> 2) North Carolina
> 3) British Columbia
>
> The result I wanted was (sorted by the whole String)"
>
> 1) British Columbia
> 2) Canada
> 3) North Carolina
>
> Is there a way to do this?
>
>
> Another option would be to sort the index itself, since this field is
> the only field that we'd be searching on. But I'm just guessing here,
> cause I have no idea if this is possible at all!
>
> Thanks,
>
>
> Andre
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org