You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Gora Mohanty <go...@mimirtech.com> on 2010/12/27 11:27:02 UTC

Re: how to sort multiple fields using solr

On Mon, Dec 27, 2010 at 3:51 PM, dhanesh vtc <dh...@vtc.com> wrote:
> Hi,
> How can I sort  a search result based on  multiple  fields.
> I've two fields id and packageId for sorting
> My code is like 'sort'=>'packageId desc','sort'=>'id desc'
> But its sorting works only one field.
> Any idea??
[...]

This question is better addressed to the Solr user mailing
list, not the dev list.

Sorting can work on multiple fields: Please see:
http://lucene.apache.org/solr/tutorial.html#Sorting for such
an example.

Regards,
Gora

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: Improving String Distance calculation performance

Posted by Uwe Schindler <uw...@thetaphi.de>.

Sven:
Please start a new thread when asking a new question. From Hossman's apache
page:

When starting a new discussion on a mailing list, please do not reply to an
existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread you
replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Biedermann,S.,Fa. Post Direkt [mailto:S.Biedermann@postdirekt.de]
> Sent: Monday, December 27, 2010 12:23 PM
> To: dev@lucene.apache.org
> Subject: Improving String Distance calculation performance
> 
> Hi,
> 
> this is my first post to this mailing list, so I first want to say hello
to all of you!
> 
> 	You are doing a great job
> 
> In org.apache.lucene.search.FuzzyTermEnum I found an optimised
> implementation of the Levenstein-Algorithms which makes use of the fact
> that the algorithm can be aborted if a given minimum similarity cannot be
> reached anymore. I isolated that algorithm into a subclass of
> org.apache.lucene.spell.StringDistance, since we usually can make use of
this
> optimisation.
> 
> With our current miminum similarity setting of 0.75 this algorithm needs
> against our test data only about 22% of run time compared to the original
> algorithm from the spell package.
> 
> With a further optimisation candidate (see below) the runtime can be
further
> reduced by another third to only 14% of original Levenstein.
> 
> So, my first question is: is it worth adding a further method to the
> StringDistance-Interface:
> 
> 	float getDistance(String left, String right, float
> minimumSimilarity)
> 
> so that applications can make use of possible optimisations
(StringDistance-
> Implementations without optimisations would just skip the minimSimilarity
> parameter)?
> 
> 
> The idea of the optimsation candidate is about calculating only those
fields in
> the "virtual" matrix that are near its diagonal.
> It is only a candidants since we have not prooven it to work. But with all
our
> test data (0.5 billion comparisons) there is no difference to the original
> algorithm.
> 
> Do you have any counter examples?
> Since this is my first post, is this the right mailing list?
> 
> Best Regards,
> 
> Sven
> 
> 
> 
> 
> 
> Here is the code taken from FuzzyTermEnum with some modfications  (p and
> d are initialised somewhere else):
> 
>     public float getDistance(final String left, final String right, float
> minimumSimilarity) {
>         final int m = right.length();
>         final int n = left.length();
>         final int maxLength = Math.max(m, n);
>         if (n == 0)  {
>           //we don't have anything to compare.  That means if we just add
>           //the letters for m we get the new word
>             return (m == 0) ? 1f : 0f;
>         }
>         if (m == 0) {
>           return 0f;
>         }
> 
>         // be patient with rounding errors (1.0000001f instead of 1f).
>         final int maxDistance = (int) ((1.0000001f-minimumSimilarity) *
> maxLength);
> 
>         if (maxDistance < Math.abs(m-n)) {
>           //just adding the characters of m to n or vice-versa results in
>           //too many edits
>           //for example "pre" length is 3 and "prefixes" length is 8.
> We can see that
>           //given this optimal circumstance, the edit distance cannot be
less than
> 5.
>           //which is 8-3 or more precisely Math.abs(3-8).
>           //if our maximum edit distance is 4, then we can discard this
word
>           //without looking at it.
>           return 0.0f;
>         }
> 
>         // if no edits are allowed, strings must be equal
>         if (maxDistance == 0)
>             return left.equals(right) ? 1f : 0f;
> 
>         // init matrix d
>         for (int i = 0; i<=n; i++) {
>           p[i] = i;
>         }
> 
>         // start computing edit distance
>         for (int j = 1; j<=m; j++) { // iterates through target
>           int bestPossibleEditDistance = m;
>           final char t_j = right.charAt(j-1); // jth character of t
>           d[0] = j;
> 
> 
> //-------> here is the optimisation candiates
> 
>           //only iterate through a maxDistance corridor
>           final int startAt  = Math.max(1, j - maxDistance );
>           final int finishAt = Math.min(n, maxDistance - 1 + j);
> 
>           for (int i=startAt; i<=finishAt; ++i) { // iterates through text
> //--------
>             // minimum of cell to the left+1, to the top+1, diagonally
left and up
> +(0|1)
>             final char t_i = left.charAt(i-1);
>             if (t_j != t_i) {
>               d[i] = Math.min(Math.min(d[i-1], p[i]),  p[i-1]) + 1;
>             } else {
>                 d[i] = Math.min(Math.min(d[i-1], p[i]) + 1,  p[i-1]);
>             }
>             bestPossibleEditDistance =
> Math.min(bestPossibleEditDistance, d[i]);
> 
>           }
> 
>           //After calculating row i, the best possible edit distance
>           //can be found by found by finding the smallest value in a given
column.
>           //If the bestPossibleEditDistance is greater than the max
distance,
> abort.
> 
>           if (j > maxDistance && bestPossibleEditDistance > maxDistance) {
> //equal is okay, but not greater
>             //the closest the target can be to the text is just too far
away.
>             //this target is leaving the party early.
>             return 0.0f;
>           }
> 
>           // copy current distance counts to 'previous row' distance
> counts: swap p and d
>           int _d[] = p;
>           p = d;
>           d = _d;
>         }
> 
>         // our last action in the above loop was to switch d and p, so p
now
>         // actually has the most recent cost counts
> 
>         // this will return less than 0.0 when the edit distance is
>         // greater than the number of characters in the shorter word.
>         // but this was the formula that was previously used in
FuzzyTermEnum,
>         // so it has not been changed (even though minimumSimilarity must
be
>         // greater than 0.0)
>         return 1.0f - ((float)p[n] / (float) (maxLength));
>       }
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Improving String Distance calculation performance

Posted by "Biedermann,S.,Fa. Post Direkt" <S....@postdirekt.de>.

Hi,

this is my first post to this mailing list, so I first want to say hello
to all of you!

	You are doing a great job 

In org.apache.lucene.search.FuzzyTermEnum I found an optimised
implementation of the Levenstein-Algorithms which makes use of the fact
that the algorithm can be aborted if a given minimum similarity cannot
be reached anymore. I isolated that algorithm into a subclass of
org.apache.lucene.spell.StringDistance, since we usually can make use of
this optimisation.

With our current miminum similarity setting of 0.75 this algorithm needs
against our test data only about 22% of run time compared to the
original algorithm from the spell package.

With a further optimisation candidate (see below) the runtime can be
further reduced by another third to only 14% of original Levenstein.

So, my first question is: is it worth adding a further method to the
StringDistance-Interface:

	float getDistance(String left, String right, float
minimumSimilarity)

so that applications can make use of possible optimisations
(StringDistance-Implementations without optimisations would just skip
the minimSimilarity parameter)?


The idea of the optimsation candidate is about calculating only those
fields in the "virtual" matrix that are near its diagonal.
It is only a candidants since we have not prooven it to work. But with
all our test data (0.5 billion comparisons) there is no difference to
the original algorithm.

Do you have any counter examples?
Since this is my first post, is this the right mailing list?

Best Regards,

Sven





Here is the code taken from FuzzyTermEnum with some modfications  (p and
d are initialised somewhere else):

    public float getDistance(final String left, final String right,
float minimumSimilarity) {
        final int m = right.length();
        final int n = left.length();
        final int maxLength = Math.max(m, n);
        if (n == 0)  {
          //we don't have anything to compare.  That means if we just
add
          //the letters for m we get the new word
            return (m == 0) ? 1f : 0f;
        }
        if (m == 0) {
          return 0f;
        }

        // be patient with rounding errors (1.0000001f instead of 1f).
        final int maxDistance = (int) ((1.0000001f-minimumSimilarity) *
maxLength);

        if (maxDistance < Math.abs(m-n)) {
          //just adding the characters of m to n or vice-versa results
in
          //too many edits
          //for example "pre" length is 3 and "prefixes" length is 8.
We can see that
          //given this optimal circumstance, the edit distance cannot be
less than 5.
          //which is 8-3 or more precisely Math.abs(3-8).
          //if our maximum edit distance is 4, then we can discard this
word
          //without looking at it.
          return 0.0f;
        }
        
        // if no edits are allowed, strings must be equal 
        if (maxDistance == 0)
            return left.equals(right) ? 1f : 0f;

        // init matrix d
        for (int i = 0; i<=n; i++) {
          p[i] = i;
        }

        // start computing edit distance
        for (int j = 1; j<=m; j++) { // iterates through target
          int bestPossibleEditDistance = m;
          final char t_j = right.charAt(j-1); // jth character of t
          d[0] = j;


//-------> here is the optimisation candiates

          //only iterate through a maxDistance corridor
          final int startAt  = Math.max(1, j - maxDistance );
          final int finishAt = Math.min(n, maxDistance - 1 + j);
          
          for (int i=startAt; i<=finishAt; ++i) { // iterates through
text
//--------
            // minimum of cell to the left+1, to the top+1, diagonally
left and up +(0|1)
            final char t_i = left.charAt(i-1);  
            if (t_j != t_i) {
              d[i] = Math.min(Math.min(d[i-1], p[i]),  p[i-1]) + 1;
            } else {
                d[i] = Math.min(Math.min(d[i-1], p[i]) + 1,  p[i-1]);
            }
            bestPossibleEditDistance =
Math.min(bestPossibleEditDistance, d[i]);

          }

          //After calculating row i, the best possible edit distance
          //can be found by found by finding the smallest value in a
given column.
          //If the bestPossibleEditDistance is greater than the max
distance, abort.

          if (j > maxDistance && bestPossibleEditDistance > maxDistance)
{  //equal is okay, but not greater
            //the closest the target can be to the text is just too far
away.
            //this target is leaving the party early.
            return 0.0f;
          }

          // copy current distance counts to 'previous row' distance
counts: swap p and d
          int _d[] = p;
          p = d;
          d = _d;
        }

        // our last action in the above loop was to switch d and p, so p
now
        // actually has the most recent cost counts

        // this will return less than 0.0 when the edit distance is
        // greater than the number of characters in the shorter word.
        // but this was the formula that was previously used in
FuzzyTermEnum,
        // so it has not been changed (even though minimumSimilarity
must be
        // greater than 0.0)
        return 1.0f - ((float)p[n] / (float) (maxLength));
      }




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

PS: Improving String Distance calculation performance

Posted by "Biedermann,S.,Fa. Post Direkt" <S....@postdirekt.de>.

Ups... I forgot to say, that the candiate only works if left.length() <= right.length() !

-----Ursprüngliche Nachricht-----
Von: Biedermann,S.,Fa. Post Direkt 
Gesendet: Montag, 27. Dezember 2010 12:23
An: 'dev@lucene.apache.org'
Betreff: Improving String Distance calculation performance

Hi,

this is my first post to this mailing list, so I first want to say hello to all of you!

	You are doing a great job 

In org.apache.lucene.search.FuzzyTermEnum I found an optimised implementation of the Levenstein-Algorithms which makes use of the fact that the algorithm can be aborted if a given minimum similarity cannot be reached anymore. I isolated that algorithm into a subclass of org.apache.lucene.spell.StringDistance, since we usually can make use of this optimisation.

With our current miminum similarity setting of 0.75 this algorithm needs against our test data only about 22% of run time compared to the original algorithm from the spell package.

With a further optimisation candidate (see below) the runtime can be further reduced by another third to only 14% of original Levenstein.

So, my first question is: is it worth adding a further method to the StringDistance-Interface:

	float getDistance(String left, String right, float minimumSimilarity)

so that applications can make use of possible optimisations (StringDistance-Implementations without optimisations would just skip the minimSimilarity parameter)?


The idea of the optimsation candidate is about calculating only those fields in the "virtual" matrix that are near its diagonal.
It is only a candidants since we have not prooven it to work. But with all our test data (0.5 billion comparisons) there is no difference to the original algorithm.

Do you have any counter examples?
Since this is my first post, is this the right mailing list?

Best Regards,

Sven





Here is the code taken from FuzzyTermEnum with some modfications  (p and d are initialised somewhere else):

    public float getDistance(final String left, final String right, float minimumSimilarity) {
        final int m = right.length();
        final int n = left.length();
        final int maxLength = Math.max(m, n);
        if (n == 0)  {
          //we don't have anything to compare.  That means if we just add
          //the letters for m we get the new word
            return (m == 0) ? 1f : 0f;
        }
        if (m == 0) {
          return 0f;
        }

        // be patient with rounding errors (1.0000001f instead of 1f).
        final int maxDistance = (int) ((1.0000001f-minimumSimilarity) * maxLength);

        if (maxDistance < Math.abs(m-n)) {
          //just adding the characters of m to n or vice-versa results in
          //too many edits
          //for example "pre" length is 3 and "prefixes" length is 8.  We can see that
          //given this optimal circumstance, the edit distance cannot be less than 5.
          //which is 8-3 or more precisely Math.abs(3-8).
          //if our maximum edit distance is 4, then we can discard this word
          //without looking at it.
          return 0.0f;
        }
        
        // if no edits are allowed, strings must be equal 
        if (maxDistance == 0)
            return left.equals(right) ? 1f : 0f;

        // init matrix d
        for (int i = 0; i<=n; i++) {
          p[i] = i;
        }

        // start computing edit distance
        for (int j = 1; j<=m; j++) { // iterates through target
          int bestPossibleEditDistance = m;
          final char t_j = right.charAt(j-1); // jth character of t
          d[0] = j;


//-------> here is the optimisation candiates

          //only iterate through a maxDistance corridor
          final int startAt  = Math.max(1, j - maxDistance );
          final int finishAt = Math.min(n, maxDistance - 1 + j);
          
          for (int i=startAt; i<=finishAt; ++i) { // iterates through text
//--------
            // minimum of cell to the left+1, to the top+1, diagonally left and up +(0|1)
            final char t_i = left.charAt(i-1);  
            if (t_j != t_i) {
              d[i] = Math.min(Math.min(d[i-1], p[i]),  p[i-1]) + 1;
            } else {
                d[i] = Math.min(Math.min(d[i-1], p[i]) + 1,  p[i-1]);
            }
            bestPossibleEditDistance = Math.min(bestPossibleEditDistance, d[i]);

          }

          //After calculating row i, the best possible edit distance
          //can be found by found by finding the smallest value in a given column.
          //If the bestPossibleEditDistance is greater than the max distance, abort.

          if (j > maxDistance && bestPossibleEditDistance > maxDistance) {  //equal is okay, but not greater
            //the closest the target can be to the text is just too far away.
            //this target is leaving the party early.
            return 0.0f;
          }

          // copy current distance counts to 'previous row' distance counts: swap p and d
          int _d[] = p;
          p = d;
          d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now
        // actually has the most recent cost counts

        // this will return less than 0.0 when the edit distance is
        // greater than the number of characters in the shorter word.
        // but this was the formula that was previously used in FuzzyTermEnum,
        // so it has not been changed (even though minimumSimilarity must be
        // greater than 0.0)
        return 1.0f - ((float)p[n] / (float) (maxLength));
      }




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org