You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by mark harwood <ma...@yahoo.co.uk> on 2009/10/19 12:29:52 UTC
Highlighting - catering for all query types

I've been putting together some code to support highlighting of opaque query clauses (cached filters, trie range, spatial etc etc) which shows some promise.

This is not intended as a replacement for the existing highlighter(s) which deal with free-text but is instead concentrating on the hard-to-highlight clauses and has the benefit of working in-line with the query process.
Summarisation is not a requirement here - I simply need to know if a given query clause matched on a result.

The approach I have come up with is to wrap query clauses with lightweight (processing and RAM-wise) instrumenting objects in order to record which clauses matched.
The recorded matches are encoded as a byte in the document score which unfortunately requires some loss of precision in the scores - more on this later.

The general approach for use looks like this:

        //Wrap *any* type of query object for highlight flagging and allocate a flag number between 1 and 8 for the clauses of interest....
        FlagRecordingQuery frqA=new FlagRecordingQuery(new TermQuery(new Term("statusField","published")),1);
        FlagRecordingQuery frqB=new FlagRecordingQuery(new XyzLtd3rdPartyQuery("imageDataField", "unknown magic to find 'sunset'")),2);

        BooleanQuery bq=new BooleanQuery();
        bq.add(new BooleanClause(frqA,Occur.SHOULD));
        bq.add(new BooleanClause(frqB,Occur.SHOULD));

        //Parent query must be a FlagCombiningQuery to encode child match info in the doc scores
        FlagCombiningQuery fcq=new FlagCombiningQuery(bq);

        //Run search
        TopDocs td = s.search(fcq,10);
        ScoreDoc[] sd = td.scoreDocs;
        for (ScoreDoc scoreDoc : sd)
        {
            float score=scoreDoc.score;

            //Check to see which flags are encoded in the score.
            if(FlagCombiningQuery.hasFlag(1, score))
            {
                System.out.println("woot! "+scoreDoc.doc+" matched clause 1 ");
            }
            if(FlagCombiningQuery.hasFlag(2, score))
            {
                System.out.println("woot! "+scoreDoc.doc+" matched clause 2 ");
            }
        }


The FlagRecordingQuery child clauses introduce themselves to the FlagCombiningQuery through a thread local at "rewrite" time.
The FlagCombiningQuery at the root adjusts the scores as follows:

        static final float DEFAULT_MULTIPLIER=1000f;
        float multiplier=DEFAULT_MULTIPLIER;
    ....
        public float score() throws IOException
        {
            float score = delegateScorer.score();
            byte flags=0;
            int d=doc();
            //encode all matched child clauses into a "flags" byte.
            for (FlagRecordingQuery frq : thisThreadsFlags)
            {
                if(frq.matched(d))
                {
                    byte mask=flagMasks[frq.flag-1];
                    flags=setFlag(flags, mask);
                }
            }

            //Multiply score to turn float into int with sufficient fractions in score.
            int shiftedI=(int) (score*multiplier);
            //Shift int to make space for byte holding flags
            int iPlusSpaceForByte=shiftedI<<8;
            //Add match flags
            int iCombinedScoreAndFlags=iPlusSpaceForByte|flags;
            System.out.println("combined score="+iCombinedScoreAndFlags+" for doc#"+doc());
            return iCombinedScoreAndFlags;
        }

The mechanism works but relies on original score values that :
a) Are not too big - i.e. do not lose significant digits when multiplied by "multiplier" and then shifted left 8 bits.
b) Are not too similar - i.e. only differ in very small fractions e.g. all scores occur in the range 0.1234 to 0.1235

To give an indication of restrictions this imposes here are the usable score ranges for various settings of "multiplier":

multiplier       max score   fraction precision
======   ========   =============
10           838860         0.x
100         83886              0.xx
1000       8388             0.xxx
10000     838               0.xxxx

I would imagine the majority of Lucene query results would still rank sensibly with a 1,000 or 10,000 multiplier.

However, all this potentially dangerous bit twiddling could of course be avoided if the Lucene search API was expanded to include docid, score AND a completely seperate field for recording match flags. 


Thoughts?


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org