You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by mtraynham <mt...@digitalsmiths.com> on 2011/05/20 16:46:38 UTC

[Contribution] Multiword Inline-Prefix Autocomplete Idea

At my company. I've been spending some time figuring out the best approach
for inline prefix Auto completion.  Most of the support for auto completion
is based solely on prefix matching, as it can jump to a certain term within
a field quickly and break the enumeration loop when the prefix no longer
matches (really nice and freaking quick).  

Doing inline prefixing means you lose this functionality and have to check
each word specifically.

Some approaches I investigated:
- TermsComponent - test each term for inline prefixes
     - For large corpa, this is slow, really really slow...
- TST - currently only supports prefixing as well, but could manage to have
each node pointing back to the document and then perform document
intersection.
     - Probably the quickest solution, but rebuilding this tree every time a
commit happens could get really ugly on memory
- RAMDirectory - again a memory hog.

A quick and not so bad solution:
- New poly field type that splits a term value into prefix-able terms. 
- CopyField of dynamic type __s (string) to this fieldtype
- Ex. Jennifer Love Hewitt -> Jennifer Love Hewitt, Love Hewitt, Hewitt

How it looks indexed:
jennifer love hewitt<DELIM>Jennfier Love Hewitt
love hewitt<DELIM>Jennfier Love Hewitt
hewitt<DELIM>Jennfier Love Hewitt

As a user is typing name values we prefix match on the term they typed and
then return whatever is after the delimiter.  I also lower cased, so I could
get case insensitivity.

Note: The only case that I'm not currently supporting is out of order
prefixing, (e.g. user types Hewitt Jennfier).  Although this can be
accomplished using this approach, you would index each poly term split
separately and maintain a map while your prefix algorithm is running.

Thanks,
Matt

public class AutocompleteStrField extends StrField {

	private static Character DELIMITER = '\u00ff';
	
	@Override
	public boolean isPolyField(){
		return true;
	}

	/**
	 * Given a {@link org.apache.solr.schema.SchemaField}, create one or more
{@link org.apache.lucene.document.Fieldable} instances
	 * @param field the {@link org.apache.solr.schema.SchemaField}
	 * @param externalVal The value to add to the field
	 * @param boost The boost to apply
	 * @return An array of {@link org.apache.lucene.document.Fieldable}
	 *
	 * @see #createField(SchemaField, String, float)
	 * @see #isPolyField()
	 */
	@Override 
	public Fieldable[] createFields(SchemaField field, String externalVal,
float boost) {
		String[] st = externalVal.toLowerCase().split(" ");
		LinkedList<String> tokens = new LinkedList<String>(Arrays.asList(st));
		Fieldable[] f = new Fieldable[st.length];

		int count = 0;
		String value = "";
		while(!tokens.isEmpty()) {
			value = tokens.pollLast() + " " + value;
			f[count] = createField(field, value + DELIMITER + externalVal, boost);
			count++;
		}
		return f==null ? new Fieldable[]{} : f;
	}

	/** Given an indexed term, return the human readable representation */
	@Override
	public String indexedToReadable(String indexedForm) {
		return indexedForm.substring(indexedForm.lastIndexOf(DELIMITER) + 1);
	}
}


--
View this message in context: http://lucene.472066.n3.nabble.com/Contribution-Multiword-Inline-Prefix-Autocomplete-Idea-tp2965854p2965854.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: [Contribution] Multiword Inline-Prefix Autocomplete Idea

Posted by mtraynham <mt...@digitalsmiths.com>.

Ahh yes, thanks for the suggestions!  I've implemented them.

I thought about you're second point previously and had encountered that
issue.  Once it's tokenized, I don't believe there is a way to get the full
string back from the token stream.

--
View this message in context: http://lucene.472066.n3.nabble.com/Contribution-Multiword-Inline-Prefix-Autocomplete-Idea-tp2965854p2966836.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: [Contribution] Multiword Inline-Prefix Autocomplete Idea

Posted by Mike Sokolov <so...@ifactory.com>.

Cool!  suggestion: you might want to replace

externalVal.toLowerCase().split(" ");

with

externalVal.toLowerCase().split("\\s+");

also I bet folks might have different ideas about what to do with 
hyphens, so maybe:

externalVal.toLowerCase().split("[-\\s]+");

In fact why not make it a configurable parameter?  Or - even better - 
use some other existing token analysis chain?  I'm not sure how to fit 
that into Solr's architecture: can you analyze a field value and still 
access the unanalyzed text?

-Mike