You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Neal Richter <nr...@gmail.com> on 2008/11/25 08:10:48 UTC

Analyzing CSV phrase fields

Hey all,

Very basic question.. I want to index fields of comma separated values:

Example document:
id: 1
title: Football Teams
keywords: philadelphia eagles, cleveland browns, new york jets

id: 2
title: Baseball Teams
keywords:"philadelphia phillies", "new york yankees", "cleveland indians"

A query of 'new york' should return the obvious documents, but a quoted
phrase query of "yankees cleveland" should return nothing... meaning that
comma breaks phrases without fail.

I've created a textCSV type in the schema.xml file and used the
PatternTokenizerFactory to split on commas, and from there analysis can
proceed as normal via StopFilterFactory, LowerCaseFilter,
RemoveDuplicatesTokenFilter

<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"
group="-1"/>

Has anyone done this before?  Can I somehow use an existing (or combination
of) Analyzer?  It seems as though I need to create a PhraseDelimiterFilter
from the WordDelimiterFilter.. though I am sure there is a way to make an
existing analyzer to break things up the way I want.

Thanks - Neal Richter

Re: Analyzing CSV phrase fields

Posted by Yonik Seeley <yo...@apache.org>.

The easiest solution would be to create the documents you send to solr
with multiple keywords fields... they will be separated by a
positionIncrement so a phrase query won't see yankees adjacent to
cleveland.

If you can't do that, then perhaps patch PatternTokenizer filter to
put a larger positionIncrement between groups.  Then you would need to
follow it by another filter that tokens on whitespace or some other
regex (which we currently don't have).

-Yonik

On Tue, Nov 25, 2008 at 2:10 AM, Neal Richter <nr...@gmail.com> wrote:
> Hey all,
>
> Very basic question.. I want to index fields of comma separated values:
>
> Example document:
> id: 1
> title: Football Teams
> keywords: philadelphia eagles, cleveland browns, new york jets
>
> id: 2
> title: Baseball Teams
> keywords:"philadelphia phillies", "new york yankees", "cleveland indians"
>
> A query of 'new york' should return the obvious documents, but a quoted
> phrase query of "yankees cleveland" should return nothing... meaning that
> comma breaks phrases without fail.
>
> I've created a textCSV type in the schema.xml file and used the
> PatternTokenizerFactory to split on commas, and from there analysis can
> proceed as normal via StopFilterFactory, LowerCaseFilter,
> RemoveDuplicatesTokenFilter
>
> <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"
> group="-1"/>
>
> Has anyone done this before?  Can I somehow use an existing (or combination
> of) Analyzer?  It seems as though I need to create a PhraseDelimiterFilter
> from the WordDelimiterFilter.. though I am sure there is a way to make an
> existing analyzer to break things up the way I want.
>
> Thanks - Neal Richter
>