You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by straup <st...@gmail.com> on 2010/02/01 17:26:13 UTC

Fwd: machine tags, copy fields and pattern tokenizers

Hi,

Just a quick note to mention that I finally figured (most of) this out.

The short version is that if there's an explicit "index" analyzer (as in 
type="index") but not a corresponding "query" analyzer then Solr appears 
to use the first for all cases.

I guess this makes sense but it's a bit confusing so if I get a few 
minutes I will update the wiki to make the distinction explicit.

The longer version is over here, for anyone interested:

	http://github.com/straup/solr-machinetags

The long version is me asking a couple more questions:

# All the questions assume the following schema.xml:
# http://github.com/straup/solr-machinetags/blob/master/conf/schema.xml

Because all the values for a given namespace/predicate field get indexed 
in the same multiValue bucket, the faceting doesn't behave the way you'd 
necessarily expect. For example, if you index the following...

solr.add([{ 'id' : int(time.time()), 'body' : 'float thing', 'tag' : 
'w,t,f', 'machinetag' : 'dc:number=12345' }])

solr.add([{ 'id' : int(time.time()), 'body' : 'decimal thing', 'tag' : 
'a,b,c', 'machinetag' : 'dc:number=123.23' }])

solr.add([{ 'id' : int(time.time()), 'body' : 'negative thing', 'tag' : 
'a,b,c', 'machinetag' : ['dc:number=-45.23', 'asc:test=rara'] }])

...and then facet on the predicates for ?q=ns:dc (basically to ask: show 
me all the predicates for the "dc:" namespace) you end up with...

   "facet_fields":{
     "ns":[
       "asc",1,
       "dc",1]},

...which seems right from a Solr perspective but isn't really a correct 
representation of the machine tags.

Can anyone offer any ideas on a better/different way to model this data?

Also, has anyone figured out how to match on double quotes inside a 
regular expression defined in an XML attribute?

As in:

<tokenizer class="solr.PatternTokenizerFactory" 
pattern="^(?:(?:[a-zA-Z]|\d)(?:\w+)?)\:(?:(?:[a-zA-Z]|\d)(?:\w+)?)=(.+)" 
group="1" />

Where that pattern should really end:

	=\"?(.+)\"?$

Thanks,

-------- Original Message --------
Subject: machine tags, copy fields and pattern tokenizers
Date: Mon, 25 Jan 2010 16:20:58 -0800
From: straup <st...@gmail.com>
Reply-To: straup@gmail.com
To: solr-user@lucene.apache.org

Hi,

I am trying to work out how to store, query and facet machine tags [1]
in Solr using a combination of copy fields and pattern tokenizer factories.

I am still relatively new to Solr so despite feeling like I've gone over
the docs, and friends, it's entirely possible I've missed something
glaringly obvious.

The short version is: Faceting works. Yay! You can facet on the
individual parts of a machine tag (namespace, predicate, value) and it
does what you'd expect. For example:

?q=*:*&facet=true&facet.field=mt_namespace&rows=0

numFound:115
foo:65
dc:48
lastfm:2

The longer version is: Even though faceting seems to work I can't query
(as in ?q=) on the individual fields.

For example, if a single "machinetag" (foo:bar=example) field is copied
to "mt_namespace", "mt_predicate" and "mt_value" fields I still can't
query for "?q=mt_namespace:foo".

It appears as though the entire machine tag is being copied to
mt_namespace even though my reading of the docs is that is a attribute
is present in a solr.PatternTokenizerFactory analyzer then only the
matching capture group will be stored.

Is that incorrect?

I've included the field/fieldType definitions I'm using below. [2] Any
help/suggestions would be appreciated.

Cheers,

[1] http://www.flickr.com/groups/api/discuss/72157594497877875/

[2]

<field name="machine_tags" type="machinetag" indexed="true"
stored="true" required="false" multiValued="true"/>

<field name="mt_namespace" type="mt_namespace" indexed="true"
stored="true" required="false" multiValued="true" />

<field name="mt_predicate" type="mt_predicate" indexed="true"
stored="true" required="false" multiValued="true" />

<field name="mt_value" type="mt_value" indexed="true" stored="true"
required="false" multiValued="true" />

<copyField source="machine_tags" dest="mt_namespace" />
<copyField source="machine_tags" dest="mt_predicate" />
<copyField source="machine_tags" dest="mt_value" />

<fieldType name="machinetag" class="solr.TextField" />

<fieldType name="mt_namespace" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.PatternTokenizerFactory"
pattern="([a-zA-Z[0-9]](?:\w+)?):.+" group="1" />
    </analyzer>
</fieldType>

<fieldType name="mt_predicate" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.PatternTokenizerFactory"
pattern="[a-zA-Z[0-9]](?:\w+)?:([a-zA-Z[0-9]](?:\w+)?)=.+" group="1" />
   </analyzer>
</fieldType>

<fieldType name="mt_value" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.PatternTokenizerFactory"
pattern="[a-zA-Z[0-9]](?:\w+)?:[a-zA-Z[0-9]](?:\w+)?=(.+)" group="1" />
   </analyzer>
</fieldType>

Re: Fwd: machine tags, copy fields and pattern tokenizers

Posted by straup <st...@gmail.com>.
I'm not sure it's a 100% solution but the new path hierarchy tokenizer 
seems promising. I've only played with a little bit with a little too 
booze and not enough sleep (in the sky) so apologies for the 
potty-mouth-ness of this blog post.

http://www.aaronland.info/weblog/2011/04/02/status/#sky

Cheers,

On 3/29/11 6:00 PM, sukhdev wrote:
> Hi,
>
> Was you able to solve machine tag problem in solr.  Actually I am
> also looking if machine tags can be stored as index in solr and
> search in efficient way.
>
> Regards
>
>
> -- View this message in context:
> http://lucene.472066.n3.nabble.com/Fwd-machine-tags-copy-fields-and-pattern-tokenizers-tp506491p2751745.html
>
>
Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Fwd: machine tags, copy fields and pattern tokenizers

Posted by sukhdev <su...@gmail.com>.
Hi,

Was you able to solve machine tag problem in solr.  Actually I am also
looking if machine tags can be stored as index in solr and search in
efficient way.

Regards


--
View this message in context: http://lucene.472066.n3.nabble.com/Fwd-machine-tags-copy-fields-and-pattern-tokenizers-tp506491p2751745.html
Sent from the Solr - User mailing list archive at Nabble.com.