You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Memory Makers <me...@gmail.com> on 2011/10/25 16:27:12 UTC

Points to processing hastags

Greetings,

I am trying to index hashtags from twitter -- so they are tokens that start
with a # symbol and can have any number of alpha numeric characters.

Examples:
1. #jane
2. #Jane
3. #Jane!

At a high level I'd like to be able to:
1. differentiate between say #jane and #jane!
2. differentiate between a hashtag such as #jane and a regular text token
jane
3. ask for variation on #jane -- by this I mean #jane? #jane!!! #jane!?!??
are all variations of jane

I'd appreciate points to what my considerations should be when I attempt to
do the above.

Thanks,

MM.

RE: Points to processing hastags

Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.
Sounds like a possible application of solr.PatternTokenizerFactory  

http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternTokenizerFactory.html

You could use copyField to copy the entire string to a separate field (or set of fields) that are processed by patterns.

JRJ

-----Original Message-----
From: Memory Makers [mailto:memmakersorg@gmail.com] 
Sent: Tuesday, October 25, 2011 9:27 AM
To: solr-user@lucene.apache.org
Subject: Points to processing hastags

Greetings,

I am trying to index hashtags from twitter -- so they are tokens that start
with a # symbol and can have any number of alpha numeric characters.

Examples:
1. #jane
2. #Jane
3. #Jane!

At a high level I'd like to be able to:
1. differentiate between say #jane and #jane!
2. differentiate between a hashtag such as #jane and a regular text token
jane
3. ask for variation on #jane -- by this I mean #jane? #jane!!! #jane!?!??
are all variations of jane

I'd appreciate points to what my considerations should be when I attempt to
do the above.

Thanks,

MM.