You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Memory Makers <me...@gmail.com> on 2011/10/25 16:27:12 UTC
Points to processing hastags
Greetings,
I am trying to index hashtags from twitter -- so they are tokens that start
with a # symbol and can have any number of alpha numeric characters.
Examples:
1. #jane
2. #Jane
3. #Jane!
At a high level I'd like to be able to:
1. differentiate between say #jane and #jane!
2. differentiate between a hashtag such as #jane and a regular text token
jane
3. ask for variation on #jane -- by this I mean #jane? #jane!!! #jane!?!??
are all variations of jane
I'd appreciate points to what my considerations should be when I attempt to
do the above.
Thanks,
MM.
RE: Points to processing hastags
Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.
Sounds like a possible application of solr.PatternTokenizerFactory
http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternTokenizerFactory.html
You could use copyField to copy the entire string to a separate field (or set of fields) that are processed by patterns.
JRJ
-----Original Message-----
From: Memory Makers [mailto:memmakersorg@gmail.com]
Sent: Tuesday, October 25, 2011 9:27 AM
To: solr-user@lucene.apache.org
Subject: Points to processing hastags
Greetings,
I am trying to index hashtags from twitter -- so they are tokens that start
with a # symbol and can have any number of alpha numeric characters.
Examples:
1. #jane
2. #Jane
3. #Jane!
At a high level I'd like to be able to:
1. differentiate between say #jane and #jane!
2. differentiate between a hashtag such as #jane and a regular text token
jane
3. ask for variation on #jane -- by this I mean #jane? #jane!!! #jane!?!??
are all variations of jane
I'd appreciate points to what my considerations should be when I attempt to
do the above.
Thanks,
MM.