You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Stéphane Nicoll <st...@gmail.com> on 2013/11/05 08:40:26 UTC

Twitter analyser

Hi,

I am building an application that indexes tweet and offer some basic
search facilities on them.

I am trying to find a combination where the following would work:

* foo matches the foo word, a mention (@foo) or the hashtag (#foo)
* @foo only matches the mention
* #foo matches only the hashtag

It should matches complete word so I used the WhiteSpaceAnalyzer for indexing.

Any recommendation for this use case?

Thanks !
S.

Sent from my iPhone

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Twitter analyser

Posted by Stephane Nicoll <st...@gmail.com>.

Replying to self: silly me. I am obviously creating the array with the
wrong length.
final String term = new String(buffer, 1, length);

should be replaced by
final String term = new String(buffer, 1, length -1);

and the silly trim can go away. I guess I need more coffee.

S.




On Sat, Nov 9, 2013 at 9:45 AM, Stephane Nicoll
<st...@gmail.com>wrote:

> Hi,
>
> This is what I've tried:
> https://gist.github.com/anonymous/7383104
>
> So far so good except that something is definitely wrong in my code as the
> synonym is not emitted as a valid token it seems. This is how my indexing
> analyzer is built:
>
>  private static final class MyIndexAnalyzer extends ReusableAnalyzerBase {
>         @Override
>         protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
>             final Tokenizer tokenizer = new
> WhitespaceTokenizer(Version.LUCENE_36, reader);
>             final TwitterFilter twitterFilter = new
> TwitterFilter(tokenizer);
>             final LowerCaseFilter filter = new
> LowerCaseFilter(Version.LUCENE_36, twitterFilter);
>             return new TokenStreamComponents(tokenizer, filter);
>         }
>     }
>
> I am expecting the lower filter to pick up the synonym exactly the same
> way as the original token but it does not. If I have the following tweet
> "Bla Bla #SomeTAG", "#sometag" matches but "sometag" does not. All other
> use cases not involving a case mismatch work as expected.
>
> Does anyone knows what's wrong in my code?
>
> Thanks for the support!
>
> S.
>
>
>
> On Tue, Nov 5, 2013 at 2:17 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> If your universe of items you want to match this way is small,
>> consider something akin to synonyms. Your indexing process
>> emits two tokens, with and without the @ or # which should
>> cover your situation.
>>
>> FWIW,
>> Erick
>>
>>
>> On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll
>> <st...@gmail.com>wrote:
>>
>> > Hi,
>> >
>> > I am building an application that indexes tweet and offer some basic
>> > search facilities on them.
>> >
>> > I am trying to find a combination where the following would work:
>> >
>> > * foo matches the foo word, a mention (@foo) or the hashtag (#foo)
>> > * @foo only matches the mention
>> > * #foo matches only the hashtag
>> >
>> > It should matches complete word so I used the WhiteSpaceAnalyzer for
>> > indexing.
>> >
>> > Any recommendation for this use case?
>> >
>> > Thanks !
>> > S.
>> >
>> > Sent from my iPhone
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>
>

Re: Twitter analyser

Posted by Stephane Nicoll <st...@gmail.com>.

Hi,

This is what I've tried:
https://gist.github.com/anonymous/7383104

So far so good except that something is definitely wrong in my code as the
synonym is not emitted as a valid token it seems. This is how my indexing
analyzer is built:

 private static final class MyIndexAnalyzer extends ReusableAnalyzerBase {
        @Override
        protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
            final Tokenizer tokenizer = new
WhitespaceTokenizer(Version.LUCENE_36, reader);
            final TwitterFilter twitterFilter = new
TwitterFilter(tokenizer);
            final LowerCaseFilter filter = new
LowerCaseFilter(Version.LUCENE_36, twitterFilter);
            return new TokenStreamComponents(tokenizer, filter);
        }
    }

I am expecting the lower filter to pick up the synonym exactly the same way
as the original token but it does not. If I have the following tweet "Bla
Bla #SomeTAG", "#sometag" matches but "sometag" does not. All other use
cases not involving a case mismatch work as expected.

Does anyone knows what's wrong in my code?

Thanks for the support!

S.



On Tue, Nov 5, 2013 at 2:17 PM, Erick Erickson <er...@gmail.com>wrote:

> If your universe of items you want to match this way is small,
> consider something akin to synonyms. Your indexing process
> emits two tokens, with and without the @ or # which should
> cover your situation.
>
> FWIW,
> Erick
>
>
> On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll
> <st...@gmail.com>wrote:
>
> > Hi,
> >
> > I am building an application that indexes tweet and offer some basic
> > search facilities on them.
> >
> > I am trying to find a combination where the following would work:
> >
> > * foo matches the foo word, a mention (@foo) or the hashtag (#foo)
> > * @foo only matches the mention
> > * #foo matches only the hashtag
> >
> > It should matches complete word so I used the WhiteSpaceAnalyzer for
> > indexing.
> >
> > Any recommendation for this use case?
> >
> > Thanks !
> > S.
> >
> > Sent from my iPhone
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Twitter analyser

Posted by Erick Erickson <er...@gmail.com>.

You have to get the values _into_ the index with the special characters,
that's where the issue is. Depending on your analysis chain special
characters may or may not be even in your index to search in the first
place.

So it's not how many different words are after the special characters as
much as how many special characters there are. So what I'm thinking is
that as you index documents, you detect #foo, #blah, #whatever and
index #foo, foo, #blah, blah etc. If all you have to do is specially handle
tokens that start with just a few different chars it's not very hard.

FWIW,
Erick


On Tue, Nov 5, 2013 at 8:33 AM, Stephane Nicoll
<st...@gmail.com>wrote:

> Hi,
>
> Thanks for the reply. It's an index with tweets so any word really is a
> target for this. This would mean a significant increase of the index. My
> volumes are really small so that shouldn't be a problem (but
> performance/scalability is a concern).
>
> I have the control over the query. Another solution would be to translate a
> query on "foo" to "foo or #foo or @foo"
>
> WDYT?
>
> Thanks!
> S.
>
>
>
>
> On Tue, Nov 5, 2013 at 2:17 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > If your universe of items you want to match this way is small,
> > consider something akin to synonyms. Your indexing process
> > emits two tokens, with and without the @ or # which should
> > cover your situation.
> >
> > FWIW,
> > Erick
> >
> >
> > On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll
> > <st...@gmail.com>wrote:
> >
> > > Hi,
> > >
> > > I am building an application that indexes tweet and offer some basic
> > > search facilities on them.
> > >
> > > I am trying to find a combination where the following would work:
> > >
> > > * foo matches the foo word, a mention (@foo) or the hashtag (#foo)
> > > * @foo only matches the mention
> > > * #foo matches only the hashtag
> > >
> > > It should matches complete word so I used the WhiteSpaceAnalyzer for
> > > indexing.
> > >
> > > Any recommendation for this use case?
> > >
> > > Thanks !
> > > S.
> > >
> > > Sent from my iPhone
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Re: Twitter analyser

Posted by Stephane Nicoll <st...@gmail.com>.

Hi,

Thanks for the reply. It's an index with tweets so any word really is a
target for this. This would mean a significant increase of the index. My
volumes are really small so that shouldn't be a problem (but
performance/scalability is a concern).

I have the control over the query. Another solution would be to translate a
query on "foo" to "foo or #foo or @foo"

WDYT?

Thanks!
S.




On Tue, Nov 5, 2013 at 2:17 PM, Erick Erickson <er...@gmail.com>wrote:

> If your universe of items you want to match this way is small,
> consider something akin to synonyms. Your indexing process
> emits two tokens, with and without the @ or # which should
> cover your situation.
>
> FWIW,
> Erick
>
>
> On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll
> <st...@gmail.com>wrote:
>
> > Hi,
> >
> > I am building an application that indexes tweet and offer some basic
> > search facilities on them.
> >
> > I am trying to find a combination where the following would work:
> >
> > * foo matches the foo word, a mention (@foo) or the hashtag (#foo)
> > * @foo only matches the mention
> > * #foo matches only the hashtag
> >
> > It should matches complete word so I used the WhiteSpaceAnalyzer for
> > indexing.
> >
> > Any recommendation for this use case?
> >
> > Thanks !
> > S.
> >
> > Sent from my iPhone
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Twitter analyser

Posted by Erick Erickson <er...@gmail.com>.

If your universe of items you want to match this way is small,
consider something akin to synonyms. Your indexing process
emits two tokens, with and without the @ or # which should
cover your situation.

FWIW,
Erick


On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll
<st...@gmail.com>wrote:

> Hi,
>
> I am building an application that indexes tweet and offer some basic
> search facilities on them.
>
> I am trying to find a combination where the following would work:
>
> * foo matches the foo word, a mention (@foo) or the hashtag (#foo)
> * @foo only matches the mention
> * #foo matches only the hashtag
>
> It should matches complete word so I used the WhiteSpaceAnalyzer for
> indexing.
>
> Any recommendation for this use case?
>
> Thanks !
> S.
>
> Sent from my iPhone
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Twitter analyser

Posted by Lance Norskog <go...@gmail.com>.

This is a parts-of-speech analyzer for tweets. It would make your index 
far more useful.

http://www.ark.cs.cmu.edu/TweetNLP/

On 11/04/2013 11:40 PM, Stéphane Nicoll wrote:
> Hi,
>
> I am building an application that indexes tweet and offer some basic
> search facilities on them.
>
> I am trying to find a combination where the following would work:
>
> * foo matches the foo word, a mention (@foo) or the hashtag (#foo)
> * @foo only matches the mention
> * #foo matches only the hashtag
>
> It should matches complete word so I used the WhiteSpaceAnalyzer for indexing.
>
> Any recommendation for this use case?
>
> Thanks !
> S.
>
> Sent from my iPhone
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: Twitter analyser

Posted by Jack Krupansky <ja...@basetechnology.com>.

You can specify custom character types with the word delimiter filter, so 
you could define "@" and "#" as "digit" and set SPLIT_ON_NUMERICS. This 
would cause "@foo" to tokenize as two adjacent terms, ditto for "#foo". 
Unfortunately, A user name or tag that starts with a digit would not 
tokenize as desired, but that seems uncommon. "foo" would match all three 
since the "@" or "#" would tokenize as a separate term.

Use:

public WordDelimiterFilter(TokenStream in,
                           byte[] charTypeTable,
                           int configurationFlags,
                           CharArraySet protWords)

See:
http://lucene.apache.org/core/4_5_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html

-- Jack Krupansky
-----Original Message----- 
From: Stéphane Nicoll
Sent: Tuesday, November 05, 2013 2:40 AM
To: java-user@lucene.apache.org
Subject: Twitter analyser

Hi,

I am building an application that indexes tweet and offer some basic
search facilities on them.

I am trying to find a combination where the following would work:

* foo matches the foo word, a mention (@foo) or the hashtag (#foo)
* @foo only matches the mention
* #foo matches only the hashtag

It should matches complete word so I used the WhiteSpaceAnalyzer for 
indexing.

Any recommendation for this use case?

Thanks !
S.

Sent from my iPhone

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org