You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Sasha Dolgy <sd...@gmail.com> on 2012/03/18 15:54:30 UTC

design that mimics twitter tweet search

Hi All,

With twitter, when I search for words like:  "cassandra is the bestest", 4
tweets will appear, including one i just did.  My understand that the
internals of twitter work in that each word in a tweet is allocated,
irrespective of the presence of a  # hash tag, and the tweet id is assigned
to a row for that word.  What is puzzling to me, and hopeful that some
smart people on here can shed some light on -- is how would this work with
Cassandra?

row [ cassandra ]: key -> tweetid  / timestamp
row [ bestest ]: key -> tweetid / timestamp

I had thought that I could simply pull a list of all column names from each
row (representing each word) and flag all occurrences (tweet id's) that
exist in each row ... however, these rows would get quite long over time.

Am I missing an easier way to get a list of all "tweetid's" that exist in
multiple rows?

-- 
Sasha Dolgy
sasha.dolgy@gmail.com

Re: design that mimics twitter tweet search

Posted by Sasha Dolgy <sd...@gmail.com>.

most excellent ... thanks Chris!

On Mon, Mar 19, 2012 at 9:23 AM, Chris Goffinet <cg...@chrisgoffinet.com>wrote:

> We do not use Cassandra for search. We made modifications to Lucene.
>
> Here is a blog post on our engineering section that talks about what we
> did:
>
>
> http://engineering.twitter.com/2011/04/twitter-search-is-now-3x-faster_1656.html
>
>
>

Re: design that mimics twitter tweet search

Posted by Chris Goffinet <cg...@chrisgoffinet.com>.

We do not use Cassandra for search. We made modifications to Lucene.

Here is a blog post on our engineering section that talks about what we did:

http://engineering.twitter.com/2011/04/twitter-search-is-now-3x-faster_1656.html


On Sun, Mar 18, 2012 at 11:22 PM, Tharindu Mathew <mc...@gmail.com>wrote:

> Sasha,
>
> It depends on the way you implement I guess... Maybe twitter uses
> Solandra, who's very good at indexing these in different ways but has the
> power of Cassandra underneath...
>
> If your doing your own impl of indexing be mindful that you can break the
> sentence into four words and index or you index the whole sentence. Both
> would produce different results as they can mean a completely different
> thing based on the context.
>
>
> On Mon, Mar 19, 2012 at 7:35 AM, Andrey V. Panov <pa...@gmail.com>wrote:
>
>> Why you suppose they did search on Cassandra?
>>
>>
>> On 19 March 2012 00:16, Sasha Dolgy <sd...@gmail.com> wrote:
>>
>>> yes -- but given i have two keywords, and want to find all tweets that
>>> have "cassandra" and "bestest" ... means, retrieving all columns + values
>>> in each row, iterating through both to see if tweet id's in one, exist in
>>> the other and finishing up with a consolidated list of tweet id's that only
>>> exist in both.  just seems clunky to me ... ?
>>>
>>>
>>> On Sun, Mar 18, 2012 at 4:12 PM, Benoit Perroud <be...@noisette.ch>wrote:
>>>
>>>> The simpliest modeling you could have is using the keyword as key, a
>>>> timestamp/time UUID as column name and the tweetid as value
>>>>
>>>> -> cf['keyword']['timestamp'] = tweetid
>>>>
>>>> then you do a range query to get all tweetid sorted by time (you may
>>>> want them in reverse order) and you can limit to the number of tweets
>>>> displayed on the page.
>>>>
>>>> As some rows can become large, you could use key patitionning by
>>>> concatening for instance keyword and the month and year.
>>>>
>>>>
>>>> 2012/3/18 Sasha Dolgy <sd...@gmail.com>:
>>>> > Hi All,
>>>> >
>>>> > With twitter, when I search for words like:  "cassandra is the
>>>> bestest", 4
>>>> > tweets will appear, including one i just did.  My understand that the
>>>> > internals of twitter work in that each word in a tweet is allocated,
>>>> > irrespective of the presence of a  # hash tag, and the tweet id is
>>>> assigned
>>>> > to a row for that word.  What is puzzling to me, and hopeful that
>>>> some smart
>>>> > people on here can shed some light on -- is how would this work with
>>>> > Cassandra?
>>>> >
>>>> > row [ cassandra ]: key -> tweetid  / timestamp
>>>> > row [ bestest ]: key -> tweetid / timestamp
>>>> >
>>>> > I had thought that I could simply pull a list of all column names
>>>> from each
>>>> > row (representing each word) and flag all occurrences (tweet id's)
>>>> that
>>>> > exist in each row ... however, these rows would get quite long over
>>>> time.
>>>> >
>>>> > Am I missing an easier way to get a list of all "tweetid's" that
>>>> exist in
>>>> > multiple rows?
>>>> >
>>>> > --
>>>> > Sasha Dolgy
>>>> > sasha.dolgy@gmail.com
>>>>
>>>>
>>>>
>>>> --
>>>> sent from my Nokia 3210
>>>>
>>>
>>>
>>>
>>> --
>>> Sasha Dolgy
>>> sasha.dolgy@gmail.com
>>>
>>
>>
>
>
> --
> Regards,
>
> Tharindu
>
> blog: http://mackiemathew.com/
>
>

Re: design that mimics twitter tweet search

Posted by Tharindu Mathew <mc...@gmail.com>.

Sasha,

It depends on the way you implement I guess... Maybe twitter uses Solandra,
who's very good at indexing these in different ways but has the power of
Cassandra underneath...

If your doing your own impl of indexing be mindful that you can break the
sentence into four words and index or you index the whole sentence. Both
would produce different results as they can mean a completely different
thing based on the context.

On Mon, Mar 19, 2012 at 7:35 AM, Andrey V. Panov <pa...@gmail.com>wrote:

> Why you suppose they did search on Cassandra?
>
>
> On 19 March 2012 00:16, Sasha Dolgy <sd...@gmail.com> wrote:
>
>> yes -- but given i have two keywords, and want to find all tweets that
>> have "cassandra" and "bestest" ... means, retrieving all columns + values
>> in each row, iterating through both to see if tweet id's in one, exist in
>> the other and finishing up with a consolidated list of tweet id's that only
>> exist in both.  just seems clunky to me ... ?
>>
>>
>> On Sun, Mar 18, 2012 at 4:12 PM, Benoit Perroud <be...@noisette.ch>wrote:
>>
>>> The simpliest modeling you could have is using the keyword as key, a
>>> timestamp/time UUID as column name and the tweetid as value
>>>
>>> -> cf['keyword']['timestamp'] = tweetid
>>>
>>> then you do a range query to get all tweetid sorted by time (you may
>>> want them in reverse order) and you can limit to the number of tweets
>>> displayed on the page.
>>>
>>> As some rows can become large, you could use key patitionning by
>>> concatening for instance keyword and the month and year.
>>>
>>>
>>> 2012/3/18 Sasha Dolgy <sd...@gmail.com>:
>>> > Hi All,
>>> >
>>> > With twitter, when I search for words like:  "cassandra is the
>>> bestest", 4
>>> > tweets will appear, including one i just did.  My understand that the
>>> > internals of twitter work in that each word in a tweet is allocated,
>>> > irrespective of the presence of a  # hash tag, and the tweet id is
>>> assigned
>>> > to a row for that word.  What is puzzling to me, and hopeful that some
>>> smart
>>> > people on here can shed some light on -- is how would this work with
>>> > Cassandra?
>>> >
>>> > row [ cassandra ]: key -> tweetid  / timestamp
>>> > row [ bestest ]: key -> tweetid / timestamp
>>> >
>>> > I had thought that I could simply pull a list of all column names from
>>> each
>>> > row (representing each word) and flag all occurrences (tweet id's) that
>>> > exist in each row ... however, these rows would get quite long over
>>> time.
>>> >
>>> > Am I missing an easier way to get a list of all "tweetid's" that exist
>>> in
>>> > multiple rows?
>>> >
>>> > --
>>> > Sasha Dolgy
>>> > sasha.dolgy@gmail.com
>>>
>>>
>>>
>>> --
>>> sent from my Nokia 3210
>>>
>>
>>
>>
>> --
>> Sasha Dolgy
>> sasha.dolgy@gmail.com
>>
>
>


-- 
Regards,

Tharindu

blog: http://mackiemathew.com/

Re: design that mimics twitter tweet search

Posted by "Andrey V. Panov" <pa...@gmail.com>.

Why you suppose they did search on Cassandra?

On 19 March 2012 00:16, Sasha Dolgy <sd...@gmail.com> wrote:

> yes -- but given i have two keywords, and want to find all tweets that
> have "cassandra" and "bestest" ... means, retrieving all columns + values
> in each row, iterating through both to see if tweet id's in one, exist in
> the other and finishing up with a consolidated list of tweet id's that only
> exist in both.  just seems clunky to me ... ?
>
>
> On Sun, Mar 18, 2012 at 4:12 PM, Benoit Perroud <be...@noisette.ch>wrote:
>
>> The simpliest modeling you could have is using the keyword as key, a
>> timestamp/time UUID as column name and the tweetid as value
>>
>> -> cf['keyword']['timestamp'] = tweetid
>>
>> then you do a range query to get all tweetid sorted by time (you may
>> want them in reverse order) and you can limit to the number of tweets
>> displayed on the page.
>>
>> As some rows can become large, you could use key patitionning by
>> concatening for instance keyword and the month and year.
>>
>>
>> 2012/3/18 Sasha Dolgy <sd...@gmail.com>:
>> > Hi All,
>> >
>> > With twitter, when I search for words like:  "cassandra is the
>> bestest", 4
>> > tweets will appear, including one i just did.  My understand that the
>> > internals of twitter work in that each word in a tweet is allocated,
>> > irrespective of the presence of a  # hash tag, and the tweet id is
>> assigned
>> > to a row for that word.  What is puzzling to me, and hopeful that some
>> smart
>> > people on here can shed some light on -- is how would this work with
>> > Cassandra?
>> >
>> > row [ cassandra ]: key -> tweetid  / timestamp
>> > row [ bestest ]: key -> tweetid / timestamp
>> >
>> > I had thought that I could simply pull a list of all column names from
>> each
>> > row (representing each word) and flag all occurrences (tweet id's) that
>> > exist in each row ... however, these rows would get quite long over
>> time.
>> >
>> > Am I missing an easier way to get a list of all "tweetid's" that exist
>> in
>> > multiple rows?
>> >
>> > --
>> > Sasha Dolgy
>> > sasha.dolgy@gmail.com
>>
>>
>>
>> --
>> sent from my Nokia 3210
>>
>
>
>
> --
> Sasha Dolgy
> sasha.dolgy@gmail.com
>

Re: design that mimics twitter tweet search

Posted by Sasha Dolgy <sd...@gmail.com>.

yes -- but given i have two keywords, and want to find all tweets that have
"cassandra" and "bestest" ... means, retrieving all columns + values in
each row, iterating through both to see if tweet id's in one, exist in the
other and finishing up with a consolidated list of tweet id's that only
exist in both.  just seems clunky to me ... ?

On Sun, Mar 18, 2012 at 4:12 PM, Benoit Perroud <be...@noisette.ch> wrote:

> The simpliest modeling you could have is using the keyword as key, a
> timestamp/time UUID as column name and the tweetid as value
>
> -> cf['keyword']['timestamp'] = tweetid
>
> then you do a range query to get all tweetid sorted by time (you may
> want them in reverse order) and you can limit to the number of tweets
> displayed on the page.
>
> As some rows can become large, you could use key patitionning by
> concatening for instance keyword and the month and year.
>
>
> 2012/3/18 Sasha Dolgy <sd...@gmail.com>:
> > Hi All,
> >
> > With twitter, when I search for words like:  "cassandra is the bestest",
> 4
> > tweets will appear, including one i just did.  My understand that the
> > internals of twitter work in that each word in a tweet is allocated,
> > irrespective of the presence of a  # hash tag, and the tweet id is
> assigned
> > to a row for that word.  What is puzzling to me, and hopeful that some
> smart
> > people on here can shed some light on -- is how would this work with
> > Cassandra?
> >
> > row [ cassandra ]: key -> tweetid  / timestamp
> > row [ bestest ]: key -> tweetid / timestamp
> >
> > I had thought that I could simply pull a list of all column names from
> each
> > row (representing each word) and flag all occurrences (tweet id's) that
> > exist in each row ... however, these rows would get quite long over time.
> >
> > Am I missing an easier way to get a list of all "tweetid's" that exist in
> > multiple rows?
> >
> > --
> > Sasha Dolgy
> > sasha.dolgy@gmail.com
>
>
>
> --
> sent from my Nokia 3210
>



-- 
Sasha Dolgy
sasha.dolgy@gmail.com

Re: design that mimics twitter tweet search

Posted by Benoit Perroud <be...@noisette.ch>.

The simpliest modeling you could have is using the keyword as key, a
timestamp/time UUID as column name and the tweetid as value

-> cf['keyword']['timestamp'] = tweetid

then you do a range query to get all tweetid sorted by time (you may
want them in reverse order) and you can limit to the number of tweets
displayed on the page.

As some rows can become large, you could use key patitionning by
concatening for instance keyword and the month and year.


2012/3/18 Sasha Dolgy <sd...@gmail.com>:
> Hi All,
>
> With twitter, when I search for words like:  "cassandra is the bestest", 4
> tweets will appear, including one i just did.  My understand that the
> internals of twitter work in that each word in a tweet is allocated,
> irrespective of the presence of a  # hash tag, and the tweet id is assigned
> to a row for that word.  What is puzzling to me, and hopeful that some smart
> people on here can shed some light on -- is how would this work with
> Cassandra?
>
> row [ cassandra ]: key -> tweetid  / timestamp
> row [ bestest ]: key -> tweetid / timestamp
>
> I had thought that I could simply pull a list of all column names from each
> row (representing each word) and flag all occurrences (tweet id's) that
> exist in each row ... however, these rows would get quite long over time.
>
> Am I missing an easier way to get a list of all "tweetid's" that exist in
> multiple rows?
>
> --
> Sasha Dolgy
> sasha.dolgy@gmail.com



-- 
sent from my Nokia 3210