You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark <st...@gmail.com> on 2010/07/27 01:43:57 UTC

Design questions/Schema Help

We are thinking about using Cassandra to store our search logs. Can 
someone point me in the right direction/lend some guidance on design? I 
am new to Cassandra and I am having trouble wrapping my head around some 
of these new concepts. My brain keeps wanting to go back to a RDBMS design.

We will be storing the user query, # of hits returned and their session 
id. We would like to be able to answer the following questions.

- What is the n most popular queries and their counts within the last x 
(mins/hours/days/etc). Basically the most popular searches within a 
given time range.
- What is the most popular query within the last x where hits = 0. Same 
as above but with an extra "where" clause
- For session id x give me all their other queries
- What are all the session ids that searched for 'foos'

We accomplish the above functionality w/ MySQL using 2 tables. One for 
the raw search log information and the other to keep the 
aggregate/running counts of queries.

Would this sort of ad-hoc querying be better implemented using Hadoop + 
Hive? If so, should I be storing all this information in Cassandra then 
using Hadoop to retrieve it?

Thanks for your suggestions

Re: Design questions/Schema Help

Posted by Kiwi de coder <ki...@gmail.com>.

i think the search log will require a lot of storage which may make indexes
size unreasonable large if store in solr.

and the aggregration results may not really fixed in lucene index structure.
:)

kiwi
happy hacking !



On Tue, Jul 27, 2010 at 7:47 AM, Tommy Chheng <to...@gmail.com>wrote:

>  Alternatively, have you considered storing(or i should say indexing) the
> search logs with Solr?
>
> This lets you text search across your search queries. You can perform time
> range queries with solr as well.
>
> @tommychheng
> Programmer and UC Irvine Graduate Student
> Find a great grad school based on research interests:
> http://gradschoolnow.com
>
>
>
> On 7/26/10 4:43 PM, Mark wrote:
>
>> We are thinking about using Cassandra to store our search logs. Can
>> someone point me in the right direction/lend some guidance on design? I am
>> new to Cassandra and I am having trouble wrapping my head around some of
>> these new concepts. My brain keeps wanting to go back to a RDBMS design.
>>
>> We will be storing the user query, # of hits returned and their session
>> id. We would like to be able to answer the following questions.
>>
>> - What is the n most popular queries and their counts within the last x
>> (mins/hours/days/etc). Basically the most popular searches within a given
>> time range.
>> - What is the most popular query within the last x where hits = 0. Same as
>> above but with an extra "where" clause
>> - For session id x give me all their other queries
>> - What are all the session ids that searched for 'foos'
>>
>> We accomplish the above functionality w/ MySQL using 2 tables. One for the
>> raw search log information and the other to keep the aggregate/running
>> counts of queries.
>>
>> Would this sort of ad-hoc querying be better implemented using Hadoop +
>> Hive? If so, should I be storing all this information in Cassandra then
>> using Hadoop to retrieve it?
>>
>> Thanks for your suggestions
>>
>>

Re: Design questions/Schema Help

Posted by Tommy Chheng <to...@gmail.com>.

  Alternatively, have you considered storing(or i should say indexing) 
the search logs with Solr?

This lets you text search across your search queries. You can perform 
time range queries with solr as well.

@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 7/26/10 4:43 PM, Mark wrote:
> We are thinking about using Cassandra to store our search logs. Can 
> someone point me in the right direction/lend some guidance on design? 
> I am new to Cassandra and I am having trouble wrapping my head around 
> some of these new concepts. My brain keeps wanting to go back to a 
> RDBMS design.
>
> We will be storing the user query, # of hits returned and their 
> session id. We would like to be able to answer the following questions.
>
> - What is the n most popular queries and their counts within the last 
> x (mins/hours/days/etc). Basically the most popular searches within a 
> given time range.
> - What is the most popular query within the last x where hits = 0. 
> Same as above but with an extra "where" clause
> - For session id x give me all their other queries
> - What are all the session ids that searched for 'foos'
>
> We accomplish the above functionality w/ MySQL using 2 tables. One for 
> the raw search log information and the other to keep the 
> aggregate/running counts of queries.
>
> Would this sort of ad-hoc querying be better implemented using Hadoop 
> + Hive? If so, should I be storing all this information in Cassandra 
> then using Hadoop to retrieve it?
>
> Thanks for your suggestions
>

Re: Design questions/Schema Help

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi,

IMHO you can do this with date range queries and (date) facets.
The DateMathParser will allow you to normalize dates on min/hours/days.
If you hit a limit there, then just add a field with an integer for
either min/hour/day. This way you'll loose the month information - which
is sometimes what you want.

You probably want the document entity to be a query with fields:
query
user (id? if you have that)
sessionid
date

the most popular query within a date range is the query that was logged
most times? Do a search on the date range:
q=date:[start TO end]
with facet on the query which gives you the count similar to "group by &
count" aggregation functionality in an RDBMS. You can do multiple facets
at the same time but be carefull what you are querying for - it will
impact the facet count. You can use functions to change the base of each
facet.

http://wiki.apache.org/solr/SimpleFacetParameters

Cheers,
Chantal

On Tue, 2010-07-27 at 01:43 +0200, Mark wrote:
> We are thinking about using Cassandra to store our search logs. Can 
> someone point me in the right direction/lend some guidance on design? I 
> am new to Cassandra and I am having trouble wrapping my head around some 
> of these new concepts. My brain keeps wanting to go back to a RDBMS design.
> 
> We will be storing the user query, # of hits returned and their session 
> id. We would like to be able to answer the following questions.
> 
> - What is the n most popular queries and their counts within the last x 
> (mins/hours/days/etc). Basically the most popular searches within a 
> given time range.
> - What is the most popular query within the last x where hits = 0. Same 
> as above but with an extra "where" clause
> - For session id x give me all their other queries
> - What are all the session ids that searched for 'foos'
> 
> We accomplish the above functionality w/ MySQL using 2 tables. One for 
> the raw search log information and the other to keep the 
> aggregate/running counts of queries.
> 
> Would this sort of ad-hoc querying be better implemented using Hadoop + 
> Hive? If so, should I be storing all this information in Cassandra then 
> using Hadoop to retrieve it?
> 
> Thanks for your suggestions

Re: Design questions/Schema Help

Posted by Mark <st...@gmail.com>.

On 7/26/10 4:43 PM, Mark wrote:
> We are thinking about using Cassandra to store our search logs. Can 
> someone point me in the right direction/lend some guidance on design? 
> I am new to Cassandra and I am having trouble wrapping my head around 
> some of these new concepts. My brain keeps wanting to go back to a 
> RDBMS design.
>
> We will be storing the user query, # of hits returned and their 
> session id. We would like to be able to answer the following questions.
>
> - What is the n most popular queries and their counts within the last 
> x (mins/hours/days/etc). Basically the most popular searches within a 
> given time range.
> - What is the most popular query within the last x where hits = 0. 
> Same as above but with an extra "where" clause
> - For session id x give me all their other queries
> - What are all the session ids that searched for 'foos'
>
> We accomplish the above functionality w/ MySQL using 2 tables. One for 
> the raw search log information and the other to keep the 
> aggregate/running counts of queries.
>
> Would this sort of ad-hoc querying be better implemented using Hadoop 
> + Hive? If so, should I be storing all this information in Cassandra 
> then using Hadoop to retrieve it?
>
> Thanks for your suggestions 
Whoops wrong forum