You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark <st...@gmail.com> on 2010/07/27 01:43:57 UTC
Design questions/Schema Help
We are thinking about using Cassandra to store our search logs. Can
someone point me in the right direction/lend some guidance on design? I
am new to Cassandra and I am having trouble wrapping my head around some
of these new concepts. My brain keeps wanting to go back to a RDBMS design.
We will be storing the user query, # of hits returned and their session
id. We would like to be able to answer the following questions.
- What is the n most popular queries and their counts within the last x
(mins/hours/days/etc). Basically the most popular searches within a
given time range.
- What is the most popular query within the last x where hits = 0. Same
as above but with an extra "where" clause
- For session id x give me all their other queries
- What are all the session ids that searched for 'foos'
We accomplish the above functionality w/ MySQL using 2 tables. One for
the raw search log information and the other to keep the
aggregate/running counts of queries.
Would this sort of ad-hoc querying be better implemented using Hadoop +
Hive? If so, should I be storing all this information in Cassandra then
using Hadoop to retrieve it?
Thanks for your suggestions
Re: Design questions/Schema Help
Posted by Kiwi de coder <ki...@gmail.com>.
i think the search log will require a lot of storage which may make indexes
size unreasonable large if store in solr.
and the aggregration results may not really fixed in lucene index structure.
:)
kiwi
happy hacking !
On Tue, Jul 27, 2010 at 7:47 AM, Tommy Chheng <to...@gmail.com>wrote:
> Alternatively, have you considered storing(or i should say indexing) the
> search logs with Solr?
>
> This lets you text search across your search queries. You can perform time
> range queries with solr as well.
>
> @tommychheng
> Programmer and UC Irvine Graduate Student
> Find a great grad school based on research interests:
> http://gradschoolnow.com
>
>
>
> On 7/26/10 4:43 PM, Mark wrote:
>
>> We are thinking about using Cassandra to store our search logs. Can
>> someone point me in the right direction/lend some guidance on design? I am
>> new to Cassandra and I am having trouble wrapping my head around some of
>> these new concepts. My brain keeps wanting to go back to a RDBMS design.
>>
>> We will be storing the user query, # of hits returned and their session
>> id. We would like to be able to answer the following questions.
>>
>> - What is the n most popular queries and their counts within the last x
>> (mins/hours/days/etc). Basically the most popular searches within a given
>> time range.
>> - What is the most popular query within the last x where hits = 0. Same as
>> above but with an extra "where" clause
>> - For session id x give me all their other queries
>> - What are all the session ids that searched for 'foos'
>>
>> We accomplish the above functionality w/ MySQL using 2 tables. One for the
>> raw search log information and the other to keep the aggregate/running
>> counts of queries.
>>
>> Would this sort of ad-hoc querying be better implemented using Hadoop +
>> Hive? If so, should I be storing all this information in Cassandra then
>> using Hadoop to retrieve it?
>>
>> Thanks for your suggestions
>>
>>
Re: Design questions/Schema Help
Posted by Tommy Chheng <to...@gmail.com>.
Alternatively, have you considered storing(or i should say indexing)
the search logs with Solr?
This lets you text search across your search queries. You can perform
time range queries with solr as well.
@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com
On 7/26/10 4:43 PM, Mark wrote:
> We are thinking about using Cassandra to store our search logs. Can
> someone point me in the right direction/lend some guidance on design?
> I am new to Cassandra and I am having trouble wrapping my head around
> some of these new concepts. My brain keeps wanting to go back to a
> RDBMS design.
>
> We will be storing the user query, # of hits returned and their
> session id. We would like to be able to answer the following questions.
>
> - What is the n most popular queries and their counts within the last
> x (mins/hours/days/etc). Basically the most popular searches within a
> given time range.
> - What is the most popular query within the last x where hits = 0.
> Same as above but with an extra "where" clause
> - For session id x give me all their other queries
> - What are all the session ids that searched for 'foos'
>
> We accomplish the above functionality w/ MySQL using 2 tables. One for
> the raw search log information and the other to keep the
> aggregate/running counts of queries.
>
> Would this sort of ad-hoc querying be better implemented using Hadoop
> + Hive? If so, should I be storing all this information in Cassandra
> then using Hadoop to retrieve it?
>
> Thanks for your suggestions
>
Re: Design questions/Schema Help
Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi,
IMHO you can do this with date range queries and (date) facets.
The DateMathParser will allow you to normalize dates on min/hours/days.
If you hit a limit there, then just add a field with an integer for
either min/hour/day. This way you'll loose the month information - which
is sometimes what you want.
You probably want the document entity to be a query with fields:
query
user (id? if you have that)
sessionid
date
the most popular query within a date range is the query that was logged
most times? Do a search on the date range:
q=date:[start TO end]
with facet on the query which gives you the count similar to "group by &
count" aggregation functionality in an RDBMS. You can do multiple facets
at the same time but be carefull what you are querying for - it will
impact the facet count. You can use functions to change the base of each
facet.
http://wiki.apache.org/solr/SimpleFacetParameters
Cheers,
Chantal
On Tue, 2010-07-27 at 01:43 +0200, Mark wrote:
> We are thinking about using Cassandra to store our search logs. Can
> someone point me in the right direction/lend some guidance on design? I
> am new to Cassandra and I am having trouble wrapping my head around some
> of these new concepts. My brain keeps wanting to go back to a RDBMS design.
>
> We will be storing the user query, # of hits returned and their session
> id. We would like to be able to answer the following questions.
>
> - What is the n most popular queries and their counts within the last x
> (mins/hours/days/etc). Basically the most popular searches within a
> given time range.
> - What is the most popular query within the last x where hits = 0. Same
> as above but with an extra "where" clause
> - For session id x give me all their other queries
> - What are all the session ids that searched for 'foos'
>
> We accomplish the above functionality w/ MySQL using 2 tables. One for
> the raw search log information and the other to keep the
> aggregate/running counts of queries.
>
> Would this sort of ad-hoc querying be better implemented using Hadoop +
> Hive? If so, should I be storing all this information in Cassandra then
> using Hadoop to retrieve it?
>
> Thanks for your suggestions
Re: Design questions/Schema Help
Posted by Mark <st...@gmail.com>.
On 7/26/10 4:43 PM, Mark wrote:
> We are thinking about using Cassandra to store our search logs. Can
> someone point me in the right direction/lend some guidance on design?
> I am new to Cassandra and I am having trouble wrapping my head around
> some of these new concepts. My brain keeps wanting to go back to a
> RDBMS design.
>
> We will be storing the user query, # of hits returned and their
> session id. We would like to be able to answer the following questions.
>
> - What is the n most popular queries and their counts within the last
> x (mins/hours/days/etc). Basically the most popular searches within a
> given time range.
> - What is the most popular query within the last x where hits = 0.
> Same as above but with an extra "where" clause
> - For session id x give me all their other queries
> - What are all the session ids that searched for 'foos'
>
> We accomplish the above functionality w/ MySQL using 2 tables. One for
> the raw search log information and the other to keep the
> aggregate/running counts of queries.
>
> Would this sort of ad-hoc querying be better implemented using Hadoop
> + Hive? If so, should I be storing all this information in Cassandra
> then using Hadoop to retrieve it?
>
> Thanks for your suggestions
Whoops wrong forum