You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Lukas Vlcek <lu...@gmail.com> on 2007/08/10 09:28:17 UTC

How to keep user search history and how to turn it into information?

Hi,

I would like to keep user search history data and I am looking for some
ideas/advices/recommendations. In general I would like to talk about methods
of storing such data, its structure and how to turn it into valuable
information.

As for the structure:
==============
For now I don't have exact idea about what kind of information I should
keep. I know that this is application specific but I believe there can be
some common general patterns. as of now I think can be useful to keep is the
following:

1) system time (time of issuing the query) and userid
2) original user query in raw form (untokenized)
3) expanded user query (both tokenized and untokenized can be useful)
4) query execution time
5) # of objects retrieved from index
6) # of total object count in index (this can change during time)
7) and possibly if user clicked some result and if so then which one (the
hit number) and system time

As for the information I can get from this:
=============================
Such minimal data collection could show if the search engine serves users
well or not (generally said). I should note that for the users in this case
the only other option is to not use the search engine at all (so the data
should not be biased by the fact that users are using alternative search
method). I should be able to learn if:

1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
2) users are finding what they want (they can find it fast and results are
ordered by properly defined relevance [my model is well tuned in terms of
term weights] so the result they click is among first hits)
3) user can formulate queries well (do they issue queries which return all
index documents or they can issue queries which return just a couple of
documents)
4) ...?... etc...

As for the storage method:
===================
I was planning to keep such data in database but now it seems to me that it
will be better to keep it directly in index (Lucene index). It seems to me
that this approach would allow me for better fuzzy searches across history
and extracting relevant objects and their count more efficiently (with
benefit of the relevance based search on top of history search corpus).

I think that more scalable solution would be to keep such data in pure flat
file and then periodically recreate search history index (or more indices)
from it (for example by Map-Reduce like task). Event better the flat file
could be stored in distributer file system. However, for now I would like to
start with something simple.

I know this is a complex topic...

Regards,
Lukas

Re: How to keep user search history and how to turn it into information?

Posted by karl wettin <ka...@gmail.com>.
13 aug 2007 kl. 12.49 skrev Lukas Vlcek:

>>> But I am looking for more IR oriented application of this  
>>> information. I
>>> remember that once I read on Lucene mail list that somebody  
>>> suggested
>>> utilization of previously issued user queries for suggestions of
>>> similar/other/related queries or for typo checking. While there  
>>> are well
>>> discussed methods (moreLikeThis, did you mean, similarity, ...  
>>> etc) in
>>> Lucene community I am still wondering if once can use user search
>>> history data for such purpose and if the answer is yes then how
>>> (practical examples are welcomed).

I've implemented an adaptive didyoumean-thing that somewhat resembles
reinforcement learning: <http://issues.apache.org/jira/browse/ 
LUCENE-626>

It sort of depends of old stuff, but should start up given you patch the
right trunk revision with the right version of the dependencies.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to keep user search history and how to turn it into information?

Posted by Lukas Vlcek <lu...@gmail.com>.
Enis,
thanks for excellent answer!
Lukas

On 8/13/07, Enis Soztutar <en...@gmail.com> wrote:
>
> Hi,
>
> Lukas Vlcek wrote:
> > Enis,
> >
> > Thanks for your time.
> > I gave a quick glance at Pig and it seems good (seems it is directly
> based
> > on Hadoop which I am starting to play with :-). It obvious that a huge
> > amount of data (like user queries or access logs) should be stored in
> flat
> > files which makes it convenient for further analysis by Pig (or directly
> by
> > Hadoop based tasks) or other tools. And I agree with you that size of
> the
> > index can be tracked in journal based style in separated log rather then
> > with every since user query. That is for the easier part of my original
> > question :-)
> >
> > The true art starts with the mining tasks itself. How to efficiently use
> > such data for bettering user experience with the search engine... one
> potion
> > is to use such data for search engine tuning, which is more technical
> > oriented application (speeding up slow queries, reshaping the index,
> ...).
> > But I am looking for more IR oriented application of this information. I
> > remember that once I read on Lucene mail list that somebody suggested
> > utilization of previously issued user queries for suggestions of
> > similar/other/related queries or for typo checking. While there are well
> > discussed methods (moreLikeThis, did you mean, similarity, ... etc) in
> > Lucene community I am still wondering if once can use user search
> history
> > data for such purpose and if the answer is yes then how (practical
> examples
> > are welcomed).
> >
> >
> Well, the logs of the search engine is used to improve the search engine
> in several ways. All the major search engines including google, yahoo
> and ms uses the logs to deliver more relevant results to the user. How
> they do this is a big research area. Although google tends to not
> publish it's algorithms, yahoo and MS do.
>
> The logs can be used to improve various components of the search engine.
> For example the suggesters (google suggest, yahoo suggest ) and the
> spell checkers. Most of the spell checkers uses the noisy channel error
> model. In this model, the user emits word w with a probability p(w). But
> due to the error in the channel (that is misspelling) the user emits w'
> instead. Spelling correction deals with correcting w' to w. The
> calculation involves edit distance, prior probabilities of words,
> probabilities of errors(prob. of writing "spel" instead of  "spell" )
> and a predefined lexicon. But in the word of internet, there in no
> lexicon, so given a word w' you may not know exactly if w' is misspelled
> or not. The logs can be used for calculating the probabilities, building
> the lexicon, and generating possible suggestions(in case of a misspelled
> word).
>
> For spell checking you can consult :
>
> An improved Error model for noisy channel spelling correction, Brin et.al.
> Learning a spelling error model for search query logs, Ahmad et.al.
> spelling correction for search engine queries. Martis. et.al
> techniques for authomatically correcting words in text. Kukish (an
> excellent review of topic)
> Spelling correction as an iterative process that exploits the collective
> knowledge of web users. Brill et.al. (This MS paper is very good)
>
> For the query suggestion, yahoo has a paper about using Machine learning
> methods using the log data, for query suggestion. The basic idea behind
> is that, the user submits a query, then if not satisfied with the
> results, refines the query and resubmits it. Looking at all the
> collective data we can find possible better suggestions for a submitted
> query, classify them as useful or not useful or irrelevant, then display
> the useful ones. You can find the paper at yahoo research site.
>
> using the web log for improving search engine quality is a much harder
> problem. Unfortunately, i could not find time to read more about this
> topic yet. I know that MS uses a method based on neural nets, called
> ranknet, which ranks the search results. The net is trained with the
> server logs. Below i list some papers, but i have not read all of them
> so i cannot say anything further about them :
>
> Accurately Interpretting Cilckthrough Data as Implicit Feedback
> A Simulated Study of Implicit Feedback Models
> Identifying "Best Bet" Web Search Results by Mining Past User Behavior
> Improving Web Search Ranking by Incorporating User behaviour Information
> Learning User Interaction Models for Predicting Web Search Result
> Preferences
> Optimizing_search_engines_using_clickthrough_data
> Query Chains: Learning to Rank from Implicit Feedback
>
>
>
>
> > Lukas
> >
> > On 8/10/07, Enis Soztutar <en...@gmail.com> wrote:
> >
> >>
> >> Lukas Vlcek wrote:
> >>
> >>> Hi Enis,
> >>>
> >>>
> >> Hi again,
> >>
> >>> On 8/10/07, Enis Soztutar <en...@gmail.com> wrote:
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> Lukas Vlcek wrote:
> >>>>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I would like to keep user search history data and I am looking for
> >>>>>
> >> some
> >>
> >>>>> ideas/advices/recommendations. In general I would like to talk about
> >>>>>
> >>>>>
> >>>> methods
> >>>>
> >>>>
> >>>>> of storing such data, its structure and how to turn it into valuable
> >>>>> information.
> >>>>>
> >>>>> As for the structure:
> >>>>> ==============
> >>>>> For now I don't have exact idea about what kind of information I
> >>>>>
> >> should
> >>
> >>>>> keep. I know that this is application specific but I believe there
> can
> >>>>>
> >>>>>
> >>>> be
> >>>>
> >>>>
> >>>>> some common general patterns. as of now I think can be useful to
> keep
> >>>>>
> >> is
> >>
> >>>> the
> >>>>
> >>>>
> >>>>> following:
> >>>>>
> >>>>> 1) system time (time of issuing the query) and userid
> >>>>> 2) original user query in raw form (untokenized)
> >>>>> 3) expanded user query (both tokenized and untokenized can be
> useful)
> >>>>> 4) query execution time
> >>>>> 5) # of objects retrieved from index
> >>>>> 6) # of total object count in index (this can change during time)
> >>>>> 7) and possibly if user clicked some result and if so then which one
> >>>>>
> >>>>>
> >>>> (the
> >>>>
> >>>>
> >>>>> hit number) and system time
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> Remember that you may not want to store all the information available
> >>>>
> >> at
> >>
> >>>> runtime of the query, since it may result in great performance
> burden.
> >>>> For example you  may want to store the raw form of the query, but not
> >>>> parsed form since you can later parse the query anayway (unless you
> >>>>
> >> have
> >>
> >>>> some architecture change). Similarly 6 seemed not a good choice for
> >>>> me(again you can store the info externally). You can look at the
> common
> >>>> and extended log formats which are stored by the web server.
> >>>>
> >>>>
> >>> The problem is that all the information do chance in time. The index
> is
> >>> updated continuously which means that expanded queries and total
> number
> >>>
> >> of
> >>
> >>> documents in index do change as well. But you are right that getting
> >>>
> >> some of
> >>
> >>> this info can cause extra performance expenses (then it would be
> >>>
> >> question of
> >>
> >>> later optimization of architecture design).
> >>>
> >>>
> >> Well i think you can at least store the size of the index in another
> >> file, and log to the changes in the index size from there. The
> >> motivation for this comes from storage efficiency. You may not want to
> >> store the same index size over and over again in n queries before the
> >> index size changes, but store it once, with the time, per change.
> >>
> >>>
> >>>>> As for the information I can get from this:
> >>>>> =============================
> >>>>> Such minimal data collection could show if the search engine serves
> >>>>>
> >>>>>
> >>>> users
> >>>>
> >>>>
> >>>>> well or not (generally said). I should note that for the users in
> this
> >>>>>
> >>>>>
> >>>> case
> >>>>
> >>>>
> >>>>> the only other option is to not use the search engine at all (so the
> >>>>>
> >>>>>
> >>>> data
> >>>>
> >>>>
> >>>>> should not be biased by the fact that users are using alternative
> >>>>>
> >> search
> >>
> >>>>> method). I should be able to learn if:
> >>>>>
> >>>>> 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
> >>>>> 2) users are finding what they want (they can find it fast and
> results
> >>>>>
> >>>>>
> >>>> are
> >>>>
> >>>>
> >>>>> ordered by properly defined relevance [my model is well tuned in
> terms
> >>>>>
> >>>>>
> >>>> of
> >>>>
> >>>>
> >>>>> term weights] so the result they click is among first hits)
> >>>>> 3) user can formulate queries well (do they issue queries which
> return
> >>>>>
> >>>>>
> >>>> all
> >>>>
> >>>>
> >>>>> index documents or they can issue queries which return just a couple
> >>>>>
> >> of
> >>
> >>>>> documents)
> >>>>> 4) ...?... etc...
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> Web server log analysis is a very popular topic nowadays, and you can
> >>>> check for the literature, especially clickthrough data anaysis. All
> the
> >>>> major search engines has to interpret the data to improve their
> >>>> algorithms, and to learn from the latent "collective knowlege" hidden
> >>>>
> >> in
> >>
> >>>> web server logs.
> >>>>
> >>>>
> >>> It seems I have to do my homework and check CiteSeer for some papers
> :-)
> >>> Is there any paper you can recommend me? Some good one to start with?
> >>> What I want to achieve is far beyond the scope of the project I am
> >>>
> >> working
> >>
> >>> on right now thus I cannot spend all my time on research (in spite of
> >>>
> >> the
> >>
> >>> fact I would love to) so I can either a) use some tool which is
> already
> >>> available (open sourced) and directly fits my needs (I don't think
> there
> >>>
> >> is
> >>
> >>> any tool which I could use out-of-box) or b) implement something new
> >>>
> >> from
> >>
> >>> scratch but with just very limited functionality.
> >>>
> >>>
> >> You do not have to implement this from scratch. You just have to
> specify
> >> your data mining tasks, then write scripts(in pig latin) or write
> >> map-reduce programs (in hadoop). Either of these are not that hard. I
> do
> >> not think that there is any tool which may satisfy all you information
> >> needs. So at the risk of repeating myself i suggest you to look at pig
> >> at write some scripts to mine the data.
> >>
> >> Coming to literature, i can hardly suggest any specific paper, since i
> >> am not very into the subject either. But i suggest you to skip this
> >> step, first build you data structures (log format), then start
> >> extracting some naive statistical information from the data. For
> >> example, initially you may want to know
> >> 1. avarage query execution time
> >> 2. avarage query execution time per query type(boolean, fuzzy, etc.)
> >> 3. histogram of query types (how many boolean queries, etc.)
> >> 4. avarage #of queries per user session.
> >> 5. etc.
> >>
> >> The list can go on and on depending on the data you have and
> information
> >> you want. These kind of simple statistical analysis can be very easy to
> >> extract and relatively easy to interpret.
> >>
> >>
> >>>> As for the storage method:
> >>>>
> >>>>
> >>>>> ===================
> >>>>> I was planning to keep such data in database but now it seems to me
> >>>>>
> >> that
> >>
> >>>> it
> >>>>
> >>>>
> >>>>> will be better to keep it directly in index (Lucene index). It seems
> >>>>>
> >> to
> >>
> >>>> me
> >>>>
> >>>>
> >>>>> that this approach would allow me for better fuzzy searches across
> >>>>>
> >>>>>
> >>>> history
> >>>>
> >>>>
> >>>>> and extracting relevant objects and their count more efficiently
> (with
> >>>>> benefit of the relevance based search on top of history search
> >>>>>
> >> corpus).
> >>
> >>>>> I think that more scalable solution would be to keep such data in
> pure
> >>>>>
> >>>>>
> >>>> flat
> >>>>
> >>>>
> >>>>> file and then periodically recreate search history index (or more
> >>>>>
> >>>>>
> >>>> indices)
> >>>>
> >>>>
> >>>>> from it (for example by Map-Reduce like task). Event better the flat
> >>>>>
> >>>>>
> >>>> file
> >>>>
> >>>>
> >>>>> could be stored in distributer file system. However, for now I would
> >>>>>
> >>>>>
> >>>> like to
> >>>>
> >>>>
> >>>>> start with something simple.
> >>>>>
> >>>>>
> >>>>>
> >>>> I would rather suggest you to keep the logs in rolling flat files. An
> >>>> access to the database for each search will take lots of time. Then
> you
> >>>> may want to flush those logs to the db once a day if you indeed want
> to
> >>>> store the data in a relational way.
> >>>>
> >>>> I infer that you want to mine the data, but you do not know what to
> >>>> mine, right? I suggest you to look at hadoop and pig. Pig is a is
> >>>> designed especially for this purpose.
> >>>>
> >>>>
> >>> You've hit the nail on the head! I am very curious about how one can
> use
> >>> such data to improve user experience with search engine (given my
> >>>
> >> project
> >>
> >>> schedule time constraints).
> >>>
> >>>
> >>>
> >>>> I know this is a complex topic...
> >>>>
> >>>>
> >>>>> Regards,
> >>>>> Lukas
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>>
> >>>>
> >>> Anyway, thanks for your reply!
> >>>
> >>> BR
> >>> Lukas
> >>>
> >>>
> >>>
> >
> >
>

Re: How to keep user search history and how to turn it into information?

Posted by Enis Soztutar <en...@gmail.com>.
Hi,

Lukas Vlcek wrote:
> Enis,
>
> Thanks for your time.
> I gave a quick glance at Pig and it seems good (seems it is directly based
> on Hadoop which I am starting to play with :-). It obvious that a huge
> amount of data (like user queries or access logs) should be stored in flat
> files which makes it convenient for further analysis by Pig (or directly by
> Hadoop based tasks) or other tools. And I agree with you that size of the
> index can be tracked in journal based style in separated log rather then
> with every since user query. That is for the easier part of my original
> question :-)
>
> The true art starts with the mining tasks itself. How to efficiently use
> such data for bettering user experience with the search engine... one potion
> is to use such data for search engine tuning, which is more technical
> oriented application (speeding up slow queries, reshaping the index, ...).
> But I am looking for more IR oriented application of this information. I
> remember that once I read on Lucene mail list that somebody suggested
> utilization of previously issued user queries for suggestions of
> similar/other/related queries or for typo checking. While there are well
> discussed methods (moreLikeThis, did you mean, similarity, ... etc) in
> Lucene community I am still wondering if once can use user search history
> data for such purpose and if the answer is yes then how (practical examples
> are welcomed).
>
>   
Well, the logs of the search engine is used to improve the search engine 
in several ways. All the major search engines including google, yahoo 
and ms uses the logs to deliver more relevant results to the user. How 
they do this is a big research area. Although google tends to not 
publish it's algorithms, yahoo and MS do.

The logs can be used to improve various components of the search engine. 
For example the suggesters (google suggest, yahoo suggest ) and the 
spell checkers. Most of the spell checkers uses the noisy channel error 
model. In this model, the user emits word w with a probability p(w). But 
due to the error in the channel (that is misspelling) the user emits w' 
instead. Spelling correction deals with correcting w' to w. The 
calculation involves edit distance, prior probabilities of words, 
probabilities of errors(prob. of writing "spel" instead of  "spell" ) 
and a predefined lexicon. But in the word of internet, there in no 
lexicon, so given a word w' you may not know exactly if w' is misspelled 
or not. The logs can be used for calculating the probabilities, building 
the lexicon, and generating possible suggestions(in case of a misspelled 
word).

For spell checking you can consult :

An improved Error model for noisy channel spelling correction, Brin et.al.
Learning a spelling error model for search query logs, Ahmad et.al.
spelling correction for search engine queries. Martis. et.al
techniques for authomatically correcting words in text. Kukish (an 
excellent review of topic)
Spelling correction as an iterative process that exploits the collective 
knowledge of web users. Brill et.al. (This MS paper is very good)

For the query suggestion, yahoo has a paper about using Machine learning 
methods using the log data, for query suggestion. The basic idea behind 
is that, the user submits a query, then if not satisfied with the 
results, refines the query and resubmits it. Looking at all the 
collective data we can find possible better suggestions for a submitted 
query, classify them as useful or not useful or irrelevant, then display 
the useful ones. You can find the paper at yahoo research site.

using the web log for improving search engine quality is a much harder 
problem. Unfortunately, i could not find time to read more about this 
topic yet. I know that MS uses a method based on neural nets, called 
ranknet, which ranks the search results. The net is trained with the 
server logs. Below i list some papers, but i have not read all of them 
so i cannot say anything further about them :

Accurately Interpretting Cilckthrough Data as Implicit Feedback
A Simulated Study of Implicit Feedback Models
Identifying "Best Bet" Web Search Results by Mining Past User Behavior
Improving Web Search Ranking by Incorporating User behaviour Information
Learning User Interaction Models for Predicting Web Search Result 
Preferences
Optimizing_search_engines_using_clickthrough_data
Query Chains: Learning to Rank from Implicit Feedback




> Lukas
>
> On 8/10/07, Enis Soztutar <en...@gmail.com> wrote:
>   
>>
>> Lukas Vlcek wrote:
>>     
>>> Hi Enis,
>>>
>>>       
>> Hi again,
>>     
>>> On 8/10/07, Enis Soztutar <en...@gmail.com> wrote:
>>>
>>>       
>>>> Hi,
>>>>
>>>> Lukas Vlcek wrote:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> I would like to keep user search history data and I am looking for
>>>>>           
>> some
>>     
>>>>> ideas/advices/recommendations. In general I would like to talk about
>>>>>
>>>>>           
>>>> methods
>>>>
>>>>         
>>>>> of storing such data, its structure and how to turn it into valuable
>>>>> information.
>>>>>
>>>>> As for the structure:
>>>>> ==============
>>>>> For now I don't have exact idea about what kind of information I
>>>>>           
>> should
>>     
>>>>> keep. I know that this is application specific but I believe there can
>>>>>
>>>>>           
>>>> be
>>>>
>>>>         
>>>>> some common general patterns. as of now I think can be useful to keep
>>>>>           
>> is
>>     
>>>> the
>>>>
>>>>         
>>>>> following:
>>>>>
>>>>> 1) system time (time of issuing the query) and userid
>>>>> 2) original user query in raw form (untokenized)
>>>>> 3) expanded user query (both tokenized and untokenized can be useful)
>>>>> 4) query execution time
>>>>> 5) # of objects retrieved from index
>>>>> 6) # of total object count in index (this can change during time)
>>>>> 7) and possibly if user clicked some result and if so then which one
>>>>>
>>>>>           
>>>> (the
>>>>
>>>>         
>>>>> hit number) and system time
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> Remember that you may not want to store all the information available
>>>>         
>> at
>>     
>>>> runtime of the query, since it may result in great performance burden.
>>>> For example you  may want to store the raw form of the query, but not
>>>> parsed form since you can later parse the query anayway (unless you
>>>>         
>> have
>>     
>>>> some architecture change). Similarly 6 seemed not a good choice for
>>>> me(again you can store the info externally). You can look at the common
>>>> and extended log formats which are stored by the web server.
>>>>
>>>>         
>>> The problem is that all the information do chance in time. The index is
>>> updated continuously which means that expanded queries and total number
>>>       
>> of
>>     
>>> documents in index do change as well. But you are right that getting
>>>       
>> some of
>>     
>>> this info can cause extra performance expenses (then it would be
>>>       
>> question of
>>     
>>> later optimization of architecture design).
>>>
>>>       
>> Well i think you can at least store the size of the index in another
>> file, and log to the changes in the index size from there. The
>> motivation for this comes from storage efficiency. You may not want to
>> store the same index size over and over again in n queries before the
>> index size changes, but store it once, with the time, per change.
>>     
>>>       
>>>>> As for the information I can get from this:
>>>>> =============================
>>>>> Such minimal data collection could show if the search engine serves
>>>>>
>>>>>           
>>>> users
>>>>
>>>>         
>>>>> well or not (generally said). I should note that for the users in this
>>>>>
>>>>>           
>>>> case
>>>>
>>>>         
>>>>> the only other option is to not use the search engine at all (so the
>>>>>
>>>>>           
>>>> data
>>>>
>>>>         
>>>>> should not be biased by the fact that users are using alternative
>>>>>           
>> search
>>     
>>>>> method). I should be able to learn if:
>>>>>
>>>>> 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
>>>>> 2) users are finding what they want (they can find it fast and results
>>>>>
>>>>>           
>>>> are
>>>>
>>>>         
>>>>> ordered by properly defined relevance [my model is well tuned in terms
>>>>>
>>>>>           
>>>> of
>>>>
>>>>         
>>>>> term weights] so the result they click is among first hits)
>>>>> 3) user can formulate queries well (do they issue queries which return
>>>>>
>>>>>           
>>>> all
>>>>
>>>>         
>>>>> index documents or they can issue queries which return just a couple
>>>>>           
>> of
>>     
>>>>> documents)
>>>>> 4) ...?... etc...
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> Web server log analysis is a very popular topic nowadays, and you can
>>>> check for the literature, especially clickthrough data anaysis. All the
>>>> major search engines has to interpret the data to improve their
>>>> algorithms, and to learn from the latent "collective knowlege" hidden
>>>>         
>> in
>>     
>>>> web server logs.
>>>>
>>>>         
>>> It seems I have to do my homework and check CiteSeer for some papers :-)
>>> Is there any paper you can recommend me? Some good one to start with?
>>> What I want to achieve is far beyond the scope of the project I am
>>>       
>> working
>>     
>>> on right now thus I cannot spend all my time on research (in spite of
>>>       
>> the
>>     
>>> fact I would love to) so I can either a) use some tool which is already
>>> available (open sourced) and directly fits my needs (I don't think there
>>>       
>> is
>>     
>>> any tool which I could use out-of-box) or b) implement something new
>>>       
>> from
>>     
>>> scratch but with just very limited functionality.
>>>
>>>       
>> You do not have to implement this from scratch. You just have to specify
>> your data mining tasks, then write scripts(in pig latin) or write
>> map-reduce programs (in hadoop). Either of these are not that hard. I do
>> not think that there is any tool which may satisfy all you information
>> needs. So at the risk of repeating myself i suggest you to look at pig
>> at write some scripts to mine the data.
>>
>> Coming to literature, i can hardly suggest any specific paper, since i
>> am not very into the subject either. But i suggest you to skip this
>> step, first build you data structures (log format), then start
>> extracting some naive statistical information from the data. For
>> example, initially you may want to know
>> 1. avarage query execution time
>> 2. avarage query execution time per query type(boolean, fuzzy, etc.)
>> 3. histogram of query types (how many boolean queries, etc.)
>> 4. avarage #of queries per user session.
>> 5. etc.
>>
>> The list can go on and on depending on the data you have and information
>> you want. These kind of simple statistical analysis can be very easy to
>> extract and relatively easy to interpret.
>>
>>     
>>>> As for the storage method:
>>>>
>>>>         
>>>>> ===================
>>>>> I was planning to keep such data in database but now it seems to me
>>>>>           
>> that
>>     
>>>> it
>>>>
>>>>         
>>>>> will be better to keep it directly in index (Lucene index). It seems
>>>>>           
>> to
>>     
>>>> me
>>>>
>>>>         
>>>>> that this approach would allow me for better fuzzy searches across
>>>>>
>>>>>           
>>>> history
>>>>
>>>>         
>>>>> and extracting relevant objects and their count more efficiently (with
>>>>> benefit of the relevance based search on top of history search
>>>>>           
>> corpus).
>>     
>>>>> I think that more scalable solution would be to keep such data in pure
>>>>>
>>>>>           
>>>> flat
>>>>
>>>>         
>>>>> file and then periodically recreate search history index (or more
>>>>>
>>>>>           
>>>> indices)
>>>>
>>>>         
>>>>> from it (for example by Map-Reduce like task). Event better the flat
>>>>>
>>>>>           
>>>> file
>>>>
>>>>         
>>>>> could be stored in distributer file system. However, for now I would
>>>>>
>>>>>           
>>>> like to
>>>>
>>>>         
>>>>> start with something simple.
>>>>>
>>>>>
>>>>>           
>>>> I would rather suggest you to keep the logs in rolling flat files. An
>>>> access to the database for each search will take lots of time. Then you
>>>> may want to flush those logs to the db once a day if you indeed want to
>>>> store the data in a relational way.
>>>>
>>>> I infer that you want to mine the data, but you do not know what to
>>>> mine, right? I suggest you to look at hadoop and pig. Pig is a is
>>>> designed especially for this purpose.
>>>>
>>>>         
>>> You've hit the nail on the head! I am very curious about how one can use
>>> such data to improve user experience with search engine (given my
>>>       
>> project
>>     
>>> schedule time constraints).
>>>
>>>
>>>       
>>>> I know this is a complex topic...
>>>>
>>>>         
>>>>> Regards,
>>>>> Lukas
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>         
>>> Anyway, thanks for your reply!
>>>
>>> BR
>>> Lukas
>>>
>>>
>>>       
>
>   

Re: How to keep user search history and how to turn it into information?

Posted by "Peter W." <pe...@marketingbrokers.com>.
Lukas,

One last thing, be sure to log only when a user clicks on a result
and in Hadoop document_id will be a key in the map phase.

Lucene related steps are the same.

Best,

Peter W.


On Aug 14, 2007, at 1:28 PM, Peter W. wrote:

> When users perform
> a search, log the unique document_id, IP address and
> result position for the next step.
>
> Use Hadoop to simplifiy your logs by mapping the id
> and emitting IP's as intermediate values. Have
> reduce collect unique document_id[IP addresses].

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to keep user search history and how to turn it into information?

Posted by "Peter W." <pe...@marketingbrokers.com>.
Hey Lukas,

You can get a basic demo of this working in Lucene
first then make a more advanced and efficient version.

First, give each document in your index a score field
using NumberTools so it's sortable. When users perform
a search, log the unique document_id, IP address and
result position for the next step.

Use Hadoop to simplifiy your logs by mapping the id
and emitting IP's as intermediate values. Have
reduce collect unique document_id[IP addresses].

Read thru final output file, increment score value for
each IP who clicked on the document_id, re-index
Lucene and sort results (reverse order) by score field.

A more advanced version could store previous result
positions as Payloads but I don't understand this
new Lucene concept.

Regards,

Peter W.

On Aug 10, 2007, at 5:56 AM, Lukas Vlcek wrote:

> Enis,
>
> Thanks for your time.
> I gave a quick glance at Pig and it seems good (seems it is  
> directly based
> on Hadoop which I am starting to play with :-). It obvious that a huge
> amount of data (like user queries or access logs) should be stored  
> in flat
> files which makes it convenient for further analysis by Pig (or  
> directly by
> Hadoop based tasks) or other tools. And I agree with you that size  
> of the
> index can be tracked in journal based style in separated log rather  
> then
> with every since user query. That is for the easier part of my  
> original
> question :-)
>
> The true art starts with the mining tasks itself. How to  
> efficiently use
> such data for bettering user experience with the search engine...
>
> On 8/10/07, Enis Soztutar <en...@gmail.com> wrote:
>>>> ...

>>>> Web server log analysis is a very popular topic nowadays, and  
>>>> you can
>>>> check for the literature, especially clickthrough data anaysis.  
>>>> All the
>>>> major search engines has to interpret the data to improve their
>>>> algorithms, and to learn from the latent "collective knowlege"  
>>>> hidden
>>>> in web server logs.
>>>> ...

>> ...
>> You do not have to implement this from scratch. You just have to  
>> specify
>> your data mining tasks, then write scripts(in pig latin) or write
>> map-reduce programs (in hadoop). Either of these are not that  
>> hard. I do
>> not think that there is any tool which may satisfy all you  
>> information
>> needs. So at the risk of repeating myself i suggest you to look at  
>> pig
>> at write some scripts to mine the data...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to keep user search history and how to turn it into information?

Posted by Lukas Vlcek <lu...@gmail.com>.
Enis,

Thanks for your time.
I gave a quick glance at Pig and it seems good (seems it is directly based
on Hadoop which I am starting to play with :-). It obvious that a huge
amount of data (like user queries or access logs) should be stored in flat
files which makes it convenient for further analysis by Pig (or directly by
Hadoop based tasks) or other tools. And I agree with you that size of the
index can be tracked in journal based style in separated log rather then
with every since user query. That is for the easier part of my original
question :-)

The true art starts with the mining tasks itself. How to efficiently use
such data for bettering user experience with the search engine... one potion
is to use such data for search engine tuning, which is more technical
oriented application (speeding up slow queries, reshaping the index, ...).
But I am looking for more IR oriented application of this information. I
remember that once I read on Lucene mail list that somebody suggested
utilization of previously issued user queries for suggestions of
similar/other/related queries or for typo checking. While there are well
discussed methods (moreLikeThis, did you mean, similarity, ... etc) in
Lucene community I am still wondering if once can use user search history
data for such purpose and if the answer is yes then how (practical examples
are welcomed).

Lukas

On 8/10/07, Enis Soztutar <en...@gmail.com> wrote:
>
>
>
> Lukas Vlcek wrote:
> > Hi Enis,
> >
> Hi again,
> > On 8/10/07, Enis Soztutar <en...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> Lukas Vlcek wrote:
> >>
> >>> Hi,
> >>>
> >>> I would like to keep user search history data and I am looking for
> some
> >>> ideas/advices/recommendations. In general I would like to talk about
> >>>
> >> methods
> >>
> >>> of storing such data, its structure and how to turn it into valuable
> >>> information.
> >>>
> >>> As for the structure:
> >>> ==============
> >>> For now I don't have exact idea about what kind of information I
> should
> >>> keep. I know that this is application specific but I believe there can
> >>>
> >> be
> >>
> >>> some common general patterns. as of now I think can be useful to keep
> is
> >>>
> >> the
> >>
> >>> following:
> >>>
> >>> 1) system time (time of issuing the query) and userid
> >>> 2) original user query in raw form (untokenized)
> >>> 3) expanded user query (both tokenized and untokenized can be useful)
> >>> 4) query execution time
> >>> 5) # of objects retrieved from index
> >>> 6) # of total object count in index (this can change during time)
> >>> 7) and possibly if user clicked some result and if so then which one
> >>>
> >> (the
> >>
> >>> hit number) and system time
> >>>
> >>>
> >>>
> >> Remember that you may not want to store all the information available
> at
> >> runtime of the query, since it may result in great performance burden.
> >> For example you  may want to store the raw form of the query, but not
> >> parsed form since you can later parse the query anayway (unless you
> have
> >> some architecture change). Similarly 6 seemed not a good choice for
> >> me(again you can store the info externally). You can look at the common
> >> and extended log formats which are stored by the web server.
> >>
> >
> >
> > The problem is that all the information do chance in time. The index is
> > updated continuously which means that expanded queries and total number
> of
> > documents in index do change as well. But you are right that getting
> some of
> > this info can cause extra performance expenses (then it would be
> question of
> > later optimization of architecture design).
> >
> Well i think you can at least store the size of the index in another
> file, and log to the changes in the index size from there. The
> motivation for this comes from storage efficiency. You may not want to
> store the same index size over and over again in n queries before the
> index size changes, but store it once, with the time, per change.
> >
> >
> >>> As for the information I can get from this:
> >>> =============================
> >>> Such minimal data collection could show if the search engine serves
> >>>
> >> users
> >>
> >>> well or not (generally said). I should note that for the users in this
> >>>
> >> case
> >>
> >>> the only other option is to not use the search engine at all (so the
> >>>
> >> data
> >>
> >>> should not be biased by the fact that users are using alternative
> search
> >>> method). I should be able to learn if:
> >>>
> >>> 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
> >>> 2) users are finding what they want (they can find it fast and results
> >>>
> >> are
> >>
> >>> ordered by properly defined relevance [my model is well tuned in terms
> >>>
> >> of
> >>
> >>> term weights] so the result they click is among first hits)
> >>> 3) user can formulate queries well (do they issue queries which return
> >>>
> >> all
> >>
> >>> index documents or they can issue queries which return just a couple
> of
> >>> documents)
> >>> 4) ...?... etc...
> >>>
> >>>
> >>>
> >> Web server log analysis is a very popular topic nowadays, and you can
> >> check for the literature, especially clickthrough data anaysis. All the
> >> major search engines has to interpret the data to improve their
> >> algorithms, and to learn from the latent "collective knowlege" hidden
> in
> >> web server logs.
> >>
> >
> >
> > It seems I have to do my homework and check CiteSeer for some papers :-)
> > Is there any paper you can recommend me? Some good one to start with?
> > What I want to achieve is far beyond the scope of the project I am
> working
> > on right now thus I cannot spend all my time on research (in spite of
> the
> > fact I would love to) so I can either a) use some tool which is already
> > available (open sourced) and directly fits my needs (I don't think there
> is
> > any tool which I could use out-of-box) or b) implement something new
> from
> > scratch but with just very limited functionality.
> >
> You do not have to implement this from scratch. You just have to specify
> your data mining tasks, then write scripts(in pig latin) or write
> map-reduce programs (in hadoop). Either of these are not that hard. I do
> not think that there is any tool which may satisfy all you information
> needs. So at the risk of repeating myself i suggest you to look at pig
> at write some scripts to mine the data.
>
> Coming to literature, i can hardly suggest any specific paper, since i
> am not very into the subject either. But i suggest you to skip this
> step, first build you data structures (log format), then start
> extracting some naive statistical information from the data. For
> example, initially you may want to know
> 1. avarage query execution time
> 2. avarage query execution time per query type(boolean, fuzzy, etc.)
> 3. histogram of query types (how many boolean queries, etc.)
> 4. avarage #of queries per user session.
> 5. etc.
>
> The list can go on and on depending on the data you have and information
> you want. These kind of simple statistical analysis can be very easy to
> extract and relatively easy to interpret.
>
> >
> >> As for the storage method:
> >>
> >>> ===================
> >>> I was planning to keep such data in database but now it seems to me
> that
> >>>
> >> it
> >>
> >>> will be better to keep it directly in index (Lucene index). It seems
> to
> >>>
> >> me
> >>
> >>> that this approach would allow me for better fuzzy searches across
> >>>
> >> history
> >>
> >>> and extracting relevant objects and their count more efficiently (with
> >>> benefit of the relevance based search on top of history search
> corpus).
> >>>
> >>> I think that more scalable solution would be to keep such data in pure
> >>>
> >> flat
> >>
> >>> file and then periodically recreate search history index (or more
> >>>
> >> indices)
> >>
> >>> from it (for example by Map-Reduce like task). Event better the flat
> >>>
> >> file
> >>
> >>> could be stored in distributer file system. However, for now I would
> >>>
> >> like to
> >>
> >>> start with something simple.
> >>>
> >>>
> >> I would rather suggest you to keep the logs in rolling flat files. An
> >> access to the database for each search will take lots of time. Then you
> >> may want to flush those logs to the db once a day if you indeed want to
> >> store the data in a relational way.
> >>
> >> I infer that you want to mine the data, but you do not know what to
> >> mine, right? I suggest you to look at hadoop and pig. Pig is a is
> >> designed especially for this purpose.
> >>
> >
> >
> > You've hit the nail on the head! I am very curious about how one can use
> > such data to improve user experience with search engine (given my
> project
> > schedule time constraints).
> >
> >
> >> I know this is a complex topic...
> >>
> >>> Regards,
> >>> Lukas
> >>>
> >>>
> >>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >>
> > Anyway, thanks for your reply!
> >
> > BR
> > Lukas
> >
> >
>

Re: How to keep user search history and how to turn it into information?

Posted by Enis Soztutar <en...@gmail.com>.

Lukas Vlcek wrote:
> Hi Enis,
>   
Hi again,
> On 8/10/07, Enis Soztutar <en...@gmail.com> wrote:
>   
>> Hi,
>>
>> Lukas Vlcek wrote:
>>     
>>> Hi,
>>>
>>> I would like to keep user search history data and I am looking for some
>>> ideas/advices/recommendations. In general I would like to talk about
>>>       
>> methods
>>     
>>> of storing such data, its structure and how to turn it into valuable
>>> information.
>>>
>>> As for the structure:
>>> ==============
>>> For now I don't have exact idea about what kind of information I should
>>> keep. I know that this is application specific but I believe there can
>>>       
>> be
>>     
>>> some common general patterns. as of now I think can be useful to keep is
>>>       
>> the
>>     
>>> following:
>>>
>>> 1) system time (time of issuing the query) and userid
>>> 2) original user query in raw form (untokenized)
>>> 3) expanded user query (both tokenized and untokenized can be useful)
>>> 4) query execution time
>>> 5) # of objects retrieved from index
>>> 6) # of total object count in index (this can change during time)
>>> 7) and possibly if user clicked some result and if so then which one
>>>       
>> (the
>>     
>>> hit number) and system time
>>>
>>>
>>>       
>> Remember that you may not want to store all the information available at
>> runtime of the query, since it may result in great performance burden.
>> For example you  may want to store the raw form of the query, but not
>> parsed form since you can later parse the query anayway (unless you have
>> some architecture change). Similarly 6 seemed not a good choice for
>> me(again you can store the info externally). You can look at the common
>> and extended log formats which are stored by the web server.
>>     
>
>
> The problem is that all the information do chance in time. The index is
> updated continuously which means that expanded queries and total number of
> documents in index do change as well. But you are right that getting some of
> this info can cause extra performance expenses (then it would be question of
> later optimization of architecture design).
>   
Well i think you can at least store the size of the index in another 
file, and log to the changes in the index size from there. The 
motivation for this comes from storage efficiency. You may not want to 
store the same index size over and over again in n queries before the 
index size changes, but store it once, with the time, per change.
>
>   
>>> As for the information I can get from this:
>>> =============================
>>> Such minimal data collection could show if the search engine serves
>>>       
>> users
>>     
>>> well or not (generally said). I should note that for the users in this
>>>       
>> case
>>     
>>> the only other option is to not use the search engine at all (so the
>>>       
>> data
>>     
>>> should not be biased by the fact that users are using alternative search
>>> method). I should be able to learn if:
>>>
>>> 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
>>> 2) users are finding what they want (they can find it fast and results
>>>       
>> are
>>     
>>> ordered by properly defined relevance [my model is well tuned in terms
>>>       
>> of
>>     
>>> term weights] so the result they click is among first hits)
>>> 3) user can formulate queries well (do they issue queries which return
>>>       
>> all
>>     
>>> index documents or they can issue queries which return just a couple of
>>> documents)
>>> 4) ...?... etc...
>>>
>>>
>>>       
>> Web server log analysis is a very popular topic nowadays, and you can
>> check for the literature, especially clickthrough data anaysis. All the
>> major search engines has to interpret the data to improve their
>> algorithms, and to learn from the latent "collective knowlege" hidden in
>> web server logs.
>>     
>
>
> It seems I have to do my homework and check CiteSeer for some papers :-)
> Is there any paper you can recommend me? Some good one to start with?
> What I want to achieve is far beyond the scope of the project I am working
> on right now thus I cannot spend all my time on research (in spite of the
> fact I would love to) so I can either a) use some tool which is already
> available (open sourced) and directly fits my needs (I don't think there is
> any tool which I could use out-of-box) or b) implement something new from
> scratch but with just very limited functionality.
>   
You do not have to implement this from scratch. You just have to specify 
your data mining tasks, then write scripts(in pig latin) or write 
map-reduce programs (in hadoop). Either of these are not that hard. I do 
not think that there is any tool which may satisfy all you information 
needs. So at the risk of repeating myself i suggest you to look at pig 
at write some scripts to mine the data. 

Coming to literature, i can hardly suggest any specific paper, since i 
am not very into the subject either. But i suggest you to skip this 
step, first build you data structures (log format), then start 
extracting some naive statistical information from the data. For 
example, initially you may want to know
1. avarage query execution time
2. avarage query execution time per query type(boolean, fuzzy, etc.)
3. histogram of query types (how many boolean queries, etc.)
4. avarage #of queries per user session.
5. etc.

The list can go on and on depending on the data you have and information 
you want. These kind of simple statistical analysis can be very easy to 
extract and relatively easy to interpret.

>   
>> As for the storage method:
>>     
>>> ===================
>>> I was planning to keep such data in database but now it seems to me that
>>>       
>> it
>>     
>>> will be better to keep it directly in index (Lucene index). It seems to
>>>       
>> me
>>     
>>> that this approach would allow me for better fuzzy searches across
>>>       
>> history
>>     
>>> and extracting relevant objects and their count more efficiently (with
>>> benefit of the relevance based search on top of history search corpus).
>>>
>>> I think that more scalable solution would be to keep such data in pure
>>>       
>> flat
>>     
>>> file and then periodically recreate search history index (or more
>>>       
>> indices)
>>     
>>> from it (for example by Map-Reduce like task). Event better the flat
>>>       
>> file
>>     
>>> could be stored in distributer file system. However, for now I would
>>>       
>> like to
>>     
>>> start with something simple.
>>>
>>>       
>> I would rather suggest you to keep the logs in rolling flat files. An
>> access to the database for each search will take lots of time. Then you
>> may want to flush those logs to the db once a day if you indeed want to
>> store the data in a relational way.
>>
>> I infer that you want to mine the data, but you do not know what to
>> mine, right? I suggest you to look at hadoop and pig. Pig is a is
>> designed especially for this purpose.
>>     
>
>
> You've hit the nail on the head! I am very curious about how one can use
> such data to improve user experience with search engine (given my project
> schedule time constraints).
>
>   
>> I know this is a complex topic...
>>     
>>> Regards,
>>> Lukas
>>>
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
> Anyway, thanks for your reply!
>
> BR
> Lukas
>
>   

Re: How to keep user search history and how to turn it into information?

Posted by Lukas Vlcek <lu...@gmail.com>.
Hi Enis,

On 8/10/07, Enis Soztutar <en...@gmail.com> wrote:
>
> Hi,
>
> Lukas Vlcek wrote:
> > Hi,
> >
> > I would like to keep user search history data and I am looking for some
> > ideas/advices/recommendations. In general I would like to talk about
> methods
> > of storing such data, its structure and how to turn it into valuable
> > information.
> >
> > As for the structure:
> > ==============
> > For now I don't have exact idea about what kind of information I should
> > keep. I know that this is application specific but I believe there can
> be
> > some common general patterns. as of now I think can be useful to keep is
> the
> > following:
> >
> > 1) system time (time of issuing the query) and userid
> > 2) original user query in raw form (untokenized)
> > 3) expanded user query (both tokenized and untokenized can be useful)
> > 4) query execution time
> > 5) # of objects retrieved from index
> > 6) # of total object count in index (this can change during time)
> > 7) and possibly if user clicked some result and if so then which one
> (the
> > hit number) and system time
> >
> >
> Remember that you may not want to store all the information available at
> runtime of the query, since it may result in great performance burden.
> For example you  may want to store the raw form of the query, but not
> parsed form since you can later parse the query anayway (unless you have
> some architecture change). Similarly 6 seemed not a good choice for
> me(again you can store the info externally). You can look at the common
> and extended log formats which are stored by the web server.


The problem is that all the information do chance in time. The index is
updated continuously which means that expanded queries and total number of
documents in index do change as well. But you are right that getting some of
this info can cause extra performance expenses (then it would be question of
later optimization of architecture design).


> > As for the information I can get from this:
> > =============================
> > Such minimal data collection could show if the search engine serves
> users
> > well or not (generally said). I should note that for the users in this
> case
> > the only other option is to not use the search engine at all (so the
> data
> > should not be biased by the fact that users are using alternative search
> > method). I should be able to learn if:
> >
> > 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
> > 2) users are finding what they want (they can find it fast and results
> are
> > ordered by properly defined relevance [my model is well tuned in terms
> of
> > term weights] so the result they click is among first hits)
> > 3) user can formulate queries well (do they issue queries which return
> all
> > index documents or they can issue queries which return just a couple of
> > documents)
> > 4) ...?... etc...
> >
> >
> Web server log analysis is a very popular topic nowadays, and you can
> check for the literature, especially clickthrough data anaysis. All the
> major search engines has to interpret the data to improve their
> algorithms, and to learn from the latent "collective knowlege" hidden in
> web server logs.


It seems I have to do my homework and check CiteSeer for some papers :-)
Is there any paper you can recommend me? Some good one to start with?
What I want to achieve is far beyond the scope of the project I am working
on right now thus I cannot spend all my time on research (in spite of the
fact I would love to) so I can either a) use some tool which is already
available (open sourced) and directly fits my needs (I don't think there is
any tool which I could use out-of-box) or b) implement something new from
scratch but with just very limited functionality.

> As for the storage method:
> > ===================
> > I was planning to keep such data in database but now it seems to me that
> it
> > will be better to keep it directly in index (Lucene index). It seems to
> me
> > that this approach would allow me for better fuzzy searches across
> history
> > and extracting relevant objects and their count more efficiently (with
> > benefit of the relevance based search on top of history search corpus).
> >
> > I think that more scalable solution would be to keep such data in pure
> flat
> > file and then periodically recreate search history index (or more
> indices)
> > from it (for example by Map-Reduce like task). Event better the flat
> file
> > could be stored in distributer file system. However, for now I would
> like to
> > start with something simple.
> >
> I would rather suggest you to keep the logs in rolling flat files. An
> access to the database for each search will take lots of time. Then you
> may want to flush those logs to the db once a day if you indeed want to
> store the data in a relational way.
>
> I infer that you want to mine the data, but you do not know what to
> mine, right? I suggest you to look at hadoop and pig. Pig is a is
> designed especially for this purpose.


You've hit the nail on the head! I am very curious about how one can use
such data to improve user experience with search engine (given my project
schedule time constraints).

> I know this is a complex topic...
> >
> > Regards,
> > Lukas
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Anyway, thanks for your reply!

BR
Lukas

Re: How to keep user search history and how to turn it into information?

Posted by Enis Soztutar <en...@gmail.com>.
Hi,

Lukas Vlcek wrote:
> Hi,
>
> I would like to keep user search history data and I am looking for some
> ideas/advices/recommendations. In general I would like to talk about methods
> of storing such data, its structure and how to turn it into valuable
> information.
>
> As for the structure:
> ==============
> For now I don't have exact idea about what kind of information I should
> keep. I know that this is application specific but I believe there can be
> some common general patterns. as of now I think can be useful to keep is the
> following:
>
> 1) system time (time of issuing the query) and userid
> 2) original user query in raw form (untokenized)
> 3) expanded user query (both tokenized and untokenized can be useful)
> 4) query execution time
> 5) # of objects retrieved from index
> 6) # of total object count in index (this can change during time)
> 7) and possibly if user clicked some result and if so then which one (the
> hit number) and system time
>
>   
Remember that you may not want to store all the information available at 
runtime of the query, since it may result in great performance burden. 
For example you  may want to store the raw form of the query, but not 
parsed form since you can later parse the query anayway (unless you have 
some architecture change). Similarly 6 seemed not a good choice for 
me(again you can store the info externally). You can look at the common 
and extended log formats which are stored by the web server.

> As for the information I can get from this:
> =============================
> Such minimal data collection could show if the search engine serves users
> well or not (generally said). I should note that for the users in this case
> the only other option is to not use the search engine at all (so the data
> should not be biased by the fact that users are using alternative search
> method). I should be able to learn if:
>
> 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
> 2) users are finding what they want (they can find it fast and results are
> ordered by properly defined relevance [my model is well tuned in terms of
> term weights] so the result they click is among first hits)
> 3) user can formulate queries well (do they issue queries which return all
> index documents or they can issue queries which return just a couple of
> documents)
> 4) ...?... etc...
>
>   
Web server log analysis is a very popular topic nowadays, and you can 
check for the literature, especially clickthrough data anaysis. All the 
major search engines has to interpret the data to improve their 
algorithms, and to learn from the latent "collective knowlege" hidden in 
web server logs.

> As for the storage method:
> ===================
> I was planning to keep such data in database but now it seems to me that it
> will be better to keep it directly in index (Lucene index). It seems to me
> that this approach would allow me for better fuzzy searches across history
> and extracting relevant objects and their count more efficiently (with
> benefit of the relevance based search on top of history search corpus).
>
> I think that more scalable solution would be to keep such data in pure flat
> file and then periodically recreate search history index (or more indices)
> from it (for example by Map-Reduce like task). Event better the flat file
> could be stored in distributer file system. However, for now I would like to
> start with something simple.
>   
I would rather suggest you to keep the logs in rolling flat files. An 
access to the database for each search will take lots of time. Then you 
may want to flush those logs to the db once a day if you indeed want to 
store the data in a relational way.

I infer that you want to mine the data, but you do not know what to 
mine, right? I suggest you to look at hadoop and pig. Pig is a is 
designed especially for this purpose.

> I know this is a complex topic...
>
> Regards,
> Lukas
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to keep user search history and how to turn it into information?

Posted by Lukas Vlcek <lu...@gmail.com>.
Dmitry,

My middle tier is Compass in this case (it is easy extensible so catching
all events should be easy). Also it allows for index to be stored in DB as
well.

The main question is *what* is important to catch and *how* to get maximum
information out of it later.

Lukas

On 8/10/07, Dmitry <dm...@hotmail.com> wrote:
>
> Lucas,
> Probably one of the solution will be to use database  - like my sql and
> setup Lucene against MySQL  - in thi scase you don't need to think less
> concerning implementaiton based on the content sotrage. ALso you need to
> create middle tier to catch all event concerning Users Search / Hostory/
> Retrieval History/Cache Management...
> Thanks,
> dt
> www.ejinz.com
> Search News
>
> ----- Original Message -----
> From: "Lukas Vlcek" <lu...@gmail.com>
> To: <ja...@lucene.apache.org>
> Sent: Friday, August 10, 2007 2:28 AM
> Subject: How to keep user search history and how to turn it into
> information?
>
>
> > Hi,
> >
> > I would like to keep user search history data and I am looking for some
> > ideas/advices/recommendations. In general I would like to talk about
> > methods
> > of storing such data, its structure and how to turn it into valuable
> > information.
> >
> > As for the structure:
> > ==============
> > For now I don't have exact idea about what kind of information I should
> > keep. I know that this is application specific but I believe there can
> be
> > some common general patterns. as of now I think can be useful to keep is
> > the
> > following:
> >
> > 1) system time (time of issuing the query) and userid
> > 2) original user query in raw form (untokenized)
> > 3) expanded user query (both tokenized and untokenized can be useful)
> > 4) query execution time
> > 5) # of objects retrieved from index
> > 6) # of total object count in index (this can change during time)
> > 7) and possibly if user clicked some result and if so then which one
> (the
> > hit number) and system time
> >
> > As for the information I can get from this:
> > =============================
> > Such minimal data collection could show if the search engine serves
> users
> > well or not (generally said). I should note that for the users in this
> > case
> > the only other option is to not use the search engine at all (so the
> data
> > should not be biased by the fact that users are using alternative search
> > method). I should be able to learn if:
> >
> > 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
> > 2) users are finding what they want (they can find it fast and results
> are
> > ordered by properly defined relevance [my model is well tuned in terms
> of
> > term weights] so the result they click is among first hits)
> > 3) user can formulate queries well (do they issue queries which return
> all
> > index documents or they can issue queries which return just a couple of
> > documents)
> > 4) ...?... etc...
> >
> > As for the storage method:
> > ===================
> > I was planning to keep such data in database but now it seems to me that
> > it
> > will be better to keep it directly in index (Lucene index). It seems to
> me
> > that this approach would allow me for better fuzzy searches across
> history
> > and extracting relevant objects and their count more efficiently (with
> > benefit of the relevance based search on top of history search corpus).
> >
> > I think that more scalable solution would be to keep such data in pure
> > flat
> > file and then periodically recreate search history index (or more
> indices)
> > from it (for example by Map-Reduce like task). Event better the flat
> file
> > could be stored in distributer file system. However, for now I would
> like
> > to
> > start with something simple.
> >
> > I know this is a complex topic...
> >
> > Regards,
> > Lukas
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to keep user search history and how to turn it into information?

Posted by Dmitry <dm...@hotmail.com>.
Lucas,
Probably one of the solution will be to use database  - like my sql and 
setup Lucene against MySQL  - in thi scase you don't need to think less 
concerning implementaiton based on the content sotrage. ALso you need to 
create middle tier to catch all event concerning Users Search / Hostory/ 
Retrieval History/Cache Management...
Thanks,
dt
www.ejinz.com
Search News

----- Original Message ----- 
From: "Lukas Vlcek" <lu...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Friday, August 10, 2007 2:28 AM
Subject: How to keep user search history and how to turn it into 
information?


> Hi,
>
> I would like to keep user search history data and I am looking for some
> ideas/advices/recommendations. In general I would like to talk about 
> methods
> of storing such data, its structure and how to turn it into valuable
> information.
>
> As for the structure:
> ==============
> For now I don't have exact idea about what kind of information I should
> keep. I know that this is application specific but I believe there can be
> some common general patterns. as of now I think can be useful to keep is 
> the
> following:
>
> 1) system time (time of issuing the query) and userid
> 2) original user query in raw form (untokenized)
> 3) expanded user query (both tokenized and untokenized can be useful)
> 4) query execution time
> 5) # of objects retrieved from index
> 6) # of total object count in index (this can change during time)
> 7) and possibly if user clicked some result and if so then which one (the
> hit number) and system time
>
> As for the information I can get from this:
> =============================
> Such minimal data collection could show if the search engine serves users
> well or not (generally said). I should note that for the users in this 
> case
> the only other option is to not use the search engine at all (so the data
> should not be biased by the fact that users are using alternative search
> method). I should be able to learn if:
>
> 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
> 2) users are finding what they want (they can find it fast and results are
> ordered by properly defined relevance [my model is well tuned in terms of
> term weights] so the result they click is among first hits)
> 3) user can formulate queries well (do they issue queries which return all
> index documents or they can issue queries which return just a couple of
> documents)
> 4) ...?... etc...
>
> As for the storage method:
> ===================
> I was planning to keep such data in database but now it seems to me that 
> it
> will be better to keep it directly in index (Lucene index). It seems to me
> that this approach would allow me for better fuzzy searches across history
> and extracting relevant objects and their count more efficiently (with
> benefit of the relevance based search on top of history search corpus).
>
> I think that more scalable solution would be to keep such data in pure 
> flat
> file and then periodically recreate search history index (or more indices)
> from it (for example by Map-Reduce like task). Event better the flat file
> could be stored in distributer file system. However, for now I would like 
> to
> start with something simple.
>
> I know this is a complex topic...
>
> Regards,
> Lukas
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org