You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by philmccarthy <ph...@gmail.com> on 2009/01/14 02:49:33 UTC

Indexing the same data in many records

Hi,

I'd like to use Solr to index some webserver logs, in order to allow easy
ad-hoc querying and analysis. Each Solr Document will represent a single
request to the webserver, with fields for time, request URL, referring URL
etc.

I'm also planning to fetch the page source of each referring URL, and add
that as an indexed field in the Solr document. The aim is to allow queries
like "find hits to /xyz.html where the referring page contains the word
'foobar'".

Since hundreds or even thousands of hits may all come from the same
referring page, would this approach be horribly inefficient? (Note the page
source won't be stored in each Document, just indexed). Am I going to
dramatically increase the index size if I do this?

If so, is there a more elegant way to do what I want?

Many thanks,
Phil



-- 
View this message in context: http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing the same data in many records

Posted by philmccarthy <ph...@gmail.com>.
Hi,

Adding same document many times is actually the scenario I wanted to
test--indexing hits from Apache webserver logs with the source of the
referring page.

My expectation would be that the majority of hits on a given day would
originate from a small number of referrers, so each of these referring pages
would be indexed multiple times. I really wanted to check that this would
scale better than indexing the same number of different documents--your
explanation regarding term distribution explains why this is the case.

Many thanks,
Phil


Otis Gospodnetic wrote:
> 
> Phil,
> 
> Note that adding the same document multiple times and looking at the index
> size is not a very good approach.  You are adding a fixed number of
> distinct terms over and over.  In real-life scenario you will have a much
> greater term distribution, and that will affect index size.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: philmccarthy <ph...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, January 14, 2009 7:36:38 PM
>> Subject: Re: Indexing the same data in many records
>> 
>> 
>> Thanks Otis. I tweaked the Solr example app a little and then uploaded a
>> ~55KB document to it a couple of thousand times (changing the ID each
>> time).
>> The solr/data directory was 72MB on disc after adding the document 2000
>> times, so it seems that the index is growing by approximately 36KB for
>> each
>> document. That seems reasonable.
>> 
>> I guess I need to do some research into expected data volumes now, and
>> limits on Lucene index size.
>> 
>> Cheers,
>> Phil
>> 
>> 
>> Otis Gospodnetic wrote:
>> > 
>> > Phil,
>> > 
>> > From what you described so far, I don't see any red flags.  I would pay
>> > attention to reading those timestamps (covered on the Wiki and ML
>> > archives), that's all.
>> > 
>> > 
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> > 
>> > 
>> > 
>> > ----- Original Message ----
>> >> From: philmccarthy 
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Tuesday, January 13, 2009 8:49:33 PM
>> >> Subject: Indexing the same data in many records
>> >> 
>> >> 
>> >> Hi,
>> >> 
>> >> I'd like to use Solr to index some webserver logs, in order to allow
>> easy
>> >> ad-hoc querying and analysis. Each Solr Document will represent a
>> single
>> >> request to the webserver, with fields for time, request URL, referring
>> >> URL
>> >> etc.
>> >> 
>> >> I'm also planning to fetch the page source of each referring URL, and
>> add
>> >> that as an indexed field in the Solr document. The aim is to allow
>> >> queries
>> >> like "find hits to /xyz.html where the referring page contains the
>> word
>> >> 'foobar'".
>> >> 
>> >> Since hundreds or even thousands of hits may all come from the same
>> >> referring page, would this approach be horribly inefficient? (Note the
>> >> page
>> >> source won't be stored in each Document, just indexed). Am I going to
>> >> dramatically increase the index size if I do this?
>> >> 
>> >> If so, is there a more elegant way to do what I want?
>> >> 
>> >> Many thanks,
>> >> Phil
>> >> 
>> >> 
>> >> 
>> >> -- 
>> >> View this message in context: 
>> >> 
>> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> > 
>> > 
>> > 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21468706.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21475019.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing the same data in many records

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Phil,

Note that adding the same document multiple times and looking at the index size is not a very good approach.  You are adding a fixed number of distinct terms over and over.  In real-life scenario you will have a much greater term distribution, and that will affect index size.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: philmccarthy <ph...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, January 14, 2009 7:36:38 PM
> Subject: Re: Indexing the same data in many records
> 
> 
> Thanks Otis. I tweaked the Solr example app a little and then uploaded a
> ~55KB document to it a couple of thousand times (changing the ID each time).
> The solr/data directory was 72MB on disc after adding the document 2000
> times, so it seems that the index is growing by approximately 36KB for each
> document. That seems reasonable.
> 
> I guess I need to do some research into expected data volumes now, and
> limits on Lucene index size.
> 
> Cheers,
> Phil
> 
> 
> Otis Gospodnetic wrote:
> > 
> > Phil,
> > 
> > From what you described so far, I don't see any red flags.  I would pay
> > attention to reading those timestamps (covered on the Wiki and ML
> > archives), that's all.
> > 
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > 
> > ----- Original Message ----
> >> From: philmccarthy 
> >> To: solr-user@lucene.apache.org
> >> Sent: Tuesday, January 13, 2009 8:49:33 PM
> >> Subject: Indexing the same data in many records
> >> 
> >> 
> >> Hi,
> >> 
> >> I'd like to use Solr to index some webserver logs, in order to allow easy
> >> ad-hoc querying and analysis. Each Solr Document will represent a single
> >> request to the webserver, with fields for time, request URL, referring
> >> URL
> >> etc.
> >> 
> >> I'm also planning to fetch the page source of each referring URL, and add
> >> that as an indexed field in the Solr document. The aim is to allow
> >> queries
> >> like "find hits to /xyz.html where the referring page contains the word
> >> 'foobar'".
> >> 
> >> Since hundreds or even thousands of hits may all come from the same
> >> referring page, would this approach be horribly inefficient? (Note the
> >> page
> >> source won't be stored in each Document, just indexed). Am I going to
> >> dramatically increase the index size if I do this?
> >> 
> >> If so, is there a more elegant way to do what I want?
> >> 
> >> Many thanks,
> >> Phil
> >> 
> >> 
> >> 
> >> -- 
> >> View this message in context: 
> >> 
> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21468706.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing the same data in many records

Posted by philmccarthy <ph...@gmail.com>.
Thanks Otis. I tweaked the Solr example app a little and then uploaded a
~55KB document to it a couple of thousand times (changing the ID each time).
The solr/data directory was 72MB on disc after adding the document 2000
times, so it seems that the index is growing by approximately 36KB for each
document. That seems reasonable.

I guess I need to do some research into expected data volumes now, and
limits on Lucene index size.

Cheers,
Phil


Otis Gospodnetic wrote:
> 
> Phil,
> 
> From what you described so far, I don't see any red flags.  I would pay
> attention to reading those timestamps (covered on the Wiki and ML
> archives), that's all.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: philmccarthy <ph...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tuesday, January 13, 2009 8:49:33 PM
>> Subject: Indexing the same data in many records
>> 
>> 
>> Hi,
>> 
>> I'd like to use Solr to index some webserver logs, in order to allow easy
>> ad-hoc querying and analysis. Each Solr Document will represent a single
>> request to the webserver, with fields for time, request URL, referring
>> URL
>> etc.
>> 
>> I'm also planning to fetch the page source of each referring URL, and add
>> that as an indexed field in the Solr document. The aim is to allow
>> queries
>> like "find hits to /xyz.html where the referring page contains the word
>> 'foobar'".
>> 
>> Since hundreds or even thousands of hits may all come from the same
>> referring page, would this approach be horribly inefficient? (Note the
>> page
>> source won't be stored in each Document, just indexed). Am I going to
>> dramatically increase the index size if I do this?
>> 
>> If so, is there a more elegant way to do what I want?
>> 
>> Many thanks,
>> Phil
>> 
>> 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21468706.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing the same data in many records

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Phil,

>From what you described so far, I don't see any red flags.  I would pay attention to reading those timestamps (covered on the Wiki and ML archives), that's all.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: philmccarthy <ph...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, January 13, 2009 8:49:33 PM
> Subject: Indexing the same data in many records
> 
> 
> Hi,
> 
> I'd like to use Solr to index some webserver logs, in order to allow easy
> ad-hoc querying and analysis. Each Solr Document will represent a single
> request to the webserver, with fields for time, request URL, referring URL
> etc.
> 
> I'm also planning to fetch the page source of each referring URL, and add
> that as an indexed field in the Solr document. The aim is to allow queries
> like "find hits to /xyz.html where the referring page contains the word
> 'foobar'".
> 
> Since hundreds or even thousands of hits may all come from the same
> referring page, would this approach be horribly inefficient? (Note the page
> source won't be stored in each Document, just indexed). Am I going to
> dramatically increase the index size if I do this?
> 
> If so, is there a more elegant way to do what I want?
> 
> Many thanks,
> Phil
> 
> 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: faceted search returning multiple values for same field

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Shantanu,

It sounds like all you have to do is switch to a field type that doesn't tokenize your mfg field.  Try field type "string".  You'll need to reindex once you make this change.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: "Deo, Shantanu" <sd...@att.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, January 13, 2009 10:15:09 PM
> Subject: faceted search returning multiple values for same field
> 
> Hi,
>   I am using solr for indexing some product data, and wanted to use the
> faceted search. My indexed field (mfg) sometimes contains two words
> "sony erricson" for example. When I get the facets on the mfg, SOLR
> return "sony" and "erricson" as separate hits. There are also some
> facets that show up rather mysteriously.
> 
> My Unique list of mfg's that is indexed is as follows:
> AT&T
> BlackBerry?
> HTC
> LG
> Motorola
> Nokia
> Option
> Palm
> Pantech
> Samsung
> Sierra Wireless
> Sony Ericsson
> 
> 
> The resulting facets being returned is below:
> "facet_fields":{
>         "mfg":[
>          "ericsson",195,
>          "soni",156,
>          "samsung",155,
>          "nokia",90,
>          "Ericsson",78,
>          "Sony",78,
>          "Samsung",62,
>          "motorola",55,
>          "lg",50,
>          "sony",39,
>          "Nokia",36,
>          "pantech",25,
>          "Motorola",22,
>          "LG",20,
>          "berri",16,
>          "black",16,
>          "blackberri",16,
>          "Pantech",10,
>          "BlackBerry",8,
>          "blackberry",4,
>          "AT",0,
>          "HTC",0,
>          "Option",0,
>          "Palm",0,
>          "Sierra",0,
>          "T",0,
>          "Wireless",0,
>          "at",0,
>          "att",0,
>          "htc",0,
>          "option",0,
>          "palm",0,
>          "sierra",0,
>          "t",0,
>          "wireless",0]
> 
> 
> I have tried playing around with defining the fieldtype using the
> following analyzers:
> 
> positionIncrementGap="100" >
>   
>     
>     
>     
>     
> words="manufacturer.txt"/>
>   
> 
> 
> 
> Any ideas if its possible to get the same facets as are in the data
> that's being indexed or would I have to write my own Filter for this
> purpose ?
> 
> Thanks
> Shantanu Deo
> AT&T eCommerce Web Hosting - Release Management
> Office: (425)288-6081
> email: sd189d@att.com


Re: faceted search returning multiple values for same field

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Jan 14, 2009 at 8:45 AM, Deo, Shantanu <sd...@att.com> wrote:

>
> I have tried playing around with defining the fieldtype using the
> following analyzers:
> <fieldType name="mfgTextTight" class="solr.TextField"
> positionIncrementGap="100" >
>  <analyzer>
>    <tokenizer class="solr.LetterTokenizerFactory"/>
>    <filter class="solr.LowerCaseFilterFactory"/>
>    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>    <filter class="solr.KeepWordFilterFactory"
> words="manufacturer.txt"/>
>  </analyzer>
> </fieldType>
>
>
> Any ideas if its possible to get the same facets as are in the data
> that's being indexed or would I have to write my own Filter for this
> purpose ?


Faceting works on the indexed terms. Therefore, you should make sure what
you index is exactly as what you stored. You probably need to facet on a
"string" type.


>
>
> Thanks
> Shantanu Deo
> AT&T eCommerce Web Hosting - Release Management
> Office: (425)288-6081
> email: sd189d@att.com
>



-- 
Regards,
Shalin Shekhar Mangar.

faceted search returning multiple values for same field

Posted by "Deo, Shantanu" <sd...@att.com>.
Hi,
  I am using solr for indexing some product data, and wanted to use the
faceted search. My indexed field (mfg) sometimes contains two words
"sony erricson" for example. When I get the facets on the mfg, SOLR
return "sony" and "erricson" as separate hits. There are also some
facets that show up rather mysteriously.

My Unique list of mfg's that is indexed is as follows:
AT&amp;T
BlackBerry?
HTC
LG
Motorola
Nokia
Option
Palm
Pantech
Samsung
Sierra Wireless
Sony Ericsson


The resulting facets being returned is below:
"facet_fields":{
        "mfg":[
         "ericsson",195,
         "soni",156,
         "samsung",155,
         "nokia",90,
         "Ericsson",78,
         "Sony",78,
         "Samsung",62,
         "motorola",55,
         "lg",50,
         "sony",39,
         "Nokia",36,
         "pantech",25,
         "Motorola",22,
         "LG",20,
         "berri",16,
         "black",16,
         "blackberri",16,
         "Pantech",10,
         "BlackBerry",8,
         "blackberry",4,
         "AT",0,
         "HTC",0,
         "Option",0,
         "Palm",0,
         "Sierra",0,
         "T",0,
         "Wireless",0,
         "at",0,
         "att",0,
         "htc",0,
         "option",0,
         "palm",0,
         "sierra",0,
         "t",0,
         "wireless",0]


I have tried playing around with defining the fieldtype using the
following analyzers:
<fieldType name="mfgTextTight" class="solr.TextField"
positionIncrementGap="100" >
  <analyzer>
    <tokenizer class="solr.LetterTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.KeepWordFilterFactory"
words="manufacturer.txt"/>
  </analyzer>
</fieldType>
 

Any ideas if its possible to get the same facets as are in the data
that's being indexed or would I have to write my own Filter for this
purpose ?

Thanks
Shantanu Deo
AT&T eCommerce Web Hosting - Release Management
Office: (425)288-6081
email: sd189d@att.com