You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Salman Akram <sa...@northbaysolutions.net> on 2011/01/24 20:42:40 UTC

Highlighting with/without Term Vectors

Hi,

Does anyone have any benchmarks how much highlighting speeds up with Term
Vectors (compared to without it)? e.g. if highlighting on 20 documents take
1 sec with Term Vectors any idea how long it will take without them?

I need to know since the index used for highlighting has a TVF file of
around 450GB (approx 65% of total index size) so I am trying to see whether
the decreasing the index size by dropping TVF would be more helpful for
performance (less RAM, should be good for I/O too I guess) or keeping it is
still better?

I know the best way is try it out but indexing takes a very long time so
trying to see whether its even worthy or not.

-- 
Regards,

Salman Akram

Re: Highlighting with/without Term Vectors

Posted by Salman Akram <sa...@northbaysolutions.net>.
Anyone?

On Tue, Jan 25, 2011 at 12:57 AM, Salman Akram <
salman.akram@northbaysolutions.net> wrote:

> Just to add one thing, in case it makes a difference.
>
> Max document size on which highlighting needs to be done is few hundred
> kb's (in file system). In index its compressed so should be much smaller.
> Total documents are more than 100 million.
>
>
> On Tue, Jan 25, 2011 at 12:42 AM, Salman Akram <
> salman.akram@northbaysolutions.net> wrote:
>
>> Hi,
>>
>> Does anyone have any benchmarks how much highlighting speeds up with Term
>> Vectors (compared to without it)? e.g. if highlighting on 20 documents take
>> 1 sec with Term Vectors any idea how long it will take without them?
>>
>> I need to know since the index used for highlighting has a TVF file of
>> around 450GB (approx 65% of total index size) so I am trying to see whether
>> the decreasing the index size by dropping TVF would be more helpful for
>> performance (less RAM, should be good for I/O too I guess) or keeping it is
>> still better?
>>
>> I know the best way is try it out but indexing takes a very long time so
>> trying to see whether its even worthy or not.
>>
>> --
>> Regards,
>>
>> Salman Akram
>>
>>
>
>
> --
> Regards,
>
> Salman Akram
>



-- 
Regards,

Salman Akram

Re: Highlighting with/without Term Vectors

Posted by Salman Akram <sa...@northbaysolutions.net>.
Just to add one thing, in case it makes a difference.

Max document size on which highlighting needs to be done is few hundred kb's
(in file system). In index its compressed so should be much smaller. Total
documents are more than 100 million.

On Tue, Jan 25, 2011 at 12:42 AM, Salman Akram <
salman.akram@northbaysolutions.net> wrote:

> Hi,
>
> Does anyone have any benchmarks how much highlighting speeds up with Term
> Vectors (compared to without it)? e.g. if highlighting on 20 documents take
> 1 sec with Term Vectors any idea how long it will take without them?
>
> I need to know since the index used for highlighting has a TVF file of
> around 450GB (approx 65% of total index size) so I am trying to see whether
> the decreasing the index size by dropping TVF would be more helpful for
> performance (less RAM, should be good for I/O too I guess) or keeping it is
> still better?
>
> I know the best way is try it out but indexing takes a very long time so
> trying to see whether its even worthy or not.
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram

Re: Highlighting with/without Term Vectors

Posted by Salman Akram <sa...@northbaysolutions.net>.
Yea I was going to reply to that thread but then it just slipped out of my
mind. :)

Actually we have two indexes. One that is used for searching and other for
highlighting. Their structure is different too like the 1st one has all the
metadata + document contents indexed (just for searching). This has around
13 million rows. In 2nd one we have mainly the document PAGE contents
indexed/stored with Terms Vectors. This has around 130 million rows (since
each row is a page).

What we do is search on the 1st index (around 150GB) and get document ID's
based on the page size (20/50/100) and then just search on these document
ID's on 2nd index (but on pages - as we need to show results based on page
no's) with text for highlighting as well.

The 2nd index is around 700GB (which has that 450GB TVF file I was talking
about) but since its only referred for small no. of documents mostly that is
not an issue (in some queries that's slow too but its size is the main
issue).

On average more than 90% of the query time is taken by 1st index file in
searching (and total count as well).

The confusion that I had was on the 1st index file which didn't have Term
Vectors in any of the fields in SOLR schema file but still had a TVF file.
The reason in the end turned out to be Lucene indexing. Some of the initial
documents were indexed through Lucene and there one of the field did had
Term Vectors! Sorry for that...

*Keeping in mind the above description any other ideas you would like to
suggest? Thanks!!*

On Sat, Feb 5, 2011 at 7:40 AM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> Hi Salman,
>
> Ah, so in the end you *did* have TV enabled on one of your fields! :) (I
> think
> this was a problem we were trying to solve a few weeks ago here)
>
> How many docs you have in the index doesn't matter here - only N
> docs/fields
> that you need to display on a page with N results need to be reanalyzed for
> highlighting purposes, so follow Grant's advice, make a small index without
> TV,
> and compare highlighting speed with and without TV.
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Salman Akram <sa...@northbaysolutions.net>
> > To: solr-user@lucene.apache.org
> > Sent: Fri, February 4, 2011 8:03:06 AM
> > Subject: Re: Highlighting with/without Term Vectors
> >
> > Basically Term Vectors are only on one main field i.e. Contents. Average
> > size  of each document would be few KB's but there are around 130 million
> > documents  so what do you suggest now?
> >
> > On Fri, Feb 4, 2011 at 5:24 PM, Otis  Gospodnetic <
> otis_gospodnetic@yahoo.com
> > >  wrote:
> >
> > > Salman,
> > >
> > > It also depends on the size of your  documents.  Re-analyzing 20 fields
> of
> > > 500
> > > bytes each will  be a lot faster than re-analyzing 20 fields with 50 KB
> > >  each.
> > >
> > > Otis
> > > ----
> > > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > > Lucene ecosystem search :: http://search-lucene.com/
> > >
> > >
> > >
> > > ----- Original  Message ----
> > > > From: Grant Ingersoll <gs...@apache.org>
> > > > To: solr-user@lucene.apache.org
> > >  > Sent: Wed, January 26, 2011 10:44:09 AM
> > > > Subject: Re:  Highlighting with/without Term Vectors
> > > >
> > > >
> > > > On  Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > Does anyone have any benchmarks how much  highlighting speeds up
> with
> > >  Term
> > > > > Vectors  (compared to without it)? e.g. if highlighting on 20
>  documents
> > >  take
> > > > > 1 sec with Term Vectors any idea how long it will  take  without
> them?
> > > > >
> > > > > I need to know  since the index used for  highlighting has a TVF
> file of
> > > > >  around 450GB (approx 65% of total index  size) so I am trying to
>  see
> > > whether
> > > > > the decreasing the index size by   dropping TVF would be more
> helpful
> > > for
> > > > > performance  (less RAM, should be  good for I/O too I guess) or
> keeping
> > > it  is
> > > > > still better?
> > > > >
> > > > > I know  the best way is try it out but indexing takes a very long
> time
> > >   so
> > > > > trying to see whether its even worthy or not.
> > >  >
> > > >
> > > > Try testing  on a smaller set.  In  general, you are saving the
> process of
> > > >re-analyzing  the  content, so, to some extent it is going to be
> dependent
> > > on how
> > >  >fast your  analyzer chain is.  At the size you are at, I don't  know
> if
> > > storing
> > > >TVs is  worth  it.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram

Re: Highlighting with/without Term Vectors

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Salman,

Ah, so in the end you *did* have TV enabled on one of your fields! :) (I think 
this was a problem we were trying to solve a few weeks ago here)

How many docs you have in the index doesn't matter here - only N docs/fields 
that you need to display on a page with N results need to be reanalyzed for 
highlighting purposes, so follow Grant's advice, make a small index without TV, 
and compare highlighting speed with and without TV.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Salman Akram <sa...@northbaysolutions.net>
> To: solr-user@lucene.apache.org
> Sent: Fri, February 4, 2011 8:03:06 AM
> Subject: Re: Highlighting with/without Term Vectors
> 
> Basically Term Vectors are only on one main field i.e. Contents. Average
> size  of each document would be few KB's but there are around 130 million
> documents  so what do you suggest now?
> 
> On Fri, Feb 4, 2011 at 5:24 PM, Otis  Gospodnetic <otis_gospodnetic@yahoo.com
> >  wrote:
> 
> > Salman,
> >
> > It also depends on the size of your  documents.  Re-analyzing 20 fields of
> > 500
> > bytes each will  be a lot faster than re-analyzing 20 fields with 50 KB
> >  each.
> >
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > ----- Original  Message ----
> > > From: Grant Ingersoll <gs...@apache.org>
> > > To: solr-user@lucene.apache.org
> >  > Sent: Wed, January 26, 2011 10:44:09 AM
> > > Subject: Re:  Highlighting with/without Term Vectors
> > >
> > >
> > > On  Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> > >
> > > >  Hi,
> > > >
> > > > Does anyone have any benchmarks how much  highlighting speeds up with
> >  Term
> > > > Vectors  (compared to without it)? e.g. if highlighting on 20  documents
> >  take
> > > > 1 sec with Term Vectors any idea how long it will  take  without them?
> > > >
> > > > I need to know  since the index used for  highlighting has a TVF file of
> > > >  around 450GB (approx 65% of total index  size) so I am trying to  see
> > whether
> > > > the decreasing the index size by   dropping TVF would be more helpful
> > for
> > > > performance  (less RAM, should be  good for I/O too I guess) or keeping
> > it  is
> > > > still better?
> > > >
> > > > I know  the best way is try it out but indexing takes a very long time
> >   so
> > > > trying to see whether its even worthy or not.
> >  >
> > >
> > > Try testing  on a smaller set.  In  general, you are saving the process of
> > >re-analyzing  the  content, so, to some extent it is going to be dependent
> > on how
> >  >fast your  analyzer chain is.  At the size you are at, I don't  know if
> > storing
> > >TVs is  worth  it.
> >
> 
> 
> 
> -- 
> Regards,
> 
> Salman Akram
> 

Re: Highlighting with/without Term Vectors

Posted by Salman Akram <sa...@northbaysolutions.net>.
Basically Term Vectors are only on one main field i.e. Contents. Average
size of each document would be few KB's but there are around 130 million
documents so what do you suggest now?

On Fri, Feb 4, 2011 at 5:24 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> Salman,
>
> It also depends on the size of your documents.  Re-analyzing 20 fields of
> 500
> bytes each will be a lot faster than re-analyzing 20 fields with 50 KB
> each.
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Grant Ingersoll <gs...@apache.org>
> > To: solr-user@lucene.apache.org
> > Sent: Wed, January 26, 2011 10:44:09 AM
> > Subject: Re: Highlighting with/without Term Vectors
> >
> >
> > On Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> >
> > > Hi,
> > >
> > > Does anyone have any benchmarks how much highlighting speeds up with
>  Term
> > > Vectors (compared to without it)? e.g. if highlighting on 20  documents
> take
> > > 1 sec with Term Vectors any idea how long it will take  without them?
> > >
> > > I need to know since the index used for  highlighting has a TVF file of
> > > around 450GB (approx 65% of total index  size) so I am trying to see
> whether
> > > the decreasing the index size by  dropping TVF would be more helpful
> for
> > > performance (less RAM, should be  good for I/O too I guess) or keeping
> it is
> > > still better?
> > >
> > > I know the best way is try it out but indexing takes a very long time
>  so
> > > trying to see whether its even worthy or not.
> >
> >
> > Try testing  on a smaller set.  In general, you are saving the process of
> >re-analyzing  the content, so, to some extent it is going to be dependent
> on how
> >fast your  analyzer chain is.  At the size you are at, I don't know if
> storing
> >TVs is  worth it.
>



-- 
Regards,

Salman Akram

Re: Highlighting with/without Term Vectors

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Salman,

It also depends on the size of your documents.  Re-analyzing 20 fields of 500 
bytes each will be a lot faster than re-analyzing 20 fields with 50 KB each.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Grant Ingersoll <gs...@apache.org>
> To: solr-user@lucene.apache.org
> Sent: Wed, January 26, 2011 10:44:09 AM
> Subject: Re: Highlighting with/without Term Vectors
> 
> 
> On Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> 
> > Hi,
> > 
> > Does anyone have any benchmarks how much highlighting speeds up with  Term
> > Vectors (compared to without it)? e.g. if highlighting on 20  documents take
> > 1 sec with Term Vectors any idea how long it will take  without them?
> > 
> > I need to know since the index used for  highlighting has a TVF file of
> > around 450GB (approx 65% of total index  size) so I am trying to see whether
> > the decreasing the index size by  dropping TVF would be more helpful for
> > performance (less RAM, should be  good for I/O too I guess) or keeping it is
> > still better?
> > 
> > I know the best way is try it out but indexing takes a very long time  so
> > trying to see whether its even worthy or not.
> 
> 
> Try testing  on a smaller set.  In general, you are saving the process of 
>re-analyzing  the content, so, to some extent it is going to be dependent on how 
>fast your  analyzer chain is.  At the size you are at, I don't know if storing 
>TVs is  worth it.

Re: Highlighting with/without Term Vectors

Posted by Grant Ingersoll <gs...@apache.org>.
On Jan 24, 2011, at 2:42 PM, Salman Akram wrote:

> Hi,
> 
> Does anyone have any benchmarks how much highlighting speeds up with Term
> Vectors (compared to without it)? e.g. if highlighting on 20 documents take
> 1 sec with Term Vectors any idea how long it will take without them?
> 
> I need to know since the index used for highlighting has a TVF file of
> around 450GB (approx 65% of total index size) so I am trying to see whether
> the decreasing the index size by dropping TVF would be more helpful for
> performance (less RAM, should be good for I/O too I guess) or keeping it is
> still better?
> 
> I know the best way is try it out but indexing takes a very long time so
> trying to see whether its even worthy or not.


Try testing on a smaller set.  In general, you are saving the process of re-analyzing the content, so, to some extent it is going to be dependent on how fast your analyzer chain is.  At the size you are at, I don't know if storing TVs is worth it.