You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by sunnyfr <jo...@gmail.com> on 2009/03/31 15:51:32 UTC

commit often and lot of data cost too much?

Hi,

I've about 14M of document. My index is about 11G.
For the moment I update every 20mn about 30 000 documents. 
Lucene alwarys merge data, What would you reckon?
My replication cost too much for the slave, they always bring back new index
directories and no segment.

Is there a way to get around this issue ? what would you reckon to people
who need fresh update on the slave with a big amount of data ?? 
Thanks a lot,


http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/
:
"When the time and bandwidth needed for replication is less of a concern,
and high query throughput is more important, it can be wise to abandon the
advantage of transferring changed segments and only replicate fully
optimized indexes. It costs a bit more in terms of resources, but the master
will eat the cost of optimizing (so that users don't see the standard
machine slowdown affect that performing an optimize brings), and the slaves
will always get a fully optimized index to issue queries against, allowing
for maximum query performance. Generally, bandwidth for replication is not
much of a concern now, but keep in mind that optimizing on a large index can
be quite time consuming, so this strategy is not for every situation."
-- 
View this message in context: http://www.nabble.com/commit-often-and-lot-of-data-cost-too-much--tp22804941p22804941.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: commit often and lot of data cost too much?

Posted by Ted Dunning <te...@gmail.com>.

Funny you should ask.

We had a similar problem at Veoh.  I think that this kind of problem is
relatively common.

Taking video viewing as the poster child, the meta-data about videos comes
in a couple of flavors:

The title and description and publisher info
A pointer to the actual video bits
The view counts
Rating data
The history of who viewed the video (for recommendation systems and such)
Stats about how people play the video (just 4 seconds?  all the way
through?  From the primary interface?  From embedded references?)

We can categorize this data on a couple of different axes.  One is update
rate:

Some of this data is very rarely updated (the video pointer and publisher
info).
Some is updated more commonly, but still pretty rarely (title and
description)
Some is updated fairly often (ratings)
And some is updated ALL the time (view counts especially, but view history
and view stats as well)

Another categorization is based on how you plan to search the data:

Title and description and length and publisher and date published (users
searching using the search box and advanced search)
Play history and ratings (recommendation systems doing off-line analysis)
Not usually searched (encoding, number of audio tracks, size in bytes and so
on)

Usually, high volume sites have to store data differently depending on size,
change-rate and purpose.  Then you abstract different search and storage
decisions with an access layer.  For what you are doing, you should put into
lucene only those things which are low change rate and which must be
searched.  You should put high mutation rate data into something like
memcache with some persistent back-store.  Very large data items such as the
video itself should be in an entirely different kind of store (at Veoh we
used a very heavily hacked version of danga's mogile).

Your two phase update trick will work reasonably well in the short-term, but
if your traffic is growing quickly it won't last very long because the full
update will be so nasty.

On Wed, Apr 1, 2009 at 1:06 AM, sunnyfr <jo...@gmail.com> wrote:

>
> Yep but we won't change the system now :(
> Or maybe I can have two kinds of schema ?
> One which is the new video during the day so just new datas and the other
> one by night which update all caracteristic of videos ?  full update
> nightly
> and light new update during the day ?
> what do you think ??
> Because the other caracteristics are not that important but used for
> filters, most view, comment ...
>
>

Re: commit often and lot of data cost too much?

Posted by sunnyfr <jo...@gmail.com>.

Yep but we won't change the system now :(
Or maybe I can have two kinds of schema ? 
One which is the new video during the day so just new datas and the other
one by night which update all caracteristic of videos ?  full update nightly
and light new update during the day ? 
what do you think ?? 
Because the other caracteristics are not that important but used for
filters, most view, comment ... 

:)
Thanks Ted


Ted Dunning wrote:
> 
> What kind of updates are these?  new documents?  Small changes to existing
> documents?
> 
> Are the changing fields important for searching?
> 
> If the updates are not involved in searches, then it would be much better
> to
> put the non-searched characteristics onto an alternative storage system.
> That would drive down the update rate dramatically and leave you with a
> pretty simple system.
> 
> If the updates *are* involved in searches, then you might consider using a
> system more like Katta than solr.  You can then create a new shard out of
> the update and broadcast a mass delete to all nodes just before adding the
> new shard to the system.  This has the benefit of very fast add updates
> and
> good balancing, but has the defect that you don't have persistence of your
> deletes until you do a full index again.  Your search nodes could right
> the
> updated index back to the persistent store, but that is scary without
> something like hadoop to handle failed updates.
> 
> On Tue, Mar 31, 2009 at 6:51 AM, sunnyfr <jo...@gmail.com> wrote:
> 
>>
>> I've about 14M of document. My index is about 11G.
>> For the moment I update every 20mn about 30 000 documents.
>> Lucene alwarys merge data, What would you reckon?
>> My replication cost too much for the slave, they always bring back new
>> index
>> directories and no segment.
>>
>> Is there a way to get around this issue ? what would you reckon to people
>> who need fresh update on the slave with a big amount of data ??
>> Thanks a lot,
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/commit-often-and-lot-of-data-cost-too-much--tp22804941p22821675.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: commit often and lot of data cost too much?

Posted by Ted Dunning <te...@gmail.com>.

What kind of updates are these?  new documents?  Small changes to existing
documents?

Are the changing fields important for searching?

If the updates are not involved in searches, then it would be much better to
put the non-searched characteristics onto an alternative storage system.
That would drive down the update rate dramatically and leave you with a
pretty simple system.

If the updates *are* involved in searches, then you might consider using a
system more like Katta than solr.  You can then create a new shard out of
the update and broadcast a mass delete to all nodes just before adding the
new shard to the system.  This has the benefit of very fast add updates and
good balancing, but has the defect that you don't have persistence of your
deletes until you do a full index again.  Your search nodes could right the
updated index back to the persistent store, but that is scary without
something like hadoop to handle failed updates.

On Tue, Mar 31, 2009 at 6:51 AM, sunnyfr <jo...@gmail.com> wrote:

>
> I've about 14M of document. My index is about 11G.
> For the moment I update every 20mn about 30 000 documents.
> Lucene alwarys merge data, What would you reckon?
> My replication cost too much for the slave, they always bring back new
> index
> directories and no segment.
>
> Is there a way to get around this issue ? what would you reckon to people
> who need fresh update on the slave with a big amount of data ??
> Thanks a lot,
>
>