You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gili Nachum <gi...@gmail.com> on 2015/05/30 10:51:41 UTC

Optimal FS block size for "small" documents in Solr?

Hi, What would be an optimal FS block size to use?

Using Solr 4.7.2, I have an RAID-5 of SSD drives currently configured with
a 128KB block size.
Can I expect better indexing/query time performance with a smaller block
size (say 8K)?
Considering my documents are almost always smaller than 8K.
I assume all stored fields would fit into one block which is good, but what
will Lucene prefer for reading a long posting list and other data
structures.

Any rules of thumb or anyone that had experimented on this?

Re: Optimal FS block size for "small" documents in Solr?

Posted by Upayavira <uv...@odoko.co.uk>.

On Sat, May 30, 2015, at 09:51 AM, Gili Nachum wrote:
> Hi, What would be an optimal FS block size to use?
> 
> Using Solr 4.7.2, I have an RAID-5 of SSD drives currently configured
> with
> a 128KB block size.
> Can I expect better indexing/query time performance with a smaller block
> size (say 8K)?
> Considering my documents are almost always smaller than 8K.
> I assume all stored fields would fit into one block which is good, but
> what
> will Lucene prefer for reading a long posting list and other data
> structures.
> 
> Any rules of thumb or anyone that had experimented on this?

I'm gonna start this response with the observation that I don't know
anything about the topic you are asking about.

So, with that out of the way, a Lucene index is "write only", that is,
when you do a commit, all of the data that makes up your index is
written to disk - that is, all documents making up a single commit are
written into a set of files, making a segment.

Therefore, it isn't the size of a document that matters, more the number
and size of documents making up a single commit. There's a lot more to
it too, e.g. whether fields are stored, how they are analysed, etc.

You could do a simple experiment. Write a little app that pushes docs to
Solr and commits, then look at the file sizes on disk. Then repeat with
more documents, see what impact on file sizes. I suspect you can answer
your question relatively easily.

Upayavira

Re: Optimal FS block size for "small" documents in Solr?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/30/2015 2:51 AM, Gili Nachum wrote:
> Hi, What would be an optimal FS block size to use?
> 
> Using Solr 4.7.2, I have an RAID-5 of SSD drives currently configured with
> a 128KB block size.
> Can I expect better indexing/query time performance with a smaller block
> size (say 8K)?
> Considering my documents are almost always smaller than 8K.
> I assume all stored fields would fit into one block which is good, but what
> will Lucene prefer for reading a long posting list and other data
> structures.

Generally speaking, RAID levels that use striping should have the
largest block size you can make, which for most modern RAID controllers
is 1MB or 2MB.  When you make the stripe size very small, reading and
writing even small files requires hitting all the disks.  With large
stripes, accessing data randomly is more likely to have one read hit one
disk while another read hits another disk.

For Lucene/Solr, there might be benefits to smaller block sizes, but I
believe that they might cause more problems than they solve.

There are some additional things to think about:

If your server has its memory appropriately sized, then you will have
enough RAM to let your operating system cache your index entirely.  For
queries, you will only rarely be hitting the disk ... so disk speed and
layout don't matter much at all, and you will only need to be concerned
about *write* speed for indexing.

RAID levels 3 through 6 (and any derivations like level 50) are
*horrible* if there is very much write activity -- for a Solr install,
that means indexing, and to a slightly lesser extent, logging.

When you write to a RAID5 array, you slow *everything* down.  Even
*reads* that happen at the same time as writes are strongly affected by
those writes.  It is the nature of RAID5.  If your system is entirely
read-only, then RAID5 is awesome ... but RAID10 is better.  RAID10 *is*
initially more expensive than RAID5 ... but the performance and
reliability benefits are completely worth the additional expense.

Additional reading material below.  I do highly recommend reading at
least the first link:

http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt
http://www.baarf.com/

The RAID10 stripe size should be at least 1MB if your controller
supports blocks that large.

Thanks,
Shawn