You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Greenhorn Techie <gr...@gmail.com> on 2018/06/07 12:41:32 UTC

Running Solr on HDFS - Disk space

Hi,

As HDFS has got its own replication mechanism, with a HDFS replication
factor of 3, and then SolrCloud replication factor of 3, does that mean
each document will probably have around 9 copies replicated underneath of
HDFS? If so, is there a way to configure HDFS or Solr such that only three
copies are maintained overall?

Thanks

Re: Running Solr on HDFS - Disk space

Posted by Hendrik Haddorp <he...@gmx.net>.
The only option should be to configure Solr to just have a replication 
factor of 1 or HDFS to have no replication. I would go for the middle 
and configure both to use a factor of 2. This way a single failure in 
HDFS and Solr is not a problem. While in 1/3 or 3/1 option a single 
server error would bring the collection down.

Setting the HDFS replication factor is a bit tricky as Solr takes in 
some places the default replication factor set on HDFS and some times 
takes a default from the client side. HDFS allows you to set a 
replication factor for every file individually.

regards,
Hendrik

On 07.06.2018 15:30, Shawn Heisey wrote:
> On 6/7/2018 6:41 AM, Greenhorn Techie wrote:
>> As HDFS has got its own replication mechanism, with a HDFS replication
>> factor of 3, and then SolrCloud replication factor of 3, does that mean
>> each document will probably have around 9 copies replicated 
>> underneath of
>> HDFS? If so, is there a way to configure HDFS or Solr such that only 
>> three
>> copies are maintained overall?
>
> Yes, that is exactly what happens.
>
> SolrCloud replication assumes that each of its replicas is a 
> completely independent index.  I am not aware of anything in Solr's 
> HDFS support that can use one HDFS index directory for multiple 
> replicas.  At the most basic level, a Solr index is a Lucene index.  
> Lucene goes to great lengths to make sure that an index *CANNOT* be 
> used in more than one place.
>
> Perhaps somebody who is more familiar with HDFSDirectoryFactory can 
> offer you a solution.  But as far as I know, there isn't one.
>
> Thanks,
> Shawn
>


Re: Running Solr on HDFS - Disk space

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/7/2018 6:41 AM, Greenhorn Techie wrote:
> As HDFS has got its own replication mechanism, with a HDFS replication
> factor of 3, and then SolrCloud replication factor of 3, does that mean
> each document will probably have around 9 copies replicated underneath of
> HDFS? If so, is there a way to configure HDFS or Solr such that only three
> copies are maintained overall?

Yes, that is exactly what happens.

SolrCloud replication assumes that each of its replicas is a completely 
independent index.  I am not aware of anything in Solr's HDFS support 
that can use one HDFS index directory for multiple replicas.  At the 
most basic level, a Solr index is a Lucene index.  Lucene goes to great 
lengths to make sure that an index *CANNOT* be used in more than one place.

Perhaps somebody who is more familiar with HDFSDirectoryFactory can 
offer you a solution.  But as far as I know, there isn't one.

Thanks,
Shawn