You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shalom Ben-Zvi Kazaz <sh...@gmail.com> on 2013/10/30 20:49:25 UTC

ReplicationHandler - SnapPull failed to download a file completely.

we are continuously getting this exception during replication from
master to slave. our index size is 9.27 G and we are trying to replicate
a slave from scratch.
Its a different file each time , sometimes we get to 60% replication
before it fails and sometimes only 10%, we never managed a successful
replication.

30 Oct 2013 18:38:52,884 [explicit-fetchindex-cmd] ERROR
ReplicationHandler - SnapPull failed
:org.apache.solr.common.SolrException: Unable to download
_aa7_Lucene41_0.tim completely. Downloaded 0!=1054090
        at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1244)
        at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1124)
        at
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:719)
        at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:397)
        at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317)
        at
org.apache.solr.handler.ReplicationHandler$1.run(ReplicationHandler.java:218)

I read in some thread that there was a related bug in solr 4.1, but we
are using solr 4.3 and tried with 4.5.1 also.
It seams that DirectoryFileFetcher can not download a file sometimes ,
the files is downloaded to the slave in size zero.
we are running in a test environment where bandwidth is high.

this is the master setup:

|<requestHandler name="/replication" class="solr.ReplicationHandler" >
   <lst name="master">
     <str name="replicateAfter">commit</str>
     <str name="replicateAfter">startup</str>
     <str name="confFiles">stopwords.txt,spellings.txt,synonyms.txt,protwords.txt,elevate.xml,currency.xml</str>
     <str name="commitReserveDuration">00:00:50</str>
   </lst>
</requestHandler>
|

and the slave setup:

| <requestHandler name="/replication"
                    class="|||solr.ReplicationHandler|">
        <lst name="slave">
            <str name="masterUrl">http://solr-master.saltdev.sealdoc.com:8081/solr-master</str>
            <str name="httpConnTimeout">150000</str>
            <str name="httpReadTimeout">300000</str>
        </lst>
    </requestHandler>

|


Re: ReplicationHandler - SnapPull failed to download a file completely.

Posted by Shawn Heisey <so...@elyograg.org>.
On 10/31/2013 7:26 AM, Shalom Ben-Zvii Kazaz wrote:
> Shawn, Thank you for your answer.
> for the purpose of testing it we have a test environment where we are not
> indexing anymore. We also disabled the DIH delta import. so as I understand
> there shouldn't be any commits on the master.
> I also tried with
> <str name="commitReserveDuration">50:50:50</str>
> and get the same failure.

If it's in an environment where there are no commits, that's really
odd.  I would suspect underlying filesystem or network issues.  There's
one problem that's not well known, but is very common - problems with
NIC firmware, most commonly Broadcom NICs.  These problems result in
things working correctly almost all the time, but when there is a high
network load, things break in strange ways, and the resulting errors
rarely look like they are network-related.

Most embedded NICs are either Broadcom or Realtek, both of which are
famous for their firmware problems.  Broadcom NICs are very common on
Dell and HP servers.  Upgrading the firmware (which is not usually the
same thing as upgrading the driver) is the only fix.  NICs from other
manufacturers also have upgradable firmware, but don't usually have the
same kind of high-profile problems as Broadcom.

The NIC firmware might not have anything to do with this problem, but
it's the only thing left that I can think of.  I personally haven't used
replication since Solr 1.4.1, but a lot of people do.  I can't say that
there's no bugs, but so far I'm not seeing the kind of problem reports
that appear when a bug in a critical piece of the software exists.

Thanks,
Shawn


Re: ReplicationHandler - SnapPull failed to download a file completely.

Posted by Shalom Ben-Zvii Kazaz <sh...@gmail.com>.
Shawn, Thank you for your answer.
for the purpose of testing it we have a test environment where we are not
indexing anymore. We also disabled the DIH delta import. so as I understand
there shouldn't be any commits on the master.
I also tried with
<str name="commitReserveDuration">50:50:50</str>
and get the same failure.

I tried changing and increasing various parameters on the master and slave
but no luck yet.
the master is functioning ok, we do have search results so I assume there
is no index corruption on the master side.
just to mention , we have done that many times before in the past few
years, this started just now when we upgraded our solr from version 3.6 to
version 4.3 and we reindexed all documents.

if we have no solution soon, and this is holding an upgrade to our
production site and various customers, do you think we can copy the index
directory from the master to the slave and hope that future replication
will work ?

Thank you again.

Shalom





On Wed, Oct 30, 2013 at 10:00 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 10/30/2013 1:49 PM, Shalom Ben-Zvi Kazaz wrote:
>
>> we are continuously getting this exception during replication from
>> master to slave. our index size is 9.27 G and we are trying to replicate
>> a slave from scratch.
>> Its a different file each time , sometimes we get to 60% replication
>> before it fails and sometimes only 10%, we never managed a successful
>> replication.
>>
>
> <snip>
>
>
>  this is the master setup:
>>
>> |<requestHandler name="/replication" class="solr.**ReplicationHandler" >
>>     <lst name="master">
>>       <str name="replicateAfter">commit</**str>
>>       <str name="replicateAfter">startup<**/str>
>>       <str name="confFiles">stopwords.**txt,spellings.txt,synonyms.**
>> txt,protwords.txt,elevate.xml,**currency.xml</str>
>>       <str name="commitReserveDuration">**00:00:50</str>
>>     </lst>
>> </requestHandler>
>>
>
> I assume that you're probably doing commits fairly often, resulting in a
> lot of merge activity that frequently deletes segments.  That
> "commitReserveDuration" parameter needs to be made larger.  I would imagine
> that it takes a lot more than 50 seconds to do the replication - even if
> you've got an extremely fast network, replicating 9.7GB probably takes
> several minutes.
>
> From the wiki page on replication:  "If your commits are very frequent and
> network is particularly slow, you can tweak an extra attribute <str
> name="commitReserveDuration">**00:00:10</str>. This is roughly the time
> taken to download 5MB from master to slave. Default is 10 secs."
>
> http://wiki.apache.org/solr/**SolrReplication#Master<http://wiki.apache.org/solr/SolrReplication#Master>
>
> You've said that your network is not slow, but with that much data, all
> networks are slow.
>
> Thanks,
> Shawn
>
>

Re: ReplicationHandler - SnapPull failed to download a file completely.

Posted by Shawn Heisey <so...@elyograg.org>.
On 10/30/2013 1:49 PM, Shalom Ben-Zvi Kazaz wrote:
> we are continuously getting this exception during replication from
> master to slave. our index size is 9.27 G and we are trying to replicate
> a slave from scratch.
> Its a different file each time , sometimes we get to 60% replication
> before it fails and sometimes only 10%, we never managed a successful
> replication.

<snip>

> this is the master setup:
>
> |<requestHandler name="/replication" class="solr.ReplicationHandler" >
>     <lst name="master">
>       <str name="replicateAfter">commit</str>
>       <str name="replicateAfter">startup</str>
>       <str name="confFiles">stopwords.txt,spellings.txt,synonyms.txt,protwords.txt,elevate.xml,currency.xml</str>
>       <str name="commitReserveDuration">00:00:50</str>
>     </lst>
> </requestHandler>

I assume that you're probably doing commits fairly often, resulting in a 
lot of merge activity that frequently deletes segments.  That 
"commitReserveDuration" parameter needs to be made larger.  I would 
imagine that it takes a lot more than 50 seconds to do the replication - 
even if you've got an extremely fast network, replicating 9.7GB probably 
takes several minutes.

 From the wiki page on replication:  "If your commits are very frequent 
and network is particularly slow, you can tweak an extra attribute 
<str name="commitReserveDuration">00:00:10</str>. This is roughly the 
time taken to download 5MB from master to slave. Default is 10 secs."

http://wiki.apache.org/solr/SolrReplication#Master

You've said that your network is not slow, but with that much data, all 
networks are slow.

Thanks,
Shawn