You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Joel Bernstein (JIRA)" <ji...@apache.org> on 2016/04/05 03:31:25 UTC

[jira] [Commented] (SOLR-8709) Account for out-of-order version numbers in the TopicStream

    [ https://issues.apache.org/jira/browse/SOLR-8709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225481#comment-15225481 ] 

Joel Bernstein commented on SOLR-8709:
--------------------------------------

I wanted to give an update on this ticket as Solr 6.0 is here and the TopicStream is part of the release.

I made a pretty serious attempt to devise a stress test that would cause the TopicStream to miss documents. In the test that I devised the TopicStream never missed documents.

Here is the outline of the test:

1) Start a multi-threaded client to index documents to Solr. I tested with 5, 8, 12, 16 and 20 indexing threads. Indexing rate was about 22,000 docs per second with this setup.
2) At the same time start a TopicStream and have it run a *:* query, pulling all new documents, writing the version numbers to a file.
3) Compare the # of version numbers in the file to number of docs in the index. First I piped the file to sort | uniq to ensure that no version numbers were pulled twice.

The outcome of this test was that the number of version numbers in the file *always* matched the record count in the Solr collection. The TopicStream never missed documents due to out of order version numbers.

I ran these tests over and over again for several hours. Each time the record counts matched up.

I'm still confused by this outcome because I expected to be able to cause the issue. In an offline chat with [~yonik@apache.org], he assured me that out of order version numbers could occur. A review of the code seems to show that it is possible for out of order version numbers to be added to the index.

But the fact remains that I was not able to break the TopicStream under a fairly rigorous test scenario.

It is possible that the way that flushes and commits are being processed that out of order version numbers won't span commit boundaries. In order for the TopicStream to lose documents the out of order version numbers must span a commit boundary. But a review of the code did not make this clear.

So until we're able to clear this up I'll consider this an open issue and I'll mention it in the TopicStream documentation.

If it does turn out that the TopicStream can lose documents due to out-of-order version numbers the "retentionWindow" described in the comment above will eliminate the issue.









> Account for out-of-order version numbers in the TopicStream
> -----------------------------------------------------------
>
>                 Key: SOLR-8709
>                 URL: https://issues.apache.org/jira/browse/SOLR-8709
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Joel Bernstein
>
> Currently the TopicStream can miss documents if version numbers are received out-of-order. The TopicStream sorts on version number so it will only miss out-of-order versions that span commit boundaries.
> In order to resolve this issue we can adopt an approach that keeps a set of the last N version numbers sent for each Topic.  As the documents are scanned we can check for documents within this time window that do not appear in the sent set. These documents can then be sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org