You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Johannes Peter (JIRA)" <ji...@apache.org> on 2017/09/03 12:18:00 UTC

[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

    [ https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151784#comment-16151784 ] 

Johannes Peter edited comment on NIFI-3248 at 9/3/17 12:17 PM:
---------------------------------------------------------------

[~ijokarumawak], [~bbende]
I examined the current GetSolr implementation and I found several issues, which I want to discuss:
(1) Currently, a date field needs to be included into the index schema and the Solr documents for indexing. Although this can be realized easily via Solrs' TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ field for filtering subsequent retrieval. This field is included in every well-configured Solr index as it is required for several functionalities. By doing so, this processor could also be used for indexes, which were not created considering NiFi interactions. 
(2) Iterating through a resultset will only be done if the processor runs the first time. This will be problematic if the amount of newly indexed documents in a trigger interval exceeds the configured batch size.
(3) Successively increasing the start parameter to retrieve Solr documents in batches is accompanied by two problems in this context. First, this way shows a poor performance for large collections. Second, updating the index during the iteration will probably lead to duplicates or a loss of documents in the case that positions of documents change due to newly indexed documents or deletions. Instead of increasing the start parameter, cursor marks should be used, and the sorting should be fixed to an ascending order of the time when documents were indexed (\_version\_ field). More details on this can be retrieved here https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
(4) Using the fq-parameter instead of the q-parameter should improve the performance in some cases, as Solr is able to use caches for fq. The q-parameter should be fixed to "\*:\*". 

As a consequence, I suggest to redesign the GetSolr processor in a way that it mainly focuses on retrieving documents reliably. This can be done better by using cursor marks and the \_version\_ field. Additionally, users should not be enabled to change the parameters sort and q. The full query capabilities of Solr could be made available by integrating an additional processor, e. g. "FetchSolr".


was (Author: jope):
[~ijokarumawak], [~bbende]
I examined the current GetSolr implementation and I found several issues, which I want to discuss:
(1) Currently, a date field needs to be included into the index schema and the Solr documents for indexing. Although this can be realized easily via Solrs' TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ field for filtering subsequent retrieval. This field is included in every well-configured Solr index as it is required for several functionalities. By doing so, this processor could also be used for indexes, which were not created considering NiFi interactions. 
(2) Iterating through a resultset will only be done if the processor runs the first time. This will be problematic if the amount of newly indexed documents in a trigger interval exceeds the configured batch size.
(3) Successively increasing the start parameter to retrieve Solr documents in batches is accompanied by two problems in this context. First, this way shows a poor performance for large collections. Second, updating the index during the iteration will probably lead to duplicates or a loss of documents in the case that positions of documents change due to newly indexed documents or deletions. Instead of increasing the start parameter, cursor marks should be used, and the sorting should be fixed to an ascending order of the time when documents were indexed (\_version\_ field). More details on this can be retrieved here https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
(4) Using the fq-parameter instead of the q-parameter should improve the performance in some cases, as Solr is able to use caches for fq. The q-parameter should be fixed to "*:*". 

As a consequence, I suggest to redesign the GetSolr processor in a way that it mainly focuses on retrieving documents reliably. This can be done better by using cursor marks and the \_version\_ field. Additionally, users should not be enabled to change the parameters sort and q. The full query capabilities of Solr could be made available by integrating an additional processor, e. g. "FetchSolr".

> GetSolr can miss recently updated documents
> -------------------------------------------
>
>                 Key: NIFI-3248
>                 URL: https://issues.apache.org/jira/browse/NIFI-3248
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 1.0.1
>            Reporter: Koji Kawamura
>            Assignee: Johannes Peter
>         Attachments: nifi-flow.png, query-result-with-curly-bracket.png, query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the documents date field value becomes older than last query timestamp, the document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange ([source code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]). But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is exclusive. If we use inclusive on both sides and a document has a time stamp exactly on the boundary then it could be returned in two consecutive executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|]. If date field String value of an updated document represents time without timezone, and NiFi is running on an environment using timezone other than UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date range filter.
> To avoid this, updated documents must have proper timezone in date field string representation.
> If one uses NiFi expression language to set current timestamp to that date field, following NiFi expression can be used:
> {code}
> ${now():format("yyyy-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently updated documents can be queried in Near Real Time, but it's not real time. This latency can be controlled by either on client side which requests the update operation by specifying "commitWithin" parameter, or on the Solr server side, "autoCommit" and "autoSoftCommit" in [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to GetSolr, to specify commit lag-time so that GetSolr aims older timestamp range to query documents.
> {code}
> // with commit lag-time
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from (t1 - lag) to (t4 - lag), but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from (t4 - lag) to (t6 - lag), the doc can match query
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)