You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Gabriel Cooper (Updated) (JIRA)" <ji...@apache.org> on 2011/11/23 17:21:40 UTC

[jira] [Updated] (SOLR-2864) DataImportHandler has non-deterministic sort order for XML files

     [ https://issues.apache.org/jira/browse/SOLR-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Cooper updated SOLR-2864:
---------------------------------

    Attachment: lucene-2864.patch

Excellent feedback. I've changed the sort to be depth-first, sorting directories alphabetically, then traversing their files first by date, then by name. 

I've changed the first test to accommodate your feedback, and modified the recursion test to test that. 

Interestingly, the recursion was technically broken and test never tested anything. It used the child directory as its base then ran recursively through ... that one directory.
                
> DataImportHandler has non-deterministic sort order for XML files
> ----------------------------------------------------------------
>
>                 Key: SOLR-2864
>                 URL: https://issues.apache.org/jira/browse/SOLR-2864
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>    Affects Versions: 3.4
>            Reporter: Gabriel Cooper
>            Priority: Minor
>              Labels: dataimport, patch, xml
>             Fix For: 3.5
>
>         Attachments: lucene-2864.patch, lucene-2864.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> DataImportHandler's FileListEntityProcessor relies on Java's File.list() method to retrieve a list of files from the configured dataimport directory, but list() does not guarantee a sort order ^(1)^. This means that if you have two files that update the same record, the results are non-deterministic. Typically, list() does in fact return them lexigraphically sorted, but this is not guaranteed ^(2)^.
> An example of how you can get into trouble is to imagine the following:
> xyz.xml -- Created one hour ago. Contains updates to records "Foo" and "Bar".
> abc.xml -- Created one minute ago. Contains updates to records "Bar" and "Baz".
> In this case, the newest file, in abc.xml, would (likely, but not guaranteed) be run first, updating the "Bar" and "Baz" records. Next, the older file, xyz.xml, would update "Foo" and overwrite "Bar" with outdated changes.
>  (1) Per http://download.oracle.com/javase/1,5,0/docs/api/java/io/File.html#list%28%29
> "There is no guarantee that the name strings in the resulting array will appear in any specific order; they are not, in particular, guaranteed to appear in alphabetical order."
>  (2)  Even if it was guaranteed, lexigraphical sorting would give you the following sort order:
>   1.xml
>   10.xml
>   2.xml
>   ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org