You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by "Shad Storhaug (Jira)" <ji...@apache.org> on 2019/12/29 07:57:00 UTC

[jira] [Resolved] (LUCENENET-612) SERIOUS issues with PerFieldAnalyzerWrapper in 4.8

     [ https://issues.apache.org/jira/browse/LUCENENET-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shad Storhaug resolved LUCENENET-612.
-------------------------------------
    Fix Version/s: Lucene.Net 4.8.0
       Resolution: Fixed

This has now been resolved in Lucene.NET 4.8.0-beta00007

> SERIOUS issues with PerFieldAnalyzerWrapper in 4.8
> --------------------------------------------------
>
>                 Key: LUCENENET-612
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-612
>             Project: Lucene.Net
>          Issue Type: Bug
>          Components: Lucene.Net.Analysis.Common
>    Affects Versions: Lucene.Net 4.8.0
>            Reporter: Shad Storhaug
>            Priority: Major
>             Fix For: Lucene.Net 4.8.0
>
>   Original Estimate: 16h
>  Remaining Estimate: 16h
>
> This came in on the user mailing list on 15-July-2019 and was originally reported by Bryan Rojo (BryanRojo@elliotelectric.com)
>  
> {quote}Not necessarily a bug, but for some people who use PerFieldAnalyzerWrapper like I do this might be worth noting.
> PerFieldAnalyzerWrapper has been "improved" in 4.8 and now uses a PER_FIELD_REUSE_STRATEGY which means that the tokenized fields will be stored in a dictionary, so If you have multiple fields with the same name in your document, then you will only be able to index the very first one that makes it into that dictionary.
> So the problem with this is that you can potentially lose thousands of terms in your index, which could cause your searches to be of very low quality.
> BEWARE.
> {quote}
>  
> There are 2 issues that need to be resolved to address this:
> 1. The documentation for {{PerFieldAnalyzerWrapper}} should be updated to inform users that if they need to use multiple dictionary keys with the same name, they should use {{TreeDictionary<K, V>}}.
> 2. {{TreeDictionary<K, V>}} does not currently implement {{System.Collections.Generic.IDictionary<TKey, TValue>}}, as it was brought over from C5 as-is.
> Another thing of note is that C5 has added support for .NET Standard 1.0 since this was brought over.
> However, there still seems to be a few problems that make the C5 types incompatible with Lucene.Net, most notably the lack of support for {{System.Collections.Generic.IDictionary<TKey, TValue>}} in {{TreeDictionary}} and {{System.Collections.Generic.ISet<T>}} in {{TreeSet}} (the latter of which has already been patched in {{Lucene.Net.Support.TreeSet}}).
> I [reported|https://github.com/sestoft/C5/issues/53] the lack of support for {{ISet<T>}} on 6-Nov-2016, but although the maintainers agree this should be done, it still hasn't been. Perhaps a PR to the C5 project is the way to get this done, which would allow us to finally remove these collection copies from Lucene.Net.Support and add a package dependency on C5.
> Another option is to shop around to see if there are any other generic TreeSet/TreeDictionary implementations that have popped up since late 2016 that we can check for compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)