You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by "Shad Storhaug (JIRA)" <ji...@apache.org> on 2017/08/31 12:31:00 UTC

[jira] [Commented] (LUCENENET-595) Wildcard search with special characters "#" not working

    [ https://issues.apache.org/jira/browse/LUCENENET-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148916#comment-16148916 ] 

Shad Storhaug commented on LUCENENET-595:
-----------------------------------------

I suspect this is due to the use of {{StandardAnalyzer}} when you write your index. Per the {{StandardTokenizer}} docs (https://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/standard/StandardTokenizer.html):

{quote}
This should be a good tokenizer for most European-language documents:

* Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
* Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
* Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.{quote}

Although it doesn't specifically state it, I suspect that the # (or for that matter any other special character) is being removed from the analyzed data, and does not match your query because it does not exist in the index. I suggest trying another analyzer that doesn't meddle with special characters, such as {{WhitespaceAnalyzer}} (https://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/WhitespaceAnalyzer.html), or build a custom analyzer to meet your exact needs.

Keep in mind when the data that is stored isn't always the same as the analyzed data, and it is the analyzed data that is used during the search.

> Wildcard search with special characters "#" not working
> -------------------------------------------------------
>
>                 Key: LUCENENET-595
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-595
>             Project: Lucene.Net
>          Issue Type: Bug
>          Components: Lucene.Net Core
>    Affects Versions: Lucene.Net 3.0.3
>            Reporter: Singaravelu
>            Priority: Blocker
>
> I'm using Lucene.Net 3.0.3.0 version in my website to search list of courses.
> I have few courses which contains the special character "#" like, C#, C#.Net, etc.
> But When I search with the term "C#" it showing 0 results.
> I'm using StandardAnalyzer and MultiFieldQueryParser also allowing wildcard search (AllowLeadingWildcard = true).
> Here is my code:
> var analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords);
> {
> BooleanQuery query = new BooleanQuery();
> var nameParser = new MultiFieldQueryParser(Version.LUCENE_30, new[] { "Column1", " Column2", " Column3" }, analyzer);
> if (!string.IsNullOrEmpty(searchCriteria.CourseName))
> {
> query.Add(parseQuery(GetTerms(searchCriteria.CourseName.ReplaceDiacritics()), nameParser), Occur.MUST);
> }
> ScoreDoc[] hits = searcher.Search(query, null, hits_limit, Sort.RELEVANCE).ScoreDocs;
> var results = _mapLuceneToDataList(hits, searcher);
> analyzer.Close();
> searcher.Dispose();
> return results;
> }
> For indexing: 
> The word "C#" indexed and stored correctly. 
> doc.Add(new Field("Title", sampleData.CourseName, Field.Store.YES, Field.Index.ANALYZED));
> Kindly let me know what I have to do to retrieve the result when I search with the term "C#".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)