You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Sergey Zaharov <se...@gmail.com> on 2017/06/16 07:16:04 UTC

Parse query with special characters

Hi all,

could you please help me with next issue: i have a problem with parsing
strings with some special characters. Example on the screenshot, Also error
comes when i parse string like "!!22", but in the same time string "!22"
parsed normally.
[image: Встроенное изображение 1]

So question is how can i parse ALL possible strings that comes from user?
may be i need another parser? Otherwise should i clear query string from
some characters/combinations and if yes, then where can i take that list?
Probably, it exists some utils that could help normalize query string.

Thanks you in advance.

-- 
Best regards, Sergey.

-- 
Best regards, Sergey.

RE: Parse query with special characters

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Sergey,

Exactly what do you expect the string "!!22" to do? According to the documentation (https://lucene.apache.org/core/4_8_0/queryparser/index.html), a single "!" is a logical NOT character, but a double "!!" is meaningless, so it throws an exception. 

I tested and in Java you also get an exception in this case:

Cannot parse 'a AND !!b': Encountered " <NOT> "! "" at line 1, column 7.
Was expecting one of:
    <BAREOPER> ...
    "(" ...
    "*" ...
    <QUOTED> ...
    <TERM> ...
    <PREFIXTERM> ...
    <WILDTERM> ...
    <REGEXPTERM> ...
    "[" ...
    "{" ...
    <NUMBER> ...
    <TERM> ...
    "*" ...
    
So, this is part of the design, not a bug. Of course, if you change the string to "a AND \!!b", it will work (and apply the NOT operator).

You basically have 3 options:

1. Catch the exception and use an alternate approach (possibly another query parser to give it a second pass).
2. Clean the input by escaping unwanted special characters as per the documentation: https://lucene.apache.org/core/4_8_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html. 
3. Use the SimpleQueryParser (https://lucene.apache.org/core/4_8_0/queryparser/org/apache/lucene/queryparser/simple/SimpleQueryParser.html) that does not make the "!" into a special character, and is specifically designed for user-entered input. Quote from documentation: "The main idea behind this parser is that a person should be able to type whatever they want to represent a query, and this parser will do its best to interpret what to search for no matter how poorly composed the request may be."


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Sergey Zaharov [mailto:sergozaharov@gmail.com] 
Sent: Friday, June 23, 2017 7:08 PM
To: Prescott Nasser; user@lucenenet.apache.org
Subject: Re: Parse query with special characters

Hi all,

Is there any updates with that problem?

Best regards,
Sergey

2017-06-16 12:32 GMT+02:00 Sergey Zaharov <se...@gmail.com>:

> Hi,
>
> there is full code of console application
>
> using System;
> using System.Collections.Generic;
> using System.Linq;
> using Lucene.Net.Analysis.Core;
> using Lucene.Net.Documents;
> using Lucene.Net.Index;
> using Lucene.Net.QueryParsers.Classic; using 
> Lucene.Net.QueryParsers.Flexible.Standard;
> using Lucene.Net.Search;
> using Lucene.Net.Store;
> using Lucene.Net.Util;
>
> namespace Lucene4TestWSA
> {
>     class Program
>     {
>         private const string FIELD_BODY = "postBody";
>         private const string FIELD_SECURITY = "Security";
>
>         private static IndexWriter _writer;
>         private static Directory _directory;
>         private static WhitespaceAnalyzer _analyzer;
>         private static IndexReader _indexReader;
>         private static IndexSearcher _searcher;
>         private static IndexWriterConfig _cfg;
>
>         private static void AddNewItem(FullTextIndexItem item)
>         {
>             if (_writer == null) return;
>             var doc = new Document();
>
>            var objectText = (item.ObjectText ?? "");
>             doc.Add(new TextField(FIELD_BODY, objectText, 
> Field.Store.NO ));
>
>             var securCodes = (item.Access == null || item.Access.All(x 
> => x.Key == 0))
>                 ? "?"
>                 : string.Join(" ", item.Access.Where(x => x.Key != 
> 0).Select(x => x.Key.ToString() + x.Info).ToList());
>             doc.Add(new TextField(FIELD_SECURITY, 
> securCodes.ToLower(), Field.Store.YES));
>
>             _writer.AddDocument(doc);
>         }
>
>         static void Main(string[] args)
>         {
>
>             var dir = @"c:\TestLuceneDir";
>             if (System.IO.Directory.Exists(dir))
>             {
>                 System.IO.Directory.Delete(dir, true);
>             }
>
>             var di = System.IO.Directory.CreateDirectory(dir);
>             var _directory = FSDirectory.Open(di);
>             _analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
>             _cfg = new IndexWriterConfig(LuceneVersion.LUCENE_48,
> _analyzer);
>             var writer = new IndexWriter(_directory, _cfg);
>             writer.Commit();
>             writer.Dispose();
>             _cfg = null;
>
>             _indexReader = DirectoryReader.Open(_directory);
>             _searcher = new IndexSearcher(_indexReader);
>
>             var analyzer = new WhitespaceAnalyzer( 
> LuceneVersion.LUCENE_48);
>             _cfg = new IndexWriterConfig(LuceneVersion.LUCENE_48,
> analyzer);
>             _writer = new IndexWriter(_directory, _cfg);
>
>             AddNewItem(new FullTextIndexItem
>             {
>                 ObjectText = "111 !!222 333 qqq",
>
>                 Access = new List<FullTextIndexItemAccessInfo>()
>                 {
>                     new FullTextIndexItemAccessInfo() { Key = 1037, 
> Info = "PW???"},
>                     new FullTextIndexItemAccessInfo() { Key = 1041, 
> Info = "P????"}
>                 }
>             });
>
>             AddNewItem(new FullTextIndexItem
>             {
>                 ObjectText = "aaa bbb ccc qqq",
>                 Access = new List<FullTextIndexItemAccessInfo>()
>                 {
>                     new FullTextIndexItemAccessInfo() { Key = 1037, 
> Info = "PW???"},
>                     new FullTextIndexItemAccessInfo() { Key = 1042, 
> Info = "PW??C"}
>                 }
>             });
>
>             _writer.Commit();
>             _writer.Dispose();
>             _writer = null;
>             _cfg = null;
>             _indexReader = DirectoryReader.Open(_directory);
>             _searcher = new IndexSearcher(_indexReader);
>
>             _analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
>             var boolQry = new BooleanQuery();
>
>             var parser = new QueryParser(LuceneVersion.LUCENE_48,
> FIELD_BODY, _analyzer) { AllowLeadingWildcard = true };
>             var textQry = parser.Parse("*22/*");
>             boolQry.Add(textQry, Occur.MUST);
>             var an = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
>             var localParser = new QueryParser(LuceneVersion.LUCENE_48,
> FIELD_SECURITY, an);
>
>             var localQry = localParser.Parse("1037p????");
>
>             boolQry.Add(localQry, Occur.MUST);
>
>             var qryRes = _searcher.Search(boolQry, 1000);
>
>             Console.WriteLine($"Result found {qryRes.TotalHits}");
>             Console.ReadLine();
>         }
>     }
>
>     public class FullTextIndexItemAccessInfo
>     {
>         public int Key { get; set; }
>         public string Info { get; set; }
>     }
>
>     public class FullTextIndexItem
>     {
>         public string ObjectText { get; set; }
>         public List<FullTextIndexItemAccessInfo> Access { get; set; }
>     }
> }
>
> Hope that would help.
>
> Manz thanks,
> Sergey
>
> 2017-06-16 9:39 GMT+02:00 Prescott Nasser <ge...@hotmail.com>:
>
>> Adding Sergey who isn't subscribed to the mailing list..
>>
>> ------
>>
>> Sergey,
>>
>> Please provide the actual code, not a screenshot. Apparently, the 
>> mailing list server strips out images from the email, so it is 
>> impossible to help you without knowing what you are doing.
>>
>> Thanks,
>> Shad Storhaug (NightOwl888)
>>
>> -----Original Message-----
>> From: Sergey Zaharov [mailto:sergozaharov@gmail.com]
>> Sent: Friday, June 16, 2017 2:16 PM
>> To: user@lucenenet.apache.org
>> Subject: Parse query with special characters
>>
>> Hi all,
>>
>> could you please help me with next issue: i have a problem with 
>> parsing strings with some special characters. Example on the 
>> screenshot, Also error comes when i parse string like "!!22", but in the same time string "!22"
>> parsed normally.
>> [image: Встроенное изображение 1]
>>
>> So question is how can i parse ALL possible strings that comes from user?
>> may be i need another parser? Otherwise should i clear query string 
>> from some characters/combinations and if yes, then where can i take that list?
>> Probably, it exists some utils that could help normalize query string.
>>
>> Thanks you in advance.
>>
>> --
>> Best regards, Sergey.
>>
>> --
>> Best regards, Sergey.
>>
>
>
>
> --
> Best regards, Sergey.
>



--
Best regards, Sergey.

Re: Parse query with special characters

Posted by Sergey Zaharov <se...@gmail.com>.
Hi all,

Is there any updates with that problem?

Best regards,
Sergey

2017-06-16 12:32 GMT+02:00 Sergey Zaharov <se...@gmail.com>:

> Hi,
>
> there is full code of console application
>
> using System;
> using System.Collections.Generic;
> using System.Linq;
> using Lucene.Net.Analysis.Core;
> using Lucene.Net.Documents;
> using Lucene.Net.Index;
> using Lucene.Net.QueryParsers.Classic;
> using Lucene.Net.QueryParsers.Flexible.Standard;
> using Lucene.Net.Search;
> using Lucene.Net.Store;
> using Lucene.Net.Util;
>
> namespace Lucene4TestWSA
> {
>     class Program
>     {
>         private const string FIELD_BODY = "postBody";
>         private const string FIELD_SECURITY = "Security";
>
>         private static IndexWriter _writer;
>         private static Directory _directory;
>         private static WhitespaceAnalyzer _analyzer;
>         private static IndexReader _indexReader;
>         private static IndexSearcher _searcher;
>         private static IndexWriterConfig _cfg;
>
>         private static void AddNewItem(FullTextIndexItem item)
>         {
>             if (_writer == null) return;
>             var doc = new Document();
>
>            var objectText = (item.ObjectText ?? "");
>             doc.Add(new TextField(FIELD_BODY, objectText, Field.Store.NO
> ));
>
>             var securCodes = (item.Access == null || item.Access.All(x =>
> x.Key == 0))
>                 ? "?"
>                 : string.Join(" ", item.Access.Where(x => x.Key !=
> 0).Select(x => x.Key.ToString() + x.Info).ToList());
>             doc.Add(new TextField(FIELD_SECURITY, securCodes.ToLower(),
> Field.Store.YES));
>
>             _writer.AddDocument(doc);
>         }
>
>         static void Main(string[] args)
>         {
>
>             var dir = @"c:\TestLuceneDir";
>             if (System.IO.Directory.Exists(dir))
>             {
>                 System.IO.Directory.Delete(dir, true);
>             }
>
>             var di = System.IO.Directory.CreateDirectory(dir);
>             var _directory = FSDirectory.Open(di);
>             _analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
>             _cfg = new IndexWriterConfig(LuceneVersion.LUCENE_48,
> _analyzer);
>             var writer = new IndexWriter(_directory, _cfg);
>             writer.Commit();
>             writer.Dispose();
>             _cfg = null;
>
>             _indexReader = DirectoryReader.Open(_directory);
>             _searcher = new IndexSearcher(_indexReader);
>
>             var analyzer = new WhitespaceAnalyzer(
> LuceneVersion.LUCENE_48);
>             _cfg = new IndexWriterConfig(LuceneVersion.LUCENE_48,
> analyzer);
>             _writer = new IndexWriter(_directory, _cfg);
>
>             AddNewItem(new FullTextIndexItem
>             {
>                 ObjectText = "111 !!222 333 qqq",
>
>                 Access = new List<FullTextIndexItemAccessInfo>()
>                 {
>                     new FullTextIndexItemAccessInfo() { Key = 1037, Info =
> "PW???"},
>                     new FullTextIndexItemAccessInfo() { Key = 1041, Info =
> "P????"}
>                 }
>             });
>
>             AddNewItem(new FullTextIndexItem
>             {
>                 ObjectText = "aaa bbb ccc qqq",
>                 Access = new List<FullTextIndexItemAccessInfo>()
>                 {
>                     new FullTextIndexItemAccessInfo() { Key = 1037, Info =
> "PW???"},
>                     new FullTextIndexItemAccessInfo() { Key = 1042, Info =
> "PW??C"}
>                 }
>             });
>
>             _writer.Commit();
>             _writer.Dispose();
>             _writer = null;
>             _cfg = null;
>             _indexReader = DirectoryReader.Open(_directory);
>             _searcher = new IndexSearcher(_indexReader);
>
>             _analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
>             var boolQry = new BooleanQuery();
>
>             var parser = new QueryParser(LuceneVersion.LUCENE_48,
> FIELD_BODY, _analyzer) { AllowLeadingWildcard = true };
>             var textQry = parser.Parse("*22/*");
>             boolQry.Add(textQry, Occur.MUST);
>             var an = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
>             var localParser = new QueryParser(LuceneVersion.LUCENE_48,
> FIELD_SECURITY, an);
>
>             var localQry = localParser.Parse("1037p????");
>
>             boolQry.Add(localQry, Occur.MUST);
>
>             var qryRes = _searcher.Search(boolQry, 1000);
>
>             Console.WriteLine($"Result found {qryRes.TotalHits}");
>             Console.ReadLine();
>         }
>     }
>
>     public class FullTextIndexItemAccessInfo
>     {
>         public int Key { get; set; }
>         public string Info { get; set; }
>     }
>
>     public class FullTextIndexItem
>     {
>         public string ObjectText { get; set; }
>         public List<FullTextIndexItemAccessInfo> Access { get; set; }
>     }
> }
>
> Hope that would help.
>
> Manz thanks,
> Sergey
>
> 2017-06-16 9:39 GMT+02:00 Prescott Nasser <ge...@hotmail.com>:
>
>> Adding Sergey who isn't subscribed to the mailing list..
>>
>> ------
>>
>> Sergey,
>>
>> Please provide the actual code, not a screenshot. Apparently, the mailing
>> list server strips out images from the email, so it is impossible to help
>> you without knowing what you are doing.
>>
>> Thanks,
>> Shad Storhaug (NightOwl888)
>>
>> -----Original Message-----
>> From: Sergey Zaharov [mailto:sergozaharov@gmail.com]
>> Sent: Friday, June 16, 2017 2:16 PM
>> To: user@lucenenet.apache.org
>> Subject: Parse query with special characters
>>
>> Hi all,
>>
>> could you please help me with next issue: i have a problem with parsing
>> strings with some special characters. Example on the screenshot, Also error
>> comes when i parse string like "!!22", but in the same time string "!22"
>> parsed normally.
>> [image: Встроенное изображение 1]
>>
>> So question is how can i parse ALL possible strings that comes from user?
>> may be i need another parser? Otherwise should i clear query string from
>> some characters/combinations and if yes, then where can i take that list?
>> Probably, it exists some utils that could help normalize query string.
>>
>> Thanks you in advance.
>>
>> --
>> Best regards, Sergey.
>>
>> --
>> Best regards, Sergey.
>>
>
>
>
> --
> Best regards, Sergey.
>



-- 
Best regards, Sergey.

Re: Parse query with special characters

Posted by Sergey Zaharov <se...@gmail.com>.
Hi,

there is full code of console application

using System;
using System.Collections.Generic;
using System.Linq;
using Lucene.Net.Analysis.Core;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers.Classic;
using Lucene.Net.QueryParsers.Flexible.Standard;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;

namespace Lucene4TestWSA
{
    class Program
    {
        private const string FIELD_BODY = "postBody";
        private const string FIELD_SECURITY = "Security";

        private static IndexWriter _writer;
        private static Directory _directory;
        private static WhitespaceAnalyzer _analyzer;
        private static IndexReader _indexReader;
        private static IndexSearcher _searcher;
        private static IndexWriterConfig _cfg;

        private static void AddNewItem(FullTextIndexItem item)
        {
            if (_writer == null) return;
            var doc = new Document();

           var objectText = (item.ObjectText ?? "");
            doc.Add(new TextField(FIELD_BODY, objectText, Field.Store.NO));

            var securCodes = (item.Access == null || item.Access.All(x =>
x.Key == 0))
                ? "?"
                : string.Join(" ", item.Access.Where(x => x.Key !=
0).Select(x => x.Key.ToString() + x.Info).ToList());
            doc.Add(new TextField(FIELD_SECURITY, securCodes.ToLower(),
Field.Store.YES));

            _writer.AddDocument(doc);
        }

        static void Main(string[] args)
        {

            var dir = @"c:\TestLuceneDir";
            if (System.IO.Directory.Exists(dir))
            {
                System.IO.Directory.Delete(dir, true);
            }

            var di = System.IO.Directory.CreateDirectory(dir);
            var _directory = FSDirectory.Open(di);
            _analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
            _cfg = new IndexWriterConfig(LuceneVersion.LUCENE_48,
_analyzer);
            var writer = new IndexWriter(_directory, _cfg);
            writer.Commit();
            writer.Dispose();
            _cfg = null;

            _indexReader = DirectoryReader.Open(_directory);
            _searcher = new IndexSearcher(_indexReader);

            var analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
            _cfg = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer);
            _writer = new IndexWriter(_directory, _cfg);

            AddNewItem(new FullTextIndexItem
            {
                ObjectText = "111 !!222 333 qqq",

                Access = new List<FullTextIndexItemAccessInfo>()
                {
                    new FullTextIndexItemAccessInfo() { Key = 1037, Info =
"PW???"},
                    new FullTextIndexItemAccessInfo() { Key = 1041, Info =
"P????"}
                }
            });

            AddNewItem(new FullTextIndexItem
            {
                ObjectText = "aaa bbb ccc qqq",
                Access = new List<FullTextIndexItemAccessInfo>()
                {
                    new FullTextIndexItemAccessInfo() { Key = 1037, Info =
"PW???"},
                    new FullTextIndexItemAccessInfo() { Key = 1042, Info =
"PW??C"}
                }
            });

            _writer.Commit();
            _writer.Dispose();
            _writer = null;
            _cfg = null;
            _indexReader = DirectoryReader.Open(_directory);
            _searcher = new IndexSearcher(_indexReader);

            _analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
            var boolQry = new BooleanQuery();

            var parser = new QueryParser(LuceneVersion.LUCENE_48,
FIELD_BODY, _analyzer) { AllowLeadingWildcard = true };
            var textQry = parser.Parse("*22/*");
            boolQry.Add(textQry, Occur.MUST);
            var an = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
            var localParser = new QueryParser(LuceneVersion.LUCENE_48,
FIELD_SECURITY, an);

            var localQry = localParser.Parse("1037p????");

            boolQry.Add(localQry, Occur.MUST);

            var qryRes = _searcher.Search(boolQry, 1000);

            Console.WriteLine($"Result found {qryRes.TotalHits}");
            Console.ReadLine();
        }
    }

    public class FullTextIndexItemAccessInfo
    {
        public int Key { get; set; }
        public string Info { get; set; }
    }

    public class FullTextIndexItem
    {
        public string ObjectText { get; set; }
        public List<FullTextIndexItemAccessInfo> Access { get; set; }
    }
}

Hope that would help.

Manz thanks,
Sergey

2017-06-16 9:39 GMT+02:00 Prescott Nasser <ge...@hotmail.com>:

> Adding Sergey who isn't subscribed to the mailing list..
>
> ------
>
> Sergey,
>
> Please provide the actual code, not a screenshot. Apparently, the mailing
> list server strips out images from the email, so it is impossible to help
> you without knowing what you are doing.
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
> -----Original Message-----
> From: Sergey Zaharov [mailto:sergozaharov@gmail.com]
> Sent: Friday, June 16, 2017 2:16 PM
> To: user@lucenenet.apache.org
> Subject: Parse query with special characters
>
> Hi all,
>
> could you please help me with next issue: i have a problem with parsing
> strings with some special characters. Example on the screenshot, Also error
> comes when i parse string like "!!22", but in the same time string "!22"
> parsed normally.
> [image: Встроенное изображение 1]
>
> So question is how can i parse ALL possible strings that comes from user?
> may be i need another parser? Otherwise should i clear query string from
> some characters/combinations and if yes, then where can i take that list?
> Probably, it exists some utils that could help normalize query string.
>
> Thanks you in advance.
>
> --
> Best regards, Sergey.
>
> --
> Best regards, Sergey.
>



-- 
Best regards, Sergey.

RE: Parse query with special characters

Posted by Prescott Nasser <ge...@hotmail.com>.
Adding Sergey who isn't subscribed to the mailing list..

------

Sergey,

Please provide the actual code, not a screenshot. Apparently, the mailing list server strips out images from the email, so it is impossible to help you without knowing what you are doing.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Sergey Zaharov [mailto:sergozaharov@gmail.com] 
Sent: Friday, June 16, 2017 2:16 PM
To: user@lucenenet.apache.org
Subject: Parse query with special characters

Hi all,

could you please help me with next issue: i have a problem with parsing strings with some special characters. Example on the screenshot, Also error comes when i parse string like "!!22", but in the same time string "!22"
parsed normally.
[image: Встроенное изображение 1]

So question is how can i parse ALL possible strings that comes from user?
may be i need another parser? Otherwise should i clear query string from some characters/combinations and if yes, then where can i take that list?
Probably, it exists some utils that could help normalize query string.

Thanks you in advance.

--
Best regards, Sergey.

--
Best regards, Sergey.

RE: Parse query with special characters

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Sergey,

Please provide the actual code, not a screenshot. Apparently, the mailing list server strips out images from the email, so it is impossible to help you without knowing what you are doing.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Sergey Zaharov [mailto:sergozaharov@gmail.com] 
Sent: Friday, June 16, 2017 2:16 PM
To: user@lucenenet.apache.org
Subject: Parse query with special characters

Hi all,

could you please help me with next issue: i have a problem with parsing strings with some special characters. Example on the screenshot, Also error comes when i parse string like "!!22", but in the same time string "!22"
parsed normally.
[image: Встроенное изображение 1]

So question is how can i parse ALL possible strings that comes from user?
may be i need another parser? Otherwise should i clear query string from some characters/combinations and if yes, then where can i take that list?
Probably, it exists some utils that could help normalize query string.

Thanks you in advance.

--
Best regards, Sergey.

--
Best regards, Sergey.