You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Jens Melgaard <Je...@Systematic.com> on 2017/12/21 10:16:03 UTC

Problems with Wildcard searches.

Hello

This is a bit of a shoot in blind, but while I try to see how I can investigate further, I thought that I would try to see if we could be lucky to hit someone who had experienced a similar issue as we are facing right now.

First a little bit of back ground.
We use Lucene.Net 3.0.3 to index json documents, each json field gets translated into a fieldname as you would access that field on the document, so { obj: { fieldName: "42kittens" } } would be translated into "obj.fieldName" = "42kittens" etc. Depending on the datatype from json, each field is indexed differently but right now we can focus on "text fields" as that is where our issue is atm.

We use a StandardAnalyzer with an empty stopset and the query parser is a slightly modified version of the MultiFieldQueryParser allowing for using "*" in range queries as well as having a dynamic fields set depending on what has been indexed. (We keep automatically track of all possible fields in the system)

We currently have about ~500.000 documents in our index, each document ranges from ~10 fields to thousands of fields (each field may be represented multiple times because of arrays), this results in about a 4GB index.

All in all everything seemed to work just fine, however yesterday we discovered that we had some issues using wildcards.

We have some documents which represents ports all over the world, these have what is called a locode, a locode is always 5 characters, e.g. DKAAR, VIFRD, ITPVT etc... The first 2 letters represent the country, so DKAAR is in Denmark, VI is U.S. Virgin Island, IT is Itally. You can get more here: http://locode.info (It might not be an exhausted list)...

Now if I search for "locode: MA*" I get:


-      MA888

-      MA6KN

However if I search for "locode: MAAGA" I get:


-      MAAGA

But that should have been included in the search above it as MA* clearly should match MAAGA.

If I search for "locode: (MA* OR MAAGA)" I get:


-      MA888

-      MA6KN

-      MAAGA

Now if I search for "locode: MAA*" I now get:


-      MAAHU

-      MAAZE

-      MAANZ

-      MAASI

-      MAAGA

Which all should be part of the first result right?...

So I am thinking that there is something I am missing here...
Med venlig hilsen / Kind regards

[Systematic Logo]<http://www.systematic.com/>
Jens Melgaard
System Architect

Søren Frichs Vej 39, 8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com<ma...@systematic.com>
www.systematic.com<http://www.systematic.com>
[Seasons greetings from systematic]<http://systematic.com/>

RE: Problems with Wildcard searches.

Posted by Jens Melgaard <Je...@Systematic.com>.
Cheers

That might be the right solution for us, for the time being we have adjusted the system to run under en-gb/en by setting it in the web.config file (We can't set it to Invariant that way. En-gb is near the same AFAIK, but regardless, it appears to work with that culture as well), it's running in virtual environments that only really runes that solution (It does run KUDU for deployment, but that’s behind the scene so that’s not a big issue)...

All in all since we have an international audience, running under da-dk is odd anyways.

Again, thanks for the help... It is very much appreciated, we would never have solved it this quickly without it!... 


Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: 22. december 2017 17:33
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Jens,

Setting CultureInfo.DefaultThreadCurrentCulture applies to all threads (which is probably not what you want especially if you are using ASP.NET).

There is a way that is less invasive. Since I can assume you are on .NET Framework (because that is all that Lucene.NET 3.0.3 supports):

System.Threading.Thread.CurrentThread.CurrentCulture = CultureInfo.InvariantCulture;

This only applies to the current thread. You can store the current culture in a variable before this operation and then restore it after the operation is complete. There is a standalone CultureContext class here (https://github.com/apache/lucenenet/blob/a3a12967b250e8e7e5f623f0ba7572ec64f479ac/src/Lucene.Net/Support/CultureContext.cs) that wraps this operation up so you can use a using block to ensure the culture is properly restored.

// Your application code...

using (var invariantContext  = new CultureContext(CultureInfo.InvariantCulture))
{
    // Lucene.NET query...
    
    // Optional block to temporarily restore the original culture
    using (var originalContext = new CultureContext(invariantContext.OriginalCulture))
    {
        // Your application code...
    }
    
    // Lucene.NET query...
}

// Your application code...

Hope this helps.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Jens Melgaard [mailto:Jens.Melgaard@Systematic.com]
Sent: Friday, December 22, 2017 4:07 PM
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hi Shad

Cheers for the input!...

With that input I think I am indeed able to reproduce the issue by forcing the culture to be DA-DK for an application like so:

CultureInfo.DefaultThreadCurrentCulture = CultureInfo.GetCultureInfo("DA-DK");

And then indexing: locode=

- MASFI
- MA888
- MA6KN
- MASUR
- MAANO
- MAAHR
- DKAAR
- DKKBH

Search: "locode: MA*"; 2 hits:
- MA888
- MA6KN

Search: "locode: MAA*"; 2 hits: 
- MAANO
- MAAHR

Etc... So that really seems to be the issue. With that knowledge I can rationalize about why MA* does not yeild the locodes that start with MAA as AA in danish is old danish for Å and would be ordered after Z

I can't quite rationalize the MAS-- though, but that would just be for curiosity anyways.

Anyways, besides changing the OS culture, setting the CultureInfo.DefaultThreadCurrentCulture or making a modified version of Lucene 3.0.3 where we explicitly set the culture in all places, are there any solutions that is less invasive?...

From your describtion and my own prior knowledge of Lucene.NET, my guess is no, but I wanted to make sure.

Anyways, thanks again!...

Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: 21. december 2017 22:12
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hi Jens,

This reminds me a little of some of the bugs I tracked down before in Lucene.NET 4.8.0.

One of the issues was due to the fact that the SortedSet<string>/SortedDictionary<string, TValue> in Java is culture-insensitive, so when they are using string as the key, the results were sorted in the wrong order in .NET. So, all of the SortedSet<string> and SortedDictionary<string, TValue> were updated to use a StringComparer.Ordinal comparer to ensure the results are in the same order as in Java. Sometimes the result is dependent upon the items being in the proper sequence, and if not, the results are cut short.

You might want to try doing the search in the invariant culture to see if you get better results. Might not be the issue, but it is a pretty quick theory to test.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Jens Melgaard [mailto:Jens.Melgaard@Systematic.com]
Sent: Friday, December 22, 2017 4:00 AM
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hello Anders

At this point, I don't have any code to share (Unfortunately)... That is because the current code is far too voluminous to share, so that would make no sense.
Currently my own investigation haven't has me ad a dead end, this is because when testing against an older database (with about 60% of the data) I haven't been able to reproduce the issue.

That obviously makes it a bit difficult to boil down the code to something that is meaningful to share at the time... So next step will be to get a fresh backup of the data so I can try to see if I can reproduce it on that and then slowly trim down the code from there to a minimal example...

So for now, as I said, I was mainly hoping for a bit of luck in posting here, knowing that it was quite a shoot in the blind until I have something more concrete. Thanks for your response so far anyways though.


Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Anders Lybecker [mailto:anders@lybecker.com]
Sent: 21. december 2017 12:40
To: user@lucenenet.apache.org
Subject: Re: Problems with Wildcard searches.

Hi Jens,

You are right. Something is wrong here.

Can you share some code, as this seems odd.

Regards,
Anders Lybecker (a fellow dane :-))

On Thu, Dec 21, 2017 at 11:16 AM, Jens Melgaard < Jens.Melgaard@systematic.com> wrote:

> Hello
>
> This is a bit of a shoot in blind, but while I try to see how I can 
> investigate further, I thought that I would try to see if we could be 
> lucky to hit someone who had experienced a similar issue as we are 
> facing right now.
>
>
>
> First a little bit of back ground.
> We use Lucene.Net 3.0.3 to index json documents, each json field gets 
> translated into a fieldname as you would access that field on the 
> document, so { obj: { fieldName: “42kittens” } } would be translated 
> into “obj.fieldName” = “42kittens” etc. Depending on the datatype from 
> json, each field is indexed differently but right now we can focus on 
> “text fields” as that is where our issue is atm.
>
>
>
> We use a StandardAnalyzer with an empty stopset and the query parser 
> is a slightly modified version of the MultiFieldQueryParser allowing 
> for using “*” in range queries as well as having a dynamic fields set 
> depending on what has been indexed. (We keep automatically track of 
> all possible fields in the system)
>
>
>
> We currently have about ~500.000 documents in our index, each document 
> ranges from ~10 fields to thousands of fields (each field may be 
> represented multiple times because of arrays), this results in about a 
> 4GB index.
>
>
>
> All in all everything seemed to work just fine, however yesterday we 
> discovered that we had some issues using wildcards.
>
>
>
> We have some documents which represents ports all over the world, 
> these have what is called a locode, a locode is always 5 characters, 
> e.g. DKAAR, VIFRD, ITPVT etc… The first 2 letters represent the 
> country, so DKAAR is in Denmark, VI is U.S. Virgin Island, IT is Itally. You can get more here:
> http://locode.info (It might not be an exhausted list)…
>
> Now if I search for “locode: MA*” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
>
>
> However if I search for “locode: MAAGA” I get:
>
>
>
> -      MAAGA
>
>
>
> But that should have been included in the search above it as MA* 
> clearly should match MAAGA.
>
>
>
> If I search for “locode: (MA* OR MAAGA)” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
> -      MAAGA
>
>
> Now if I search for “locode: MAA*” I now get:
>
> -      MAAHU
>
> -      MAAZE
>
> -      MAANZ
>
> -      MAASI
>
> -      MAAGA
>
>
>
> Which all should be part of the first result right?...
>
>
>
> So I am thinking that there is something I am missing here…
>
> Med venlig hilsen / Kind regards
>
> [image: Systematic Logo] <http://www.systematic.com/> *Jens Melgaard* 
> System Architect
>
> Søren Frichs Vej 39
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>,
> 8000
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g> Aarhus C
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
> Denmark
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
>
> Mobile: +45 4196 5119 <41%2096%2051%2019> Jens.Melgaard@systematic.com 
> www.systematic.com
>
> [image: Seasons greetings from systematic] <http://systematic.com/>
>

RE: Problems with Wildcard searches.

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Jens,

Setting CultureInfo.DefaultThreadCurrentCulture applies to all threads (which is probably not what you want especially if you are using ASP.NET).

There is a way that is less invasive. Since I can assume you are on .NET Framework (because that is all that Lucene.NET 3.0.3 supports):

System.Threading.Thread.CurrentThread.CurrentCulture = CultureInfo.InvariantCulture;

This only applies to the current thread. You can store the current culture in a variable before this operation and then restore it after the operation is complete. There is a standalone CultureContext class here (https://github.com/apache/lucenenet/blob/a3a12967b250e8e7e5f623f0ba7572ec64f479ac/src/Lucene.Net/Support/CultureContext.cs) that wraps this operation up so you can use a using block to ensure the culture is properly restored.

// Your application code...

using (var invariantContext  = new CultureContext(CultureInfo.InvariantCulture))
{
    // Lucene.NET query...
    
    // Optional block to temporarily restore the original culture
    using (var originalContext = new CultureContext(invariantContext.OriginalCulture))
    {
        // Your application code...
    }
    
    // Lucene.NET query...
}

// Your application code...

Hope this helps.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Jens Melgaard [mailto:Jens.Melgaard@Systematic.com] 
Sent: Friday, December 22, 2017 4:07 PM
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hi Shad

Cheers for the input!...

With that input I think I am indeed able to reproduce the issue by forcing the culture to be DA-DK for an application like so:

CultureInfo.DefaultThreadCurrentCulture = CultureInfo.GetCultureInfo("DA-DK");

And then indexing: locode=

- MASFI
- MA888
- MA6KN
- MASUR
- MAANO
- MAAHR
- DKAAR
- DKKBH

Search: "locode: MA*"; 2 hits:
- MA888
- MA6KN

Search: "locode: MAA*"; 2 hits: 
- MAANO
- MAAHR

Etc... So that really seems to be the issue. With that knowledge I can rationalize about why MA* does not yeild the locodes that start with MAA as AA in danish is old danish for Å and would be ordered after Z

I can't quite rationalize the MAS-- though, but that would just be for curiosity anyways.

Anyways, besides changing the OS culture, setting the CultureInfo.DefaultThreadCurrentCulture or making a modified version of Lucene 3.0.3 where we explicitly set the culture in all places, are there any solutions that is less invasive?...

From your describtion and my own prior knowledge of Lucene.NET, my guess is no, but I wanted to make sure.

Anyways, thanks again!...

Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: 21. december 2017 22:12
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hi Jens,

This reminds me a little of some of the bugs I tracked down before in Lucene.NET 4.8.0.

One of the issues was due to the fact that the SortedSet<string>/SortedDictionary<string, TValue> in Java is culture-insensitive, so when they are using string as the key, the results were sorted in the wrong order in .NET. So, all of the SortedSet<string> and SortedDictionary<string, TValue> were updated to use a StringComparer.Ordinal comparer to ensure the results are in the same order as in Java. Sometimes the result is dependent upon the items being in the proper sequence, and if not, the results are cut short.

You might want to try doing the search in the invariant culture to see if you get better results. Might not be the issue, but it is a pretty quick theory to test.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Jens Melgaard [mailto:Jens.Melgaard@Systematic.com]
Sent: Friday, December 22, 2017 4:00 AM
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hello Anders

At this point, I don't have any code to share (Unfortunately)... That is because the current code is far too voluminous to share, so that would make no sense.
Currently my own investigation haven't has me ad a dead end, this is because when testing against an older database (with about 60% of the data) I haven't been able to reproduce the issue.

That obviously makes it a bit difficult to boil down the code to something that is meaningful to share at the time... So next step will be to get a fresh backup of the data so I can try to see if I can reproduce it on that and then slowly trim down the code from there to a minimal example...

So for now, as I said, I was mainly hoping for a bit of luck in posting here, knowing that it was quite a shoot in the blind until I have something more concrete. Thanks for your response so far anyways though.


Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Anders Lybecker [mailto:anders@lybecker.com]
Sent: 21. december 2017 12:40
To: user@lucenenet.apache.org
Subject: Re: Problems with Wildcard searches.

Hi Jens,

You are right. Something is wrong here.

Can you share some code, as this seems odd.

Regards,
Anders Lybecker (a fellow dane :-))

On Thu, Dec 21, 2017 at 11:16 AM, Jens Melgaard < Jens.Melgaard@systematic.com> wrote:

> Hello
>
> This is a bit of a shoot in blind, but while I try to see how I can 
> investigate further, I thought that I would try to see if we could be 
> lucky to hit someone who had experienced a similar issue as we are 
> facing right now.
>
>
>
> First a little bit of back ground.
> We use Lucene.Net 3.0.3 to index json documents, each json field gets 
> translated into a fieldname as you would access that field on the 
> document, so { obj: { fieldName: “42kittens” } } would be translated 
> into “obj.fieldName” = “42kittens” etc. Depending on the datatype from 
> json, each field is indexed differently but right now we can focus on 
> “text fields” as that is where our issue is atm.
>
>
>
> We use a StandardAnalyzer with an empty stopset and the query parser 
> is a slightly modified version of the MultiFieldQueryParser allowing 
> for using “*” in range queries as well as having a dynamic fields set 
> depending on what has been indexed. (We keep automatically track of 
> all possible fields in the system)
>
>
>
> We currently have about ~500.000 documents in our index, each document 
> ranges from ~10 fields to thousands of fields (each field may be 
> represented multiple times because of arrays), this results in about a 
> 4GB index.
>
>
>
> All in all everything seemed to work just fine, however yesterday we 
> discovered that we had some issues using wildcards.
>
>
>
> We have some documents which represents ports all over the world, 
> these have what is called a locode, a locode is always 5 characters, 
> e.g. DKAAR, VIFRD, ITPVT etc… The first 2 letters represent the 
> country, so DKAAR is in Denmark, VI is U.S. Virgin Island, IT is Itally. You can get more here:
> http://locode.info (It might not be an exhausted list)…
>
> Now if I search for “locode: MA*” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
>
>
> However if I search for “locode: MAAGA” I get:
>
>
>
> -      MAAGA
>
>
>
> But that should have been included in the search above it as MA* 
> clearly should match MAAGA.
>
>
>
> If I search for “locode: (MA* OR MAAGA)” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
> -      MAAGA
>
>
> Now if I search for “locode: MAA*” I now get:
>
> -      MAAHU
>
> -      MAAZE
>
> -      MAANZ
>
> -      MAASI
>
> -      MAAGA
>
>
>
> Which all should be part of the first result right?...
>
>
>
> So I am thinking that there is something I am missing here…
>
> Med venlig hilsen / Kind regards
>
> [image: Systematic Logo] <http://www.systematic.com/> *Jens Melgaard* 
> System Architect
>
> Søren Frichs Vej 39
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>,
> 8000
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g> Aarhus C
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
> Denmark
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
>
> Mobile: +45 4196 5119 <41%2096%2051%2019> Jens.Melgaard@systematic.com 
> www.systematic.com
>
> [image: Seasons greetings from systematic] <http://systematic.com/>
>

RE: Problems with Wildcard searches.

Posted by Jens Melgaard <Je...@Systematic.com>.
Hi Shad

Cheers for the input!...

With that input I think I am indeed able to reproduce the issue by forcing the culture to be DA-DK for an application like so:

CultureInfo.DefaultThreadCurrentCulture = CultureInfo.GetCultureInfo("DA-DK");

And then indexing: locode=

- MASFI
- MA888
- MA6KN
- MASUR
- MAANO
- MAAHR
- DKAAR
- DKKBH

Search: "locode: MA*"; 2 hits:
- MA888
- MA6KN

Search: "locode: MAA*"; 2 hits: 
- MAANO
- MAAHR

Etc... So that really seems to be the issue. With that knowledge I can rationalize about why MA* does not yeild the locodes that start with MAA as AA in danish is old danish for Å and would be ordered after Z

I can't quite rationalize the MAS-- though, but that would just be for curiosity anyways.

Anyways, besides changing the OS culture, setting the CultureInfo.DefaultThreadCurrentCulture or making a modified version of Lucene 3.0.3 where we explicitly set the culture in all places, are there any solutions that is less invasive?...

From your describtion and my own prior knowledge of Lucene.NET, my guess is no, but I wanted to make sure.

Anyways, thanks again!...

Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: 21. december 2017 22:12
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hi Jens,

This reminds me a little of some of the bugs I tracked down before in Lucene.NET 4.8.0.

One of the issues was due to the fact that the SortedSet<string>/SortedDictionary<string, TValue> in Java is culture-insensitive, so when they are using string as the key, the results were sorted in the wrong order in .NET. So, all of the SortedSet<string> and SortedDictionary<string, TValue> were updated to use a StringComparer.Ordinal comparer to ensure the results are in the same order as in Java. Sometimes the result is dependent upon the items being in the proper sequence, and if not, the results are cut short.

You might want to try doing the search in the invariant culture to see if you get better results. Might not be the issue, but it is a pretty quick theory to test.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Jens Melgaard [mailto:Jens.Melgaard@Systematic.com]
Sent: Friday, December 22, 2017 4:00 AM
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hello Anders

At this point, I don't have any code to share (Unfortunately)... That is because the current code is far too voluminous to share, so that would make no sense.
Currently my own investigation haven't has me ad a dead end, this is because when testing against an older database (with about 60% of the data) I haven't been able to reproduce the issue.

That obviously makes it a bit difficult to boil down the code to something that is meaningful to share at the time... So next step will be to get a fresh backup of the data so I can try to see if I can reproduce it on that and then slowly trim down the code from there to a minimal example...

So for now, as I said, I was mainly hoping for a bit of luck in posting here, knowing that it was quite a shoot in the blind until I have something more concrete. Thanks for your response so far anyways though.


Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Anders Lybecker [mailto:anders@lybecker.com]
Sent: 21. december 2017 12:40
To: user@lucenenet.apache.org
Subject: Re: Problems with Wildcard searches.

Hi Jens,

You are right. Something is wrong here.

Can you share some code, as this seems odd.

Regards,
Anders Lybecker (a fellow dane :-))

On Thu, Dec 21, 2017 at 11:16 AM, Jens Melgaard < Jens.Melgaard@systematic.com> wrote:

> Hello
>
> This is a bit of a shoot in blind, but while I try to see how I can 
> investigate further, I thought that I would try to see if we could be 
> lucky to hit someone who had experienced a similar issue as we are 
> facing right now.
>
>
>
> First a little bit of back ground.
> We use Lucene.Net 3.0.3 to index json documents, each json field gets 
> translated into a fieldname as you would access that field on the 
> document, so { obj: { fieldName: “42kittens” } } would be translated 
> into “obj.fieldName” = “42kittens” etc. Depending on the datatype from 
> json, each field is indexed differently but right now we can focus on 
> “text fields” as that is where our issue is atm.
>
>
>
> We use a StandardAnalyzer with an empty stopset and the query parser 
> is a slightly modified version of the MultiFieldQueryParser allowing 
> for using “*” in range queries as well as having a dynamic fields set 
> depending on what has been indexed. (We keep automatically track of 
> all possible fields in the system)
>
>
>
> We currently have about ~500.000 documents in our index, each document 
> ranges from ~10 fields to thousands of fields (each field may be 
> represented multiple times because of arrays), this results in about a 
> 4GB index.
>
>
>
> All in all everything seemed to work just fine, however yesterday we 
> discovered that we had some issues using wildcards.
>
>
>
> We have some documents which represents ports all over the world, 
> these have what is called a locode, a locode is always 5 characters, 
> e.g. DKAAR, VIFRD, ITPVT etc… The first 2 letters represent the 
> country, so DKAAR is in Denmark, VI is U.S. Virgin Island, IT is Itally. You can get more here:
> http://locode.info (It might not be an exhausted list)…
>
> Now if I search for “locode: MA*” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
>
>
> However if I search for “locode: MAAGA” I get:
>
>
>
> -      MAAGA
>
>
>
> But that should have been included in the search above it as MA* 
> clearly should match MAAGA.
>
>
>
> If I search for “locode: (MA* OR MAAGA)” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
> -      MAAGA
>
>
> Now if I search for “locode: MAA*” I now get:
>
> -      MAAHU
>
> -      MAAZE
>
> -      MAANZ
>
> -      MAASI
>
> -      MAAGA
>
>
>
> Which all should be part of the first result right?...
>
>
>
> So I am thinking that there is something I am missing here…
>
> Med venlig hilsen / Kind regards
>
> [image: Systematic Logo] <http://www.systematic.com/> *Jens Melgaard* 
> System Architect
>
> Søren Frichs Vej 39
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>,
> 8000
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g> Aarhus C
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
> Denmark
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
>
> Mobile: +45 4196 5119 <41%2096%2051%2019> Jens.Melgaard@systematic.com 
> www.systematic.com
>
> [image: Seasons greetings from systematic] <http://systematic.com/>
>

RE: Problems with Wildcard searches.

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Hi Jens,

This reminds me a little of some of the bugs I tracked down before in Lucene.NET 4.8.0.

One of the issues was due to the fact that the SortedSet<string>/SortedDictionary<string, TValue> in Java is culture-insensitive, so when they are using string as the key, the results were sorted in the wrong order in .NET. So, all of the SortedSet<string> and SortedDictionary<string, TValue> were updated to use a StringComparer.Ordinal comparer to ensure the results are in the same order as in Java. Sometimes the result is dependent upon the items being in the proper sequence, and if not, the results are cut short.

You might want to try doing the search in the invariant culture to see if you get better results. Might not be the issue, but it is a pretty quick theory to test.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Jens Melgaard [mailto:Jens.Melgaard@Systematic.com] 
Sent: Friday, December 22, 2017 4:00 AM
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hello Anders

At this point, I don't have any code to share (Unfortunately)... That is because the current code is far too voluminous to share, so that would make no sense.
Currently my own investigation haven't has me ad a dead end, this is because when testing against an older database (with about 60% of the data) I haven't been able to reproduce the issue.

That obviously makes it a bit difficult to boil down the code to something that is meaningful to share at the time... So next step will be to get a fresh backup of the data so I can try to see if I can reproduce it on that and then slowly trim down the code from there to a minimal example...

So for now, as I said, I was mainly hoping for a bit of luck in posting here, knowing that it was quite a shoot in the blind until I have something more concrete. Thanks for your response so far anyways though.


Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Anders Lybecker [mailto:anders@lybecker.com]
Sent: 21. december 2017 12:40
To: user@lucenenet.apache.org
Subject: Re: Problems with Wildcard searches.

Hi Jens,

You are right. Something is wrong here.

Can you share some code, as this seems odd.

Regards,
Anders Lybecker (a fellow dane :-))

On Thu, Dec 21, 2017 at 11:16 AM, Jens Melgaard < Jens.Melgaard@systematic.com> wrote:

> Hello
>
> This is a bit of a shoot in blind, but while I try to see how I can 
> investigate further, I thought that I would try to see if we could be 
> lucky to hit someone who had experienced a similar issue as we are 
> facing right now.
>
>
>
> First a little bit of back ground.
> We use Lucene.Net 3.0.3 to index json documents, each json field gets 
> translated into a fieldname as you would access that field on the 
> document, so { obj: { fieldName: “42kittens” } } would be translated 
> into “obj.fieldName” = “42kittens” etc. Depending on the datatype from 
> json, each field is indexed differently but right now we can focus on 
> “text fields” as that is where our issue is atm.
>
>
>
> We use a StandardAnalyzer with an empty stopset and the query parser 
> is a slightly modified version of the MultiFieldQueryParser allowing 
> for using “*” in range queries as well as having a dynamic fields set 
> depending on what has been indexed. (We keep automatically track of 
> all possible fields in the system)
>
>
>
> We currently have about ~500.000 documents in our index, each document 
> ranges from ~10 fields to thousands of fields (each field may be 
> represented multiple times because of arrays), this results in about a 
> 4GB index.
>
>
>
> All in all everything seemed to work just fine, however yesterday we 
> discovered that we had some issues using wildcards.
>
>
>
> We have some documents which represents ports all over the world, 
> these have what is called a locode, a locode is always 5 characters, 
> e.g. DKAAR, VIFRD, ITPVT etc… The first 2 letters represent the 
> country, so DKAAR is in Denmark, VI is U.S. Virgin Island, IT is Itally. You can get more here:
> http://locode.info (It might not be an exhausted list)…
>
> Now if I search for “locode: MA*” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
>
>
> However if I search for “locode: MAAGA” I get:
>
>
>
> -      MAAGA
>
>
>
> But that should have been included in the search above it as MA* 
> clearly should match MAAGA.
>
>
>
> If I search for “locode: (MA* OR MAAGA)” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
> -      MAAGA
>
>
> Now if I search for “locode: MAA*” I now get:
>
> -      MAAHU
>
> -      MAAZE
>
> -      MAANZ
>
> -      MAASI
>
> -      MAAGA
>
>
>
> Which all should be part of the first result right?...
>
>
>
> So I am thinking that there is something I am missing here…
>
> Med venlig hilsen / Kind regards
>
> [image: Systematic Logo] <http://www.systematic.com/> *Jens Melgaard* 
> System Architect
>
> Søren Frichs Vej 39
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>,
> 8000
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g> Aarhus C
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
> Denmark
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
>
> Mobile: +45 4196 5119 <41%2096%2051%2019> Jens.Melgaard@systematic.com 
> www.systematic.com
>
> [image: Seasons greetings from systematic] <http://systematic.com/>
>

RE: Problems with Wildcard searches.

Posted by Jens Melgaard <Je...@Systematic.com>.
Hello Anders

At this point, I don't have any code to share (Unfortunately)... That is because the current code is far too voluminous to share, so that would make no sense.
Currently my own investigation haven't has me ad a dead end, this is because when testing against an older database (with about 60% of the data) I haven't been able to reproduce the issue.

That obviously makes it a bit difficult to boil down the code to something that is meaningful to share at the time... So next step will be to get a fresh backup of the data so I can try to see if I can reproduce it on that and then slowly trim down the code from there to a minimal example...

So for now, as I said, I was mainly hoping for a bit of luck in posting here, knowing that it was quite a shoot in the blind until I have something more concrete. Thanks for your response so far anyways though.


Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Anders Lybecker [mailto:anders@lybecker.com] 
Sent: 21. december 2017 12:40
To: user@lucenenet.apache.org
Subject: Re: Problems with Wildcard searches.

Hi Jens,

You are right. Something is wrong here.

Can you share some code, as this seems odd.

Regards,
Anders Lybecker (a fellow dane :-))

On Thu, Dec 21, 2017 at 11:16 AM, Jens Melgaard < Jens.Melgaard@systematic.com> wrote:

> Hello
>
> This is a bit of a shoot in blind, but while I try to see how I can 
> investigate further, I thought that I would try to see if we could be 
> lucky to hit someone who had experienced a similar issue as we are 
> facing right now.
>
>
>
> First a little bit of back ground.
> We use Lucene.Net 3.0.3 to index json documents, each json field gets 
> translated into a fieldname as you would access that field on the 
> document, so { obj: { fieldName: “42kittens” } } would be translated 
> into “obj.fieldName” = “42kittens” etc. Depending on the datatype from 
> json, each field is indexed differently but right now we can focus on 
> “text fields” as that is where our issue is atm.
>
>
>
> We use a StandardAnalyzer with an empty stopset and the query parser 
> is a slightly modified version of the MultiFieldQueryParser allowing 
> for using “*” in range queries as well as having a dynamic fields set 
> depending on what has been indexed. (We keep automatically track of 
> all possible fields in the system)
>
>
>
> We currently have about ~500.000 documents in our index, each document 
> ranges from ~10 fields to thousands of fields (each field may be 
> represented multiple times because of arrays), this results in about a 
> 4GB index.
>
>
>
> All in all everything seemed to work just fine, however yesterday we 
> discovered that we had some issues using wildcards.
>
>
>
> We have some documents which represents ports all over the world, 
> these have what is called a locode, a locode is always 5 characters, 
> e.g. DKAAR, VIFRD, ITPVT etc… The first 2 letters represent the 
> country, so DKAAR is in Denmark, VI is U.S. Virgin Island, IT is Itally. You can get more here:
> http://locode.info (It might not be an exhausted list)…
>
> Now if I search for “locode: MA*” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
>
>
> However if I search for “locode: MAAGA” I get:
>
>
>
> -      MAAGA
>
>
>
> But that should have been included in the search above it as MA* 
> clearly should match MAAGA.
>
>
>
> If I search for “locode: (MA* OR MAAGA)” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
> -      MAAGA
>
>
> Now if I search for “locode: MAA*” I now get:
>
> -      MAAHU
>
> -      MAAZE
>
> -      MAANZ
>
> -      MAASI
>
> -      MAAGA
>
>
>
> Which all should be part of the first result right?...
>
>
>
> So I am thinking that there is something I am missing here…
>
> Med venlig hilsen / Kind regards
>
> [image: Systematic Logo] <http://www.systematic.com/> *Jens Melgaard* 
> System Architect
>
> Søren Frichs Vej 39
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>,
> 8000
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g> Aarhus C 
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
> Denmark
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
>
> Mobile: +45 4196 5119 <41%2096%2051%2019> Jens.Melgaard@systematic.com 
> www.systematic.com
>
> [image: Seasons greetings from systematic] <http://systematic.com/>
>

Re: Problems with Wildcard searches.

Posted by Anders Lybecker <an...@lybecker.com>.
Hi Jens,

You are right. Something is wrong here.

Can you share some code, as this seems odd.

Regards,
Anders Lybecker (a fellow dane :-))

On Thu, Dec 21, 2017 at 11:16 AM, Jens Melgaard <
Jens.Melgaard@systematic.com> wrote:

> Hello
>
> This is a bit of a shoot in blind, but while I try to see how I can
> investigate further, I thought that I would try to see if we could be lucky
> to hit someone who had experienced a similar issue as we are facing right
> now.
>
>
>
> First a little bit of back ground.
> We use Lucene.Net 3.0.3 to index json documents, each json field gets
> translated into a fieldname as you would access that field on the document,
> so { obj: { fieldName: “42kittens” } } would be translated into
> “obj.fieldName” = “42kittens” etc. Depending on the datatype from json,
> each field is indexed differently but right now we can focus on “text
> fields” as that is where our issue is atm.
>
>
>
> We use a StandardAnalyzer with an empty stopset and the query parser is a
> slightly modified version of the MultiFieldQueryParser allowing for using
> “*” in range queries as well as having a dynamic fields set depending on
> what has been indexed. (We keep automatically track of all possible fields
> in the system)
>
>
>
> We currently have about ~500.000 documents in our index, each document
> ranges from ~10 fields to thousands of fields (each field may be
> represented multiple times because of arrays), this results in about a 4GB
> index.
>
>
>
> All in all everything seemed to work just fine, however yesterday we
> discovered that we had some issues using wildcards.
>
>
>
> We have some documents which represents ports all over the world, these
> have what is called a locode, a locode is always 5 characters, e.g. DKAAR,
> VIFRD, ITPVT etc… The first 2 letters represent the country, so DKAAR is in
> Denmark, VI is U.S. Virgin Island, IT is Itally. You can get more here:
> http://locode.info (It might not be an exhausted list)…
>
> Now if I search for “locode: MA*” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
>
>
> However if I search for “locode: MAAGA” I get:
>
>
>
> -      MAAGA
>
>
>
> But that should have been included in the search above it as MA* clearly
> should match MAAGA.
>
>
>
> If I search for “locode: (MA* OR MAAGA)” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
> -      MAAGA
>
>
> Now if I search for “locode: MAA*” I now get:
>
> -      MAAHU
>
> -      MAAZE
>
> -      MAANZ
>
> -      MAASI
>
> -      MAAGA
>
>
>
> Which all should be part of the first result right?...
>
>
>
> So I am thinking that there is something I am missing here…
>
> Med venlig hilsen / Kind regards
>
> [image: Systematic Logo] <http://www.systematic.com/>
> *Jens Melgaard*
> System Architect
>
> Søren Frichs Vej 39
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhus+C+%0D+Denmark&entry=gmail&source=g>,
> 8000
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhus+C+%0D+Denmark&entry=gmail&source=g> Aarhus
> C
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhus+C+%0D+Denmark&entry=gmail&source=g>
> Denmark
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhus+C+%0D+Denmark&entry=gmail&source=g>
>
> Mobile: +45 4196 5119 <41%2096%2051%2019>
> Jens.Melgaard@systematic.com
> www.systematic.com
>
> [image: Seasons greetings from systematic] <http://systematic.com/>
>