You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Shad Storhaug <sh...@shadstorhaug.com> on 2021/03/31 13:53:51 UTC

RE: Lusene.Net | .Net | Multi Language Support

Hi Hassan,

I will try to answer what I can:

1. AFAIK, no such list exists but in general there is good support for the most popular written languages, and less support for the others. You can get some idea from the 2-letter language abbreviation codes here: https://github.com/apache/lucenenet/tree/Lucene.Net_4_8_0_beta00014/src/Lucene.Net.Analysis.Common/Analysis.
2. There is a StandardAnalyzer in the Lucene.Net.Analysis.Common package that is what most applications use. It is general-purpose for all languages, but there are other options that might work better if you are targeting a specific language or text. Or if nothing that is built-in meets your exact needs, it is possible to implement your own analyzer, tokenizer, and filters, and there is even a test framework to thoroughly test them to make sure they are implemented correctly. The Analysis documentation is fairly thorough on the subject: https://lucenenet.apache.org/docs/4.8.0-beta00014/api/core/Lucene.Net.Analysis.html.
3. Yes, weights can be added to fields both during the indexing stage and during a specific search so there is quite a bit of flexibility with optimizing your search.
4. Yes, there are specialized analyzers for many languages in the Lucene.Net.Analysis.Common package. There are also additional analysis packages specialized for Chinese, Japanese, Polish, and Ukrainian. If you require it, full Unicode 8.0 support is also available using the Lucene.Net.ICU package, which also has other things that are generally useful, such as normalization, segmentation of different scripts in the same text and transliteration from one text to another.
5. The only options built in are file system storage (MMapDirectory, FSDirectory, FileSwitchDirectory, NIOFSDirectory) or in-memory storage (RAMDirectory or MemoryIndex), but Lucene.NET provides a lot of extensible APIs such as the Directory abstract class, which can be implemented to provide other storage mechanisms. There is one Azure storage directory that supports Lucene.NET 4.8.0-beta00011 here: https://github.com/tomlm/Lucene.Net.Store.Azure and should work for the latest version as well. There is also an AWS directory in Java that could probably be ported to .NET without too much trouble: https://github.com/albogdano/lucene-s3directory. Most cloud providers have an option for exposing their blob storage as a normal file system, which could be used with Lucene.NET directly without add-ons. However, I am not sure how that would perform compared to an actual file system.

Thanks,
Shad Storhaug (NightOwl888)
Project Chairperson - Apache Lucene.NET


-----Original Message-----
From: Hassan Iftikhar <Ha...@enghouse.com.INVALID> 
Sent: Wednesday, March 31, 2021 2:27 PM
To: user@lucenenet.apache.org
Subject: RE: Lusene.Net | .Net | Multi Language Support

Hi, 

Thanks for your feedback. Love to see that Lucene.Net has active developers community. So finding more features of Lucene.Net opens different issues/questions to me which I want to be clarified. So I got some new questions while doing R&D on Lucene.Net. Those are:
1. As Lucene.Net has multi language support. Can we have the list of the languages which Lucene.Net supporting?
2. Can we use same analyzer for multiple languages text in order to index documents?
3. Is there any concept of weighted fields i.e. while searching through it can we assign a weight/priority mechanism to some fields if all the fields are searchable.
4. Does Lucene.Net support Unicode characters? 
5. Does Lucene.Net support any other storage solution except File Storage?

Thanks in advance!

Regards,
Hassan Iftikhar
Software Engineer, R&D
m: +92 (0) 300 064 9845
w: www.enghouseinteractive.com
e: hassan.iftikhar@enghouse.com

As the world responds to the Covid-19 outbreak, Enghouse is committed to doing its part to support organisations' risk management efforts. 
We are providing temporary licences of our secure cloud-based communications platform at no cost to your organisation.



-----Original Message-----
From: Andy Pook <an...@gmail.com>
Sent: Thursday, March 25, 2021 4:01 PM
To: user@lucenenet.apache.org
Subject: Re: Lusene.Net | .Net | SQL Server | Databases

1. add ChangeTracking (or CDC) to your tables. Write a service to "listen"
to the changes and apply those to your index.
2. If you have a "ChangedAt" timestamp column. Then "poll" the table where ChangedAt > last-seen-change. Foreach over those rows, update the index.
How frequently you poll is part of how NRT the index will be.
3. if your system is based on event messaging, listen to "update" messages and update the index from those 4. ...

You can think of the index in the same way as a "data warehouse". It's just some other data store with a different "schema" than your OLTP system (which is often awesome normalized sql thing). So this is "just" another ETL feature. Extract from one datastore, Transform it, Load it into the index.

off-topic...
Often the pattern is to include some primary key of the thing being indexed as a field. Then the system can "search" the index, get the id/key to query the original store to get the whole entity.
Several previous Lucene things I've had a hand in decided to also store the entity directly in the index (we serialized the object as json into a binary/blob field of the Document. You can use some other serializer/encoder with compression if you want to obsess over size :) ).
Obviously this makes the index files/folder much larger (depending on how big the entity is). But it does not have to be the entire entity, just the bits that are needed. Makes the index feel more like a document database.
The upside is that you don't have to do a two phase query to get at the entity. And that the index can be used in isolation without the origin data store.

It does mean that there is an "eventual consistency" aspect to consider.
But depending on the polling or ChangeTracking (or whatever ETL scheme you go with) and all the NRT stuff that Shad is talking about... you can get very close to "real time".
But do consider how near to realtime your system needs to be. I would suggest that some minutes to hours is more than good enough for most systems. Needing sub-second is rare (but possible).

Wow, all that just flowed out :) I hope some of it might be useful/relevant to your endeavours

On Thu, 25 Mar 2021 at 08:17, Hassan Iftikhar <Ha...@enghouse.com.invalid> wrote:

> Hi Shad,
>
> As Lucene.Net is a general purpose library and it has nothing to do 
> with data sources like SQL Server, SQLite, etc. It only knows you have 
> a Lucene document that you want indexed. So when we dump data to 
> Lucene.Net from any data source. How can we make Lucene.Net documents 
> up to date as the data is in SQL Database(For example). One way to 
> keep both data, i.e. (Lucene.Net and SQL) sync is to continually 
> update the Lucene index during each database update. We also know that 
> there is a possibility that someone can made manually changes to SQL 
> database, in that scenario how we can update Lucene indexes?
>
> Thanks,
>
> Regards,
> Hassan Iftikhar
> Software Engineer, R&D
> m: +92 (0) 300 064 9845
> w: 
> https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.e
> nghouseinteractive.com%2F&amp;data=04%7C01%7CHassan.Iftikhar%40enghous
> e.com%7C14d87a8566fa4d690bce08d8ef7d55b7%7C427e40023c0240489e280eba58b
> 331f4%7C1%7C0%7C637522668877206520%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4
> wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sda
> ta=sPOskUsr8dzVEb65vUu2gTZ%2F0sPzlPvudi4wDmY5jbA%3D&amp;reserved=0
> e: hassan.iftikhar@enghouse.com
>
> As the world responds to the Covid-19 outbreak, Enghouse is committed 
> to doing its part to support organisations' risk management efforts.
> We are providing temporary licences of our secure cloud-based 
> communications platform at no cost to your organisation.
>
>
>
>
> -----Original Message-----
> From: Shad Storhaug <sh...@shadstorhaug.com>
> Sent: Wednesday, March 24, 2021 1:21 AM
> To: user@lucenenet.apache.org
> Subject: RE: Lusene.Net | .Net | SQL Server | Databases
>
> Hello Hassan,
>
> While you might extend the Directory class to provide a "native"
> communication channel with another data storage medium other than the 
> file system, you will quickly find out how challenging such a task is, 
> and there will always be a high price to pay in terms of performance.
>
> Basically, there are a few different ways people deal with data from a 
> database but most of the time you have to accept that some data will 
> be duplicated between the database and Lucene, but the risk of that 
> duplication can be reduced/eliminated by applying automation. The 
> exact solution depends on how much data there is and how often it 
> needs to be refreshed.
>
> 1. Make indexing part of the deployment process so the most current 
> copy of the search index is the deployment date.
> 2. Design a custom job to update the index at specified intervals.
> 3. Use Lucene's near real-time search (NRT) feature to continually 
> update the Lucene index during each database update and keep a "live"
> view of the data in search.
>
> There might be other solutions, but these are generally the best 
> options for most scenarios.
>
> Thanks,
> Shad Storhaug (NightOwl888)
> Project Chairperson - Apache Lucene.NET
>
> -----Original Message-----
> From: Ron.Git <Ro...@GiftOasis.com>
> Sent: Tuesday, March 23, 2021 7:49 PM
> To: user@lucenenet.apache.org
> Subject: RE: Lusene.Net | .Net | SQL Server | Databases
>
> Lucene is a general purpose indexing and search library.  As such it 
> is not concerned with where the data comes from.  When that data comes 
> from a sql database, developers typically use their normal approach to 
> retrieve the data from the sql database an then use that data to 
> populate a Lucene document for indexing.  Lucene itself has no 
> knowledge of where the data came from.  It only knows you have a Lucene document that you want indexed.
>
> -Ron
>
>
>
> -----Original Message-----
> From: Hassan Iftikhar [mailto:Hassan.Iftikhar@enghouse.com.INVALID]
> Sent: Tuesday, March 23, 2021 5:07 AM
> To: user@lucenenet.apache.org
> Subject: RE: Lusene.Net | .Net | SQL Server | Databases
>
> Hi,
>
> Hope you guys are doing well. Thanks for the feedback for my last email.
> Now I have another query regarding data sources.
>
> I want to know how Lucene.Net communicate with data sources i.e. with 
> existing SQL Server database. As I mentioned earlier we will be using 
> Lucene.Net in existing .Net Server/Client applications. So I am 
> interested to know how we can take the advantage of Lucene.Net with 
> existing SQL server/ SQLite Database?
>
> Thanks in advance!
>
> Regards,
> Hassan Iftikhar
> Software Engineer, R&D
> m: +92 (0) 300 064 9845
> w:
> https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.e
> nghouseinteractive.com%2F&amp;data=04%7C01%7CHassan.Iftikhar%40enghous
> e.com%7C14d87a8566fa4d690bce08d8ef7d55b7%7C427e40023c0240489e280eba58b
> 331f4%7C1%7C0%7C637522668877216474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4
> wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sda
> ta=%2FogbsVCgD69K148VYCAZ32fVn9wbcOc3vDARLVuooEI%3D&amp;reserved=0
> e: hassan.iftikhar@enghouse.com
>
> As the world responds to the Covid-19 outbreak, Enghouse is committed 
> to doing its part to support organisations' risk management efforts.
> We are providing temporary licences of our secure cloud-based 
> communications platform at no cost to your organisation.
>
>
>
>
> -----Original Message-----
> From: RonClabo@GiftOasis.com <Ro...@GiftOasis.com>
> Sent: Wednesday, March 17, 2021 7:29 PM
> To: user@lucenenet.apache.org
> Subject: RE: Lusene.Net | .Net | .Net Core | .Net5 | .Net framework
>
> Hi Hassan,
>
>
>
> Thanks for your interest in Lucene.Net.  I am a recent contributor to 
> the Lucene.Net 4.8 project and will do my best to answer your questions.
>
>
>
> 1)    Lucene.Net 3.0.3 is from some time ago so it makes sense that it
> doesn't specifically target a newer version of the .Net 4.x framework.
> However, if it's documented to support .Net Framework 4.0 it's very 
> likely it will work on 4.61 since .Net Frameworks are generally 
> backward compatible especially for the major version.
>
> 2)    Lucenet.Net 4.8 which is currently in release Beta 13 supports both
> the .Net Full Framework 4.5 and supports NetStandard2.0, NetStandard2.1.
> As such it is compatible with .Net Core 2.0 or higher.
>
> 3)    Each Lucene.Net version attempts to be a faithful port of the
> functionality in the corresponding Java Lucene version.  This is 
> largely accomplished by porting the project from Java to C# on a line by line basis.
> However, occasionally a few features and bug fixes are pulled in from 
> later versions.  In general anything you read online about Java Lucene
> 3.0.3 will be accurate for Lucene.Net 3.0.3 and anything you read 
> online about Java Lucene 4.8 will be accurate about Lucene.Net 4.8.
> Also note that while Lucene.Net 4.8 may seem like it's version number 
> is far behind the current Java version which is version 8.8, the 
> reality is that the version 4.8 contains the _vast_ majority of 
> features found in 8.8 because the big change (and multi-year effort) 
> for Lucene came in version 4.0 when codecs were introduced.  Since 
> then the features added per release have been much more modest and the 
> releases much more frequent, hence the rapid escalation in version number.
>
> 4)    Lucene 4.8 can be used on the latest Full Framework and on the latest
> .Net Core Framework.  More specifically, yes - Lucene.Net 4.8 is fully 
> compatible with .Net5.  I personally use it with .Net5.  Lucene.Net
> 4.8 is in beta but I believe some people do use it in production as it 
> has been extremely stable for a very long time and has a large number 
> of unit tests ported from Java Lucene which are must pass before 
> commits are added to the project.  Some in the LuceneNet developer 
> community have even stated that they think 4.8 is already more solid 
> then 3.0.3 given that the earlier version did not have the extensive 
> unit tests to ensure accuracy of the port.  Lucene.Net 4.8 is actively 
> being worked on with a goal to getting it to final release.  If you 
> have any time to donate we'd welcome your help in polishing this version of LuceneNet.
>
>
>
> Best,
>
>
>
> Ron Clabo
>
> rclabo on Github.
>
>
>
> From: Hassan Iftikhar [mailto:Hassan.Iftikhar@enghouse.com.INVALID]
> Sent: Wednesday, March 17, 2021 7:58 AM
> To: user@lucenenet.apache.org
> Subject: Lusene.Net | .Net | .Net Core | .Net5 | .Net framework
>
>
>
> Hi,
>
>
>
> I am new at Lucene.Net and exploring it now a days to use it in our 
> products. Here I have some questions to ask:
>
>
>
> 1.      Can we use Lucene.Net in .Net framework 4.6.1? As the stable
> version
> of Lucene.Net is 3.0.3 and from your website what can I see that it 
> supports till .Net Framework 4.0.
>
> 2.      Can we use Lucene.Net in .Net Core? Because there is nothing
> information on your website related to support for Lucene.Net for .Net 
> core.
>
> 3.      Is Lucene.Net providing the same set of features as compare to
> Lucene for Java?
>
> 4.      Can we use Lucene.Net in .Net5 i.e. on latest .Net frameworks?
>
>
>
> Regards,
>
> Hassan Iftikhar
>
> Software Engineer, R&D
>
> m: +92 (0) 300 064 9845
>
> w:
> <
> https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.e
> nghou
>
> seinteractive.com%2F&amp;data=04%7C01%7CHassan.Iftikhar%40enghouse.com
> %7C3c0
>
> 3eeefda9448a3229f08d8e951310e%7C427e40023c0240489e280eba58b331f4%7C1%7
> C0%7C6
> <https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.
> enghouseinteractive.com%2F&amp;data=04%7C01%7CHassan.Iftikhar%40enghou
> se.com%7C3c03eeefda9448a3229f08d8e951310e%7C427e40023c0240489e280eba58
> b331f4%7C1%7C0%7C6>
>
> 37515882219503948%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi
> V2luMz
>
> IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=g%2BHSNyUoxQc9TZCt20
> NnNw96 puSdxWNBuiE372d7HMc%3D&amp;reserved=0>
>
> https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.e
> nghous
>
> einteractive.com%2F&amp;data=04%7C01%7CHassan.Iftikhar%40enghouse.com%
> 7C3c03
>
> eeefda9448a3229f08d8e951310e%7C427e40023c0240489e280eba58b331f4%7C1%7C
> 0%7C63
> <https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.
> enghouseinteractive.com%2F&amp;data=04%7C01%7CHassan.Iftikhar%40enghou
> se.com%7C3c03eeefda9448a3229f08d8e951310e%7C427e40023c0240489e280eba58
> b331f4%7C1%7C0%7C63>
>
> 7515882219503948%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV
> 2luMzI
>
> iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=g%2BHSNyUoxQc9TZCt20N
> nNw96p
> uSdxWNBuiE372d7HMc%3D&amp;reserved=0
>
> e: hassan.iftikhar@enghouse.com
>
>
>
> As the world responds to the Covid-19 outbreak, Enghouse is committed 
> to doing its part to support organisations' risk management efforts.
> We are providing temporary licences of our secure cloud-based 
> communications platform at no cost to your organisation.
>
>
>
>
> <
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fengh
> ousei
>
> nteractive.de%2Feuropean-contact-center-dmg%2F&amp;data=04%7C01%7CHass
> an.Ift
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Feng
> houseinteractive.de%2Feuropean-contact-center-dmg%2F&amp;data=04%7C01%
> 7CHassan.Ift>
> ikhar%40enghouse.com
> %7C3c03eeefda9448a3229f08d8e951310e%7C427e40023c0240489e
>
> 280eba58b331f4%7C1%7C0%7C637515882219513908%7CUnknown%7CTWFpbGZsb3d8ey
> JWIjoi
>
> MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;
> sdata=
> uOV3W9yR2h9I%2BYtigYu07wAZRZa27D%2BGSVDv64yiGFw%3D&amp;reserved=0>
> vidyo-trial-outlook-signature-v2
>
>
>
>
>
>
>
>