You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by GitBox <gi...@apache.org> on 2022/11/17 18:31:07 UTC

[GitHub] [lucenenet] tonyguo2022 opened a new issue, #763: Support async task

tonyguo2022 opened a new issue, #763:
URL: https://github.com/apache/lucenenet/issues/763

   We'd like to build Lucene Index over cloud storage. Most cloud storage operations are asynchronized. I am wondering if Lucene.NET library can add a group of API to Directory and IndexInput/IndexOutput classes to support async operations. e.g.:
   
   existing API in class DataInput:
   public virtual void ReadBytes(byte[] b, int offset, int len, bool useBuffer)
   
   New API:
   public virtual async Task ReadBytesAsync(byte[] b, int offset, int len, bool useBuffer, CancellationToken cancellationToken)
   
   There could be a default implementation to call the sync method, but people can also override it to call async cloud API to read from remote. 
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] tonyguo2022 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
tonyguo2022 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322516254

   @eladmarg Love these options. Thanks. We also planned to change current code as less as possible, similar to option A. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 closed issue #763: Support async task

Posted by "NightOwl888 (via GitHub)" <gi...@apache.org>.
NightOwl888 closed issue #763: Support async task
URL: https://github.com/apache/lucenenet/issues/763


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] tonyguo2022 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
tonyguo2022 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319371313

   After Merge() operation, we immediately collect the output files and store them to cloud storage.  Seems the Lucene Index is valid already and we can run query against it successfully. So, there should be no required post write operations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] jeme commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
jeme commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1323214681

   @NightOwl888 Sure I can try to submit a PR for that, I can't build the source as it stands right now due to missing .NET 4.6.2, AFAIK 4.6.2 is an implementation of .NET Standard 2.0, is there a specific reason that the project targets both?
   
   (https://learn.microsoft.com/en-us/dotnet/standard/net-standard?tabs=net-standard-2-0)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] rclabo commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
rclabo commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319313064

   > Async wait could free the calling thread to do other tasks instead being waiting/blocked. Please check reference: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/task-asynchronous-programming-model
   
   @tonyguo2022  sure, and while that is technically true, let's not forget that we are talking about a background thread here.  So unless for your use case you expect to run out of threads or hit thread starvation, why bother?  And to be clear, there is no code path from `IndexWriter.Commit` that writes to the disk, so adding a method like  `IndexWriter.CommitAsync` isn't gonna help.  The writing to storage is not done by the thread that calls  `IndexWriter.Commit`.  The writing to storage is done asynchronously by a background thread that is signaled by the fact that  `IndexWriter.Commit` was called on another thread earlier in time.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] Jeevananthan-23 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
Jeevananthan-23 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1368691729

   > For the record, there are already existing Lucene directory implementations for various cloud platforms and products. Some would need to be ported to .NET, others might already exist on .NET (some of which don't yet support 4.8.0).
   > 
   > [#631 (comment)](https://github.com/apache/lucenenet/issues/631#issuecomment-1086764962) https://github.com/azure-contrib/AzureDirectory https://github.com/tomlm/Lucene.Net.Store.Azure https://stackoverflow.com/a/59381272 https://github.com/albogdano/lucene-s3directory https://docs.jboss.org/author/display/ISPN50/Infinispan%20as%20a%20Directory%20for%20Lucene.html https://cwiki.apache.org/confluence/display/lucene/AvailableLockFactories
   > 
   > Lucene is highly optimized to use a local file system. Swapping the directory implementation is one way to extend Lucene, but it will come at a pretty significant performance penalty to write to a blob storage provider (as you can see by the notes on each implementation). There are existing cloud products that emulate a local file system that may or may not do the job better of moving your index off of the server it runs on.
   > 
   > However, rolling your own `Directory` implementation is a major job that will take a significant amount of effort to perform well and be stable enough to use.
   > 
   > Note that there is a [Lucene.Net.Replicator](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/replicator/Lucene.Net.Replicator.html) module that is designed to synchronize an index across multiple servers by hosting a listener on a server and having each client request a copy periodically. I haven't tried, but I suspect there is a way to utilize it to safely copy the index to cloud storage at periodic intervals. I wouldn't recommend copying the index outside of a locking context, though, because Lucene.NET may lock the files at unpredictable times and it may not be practical to copy them without getting copy errors.
   
   So insightful conversion throughout this topic, I should also add my points here as @NightOwl888 mentioned it's great to implement **[nrt-replication](https://github.com/Yelp/nrtsearch#near-real-time-replication)**. **Yelp Engineers implemented this kind of project for their **NRT-Search engine** to fulfill the need for Large cluster management. But NRT-Replication seems a good alternative to document-based replication when it comes to costs associated with maintaining large clusters. Scaling document-based clusters up/down promptly could be slower due to data migration between nodes apart from paying the cost for reindexing on all nodes.**
   
   Hope it will help,
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1323652154

   > @jeme
   > 
   > Looks like `BinaryReaderDataInput` was used only in the [constructor of `OfflineSorter.ByteSequencesReader`](https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/OfflineSorter.java#L501-L503) to replace the [`DataInputStream`](https://docs.oracle.com/javase/8/docs/api/java/io/DataInputStream.html) for .NET. But I see why you went down this rabbit hole. The constructor wraps `FileStream` in a `BinaryReader` when `BinaryReaderDataInput` actually could be swapped for `InputStreamDataInput` and use the `FileStream` directly. We can then delete the `BinaryReaderDataInput` from the codebase because it is the only place it is used.
   > 
   > Same goes for `BinaryWriterDataOutput`.
   > 
   
   
   Okay, I see where we went wrong for `ByteSequencesReader` and `ByteSequencesWriter`. Their constructors accept a reference to `java.io.DataInput` and `java.io.DataOutput` interfaces. During the port, we transposed these with Lucene's `DataInput` and `DataOutput` abstract classes. So, we are plugging this together wrong.
   
   Although we have ported over the `DataInputStream` and `DataOutputStream`, according to the J2N docs, it is better to use the `BinaryReader` and `BinaryWriter` if the modified UTF-8 format used in Java is not required. And indeed, we don't use that format here. `BinaryReader` and `BinaryWriter` don't have any interfaces, but both of them are not sealed and have virtual members (except for a few odd ones). So, I am not seeing too much of an issue with changing the constructors to accept `BinaryReader` and `BinaryWriter` instead of `DataInput` and `DataOutput`.
   
   
   > `OfflineSorter` also has another problem. It uses [FileSupport.CreateTempFile()](https://github.com/apache/lucenenet/blob/4e49612d6867194440694b77db95fd0ed756c9a9/src/Lucene.Net/Util/OfflineSorter.cs#L305) to get a `FileInfo` object. Then attempts to open the file afterward in `MergePartitions`. There is a non-zero chance on Windows that the operating system hasn't closed the handle on the file and we get a "file in use by another process" error.
   > 
   > I was considering using some low-level APIs to create a file handle and wrap it in a subclass of [System.IO.FileSystemInfo](https://learn.microsoft.com/en-us/dotnet/api/system.io.filesysteminfo?view=net-7.0) (which `FileInfo` subclasses) and returning that from `FileSupport.CreateTempFile()`. But now that I look again, maybe we should just change `FileSupport.CreateTempFile()` to allow passing in the flags for creating and returning the `FileStream` instance that is used to create the file. It looks like every place that is using it immediately opens a stream afterwards, anyway. `OfflineSorter` could be changed to accept streams instead of `FileInfo` objects in its `Sort` method.
   > 
   > Needs some more analysis. Looks like some places actually need the name of the file, which `FileStream` doesn't provide access to.
   
   Looks like I was wrong - `FileStream` does have a `Name` property, so we have access to the underlying filename. There are a couple of callers that seem to fit better with using the `FileInfo` and they are both in the test framework.  All of the other callers use it in conjunction with `OfflineSorter`. So, creating overloads named `CreateTempFileAsStream()` seem to be the way to go here. The existing overloads can easily call `CreateTempFileAsStream()` and return a `FileInfo` so we can reuse logic.
   
   It gets a bit mangled in `OfflineSorter` because it is set up to delete its files, but we can specify in the docs when calling `Sort()` that it should be passed a `FileStream` with the [FileOptions.DeleteOnClose](https://learn.microsoft.com/en-us/dotnet/api/system.io.fileoptions?view=net-7.0#fields) option set or else delete it explicitly. It will have to accept `FileStream` rather than `Stream`, but since it was only designed to work with files, I don't see that as an issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] rclabo commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
rclabo commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319345574

   @tonyguo2022 And that's great.  But that doesn't prove that the write to storage is done via that thread. And it's not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319069514

   This is an exact duplicate of #485.
   
   While we do have a little better handle on locking contention since then, there are still known concurrency issues to track down in `Lucene.Net.Facet` and in the dictionary-based `BreakIterator` in [ICU4N](https://github.com/NightOwl888/ICU4N) (See #269).
   
   However, our goal of upgrading to a newer version of Lucene when the port is completed has not changed, and not only is this not part of that planned work but it is also something that could potentially slow down the upgrade process. Given the choice between async/await support and the latest Lucene features, I suspect most people would rather have the latter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1320806602

   I was just closing some browser tabs, and came across this topic that goes into detail about how to call synchronous methods asynchronously: https://learn.microsoft.com/en-us/dotnet/standard/asynchronous-programming-patterns/calling-synchronous-methods-asynchronously. 
   
   Just sayin', it seems like a simpler fix than punching async methods all the way through and expecting them to "just work". Lucene throws exceptions in background threads that are re-thrown on the calling thread and are expected to be a specific exception type to get caught in the right handler (which will fire a re-try). Any async code that is added would need to respect (and duplicate) that behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] tonyguo2022 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
tonyguo2022 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319212411

   @rclabo  Async wait could free the calling thread to do other tasks instead being waiting/blocked.
   Please check reference: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/task-asynchronous-programming-model


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319142053

   Well, being that we just started our 9th year porting Lucene 4.8.0, this isn't a priority.
   
   I still stand by what was said last year. The Lucene team is the place to submit this request. Let them work out at a high level how to make async/await work with the design. Then when we port the next version it will automatically be part of our TODO list, rather than something that is going to sidetrack us.
   
   It might help to be very specific about which APIs need to be async. If you can narrow the focus it will be more likely something the Lucene team would consider doing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322839488

   > > I really don't see how you can do it without breaking all the components.
   > > even with partial classes, the async operations should seep from the ground basic operations (Lucene.Net.Store.DataOutput) all the way up to the public api.
   > 
   > As I said, making it all the way up to `IndexWriter.UpdateDocumentAsync(...)` I am having a hard time believing can be done without running into trouble as well, however for replication purposes, working with Lucene on the lover levels (DataInputs/DataOutputs) does actually not look to hard, In an OPTION-B scenario it might even be easier that I initially thought as it's actually only the DataInput/DataOutput that needs additions in partial classes for now to lift the async operations into the replication code.
   > 
   > Here I did run into some classes which raises a question from a efficiency perspective. E.g. InputStreamDataInput that takes a Stream in it's constructor and immediately wraps that stream in a BinaryReader but I can't see for what reason. If the concern is that the underlying stream is not buffered, then perhaps a BufferedStream would be more appropriate. The Java version takes a InputStream in the constructor and just uses that as is, so I am not sure why this differs. (At least for the Java version I currently have checked out)
   > 
   > For OPTION-A which I would point to for now (due to priorities) the Directory classes has to be mirrored as well to return these new implementations.
   
   @jeme - Looks like `OutputStreamDataOutput` has the same issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] eladmarg commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
eladmarg commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319683616

   @tonyguo2022 this is really appreciated, but at this limits we cannot go behind the library frame (in the repo)
   because the main goal of this project was port from the java version.
   
   async / await is only one piece in a big  puzzle.
   one option i can see is to optimize the port in another repository, but i doubt how many people will contribute, especially when this contribution for this port is very limited. (yes we're totally lucky we have @NightOwl888 who's willing to give his efforts and time.
   
   the downside that it won't be compatible to future upgrades.
   also, as we all know, lucene is very complex project. many of the design had big thoughts and the talented developers who want to get into it are very limited. 
   
   let's hope in the future the library will be upgraded to match the java version and it will perform better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] rclabo commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
rclabo commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319152210

   I totally agree with @NightOwl888 .  Also, I'm note even sure anything would be gained by adding `async` file system support.  The disk segments are written by a background thread.  So any thread that is writing docs to the index isn't waiting on disk IO anyway. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1323374601

   @jeme 
   
   Looks like `BinaryReaderDataInput` was used only in the [constructor of `OfflineSorter.ByteSequencesReader`](https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/OfflineSorter.java#L501-L503) to replace the [`DataInputStream`](https://docs.oracle.com/javase/8/docs/api/java/io/DataInputStream.html) for .NET. But I see why you went down this rabbit hole. The constructor wraps it in a `BinaryReader` when it actually could be swapped for `InputStreamDataInput`. We can then delete the `BinaryReaderDataInput` from the codebase because it is the only place it is used.
   
   Same goes for `BinaryWriterDataOutput`.
   
   `OfflineSorter` also has another problem. It uses [FileSupport.CreateTempFile()](https://github.com/apache/lucenenet/blob/4e49612d6867194440694b77db95fd0ed756c9a9/src/Lucene.Net/Util/OfflineSorter.cs#L305) to get a `FileInfo` object. Then attempts to open the file afterward in `MergePartitions`. There is a non-zero chance on Windows that the operating system hasn't closed the handle on the file and we get a "file in use by another process" error.
   
   I was considering using some low-level APIs to create a file handle and wrap it in a subclass of [System.IO.FileSystemInfo](https://learn.microsoft.com/en-us/dotnet/api/system.io.filesysteminfo?view=net-7.0) (which `FileInfo` subclasses) and returning that from `FileSupport.CreateTempFile()`. But now that I look again, maybe we should just change `FileSupport.CreateTempFile()` to allow passing in the flags for creating and returning the `FileStream` instance that is used to create the file. It looks like every place that is using it immediately opens a stream afterwards, anyway. `OfflineSorter` could be changed to accept streams instead of `FileInfo` objects in its `Sort` method.
   
   Needs some more analysis. Looks like some places actually need the name of the file, which `FileStream` doesn't provide access to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322793010

   > Furthermore, now that I looked a bit around in the code which would be involved in the above, it might actually not be completely unfeasibly to introduce async into lucene at this layer for such scenarios as a "Addin" to Lucene.NET, that means we are talking coud outside of lucene itself that extends the appropriate classes (e.g. SimpleFSDirectory, NIOFSDirectory, MMapDirectory etc.) or making "Mirrors" of them and then providing the asynchronous overloads on those and their associated Input/Output classes. And then you can use those during replication allowing you to use async patterns when copying the files from the disk etc. From there it will require a bit of Type checking/casting, but other than that it should be doable... Lets mark that as **OPTION-A**.
   > 
   > It COULD possibly also be introduced into Lucene it self by marking the involved classes here as partial classes and then implement the Async behaviors there, that might allow us to introduce it without breaking with the principle of sticking close to the Java side as we could then add it into Lucene.Net specific async files that can clearly be marked as "Lucene.Net Specific" - that should in theory mean that all we have to do in future ports is again to open up these ported classes by marking them as partial. Ofc. Breaking changes can occur in the stack that means we have to do major rework on the Async parts, but it see it as more manageable. Lets mark that as **OPTION-B**.
   > 
   > While **OPTION-B** could be interesting to investigate as it might have to opportunity to introduce async/await into Lucene.NET gradually if anyone wishes to give that a shot, to begin with the use will be very limited and probably specifically only to scenarios outlined here, it would not affect the core itself which would still use the blocking API's and I can't say how far up the stack this could be done without hitting a wall where we can't go further without actually compromising on the principle of having a port as close to the java source as possible to make furture ports easier. I REALLY doubt that we could get all the way out to making a `IndexWriter.UpdateDocumentAsync(...)` without hitting that problem, but I can't say for sure.
   > 
   > So I am not sure **OPTION-B** is worth the effort considering that **OPTION-A** is probably feasible and it's only downside is the requirement of casts/type checks. And even IF we would find that worth it, I don't think anyone in the team sees it as a priority.
   > 
   > But feel free to share what you guys think.
   
   ## Partial Classes
   
   I am not sure I totally understand the difference between **OPTION-A** and **OPTION-B**. Partial classes are a compiler feature that makes it easier to organize code, but at the end of the day they get compiled into 1 class. Both partial classes have access to all of the private state of the class.
   
   And if we add code to the codebase that is not part of Lucene, it should **all** be in partial classes and put into the Support folder. For example, to extend `IndexWriter` with async code, the original `IndexWriter` class would be marked as a partial class. Then we would add a partial class named `IndexWriter` below the Support folder to extend it with non-Java ported code.
   
   ### File Location Lucene.Net/Index/IndexWriter.cs
   
   ```c#
   namespace Lucene.Net.Index
   {
        public partial class IndexWriter : IDisposable, ITwoPhaseCommit
        {
            // Implementation...
        }
   }
   ```
   
   ### File Location Lucene.Net/Support/Index/IndexWriter.cs
   
   ```c#
   namespace Lucene.Net.Index
   {
        public partial class IndexWriter // Leave off all of the interfaces/base classes
        {
            // Code that was not part of Lucene 4.8.0 (or our best interpretation of it)...
        }
   }
   ```
   
   ## Exceptions
   
   But, that is the easy part. The problem is that everyone involved in this conversation is thinking like a .NET developer: we only throw exceptions when something exceptional happens. But in the Java world, exceptions are frequently used as a signaling mechanism. Indeed, Lucene does this also. We don't have a detailed map of what every exception means or when an exception is thrown in one piece of code, who (if anybody) will catch it and attempt to handle it. It is easy to look at the primary path and lose sight of all of the exceptions that are flying over our head that are being handled in specialized ways.
   
   The `ThreadJob` class handles re-throwing exceptions that are thrown in a background thread. I never did figure out how that works in Java, but this is the behavior that we are seeing in the Lucene tests and clearly what is expected by code.
   
   So, if we complete the primary path through to make something async, we have only started the job. We need to deal with the exceptions that are thrown. Our parallel tasks run, some of them fail, and we end up with an `AggregateException` type. What do we do with it?
   
   In every case, the answer will probably be something a bit different. Lucene often expects the exception to fly up the stack to a handler to deal with it. Or, depending on the exception type, it may just let it fly up to the original caller, or perhaps swallow and ignore it. To make things more complicated, there may be a different handler in a different part of the call stack expecting it depending on which type of exception it is. And many parts are pluggable, so the set of handlers and how they are expected to deal with the exception may be different in each pluggable piece. Certain questions need to be answered for every case:
   
   1. Which upstream handlers are expected to catch the exception from this piece of code?
   2. Are there potential exceptions that are thrown from a downstream piece of code? If so, what exception types?
   3. Is the exception required to be caught somewhere for the task that launched it to successfully complete?
   4. Is something upstream keeping track of which tasks succeeded and which failed?
   5. Does an exception handler reschedule the task at a later time?
   6. Is the exception allowed to propagate back to the caller (outside of Lucene)?
   7. Is the exception swallowed?
   8. Is this or any upstream piece of code pluggable or virtual, and what happens with the exception handlers when alternative implementation is plugged in? 
   9. If this or another pluggable piece may be swapped by the end user with a custom implementation, how are they expected to handle exceptions in the custom code? 
   
   One thing seems obvious to me: The existing exception handlers might not be compatible with the async code and may need to either be *duplicated* or *redesigned* to function when running async. This is particularly true when we have an `AggregateException` type and are expected to fire the code in multiple handlers because we have multiple exception types. We can't exactly throw multiple exceptions from a synchronous piece of code to get to the existing handlers.
   
   It took a month of work just to figure out how to (mostly) [correctly map each exception in Java to its counterpart in .NET](https://github.com/apache/lucenenet/pull/476), including changing the exception handlers to ignore the difference in inheritance between the exception types and re-map it to how Java inherits exceptions. That said, we [don't really know for certain how some exceptions should be handled if they are thrown](https://github.com/apache/lucenenet/pull/476).
   
   TBH - I have no idea how the Lucene team keeps track of the complicated dance of exceptions that are thrown and caught in this application. They must have a technical document somewhere that details it, but if not, this is a document that would need to be created and maintained on our end before we even started on such a task - at least for the part of the picture that we would need to see to maintain the async code that is added. This document would need to detail the entire stack when the application is running to fully understand where exceptions of specific types are thrown and where they are handled.
   
   There is a [TestIndexWriterExceptions](https://github.com/apache/lucenenet/blob/a654eb1f84e0f1e2bcb21fc8cab78169485cd651/src/Lucene.Net.Tests/Index/TestIndexWriterExceptions.cs) test that goes through some of what is expected, but it would need to be analyzed in combination with [our exception mapping logic](https://github.com/apache/lucenenet/blob/a654eb1f84e0f1e2bcb21fc8cab78169485cd651/src/Lucene.Net/Support/ExceptionHandling/ExceptionExtensions.cs) for it to make sense. There are also several tests that [analyze the stack trace](https://github.com/apache/lucenenet/blob/a654eb1f84e0f1e2bcb21fc8cab78169485cd651/src/Lucene.Net.TestFramework/Support/StackTraceHelper.cs) to ensure a specific caller exists in the call stack. Further evidence that the Lucene team is working from the perspective of the call stack at a high level when dealing with these.
   
   This is why I maintain my original position: This is a job for the Lucene designers to figure out, since they have the inside knowledge of what each exception type means and who is supposed to handle it. And this is taken into consideration when new features are added to Lucene.
   
   If we add async code on our end and are expected to maintain it across releases, it will be a tremendous amount of work. If we just add async functionality without creating a plan on how to upgrade it (including any exception logic that changes in future versions of Lucene), we may have no choice but to remove it since it will be blocking the effort to upgrade. Then not only is it a huge amount of time spent adding it to 4.8.0, it is a huge amount of time wasted because it will be gone from all future versions. Fortunately, we can see into the future versions of Lucene, so we can at least mitigate some of this.
   
   But I wouldn't consider a contribution like this without a commitment from someone to help to maintain it over the long run (dealing with future bug reports, incompatibilities, upgrading, and creating detailed technical documentation so others can understand it). Otherwise, it will just be relegated to the same fate as [TaskMergeScheduler](https://github.com/apache/lucenenet/issues/354): fundamentally broken, deprecated and slated for deletion. And a total waste of everyone's time.
   
   If Lucene adds async support, then it will always be something that will be considered as part of the entire application and something that is more straightforward to port and debug. Most importantly, it is not something we have to document and maintain ourselves.
   
   Of course, the first step is to [start a conversation with the Lucene team](https://lucene.apache.org/core/discussion.html) to determine whether this is something they are willing to do. Or at least willing to accept as a contribution that they will then maintain.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] Jeevananthan-23 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
Jeevananthan-23 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1368506671

   @tonyguo2022, Interesting use case and their are alot of solutions provided for your question but I need to understand it more.(i.e) your use case is just use the Lucene library to index your data and storage to cloud back?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] tonyguo2022 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
tonyguo2022 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319220619

   > It might help to be very specific about which APIs need to be async. If you can narrow the focus it will be more likely something the Lucene team would consider doing.
   
   @NightOwl888 , so far we need to update Lucene Index stored in the cloud. I think we need a few APIs(see below list)  to be async. We can clone this repo to another repo and invite you to review. Thanks.
   
   DataInput.ReadBytes
   DataInput.ReadByte
   IndexWriter.Commit
   IndexWriter.Merge
   
   There maybe other functions needed along with the call stack.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] tonyguo2022 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
tonyguo2022 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319318903

   @rclabo our debug shows indexWriter.Commit call ReadBytes in same thread:
   
   ```
   >	MetadataIndexFileInput.ReadBytes(byte[] b, int offset, int len) Line 47	C#
    	Lucene.Net.Store.DataOutput.CopyBytes(Lucene.Net.Store.DataInput input, long numBytes)	Unknown
    	Lucene.Net.Store.Directory.Copy(Lucene.Net.Store.Directory to, string src, string dest, Lucene.Net.Store.IOContext context)	Unknown
    	Lucene.Net.Store.TrackingDirectoryWrapper.Copy(Lucene.Net.Store.Directory to, string src, string dest, Lucene.Net.Store.IOContext context)	Unknown
    	Lucene.Net.Index.IndexWriter.CreateCompoundFile(Lucene.Net.Util.InfoStream infoStream, Lucene.Net.Store.Directory directory, Lucene.Net.Index.CheckAbort checkAbort, Lucene.Net.Index.SegmentInfo info, Lucene.Net.Store.IOContext context)	Unknown
    	Lucene.Net.Index.DocumentsWriterPerThread.SealFlushedSegment(Lucene.Net.Index.DocumentsWriterPerThread.FlushedSegment flushedSegment)	Unknown
    	Lucene.Net.Index.DocumentsWriterPerThread.Flush()	Unknown
    	Lucene.Net.Index.DocumentsWriter.DoFlush(Lucene.Net.Index.DocumentsWriterPerThread flushingDWPT)	Unknown
    	Lucene.Net.Index.DocumentsWriter.FlushAllThreads(Lucene.Net.Index.IndexWriter indexWriter)	Unknown
    	Lucene.Net.Index.IndexWriter.PrepareCommitInternal()	Unknown
    	Lucene.Net.Index.IndexWriter.CommitInternal()	Unknown
    	Lucene.Net.Index.IndexWriter.Commit()	Unknown
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] Jeevananthan-23 commented on issue #763: Support async task

Posted by "Jeevananthan-23 (via GitHub)" <gi...@apache.org>.
Jeevananthan-23 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1691442148

   Hi @mikemccand / @NightOwl888, supporting async for **Lucene** really great idea because in recent-day talks about scalability, how to write high-performance code matters, and the landscape for search engines(Vector search engines) gradually increased. I feel that Lucene's legacy code base had many improvements over the period of time but lags with web-scale capability meaning that Lucene is not used in search engines like Google Search(BigTable)/Bing Search (RocksDb) [_That's a separate rant 🤣_].
   
   Adding async/await in **Lucene(java)** by using _**Virtual Threads(in JDK21**_) is still in preview/beta and may be changed in the near future. Lucene(java) already implemented using the `ThreadPoolExector ` which is similar to `TaskScheduler` and uses ThreadPools to run the threads. 
   
   On the other hand, **Lucenenet(C#)** current code base lags behind compared to Lucene(java) V9.x.x. but `C#` is already very mature in the grade of supporting async/await from low-level APIs like **async I/O bound operations**. IMO we can implement and test the performance of **async operations in Lucenenet at first** and move to the Java version if virtual threads become stable. 
   
   _Note: Also will work on removing J2N overhead if possible_ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by "NightOwl888 (via GitHub)" <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1685294114

   > Has anyone opened a Lucene java issue with the ideas here? I am no expert on async/await but I do think this is a gap in Lucene java. Search is often IO bound and it's crazy that Lucene java's design must stall a whole thread waiting from the IO reads, blocking getting the other IO requests shortly necessary to continue the search. Maybe the new virtual threads (in JDK 21) somehow helps here, not sure.
   
   There is no open issue that I am aware of. However, I agree it would be a great addition. ASP.NET Core and other common UI frameworks tend to use async/await support throughout and fitting Lucene into the picture requires blocking IO. Being able to do an async search would be a nicer fit for these UI frameworks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] jeme commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
jeme commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322012531

   > Note that there is a [Lucene.Net.Replicator](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/replicator/Lucene.Net.Replicator.html) module that is designed to synchronize an index across multiple servers by hosting a listener on a server and having each client request a copy periodically. I haven't tried, but I suspect there is a way to utilize it to safely copy the index to cloud storage at periodic intervals. I wouldn't recommend copying the index outside of a locking context, though, because Lucene.NET may lock the files at unpredictable times and it may not be practical to copy them without getting copy errors.
   
   The Replicator can probably do that but I don't think it's needed at all to do a feature like that, however the replicator holds the answer to how so it's worth looking at the code.
   
   But it should be as simple as using the SnapshotDeletionPolicy (Also required by the Replicator) then at intervals do a snapshot by calling the `Snapshot` method which returns a `IndexCommit` then copy all the files of that commit.
   
   We do that targeting the local file system to make dated snapshots of our index. Going from building snapshots targeting the local file system to putting them into cloud storage shouldn't be much of a challenge.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322829882

   > Here I did run into some classes which raises a question from a efficiency perspective. E.g. InputStreamDataInput that takes a Stream in it's constructor and immediately wraps that stream in a BinaryReader but I can't see for what reason. If the concern is that the underlying stream is not buffered, then perhaps a BufferedStream would be more appropriate. The Java version takes a InputStream in the constructor and just uses that as is, so I am not sure why this differs. (At least for the Java version I currently have checked out)
   
   Looks like you have spotted yet another weird thing that was done during the initial port. In Lucene, the passed-in stream was used without wrapping it in anything.
   
   It was changed like this in https://github.com/apache/lucenenet/commit/3bcd980f0244af667a885eb14e6144510af591e0#diff-5cbde6cbc3ee23bdc1d699d185ccae736530a7129bb2f2a844be24155fa9e9cd, and I don't see any valid reason for changing it like this.
   
   Care to submit a patch? Also, a null guard clause can be added to the constructor because clearly this class cannot function with a null stream.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] jeme commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
jeme commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1323361243

   > I am not sure I totally understand the difference between **OPTION-A** and **OPTION-B**. Partial classes are a compiler feature that makes it easier to organize code, but at the end of the day they get compiled into 1 class. Both partial classes have access to all of the private state of the class.
   
   **OPTION-A** is something that the team does not have to get involved with directly, it can be a 3rd party contribution that lives elsewhere, the team COULD open up a few things to make it a bit easier to copy and extend the Directory and Input/Output classes, e.g. exposing internally used helper methods.
   
   But I think the confusion here is how far up the stack this goes, I am only focusing on the replication scenarios and similar scenarios that works on that level. As I said and tried to re-iterate is that I don't think we can get problems in an attempt to go all the way up to `IndexWriter.UpdateDocumentAsync(...)` - This also sets the two options very much apart because if we were even to Attempt to go that far, that had to be done within the Lucene Source, where for replication purposes etc. We don't need to go that far up the stack, and because of that, this can be done as a custom implementation.
   
   A dirty POC of **OPTION-A** can be seen here: https://github.com/jeme/LuceneAsyncInputsPoc/tree/main/src/LuceneAsyncInputsPoc/LuceneAsyncInputsPoc/CustomDirectory
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] jeme commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
jeme commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322066724

   Furthermore, now that I looked a bit around in the code which would be involved in the above, it might actually not be completely unfeasibly to introduce async into lucene at this layer for such scenarios as a "Addin" to Lucene.NET, that means we are talking coud outside of lucene itself that extends the appropriate classes (e.g. SimpleFSDirectory, NIOFSDirectory, MMapDirectory etc.) or making "Mirrors" of them and then providing the asynchronous overloads on those and their associated Input/Output classes. And then you can use those during replication allowing you to use async patterns when copying the files from the disk etc. From there it will require a bit of Type checking/casting, but other than that it should be doable... Lets mark that as **OPTION-A**.
   
   It COULD possibly also be introduced into Lucene it self by marking the involved classes here as partial classes and then implement the Async behaviors there, that might allow us to introduce it without breaking with the principle of sticking close to the Java side as we could then add it into Lucene.Net specific async files that can clearly be marked as "Lucene.Net Specific" - that should in theory mean that all we have to do in future ports is again to open up these ported classes by marking them as partial. Ofc. Breaking changes can occur in the stack that means we have to do major rework on the Async parts, but it see it as more manageable. Lets mark that as **OPTION-B**.
   
   While **OPTION-B** could be interesting to investigate as it might have to opportunity to introduce async/await into Lucene.NET gradually if anyone wishes to give that a shot, to begin with the use will be very limited and probably specifically only to scenarios outlined here, it would not affect the core itself which would still use the blocking API's and I can't say how far up the stack this could be done without hitting a wall where we can't go further without actually compromising on the principle of having a port as close to the java source as possible to make furture ports easier. I REALLY doubt that we could get all the way out to making a `IndexWriter.UpdateDocumentAsync(...)` without hitting that problem, but I can't say for sure.
   
   So I am not sure **OPTION-B** is worth the effort considering that **OPTION-A** is probably feasible and it's only downside is the requirement of casts/type checks. And even IF we would find that worth it, I don't think anyone in the team sees it as a priority.
   
   But feel free to share what you guys think.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] tonyguo2022 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
tonyguo2022 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322507945

   > So when you say "After Merge() operation, we immediately collect the output files and store them to cloud storage." That would only work well if there is no need to add additional documents to the index. Otherwise, without ongoing merging, the number of segments will grow very large and search times will become slower and slower due to the number of segments that must be searched.
   
   
   @rclabo 
   We actually do indexWriter.ForceMerge() and indexWriter.Commit() when the number of segments has exceeded a threshold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] tonyguo2022 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
tonyguo2022 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322514788

   @NightOwl888 thanks for the research and references. We can definitely wrap asyn cloud API within sync Lucene.NET API as the most of the references have done. But if Lucene.net support async call natively, I think there could be benefits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] jeme commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
jeme commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322649394

   > I really don't see how you can do it without breaking all the components.
   > even with partial classes, the async operations should seep from the ground basic operations (Lucene.Net.Store.DataOutput) all the way up to the public api.
   
   As I said, making it all the way up to `IndexWriter.UpdateDocumentAsync(...)` I am having a hard time believing can be done without running into trouble as well, however for replication purposes, working with Lucene on the lover levels (DataInputs/DataOutputs) does actually not look to hard, In an OPTION-B scenario it might even be easier that I initially thought as it's actually only the DataInput/DataOutput that needs additions in partial classes for now to lift the async operations into the replication code.
   
   Here I did run into some classes which raises a question from a efficiency perspective. E.g. InputStreamDataInput that takes a Stream in it's constructor and immediately wraps that stream in a BinaryReader but I can't see for what reason. If the concern is that the underlying stream is not buffered, then perhaps a BufferedStream would be more appropriate. The Java version takes a InputStream in the constructor and just uses that as is, so I am not sure why this differs. (At least for the Java version I currently have checked out)
   
   For OPTION-A which I would point to for now (due to priorities) the Directory classes has to be mirrored as well to return these new implementations.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] rclabo commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
rclabo commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319355153

   @tonyguo2022 I'm surprised by that stack trace.  That's not my recollection.  But facts are facts.  Thank you for that.  I always appreciate learning something new. 
    
   But even still, it just shows that the new segment written to storage happened on the same thread as IndexWriter.Commit().  Unless you are using the `SerialMergeScheduler`, the heavy lifting of the storage writing will still be done on a background thread when the merging of segments happens.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1320374984

   For the record, there are already existing Lucene directory implementations for various cloud platforms and products. Some would need to be ported to .NET, others might already exist on .NET (some of which don't yet support 4.8.0).
   
   https://github.com/apache/lucenenet/issues/631#issuecomment-1086764962
   https://github.com/azure-contrib/AzureDirectory
   https://github.com/tomlm/Lucene.Net.Store.Azure
   https://stackoverflow.com/a/59381272
   https://github.com/albogdano/lucene-s3directory
   https://docs.jboss.org/author/display/ISPN50/Infinispan%20as%20a%20Directory%20for%20Lucene.html
   https://cwiki.apache.org/confluence/display/lucene/AvailableLockFactories
   
   Lucene is highly optimized to use a local file system. Swapping the directory implementation is one way to extend Lucene, but it will come at a pretty significant performance penalty to write to a blob storage provider (as you can see by the notes on each implementation). There are existing cloud products that emulate a local file system that may or may not do the job better of moving your index off of the server it runs on.
   
   However, rolling your own `Directory` implementation is a major job that will take a significant amount of effort to perform well and be stable enough to use.
   
   Note that there is a [Lucene.Net.Replicator](https://lucenenet.apache.org/docs/4.8.0-beta00009/api/replicator/Lucene.Net.Replicator.html) module that is designed to synchronize an index across multiple servers by hosting a listener on a server and having each client request a copy periodically. I haven't tried, but I suspect there is a way to utilize it to safely copy the index to cloud storage at periodic intervals. I wouldn't recommend copying the index outside of a locking context, though, because Lucene.NET may lock the files at unpredictable times and it may not be practical to copy them without getting copy errors.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] tonyguo2022 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
tonyguo2022 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319107489

   I think we do need this feature. Not sure if this is rare. Maybe we can clone this repo and make change there. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] tonyguo2022 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
tonyguo2022 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319350119

   @rclabo It is same for write. 
   
   ```
   >	MetadataIndexFileOutput.WriteByte(byte b) Line 55	C#
    	Lucene.Net.Store.DataOutput.WriteInt32(int i)	Unknown
    	Lucene.Net.Codecs.CodecUtil.WriteFooter(Lucene.Net.Store.IndexOutput out)	Unknown
    	Lucene.Net.Store.CompoundFileWriter.Dispose()	Unknown
    	Lucene.Net.Store.CompoundFileDirectory.Dispose(bool disposing)	Unknown
    	Lucene.Net.Store.Directory.Dispose()	Unknown
    	Lucene.Net.Util.IOUtils.Dispose(System.IDisposable[] objects)	Unknown
    	Lucene.Net.Index.IndexWriter.CreateCompoundFile(Lucene.Net.Util.InfoStream infoStream, Lucene.Net.Store.Directory directory, Lucene.Net.Index.CheckAbort checkAbort, Lucene.Net.Index.SegmentInfo info, Lucene.Net.Store.IOContext context)	Unknown
    	Lucene.Net.Index.DocumentsWriterPerThread.SealFlushedSegment(Lucene.Net.Index.DocumentsWriterPerThread.FlushedSegment flushedSegment)	Unknown
    	Lucene.Net.Index.DocumentsWriterPerThread.Flush()	Unknown
    	Lucene.Net.Index.DocumentsWriter.DoFlush(Lucene.Net.Index.DocumentsWriterPerThread flushingDWPT)	Unknown
    	Lucene.Net.Index.DocumentsWriter.FlushAllThreads(Lucene.Net.Index.IndexWriter indexWriter)	Unknown
    	Lucene.Net.Index.IndexWriter.PrepareCommitInternal()	Unknown
    	Lucene.Net.Index.IndexWriter.CommitInternal()	Unknown
    	Lucene.Net.Index.IndexWriter.Commit()	Unknown
   
   ```
   
   But so far we don't really need async WriteByte. we can write to memory and do async after commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] rclabo commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
rclabo commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319966984

   > After Merge() operation, we immediately collect the output files and store them to cloud storage. Seems the Lucene Index is valid already and we can run query against it successfully. So, there should be no required post write operations.
   
   I'm not sure I fully understand.  When you call `IndexWriter.Commit` that thread writes a new index segment to storage, handled at a top level by `DocumentsWriterPerThread` (which I had forgot was introduced in Lucene 4).  That index segment on disk will be fairly small given that it had to fit in RAM in it's entirety before being written to disk.  
   
   Then, in the default configuration, the [ConcurrentMergeScheduler](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/core/Lucene.Net.Index.ConcurrentMergeScheduler.html) will run on background threads and merge that tiny segment with other segments to create a new larger segment.  This process repeats itself over and over as new documents are written to the index and committed.  
   
    [This video](https://www.youtube.com/watch?v=YW0bOvLp72E ) shows  the segment writing and merging that happens as Java Lucene indexes Wikipedia.  It creates a nice visual of the process we are talking about.  You can see all the small initial segments being written as various commits are called and then see the segments getting combined into larger segments by the background workers.  And eventually those larger segments get combined to create even larger segments and so on.
   
   So when you say "After Merge() operation, we immediately collect the output files and store them to cloud storage." That would only work well if there is no need to add additional documents to the index.  Otherwise, without ongoing merging, the number of segments will grow very large and search times will become slower and slower due to the number of segments that must be searched.
   
   I hope this information is helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] eladmarg commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
eladmarg commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1322497950

   no one said that async operation isn't needed. IO (network / disk) is async operation by definition.
   so yes, this is totally agreed that the correct way to access IO is async all the way.
   
   no need to say that this will impact much higher throughput to the system / index. even that lucene is "highly optimized to use a local file system".
   a local FS on the cloud is not always what you think in terms of FS, not always fast.
   but even if its local, it's still in order of magnitude slower than cpu/L1 cache.
   so while a thread waiting for IO, it can be freed and do another work or serve other queries.
   
   so to summarize, it's not a question if async should be supported. it's clearly as the sun.
   our problem is the scope and how and when can it be done.
   
   I really don't see how you can do it without breaking all the components. 
   even with partial classes, the async operations should seep from the ground basic operations (Lucene.Net.Store.DataOutput) all the way up to the public api.
   
   so in some way, i can truly understand this is disappointing because we aren't utilizing the modern technology and the tools we already have.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] jeme commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
jeme commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1323219794

   I went down a rabbit how and checked what BinaryReaderDataInput was used for as I could not seem to find an equivalent in Java, so I am not sure why that exists. I ended up in the OfflineSorter class which seems to have quite a few differences between .NET and Java, I couldn't really make heads or tails of it and in the end why the BinaryReaderDataInput existed.
   
   Not sure if you have any insights here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #763: Support async task

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1323440052

   > > I am not sure I totally understand the difference between **OPTION-A** and **OPTION-B**. Partial classes are a compiler feature that makes it easier to organize code, but at the end of the day they get compiled into 1 class. Both partial classes have access to all of the private state of the class.
   > 
   > **OPTION-A** is something that the team does not have to get involved with directly, it can be a 3rd party contribution that lives elsewhere, the team COULD open up a few things to make it a bit easier to copy and extend the Directory and Input/Output classes, e.g. exposing internally used helper methods if that turns out to be helpful as one go deeper.
   > 
   > But I think the confusion here is how far up the stack this goes, I am only focusing on the replication scenarios and similar scenarios that works on that level. As I said and tried to re-iterate is that I don't think we can get all the way up to `IndexWriter.UpdateDocumentAsync(...)` without problems - This also sets the two options very much apart because if we were even to Attempt to go that far, that had to be done within the Lucene Source (I think I can state that with a 100% certainty, if not then a 99.9999% certainty), where for replication purposes etc. We don't need to go that far up the stack, and because of that, this can be done as a custom implementation.
   > 
   > A SUPER DIRTY POC of **OPTION-A** can be seen here: https://github.com/jeme/LuceneAsyncInputsPoc/tree/main/src/LuceneAsyncInputsPoc/LuceneAsyncInputsPoc/CustomDirectory
   
   Gotcha.
   
   I don't think there is such an issue if we limit the scope like this. In fact, when upgrading the replicator tests, I left some [`LUCENENET TODO`s about making it async](https://github.com/apache/lucenenet/blob/4e49612d6867194440694b77db95fd0ed756c9a9/src/Lucene.Net.Tests.Replicator/Http/ReplicationServlet.cs#L86) so it fits together with ASP.NET Core better.
   
    Making stuff public that never was meant to be is probably not a good idea, so it would be preferable to fix this using partial classes (**OPTION-B**). But I agree that trying to fix `IndexWriter.UpdateDocument()` is probably not worth the effort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] mikemccand commented on issue #763: Support async task

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1684912186

   Has anyone opened a Lucene java issue with the ideas here?  I am no expert on async/await but I do think this is a gap in Lucene java.  Search is often IO bound and it's crazy that Lucene java's design must stall a whole thread waiting from the IO reads, blocking getting the other IO requests shortly necessary to continue the search.  Maybe the new virtual threads (in JDK 21) somehow helps here, not sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org