You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by Shad Storhaug <sh...@shadstorhaug.com> on 2017/03/20 20:59:47 UTC

RE: Lucene.net file I/O inefficiency and a question

Vincent,

Just an update on this.

> Proposal to speed up implementation of LowercaseFilter/charUtils.toLower

I have implemented the proposed fix. There was more to it than that, though because you apparently aren't running the tests - for backward compatibility support with Lucene 3.0, there was a "broken" Unicode implementation that needed to be used (which was taken care of by the original code). However, since that is an extreme edge case (made impossibly small by the fact there was no Lucene.Net 3.0 release), I have used your fix in all but that one case.

> CreateTempFile is not thread-safe (no, it really is not)

I have taken your suggestions into consideration for the current version: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/FileSupport.cs#L154. There are still no tests for that class, but I have confirmed that it is working right by stepping through.

> Lucene.net file I/O inefficiency

I haven't yet looked into these issues, but will take a look when I get a chance (unless you beat me to it with a pull request).

I have implemented your fix for the MMapDirectory (but do note it doesn't work on .NET core because the ReadArray() and WriteArray() methods are no longer supported). The current implementation of MMapDirectory is not an exact port from Lucene. These "alternative" versions that are not a true port from Lucene tend to be a major source of bugs. Someone before me made an incomplete MemoryMappedFileByteBuffer implementation (which there is no counterpart for in Java) as well as most of the MMapDirectory and it was a minor miracle I got them to stitch together and make the tests pass. Of course, the result is not stellar and I wouldn't mind if it were redesigned.

Please keep sending the issues (or fixes) our way.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Shad Storhaug 
Sent: Wednesday, February 1, 2017 11:23 AM
To: 'Vincent.VanDenBerghe@bvdinfo.com'
Cc: dev@lucenenet.apache.org
Subject: RE: Lucene.net file I/O inefficiency and a question

Vincent,

Right now, our priorities are:

1. Stabilize the API
2. Fix the tests that are causing NUnit to crash 3. Get a pre-release on NuGet 4. Fix remaining broken tests 5. Fix other bugs and make optimizations

So, these issues while important, are not the highest of priority right now. That said, now that I have the my current working branch (https://github.com/apache/lucenenet/tree/api-work) posted on the main repository, you may submit pull requests there. Please DO NOT use the master branch or #191 branch to make changes, because it will be difficult to merge them. The api-work branch is more than 1000 commits ahead of #191.

We do need to proceed carefully on these changes, though, because this I/O code has dependencies with very temperamental code. Two of the "issues" you brought up previously were to reverse 2 of the bug fixes that I put into place. We need to rely on the tests to verify that we have the correct behavior, and frankly, that is our best tool to determine if it is working correctly.

I agree that a thread-safe implementation of CreateTempFile is required. But, we also need it to release the file handle immediately before returning. When trying to open then close with a using block, some tests are failing (I think because there is a delay after the end of the using block before the file handle is closed). Also, some of the code does not function correctly if there is a BOM in the file that is created. Both of these issues were difficult to track down.

While it might be possible to change the design, keep in mind our safest option is to try to exactly mimic what is going on in Lucene/Java. Changing the design of this piece, means that the design of half a dozen dependent pieces needs to be changed to work with the new behavior. And being that we have new behavior, we are off the map as far as the tests are concerned. We can write new tests, but how do we know the new behavior is right? It also makes future porting efforts more difficult for changes to pieces that relied on the original behavior. I don't object, but we need to tread carefully to go into new waters like this.

If you want to help out, I would suggest start by first helping with the effort of fixing broken tests (currently we have some tests that are causing NUnit to crash), or if you want to help with getting the API into shape, let me know and I will set some of the tasks aside for you. Or, if you prefer, perhaps you could write some tests to check for thread-safety of CreateTempFile - at least then we have a target to shoot for. But let's at least try to get stable tests before we start changing parts of the design that could break tests that don't currently run.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Prescott Nasser [mailto:geobmx540@hotmail.com]
Sent: Tuesday, January 31, 2017 3:12 PM
To: dev@lucenenet.apache.org
Subject: RE: Lucene.net file I/O inefficiency and a question

Hey Vincent - 

We love any and all help. 

As for contributions, you can create issues in JIRA (https://issues.apache.org/jira/browse/LUCENENET/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel) - admittedly we aren't great at keeping this up to date or on track. You can also submit a pull request and state that the code is your original work and you license it under the Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)

Best,
~Prescott

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Monday, January 30, 2017 11:46 PM
To: dev@lucenenet.apache.org
Subject: Lucene.net file I/O inefficiency and a question

Hello everyone,

This message contains two subjects, but since the second one is more of a question, I'll use the first subject as a "hook", hoping to get an answer  to the next one.
(start of  first subject)
There is an inefficient implementation of file I/O in Lucene.net, most notably in FSDirectory.FSIndexOutput. The number of write calls can be reduced by a factor of 2.
First we see this, which seems to be a copy paste from the Java code:

            /// <summary>
            /// The maximum chunk size is 8192 bytes, because <seealso cref="RandomAccessFile"/> mallocs
            /// a native buffer outside of stack if the write buffer size is larger.
            /// </summary>
            internal const int CHUNK_SIZE = 8192;

And then further on:

            protected internal override void FlushBuffer(byte[] b, int offset, int size)
            {
                //Debug.Assert(IsOpen);
                while (size > 0)
                {
                    int toWrite = Math.Min(CHUNK_SIZE, size);
                    File.Write(b, offset, toWrite);
                    offset += toWrite;
                    size -= toWrite;
                }
                //Debug.Assert(size == 0);
            }


This is not needed: in .NET FileStream.Write delegates to the native Win32 file implementation and allocates nothing, regardless the size of the buffer.
Wouldn't it be better to write:

            protected internal override void FlushBuffer(byte[] b, int offset, int size)
            {
              //Debug.Assert(IsOpen);
              File.Write(b, offset, size);
            }

... and get rid of the CHUNK_SIZE?
The default buffer size (from the BufferedIndexOutput class) is 16384 bytes, so this will reduce the number of I/O calls by 2.
There is a similar modification that can be done for SimpleFSIndexInput.ReadInternal.
There may be other places where similar code is used, but I couldn't conclusively prove a similar modification would help.
(end of the first subject)

Here's my question:  This is the third suggestion I'm making, based of real-world usage of Lucene.net:

-          Proposal to speed up implementation of LowercaseFilter/charUtils.toLower

-          CreateTempFile is not thread-safe (no, it really is not)

-          Lucene.net file I/O inefficiency
I'd like to make contributions to the Lucene.net project, but several personal and external factors are preventing me to be a contributor (in the Apache sense). I also may not have anything else or significant to contribute after this: there is no way to know.
How can I make sure that these suggestions are actually considered for ending up in the code? I've seen contributors doing modifications on behalf of other people. I care about problems being solved, and do not care about who's name is on them. What's the best way to proceed? Would it be better to post these things on GitHub somewhere?


Vincent


RE: Lucene.net file I/O inefficiency and a question

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.
Hello Shad,

Thank you for taking my issues into consideration, and for the feedback. It is much appreciated.
You are correct I'm not running the Lucene tests. I apologize for the trouble that has caused. I'll try to better my life in the future, but no promises.

I mentioned in my earlier posts that I have no intention of being a "real" contributor to Lucene.net (in the Apache sense), but your efforts, the imminence of the #203 merge and some other major improvements I made to Lucene.net in the meantime just might cause me to reconsider. I'll keep you posted.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 10:00 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Lucene.net file I/O inefficiency and a question

Vincent,

Just an update on this.

> Proposal to speed up implementation of LowercaseFilter/charUtils.toLower

I have implemented the proposed fix. There was more to it than that, though because you apparently aren't running the tests - for backward compatibility support with Lucene 3.0, there was a "broken" Unicode implementation that needed to be used (which was taken care of by the original code). However, since that is an extreme edge case (made impossibly small by the fact there was no Lucene.Net 3.0 release), I have used your fix in all but that one case.

> CreateTempFile is not thread-safe (no, it really is not)

I have taken your suggestions into consideration for the current version: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/FileSupport.cs#L154. There are still no tests for that class, but I have confirmed that it is working right by stepping through.

> Lucene.net file I/O inefficiency

I haven't yet looked into these issues, but will take a look when I get a chance (unless you beat me to it with a pull request).

I have implemented your fix for the MMapDirectory (but do note it doesn't work on .NET core because the ReadArray() and WriteArray() methods are no longer supported). The current implementation of MMapDirectory is not an exact port from Lucene. These "alternative" versions that are not a true port from Lucene tend to be a major source of bugs. Someone before me made an incomplete MemoryMappedFileByteBuffer implementation (which there is no counterpart for in Java) as well as most of the MMapDirectory and it was a minor miracle I got them to stitch together and make the tests pass. Of course, the result is not stellar and I wouldn't mind if it were redesigned.

Please keep sending the issues (or fixes) our way.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Shad Storhaug 
Sent: Wednesday, February 1, 2017 11:23 AM
To: 'Vincent.VanDenBerghe@bvdinfo.com'
Cc: dev@lucenenet.apache.org
Subject: RE: Lucene.net file I/O inefficiency and a question

Vincent,

Right now, our priorities are:

1. Stabilize the API
2. Fix the tests that are causing NUnit to crash 3. Get a pre-release on NuGet 4. Fix remaining broken tests 5. Fix other bugs and make optimizations

So, these issues while important, are not the highest of priority right now. That said, now that I have the my current working branch (https://github.com/apache/lucenenet/tree/api-work) posted on the main repository, you may submit pull requests there. Please DO NOT use the master branch or #191 branch to make changes, because it will be difficult to merge them. The api-work branch is more than 1000 commits ahead of #191.

We do need to proceed carefully on these changes, though, because this I/O code has dependencies with very temperamental code. Two of the "issues" you brought up previously were to reverse 2 of the bug fixes that I put into place. We need to rely on the tests to verify that we have the correct behavior, and frankly, that is our best tool to determine if it is working correctly.

I agree that a thread-safe implementation of CreateTempFile is required. But, we also need it to release the file handle immediately before returning. When trying to open then close with a using block, some tests are failing (I think because there is a delay after the end of the using block before the file handle is closed). Also, some of the code does not function correctly if there is a BOM in the file that is created. Both of these issues were difficult to track down.

While it might be possible to change the design, keep in mind our safest option is to try to exactly mimic what is going on in Lucene/Java. Changing the design of this piece, means that the design of half a dozen dependent pieces needs to be changed to work with the new behavior. And being that we have new behavior, we are off the map as far as the tests are concerned. We can write new tests, but how do we know the new behavior is right? It also makes future porting efforts more difficult for changes to pieces that relied on the original behavior. I don't object, but we need to tread carefully to go into new waters like this.

If you want to help out, I would suggest start by first helping with the effort of fixing broken tests (currently we have some tests that are causing NUnit to crash), or if you want to help with getting the API into shape, let me know and I will set some of the tasks aside for you. Or, if you prefer, perhaps you could write some tests to check for thread-safety of CreateTempFile - at least then we have a target to shoot for. But let's at least try to get stable tests before we start changing parts of the design that could break tests that don't currently run.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Prescott Nasser [mailto:geobmx540@hotmail.com]
Sent: Tuesday, January 31, 2017 3:12 PM
To: dev@lucenenet.apache.org
Subject: RE: Lucene.net file I/O inefficiency and a question

Hey Vincent - 

We love any and all help. 

As for contributions, you can create issues in JIRA (https://issues.apache.org/jira/browse/LUCENENET/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel) - admittedly we aren't great at keeping this up to date or on track. You can also submit a pull request and state that the code is your original work and you license it under the Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)

Best,
~Prescott

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Monday, January 30, 2017 11:46 PM
To: dev@lucenenet.apache.org
Subject: Lucene.net file I/O inefficiency and a question

Hello everyone,

This message contains two subjects, but since the second one is more of a question, I'll use the first subject as a "hook", hoping to get an answer  to the next one.
(start of  first subject)
There is an inefficient implementation of file I/O in Lucene.net, most notably in FSDirectory.FSIndexOutput. The number of write calls can be reduced by a factor of 2.
First we see this, which seems to be a copy paste from the Java code:

            /// <summary>
            /// The maximum chunk size is 8192 bytes, because <seealso cref="RandomAccessFile"/> mallocs
            /// a native buffer outside of stack if the write buffer size is larger.
            /// </summary>
            internal const int CHUNK_SIZE = 8192;

And then further on:

            protected internal override void FlushBuffer(byte[] b, int offset, int size)
            {
                //Debug.Assert(IsOpen);
                while (size > 0)
                {
                    int toWrite = Math.Min(CHUNK_SIZE, size);
                    File.Write(b, offset, toWrite);
                    offset += toWrite;
                    size -= toWrite;
                }
                //Debug.Assert(size == 0);
            }


This is not needed: in .NET FileStream.Write delegates to the native Win32 file implementation and allocates nothing, regardless the size of the buffer.
Wouldn't it be better to write:

            protected internal override void FlushBuffer(byte[] b, int offset, int size)
            {
              //Debug.Assert(IsOpen);
              File.Write(b, offset, size);
            }

... and get rid of the CHUNK_SIZE?
The default buffer size (from the BufferedIndexOutput class) is 16384 bytes, so this will reduce the number of I/O calls by 2.
There is a similar modification that can be done for SimpleFSIndexInput.ReadInternal.
There may be other places where similar code is used, but I couldn't conclusively prove a similar modification would help.
(end of the first subject)

Here's my question:  This is the third suggestion I'm making, based of real-world usage of Lucene.net:

-          Proposal to speed up implementation of LowercaseFilter/charUtils.toLower

-          CreateTempFile is not thread-safe (no, it really is not)

-          Lucene.net file I/O inefficiency
I'd like to make contributions to the Lucene.net project, but several personal and external factors are preventing me to be a contributor (in the Apache sense). I also may not have anything else or significant to contribute after this: there is no way to know.
How can I make sure that these suggestions are actually considered for ending up in the code? I've seen contributors doing modifications on behalf of other people. I care about problems being solved, and do not care about who's name is on them. What's the best way to proceed? Would it be better to post these things on GitHub somewhere?


Vincent