You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/01 10:27:25 UTC

[GitHub] [lucene-solr] donnerpeter opened a new pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

donnerpeter opened a new pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277

<!--
_(If you are a project committer then you may remove some/all of the following template.)_

Before creating a pull request, please file an issue in the ASF Jira system for Lucene or Solr:

* https://issues.apache.org/jira/projects/LUCENE
* https://issues.apache.org/jira/projects/SOLR

You will need to create an account in Jira in order to create an issue.

The title of the PR should reference the Jira issue number in the form:

* LUCENE-####: <short description of problem or changes>
* SOLR-####: <short description of problem or changes>

LUCENE and SOLR must be fully capitalized. A short description helps people scanning pull requests for items they can work on.

Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->

# Description

e.g. `nl` dictionaries first have `KEEPCASE Kc` and only then `FLAG long`

# Solution

Like Hunspell/C++, read affix twice: first extract important info like encoding and flag format, next — the rest.

# Tests

Added

# Checklist

Please review the following and check all that apply:

- [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the `master` branch.
- [x] I have run `./gradlew check`.
- [x] I have added tests for my changes.
- [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773562354


   No, obviously not. But I think we can empirically check on existing data where those required flags actually are (how much buffer is realistically needed). If there is a file where a flag occurs at 37mb then I think it could just thrown an IOException with an appropriate message. If somebody *really* wants to, they can pass an InputStream supporting mark/reset which would support such a large scan window... I don't have all the details worked out. This is fine as it is now too, but... you know - why not try to make it better. ;)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773883764


   It's already merged :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773471401


   As for this:
   
   ```
       // TODO: can these flags change throughout the file? If not then we can abort sooner. And
       // then we wouldn't even need to create a temp file for the affix stream - a large enough
       // leading buffer (BufferedInputStream) would be sufficient?
   ```
   
   Flags aren't supposed to be redefined, although Hunspell currently silently (and I believe erroneously) allows that. Aborting sooner is possible, but would need a bit more code.
   
   I'd very much like to avoid a temp file, but I don't understand how. I don't think we can fix the size of the temp buffer in advance and expect all aff files to obey that, some of them include large comments in the beginning. And unfortunately, the files themselves can be quite large: for Kinyarwanda it's 38MB.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773888541


   Yup, yup. Saw that after I posted the comment.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-771711060


   Thank you, LGTM!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773880886


   I'll commit this patch in and create a separate issue for improving it later.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773476961






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773471401






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#discussion_r568575767



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java
##########
@@ -696,45 +690,25 @@ char affixData(int affixIndex, int offset) {
     return fstCompiler.compile();
   }
 
-  /** pattern accepts optional BOM + SET + any whitespace */
-  static final Pattern ENCODING_PATTERN = Pattern.compile("^(\u00EF\u00BB\u00BF)?SET\\s+");
+  /** Parses the encoding and flag format specified in the provided InputStream */
+  private void readConfig(InputStream affix) throws IOException, ParseException {
+    LineNumberReader reader = new LineNumberReader(new InputStreamReader(affix, DEFAULT_CHARSET));
+    while (true) {
+      String line = reader.readLine();
+      if (line == null) break;
 
-  /**
-   * Parses the encoding specified in the affix file readable through the provided InputStream
-   *
-   * @param affix InputStream for reading the affix file
-   * @return Encoding specified in the affix file
-   * @throws IOException Can be thrown while reading from the InputStream
-   */
-  static String getDictionaryEncoding(InputStream affix) throws IOException {
-    final StringBuilder encoding = new StringBuilder();
-    for (; ; ) {
-      encoding.setLength(0);
-      int ch;
-      while ((ch = affix.read()) >= 0) {
-        if (ch == '\n') {
-          break;
-        }
-        if (ch != '\r') {
-          encoding.append((char) ch);
-        }
-      }
-      if (encoding.length() == 0
-          || encoding.charAt(0) == '#'
-          ||
-          // this test only at the end as ineffective but would allow lines only containing spaces:
-          encoding.toString().trim().length() == 0) {
-        if (ch < 0) {
-          return DEFAULT_CHARSET.name();
-        }
-        continue;
+      line = line.trim();
+
+      while (line.startsWith("\u00EF") || line.startsWith("\u00BB") || line.startsWith("\u00BF")) {

Review comment:
       Most likely it's just on the first line, handling it this way was just easier. Pushback might indeed be more elegant, I'll try that, thanks!
   
   The situation with encoding is complicated indeed. AFAIU the encodings are either ASCII-based 8-bit, or UTF-8, so for the first time we read the file we can safely check Latin letters. At least that's what Hunspell appears to do as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss merged pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss merged pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773501527


   Will this buffer help even if these flags occur at the 37th MB? :)
   
   The mentioned dictionary is in the wooorm collection: https://github.com/wooorm/dictionaries/tree/main/dictionaries/rw


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss merged pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss merged pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on a change in pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#discussion_r568569461



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java
##########
@@ -696,45 +690,25 @@ char affixData(int affixIndex, int offset) {
     return fstCompiler.compile();
   }
 
-  /** pattern accepts optional BOM + SET + any whitespace */
-  static final Pattern ENCODING_PATTERN = Pattern.compile("^(\u00EF\u00BB\u00BF)?SET\\s+");
+  /** Parses the encoding and flag format specified in the provided InputStream */
+  private void readConfig(InputStream affix) throws IOException, ParseException {
+    LineNumberReader reader = new LineNumberReader(new InputStreamReader(affix, DEFAULT_CHARSET));
+    while (true) {
+      String line = reader.readLine();
+      if (line == null) break;
 
-  /**
-   * Parses the encoding specified in the affix file readable through the provided InputStream
-   *
-   * @param affix InputStream for reading the affix file
-   * @return Encoding specified in the affix file
-   * @throws IOException Can be thrown while reading from the InputStream
-   */
-  static String getDictionaryEncoding(InputStream affix) throws IOException {
-    final StringBuilder encoding = new StringBuilder();
-    for (; ; ) {
-      encoding.setLength(0);
-      int ch;
-      while ((ch = affix.read()) >= 0) {
-        if (ch == '\n') {
-          break;
-        }
-        if (ch != '\r') {
-          encoding.append((char) ch);
-        }
-      }
-      if (encoding.length() == 0
-          || encoding.charAt(0) == '#'
-          ||
-          // this test only at the end as ineffective but would allow lines only containing spaces:
-          encoding.toString().trim().length() == 0) {
-        if (ch < 0) {
-          return DEFAULT_CHARSET.name();
-        }
-        continue;
+      line = line.trim();
+
+      while (line.startsWith("\u00EF") || line.startsWith("\u00BB") || line.startsWith("\u00BF")) {

Review comment:
       Can the bom really be present on any line? Wouldn't a more elegant solution be to use a buffered input stream (or a pushback input stream) and just consume the bom if it's leading the file?
   
   It is a bit awkward that those files are parsed as ascii (well, iso8859-1) and at the same time have utf bom (not to mention that bit where you convert to utf8 from usi8859-1)... Is this encoding situation really so messed up in hunspell?
   

##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java
##########
@@ -696,45 +690,25 @@ char affixData(int affixIndex, int offset) {
     return fstCompiler.compile();
   }
 
-  /** pattern accepts optional BOM + SET + any whitespace */
-  static final Pattern ENCODING_PATTERN = Pattern.compile("^(\u00EF\u00BB\u00BF)?SET\\s+");
+  /** Parses the encoding and flag format specified in the provided InputStream */
+  private void readConfig(InputStream affix) throws IOException, ParseException {
+    LineNumberReader reader = new LineNumberReader(new InputStreamReader(affix, DEFAULT_CHARSET));
+    while (true) {
+      String line = reader.readLine();
+      if (line == null) break;
 
-  /**
-   * Parses the encoding specified in the affix file readable through the provided InputStream
-   *
-   * @param affix InputStream for reading the affix file
-   * @return Encoding specified in the affix file
-   * @throws IOException Can be thrown while reading from the InputStream
-   */
-  static String getDictionaryEncoding(InputStream affix) throws IOException {
-    final StringBuilder encoding = new StringBuilder();
-    for (; ; ) {
-      encoding.setLength(0);
-      int ch;
-      while ((ch = affix.read()) >= 0) {
-        if (ch == '\n') {
-          break;
-        }
-        if (ch != '\r') {
-          encoding.append((char) ch);
-        }
-      }
-      if (encoding.length() == 0
-          || encoding.charAt(0) == '#'
-          ||
-          // this test only at the end as ineffective but would allow lines only containing spaces:
-          encoding.toString().trim().length() == 0) {
-        if (ch < 0) {
-          return DEFAULT_CHARSET.name();
-        }
-        continue;
+      line = line.trim();
+
+      while (line.startsWith("\u00EF") || line.startsWith("\u00BB") || line.startsWith("\u00BF")) {

Review comment:
       Ok, so it's essentially an unknown byte stream with dynamic charset detection. Not fun. If it's restricted to a reasonable subset (like you said) then a preflight of the content could determine the actual encoding (at least until an explicit encoding declaration is found). Then things would be less messy down the road as you'd just have a Reader to read from... 
   
   Pushback is fine too. Either this or a BufferedInputStream and use mark/reset to adjust stream position after you detect the BOM (or not). As much as I like PushbackInputStream, it predates dinosaurs. :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773476961

Hi Peter. The file size itself doesn't matter if we can assume some kind of leader buffer in which these flags have to occur and which we can rewind. Implementing this is technically easy - for example via BufferedInputStream with a reasonably large internal buffer, then a mark on the zero-eth byte. Once you reach your flags, you reset the buffer.

The only problem I see here is a minor potential to have a the buffer limit fall on an UTF8 surrogate, for example, which could potentially trigger some kind of exception... but this can be worked around.

I'll try to do this, time permitting. It's not much of a problem - it can be done later too. I have to limit my time for Lucene to reasonable chunks though. :)

> for Kinyarwanda it's 38MB.

Is this one of the openoffice dictionaries? Once we have them all parse successfully it'd be a good baseline test.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-771711060


   Thank you, LGTM!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#discussion_r568575767



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java
##########
@@ -696,45 +690,25 @@ char affixData(int affixIndex, int offset) {
     return fstCompiler.compile();
   }
 
-  /** pattern accepts optional BOM + SET + any whitespace */
-  static final Pattern ENCODING_PATTERN = Pattern.compile("^(\u00EF\u00BB\u00BF)?SET\\s+");
+  /** Parses the encoding and flag format specified in the provided InputStream */
+  private void readConfig(InputStream affix) throws IOException, ParseException {
+    LineNumberReader reader = new LineNumberReader(new InputStreamReader(affix, DEFAULT_CHARSET));
+    while (true) {
+      String line = reader.readLine();
+      if (line == null) break;
 
-  /**
-   * Parses the encoding specified in the affix file readable through the provided InputStream
-   *
-   * @param affix InputStream for reading the affix file
-   * @return Encoding specified in the affix file
-   * @throws IOException Can be thrown while reading from the InputStream
-   */
-  static String getDictionaryEncoding(InputStream affix) throws IOException {
-    final StringBuilder encoding = new StringBuilder();
-    for (; ; ) {
-      encoding.setLength(0);
-      int ch;
-      while ((ch = affix.read()) >= 0) {
-        if (ch == '\n') {
-          break;
-        }
-        if (ch != '\r') {
-          encoding.append((char) ch);
-        }
-      }
-      if (encoding.length() == 0
-          || encoding.charAt(0) == '#'
-          ||
-          // this test only at the end as ineffective but would allow lines only containing spaces:
-          encoding.toString().trim().length() == 0) {
-        if (ch < 0) {
-          return DEFAULT_CHARSET.name();
-        }
-        continue;
+      line = line.trim();
+
+      while (line.startsWith("\u00EF") || line.startsWith("\u00BB") || line.startsWith("\u00BF")) {

Review comment:
       Most likely it's just on the first line, handling it this way was just easier. Pushback might indeed be more elegant, I'll try that, thanks!
   
   The situation with encoding is complicated indeed. AFAIU the encodings are either ASCII-based 8-bit, or UTF-8, so for the first time we read the file we can safely check Latin letters. At least that's what Hunspell appears to do as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-771677497


   Hi Peter. I pushed a commit which replaces bom consumption with a small function so that either it's read atomically or nothing is consumed at all. Looking at the code also made me wonder whether a sufficiently large leading buffer could be used to just parse the needed stuff from the input stream (bypassing the need to create a temp file)... can be left for a later improvement though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on a change in pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#discussion_r568569461



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java
##########
@@ -696,45 +690,25 @@ char affixData(int affixIndex, int offset) {
     return fstCompiler.compile();
   }
 
-  /** pattern accepts optional BOM + SET + any whitespace */
-  static final Pattern ENCODING_PATTERN = Pattern.compile("^(\u00EF\u00BB\u00BF)?SET\\s+");
+  /** Parses the encoding and flag format specified in the provided InputStream */
+  private void readConfig(InputStream affix) throws IOException, ParseException {
+    LineNumberReader reader = new LineNumberReader(new InputStreamReader(affix, DEFAULT_CHARSET));
+    while (true) {
+      String line = reader.readLine();
+      if (line == null) break;
 
-  /**
-   * Parses the encoding specified in the affix file readable through the provided InputStream
-   *
-   * @param affix InputStream for reading the affix file
-   * @return Encoding specified in the affix file
-   * @throws IOException Can be thrown while reading from the InputStream
-   */
-  static String getDictionaryEncoding(InputStream affix) throws IOException {
-    final StringBuilder encoding = new StringBuilder();
-    for (; ; ) {
-      encoding.setLength(0);
-      int ch;
-      while ((ch = affix.read()) >= 0) {
-        if (ch == '\n') {
-          break;
-        }
-        if (ch != '\r') {
-          encoding.append((char) ch);
-        }
-      }
-      if (encoding.length() == 0
-          || encoding.charAt(0) == '#'
-          ||
-          // this test only at the end as ineffective but would allow lines only containing spaces:
-          encoding.toString().trim().length() == 0) {
-        if (ch < 0) {
-          return DEFAULT_CHARSET.name();
-        }
-        continue;
+      line = line.trim();
+
+      while (line.startsWith("\u00EF") || line.startsWith("\u00BB") || line.startsWith("\u00BF")) {

Review comment:
       Can the bom really be present on any line? Wouldn't a more elegant solution be to use a buffered input stream (or a pushback input stream) and just consume the bom if it's leading the file?
   
   It is a bit awkward that those files are parsed as ascii (well, iso8859-1) and at the same time have utf bom (not to mention that bit where you convert to utf8 from usi8859-1)... Is this encoding situation really so messed up in hunspell?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-771677497


   Hi Peter. I pushed a commit which replaces bom consumption with a small function so that either it's read atomically or nothing is consumed at all. Looking at the code also made me wonder whether a sufficiently large leading buffer could be used to just parse the needed stuff from the input stream (bypassing the need to create a temp file)... can be left for a later improvement though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

dweiss commented on a change in pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#discussion_r568580925



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java
##########
@@ -696,45 +690,25 @@ char affixData(int affixIndex, int offset) {
     return fstCompiler.compile();
   }
 
-  /** pattern accepts optional BOM + SET + any whitespace */
-  static final Pattern ENCODING_PATTERN = Pattern.compile("^(\u00EF\u00BB\u00BF)?SET\\s+");
+  /** Parses the encoding and flag format specified in the provided InputStream */
+  private void readConfig(InputStream affix) throws IOException, ParseException {
+    LineNumberReader reader = new LineNumberReader(new InputStreamReader(affix, DEFAULT_CHARSET));
+    while (true) {
+      String line = reader.readLine();
+      if (line == null) break;
 
-  /**
-   * Parses the encoding specified in the affix file readable through the provided InputStream
-   *
-   * @param affix InputStream for reading the affix file
-   * @return Encoding specified in the affix file
-   * @throws IOException Can be thrown while reading from the InputStream
-   */
-  static String getDictionaryEncoding(InputStream affix) throws IOException {
-    final StringBuilder encoding = new StringBuilder();
-    for (; ; ) {
-      encoding.setLength(0);
-      int ch;
-      while ((ch = affix.read()) >= 0) {
-        if (ch == '\n') {
-          break;
-        }
-        if (ch != '\r') {
-          encoding.append((char) ch);
-        }
-      }
-      if (encoding.length() == 0
-          || encoding.charAt(0) == '#'
-          ||
-          // this test only at the end as ineffective but would allow lines only containing spaces:
-          encoding.toString().trim().length() == 0) {
-        if (ch < 0) {
-          return DEFAULT_CHARSET.name();
-        }
-        continue;
+      line = line.trim();
+
+      while (line.startsWith("\u00EF") || line.startsWith("\u00BB") || line.startsWith("\u00BF")) {

Review comment:
       Ok, so it's essentially an unknown byte stream with dynamic charset detection. Not fun. If it's restricted to a reasonable subset (like you said) then a preflight of the content could determine the actual encoding (at least until an explicit encoding declaration is found). Then things would be less messy down the road as you'd just have a Reader to read from... 
   
   Pushback is fine too. Either this or a BufferedInputStream and use mark/reset to adjust stream position after you detect the BOM (or not). As much as I like PushbackInputStream, it predates dinosaurs. :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773580816


   Checking against available dictionaries looks like a good idea!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org