You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/02 12:39:19 UTC

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

dweiss commented on a change in pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#discussion_r568569461



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java
##########
@@ -696,45 +690,25 @@ char affixData(int affixIndex, int offset) {
     return fstCompiler.compile();
   }
 
-  /** pattern accepts optional BOM + SET + any whitespace */
-  static final Pattern ENCODING_PATTERN = Pattern.compile("^(\u00EF\u00BB\u00BF)?SET\\s+");
+  /** Parses the encoding and flag format specified in the provided InputStream */
+  private void readConfig(InputStream affix) throws IOException, ParseException {
+    LineNumberReader reader = new LineNumberReader(new InputStreamReader(affix, DEFAULT_CHARSET));
+    while (true) {
+      String line = reader.readLine();
+      if (line == null) break;
 
-  /**
-   * Parses the encoding specified in the affix file readable through the provided InputStream
-   *
-   * @param affix InputStream for reading the affix file
-   * @return Encoding specified in the affix file
-   * @throws IOException Can be thrown while reading from the InputStream
-   */
-  static String getDictionaryEncoding(InputStream affix) throws IOException {
-    final StringBuilder encoding = new StringBuilder();
-    for (; ; ) {
-      encoding.setLength(0);
-      int ch;
-      while ((ch = affix.read()) >= 0) {
-        if (ch == '\n') {
-          break;
-        }
-        if (ch != '\r') {
-          encoding.append((char) ch);
-        }
-      }
-      if (encoding.length() == 0
-          || encoding.charAt(0) == '#'
-          ||
-          // this test only at the end as ineffective but would allow lines only containing spaces:
-          encoding.toString().trim().length() == 0) {
-        if (ch < 0) {
-          return DEFAULT_CHARSET.name();
-        }
-        continue;
+      line = line.trim();
+
+      while (line.startsWith("\u00EF") || line.startsWith("\u00BB") || line.startsWith("\u00BF")) {

Review comment:
       Can the bom really be present on any line? Wouldn't a more elegant solution be to use a buffered input stream (or a pushback input stream) and just consume the bom if it's leading the file?
   
   It is a bit awkward that those files are parsed as ascii (well, iso8859-1) and at the same time have utf bom (not to mention that bit where you convert to utf8 from usi8859-1)... Is this encoding situation really so messed up in hunspell?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org