You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/09/07 19:51:15 UTC

[GitHub] [pinot] atris opened a new pull request #7405: Introduce Native Text Indices (Core Functionality)

atris opened a new pull request #7405:
URL: https://github.com/apache/pinot/pull/7405


   https://docs.google.com/document/d/1PMhoRy6WF46C4d4mw0LVe9b8Vjqes6vsXZkmxXzMYzw/edit?usp=sharing
   
   This PR implements Phase 1 (core functionality) of the given implementation plan.
   
   Part of #7395 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5c7032a) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `3.51%`.
   > The diff coverage is `47.09%`.
   
   > :exclamation: Current head 5c7032a differs from pull request most recent head 92d6879. Consider uploading reports for the commit 92d6879 to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   68.39%   -3.52%     
   - Complexity     3348     3727     +379     
   ============================================
     Files          1517     1154     -363     
     Lines         75039    56046   -18993     
     Branches      10921     8602    -2319     
   ============================================
   - Hits          53961    38334   -15627     
   + Misses        17451    14976    -2475     
   + Partials       3627     2736     -891     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `68.39% <47.09%> (-1.31%)` | :arrow_down: |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [.../pinot/common/function/scalar/StringFunctions.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vZnVuY3Rpb24vc2NhbGFyL1N0cmluZ0Z1bmN0aW9ucy5qYXZh) | `69.64% <0.00%> (-1.27%)` | :arrow_down: |
   | [...pache/pinot/common/utils/grpc/GrpcQueryClient.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvZ3JwYy9HcnBjUXVlcnlDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [.../java/org/apache/pinot/core/util/GroupByUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS91dGlsL0dyb3VwQnlVdGlscy5qYXZh) | `100.00% <ø> (ø)` | |
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | ... and [665 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...92d6879](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710137760



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ConstantArcSizeFST.java
##########
@@ -0,0 +1,159 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.util.Collections;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+
+
+/**
+ * A FST with constant-size arc representation produced directly by
+ * {@link FSTBuilder}.
+ *
+ * @see FSTBuilder
+ */
+public final class ConstantArcSizeFST extends FST {
+  /** Size of the target address field (constant for the builder). */
+  public final static int TARGET_ADDRESS_SIZE = 4;
+
+  /** Size of the flags field (constant for the builder). */
+  public final static int FLAGS_SIZE = 1;
+
+  /** Size of the label field (constant for the builder). */
+  public final static int LABEL_SIZE = 1;
+
+  /**
+   * Size of a single arc structure.
+   */
+  public final static int ARC_SIZE = FLAGS_SIZE + LABEL_SIZE + TARGET_ADDRESS_SIZE;
+
+  /** Offset of the flags field inside an arc. */
+  public final static int FLAGS_OFFSET = 0;
+
+  /** Offset of the label field inside an arc. */
+  public final static int LABEL_OFFSET = FLAGS_SIZE;
+
+  /** Offset of the address field inside an arc. */
+  public final static int ADDRESS_OFFSET = LABEL_OFFSET + LABEL_SIZE;
+  /**
+   * An arc flag indicating the target node of an arc corresponds to a final
+   * state.
+   */
+  public final static int BIT_ARC_FINAL = 1 << 1;
+  /** An arc flag indicating the arc is last within its state. */
+  public final static int BIT_ARC_LAST = 1 << 0;
+  /** A dummy address of the terminal state. */
+  public final static int TERMINAL_STATE = 0;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private final int _epsilon;
+
+  /**
+   * FST data, serialized as a byte array.
+   */
+  private final byte[] _data;
+
+  private Map<Integer, Integer> _outputSymbols;
+
+  /**
+   * @param data
+   *          FST data. There must be no trailing bytes after the last state.
+   */
+  public ConstantArcSizeFST(byte[] data, int epsilon, Map<Integer, Integer> outputSymbols) {
+    assert epsilon == 0 : "Epsilon is not zero?";
+
+    this._epsilon = epsilon;
+    this._data = data;
+    this._outputSymbols = outputSymbols;
+  }
+
+  @Override
+  public int getRootNode() {
+    return getEndNode(getFirstArc(_epsilon));
+  }
+
+  @Override
+  public int getFirstArc(int node) {
+    return node;
+  }
+
+  @Override
+  public int getArc(int node, byte label) {
+    for (int arc = getFirstArc(node); arc != 0; arc = getNextArc(arc)) {
+      if (getArcLabel(arc) == label) {
+        return arc;
+      }
+    }
+    return 0;
+  }
+
+  @Override
+  public int getNextArc(int arc) {
+    if (isArcLast(arc)) {
+      return 0;
+    }
+    return arc + ARC_SIZE;
+  }
+
+  @Override
+  public byte getArcLabel(int arc) {
+    return _data[arc + LABEL_OFFSET];
+  }
+
+  @Override
+  public int getOutputSymbol(int arc) {
+    return _outputSymbols.get(arc);

Review comment:
       Can you add a comment please?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710143588



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RegexpMatcher.java
##########
@@ -0,0 +1,170 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.utils;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Automaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.CharacterRunAutomaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.RegExp;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.State;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Transition;
+
+
+/**
+ * RegexpMatcher is a helper to retrieve matching values for a given regexp query.
+ * Regexp query is converted into an automaton and we run the matching algorithm on FST.
+ *
+ * Two main functions of this class are
+ *   regexMatchOnFST() Function runs matching on FST (See function comments for more details)
+ *   match(input) Function builds the automaton and matches given input.
+ */
+public class RegexpMatcher {
+  private final String _regexQuery;
+  private final FST _fst;
+  private final Automaton _automaton;
+
+  public RegexpMatcher(String regexQuery, FST fst) {
+    _regexQuery = regexQuery;
+    _fst = fst;
+
+    _automaton = new RegExp(_regexQuery).toAutomaton();
+  }
+
+  public static List<Long> regexMatch(String regexQuery, FST fst) {
+    RegexpMatcher matcher = new RegexpMatcher(regexQuery, fst);
+    return matcher.regexMatchOnFST();
+  }
+
+  // Matches "input" string with _regexQuery Automaton.
+  public boolean match(String input) {
+    CharacterRunAutomaton characterRunAutomaton = new CharacterRunAutomaton(_automaton);
+    return characterRunAutomaton.run(input);
+  }
+
+  /**
+   * This function runs matching on automaton built from regexQuery and the FST.
+   * FST stores key (string) to a value (Long). Both are state machines and state transition is based on
+   * a input character.
+   *
+   * This algorithm starts with Queue containing (Automaton Start Node, FST Start Node).
+   * Each step an entry is popped from the queue:
+   *    1) if the automaton state is accept and the FST Node is final (i.e. end node) then the value stored for that FST
+   *       is added to the set of result.
+   *    2) Else next set of transitions on automaton are gathered and for each transition target node for that character
+   *       is figured out in FST Node, resulting pair of (automaton state, fst node) are added to the queue.
+   *    3) This process is bound to complete since we are making progression on the FST (which is a DAG) towards final
+   *       nodes.
+   * @return
+   */
+  public List<Long> regexMatchOnFST() {
+    final List<Path> queue = new ArrayList<>();
+    final List<Long> endNodes = new ArrayList<>();
+
+    if (_automaton.getNumberOfStates() == 0) {
+      return Collections.emptyList();
+    }
+
+    // Automaton start state and FST start node is added to the queue.
+    queue.add(new Path(_automaton.getInitialState(), _fst.getRootNode(), 0, new ArrayList<>()));
+
+    Set<State> acceptStates = _automaton.getAcceptStates();
+    while (queue.size() != 0) {
+      final Path path = queue.remove(queue.size() - 1);
+
+      // If automaton is in accept state and the fstNode is final (i.e. end node) then add the entry to endNodes which
+      // contains the result set.
+      if (acceptStates.contains(path._state)) {
+        if (_fst.isArcFinal(path._fstArc)) {
+          endNodes.add((long) _fst.getOutputSymbol(path._fstArc));

Review comment:
       Why cast to `long` here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716411483



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ImmutableFST.java
##########
@@ -0,0 +1,406 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Collections;
+import java.util.EnumSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.io.readerwriter.PinotDataBufferMemoryManager;
+import org.apache.pinot.segment.local.realtime.impl.dictionary.OffHeapMutableBytesStore;
+import org.apache.pinot.spi.utils.Pair;
+
+
+/**
+ * FST binary format implementation
+ *
+ * <p>
+ * This version indicates the dictionary was built with these flags:
+ * {@link FSTFlags#FLEXIBLE}, {@link FSTFlags#STOPBIT} and
+ * {@link FSTFlags#NEXTBIT}. The internal representation of the FST must
+ * therefore follow this description (please note this format describes only a
+ * single transition (arc), not the entire dictionary file).
+ *
+ * <pre>
+ * ---- this node header present only if automaton was compiled with NUMBERS option.
+ * Byte
+ *        +-+-+-+-+-+-+-+-+\
+ *      0 | | | | | | | | | \  LSB
+ *        +-+-+-+-+-+-+-+-+  +
+ *      1 | | | | | | | | |  |      number of strings recognized
+ *        +-+-+-+-+-+-+-+-+  +----- by the automaton starting
+ *        : : : : : : : : :  |      from this node.
+ *        +-+-+-+-+-+-+-+-+  +
+ *  ctl-1 | | | | | | | | | /  MSB
+ *        +-+-+-+-+-+-+-+-+/
+ *
+ * ---- remaining part of the node
+ * Length of output symbols dictionary -- Integer
+ * <Arc ID, Output Symbol>
+ * <Arc ID, Output Symbol>
+ * <Arc ID, Output Symbol>
+ * .
+ * .
+ * .
+ * <Arc ID, Output Symbol> (Length)
+ *
+ * Byte
+ *       +-+-+-+-+-+-+-+-+\
+ *     0 | | | | | | | | | +------ label
+ *       +-+-+-+-+-+-+-+-+/
+ *
+ *                  +------------- node pointed to is next
+ *                  | +----------- the last arc of the node
+ *                  | | +--------- the arc is final
+ *                  | | |
+ *             +-----------+
+ *             |    | | |  |
+ *         ___+___  | | |  |
+ *        /       \ | | |  |
+ *       MSB           LSB |
+ *        7 6 5 4 3 2 1 0  |
+ *       +-+-+-+-+-+-+-+-+ |
+ *     1 | | | | | | | | | \ \
+ *       +-+-+-+-+-+-+-+-+  \ \  LSB
+ *       +-+-+-+-+-+-+-+-+     +
+ *     2 | | | | | | | | |     |
+ *       +-+-+-+-+-+-+-+-+     |
+ *     3 | | | | | | | | |     +----- target node address (in bytes)
+ *       +-+-+-+-+-+-+-+-+     |      (not present except for the byte
+ *       : : : : : : : : :     |       with flags if the node pointed to
+ *       +-+-+-+-+-+-+-+-+     +       is next)
+ *   gtl | | | | | | | | |    /  MSB
+ *       +-+-+-+-+-+-+-+-+   /
+ * gtl+1                           (gtl = gotoLength)
+ * </pre>
+ */
+public final class ImmutableFST extends FST {
+  /**
+   * Default filler byte.
+   */
+  public final static byte DEFAULT_FILLER = '_';
+
+  /**
+   * Default annotation byte.
+   */
+  public final static byte DEFAULT_ANNOTATION = '+';
+
+  /**
+   * Automaton version as in the file header.
+   */
+  public static final byte VERSION = 5;
+
+  /**
+   * Bit indicating that an arc corresponds to the last character of a sequence
+   * available when building the automaton.
+   */
+  public static final int BIT_FINAL_ARC = 1 << 0;
+
+  /**
+   * Bit indicating that an arc is the last one of the node's list and the
+   * following one belongs to another node.
+   */
+  public static final int BIT_LAST_ARC = 1 << 1;
+
+  /**
+   * Bit indicating that the target node of this arc follows it in the
+   * compressed automaton structure (no goto field).
+   */
+  public static final int BIT_TARGET_NEXT = 1 << 2;
+
+  /**
+   * An offset in the arc structure, where the address and flags field begins.
+   * In version 5 of FST automata, this value is constant (1, skip label).
+   */
+  public final static int ADDRESS_OFFSET = 1;
+
+  private static final int PER_BUFFER_SIZE = 16;
+
+  /**
+   * An array of bytes with the internal representation of the automaton. Please
+   * see the documentation of this class for more information on how this
+   * structure is organized.
+   */
+  public final OffHeapMutableBytesStore _mutableBytesStore;
+  /**
+   * The length of the node header structure (if the automaton was compiled with
+   * <code>NUMBERS</code> option). Otherwise zero.
+   */
+  public final int _nodeDataLength;
+  /**
+   * Number of bytes each address takes in full, expanded form (goto length).
+   */
+  public final int _gotoLength;
+  /** Filler character. */
+  public final byte _filler;
+  /** Annotation character. */
+  public final byte _annotation;
+  public Map<Integer, Integer> _outputSymbols;
+  /**
+   * Flags for this automaton version.
+   */
+  private Set<FSTFlags> _flags;
+
+  /**
+   * Read and wrap a binary automaton in FST version 5.
+   */
+  ImmutableFST(InputStream stream, boolean hasOutputSymbols, PinotDataBufferMemoryManager memoryManager)
+      throws IOException {
+    DataInputStream in = new DataInputStream(stream);
+
+    this._filler = in.readByte();
+    this._annotation = in.readByte();
+    final byte hgtl = in.readByte();
+
+    _mutableBytesStore = new OffHeapMutableBytesStore(memoryManager, "ImmutableFST");
+
+    /*
+     * Determine if the automaton was compiled with NUMBERS. If so, modify
+     * ctl and goto fields accordingly.
+     */
+    _flags = EnumSet.of(FSTFlags.FLEXIBLE, FSTFlags.STOPBIT, FSTFlags.NEXTBIT);
+    if ((hgtl & 0xf0) != 0) {
+      _flags.add(FSTFlags.NUMBERS);
+    }
+
+    _flags = Collections.unmodifiableSet(_flags);
+
+    this._nodeDataLength = (hgtl >>> 4) & 0x0f;
+    this._gotoLength = hgtl & 0x0f;
+
+    if (hasOutputSymbols) {
+      final int outputSymbolsLength = in.readInt();
+      byte[] outputSymbolsBuffer = readRemaining(in, outputSymbolsLength);
+
+      if (outputSymbolsBuffer.length > 0) {
+        String outputSymbolsSerialized = new String(outputSymbolsBuffer);
+
+        _outputSymbols = buildMap(outputSymbolsSerialized);
+      }
+    }
+
+    readRemaining(in);
+  }
+
+  protected final void readRemaining(InputStream in)
+      throws IOException {
+    byte[] buffer = new byte[PER_BUFFER_SIZE];
+    while ((in.read(buffer)) >= 0) {
+      _mutableBytesStore.add(buffer);
+    }
+  }
+
+  /**
+   * Returns the start node of this automaton.
+   */
+  @Override
+  public int getRootNode() {
+    // Skip dummy node marking terminating state.
+    final int epsilonNode = skipArc(getFirstArc(0));
+
+    // And follow the epsilon node's first (and only) arc.
+    return getDestinationNodeOffset(getFirstArc(epsilonNode));
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public final int getFirstArc(int node) {
+    return _nodeDataLength + node;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public final int getNextArc(int arc) {
+    if (isArcLast(arc)) {
+      return 0;
+    } else {
+      return skipArc(arc);
+    }
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public int getArc(int node, byte label) {
+    for (int arc = getFirstArc(node); arc != 0; arc = getNextArc(arc)) {
+      if (getArcLabel(arc) == label) {
+        return arc;
+      }
+    }
+
+    // An arc labeled with "label" not found.
+    return 0;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public int getEndNode(int arc) {
+    final int nodeOffset = getDestinationNodeOffset(arc);
+    return nodeOffset;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public byte getArcLabel(int arc) {
+    return getByte(arc, 0);
+  }
+
+  @Override
+  public int getOutputSymbol(int arc) {
+    return _outputSymbols.get(arc);
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public boolean isArcFinal(int arc) {
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_FINAL_ARC) != 0;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public boolean isArcTerminal(int arc) {
+    return (0 == getDestinationNodeOffset(arc));
+  }
+
+  /**
+   * Returns the number encoded at the given node. The number equals the count
+   * of the set of suffixes reachable from <code>node</code> (called its right
+   * language).
+   */
+  @Override
+  public int getRightLanguageCount(int node) {
+    assert getFlags().contains(FSTFlags.NUMBERS) : "This FST was not compiled with NUMBERS.";
+    return decodeFromBytes(node, _nodeDataLength);
+  }
+
+  /**
+   * {@inheritDoc}
+   *
+   * <p>
+   * For this automaton version, an additional {@link FSTFlags#NUMBERS} flag may
+   * be set to indicate the automaton contains extra fields for each node.
+   * </p>
+   */
+  @Override
+  public Set<FSTFlags> getFlags() {
+    return _flags;
+  }
+
+  /**
+   * Returns <code>true</code> if this arc has <code>NEXT</code> bit set.
+   *
+   * @see #BIT_LAST_ARC
+   * @param arc The node's arc identifier.
+   * @return Returns true if the argument is the last arc of a node.
+   */
+  public boolean isArcLast(int arc) {
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_LAST_ARC) != 0;
+  }
+
+  /**
+   * @see #BIT_TARGET_NEXT
+   * @param arc The node's arc identifier.
+   * @return Returns true if {@link #BIT_TARGET_NEXT} is set for this arc.
+   */
+  public boolean isNextSet(int arc) {
+
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_TARGET_NEXT) != 0;
+  }
+
+  /**
+   * Returns an n-byte integer encoded in byte-packed representation.
+   */
+  final int decodeFromBytes(final int start, final int n) {
+    int r = 0;
+
+    for (int i = n; --i >= 0; ) {
+      Pair<Integer, Integer> offheapOffsets = getOffheapOffsets(start + i);
+      byte[] inputData = _mutableBytesStore.get(offheapOffsets.getFirst());
+
+      r = r << 8 | (inputData[offheapOffsets.getSecond()] & 0xff);
+    }
+    return r;
+  }
+
+  /**
+   * Returns the address of the node pointed to by this arc.
+   */
+  final int getDestinationNodeOffset(int arc) {
+    if (isNextSet(arc)) {
+      /* The destination node follows this arc in the array. */
+      return skipArc(arc);
+    } else {
+      /*
+       * The destination node address has to be extracted from the arc's
+       * goto field.
+       */
+      return decodeFromBytes(arc + ADDRESS_OFFSET, _gotoLength) >>> 3;
+    }
+  }
+
+  /**
+   * Read the arc's layout and skip as many bytes, as needed.
+   */
+  private int skipArc(int offset) {
+    return offset + (isNextSet(offset) ? 1 + 1   /* label + flags */ : 1 + _gotoLength /* label + flags/address */);
+  }
+
+  private byte getByte(int seek, int offset) {
+    Pair<Integer, Integer> offheapOffsets = getOffheapOffsets(seek);
+
+    int fooArc = offheapOffsets.getFirst();
+    byte[] retVal = _mutableBytesStore.get((fooArc));
+
+    int barArc = offheapOffsets.getSecond();
+    int target = barArc + offset;
+
+    if (target >= PER_BUFFER_SIZE) {
+      retVal = _mutableBytesStore.get(fooArc + 1);
+      target = target - PER_BUFFER_SIZE;
+    }
+
+    return retVal[target];
+  }
+
+  private Pair<Integer, Integer> getOffheapOffsets(int seek) {
+    int fooArc = seek >= PER_BUFFER_SIZE ? seek / PER_BUFFER_SIZE : 0;
+    int barArc = seek >= PER_BUFFER_SIZE ? seek - ((fooArc) * PER_BUFFER_SIZE) : seek;
+
+    assert fooArc < _mutableBytesStore.getNumValues();
+    assert barArc < PER_BUFFER_SIZE;
+
+    return new Pair<>(fooArc, barArc);
+  }

Review comment:
       Since barArc depends on fooArc, is inlining possible here? 

##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ByteSequenceIterator.java
##########
@@ -0,0 +1,180 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+
+
+/**
+ * An iterator that traverses the right language of a given node (all sequences
+ * reachable from a given node).
+ */
+public final class ByteSequenceIterator implements Iterator<ByteBuffer> {
+  /**
+   * Default expected depth of the recursion stack (estimated longest sequence
+   * in the automaton). Buffers expand by the same value if exceeded.
+   */
+  private final static int EXPECTED_MAX_STATES = 15;
+
+  /** The FST to which this iterator belongs. */
+  private final FST _fst;
+
+  /** An internal cache for the next element in the FST */
+  private ByteBuffer _nextElement;
+
+  /**
+   * A buffer for the current sequence of bytes from the current node to the
+   * root.
+   */
+  private byte[] _buffer = new byte[EXPECTED_MAX_STATES];
+
+  /** Reusable byte buffer wrapper around {@link #_buffer}. */
+  private ByteBuffer _bufferWrapper = ByteBuffer.wrap(_buffer);
+
+  /** An arc stack for DFS when processing the automaton. */
+  private int[] _arcs = new int[EXPECTED_MAX_STATES];
+
+  /** Current processing depth in {@link #_arcs}. */
+  private int _position;
+
+  /**
+   * Create an instance of the iterator for a given node.
+   * @param fst The automaton to iterate over.
+   * @param node The starting node's identifier (can be the {@link FST#getRootNode()}).
+   */
+  public ByteSequenceIterator(FST fst, int node) {
+    this._fst = fst;
+
+    if (fst.getFirstArc(node) != 0) {
+      restartFrom(node);
+    }
+  }
+
+  /**
+   * Restart walking from <code>node</code>. Allows iterator reuse.
+   *
+   * @param node Restart the iterator from <code>node</code>.
+   * @return Returns <code>this</code> for call chaining.
+   */
+  public ByteSequenceIterator restartFrom(int node) {
+    _position = 0;
+    _bufferWrapper.clear();
+    _nextElement = null;
+
+    pushNode(node);
+    return this;
+  }
+
+  /** Returns <code>true</code> if there are still elements in this iterator. */
+  @Override
+  public boolean hasNext() {
+    if (_nextElement == null) {
+      _nextElement = advance();
+    }
+
+    return _nextElement != null;
+  }
+
+  /**
+   * @return Returns a {@link ByteBuffer} with the sequence corresponding to the
+   *         next final state in the automaton.
+   */
+  @Override
+  public ByteBuffer next() {
+    if (_nextElement != null) {
+      final ByteBuffer cache = _nextElement;
+      _nextElement = null;
+      return cache;
+    } else {
+      final ByteBuffer cache = advance();
+      if (cache == null) {
+        throw new NoSuchElementException();
+      }
+      return cache;
+    }
+  }
+
+  /**
+   * Advances to the next available final state.
+   */
+  private final ByteBuffer advance() {
+    if (_position == 0) {
+      return null;
+    }
+
+    while (_position > 0) {
+      final int lastIndex = _position - 1;
+      final int arc = _arcs[lastIndex];
+
+      if (arc == 0) {
+        // Remove the current node from the queue.
+        _position--;
+        continue;
+      }
+
+      // Go to the next arc, but leave it on the stack
+      // so that we keep the recursion depth level accurate.
+      _arcs[lastIndex] = _fst.getNextArc(arc);
+
+      // Expand buffer if needed.
+      final int bufferLength = this._buffer.length;
+      if (lastIndex >= bufferLength) {
+        this._buffer = Arrays.copyOf(_buffer, bufferLength + EXPECTED_MAX_STATES);
+        this._bufferWrapper = ByteBuffer.wrap(_buffer);
+      }
+      _buffer[lastIndex] = _fst.getArcLabel(arc);
+
+      if (!_fst.isArcTerminal(arc)) {
+        // Recursively descend into the arc's node.
+        pushNode(_fst.getEndNode(arc));
+      }
+
+      if (_fst.isArcFinal(arc)) {
+        _bufferWrapper.clear();
+        _bufferWrapper.limit(lastIndex + 1);
+        return _bufferWrapper;
+      }
+    }
+
+    return null;
+  }
+
+  /**
+   * Not implemented in this iterator.
+   */
+  @Override
+  public void remove() {
+    throw new UnsupportedOperationException("Read-only iterator.");
+  }
+
+  /**
+   * Descends to a given node, adds its arcs to the stack to be traversed.
+   */
+  private void pushNode(int node) {
+    // Expand buffers if needed.
+    if (_position == _arcs.length) {
+      _arcs = Arrays.copyOf(_arcs, _arcs.length + EXPECTED_MAX_STATES);
+    }
+
+    _arcs[_position++] = _fst.getFirstArc(node);
+  }
+}

Review comment:
       Strange. My IDE shows newlines at the end of all files, and I think Checkstyle checks the same as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b0f7138) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `40.81%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #7405       +/-   ##
   =============================================
   - Coverage     71.91%   31.09%   -40.82%     
   =============================================
     Files          1517     1539       +22     
     Lines         75039    78040     +3001     
     Branches      10921    11559      +638     
   =============================================
   - Hits          53961    24264    -29697     
   - Misses        17451    51695    +34244     
   + Partials       3627     2081     -1546     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `29.52% <0.00%> (-1.11%)` | :arrow_down: |
   | integration2 | `28.01% <0.00%> (-1.10%)` | :arrow_down: |
   | unittests1 | `?` | |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...nt/local/utils/nativefst/ByteSequenceIterator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvQnl0ZVNlcXVlbmNlSXRlcmF0b3IuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ment/local/utils/nativefst/ConstantArcSizeFST.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvQ29uc3RhbnRBcmNTaXplRlNULmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...pache/pinot/segment/local/utils/nativefst/FST.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvRlNULmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [.../pinot/segment/local/utils/nativefst/FSTFlags.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvRlNURmxhZ3MuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...pinot/segment/local/utils/nativefst/FSTHeader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvRlNUSGVhZGVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ot/segment/local/utils/nativefst/FSTTraversal.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvRlNUVHJhdmVyc2FsLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ot/segment/local/utils/nativefst/ImmutableFST.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvSW1tdXRhYmxlRlNULmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...not/segment/local/utils/nativefst/MatchResult.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTWF0Y2hSZXN1bHQuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | ... and [1075 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...b0f7138](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716556231



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/TransitionComparator.java
##########
@@ -0,0 +1,80 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Comparator;
+
+
+class TransitionComparator implements Comparator<Transition>, Serializable {

Review comment:
       Transition is designed to really be pluggable -- users should be able to write their own versions of Transition without having to worry about comparisons unless they want a custom order. This class exists more from an extensibility perspective




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (64f5e76) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `7.59%`.
   > The diff coverage is `44.81%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   64.31%   -7.60%     
   - Complexity     3348     3834     +486     
   ============================================
     Files          1517     1502      -15     
     Lines         75039    76500    +1461     
     Branches      10921    11392     +471     
   ============================================
   - Hits          53961    49201    -4760     
   - Misses        17451    23778    +6327     
   + Partials       3627     3521     -106     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `68.46% <44.81%> (-1.24%)` | :arrow_down: |
   | unittests2 | `13.95% <0.00%> (-0.58%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [460 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...64f5e76](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-920937262


   This is looking quite promising to me 👍🏻 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716556764



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/Automaton.java
##########
@@ -0,0 +1,653 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.BitSet;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+
+/**
+ * Finite-state automaton with regular expression operations.
+ * <p>
+ * Class invariants:
+ * <ul>
+ * <li> An automaton is either represented explicitly (with {@link State} and {@link Transition} objects)
+ *      or with a singleton string ({@link #expandSingleton()}) in case
+ *      the automaton is known to accept exactly one string.
+ *      (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
+ * <li> Automata are always reduced (see {@link #reduce()}) 
+ *      and have no transitions to dead states (see {@link #removeDeadTransitions()}).
+ * <li> Automata provided as input to operations are generally assumed to be disjoint.
+ * </ul>
+ * <p>
+ */
+public class Automaton implements Serializable, Cloneable {
+
+  /**
+   * Minimize using Huffman's O(n<sup>2</sup>) algorithm.
+   * This is the standard text-book algorithm.
+   */
+  public static final int MINIMIZE_HUFFMAN = 0;
+  /**
+   * Minimize using Brzozowski's O(2<sup>n</sup>) algorithm.
+   * This algorithm uses the reverse-determinize-reverse-determinize trick, which has a bad
+   * worst-case behavior but often works very well in practice
+   * (even better than Hopcroft's!).
+   */
+  public static final int MINIMIZE_BRZOZOWSKI = 1;
+  /**
+   * Minimize using Hopcroft's O(n log n) algorithm.
+   */
+  public static final int MINIMIZE_HOPCROFT = 2;
+  /**
+   * Minimize using Valmari's O(n + m log m) algorithm.
+   */
+  public static final int MINIMIZE_VALMARI = 3;
+
+  /** Minimize always flag. */
+  public static boolean _minimizeAlways = false;
+
+  /** Selects whether operations may modify the input automata (default: <code>false</code>). */
+  public static boolean _allowMutation = false;
+
+  /** Selects minimization algorithm (default: <code>MINIMIZE_HOPCROFT</code>). */
+  public static int _minimization = MINIMIZE_HOPCROFT;
+
+  /** Initial state of this automaton. */
+  State _initial;
+
+  /** If true, then this automaton is definitely deterministic
+   (i.e., there are no choices for any run, but a run may crash). */
+  boolean _deterministic;
+
+  /** Hash code. Recomputed by {@link #minimize()}. */
+  int _hashCode;
+
+  /** Singleton string. Null if not applicable. */
+  String _singleton;
+
+  /**
+   * Constructs a new automaton that accepts the empty language.
+   * Using this constructor, automata can be constructed manually from
+   * {@link State} and {@link Transition} objects.
+   * @see #setInitialState(State)
+   * @see State
+   * @see Transition
+   */
+  public Automaton() {
+    _initial = new State();
+    _deterministic = true;
+    _singleton = null;
+  }
+
+  /**
+   * Sets or resets allow mutate flag.
+   * If this flag is set, then all automata operations may modify automata given as input;
+   * otherwise, operations will always leave input automata languages unmodified.
+   * By default, the flag is not set.
+   * @param flag if true, the flag is set
+   * @return previous value of the flag
+   */
+  static public boolean setAllowMutate(boolean flag) {
+    boolean b = _allowMutation;
+    _allowMutation = flag;
+    return b;
+  }
+
+  /**
+   * Assigns consecutive numbers to the given states.
+   */
+  static void setStateNumbers(Set<State> states) {
+    if (states.size() == Integer.MAX_VALUE) {
+      throw new IllegalArgumentException("number of states exceeded Integer.MAX_VALUE");
+    }
+    int number = 0;
+    for (State s : states) {
+      s._number = number++;
+    }
+  }
+
+  /**
+   * Returns a sorted array of transitions for each state (and sets state numbers).
+   */
+  static Transition[][] getSortedTransitions(Set<State> states) {
+    setStateNumbers(states);
+    Transition[][] transitions = new Transition[states.size()][];
+    for (State s : states) {
+      transitions[s._number] = s.getSortedTransitionArray(false);
+    }
+    return transitions;
+  }
+
+  /**
+   * See {@link MinimizationOperations#minimize(Automaton)}.
+   * Returns the automaton being given as argument.
+   */
+  public static Automaton minimize(Automaton a) {
+    a.minimize();
+    return a;
+  }
+
+  void checkMinimizeAlways() {
+    if (_minimizeAlways) {
+      minimize();
+    }
+  }
+
+  boolean isSingleton() {
+    return _singleton != null;
+  }
+
+  /**
+   * Gets initial state.
+   * @return state
+   */
+  public State getInitialState() {
+    expandSingleton();
+    return _initial;
+  }
+
+  /**
+   * Sets initial state.
+   * @param s state
+   */
+  public void setInitialState(State s) {
+    _initial = s;
+    _singleton = null;
+  }
+
+  /**
+   * Returns the set of states that are reachable from the initial state.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getStates() {
+    expandSingleton();
+    Set<State> visited;
+
+    visited = new HashSet<>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (!worklist.isEmpty()) {
+      State s = worklist.removeFirst();
+      Collection<Transition> tr;
+
+      tr = s._transitionSet;
+      for (Transition t : tr) {
+        if (!visited.contains(t._to)) {
+          visited.add(t._to);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return visited;
+  }
+
+  /**
+   * Returns the set of reachable accept states.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getAcceptStates() {
+    expandSingleton();
+    HashSet<State> accepts = new HashSet<State>();
+    BitSet visited = new BitSet();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.set(_initial._id);

Review comment:
       _number is mutable -- it can changed post the reduction of an automaton.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710119517



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeFSTIndexReader.java
##########
@@ -0,0 +1,88 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.avro.util.ByteBufferInputStream;
+import org.apache.pinot.segment.local.utils.nativefst.utils.RegexpMatcher;
+import org.apache.pinot.segment.spi.index.reader.TextIndexReader;
+import org.apache.pinot.segment.spi.memory.PinotDataBuffer;
+import org.roaringbitmap.buffer.ImmutableRoaringBitmap;
+import org.roaringbitmap.buffer.MutableRoaringBitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This class loads FST index from PinotDataBuffer and creates a FST reader which
+ * is used in finding matching results for regexp queries. Since FST index currently
+ * stores dict ids as values this class only implements getDictIds method.
+ *
+ * This class works on top of ImmutableFST.
+ *
+ */
+public class NativeFSTIndexReader implements TextIndexReader {
+  public static final Logger LOGGER =
+      LoggerFactory.getLogger(org.apache.pinot.segment.local.segment.index.readers.LuceneFSTIndexReader.class);
+
+  private final PinotDataBuffer _dataBuffer;
+
+  private final FST _readFST;
+
+  public NativeFSTIndexReader(PinotDataBuffer pinotDataBuffer)
+      throws IOException {
+    this._dataBuffer = pinotDataBuffer;
+
+    List<ByteBuffer> inputList = new ArrayList<>();
+
+    inputList.add(_dataBuffer.toDirectByteBuffer(0, (int) _dataBuffer.size()));
+
+    this._readFST =
+        FST.read(new ByteBufferInputStream(inputList), ImmutableFST.class, true);
+  }
+
+  @Override
+  public MutableRoaringBitmap getDocIds(String searchQuery) {
+    throw new RuntimeException("LuceneFSTIndexReader only supports getDictIds currently.");
+  }
+
+  @Override
+  public ImmutableRoaringBitmap getDictIds(String searchQuery) {
+    try {
+      MutableRoaringBitmap dictIds = new MutableRoaringBitmap();
+      List<Long> matchingIds = RegexpMatcher.regexMatch(searchQuery, this._readFST);
+      for (Long matchingId : matchingIds) {
+        dictIds.add(matchingId.intValue());
+      }
+      return dictIds.toImmutableRoaringBitmap();

Review comment:
       This is a no-op, by the way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710168057



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RegexpMatcher.java
##########
@@ -0,0 +1,170 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.utils;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Automaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.CharacterRunAutomaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.RegExp;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.State;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Transition;
+
+
+/**
+ * RegexpMatcher is a helper to retrieve matching values for a given regexp query.
+ * Regexp query is converted into an automaton and we run the matching algorithm on FST.
+ *
+ * Two main functions of this class are
+ *   regexMatchOnFST() Function runs matching on FST (See function comments for more details)
+ *   match(input) Function builds the automaton and matches given input.
+ */
+public class RegexpMatcher {
+  private final String _regexQuery;
+  private final FST _fst;
+  private final Automaton _automaton;
+
+  public RegexpMatcher(String regexQuery, FST fst) {
+    _regexQuery = regexQuery;
+    _fst = fst;
+
+    _automaton = new RegExp(_regexQuery).toAutomaton();
+  }
+
+  public static List<Long> regexMatch(String regexQuery, FST fst) {

Review comment:
       Simpler would be to take an `IntConsumer` so it could be directly appended to the resultant bitmap without this depending on RoaringBitmap. So something like:
   
   ```
   static void regexMatch(String regexQuery, FST fst, IntConsumer dest) {
       ...
      dest.accept(_fst.getOutputSymbol(path._fstArc));
   }
   ...
   RoaringBitmapWriter<MutableRoaringBitmap> writer = RoaringBitmapWriter.bufferWriter().get();
   regexMatch(regex, fst, writer::add);




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9bae7b5) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `7.89%`.
   > The diff coverage is `41.78%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   64.01%   -7.90%     
   - Complexity     3348     3818     +470     
   ============================================
     Files          1517     1502      -15     
     Lines         75039    76429    +1390     
     Branches      10921    11383     +462     
   ============================================
   - Hits          53961    48928    -5033     
   - Misses        17451    24009    +6558     
   + Partials       3627     3492     -135     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `68.07% <41.78%> (-1.63%)` | :arrow_down: |
   | unittests2 | `13.90% <0.00%> (-0.63%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/ShuffleOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NodWZmbGVPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | ... and [416 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...9bae7b5](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] amrishlal edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
amrishlal edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917104588


   > > I would suggest creating a new pinot-fst module for the new FST implementation.
   > 
   > Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet
   
   It doesn't have to be a top-level module, but basically, I was looking for some way to sufficiently encapsulate this FST implementation using interfaces (in a separate jar if possible) and then use it within Pinot (?).
   
   > For e.g., in this PR, ImmutableFST uses the off heap bytes store in pinot-segment-local, thus creating a cyclic dependency.
    
   Sounds like we need interfaces?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kishoreg commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
kishoreg commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-916384652


   I think he published the PR to help with the review of the design doc. As long as the PR does not get merged without the approval of the design, it should be ok. 
   
   @atris can you please update the PR description to indicate the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1ba6332) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `2.33%`.
   > The diff coverage is `44.90%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   69.57%   -2.34%     
   - Complexity     3348     3902     +554     
   ============================================
     Files          1517     1548      +31     
     Lines         75039    78135    +3096     
     Branches      10921    11551     +630     
   ============================================
   + Hits          53961    54364     +403     
   - Misses        17451    20038    +2587     
   - Partials       3627     3733     +106     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `27.88% <0.00%> (-1.24%)` | :arrow_down: |
   | unittests1 | `68.38% <44.90%> (-1.32%)` | :arrow_down: |
   | unittests2 | `13.95% <0.00%> (-0.58%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [170 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...1ba6332](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b22c014) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `7.69%`.
   > The diff coverage is `44.99%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   64.21%   -7.70%     
   - Complexity     3348     3817     +469     
   ============================================
     Files          1517     1501      -16     
     Lines         75039    76199    +1160     
     Branches      10921    11346     +425     
   ============================================
   - Hits          53961    48932    -5029     
   - Misses        17451    23773    +6322     
   + Partials       3627     3494     -133     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `68.35% <44.99%> (-1.35%)` | :arrow_down: |
   | unittests2 | `13.94% <0.00%> (-0.59%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [413 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...b22c014](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710160192



##########
File path: pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/nativefst/FSTTestUtils.java
##########
@@ -0,0 +1,129 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.nio.ByteBuffer;
+import java.nio.charset.Charset;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+import org.apache.pinot.segment.local.utils.nativefst.utils.RegexpMatcher;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.testng.Assert.assertEquals;
+import static org.testng.FileAssert.fail;
+
+
+/**
+ * Test utils class
+ */
+class FSTTestUtils {
+
+  private FSTTestUtils() {
+  }
+
+  /*
+   * Generate a sorted list of random sequences.
+   */
+  public static byte[][] generateRandom(int count, MinMax length, MinMax alphabet) {
+    final byte[][] input = new byte[count][];
+    final Random rnd = new Random();
+    for (int i = 0; i < count; i++) {
+      input[i] = randomByteSequence(rnd, length, alphabet);
+    }
+    Arrays.sort(input, FSTBuilder.LEXICAL_ORDERING);
+    return input;
+  }
+
+  /**
+   * Generate a random string.
+   */
+  private static byte[] randomByteSequence(Random rnd, MinMax length, MinMax alphabet) {
+    byte[] bytes = new byte[length._min + rnd.nextInt(length.range())];
+    for (int i = 0; i < bytes.length; i++) {
+      bytes[i] = (byte) (alphabet._min + rnd.nextInt(alphabet.range()));
+    }
+    return bytes;

Review comment:
       Sorry my mistake, I misread




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-920938627


   > This is looking quite promising to me 👍🏻
   
   Thank you for reviewing! I will raise an iteration today which fixes your comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-921708942


   > > Shall we reduce the testing/sample text file size? IMO keeping ~1000 words should be good enough. We don't want to increase the repo size too much because of these sample files
   > 
   > +1
   > 
   > Test files seem extremely huge and we should try to get coverage with considerably smaller datasets.
   
   Fixed, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r714213516



##########
File path: pinot-segment-local/pom.xml
##########
@@ -156,5 +156,25 @@
       <type>test-jar</type>
       <scope>test</scope>
     </dependency>
+    <dependency>

Review comment:
       Shall we revert these extra dependencies?

##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ByteSequenceIterator.java
##########
@@ -0,0 +1,180 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+
+
+/**
+ * An iterator that traverses the right language of a given node (all sequences
+ * reachable from a given node).
+ */
+public final class ByteSequenceIterator implements Iterator<ByteBuffer> {
+  /**
+   * Default expected depth of the recursion stack (estimated longest sequence
+   * in the automaton). Buffers expand by the same value if exceeded.
+   */
+  private final static int EXPECTED_MAX_STATES = 15;
+
+  /** The FST to which this iterator belongs. */
+  private final FST _fst;
+
+  /** An internal cache for the next element in the FST */
+  private ByteBuffer _nextElement;
+
+  /**
+   * A buffer for the current sequence of bytes from the current node to the
+   * root.
+   */
+  private byte[] _buffer = new byte[EXPECTED_MAX_STATES];
+
+  /** Reusable byte buffer wrapper around {@link #_buffer}. */
+  private ByteBuffer _bufferWrapper = ByteBuffer.wrap(_buffer);
+
+  /** An arc stack for DFS when processing the automaton. */
+  private int[] _arcs = new int[EXPECTED_MAX_STATES];
+
+  /** Current processing depth in {@link #_arcs}. */
+  private int _position;
+
+  /**
+   * Create an instance of the iterator for a given node.
+   * @param fst The automaton to iterate over.
+   * @param node The starting node's identifier (can be the {@link FST#getRootNode()}).
+   */
+  public ByteSequenceIterator(FST fst, int node) {
+    this._fst = fst;
+
+    if (fst.getFirstArc(node) != 0) {
+      restartFrom(node);
+    }
+  }
+
+  /**
+   * Restart walking from <code>node</code>. Allows iterator reuse.
+   *
+   * @param node Restart the iterator from <code>node</code>.
+   * @return Returns <code>this</code> for call chaining.
+   */
+  public ByteSequenceIterator restartFrom(int node) {
+    _position = 0;
+    _bufferWrapper.clear();
+    _nextElement = null;
+
+    pushNode(node);
+    return this;
+  }
+
+  /** Returns <code>true</code> if there are still elements in this iterator. */
+  @Override
+  public boolean hasNext() {
+    if (_nextElement == null) {
+      _nextElement = advance();
+    }
+
+    return _nextElement != null;
+  }
+
+  /**
+   * @return Returns a {@link ByteBuffer} with the sequence corresponding to the
+   *         next final state in the automaton.
+   */
+  @Override
+  public ByteBuffer next() {
+    if (_nextElement != null) {
+      final ByteBuffer cache = _nextElement;
+      _nextElement = null;
+      return cache;
+    } else {
+      final ByteBuffer cache = advance();
+      if (cache == null) {
+        throw new NoSuchElementException();
+      }
+      return cache;
+    }
+  }
+
+  /**
+   * Advances to the next available final state.
+   */
+  private final ByteBuffer advance() {
+    if (_position == 0) {
+      return null;
+    }
+
+    while (_position > 0) {
+      final int lastIndex = _position - 1;
+      final int arc = _arcs[lastIndex];
+
+      if (arc == 0) {
+        // Remove the current node from the queue.
+        _position--;
+        continue;
+      }
+
+      // Go to the next arc, but leave it on the stack
+      // so that we keep the recursion depth level accurate.
+      _arcs[lastIndex] = _fst.getNextArc(arc);
+
+      // Expand buffer if needed.
+      final int bufferLength = this._buffer.length;
+      if (lastIndex >= bufferLength) {
+        this._buffer = Arrays.copyOf(_buffer, bufferLength + EXPECTED_MAX_STATES);
+        this._bufferWrapper = ByteBuffer.wrap(_buffer);
+      }
+      _buffer[lastIndex] = _fst.getArcLabel(arc);
+
+      if (!_fst.isArcTerminal(arc)) {
+        // Recursively descend into the arc's node.
+        pushNode(_fst.getEndNode(arc));
+      }
+
+      if (_fst.isArcFinal(arc)) {
+        _bufferWrapper.clear();
+        _bufferWrapper.limit(lastIndex + 1);
+        return _bufferWrapper;
+      }
+    }
+
+    return null;
+  }
+
+  /**
+   * Not implemented in this iterator.
+   */
+  @Override
+  public void remove() {
+    throw new UnsupportedOperationException("Read-only iterator.");
+  }
+
+  /**
+   * Descends to a given node, adds its arcs to the stack to be traversed.
+   */
+  private void pushNode(int node) {
+    // Expand buffers if needed.
+    if (_position == _arcs.length) {
+      _arcs = Arrays.copyOf(_arcs, _arcs.length + EXPECTED_MAX_STATES);
+    }
+
+    _arcs[_position++] = _fst.getFirstArc(node);
+  }
+}

Review comment:
       (nit) new line at the end, same for other files

##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ByteSequenceIterator.java
##########
@@ -0,0 +1,180 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+
+
+/**
+ * An iterator that traverses the right language of a given node (all sequences
+ * reachable from a given node).
+ */
+public final class ByteSequenceIterator implements Iterator<ByteBuffer> {
+  /**
+   * Default expected depth of the recursion stack (estimated longest sequence
+   * in the automaton). Buffers expand by the same value if exceeded.
+   */
+  private final static int EXPECTED_MAX_STATES = 15;
+
+  /** The FST to which this iterator belongs. */
+  private final FST _fst;
+
+  /** An internal cache for the next element in the FST */
+  private ByteBuffer _nextElement;
+
+  /**
+   * A buffer for the current sequence of bytes from the current node to the
+   * root.
+   */
+  private byte[] _buffer = new byte[EXPECTED_MAX_STATES];
+
+  /** Reusable byte buffer wrapper around {@link #_buffer}. */
+  private ByteBuffer _bufferWrapper = ByteBuffer.wrap(_buffer);
+
+  /** An arc stack for DFS when processing the automaton. */
+  private int[] _arcs = new int[EXPECTED_MAX_STATES];
+
+  /** Current processing depth in {@link #_arcs}. */
+  private int _position;
+
+  /**
+   * Create an instance of the iterator for a given node.
+   * @param fst The automaton to iterate over.
+   * @param node The starting node's identifier (can be the {@link FST#getRootNode()}).
+   */
+  public ByteSequenceIterator(FST fst, int node) {
+    this._fst = fst;

Review comment:
       (code format) remove `this.`, same for other places

##########
File path: pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/fst/FSTBuilderTest.java
##########
@@ -74,7 +73,7 @@ public void testFSTBuilder()
 
     Outputs<Long> outputs = PositiveIntOutputs.getSingleton();
     File fstFile = new File(outputFile.getAbsolutePath());
-
+    

Review comment:
       (nit) Revert the changes in this file




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710124384



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/Automaton.java
##########
@@ -0,0 +1,652 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+
+/**
+ * Finite-state automaton with regular expression operations.
+ * <p>
+ * Class invariants:
+ * <ul>
+ * <li> An automaton is either represented explicitly (with {@link State} and {@link Transition} objects)
+ *      or with a singleton string ({@link #expandSingleton()}) in case
+ *      the automaton is known to accept exactly one string.
+ *      (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
+ * <li> Automata are always reduced (see {@link #reduce()}) 
+ *      and have no transitions to dead states (see {@link #removeDeadTransitions()}).
+ * <li> If an automaton is nondeterministic, then {@link #isDeterministic()} returns false (but
+ *      the converse is not required).
+ * <li> Automata provided as input to operations are generally assumed to be disjoint.
+ * </ul>
+ * <p>
+ */
+public class Automaton implements Serializable, Cloneable {
+
+  /**
+   * Minimize using Huffman's O(n<sup>2</sup>) algorithm.
+   * This is the standard text-book algorithm.
+   */
+  public static final int MINIMIZE_HUFFMAN = 0;
+  /**
+   * Minimize using Brzozowski's O(2<sup>n</sup>) algorithm.
+   * This algorithm uses the reverse-determinize-reverse-determinize trick, which has a bad
+   * worst-case behavior but often works very well in practice
+   * (even better than Hopcroft's!).
+   */
+  public static final int MINIMIZE_BRZOZOWSKI = 1;
+  /**
+   * Minimize using Hopcroft's O(n log n) algorithm.
+   */
+  public static final int MINIMIZE_HOPCROFT = 2;
+  /**
+   * Minimize using Valmari's O(n + m log m) algorithm.
+   */
+  public static final int MINIMIZE_VALMARI = 3;
+
+  /** Minimize always flag. */
+  public static boolean _minimizeAlways = false;
+
+  /** Selects whether operations may modify the input automata (default: <code>false</code>). */
+  public static boolean _allowMutation = false;
+
+  /** Selects minimization algorithm (default: <code>MINIMIZE_HOPCROFT</code>). */
+  public static int _minimization = MINIMIZE_HOPCROFT;
+
+  /** Initial state of this automaton. */
+  State _initial;
+
+  /** If true, then this automaton is definitely deterministic
+   (i.e., there are no choices for any run, but a run may crash). */
+  boolean _deterministic;
+
+  /** Hash code. Recomputed by {@link #minimize()}. */
+  int _hashCode;
+
+  /** Singleton string. Null if not applicable. */
+  String _singleton;
+
+  /**
+   * Constructs a new automaton that accepts the empty language.
+   * Using this constructor, automata can be constructed manually from
+   * {@link State} and {@link Transition} objects.
+   * @see #setInitialState(State)
+   * @see State
+   * @see Transition
+   */
+  public Automaton() {
+    _initial = new State();
+    _deterministic = true;
+    _singleton = null;
+  }
+
+  /**
+   * Sets or resets allow mutate flag.
+   * If this flag is set, then all automata operations may modify automata given as input;
+   * otherwise, operations will always leave input automata languages unmodified.
+   * By default, the flag is not set.
+   * @param flag if true, the flag is set
+   * @return previous value of the flag
+   */
+  static public boolean setAllowMutate(boolean flag) {
+    boolean b = _allowMutation;
+    _allowMutation = flag;
+    return b;
+  }
+
+  /**
+   * Assigns consecutive numbers to the given states.
+   */
+  static void setStateNumbers(Set<State> states) {
+    if (states.size() == Integer.MAX_VALUE) {
+      throw new IllegalArgumentException("number of states exceeded Integer.MAX_VALUE");
+    }
+    int number = 0;
+    for (State s : states) {
+      s._number = number++;
+    }
+  }
+
+  /**
+   * Returns a sorted array of transitions for each state (and sets state numbers).
+   */
+  static Transition[][] getSortedTransitions(Set<State> states) {
+    setStateNumbers(states);
+    Transition[][] transitions = new Transition[states.size()][];
+    for (State s : states) {
+      transitions[s._number] = s.getSortedTransitionArray(false);
+    }
+    return transitions;
+  }
+
+  /**
+   * See {@link MinimizationOperations#minimize(Automaton)}.
+   * Returns the automaton being given as argument.
+   */
+  public static Automaton minimize(Automaton a) {
+    a.minimize();
+    return a;
+  }
+
+  void checkMinimizeAlways() {
+    if (_minimizeAlways) {
+      minimize();
+    }
+  }
+
+  boolean isSingleton() {
+    return _singleton != null;
+  }
+
+  /**
+   * Gets initial state.
+   * @return state
+   */
+  public State getInitialState() {
+    expandSingleton();
+    return _initial;
+  }
+
+  /**
+   * Sets initial state.
+   * @param s state
+   */
+  public void setInitialState(State s) {
+    _initial = s;
+    _singleton = null;
+  }
+
+  /**
+   * Returns the set of states that are reachable from the initial state.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getStates() {
+    expandSingleton();
+    Set<State> visited;
+
+    visited = new HashSet<>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (worklist.size() > 0) {
+      State s = worklist.removeFirst();
+      Collection<Transition> tr;
+
+      tr = s._transitionSet;
+      for (Transition t : tr) {
+        if (!visited.contains(t._to)) {
+          visited.add(t._to);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return visited;
+  }
+
+  /**
+   * Returns the set of reachable accept states.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getAcceptStates() {
+    expandSingleton();
+    HashSet<State> accepts = new HashSet<State>();
+    HashSet<State> visited = new HashSet<State>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (worklist.size() > 0) {
+      State s = worklist.removeFirst();
+      if (s._accept) {
+        accepts.add(s);
+      }
+      for (Transition t : s._transitionSet) {
+        if (!visited.contains(t._to)) {
+          visited.add(t._to);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return accepts;

Review comment:
       Have you considered assigning each state an incremental identifier unique within the scope of an automaton, then you can do a lot of this logic with bitsets?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710105525



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ConstantArcSizeFST.java
##########
@@ -0,0 +1,159 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.util.Collections;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+
+
+/**
+ * A FST with constant-size arc representation produced directly by
+ * {@link FSTBuilder}.
+ *
+ * @see FSTBuilder
+ */
+public final class ConstantArcSizeFST extends FST {
+  /** Size of the target address field (constant for the builder). */
+  public final static int TARGET_ADDRESS_SIZE = 4;
+
+  /** Size of the flags field (constant for the builder). */
+  public final static int FLAGS_SIZE = 1;
+
+  /** Size of the label field (constant for the builder). */
+  public final static int LABEL_SIZE = 1;
+
+  /**
+   * Size of a single arc structure.
+   */
+  public final static int ARC_SIZE = FLAGS_SIZE + LABEL_SIZE + TARGET_ADDRESS_SIZE;
+
+  /** Offset of the flags field inside an arc. */
+  public final static int FLAGS_OFFSET = 0;
+
+  /** Offset of the label field inside an arc. */
+  public final static int LABEL_OFFSET = FLAGS_SIZE;
+
+  /** Offset of the address field inside an arc. */
+  public final static int ADDRESS_OFFSET = LABEL_OFFSET + LABEL_SIZE;
+  /**
+   * An arc flag indicating the target node of an arc corresponds to a final
+   * state.
+   */
+  public final static int BIT_ARC_FINAL = 1 << 1;
+  /** An arc flag indicating the arc is last within its state. */
+  public final static int BIT_ARC_LAST = 1 << 0;
+  /** A dummy address of the terminal state. */
+  public final static int TERMINAL_STATE = 0;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private final int _epsilon;
+
+  /**
+   * FST data, serialized as a byte array.
+   */
+  private final byte[] _data;
+
+  private Map<Integer, Integer> _outputSymbols;
+
+  /**
+   * @param data
+   *          FST data. There must be no trailing bytes after the last state.
+   */
+  public ConstantArcSizeFST(byte[] data, int epsilon, Map<Integer, Integer> outputSymbols) {
+    assert epsilon == 0 : "Epsilon is not zero?";

Review comment:
       Why store it at all if it can only be zero and can't be mutated?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r714144026



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ConstantArcSizeFST.java
##########
@@ -0,0 +1,159 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.util.Collections;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+
+
+/**
+ * A FST with constant-size arc representation produced directly by
+ * {@link FSTBuilder}.
+ *
+ * @see FSTBuilder
+ */
+public final class ConstantArcSizeFST extends FST {
+  /** Size of the target address field (constant for the builder). */
+  public final static int TARGET_ADDRESS_SIZE = 4;
+
+  /** Size of the flags field (constant for the builder). */
+  public final static int FLAGS_SIZE = 1;
+
+  /** Size of the label field (constant for the builder). */
+  public final static int LABEL_SIZE = 1;
+
+  /**
+   * Size of a single arc structure.
+   */
+  public final static int ARC_SIZE = FLAGS_SIZE + LABEL_SIZE + TARGET_ADDRESS_SIZE;
+
+  /** Offset of the flags field inside an arc. */
+  public final static int FLAGS_OFFSET = 0;
+
+  /** Offset of the label field inside an arc. */
+  public final static int LABEL_OFFSET = FLAGS_SIZE;
+
+  /** Offset of the address field inside an arc. */
+  public final static int ADDRESS_OFFSET = LABEL_OFFSET + LABEL_SIZE;
+  /**
+   * An arc flag indicating the target node of an arc corresponds to a final
+   * state.
+   */
+  public final static int BIT_ARC_FINAL = 1 << 1;
+  /** An arc flag indicating the arc is last within its state. */
+  public final static int BIT_ARC_LAST = 1 << 0;
+  /** A dummy address of the terminal state. */
+  public final static int TERMINAL_STATE = 0;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private final int _epsilon;
+
+  /**
+   * FST data, serialized as a byte array.
+   */
+  private final byte[] _data;
+
+  private Map<Integer, Integer> _outputSymbols;
+
+  /**
+   * @param data
+   *          FST data. There must be no trailing bytes after the last state.
+   */
+  public ConstantArcSizeFST(byte[] data, int epsilon, Map<Integer, Integer> outputSymbols) {
+    assert epsilon == 0 : "Epsilon is not zero?";

Review comment:
       This makes the code more readable since epsilon state is the representation of the start state




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (366ae53) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `3.39%`.
   > The diff coverage is `45.44%`.
   
   > :exclamation: Current head 366ae53 differs from pull request most recent head 11dc57c. Consider uploading reports for the commit 11dc57c to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   68.51%   -3.40%     
   - Complexity     3348     3800     +452     
   ============================================
     Files          1517     1158     -359     
     Lines         75039    56411   -18628     
     Branches      10921     8653    -2268     
   ============================================
   - Hits          53961    38648   -15313     
   + Misses        17451    15001    -2450     
   + Partials       3627     2762     -865     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `68.51% <45.44%> (-1.19%)` | :arrow_down: |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...n/src/main/java/org/apache/pinot/common/Utils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vVXRpbHMuamF2YQ==) | `30.23% <0.00%> (-8.90%)` | :arrow_down: |
   | [...ache/pinot/common/lineage/SegmentLineageUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vbGluZWFnZS9TZWdtZW50TGluZWFnZVV0aWxzLmphdmE=) | `11.11% <ø> (-88.89%)` | :arrow_down: |
   | [...ta/segment/SegmentZKMetadataCustomMapModifier.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vbWV0YWRhdGEvc2VnbWVudC9TZWdtZW50WktNZXRhZGF0YUN1c3RvbU1hcE1vZGlmaWVyLmphdmE=) | `0.00% <0.00%> (-90.91%)` | :arrow_down: |
   | [...g/apache/pinot/common/metrics/ControllerMeter.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vbWV0cmljcy9Db250cm9sbGVyTWV0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...e/pinot/common/utils/FileUploadDownloadClient.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvRmlsZVVwbG9hZERvd25sb2FkQ2xpZW50LmphdmE=) | `19.04% <0.00%> (-43.12%)` | :arrow_down: |
   | [.../apache/pinot/common/utils/NamedThreadFactory.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvTmFtZWRUaHJlYWRGYWN0b3J5LmphdmE=) | `50.00% <0.00%> (ø)` | |
   | [...a/org/apache/pinot/common/utils/ServiceStatus.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvU2VydmljZVN0YXR1cy5qYXZh) | `54.59% <0.00%> (-12.79%)` | :arrow_down: |
   | [...pache/pinot/common/utils/grpc/GrpcQueryClient.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvZ3JwYy9HcnBjUXVlcnlDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...apache/pinot/pql/parsers/pql2/ast/BaseAstNode.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9wcWwvcGFyc2Vycy9wcWwyL2FzdC9CYXNlQXN0Tm9kZS5qYXZh) | `50.00% <0.00%> (ø)` | |
   | [...t/pql/parsers/pql2/ast/BooleanOperatorAstNode.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9wcWwvcGFyc2Vycy9wcWwyL2FzdC9Cb29sZWFuT3BlcmF0b3JBc3ROb2RlLmphdmE=) | `22.22% <0.00%> (ø)` | |
   | ... and [728 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...11dc57c](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716512876



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/TransitionComparator.java
##########
@@ -0,0 +1,80 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Comparator;
+
+
+class TransitionComparator implements Comparator<Transition>, Serializable {

Review comment:
       Is there more than one sort order for `Transition`? If not, I suggest making `Transition implement Comparable<Transition>`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-933589556


   > Are there any remaining items to be taken care of in this PR @atris @siddharthteotia @richardstartin @Jackie-Jiang? IMHO, since this does not impact existing functionality, perhaps we can document the TODOs here and follow up?
   
   @richardstartin kindly reviewed the PR and his suggestions have been addressed. At this point, I have no open action items on the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710149185



##########
File path: pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/nativefst/FSTTestUtils.java
##########
@@ -0,0 +1,129 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.nio.ByteBuffer;
+import java.nio.charset.Charset;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+import org.apache.pinot.segment.local.utils.nativefst.utils.RegexpMatcher;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.testng.Assert.assertEquals;
+import static org.testng.FileAssert.fail;
+
+
+/**
+ * Test utils class
+ */
+class FSTTestUtils {
+
+  private FSTTestUtils() {
+  }
+
+  /*
+   * Generate a sorted list of random sequences.
+   */
+  public static byte[][] generateRandom(int count, MinMax length, MinMax alphabet) {
+    final byte[][] input = new byte[count][];
+    final Random rnd = new Random();

Review comment:
       For results to be comparable from run to run, this needs to seeded.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (17eb7e6) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `42.45%`.
   > The diff coverage is `9.06%`.
   
   > :exclamation: Current head 17eb7e6 differs from pull request most recent head 995ec32. Consider uploading reports for the commit 995ec32 to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #7405       +/-   ##
   =============================================
   - Coverage     71.91%   29.45%   -42.46%     
   =============================================
     Files          1517     1539       +22     
     Lines         75039    78040     +3001     
     Branches      10921    11559      +638     
   =============================================
   - Hits          53961    22989    -30972     
   - Misses        17451    53006    +35555     
   + Partials       3627     2045     -1582     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `29.45% <9.06%> (-1.18%)` | :arrow_down: |
   | integration2 | `?` | |
   | unittests1 | `?` | |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...quota/HelixExternalViewBasedQueryQuotaManager.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9IZWxpeEV4dGVybmFsVmlld0Jhc2VkUXVlcnlRdW90YU1hbmFnZXIuamF2YQ==) | `43.04% <0.00%> (-26.15%)` | :arrow_down: |
   | [...che/pinot/broker/queryquota/MaxHitRateTracker.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9NYXhIaXRSYXRlVHJhY2tlci5qYXZh) | `30.00% <0.00%> (-65.00%)` | :arrow_down: |
   | [...n/java/org/apache/pinot/client/BrokerResponse.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY2xpZW50cy9waW5vdC1qYXZhLWNsaWVudC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY2xpZW50L0Jyb2tlclJlc3BvbnNlLmphdmE=) | `83.33% <0.00%> (-16.67%)` | :arrow_down: |
   | [.../pinot/common/function/scalar/StringFunctions.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vZnVuY3Rpb24vc2NhbGFyL1N0cmluZ0Z1bmN0aW9ucy5qYXZh) | `0.00% <0.00%> (-70.91%)` | :arrow_down: |
   | [...ache/pinot/common/lineage/SegmentLineageUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vbGluZWFnZS9TZWdtZW50TGluZWFnZVV0aWxzLmphdmE=) | `100.00% <ø> (ø)` | |
   | [...ta/segment/SegmentZKMetadataCustomMapModifier.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vbWV0YWRhdGEvc2VnbWVudC9TZWdtZW50WktNZXRhZGF0YUN1c3RvbU1hcE1vZGlmaWVyLmphdmE=) | `84.84% <0.00%> (-6.07%)` | :arrow_down: |
   | [.../apache/pinot/common/utils/NamedThreadFactory.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvTmFtZWRUaHJlYWRGYWN0b3J5LmphdmE=) | `50.00% <0.00%> (ø)` | |
   | [...org/apache/pinot/common/utils/PinotAppConfigs.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvUGlub3RBcHBDb25maWdzLmphdmE=) | `0.00% <0.00%> (-64.71%)` | :arrow_down: |
   | [...a/org/apache/pinot/common/utils/ServiceStatus.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvU2VydmljZVN0YXR1cy5qYXZh) | `41.32% <0.00%> (-26.06%)` | :arrow_down: |
   | [...pache/pinot/common/utils/grpc/GrpcQueryClient.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvZ3JwYy9HcnBjUXVlcnlDbGllbnQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | ... and [1223 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...995ec32](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] amrishlal commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
amrishlal commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-916454905


   I would suggest creating a new pinot-fst module for the new FST implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r705901413



##########
File path: pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ComplexTypes.java
##########
@@ -719,8 +719,12 @@ public int getNestedIntField() {
       @java.lang.Override
       public final boolean isInitialized() {
         byte isInitialized = memoizedIsInitialized;
-        if (isInitialized == 1) return true;
-        if (isInitialized == 0) return false;
+        if (isInitialized == 1) {
+          return true;
+        }
+        if (isInitialized == 0) {
+          return false;
+        }

Review comment:
       Yeah, I think my global format script just fixed all checkstyle violations :). Reverted, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mayankshriv commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
mayankshriv commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917129393


   > > I would suggest creating a new pinot-fst module for the new FST implementation.
   > 
   > Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet
   
   +1 to Kishore, we don't have (or create) new modules at root level for other indexing either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716505530



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/Automaton.java
##########
@@ -0,0 +1,653 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.BitSet;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+
+/**
+ * Finite-state automaton with regular expression operations.
+ * <p>
+ * Class invariants:
+ * <ul>
+ * <li> An automaton is either represented explicitly (with {@link State} and {@link Transition} objects)
+ *      or with a singleton string ({@link #expandSingleton()}) in case
+ *      the automaton is known to accept exactly one string.
+ *      (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
+ * <li> Automata are always reduced (see {@link #reduce()}) 
+ *      and have no transitions to dead states (see {@link #removeDeadTransitions()}).
+ * <li> Automata provided as input to operations are generally assumed to be disjoint.
+ * </ul>
+ * <p>
+ */
+public class Automaton implements Serializable, Cloneable {
+
+  /**
+   * Minimize using Huffman's O(n<sup>2</sup>) algorithm.
+   * This is the standard text-book algorithm.
+   */
+  public static final int MINIMIZE_HUFFMAN = 0;
+  /**
+   * Minimize using Brzozowski's O(2<sup>n</sup>) algorithm.
+   * This algorithm uses the reverse-determinize-reverse-determinize trick, which has a bad
+   * worst-case behavior but often works very well in practice
+   * (even better than Hopcroft's!).
+   */
+  public static final int MINIMIZE_BRZOZOWSKI = 1;
+  /**
+   * Minimize using Hopcroft's O(n log n) algorithm.
+   */
+  public static final int MINIMIZE_HOPCROFT = 2;
+  /**
+   * Minimize using Valmari's O(n + m log m) algorithm.
+   */
+  public static final int MINIMIZE_VALMARI = 3;
+
+  /** Minimize always flag. */
+  public static boolean _minimizeAlways = false;
+
+  /** Selects whether operations may modify the input automata (default: <code>false</code>). */
+  public static boolean _allowMutation = false;
+
+  /** Selects minimization algorithm (default: <code>MINIMIZE_HOPCROFT</code>). */
+  public static int _minimization = MINIMIZE_HOPCROFT;
+
+  /** Initial state of this automaton. */
+  State _initial;
+
+  /** If true, then this automaton is definitely deterministic
+   (i.e., there are no choices for any run, but a run may crash). */
+  boolean _deterministic;
+
+  /** Hash code. Recomputed by {@link #minimize()}. */
+  int _hashCode;
+
+  /** Singleton string. Null if not applicable. */
+  String _singleton;
+
+  /**
+   * Constructs a new automaton that accepts the empty language.
+   * Using this constructor, automata can be constructed manually from
+   * {@link State} and {@link Transition} objects.
+   * @see #setInitialState(State)
+   * @see State
+   * @see Transition
+   */
+  public Automaton() {
+    _initial = new State();
+    _deterministic = true;
+    _singleton = null;
+  }
+
+  /**
+   * Sets or resets allow mutate flag.
+   * If this flag is set, then all automata operations may modify automata given as input;
+   * otherwise, operations will always leave input automata languages unmodified.
+   * By default, the flag is not set.
+   * @param flag if true, the flag is set
+   * @return previous value of the flag
+   */
+  static public boolean setAllowMutate(boolean flag) {
+    boolean b = _allowMutation;
+    _allowMutation = flag;
+    return b;
+  }
+
+  /**
+   * Assigns consecutive numbers to the given states.
+   */
+  static void setStateNumbers(Set<State> states) {
+    if (states.size() == Integer.MAX_VALUE) {
+      throw new IllegalArgumentException("number of states exceeded Integer.MAX_VALUE");
+    }
+    int number = 0;
+    for (State s : states) {
+      s._number = number++;
+    }
+  }
+
+  /**
+   * Returns a sorted array of transitions for each state (and sets state numbers).
+   */
+  static Transition[][] getSortedTransitions(Set<State> states) {
+    setStateNumbers(states);
+    Transition[][] transitions = new Transition[states.size()][];
+    for (State s : states) {
+      transitions[s._number] = s.getSortedTransitionArray(false);
+    }
+    return transitions;
+  }
+
+  /**
+   * See {@link MinimizationOperations#minimize(Automaton)}.
+   * Returns the automaton being given as argument.
+   */
+  public static Automaton minimize(Automaton a) {
+    a.minimize();
+    return a;
+  }
+
+  void checkMinimizeAlways() {
+    if (_minimizeAlways) {
+      minimize();
+    }
+  }
+
+  boolean isSingleton() {
+    return _singleton != null;
+  }
+
+  /**
+   * Gets initial state.
+   * @return state
+   */
+  public State getInitialState() {
+    expandSingleton();
+    return _initial;
+  }
+
+  /**
+   * Sets initial state.
+   * @param s state
+   */
+  public void setInitialState(State s) {
+    _initial = s;
+    _singleton = null;
+  }
+
+  /**
+   * Returns the set of states that are reachable from the initial state.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getStates() {
+    expandSingleton();
+    Set<State> visited;
+
+    visited = new HashSet<>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (!worklist.isEmpty()) {
+      State s = worklist.removeFirst();
+      Collection<Transition> tr;
+
+      tr = s._transitionSet;
+      for (Transition t : tr) {
+        if (!visited.contains(t._to)) {
+          visited.add(t._to);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return visited;
+  }
+
+  /**
+   * Returns the set of reachable accept states.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getAcceptStates() {
+    expandSingleton();
+    HashSet<State> accepts = new HashSet<State>();
+    BitSet visited = new BitSet();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.set(_initial._id);

Review comment:
       Why `_id` and not `_number`? `_id` is global and `_number` is assigned locally to the automaton?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b0f7138) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `43.89%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #7405       +/-   ##
   =============================================
   - Coverage     71.91%   28.01%   -43.90%     
   =============================================
     Files          1517     1539       +22     
     Lines         75039    78040     +3001     
     Branches      10921    11559      +638     
   =============================================
   - Hits          53961    21866    -32095     
   - Misses        17451    54196    +36745     
   + Partials       3627     1978     -1649     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `28.01% <0.00%> (-1.10%)` | :arrow_down: |
   | unittests1 | `?` | |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...nt/local/utils/nativefst/ByteSequenceIterator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvQnl0ZVNlcXVlbmNlSXRlcmF0b3IuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ment/local/utils/nativefst/ConstantArcSizeFST.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvQ29uc3RhbnRBcmNTaXplRlNULmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...pache/pinot/segment/local/utils/nativefst/FST.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvRlNULmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [.../pinot/segment/local/utils/nativefst/FSTFlags.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvRlNURmxhZ3MuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...pinot/segment/local/utils/nativefst/FSTHeader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvRlNUSGVhZGVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ot/segment/local/utils/nativefst/FSTTraversal.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvRlNUVHJhdmVyc2FsLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ot/segment/local/utils/nativefst/ImmutableFST.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvSW1tdXRhYmxlRlNULmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...not/segment/local/utils/nativefst/MatchResult.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTWF0Y2hSZXN1bHQuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | ... and [1136 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...b0f7138](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-921511027


   > Shall we reduce the testing/sample text file size? IMO keeping ~1000 words should be good enough. We don't want to increase the repo size too much because of these sample files
   
   + 1. Test files seem extremely huge and we should try to get coverage with considerably smaller datasets. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (96001bf) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `1.07%`.
   > The diff coverage is `44.88%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   70.83%   -1.08%     
   - Complexity     3348     3816     +468     
   ============================================
     Files          1517     1546      +29     
     Lines         75039    78137    +3098     
     Branches      10921    11560     +639     
   ============================================
   + Hits          53961    55351    +1390     
   - Misses        17451    19033    +1582     
   - Partials       3627     3753     +126     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `29.22% <0.00%> (-1.41%)` | :arrow_down: |
   | integration2 | `27.98% <0.00%> (-1.13%)` | :arrow_down: |
   | unittests1 | `68.37% <44.88%> (-1.33%)` | :arrow_down: |
   | unittests2 | `13.95% <0.00%> (-0.58%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [90 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...96001bf](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710101805



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ConstantArcSizeFST.java
##########
@@ -0,0 +1,159 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.util.Collections;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+
+
+/**
+ * A FST with constant-size arc representation produced directly by
+ * {@link FSTBuilder}.
+ *
+ * @see FSTBuilder
+ */
+public final class ConstantArcSizeFST extends FST {
+  /** Size of the target address field (constant for the builder). */
+  public final static int TARGET_ADDRESS_SIZE = 4;
+
+  /** Size of the flags field (constant for the builder). */
+  public final static int FLAGS_SIZE = 1;
+
+  /** Size of the label field (constant for the builder). */
+  public final static int LABEL_SIZE = 1;
+
+  /**
+   * Size of a single arc structure.
+   */
+  public final static int ARC_SIZE = FLAGS_SIZE + LABEL_SIZE + TARGET_ADDRESS_SIZE;
+
+  /** Offset of the flags field inside an arc. */
+  public final static int FLAGS_OFFSET = 0;
+
+  /** Offset of the label field inside an arc. */
+  public final static int LABEL_OFFSET = FLAGS_SIZE;
+
+  /** Offset of the address field inside an arc. */
+  public final static int ADDRESS_OFFSET = LABEL_OFFSET + LABEL_SIZE;
+  /**
+   * An arc flag indicating the target node of an arc corresponds to a final
+   * state.
+   */
+  public final static int BIT_ARC_FINAL = 1 << 1;
+  /** An arc flag indicating the arc is last within its state. */
+  public final static int BIT_ARC_LAST = 1 << 0;
+  /** A dummy address of the terminal state. */
+  public final static int TERMINAL_STATE = 0;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private final int _epsilon;
+
+  /**
+   * FST data, serialized as a byte array.
+   */
+  private final byte[] _data;
+
+  private Map<Integer, Integer> _outputSymbols;
+
+  /**
+   * @param data
+   *          FST data. There must be no trailing bytes after the last state.
+   */
+  public ConstantArcSizeFST(byte[] data, int epsilon, Map<Integer, Integer> outputSymbols) {
+    assert epsilon == 0 : "Epsilon is not zero?";
+
+    this._epsilon = epsilon;
+    this._data = data;
+    this._outputSymbols = outputSymbols;
+  }
+
+  @Override
+  public int getRootNode() {
+    return getEndNode(getFirstArc(_epsilon));
+  }
+
+  @Override
+  public int getFirstArc(int node) {
+    return node;
+  }
+
+  @Override
+  public int getArc(int node, byte label) {
+    for (int arc = getFirstArc(node); arc != 0; arc = getNextArc(arc)) {
+      if (getArcLabel(arc) == label) {
+        return arc;
+      }
+    }
+    return 0;
+  }
+
+  @Override
+  public int getNextArc(int arc) {
+    if (isArcLast(arc)) {
+      return 0;
+    }
+    return arc + ARC_SIZE;
+  }
+
+  @Override
+  public byte getArcLabel(int arc) {
+    return _data[arc + LABEL_OFFSET];
+  }
+
+  @Override
+  public int getOutputSymbol(int arc) {
+    return _outputSymbols.get(arc);

Review comment:
       NPE here when `arc` is not in the map.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9bae7b5) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `1.32%`.
   > The diff coverage is `41.78%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   70.58%   -1.33%     
   - Complexity     3348     3818     +470     
   ============================================
     Files          1517     1548      +31     
     Lines         75039    78354    +3315     
     Branches      10921    11589     +668     
   ============================================
   + Hits          53961    55309    +1348     
   - Misses        17451    19299    +1848     
   - Partials       3627     3746     +119     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `29.05% <0.00%> (-1.59%)` | :arrow_down: |
   | integration2 | `27.89% <0.00%> (-1.22%)` | :arrow_down: |
   | unittests1 | `68.07% <41.78%> (-1.63%)` | :arrow_down: |
   | unittests2 | `13.90% <0.00%> (-0.63%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/ShuffleOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NodWZmbGVPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | ... and [74 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...9bae7b5](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716552959



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/builders/FSTBuilder.java
##########
@@ -0,0 +1,565 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.builders;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.SortedMap;
+import java.util.TreeMap;
+import org.apache.pinot.segment.local.utils.nativefst.ConstantArcSizeFST;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+
+
+/**
+ * Fast, memory-conservative finite state transducer builder, returning an
+ * in-memory {@link FST} that is a tradeoff between construction speed and
+ * memory consumption. Use serializers to compress the returned automaton into
+ * more compact form.
+ *
+ * @see FSTSerializer
+ */
+public final class FSTBuilder {
+  /**
+   * A comparator comparing full byte arrays. Unsigned byte comparisons ('C'-locale).
+   */
+  public static final Comparator<byte[]> LEXICAL_ORDERING = new Comparator<byte[]>() {
+    public int compare(byte[] o1, byte[] o2) {
+      return FSTBuilder.compare(o1, 0, o1.length, o2, 0, o2.length);
+    }
+  };
+  /** A megabyte. */
+  private final static int MB = 1024 * 1024;
+
+  /**
+   * Internal serialized FST buffer expand ratio.
+   */
+  private final static int BUFFER_GROWTH_SIZE = 5 * MB;
+
+  /**
+   * Maximum number of labels from a single state.
+   */
+  private final static int MAX_LABELS = 256;
+  /**
+   * Internal serialized FST buffer expand ratio.
+   */
+  private final int _bufferGrowthSize;
+  private byte[] _serialized = new byte[0];
+  private Map<Integer, Integer> _outputSymbols = new HashMap<>();
+
+  /**
+   * Number of bytes already taken in {@link #_serialized}. Start from 1 to keep
+   * 0 a sentinel value (for the hash set and final state).
+   */
+  private int _size;
+  /**
+   * States on the "active path" (still mutable). Values are addresses of each
+   * state's first arc.
+   */
+  private int[] _activePath = new int[0];
+  /**
+   * Current length of the active path.
+   */
+  private int _activePathLen;
+  /**
+   * The next offset at which an arc will be added to the given state on
+   * {@link #_activePath}.
+   */
+  private int[] _nextArcOffset = new int[0];
+  /**
+   * Root state. If negative, the automaton has been built already and cannot be
+   * extended.
+   */
+  private int _root;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private int _epsilon;
+  /**
+   * Hash set of state addresses in {@link #_serialized}, hashed by
+   * {@link #hash(int, int)}. Zero reserved for an unoccupied slot.
+   */
+  private int[] _hashSet = new int[2];
+  /**
+   * Number of entries currently stored in {@link #_hashSet}.
+   */
+  private int _hashSize = 0;
+  /**
+   * Previous sequence added to the automaton in {@link #add(byte[], int, int, int)}.
+   * Used in assertions only.
+   */
+  private byte[] _previous;
+  /**
+   * Information about the automaton and its compilation.
+   */
+  private TreeMap<InfoEntry, Object> _info;
+  /**
+   * {@link #_previous} sequence's length, used in assertions only.
+   */
+  private int _previousLength;
+  /** Number of serialization buffer reallocations. */
+  private int _serializationBufferReallocations;
+
+  /** */
+  public FSTBuilder() {
+    this(BUFFER_GROWTH_SIZE);
+  }
+
+  /**
+   * @param bufferGrowthSize Buffer growth size (in bytes) when constructing the automaton.
+   */
+  public FSTBuilder(int bufferGrowthSize) {
+    _bufferGrowthSize = Math.max(bufferGrowthSize, ConstantArcSizeFST.ARC_SIZE * MAX_LABELS);
+
+    // Allocate epsilon state.
+    _epsilon = allocateState(1);
+    _serialized[_epsilon + ConstantArcSizeFST.FLAGS_OFFSET] |= ConstantArcSizeFST.BIT_ARC_LAST;
+
+    // Allocate root, with an initial empty set of output arcs.
+    expandActivePath(1);
+    _root = _activePath[0];
+  }
+
+  public static FST buildFST(SortedMap<String, Integer> input) {
+
+    FSTBuilder fstbuilder = new FSTBuilder();
+
+    for (Map.Entry<String, Integer> entry : input.entrySet()) {
+      fstbuilder.add(entry.getKey().getBytes(), 0, entry.getKey().length(), entry.getValue().intValue());
+    }
+
+    return fstbuilder.complete();
+  }
+
+  /**
+   * Build a minimal, deterministic automaton from a sorted list of byte
+   * sequences.
+   *
+   * @param input Input sequences to build automaton from.
+   * @return Returns the automaton encoding all input sequences.
+   */
+  public static FST build(byte[][] input, int[] outputSymbols) {
+    final FSTBuilder builder = new FSTBuilder();
+
+    int i = 0;
+    for (byte[] chs : input) {
+      builder.add(chs, 0, chs.length, i < outputSymbols.length ? outputSymbols[i] : -1);
+      ++i;
+    }
+
+    return builder.complete();
+  }
+
+  /**
+   * Build a minimal, deterministic automaton from an iterable list of byte
+   * sequences.
+   *
+   * @param input Input sequences to build automaton from.
+   * @return Returns the automaton encoding all input sequences.
+   */
+  public static FST build(Iterable<byte[]> input, int[] outputSymbols) {
+    final FSTBuilder builder = new FSTBuilder();
+
+    int i = 0;
+
+    for (byte[] chs : input) {
+      builder.add(chs, 0, chs.length, i < outputSymbols.length ? outputSymbols[i] : -1);
+      ++i;
+    }
+
+    return builder.complete();
+  }
+
+  /**
+   * Lexicographic order of input sequences. By default, consistent with the "C"
+   * sort (absolute value of bytes, 0-255).
+   */
+  private static int compare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2) {
+    final int max = Math.min(lens1, lens2);
+
+    for (int i = 0; i < max; i++) {
+      final byte c1 = s1[start1++];
+      final byte c2 = s2[start2++];
+      if (c1 != c2) {
+        return (c1 & 0xff) - (c2 & 0xff);
+      }
+    }
+
+    return lens1 - lens2;
+  }

Review comment:
       I will add a `MethodHandle` based implementation in pinot because this is done in several places. Keep it the way it is and we can update it after merging, or on the branch if my change merges first.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-916388421


   Done, thanks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (995ec32) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `10.20%`.
   > The diff coverage is `44.80%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #7405       +/-   ##
   =============================================
   - Coverage     71.91%   61.70%   -10.21%     
   - Complexity     3348     3756      +408     
   =============================================
     Files          1517     1539       +22     
     Lines         75039    78040     +3001     
     Branches      10921    11559      +638     
   =============================================
   - Hits          53961    48157     -5804     
   - Misses        17451    26384     +8933     
   + Partials       3627     3499      -128     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `27.94% <0.00%> (-1.17%)` | :arrow_down: |
   | unittests1 | `68.48% <44.80%> (-1.22%)` | :arrow_down: |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [392 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...995ec32](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (415fed3) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `43.98%`.
   > The diff coverage is `8.07%`.
   
   > :exclamation: Current head 415fed3 differs from pull request most recent head 64f5e76. Consider uploading reports for the commit 64f5e76 to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #7405       +/-   ##
   =============================================
   - Coverage     71.91%   27.92%   -43.99%     
   =============================================
     Files          1517     1539       +22     
     Lines         75039    78002     +2963     
     Branches      10921    11554      +633     
   =============================================
   - Hits          53961    21782    -32179     
   - Misses        17451    54247    +36796     
   + Partials       3627     1973     -1654     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `27.92% <8.07%> (-1.19%)` | :arrow_down: |
   | unittests1 | `?` | |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...che/pinot/broker/queryquota/MaxHitRateTracker.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9NYXhIaXRSYXRlVHJhY2tlci5qYXZh) | `30.00% <0.00%> (-65.00%)` | :arrow_down: |
   | [.../pinot/common/function/scalar/StringFunctions.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vZnVuY3Rpb24vc2NhbGFyL1N0cmluZ0Z1bmN0aW9ucy5qYXZh) | `4.76% <0.00%> (-66.15%)` | :arrow_down: |
   | [...e/pinot/common/utils/FileUploadDownloadClient.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvRmlsZVVwbG9hZERvd25sb2FkQ2xpZW50LmphdmE=) | `47.18% <0.00%> (-14.98%)` | :arrow_down: |
   | [...org/apache/pinot/common/utils/PinotAppConfigs.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvUGlub3RBcHBDb25maWdzLmphdmE=) | `0.00% <0.00%> (-64.71%)` | :arrow_down: |
   | [...a/org/apache/pinot/common/utils/ServiceStatus.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvU2VydmljZVN0YXR1cy5qYXZh) | `40.81% <0.00%> (-26.57%)` | :arrow_down: |
   | [...apache/pinot/pql/parsers/pql2/ast/BaseAstNode.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9wcWwvcGFyc2Vycy9wcWwyL2FzdC9CYXNlQXN0Tm9kZS5qYXZh) | `38.23% <0.00%> (-11.77%)` | :arrow_down: |
   | [...t/pql/parsers/pql2/ast/BooleanOperatorAstNode.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9wcWwvcGFyc2Vycy9wcWwyL2FzdC9Cb29sZWFuT3BlcmF0b3JBc3ROb2RlLmphdmE=) | `22.22% <0.00%> (ø)` | |
   | [...r/api/resources/PinotIngestionRestletResource.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci9hcGkvcmVzb3VyY2VzL1Bpbm90SW5nZXN0aW9uUmVzdGxldFJlc291cmNlLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ntroller/helix/core/rebalance/TableRebalancer.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci9oZWxpeC9jb3JlL3JlYmFsYW5jZS9UYWJsZVJlYmFsYW5jZXIuamF2YQ==) | `0.00% <0.00%> (-70.93%)` | :arrow_down: |
   | [...che/pinot/controller/util/FileIngestionHelper.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci91dGlsL0ZpbGVJbmdlc3Rpb25IZWxwZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | ... and [1200 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...64f5e76](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r717349854



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/builders/FSTBuilder.java
##########
@@ -0,0 +1,565 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.builders;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.SortedMap;
+import java.util.TreeMap;
+import org.apache.pinot.segment.local.utils.nativefst.ConstantArcSizeFST;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+
+
+/**
+ * Fast, memory-conservative finite state transducer builder, returning an
+ * in-memory {@link FST} that is a tradeoff between construction speed and
+ * memory consumption. Use serializers to compress the returned automaton into
+ * more compact form.
+ *
+ * @see FSTSerializer
+ */
+public final class FSTBuilder {
+  /**
+   * A comparator comparing full byte arrays. Unsigned byte comparisons ('C'-locale).
+   */
+  public static final Comparator<byte[]> LEXICAL_ORDERING = new Comparator<byte[]>() {
+    public int compare(byte[] o1, byte[] o2) {
+      return FSTBuilder.compare(o1, 0, o1.length, o2, 0, o2.length);
+    }
+  };
+  /** A megabyte. */
+  private final static int MB = 1024 * 1024;
+
+  /**
+   * Internal serialized FST buffer expand ratio.
+   */
+  private final static int BUFFER_GROWTH_SIZE = 5 * MB;
+
+  /**
+   * Maximum number of labels from a single state.
+   */
+  private final static int MAX_LABELS = 256;
+  /**
+   * Internal serialized FST buffer expand ratio.
+   */
+  private final int _bufferGrowthSize;
+  private byte[] _serialized = new byte[0];
+  private Map<Integer, Integer> _outputSymbols = new HashMap<>();
+
+  /**
+   * Number of bytes already taken in {@link #_serialized}. Start from 1 to keep
+   * 0 a sentinel value (for the hash set and final state).
+   */
+  private int _size;
+  /**
+   * States on the "active path" (still mutable). Values are addresses of each
+   * state's first arc.
+   */
+  private int[] _activePath = new int[0];
+  /**
+   * Current length of the active path.
+   */
+  private int _activePathLen;
+  /**
+   * The next offset at which an arc will be added to the given state on
+   * {@link #_activePath}.
+   */
+  private int[] _nextArcOffset = new int[0];
+  /**
+   * Root state. If negative, the automaton has been built already and cannot be
+   * extended.
+   */
+  private int _root;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private int _epsilon;
+  /**
+   * Hash set of state addresses in {@link #_serialized}, hashed by
+   * {@link #hash(int, int)}. Zero reserved for an unoccupied slot.
+   */
+  private int[] _hashSet = new int[2];
+  /**
+   * Number of entries currently stored in {@link #_hashSet}.
+   */
+  private int _hashSize = 0;
+  /**
+   * Previous sequence added to the automaton in {@link #add(byte[], int, int, int)}.
+   * Used in assertions only.
+   */
+  private byte[] _previous;
+  /**
+   * Information about the automaton and its compilation.
+   */
+  private TreeMap<InfoEntry, Object> _info;
+  /**
+   * {@link #_previous} sequence's length, used in assertions only.
+   */
+  private int _previousLength;
+  /** Number of serialization buffer reallocations. */
+  private int _serializationBufferReallocations;
+
+  /** */
+  public FSTBuilder() {
+    this(BUFFER_GROWTH_SIZE);
+  }
+
+  /**
+   * @param bufferGrowthSize Buffer growth size (in bytes) when constructing the automaton.
+   */
+  public FSTBuilder(int bufferGrowthSize) {
+    _bufferGrowthSize = Math.max(bufferGrowthSize, ConstantArcSizeFST.ARC_SIZE * MAX_LABELS);
+
+    // Allocate epsilon state.
+    _epsilon = allocateState(1);
+    _serialized[_epsilon + ConstantArcSizeFST.FLAGS_OFFSET] |= ConstantArcSizeFST.BIT_ARC_LAST;
+
+    // Allocate root, with an initial empty set of output arcs.
+    expandActivePath(1);
+    _root = _activePath[0];
+  }
+
+  public static FST buildFST(SortedMap<String, Integer> input) {
+
+    FSTBuilder fstbuilder = new FSTBuilder();
+
+    for (Map.Entry<String, Integer> entry : input.entrySet()) {
+      fstbuilder.add(entry.getKey().getBytes(), 0, entry.getKey().length(), entry.getValue().intValue());
+    }
+
+    return fstbuilder.complete();
+  }
+
+  /**
+   * Build a minimal, deterministic automaton from a sorted list of byte
+   * sequences.
+   *
+   * @param input Input sequences to build automaton from.
+   * @return Returns the automaton encoding all input sequences.
+   */
+  public static FST build(byte[][] input, int[] outputSymbols) {
+    final FSTBuilder builder = new FSTBuilder();
+
+    int i = 0;
+    for (byte[] chs : input) {
+      builder.add(chs, 0, chs.length, i < outputSymbols.length ? outputSymbols[i] : -1);
+      ++i;
+    }
+
+    return builder.complete();
+  }
+
+  /**
+   * Build a minimal, deterministic automaton from an iterable list of byte
+   * sequences.
+   *
+   * @param input Input sequences to build automaton from.
+   * @return Returns the automaton encoding all input sequences.
+   */
+  public static FST build(Iterable<byte[]> input, int[] outputSymbols) {
+    final FSTBuilder builder = new FSTBuilder();
+
+    int i = 0;
+
+    for (byte[] chs : input) {
+      builder.add(chs, 0, chs.length, i < outputSymbols.length ? outputSymbols[i] : -1);
+      ++i;
+    }
+
+    return builder.complete();
+  }
+
+  /**
+   * Lexicographic order of input sequences. By default, consistent with the "C"
+   * sort (absolute value of bytes, 0-255).
+   */
+  private static int compare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2) {
+    final int max = Math.min(lens1, lens2);
+
+    for (int i = 0; i < max; i++) {
+      final byte c1 = s1[start1++];
+      final byte c2 = s2[start2++];
+      if (c1 != c2) {
+        return (c1 & 0xff) - (c2 & 0xff);
+      }
+    }
+
+    return lens1 - lens2;
+  }

Review comment:
       If you rebase on master you can replace this with `ByteArray.compare`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (64f5e76) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `3.44%`.
   > The diff coverage is `44.81%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   68.46%   -3.45%     
   - Complexity     3348     3754     +406     
   ============================================
     Files          1517     1154     -363     
     Lines         75039    56268   -18771     
     Branches      10921     8634    -2287     
   ============================================
   - Hits          53961    38526   -15435     
   + Misses        17451    14984    -2467     
   + Partials       3627     2758     -869     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `68.46% <44.81%> (-1.24%)` | :arrow_down: |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [698 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...64f5e76](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716568669



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ConstantArcSizeFST.java
##########
@@ -0,0 +1,159 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.util.Collections;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+
+
+/**
+ * A FST with constant-size arc representation produced directly by
+ * {@link FSTBuilder}.
+ *
+ * @see FSTBuilder
+ */
+public final class ConstantArcSizeFST extends FST {
+  /** Size of the target address field (constant for the builder). */
+  public final static int TARGET_ADDRESS_SIZE = 4;
+
+  /** Size of the flags field (constant for the builder). */
+  public final static int FLAGS_SIZE = 1;
+
+  /** Size of the label field (constant for the builder). */
+  public final static int LABEL_SIZE = 1;
+
+  /**
+   * Size of a single arc structure.
+   */
+  public final static int ARC_SIZE = FLAGS_SIZE + LABEL_SIZE + TARGET_ADDRESS_SIZE;
+
+  /** Offset of the flags field inside an arc. */
+  public final static int FLAGS_OFFSET = 0;
+
+  /** Offset of the label field inside an arc. */
+  public final static int LABEL_OFFSET = FLAGS_SIZE;
+
+  /** Offset of the address field inside an arc. */
+  public final static int ADDRESS_OFFSET = LABEL_OFFSET + LABEL_SIZE;
+  /**
+   * An arc flag indicating the target node of an arc corresponds to a final
+   * state.
+   */
+  public final static int BIT_ARC_FINAL = 1 << 1;
+  /** An arc flag indicating the arc is last within its state. */
+  public final static int BIT_ARC_LAST = 1 << 0;
+  /** A dummy address of the terminal state. */
+  public final static int TERMINAL_STATE = 0;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private final int _epsilon;
+
+  /**
+   * FST data, serialized as a byte array.
+   */
+  private final byte[] _data;
+
+  private Map<Integer, Integer> _outputSymbols;
+
+  /**
+   * @param data
+   *          FST data. There must be no trailing bytes after the last state.
+   */
+  public ConstantArcSizeFST(byte[] data, int epsilon, Map<Integer, Integer> outputSymbols) {
+    assert epsilon == 0 : "Epsilon is not zero?";

Review comment:
       Actually, thinking more about this, there is no reason why the start state (epsilon) needs to be always 0. Although it is true in the current context, but that does not necessitate a need for epsilon to always be marked as 0. Removed the assert.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9bae7b5) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `3.83%`.
   > The diff coverage is `41.78%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   68.07%   -3.84%     
   - Complexity     3348     3726     +378     
   ============================================
     Files          1517     1154     -363     
     Lines         75039    56263   -18776     
     Branches      10921     8638    -2283     
   ============================================
   - Hits          53961    38303   -15658     
   + Misses        17451    15226    -2225     
   + Partials       3627     2734     -893     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `68.07% <41.78%> (-1.63%)` | :arrow_down: |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/ShuffleOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NodWZmbGVPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | ... and [658 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...9bae7b5](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (96001bf) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `7.68%`.
   > The diff coverage is `44.88%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   64.23%   -7.69%     
   - Complexity     3348     3816     +468     
   ============================================
     Files          1517     1500      -17     
     Lines         75039    76216    +1177     
     Branches      10921    11354     +433     
   ============================================
   - Hits          53961    48954    -5007     
   - Misses        17451    23759    +6308     
   + Partials       3627     3503     -124     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `68.37% <44.88%> (-1.33%)` | :arrow_down: |
   | unittests2 | `13.95% <0.00%> (-0.58%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [428 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...96001bf](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710121411



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/Automaton.java
##########
@@ -0,0 +1,652 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+
+/**
+ * Finite-state automaton with regular expression operations.
+ * <p>
+ * Class invariants:
+ * <ul>
+ * <li> An automaton is either represented explicitly (with {@link State} and {@link Transition} objects)
+ *      or with a singleton string ({@link #expandSingleton()}) in case
+ *      the automaton is known to accept exactly one string.
+ *      (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
+ * <li> Automata are always reduced (see {@link #reduce()}) 
+ *      and have no transitions to dead states (see {@link #removeDeadTransitions()}).
+ * <li> If an automaton is nondeterministic, then {@link #isDeterministic()} returns false (but
+ *      the converse is not required).
+ * <li> Automata provided as input to operations are generally assumed to be disjoint.
+ * </ul>
+ * <p>
+ */
+public class Automaton implements Serializable, Cloneable {
+
+  /**
+   * Minimize using Huffman's O(n<sup>2</sup>) algorithm.
+   * This is the standard text-book algorithm.
+   */
+  public static final int MINIMIZE_HUFFMAN = 0;
+  /**
+   * Minimize using Brzozowski's O(2<sup>n</sup>) algorithm.
+   * This algorithm uses the reverse-determinize-reverse-determinize trick, which has a bad
+   * worst-case behavior but often works very well in practice
+   * (even better than Hopcroft's!).
+   */
+  public static final int MINIMIZE_BRZOZOWSKI = 1;
+  /**
+   * Minimize using Hopcroft's O(n log n) algorithm.
+   */
+  public static final int MINIMIZE_HOPCROFT = 2;
+  /**
+   * Minimize using Valmari's O(n + m log m) algorithm.
+   */
+  public static final int MINIMIZE_VALMARI = 3;
+
+  /** Minimize always flag. */
+  public static boolean _minimizeAlways = false;
+
+  /** Selects whether operations may modify the input automata (default: <code>false</code>). */
+  public static boolean _allowMutation = false;
+
+  /** Selects minimization algorithm (default: <code>MINIMIZE_HOPCROFT</code>). */
+  public static int _minimization = MINIMIZE_HOPCROFT;
+
+  /** Initial state of this automaton. */
+  State _initial;
+
+  /** If true, then this automaton is definitely deterministic
+   (i.e., there are no choices for any run, but a run may crash). */
+  boolean _deterministic;
+
+  /** Hash code. Recomputed by {@link #minimize()}. */
+  int _hashCode;
+
+  /** Singleton string. Null if not applicable. */
+  String _singleton;
+
+  /**
+   * Constructs a new automaton that accepts the empty language.
+   * Using this constructor, automata can be constructed manually from
+   * {@link State} and {@link Transition} objects.
+   * @see #setInitialState(State)
+   * @see State
+   * @see Transition
+   */
+  public Automaton() {
+    _initial = new State();
+    _deterministic = true;
+    _singleton = null;
+  }
+
+  /**
+   * Sets or resets allow mutate flag.
+   * If this flag is set, then all automata operations may modify automata given as input;
+   * otherwise, operations will always leave input automata languages unmodified.
+   * By default, the flag is not set.
+   * @param flag if true, the flag is set
+   * @return previous value of the flag
+   */
+  static public boolean setAllowMutate(boolean flag) {
+    boolean b = _allowMutation;
+    _allowMutation = flag;
+    return b;
+  }
+
+  /**
+   * Assigns consecutive numbers to the given states.
+   */
+  static void setStateNumbers(Set<State> states) {
+    if (states.size() == Integer.MAX_VALUE) {
+      throw new IllegalArgumentException("number of states exceeded Integer.MAX_VALUE");
+    }
+    int number = 0;
+    for (State s : states) {
+      s._number = number++;
+    }
+  }
+
+  /**
+   * Returns a sorted array of transitions for each state (and sets state numbers).
+   */
+  static Transition[][] getSortedTransitions(Set<State> states) {
+    setStateNumbers(states);
+    Transition[][] transitions = new Transition[states.size()][];
+    for (State s : states) {
+      transitions[s._number] = s.getSortedTransitionArray(false);
+    }
+    return transitions;
+  }
+
+  /**
+   * See {@link MinimizationOperations#minimize(Automaton)}.
+   * Returns the automaton being given as argument.
+   */
+  public static Automaton minimize(Automaton a) {
+    a.minimize();
+    return a;
+  }
+
+  void checkMinimizeAlways() {
+    if (_minimizeAlways) {
+      minimize();
+    }
+  }
+
+  boolean isSingleton() {
+    return _singleton != null;
+  }
+
+  /**
+   * Gets initial state.
+   * @return state
+   */
+  public State getInitialState() {
+    expandSingleton();
+    return _initial;
+  }
+
+  /**
+   * Sets initial state.
+   * @param s state
+   */
+  public void setInitialState(State s) {
+    _initial = s;
+    _singleton = null;
+  }
+
+  /**
+   * Returns the set of states that are reachable from the initial state.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getStates() {
+    expandSingleton();
+    Set<State> visited;
+
+    visited = new HashSet<>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (worklist.size() > 0) {

Review comment:
       `while (!worklist.isEmpty())` to avoid quadratic loop




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710147522



##########
File path: pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/nativefst/FSTBenchmarkTest.java
##########
@@ -0,0 +1,220 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.nio.charset.StandardCharsets;
+import java.util.List;
+import java.util.SortedMap;
+import java.util.TreeMap;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+import org.apache.pinot.segment.local.utils.nativefst.utils.RegexpMatcher;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+import org.openjdk.jmh.infra.Blackhole;
+
+
+/**
+ * This benchmark uses COCACorpus which constitutes of 51 million words and 1.5 million unique
+ * words. The benchmark runs a set of queries on Lucene FST and native FST and publishes numbers.
+ */
+public class FSTBenchmarkTest {

Review comment:
       This benchmark would be better if you modelled the regexes as a parameter:
   ```java
   @Param({"q.[aeiou]c.*", ".*a", ...})
   String regex;
   ```
   
   For a few reasons: 
   1. You have less benchmark code this way
   2. This prevents constant folding interfering with benchmark execution (this one is a big deal and why `@Param` exists)
   3. You can see the regex in the JMH output which makes the results easier to interpret for others




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] amrishlal commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
amrishlal commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r705714655



##########
File path: pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ComplexTypes.java
##########
@@ -719,8 +719,12 @@ public int getNestedIntField() {
       @java.lang.Override
       public final boolean isInitialized() {
         byte isInitialized = memoizedIsInitialized;
-        if (isInitialized == 1) return true;
-        if (isInitialized == 0) return false;
+        if (isInitialized == 1) {
+          return true;
+        }
+        if (isInitialized == 0) {
+          return false;
+        }

Review comment:
       Please avoid unrelated formatting changes specially in a large PR like this one. I would suggest reverting all changes that are strictly not part of this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9bae7b5) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `2.49%`.
   > The diff coverage is `41.78%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   69.41%   -2.50%     
   - Complexity     3348     3818     +470     
   ============================================
     Files          1517     1548      +31     
     Lines         75039    78354    +3315     
     Branches      10921    11589     +668     
   ============================================
   + Hits          53961    54388     +427     
   - Misses        17451    20226    +2775     
   - Partials       3627     3740     +113     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `27.89% <0.00%> (-1.22%)` | :arrow_down: |
   | unittests1 | `68.07% <41.78%> (-1.63%)` | :arrow_down: |
   | unittests2 | `13.90% <0.00%> (-0.63%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/ShuffleOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NodWZmbGVPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | ... and [162 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...9bae7b5](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710116243



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ImmutableFST.java
##########
@@ -0,0 +1,406 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Collections;
+import java.util.EnumSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.io.readerwriter.PinotDataBufferMemoryManager;
+import org.apache.pinot.segment.local.realtime.impl.dictionary.OffHeapMutableBytesStore;
+import org.apache.pinot.spi.utils.Pair;
+
+
+/**
+ * FST binary format implementation
+ *
+ * <p>
+ * This version indicates the dictionary was built with these flags:
+ * {@link FSTFlags#FLEXIBLE}, {@link FSTFlags#STOPBIT} and
+ * {@link FSTFlags#NEXTBIT}. The internal representation of the FST must
+ * therefore follow this description (please note this format describes only a
+ * single transition (arc), not the entire dictionary file).
+ *
+ * <pre>
+ * ---- this node header present only if automaton was compiled with NUMBERS option.
+ * Byte
+ *        +-+-+-+-+-+-+-+-+\
+ *      0 | | | | | | | | | \  LSB
+ *        +-+-+-+-+-+-+-+-+  +
+ *      1 | | | | | | | | |  |      number of strings recognized
+ *        +-+-+-+-+-+-+-+-+  +----- by the automaton starting
+ *        : : : : : : : : :  |      from this node.
+ *        +-+-+-+-+-+-+-+-+  +
+ *  ctl-1 | | | | | | | | | /  MSB
+ *        +-+-+-+-+-+-+-+-+/
+ *
+ * ---- remaining part of the node
+ * Length of output symbols dictionary -- Integer
+ * <Arc ID, Output Symbol>
+ * <Arc ID, Output Symbol>
+ * <Arc ID, Output Symbol>
+ * .
+ * .
+ * .
+ * <Arc ID, Output Symbol> (Length)
+ *
+ * Byte
+ *       +-+-+-+-+-+-+-+-+\
+ *     0 | | | | | | | | | +------ label
+ *       +-+-+-+-+-+-+-+-+/
+ *
+ *                  +------------- node pointed to is next
+ *                  | +----------- the last arc of the node
+ *                  | | +--------- the arc is final
+ *                  | | |
+ *             +-----------+
+ *             |    | | |  |
+ *         ___+___  | | |  |
+ *        /       \ | | |  |
+ *       MSB           LSB |
+ *        7 6 5 4 3 2 1 0  |
+ *       +-+-+-+-+-+-+-+-+ |
+ *     1 | | | | | | | | | \ \
+ *       +-+-+-+-+-+-+-+-+  \ \  LSB
+ *       +-+-+-+-+-+-+-+-+     +
+ *     2 | | | | | | | | |     |
+ *       +-+-+-+-+-+-+-+-+     |
+ *     3 | | | | | | | | |     +----- target node address (in bytes)
+ *       +-+-+-+-+-+-+-+-+     |      (not present except for the byte
+ *       : : : : : : : : :     |       with flags if the node pointed to
+ *       +-+-+-+-+-+-+-+-+     +       is next)
+ *   gtl | | | | | | | | |    /  MSB
+ *       +-+-+-+-+-+-+-+-+   /
+ * gtl+1                           (gtl = gotoLength)
+ * </pre>
+ */
+public final class ImmutableFST extends FST {
+  /**
+   * Default filler byte.
+   */
+  public final static byte DEFAULT_FILLER = '_';
+
+  /**
+   * Default annotation byte.
+   */
+  public final static byte DEFAULT_ANNOTATION = '+';
+
+  /**
+   * Automaton version as in the file header.
+   */
+  public static final byte VERSION = 5;
+
+  /**
+   * Bit indicating that an arc corresponds to the last character of a sequence
+   * available when building the automaton.
+   */
+  public static final int BIT_FINAL_ARC = 1 << 0;
+
+  /**
+   * Bit indicating that an arc is the last one of the node's list and the
+   * following one belongs to another node.
+   */
+  public static final int BIT_LAST_ARC = 1 << 1;
+
+  /**
+   * Bit indicating that the target node of this arc follows it in the
+   * compressed automaton structure (no goto field).
+   */
+  public static final int BIT_TARGET_NEXT = 1 << 2;
+
+  /**
+   * An offset in the arc structure, where the address and flags field begins.
+   * In version 5 of FST automata, this value is constant (1, skip label).
+   */
+  public final static int ADDRESS_OFFSET = 1;
+
+  private static final int PER_BUFFER_SIZE = 16;
+
+  /**
+   * An array of bytes with the internal representation of the automaton. Please
+   * see the documentation of this class for more information on how this
+   * structure is organized.
+   */
+  public final OffHeapMutableBytesStore _mutableBytesStore;
+  /**
+   * The length of the node header structure (if the automaton was compiled with
+   * <code>NUMBERS</code> option). Otherwise zero.
+   */
+  public final int _nodeDataLength;
+  /**
+   * Number of bytes each address takes in full, expanded form (goto length).
+   */
+  public final int _gotoLength;
+  /** Filler character. */
+  public final byte _filler;
+  /** Annotation character. */
+  public final byte _annotation;
+  public Map<Integer, Integer> _outputSymbols;
+  /**
+   * Flags for this automaton version.
+   */
+  private Set<FSTFlags> _flags;
+
+  /**
+   * Read and wrap a binary automaton in FST version 5.
+   */
+  ImmutableFST(InputStream stream, boolean hasOutputSymbols, PinotDataBufferMemoryManager memoryManager)
+      throws IOException {
+    DataInputStream in = new DataInputStream(stream);
+
+    this._filler = in.readByte();
+    this._annotation = in.readByte();
+    final byte hgtl = in.readByte();
+
+    _mutableBytesStore = new OffHeapMutableBytesStore(memoryManager, "ImmutableFST");
+
+    /*
+     * Determine if the automaton was compiled with NUMBERS. If so, modify
+     * ctl and goto fields accordingly.
+     */
+    _flags = EnumSet.of(FSTFlags.FLEXIBLE, FSTFlags.STOPBIT, FSTFlags.NEXTBIT);
+    if ((hgtl & 0xf0) != 0) {
+      _flags.add(FSTFlags.NUMBERS);
+    }
+
+    _flags = Collections.unmodifiableSet(_flags);
+
+    this._nodeDataLength = (hgtl >>> 4) & 0x0f;
+    this._gotoLength = hgtl & 0x0f;
+
+    if (hasOutputSymbols) {
+      final int outputSymbolsLength = in.readInt();
+      byte[] outputSymbolsBuffer = readRemaining(in, outputSymbolsLength);
+
+      if (outputSymbolsBuffer.length > 0) {
+        String outputSymbolsSerialized = new String(outputSymbolsBuffer);
+
+        _outputSymbols = buildMap(outputSymbolsSerialized);
+      }
+    }
+
+    readRemaining(in);
+  }
+
+  protected final void readRemaining(InputStream in)
+      throws IOException {
+    byte[] buffer = new byte[PER_BUFFER_SIZE];
+    while ((in.read(buffer)) >= 0) {
+      _mutableBytesStore.add(buffer);
+    }
+  }
+
+  /**
+   * Returns the start node of this automaton.
+   */
+  @Override
+  public int getRootNode() {
+    // Skip dummy node marking terminating state.
+    final int epsilonNode = skipArc(getFirstArc(0));
+
+    // And follow the epsilon node's first (and only) arc.
+    return getDestinationNodeOffset(getFirstArc(epsilonNode));
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public final int getFirstArc(int node) {
+    return _nodeDataLength + node;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public final int getNextArc(int arc) {
+    if (isArcLast(arc)) {
+      return 0;
+    } else {
+      return skipArc(arc);
+    }
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public int getArc(int node, byte label) {
+    for (int arc = getFirstArc(node); arc != 0; arc = getNextArc(arc)) {
+      if (getArcLabel(arc) == label) {
+        return arc;
+      }
+    }
+
+    // An arc labeled with "label" not found.
+    return 0;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public int getEndNode(int arc) {
+    final int nodeOffset = getDestinationNodeOffset(arc);
+    return nodeOffset;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public byte getArcLabel(int arc) {
+    return getByte(arc, 0);
+  }
+
+  @Override
+  public int getOutputSymbol(int arc) {
+    return _outputSymbols.get(arc);
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public boolean isArcFinal(int arc) {
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_FINAL_ARC) != 0;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public boolean isArcTerminal(int arc) {
+    return (0 == getDestinationNodeOffset(arc));
+  }
+
+  /**
+   * Returns the number encoded at the given node. The number equals the count
+   * of the set of suffixes reachable from <code>node</code> (called its right
+   * language).
+   */
+  @Override
+  public int getRightLanguageCount(int node) {
+    assert getFlags().contains(FSTFlags.NUMBERS) : "This FST was not compiled with NUMBERS.";
+    return decodeFromBytes(node, _nodeDataLength);
+  }
+
+  /**
+   * {@inheritDoc}
+   *
+   * <p>
+   * For this automaton version, an additional {@link FSTFlags#NUMBERS} flag may
+   * be set to indicate the automaton contains extra fields for each node.
+   * </p>
+   */
+  @Override
+  public Set<FSTFlags> getFlags() {
+    return _flags;
+  }
+
+  /**
+   * Returns <code>true</code> if this arc has <code>NEXT</code> bit set.
+   *
+   * @see #BIT_LAST_ARC
+   * @param arc The node's arc identifier.
+   * @return Returns true if the argument is the last arc of a node.
+   */
+  public boolean isArcLast(int arc) {
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_LAST_ARC) != 0;
+  }
+
+  /**
+   * @see #BIT_TARGET_NEXT
+   * @param arc The node's arc identifier.
+   * @return Returns true if {@link #BIT_TARGET_NEXT} is set for this arc.
+   */
+  public boolean isNextSet(int arc) {
+
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_TARGET_NEXT) != 0;
+  }
+
+  /**
+   * Returns an n-byte integer encoded in byte-packed representation.
+   */
+  final int decodeFromBytes(final int start, final int n) {
+    int r = 0;
+
+    for (int i = n; --i >= 0; ) {
+      Pair<Integer, Integer> offheapOffsets = getOffheapOffsets(start + i);
+      byte[] inputData = _mutableBytesStore.get(offheapOffsets.getFirst());
+
+      r = r << 8 | (inputData[offheapOffsets.getSecond()] & 0xff);
+    }
+    return r;
+  }
+
+  /**
+   * Returns the address of the node pointed to by this arc.
+   */
+  final int getDestinationNodeOffset(int arc) {
+    if (isNextSet(arc)) {
+      /* The destination node follows this arc in the array. */
+      return skipArc(arc);
+    } else {
+      /*
+       * The destination node address has to be extracted from the arc's
+       * goto field.
+       */
+      return decodeFromBytes(arc + ADDRESS_OFFSET, _gotoLength) >>> 3;
+    }
+  }
+
+  /**
+   * Read the arc's layout and skip as many bytes, as needed.
+   */
+  private int skipArc(int offset) {
+    return offset + (isNextSet(offset) ? 1 + 1   /* label + flags */ : 1 + _gotoLength /* label + flags/address */);
+  }
+
+  private byte getByte(int seek, int offset) {
+    Pair<Integer, Integer> offheapOffsets = getOffheapOffsets(seek);
+
+    int fooArc = offheapOffsets.getFirst();
+    byte[] retVal = _mutableBytesStore.get((fooArc));
+
+    int barArc = offheapOffsets.getSecond();
+    int target = barArc + offset;
+
+    if (target >= PER_BUFFER_SIZE) {
+      retVal = _mutableBytesStore.get(fooArc + 1);
+      target = target - PER_BUFFER_SIZE;
+    }
+
+    return retVal[target];
+  }
+
+  private Pair<Integer, Integer> getOffheapOffsets(int seek) {
+    int fooArc = seek >= PER_BUFFER_SIZE ? seek / PER_BUFFER_SIZE : 0;
+    int barArc = seek >= PER_BUFFER_SIZE ? seek - ((fooArc) * PER_BUFFER_SIZE) : seek;
+
+    assert fooArc < _mutableBytesStore.getNumValues();
+    assert barArc < PER_BUFFER_SIZE;
+
+    return new Pair<>(fooArc, barArc);
+  }

Review comment:
       I would be tempted to inline this, it's only used twice (but measure the impact of doing so)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710556941



##########
File path: pinot-segment-local/pom.xml
##########
@@ -156,5 +156,30 @@
       <type>test-jar</type>
       <scope>test</scope>
     </dependency>
+    <dependency>
+      <groupId>com.carrotsearch</groupId>
+      <artifactId>hppc</artifactId>
+      <version>0.7.2</version>
+    </dependency>
+    <dependency>
+      <groupId>org.openjdk.jmh</groupId>

Review comment:
       We keep all the micro-benchmark in the `pinot-perf` package. Let's not introducing jmh in the production code package but move the perf class into the `pinot-perf`

##########
File path: pinot-segment-local/pom.xml
##########
@@ -156,5 +156,30 @@
       <type>test-jar</type>
       <scope>test</scope>
     </dependency>
+    <dependency>
+      <groupId>com.carrotsearch</groupId>
+      <artifactId>hppc</artifactId>

Review comment:
       We use `it.unimi.dsi.fastutil` for the primitive type data structures (e.g. `it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap`). Let's try to reduce the external dependency unless there is some clear benefits of introducing this new library




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] amrishlal commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
amrishlal commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917104588


   > > I would suggest creating a new pinot-fst module for the new FST implementation.
   > 
   > Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet
   
   It doesn't have to be a top-level module, but basically, I was looking for some way to sufficiently encapsulate this FST implementation using interfaces (in a separate jar if possible) and then use it within Pinot (?).
   
   > For e.g., in this PR, ImmutableFST uses the off heap bytes store in pinot-segment-local, thus creating a cyclic dependency. 
   Sounds like we need interfaces?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (96001bf) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `2.27%`.
   > The diff coverage is `44.88%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   69.64%   -2.28%     
   - Complexity     3348     3816     +468     
   ============================================
     Files          1517     1546      +29     
     Lines         75039    78137    +3098     
     Branches      10921    11560     +639     
   ============================================
   + Hits          53961    54415     +454     
   - Misses        17451    19975    +2524     
   - Partials       3627     3747     +120     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `27.98% <0.00%> (-1.13%)` | :arrow_down: |
   | unittests1 | `68.37% <44.88%> (-1.33%)` | :arrow_down: |
   | unittests2 | `13.95% <0.00%> (-0.58%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [175 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...96001bf](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716510805



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/Automaton.java
##########
@@ -0,0 +1,653 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.BitSet;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+
+/**
+ * Finite-state automaton with regular expression operations.
+ * <p>
+ * Class invariants:
+ * <ul>
+ * <li> An automaton is either represented explicitly (with {@link State} and {@link Transition} objects)
+ *      or with a singleton string ({@link #expandSingleton()}) in case
+ *      the automaton is known to accept exactly one string.
+ *      (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
+ * <li> Automata are always reduced (see {@link #reduce()}) 
+ *      and have no transitions to dead states (see {@link #removeDeadTransitions()}).
+ * <li> Automata provided as input to operations are generally assumed to be disjoint.
+ * </ul>
+ * <p>
+ */
+public class Automaton implements Serializable, Cloneable {
+
+  /**
+   * Minimize using Huffman's O(n<sup>2</sup>) algorithm.
+   * This is the standard text-book algorithm.
+   */
+  public static final int MINIMIZE_HUFFMAN = 0;
+  /**
+   * Minimize using Brzozowski's O(2<sup>n</sup>) algorithm.
+   * This algorithm uses the reverse-determinize-reverse-determinize trick, which has a bad
+   * worst-case behavior but often works very well in practice
+   * (even better than Hopcroft's!).
+   */
+  public static final int MINIMIZE_BRZOZOWSKI = 1;
+  /**
+   * Minimize using Hopcroft's O(n log n) algorithm.
+   */
+  public static final int MINIMIZE_HOPCROFT = 2;
+  /**
+   * Minimize using Valmari's O(n + m log m) algorithm.
+   */
+  public static final int MINIMIZE_VALMARI = 3;
+
+  /** Minimize always flag. */
+  public static boolean _minimizeAlways = false;
+
+  /** Selects whether operations may modify the input automata (default: <code>false</code>). */
+  public static boolean _allowMutation = false;
+
+  /** Selects minimization algorithm (default: <code>MINIMIZE_HOPCROFT</code>). */
+  public static int _minimization = MINIMIZE_HOPCROFT;
+
+  /** Initial state of this automaton. */
+  State _initial;
+
+  /** If true, then this automaton is definitely deterministic
+   (i.e., there are no choices for any run, but a run may crash). */
+  boolean _deterministic;
+
+  /** Hash code. Recomputed by {@link #minimize()}. */
+  int _hashCode;
+
+  /** Singleton string. Null if not applicable. */
+  String _singleton;
+
+  /**
+   * Constructs a new automaton that accepts the empty language.
+   * Using this constructor, automata can be constructed manually from
+   * {@link State} and {@link Transition} objects.
+   * @see #setInitialState(State)
+   * @see State
+   * @see Transition
+   */
+  public Automaton() {
+    _initial = new State();
+    _deterministic = true;
+    _singleton = null;
+  }
+
+  /**
+   * Sets or resets allow mutate flag.
+   * If this flag is set, then all automata operations may modify automata given as input;
+   * otherwise, operations will always leave input automata languages unmodified.
+   * By default, the flag is not set.
+   * @param flag if true, the flag is set
+   * @return previous value of the flag
+   */
+  static public boolean setAllowMutate(boolean flag) {
+    boolean b = _allowMutation;
+    _allowMutation = flag;
+    return b;
+  }
+
+  /**
+   * Assigns consecutive numbers to the given states.
+   */
+  static void setStateNumbers(Set<State> states) {
+    if (states.size() == Integer.MAX_VALUE) {
+      throw new IllegalArgumentException("number of states exceeded Integer.MAX_VALUE");
+    }
+    int number = 0;
+    for (State s : states) {
+      s._number = number++;
+    }
+  }
+
+  /**
+   * Returns a sorted array of transitions for each state (and sets state numbers).
+   */
+  static Transition[][] getSortedTransitions(Set<State> states) {
+    setStateNumbers(states);
+    Transition[][] transitions = new Transition[states.size()][];
+    for (State s : states) {
+      transitions[s._number] = s.getSortedTransitionArray(false);
+    }
+    return transitions;
+  }
+
+  /**
+   * See {@link MinimizationOperations#minimize(Automaton)}.
+   * Returns the automaton being given as argument.
+   */
+  public static Automaton minimize(Automaton a) {
+    a.minimize();
+    return a;
+  }
+
+  void checkMinimizeAlways() {
+    if (_minimizeAlways) {
+      minimize();
+    }
+  }
+
+  boolean isSingleton() {
+    return _singleton != null;
+  }
+
+  /**
+   * Gets initial state.
+   * @return state
+   */
+  public State getInitialState() {
+    expandSingleton();
+    return _initial;
+  }
+
+  /**
+   * Sets initial state.
+   * @param s state
+   */
+  public void setInitialState(State s) {
+    _initial = s;
+    _singleton = null;
+  }
+
+  /**
+   * Returns the set of states that are reachable from the initial state.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getStates() {
+    expandSingleton();
+    Set<State> visited;
+
+    visited = new HashSet<>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (!worklist.isEmpty()) {
+      State s = worklist.removeFirst();
+      Collection<Transition> tr;
+
+      tr = s._transitionSet;
+      for (Transition t : tr) {
+        if (!visited.contains(t._to)) {
+          visited.add(t._to);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return visited;
+  }
+
+  /**
+   * Returns the set of reachable accept states.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getAcceptStates() {
+    expandSingleton();
+    HashSet<State> accepts = new HashSet<State>();
+    BitSet visited = new BitSet();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.set(_initial._id);
+    while (!worklist.isEmpty()) {
+      State s = worklist.removeFirst();
+      if (s._accept) {
+        accepts.add(s);
+      }
+      for (Transition t : s._transitionSet) {
+        if (!visited.get(t._to._id)) {
+          visited.set(t._to._id);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return accepts;
+  }
+
+  /**
+   * Adds transitions to explicit crash state to ensure that transition function is total.
+   */
+  void totalize() {
+    State s = new State();
+    s._transitionSet.add(new Transition(Character.MIN_VALUE, Character.MAX_VALUE, s));
+    for (State p : getStates()) {
+      int maxi = Character.MIN_VALUE;
+      for (Transition t : p.getSortedTransitions(false)) {
+        if (t._min > maxi) {
+          p._transitionSet.add(new Transition((char) maxi, (char) (t._min - 1), s));
+        }
+        if (t._max + 1 > maxi) {
+          maxi = t._max + 1;
+        }
+      }
+      if (maxi <= Character.MAX_VALUE) {
+        p._transitionSet.add(new Transition((char) maxi, Character.MAX_VALUE, s));
+      }
+    }
+  }
+
+  /**
+   * Reduces this automaton.
+   * An automaton is "reduced" by combining overlapping and adjacent edge intervals with same destination.
+   */
+  public void reduce() {
+    if (isSingleton()) {
+      return;
+    }
+    Set<State> states = getStates();
+    setStateNumbers(states);
+    for (State s : states) {
+      List<Transition> st = s.getSortedTransitions(true);
+      s.resetTransitions();
+      State p = null;
+      int min = -1;
+      int max = -1;
+      for (Transition t : st) {
+        if (p == t._to) {
+          if (t._min <= max + 1) {
+            if (t._max > max) {
+              max = t._max;
+            }
+          } else {
+            if (p != null) {
+              s._transitionSet.add(new Transition((char) min, (char) max, p));
+            }
+            min = t._min;
+            max = t._max;
+          }
+        } else {
+          if (p != null) {
+            s._transitionSet.add(new Transition((char) min, (char) max, p));
+          }
+          p = t._to;
+          min = t._min;
+          max = t._max;
+        }
+      }
+      if (p != null) {
+        s._transitionSet.add(new Transition((char) min, (char) max, p));
+      }
+    }
+    clearHashCode();
+  }
+
+  /**
+   * Returns sorted array of all interval start points.
+   */
+  char[] getStartPoints() {
+    //TODO: move to bitsets
+    Set<Character> pointset = new HashSet<Character>();
+    pointset.add(Character.MIN_VALUE);
+    for (State s : getStates()) {
+      for (Transition t : s._transitionSet) {
+        pointset.add(t._min);
+        if (t._max < Character.MAX_VALUE) {
+          pointset.add((char) (t._max + 1));
+        }
+      }
+    }
+    char[] points = new char[pointset.size()];
+    int n = 0;
+    for (Character m : pointset) {
+      points[n++] = m;
+    }
+    // Remove once move to bitsets
+    Arrays.sort(points);
+    return points;
+  }
+
+  private Set<State> getLiveStates(Set<State> states) {
+    HashMap<State, Set<State>> map = new HashMap<State, Set<State>>();
+    for (State s : states) {
+      map.put(s, new HashSet<>());
+    }
+    for (State s : states) {
+      for (Transition t : s._transitionSet) {
+        map.get(t._to).add(s);
+      }
+    }
+    Set<State> live = new HashSet<State>(getAcceptStates());
+    LinkedList<State> worklist = new LinkedList<State>(live);
+    while (worklist.size() > 0) {

Review comment:
       This has been commented on previously, this should use `!worklist.isEmpty()` everywhere, because `LinkedList.size()` traverses the list each time. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716511999



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/State.java
##########
@@ -0,0 +1,180 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+
+/**
+ * <tt>Automaton</tt> state.
+ */
+public class State implements Serializable, Comparable<State> {
+
+  static AtomicInteger _nextId = new AtomicInteger();

Review comment:
       Needs to be `static final`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (995ec32) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `2.22%`.
   > The diff coverage is `44.80%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   69.68%   -2.23%     
   - Complexity     3348     3836     +488     
   ============================================
     Files          1517     1548      +31     
     Lines         75039    78388    +3349     
     Branches      10921    11597     +676     
   ============================================
   + Hits          53961    54627     +666     
   - Misses        17451    19999    +2548     
   - Partials       3627     3762     +135     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `27.94% <0.00%> (-1.17%)` | :arrow_down: |
   | unittests1 | `68.48% <44.80%> (-1.22%)` | :arrow_down: |
   | unittests2 | `13.94% <0.00%> (-0.59%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [225 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...995ec32](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710121795



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/Automaton.java
##########
@@ -0,0 +1,652 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+
+/**
+ * Finite-state automaton with regular expression operations.
+ * <p>
+ * Class invariants:
+ * <ul>
+ * <li> An automaton is either represented explicitly (with {@link State} and {@link Transition} objects)
+ *      or with a singleton string ({@link #expandSingleton()}) in case
+ *      the automaton is known to accept exactly one string.
+ *      (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
+ * <li> Automata are always reduced (see {@link #reduce()}) 
+ *      and have no transitions to dead states (see {@link #removeDeadTransitions()}).
+ * <li> If an automaton is nondeterministic, then {@link #isDeterministic()} returns false (but
+ *      the converse is not required).
+ * <li> Automata provided as input to operations are generally assumed to be disjoint.
+ * </ul>
+ * <p>
+ */
+public class Automaton implements Serializable, Cloneable {
+
+  /**
+   * Minimize using Huffman's O(n<sup>2</sup>) algorithm.
+   * This is the standard text-book algorithm.
+   */
+  public static final int MINIMIZE_HUFFMAN = 0;
+  /**
+   * Minimize using Brzozowski's O(2<sup>n</sup>) algorithm.
+   * This algorithm uses the reverse-determinize-reverse-determinize trick, which has a bad
+   * worst-case behavior but often works very well in practice
+   * (even better than Hopcroft's!).
+   */
+  public static final int MINIMIZE_BRZOZOWSKI = 1;
+  /**
+   * Minimize using Hopcroft's O(n log n) algorithm.
+   */
+  public static final int MINIMIZE_HOPCROFT = 2;
+  /**
+   * Minimize using Valmari's O(n + m log m) algorithm.
+   */
+  public static final int MINIMIZE_VALMARI = 3;
+
+  /** Minimize always flag. */
+  public static boolean _minimizeAlways = false;
+
+  /** Selects whether operations may modify the input automata (default: <code>false</code>). */
+  public static boolean _allowMutation = false;
+
+  /** Selects minimization algorithm (default: <code>MINIMIZE_HOPCROFT</code>). */
+  public static int _minimization = MINIMIZE_HOPCROFT;
+
+  /** Initial state of this automaton. */
+  State _initial;
+
+  /** If true, then this automaton is definitely deterministic
+   (i.e., there are no choices for any run, but a run may crash). */
+  boolean _deterministic;
+
+  /** Hash code. Recomputed by {@link #minimize()}. */
+  int _hashCode;
+
+  /** Singleton string. Null if not applicable. */
+  String _singleton;
+
+  /**
+   * Constructs a new automaton that accepts the empty language.
+   * Using this constructor, automata can be constructed manually from
+   * {@link State} and {@link Transition} objects.
+   * @see #setInitialState(State)
+   * @see State
+   * @see Transition
+   */
+  public Automaton() {
+    _initial = new State();
+    _deterministic = true;
+    _singleton = null;
+  }
+
+  /**
+   * Sets or resets allow mutate flag.
+   * If this flag is set, then all automata operations may modify automata given as input;
+   * otherwise, operations will always leave input automata languages unmodified.
+   * By default, the flag is not set.
+   * @param flag if true, the flag is set
+   * @return previous value of the flag
+   */
+  static public boolean setAllowMutate(boolean flag) {
+    boolean b = _allowMutation;
+    _allowMutation = flag;
+    return b;
+  }
+
+  /**
+   * Assigns consecutive numbers to the given states.
+   */
+  static void setStateNumbers(Set<State> states) {
+    if (states.size() == Integer.MAX_VALUE) {
+      throw new IllegalArgumentException("number of states exceeded Integer.MAX_VALUE");
+    }
+    int number = 0;
+    for (State s : states) {
+      s._number = number++;
+    }
+  }
+
+  /**
+   * Returns a sorted array of transitions for each state (and sets state numbers).
+   */
+  static Transition[][] getSortedTransitions(Set<State> states) {
+    setStateNumbers(states);
+    Transition[][] transitions = new Transition[states.size()][];
+    for (State s : states) {
+      transitions[s._number] = s.getSortedTransitionArray(false);
+    }
+    return transitions;
+  }
+
+  /**
+   * See {@link MinimizationOperations#minimize(Automaton)}.
+   * Returns the automaton being given as argument.
+   */
+  public static Automaton minimize(Automaton a) {
+    a.minimize();
+    return a;
+  }
+
+  void checkMinimizeAlways() {
+    if (_minimizeAlways) {
+      minimize();
+    }
+  }
+
+  boolean isSingleton() {
+    return _singleton != null;
+  }
+
+  /**
+   * Gets initial state.
+   * @return state
+   */
+  public State getInitialState() {
+    expandSingleton();
+    return _initial;
+  }
+
+  /**
+   * Sets initial state.
+   * @param s state
+   */
+  public void setInitialState(State s) {
+    _initial = s;
+    _singleton = null;
+  }
+
+  /**
+   * Returns the set of states that are reachable from the initial state.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getStates() {
+    expandSingleton();
+    Set<State> visited;
+
+    visited = new HashSet<>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (worklist.size() > 0) {
+      State s = worklist.removeFirst();
+      Collection<Transition> tr;
+
+      tr = s._transitionSet;
+      for (Transition t : tr) {
+        if (!visited.contains(t._to)) {
+          visited.add(t._to);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return visited;
+  }
+
+  /**
+   * Returns the set of reachable accept states.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getAcceptStates() {
+    expandSingleton();
+    HashSet<State> accepts = new HashSet<State>();
+    HashSet<State> visited = new HashSet<State>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (worklist.size() > 0) {

Review comment:
       `while (!worklist.isEmpty())`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710140553



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RegexpMatcher.java
##########
@@ -0,0 +1,170 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.utils;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Automaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.CharacterRunAutomaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.RegExp;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.State;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Transition;
+
+
+/**
+ * RegexpMatcher is a helper to retrieve matching values for a given regexp query.
+ * Regexp query is converted into an automaton and we run the matching algorithm on FST.
+ *
+ * Two main functions of this class are
+ *   regexMatchOnFST() Function runs matching on FST (See function comments for more details)
+ *   match(input) Function builds the automaton and matches given input.
+ */
+public class RegexpMatcher {
+  private final String _regexQuery;
+  private final FST _fst;
+  private final Automaton _automaton;
+
+  public RegexpMatcher(String regexQuery, FST fst) {
+    _regexQuery = regexQuery;
+    _fst = fst;
+
+    _automaton = new RegExp(_regexQuery).toAutomaton();
+  }
+
+  public static List<Long> regexMatch(String regexQuery, FST fst) {

Review comment:
       Does this need to produce `List<Long>` - can it avoid boxing and yield an iterator instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710142729



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RegexpMatcher.java
##########
@@ -0,0 +1,170 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.utils;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Automaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.CharacterRunAutomaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.RegExp;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.State;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Transition;
+
+
+/**
+ * RegexpMatcher is a helper to retrieve matching values for a given regexp query.
+ * Regexp query is converted into an automaton and we run the matching algorithm on FST.
+ *
+ * Two main functions of this class are
+ *   regexMatchOnFST() Function runs matching on FST (See function comments for more details)
+ *   match(input) Function builds the automaton and matches given input.
+ */
+public class RegexpMatcher {
+  private final String _regexQuery;
+  private final FST _fst;
+  private final Automaton _automaton;
+
+  public RegexpMatcher(String regexQuery, FST fst) {
+    _regexQuery = regexQuery;
+    _fst = fst;
+
+    _automaton = new RegExp(_regexQuery).toAutomaton();
+  }
+
+  public static List<Long> regexMatch(String regexQuery, FST fst) {

Review comment:
       Returning an iterator would still need building a list internally, right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r713604426



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ConstantArcSizeFST.java
##########
@@ -0,0 +1,159 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.util.Collections;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+
+
+/**
+ * A FST with constant-size arc representation produced directly by
+ * {@link FSTBuilder}.
+ *
+ * @see FSTBuilder
+ */
+public final class ConstantArcSizeFST extends FST {
+  /** Size of the target address field (constant for the builder). */
+  public final static int TARGET_ADDRESS_SIZE = 4;
+
+  /** Size of the flags field (constant for the builder). */
+  public final static int FLAGS_SIZE = 1;
+
+  /** Size of the label field (constant for the builder). */
+  public final static int LABEL_SIZE = 1;
+
+  /**
+   * Size of a single arc structure.
+   */
+  public final static int ARC_SIZE = FLAGS_SIZE + LABEL_SIZE + TARGET_ADDRESS_SIZE;
+
+  /** Offset of the flags field inside an arc. */
+  public final static int FLAGS_OFFSET = 0;
+
+  /** Offset of the label field inside an arc. */
+  public final static int LABEL_OFFSET = FLAGS_SIZE;
+
+  /** Offset of the address field inside an arc. */
+  public final static int ADDRESS_OFFSET = LABEL_OFFSET + LABEL_SIZE;
+  /**
+   * An arc flag indicating the target node of an arc corresponds to a final
+   * state.
+   */
+  public final static int BIT_ARC_FINAL = 1 << 1;
+  /** An arc flag indicating the arc is last within its state. */
+  public final static int BIT_ARC_LAST = 1 << 0;
+  /** A dummy address of the terminal state. */
+  public final static int TERMINAL_STATE = 0;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private final int _epsilon;
+
+  /**
+   * FST data, serialized as a byte array.
+   */
+  private final byte[] _data;
+
+  private Map<Integer, Integer> _outputSymbols;
+
+  /**
+   * @param data
+   *          FST data. There must be no trailing bytes after the last state.
+   */
+  public ConstantArcSizeFST(byte[] data, int epsilon, Map<Integer, Integer> outputSymbols) {
+    assert epsilon == 0 : "Epsilon is not zero?";
+
+    this._epsilon = epsilon;
+    this._data = data;
+    this._outputSymbols = outputSymbols;
+  }
+
+  @Override
+  public int getRootNode() {
+    return getEndNode(getFirstArc(_epsilon));
+  }
+
+  @Override
+  public int getFirstArc(int node) {
+    return node;
+  }
+
+  @Override
+  public int getArc(int node, byte label) {
+    for (int arc = getFirstArc(node); arc != 0; arc = getNextArc(arc)) {
+      if (getArcLabel(arc) == label) {
+        return arc;
+      }
+    }
+    return 0;
+  }
+
+  @Override
+  public int getNextArc(int arc) {
+    if (isArcLast(arc)) {
+      return 0;
+    }
+    return arc + ARC_SIZE;
+  }
+
+  @Override
+  public byte getArcLabel(int arc) {
+    return _data[arc + LABEL_OFFSET];
+  }
+
+  @Override
+  public int getOutputSymbol(int arc) {
+    return _outputSymbols.get(arc);

Review comment:
       @richardstartin The base method has a callout where it says that a null will be returned if the output symbol is not present.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (96001bf) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `3.53%`.
   > The diff coverage is `44.88%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   68.37%   -3.54%     
   - Complexity     3348     3724     +376     
   ============================================
     Files          1517     1152     -365     
     Lines         75039    56045   -18994     
     Branches      10921     8609    -2312     
   ============================================
   - Hits          53961    38321   -15640     
   + Misses        17451    14979    -2472     
   + Partials       3627     2745     -882     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `68.37% <44.88%> (-1.33%)` | :arrow_down: |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [670 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...96001bf](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716571107



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ImmutableFST.java
##########
@@ -0,0 +1,406 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Collections;
+import java.util.EnumSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.io.readerwriter.PinotDataBufferMemoryManager;
+import org.apache.pinot.segment.local.realtime.impl.dictionary.OffHeapMutableBytesStore;
+import org.apache.pinot.spi.utils.Pair;
+
+
+/**
+ * FST binary format implementation
+ *
+ * <p>
+ * This version indicates the dictionary was built with these flags:
+ * {@link FSTFlags#FLEXIBLE}, {@link FSTFlags#STOPBIT} and
+ * {@link FSTFlags#NEXTBIT}. The internal representation of the FST must
+ * therefore follow this description (please note this format describes only a
+ * single transition (arc), not the entire dictionary file).
+ *
+ * <pre>
+ * ---- this node header present only if automaton was compiled with NUMBERS option.
+ * Byte
+ *        +-+-+-+-+-+-+-+-+\
+ *      0 | | | | | | | | | \  LSB
+ *        +-+-+-+-+-+-+-+-+  +
+ *      1 | | | | | | | | |  |      number of strings recognized
+ *        +-+-+-+-+-+-+-+-+  +----- by the automaton starting
+ *        : : : : : : : : :  |      from this node.
+ *        +-+-+-+-+-+-+-+-+  +
+ *  ctl-1 | | | | | | | | | /  MSB
+ *        +-+-+-+-+-+-+-+-+/
+ *
+ * ---- remaining part of the node
+ * Length of output symbols dictionary -- Integer
+ * <Arc ID, Output Symbol>
+ * <Arc ID, Output Symbol>
+ * <Arc ID, Output Symbol>
+ * .
+ * .
+ * .
+ * <Arc ID, Output Symbol> (Length)
+ *
+ * Byte
+ *       +-+-+-+-+-+-+-+-+\
+ *     0 | | | | | | | | | +------ label
+ *       +-+-+-+-+-+-+-+-+/
+ *
+ *                  +------------- node pointed to is next
+ *                  | +----------- the last arc of the node
+ *                  | | +--------- the arc is final
+ *                  | | |
+ *             +-----------+
+ *             |    | | |  |
+ *         ___+___  | | |  |
+ *        /       \ | | |  |
+ *       MSB           LSB |
+ *        7 6 5 4 3 2 1 0  |
+ *       +-+-+-+-+-+-+-+-+ |
+ *     1 | | | | | | | | | \ \
+ *       +-+-+-+-+-+-+-+-+  \ \  LSB
+ *       +-+-+-+-+-+-+-+-+     +
+ *     2 | | | | | | | | |     |
+ *       +-+-+-+-+-+-+-+-+     |
+ *     3 | | | | | | | | |     +----- target node address (in bytes)
+ *       +-+-+-+-+-+-+-+-+     |      (not present except for the byte
+ *       : : : : : : : : :     |       with flags if the node pointed to
+ *       +-+-+-+-+-+-+-+-+     +       is next)
+ *   gtl | | | | | | | | |    /  MSB
+ *       +-+-+-+-+-+-+-+-+   /
+ * gtl+1                           (gtl = gotoLength)
+ * </pre>
+ */
+public final class ImmutableFST extends FST {
+  /**
+   * Default filler byte.
+   */
+  public final static byte DEFAULT_FILLER = '_';
+
+  /**
+   * Default annotation byte.
+   */
+  public final static byte DEFAULT_ANNOTATION = '+';
+
+  /**
+   * Automaton version as in the file header.
+   */
+  public static final byte VERSION = 5;
+
+  /**
+   * Bit indicating that an arc corresponds to the last character of a sequence
+   * available when building the automaton.
+   */
+  public static final int BIT_FINAL_ARC = 1 << 0;
+
+  /**
+   * Bit indicating that an arc is the last one of the node's list and the
+   * following one belongs to another node.
+   */
+  public static final int BIT_LAST_ARC = 1 << 1;
+
+  /**
+   * Bit indicating that the target node of this arc follows it in the
+   * compressed automaton structure (no goto field).
+   */
+  public static final int BIT_TARGET_NEXT = 1 << 2;
+
+  /**
+   * An offset in the arc structure, where the address and flags field begins.
+   * In version 5 of FST automata, this value is constant (1, skip label).
+   */
+  public final static int ADDRESS_OFFSET = 1;
+
+  private static final int PER_BUFFER_SIZE = 16;
+
+  /**
+   * An array of bytes with the internal representation of the automaton. Please
+   * see the documentation of this class for more information on how this
+   * structure is organized.
+   */
+  public final OffHeapMutableBytesStore _mutableBytesStore;
+  /**
+   * The length of the node header structure (if the automaton was compiled with
+   * <code>NUMBERS</code> option). Otherwise zero.
+   */
+  public final int _nodeDataLength;
+  /**
+   * Number of bytes each address takes in full, expanded form (goto length).
+   */
+  public final int _gotoLength;
+  /** Filler character. */
+  public final byte _filler;
+  /** Annotation character. */
+  public final byte _annotation;
+  public Map<Integer, Integer> _outputSymbols;
+  /**
+   * Flags for this automaton version.
+   */
+  private Set<FSTFlags> _flags;
+
+  /**
+   * Read and wrap a binary automaton in FST version 5.
+   */
+  ImmutableFST(InputStream stream, boolean hasOutputSymbols, PinotDataBufferMemoryManager memoryManager)
+      throws IOException {
+    DataInputStream in = new DataInputStream(stream);
+
+    this._filler = in.readByte();
+    this._annotation = in.readByte();
+    final byte hgtl = in.readByte();
+
+    _mutableBytesStore = new OffHeapMutableBytesStore(memoryManager, "ImmutableFST");
+
+    /*
+     * Determine if the automaton was compiled with NUMBERS. If so, modify
+     * ctl and goto fields accordingly.
+     */
+    _flags = EnumSet.of(FSTFlags.FLEXIBLE, FSTFlags.STOPBIT, FSTFlags.NEXTBIT);
+    if ((hgtl & 0xf0) != 0) {
+      _flags.add(FSTFlags.NUMBERS);
+    }
+
+    _flags = Collections.unmodifiableSet(_flags);
+
+    this._nodeDataLength = (hgtl >>> 4) & 0x0f;
+    this._gotoLength = hgtl & 0x0f;
+
+    if (hasOutputSymbols) {
+      final int outputSymbolsLength = in.readInt();
+      byte[] outputSymbolsBuffer = readRemaining(in, outputSymbolsLength);
+
+      if (outputSymbolsBuffer.length > 0) {
+        String outputSymbolsSerialized = new String(outputSymbolsBuffer);
+
+        _outputSymbols = buildMap(outputSymbolsSerialized);
+      }
+    }
+
+    readRemaining(in);
+  }
+
+  protected final void readRemaining(InputStream in)
+      throws IOException {
+    byte[] buffer = new byte[PER_BUFFER_SIZE];
+    while ((in.read(buffer)) >= 0) {
+      _mutableBytesStore.add(buffer);
+    }
+  }
+
+  /**
+   * Returns the start node of this automaton.
+   */
+  @Override
+  public int getRootNode() {
+    // Skip dummy node marking terminating state.
+    final int epsilonNode = skipArc(getFirstArc(0));
+
+    // And follow the epsilon node's first (and only) arc.
+    return getDestinationNodeOffset(getFirstArc(epsilonNode));
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public final int getFirstArc(int node) {
+    return _nodeDataLength + node;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public final int getNextArc(int arc) {
+    if (isArcLast(arc)) {
+      return 0;
+    } else {
+      return skipArc(arc);
+    }
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public int getArc(int node, byte label) {
+    for (int arc = getFirstArc(node); arc != 0; arc = getNextArc(arc)) {
+      if (getArcLabel(arc) == label) {
+        return arc;
+      }
+    }
+
+    // An arc labeled with "label" not found.
+    return 0;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public int getEndNode(int arc) {
+    final int nodeOffset = getDestinationNodeOffset(arc);
+    return nodeOffset;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public byte getArcLabel(int arc) {
+    return getByte(arc, 0);
+  }
+
+  @Override
+  public int getOutputSymbol(int arc) {
+    return _outputSymbols.get(arc);
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public boolean isArcFinal(int arc) {
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_FINAL_ARC) != 0;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public boolean isArcTerminal(int arc) {
+    return (0 == getDestinationNodeOffset(arc));
+  }
+
+  /**
+   * Returns the number encoded at the given node. The number equals the count
+   * of the set of suffixes reachable from <code>node</code> (called its right
+   * language).
+   */
+  @Override
+  public int getRightLanguageCount(int node) {
+    assert getFlags().contains(FSTFlags.NUMBERS) : "This FST was not compiled with NUMBERS.";
+    return decodeFromBytes(node, _nodeDataLength);
+  }
+
+  /**
+   * {@inheritDoc}
+   *
+   * <p>
+   * For this automaton version, an additional {@link FSTFlags#NUMBERS} flag may
+   * be set to indicate the automaton contains extra fields for each node.
+   * </p>
+   */
+  @Override
+  public Set<FSTFlags> getFlags() {
+    return _flags;
+  }
+
+  /**
+   * Returns <code>true</code> if this arc has <code>NEXT</code> bit set.
+   *
+   * @see #BIT_LAST_ARC
+   * @param arc The node's arc identifier.
+   * @return Returns true if the argument is the last arc of a node.
+   */
+  public boolean isArcLast(int arc) {
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_LAST_ARC) != 0;
+  }
+
+  /**
+   * @see #BIT_TARGET_NEXT
+   * @param arc The node's arc identifier.
+   * @return Returns true if {@link #BIT_TARGET_NEXT} is set for this arc.
+   */
+  public boolean isNextSet(int arc) {
+
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_TARGET_NEXT) != 0;
+  }
+
+  /**
+   * Returns an n-byte integer encoded in byte-packed representation.
+   */
+  final int decodeFromBytes(final int start, final int n) {
+    int r = 0;
+
+    for (int i = n; --i >= 0; ) {
+      Pair<Integer, Integer> offheapOffsets = getOffheapOffsets(start + i);
+      byte[] inputData = _mutableBytesStore.get(offheapOffsets.getFirst());
+
+      r = r << 8 | (inputData[offheapOffsets.getSecond()] & 0xff);
+    }
+    return r;
+  }
+
+  /**
+   * Returns the address of the node pointed to by this arc.
+   */
+  final int getDestinationNodeOffset(int arc) {
+    if (isNextSet(arc)) {
+      /* The destination node follows this arc in the array. */
+      return skipArc(arc);
+    } else {
+      /*
+       * The destination node address has to be extracted from the arc's
+       * goto field.
+       */
+      return decodeFromBytes(arc + ADDRESS_OFFSET, _gotoLength) >>> 3;
+    }
+  }
+
+  /**
+   * Read the arc's layout and skip as many bytes, as needed.
+   */
+  private int skipArc(int offset) {
+    return offset + (isNextSet(offset) ? 1 + 1   /* label + flags */ : 1 + _gotoLength /* label + flags/address */);
+  }
+
+  private byte getByte(int seek, int offset) {
+    Pair<Integer, Integer> offheapOffsets = getOffheapOffsets(seek);
+
+    int fooArc = offheapOffsets.getFirst();
+    byte[] retVal = _mutableBytesStore.get((fooArc));
+
+    int barArc = offheapOffsets.getSecond();
+    int target = barArc + offset;
+
+    if (target >= PER_BUFFER_SIZE) {
+      retVal = _mutableBytesStore.get(fooArc + 1);
+      target = target - PER_BUFFER_SIZE;
+    }
+
+    return retVal[target];
+  }
+
+  private Pair<Integer, Integer> getOffheapOffsets(int seek) {
+    int fooArc = seek >= PER_BUFFER_SIZE ? seek / PER_BUFFER_SIZE : 0;
+    int barArc = seek >= PER_BUFFER_SIZE ? seek - ((fooArc) * PER_BUFFER_SIZE) : seek;
+
+    assert fooArc < _mutableBytesStore.getNumValues();
+    assert barArc < PER_BUFFER_SIZE;
+
+    return new Pair<>(fooArc, barArc);
+  }

Review comment:
       Got it, thanks for the explanation. I was under the impression that we were trying to avoid storing the value of the offsets in actual integers (to avoid boxing). Fixed, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917119089


   It is already encapsulated within a package in Pinot-segment and has
   interfaces for interaction at each layer - please refer to the code
   
   On Fri, 10 Sep 2021, 23:35 Amrish Lal, ***@***.***> wrote:
   
   > I would suggest creating a new pinot-fst module for the new FST
   > implementation.
   >
   > Not sure about this.. we don't create modules at root level for other
   > indexing. Need to think carefully before creating modules at the root.. at
   > some point, ideally, we should create plugin mechanisms for indexes and
   > then create a module for each index.. we are not there yet
   >
   > It doesn't have to be a top-level module, but basically, I was looking for
   > some way to sufficiently encapsulate this FST implementation using
   > interfaces (in a separate jar if possible) and then use it within Pinot (?).
   >
   > For e.g., in this PR, ImmutableFST uses the off heap bytes store in
   > pinot-segment-local, thus creating a cyclic dependency.
   > Sounds like we need interfaces?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/pinot/pull/7405#issuecomment-917104588>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AANE5Y2H7WNUCTP3VX5OJIDUBJCE3ANCNFSM5DTCLYZQ>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kishoreg commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
kishoreg commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-916651307


   > I would suggest creating a new pinot-fst module for the new FST implementation.
   
   Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-916701211


   > > I would suggest creating a new pinot-fst module for the new FST implementation.
   > 
   > Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet
   
   Agreed. To add to Kishore's point, introducing new modules at root level increases the scope of jar conflicts. For e.g., in this PR, ImmutableFST uses the off heap bytes store in pinot-segment-local, thus creating a cyclic dependency. Until we do not have the plugin model, I would prefer this model.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710162848



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RegexpMatcher.java
##########
@@ -0,0 +1,170 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.utils;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Automaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.CharacterRunAutomaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.RegExp;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.State;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Transition;
+
+
+/**
+ * RegexpMatcher is a helper to retrieve matching values for a given regexp query.
+ * Regexp query is converted into an automaton and we run the matching algorithm on FST.
+ *
+ * Two main functions of this class are
+ *   regexMatchOnFST() Function runs matching on FST (See function comments for more details)
+ *   match(input) Function builds the automaton and matches given input.
+ */
+public class RegexpMatcher {
+  private final String _regexQuery;
+  private final FST _fst;
+  private final Automaton _automaton;
+
+  public RegexpMatcher(String regexQuery, FST fst) {
+    _regexQuery = regexQuery;
+    _fst = fst;
+
+    _automaton = new RegExp(_regexQuery).toAutomaton();
+  }
+
+  public static List<Long> regexMatch(String regexQuery, FST fst) {

Review comment:
       I believe `regexMatchOnFST` could be implemented as an iterator, yielding whenever `endNodes` is added to. But my main concern is the boxing, not the eagerness.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710134949



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ConstantArcSizeFST.java
##########
@@ -0,0 +1,159 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.util.Collections;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+
+
+/**
+ * A FST with constant-size arc representation produced directly by
+ * {@link FSTBuilder}.
+ *
+ * @see FSTBuilder
+ */
+public final class ConstantArcSizeFST extends FST {
+  /** Size of the target address field (constant for the builder). */
+  public final static int TARGET_ADDRESS_SIZE = 4;
+
+  /** Size of the flags field (constant for the builder). */
+  public final static int FLAGS_SIZE = 1;
+
+  /** Size of the label field (constant for the builder). */
+  public final static int LABEL_SIZE = 1;
+
+  /**
+   * Size of a single arc structure.
+   */
+  public final static int ARC_SIZE = FLAGS_SIZE + LABEL_SIZE + TARGET_ADDRESS_SIZE;
+
+  /** Offset of the flags field inside an arc. */
+  public final static int FLAGS_OFFSET = 0;
+
+  /** Offset of the label field inside an arc. */
+  public final static int LABEL_OFFSET = FLAGS_SIZE;
+
+  /** Offset of the address field inside an arc. */
+  public final static int ADDRESS_OFFSET = LABEL_OFFSET + LABEL_SIZE;
+  /**
+   * An arc flag indicating the target node of an arc corresponds to a final
+   * state.
+   */
+  public final static int BIT_ARC_FINAL = 1 << 1;
+  /** An arc flag indicating the arc is last within its state. */
+  public final static int BIT_ARC_LAST = 1 << 0;
+  /** A dummy address of the terminal state. */
+  public final static int TERMINAL_STATE = 0;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private final int _epsilon;
+
+  /**
+   * FST data, serialized as a byte array.
+   */
+  private final byte[] _data;
+
+  private Map<Integer, Integer> _outputSymbols;
+
+  /**
+   * @param data
+   *          FST data. There must be no trailing bytes after the last state.
+   */
+  public ConstantArcSizeFST(byte[] data, int epsilon, Map<Integer, Integer> outputSymbols) {
+    assert epsilon == 0 : "Epsilon is not zero?";
+
+    this._epsilon = epsilon;
+    this._data = data;
+    this._outputSymbols = outputSymbols;
+  }
+
+  @Override
+  public int getRootNode() {
+    return getEndNode(getFirstArc(_epsilon));
+  }
+
+  @Override
+  public int getFirstArc(int node) {
+    return node;
+  }
+
+  @Override
+  public int getArc(int node, byte label) {
+    for (int arc = getFirstArc(node); arc != 0; arc = getNextArc(arc)) {
+      if (getArcLabel(arc) == label) {
+        return arc;
+      }
+    }
+    return 0;
+  }
+
+  @Override
+  public int getNextArc(int arc) {
+    if (isArcLast(arc)) {
+      return 0;
+    }
+    return arc + ARC_SIZE;
+  }
+
+  @Override
+  public byte getArcLabel(int arc) {
+    return _data[arc + LABEL_OFFSET];
+  }
+
+  @Override
+  public int getOutputSymbol(int arc) {
+    return _outputSymbols.get(arc);

Review comment:
       Actually, I did that intentionally. If the arc is not present, then it is a bug.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716500969



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ImmutableFST.java
##########
@@ -0,0 +1,406 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Collections;
+import java.util.EnumSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.io.readerwriter.PinotDataBufferMemoryManager;
+import org.apache.pinot.segment.local.realtime.impl.dictionary.OffHeapMutableBytesStore;
+import org.apache.pinot.spi.utils.Pair;
+
+
+/**
+ * FST binary format implementation
+ *
+ * <p>
+ * This version indicates the dictionary was built with these flags:
+ * {@link FSTFlags#FLEXIBLE}, {@link FSTFlags#STOPBIT} and
+ * {@link FSTFlags#NEXTBIT}. The internal representation of the FST must
+ * therefore follow this description (please note this format describes only a
+ * single transition (arc), not the entire dictionary file).
+ *
+ * <pre>
+ * ---- this node header present only if automaton was compiled with NUMBERS option.
+ * Byte
+ *        +-+-+-+-+-+-+-+-+\
+ *      0 | | | | | | | | | \  LSB
+ *        +-+-+-+-+-+-+-+-+  +
+ *      1 | | | | | | | | |  |      number of strings recognized
+ *        +-+-+-+-+-+-+-+-+  +----- by the automaton starting
+ *        : : : : : : : : :  |      from this node.
+ *        +-+-+-+-+-+-+-+-+  +
+ *  ctl-1 | | | | | | | | | /  MSB
+ *        +-+-+-+-+-+-+-+-+/
+ *
+ * ---- remaining part of the node
+ * Length of output symbols dictionary -- Integer
+ * <Arc ID, Output Symbol>
+ * <Arc ID, Output Symbol>
+ * <Arc ID, Output Symbol>
+ * .
+ * .
+ * .
+ * <Arc ID, Output Symbol> (Length)
+ *
+ * Byte
+ *       +-+-+-+-+-+-+-+-+\
+ *     0 | | | | | | | | | +------ label
+ *       +-+-+-+-+-+-+-+-+/
+ *
+ *                  +------------- node pointed to is next
+ *                  | +----------- the last arc of the node
+ *                  | | +--------- the arc is final
+ *                  | | |
+ *             +-----------+
+ *             |    | | |  |
+ *         ___+___  | | |  |
+ *        /       \ | | |  |
+ *       MSB           LSB |
+ *        7 6 5 4 3 2 1 0  |
+ *       +-+-+-+-+-+-+-+-+ |
+ *     1 | | | | | | | | | \ \
+ *       +-+-+-+-+-+-+-+-+  \ \  LSB
+ *       +-+-+-+-+-+-+-+-+     +
+ *     2 | | | | | | | | |     |
+ *       +-+-+-+-+-+-+-+-+     |
+ *     3 | | | | | | | | |     +----- target node address (in bytes)
+ *       +-+-+-+-+-+-+-+-+     |      (not present except for the byte
+ *       : : : : : : : : :     |       with flags if the node pointed to
+ *       +-+-+-+-+-+-+-+-+     +       is next)
+ *   gtl | | | | | | | | |    /  MSB
+ *       +-+-+-+-+-+-+-+-+   /
+ * gtl+1                           (gtl = gotoLength)
+ * </pre>
+ */
+public final class ImmutableFST extends FST {
+  /**
+   * Default filler byte.
+   */
+  public final static byte DEFAULT_FILLER = '_';
+
+  /**
+   * Default annotation byte.
+   */
+  public final static byte DEFAULT_ANNOTATION = '+';
+
+  /**
+   * Automaton version as in the file header.
+   */
+  public static final byte VERSION = 5;
+
+  /**
+   * Bit indicating that an arc corresponds to the last character of a sequence
+   * available when building the automaton.
+   */
+  public static final int BIT_FINAL_ARC = 1 << 0;
+
+  /**
+   * Bit indicating that an arc is the last one of the node's list and the
+   * following one belongs to another node.
+   */
+  public static final int BIT_LAST_ARC = 1 << 1;
+
+  /**
+   * Bit indicating that the target node of this arc follows it in the
+   * compressed automaton structure (no goto field).
+   */
+  public static final int BIT_TARGET_NEXT = 1 << 2;
+
+  /**
+   * An offset in the arc structure, where the address and flags field begins.
+   * In version 5 of FST automata, this value is constant (1, skip label).
+   */
+  public final static int ADDRESS_OFFSET = 1;
+
+  private static final int PER_BUFFER_SIZE = 16;
+
+  /**
+   * An array of bytes with the internal representation of the automaton. Please
+   * see the documentation of this class for more information on how this
+   * structure is organized.
+   */
+  public final OffHeapMutableBytesStore _mutableBytesStore;
+  /**
+   * The length of the node header structure (if the automaton was compiled with
+   * <code>NUMBERS</code> option). Otherwise zero.
+   */
+  public final int _nodeDataLength;
+  /**
+   * Number of bytes each address takes in full, expanded form (goto length).
+   */
+  public final int _gotoLength;
+  /** Filler character. */
+  public final byte _filler;
+  /** Annotation character. */
+  public final byte _annotation;
+  public Map<Integer, Integer> _outputSymbols;
+  /**
+   * Flags for this automaton version.
+   */
+  private Set<FSTFlags> _flags;
+
+  /**
+   * Read and wrap a binary automaton in FST version 5.
+   */
+  ImmutableFST(InputStream stream, boolean hasOutputSymbols, PinotDataBufferMemoryManager memoryManager)
+      throws IOException {
+    DataInputStream in = new DataInputStream(stream);
+
+    this._filler = in.readByte();
+    this._annotation = in.readByte();
+    final byte hgtl = in.readByte();
+
+    _mutableBytesStore = new OffHeapMutableBytesStore(memoryManager, "ImmutableFST");
+
+    /*
+     * Determine if the automaton was compiled with NUMBERS. If so, modify
+     * ctl and goto fields accordingly.
+     */
+    _flags = EnumSet.of(FSTFlags.FLEXIBLE, FSTFlags.STOPBIT, FSTFlags.NEXTBIT);
+    if ((hgtl & 0xf0) != 0) {
+      _flags.add(FSTFlags.NUMBERS);
+    }
+
+    _flags = Collections.unmodifiableSet(_flags);
+
+    this._nodeDataLength = (hgtl >>> 4) & 0x0f;
+    this._gotoLength = hgtl & 0x0f;
+
+    if (hasOutputSymbols) {
+      final int outputSymbolsLength = in.readInt();
+      byte[] outputSymbolsBuffer = readRemaining(in, outputSymbolsLength);
+
+      if (outputSymbolsBuffer.length > 0) {
+        String outputSymbolsSerialized = new String(outputSymbolsBuffer);
+
+        _outputSymbols = buildMap(outputSymbolsSerialized);
+      }
+    }
+
+    readRemaining(in);
+  }
+
+  protected final void readRemaining(InputStream in)
+      throws IOException {
+    byte[] buffer = new byte[PER_BUFFER_SIZE];
+    while ((in.read(buffer)) >= 0) {
+      _mutableBytesStore.add(buffer);
+    }
+  }
+
+  /**
+   * Returns the start node of this automaton.
+   */
+  @Override
+  public int getRootNode() {
+    // Skip dummy node marking terminating state.
+    final int epsilonNode = skipArc(getFirstArc(0));
+
+    // And follow the epsilon node's first (and only) arc.
+    return getDestinationNodeOffset(getFirstArc(epsilonNode));
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public final int getFirstArc(int node) {
+    return _nodeDataLength + node;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public final int getNextArc(int arc) {
+    if (isArcLast(arc)) {
+      return 0;
+    } else {
+      return skipArc(arc);
+    }
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public int getArc(int node, byte label) {
+    for (int arc = getFirstArc(node); arc != 0; arc = getNextArc(arc)) {
+      if (getArcLabel(arc) == label) {
+        return arc;
+      }
+    }
+
+    // An arc labeled with "label" not found.
+    return 0;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public int getEndNode(int arc) {
+    final int nodeOffset = getDestinationNodeOffset(arc);
+    return nodeOffset;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public byte getArcLabel(int arc) {
+    return getByte(arc, 0);
+  }
+
+  @Override
+  public int getOutputSymbol(int arc) {
+    return _outputSymbols.get(arc);
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public boolean isArcFinal(int arc) {
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_FINAL_ARC) != 0;
+  }
+
+  /**
+   * {@inheritDoc}
+   */
+  @Override
+  public boolean isArcTerminal(int arc) {
+    return (0 == getDestinationNodeOffset(arc));
+  }
+
+  /**
+   * Returns the number encoded at the given node. The number equals the count
+   * of the set of suffixes reachable from <code>node</code> (called its right
+   * language).
+   */
+  @Override
+  public int getRightLanguageCount(int node) {
+    assert getFlags().contains(FSTFlags.NUMBERS) : "This FST was not compiled with NUMBERS.";
+    return decodeFromBytes(node, _nodeDataLength);
+  }
+
+  /**
+   * {@inheritDoc}
+   *
+   * <p>
+   * For this automaton version, an additional {@link FSTFlags#NUMBERS} flag may
+   * be set to indicate the automaton contains extra fields for each node.
+   * </p>
+   */
+  @Override
+  public Set<FSTFlags> getFlags() {
+    return _flags;
+  }
+
+  /**
+   * Returns <code>true</code> if this arc has <code>NEXT</code> bit set.
+   *
+   * @see #BIT_LAST_ARC
+   * @param arc The node's arc identifier.
+   * @return Returns true if the argument is the last arc of a node.
+   */
+  public boolean isArcLast(int arc) {
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_LAST_ARC) != 0;
+  }
+
+  /**
+   * @see #BIT_TARGET_NEXT
+   * @param arc The node's arc identifier.
+   * @return Returns true if {@link #BIT_TARGET_NEXT} is set for this arc.
+   */
+  public boolean isNextSet(int arc) {
+
+    return (getByte(arc, ADDRESS_OFFSET) & BIT_TARGET_NEXT) != 0;
+  }
+
+  /**
+   * Returns an n-byte integer encoded in byte-packed representation.
+   */
+  final int decodeFromBytes(final int start, final int n) {
+    int r = 0;
+
+    for (int i = n; --i >= 0; ) {
+      Pair<Integer, Integer> offheapOffsets = getOffheapOffsets(start + i);
+      byte[] inputData = _mutableBytesStore.get(offheapOffsets.getFirst());
+
+      r = r << 8 | (inputData[offheapOffsets.getSecond()] & 0xff);
+    }
+    return r;
+  }
+
+  /**
+   * Returns the address of the node pointed to by this arc.
+   */
+  final int getDestinationNodeOffset(int arc) {
+    if (isNextSet(arc)) {
+      /* The destination node follows this arc in the array. */
+      return skipArc(arc);
+    } else {
+      /*
+       * The destination node address has to be extracted from the arc's
+       * goto field.
+       */
+      return decodeFromBytes(arc + ADDRESS_OFFSET, _gotoLength) >>> 3;
+    }
+  }
+
+  /**
+   * Read the arc's layout and skip as many bytes, as needed.
+   */
+  private int skipArc(int offset) {
+    return offset + (isNextSet(offset) ? 1 + 1   /* label + flags */ : 1 + _gotoLength /* label + flags/address */);
+  }
+
+  private byte getByte(int seek, int offset) {
+    Pair<Integer, Integer> offheapOffsets = getOffheapOffsets(seek);
+
+    int fooArc = offheapOffsets.getFirst();
+    byte[] retVal = _mutableBytesStore.get((fooArc));
+
+    int barArc = offheapOffsets.getSecond();
+    int target = barArc + offset;
+
+    if (target >= PER_BUFFER_SIZE) {
+      retVal = _mutableBytesStore.get(fooArc + 1);
+      target = target - PER_BUFFER_SIZE;
+    }
+
+    return retVal[target];
+  }
+
+  private Pair<Integer, Integer> getOffheapOffsets(int seek) {
+    int fooArc = seek >= PER_BUFFER_SIZE ? seek / PER_BUFFER_SIZE : 0;
+    int barArc = seek >= PER_BUFFER_SIZE ? seek - ((fooArc) * PER_BUFFER_SIZE) : seek;
+
+    assert fooArc < _mutableBytesStore.getNumValues();
+    assert barArc < PER_BUFFER_SIZE;
+
+    return new Pair<>(fooArc, barArc);
+  }

Review comment:
       What I mean is copy and paste the lines 
   ```java
       int fooArc = seek >= PER_BUFFER_SIZE ? seek / PER_BUFFER_SIZE : 0;
       int barArc = seek >= PER_BUFFER_SIZE ? seek - ((fooArc) * PER_BUFFER_SIZE) : seek;
   ```
   
   to where you need them (2 places) instead of factoring this in to a method, at the cost of boxing `fooArc` and `barArc` and wrapping them in a `Pair` so they can be returned.
   
   So, for example, in `decodeFromBytes` you would write
   
   ```java
     final int decodeFromBytes(final int start, final int n) {
        int r = 0;
   
        for (int i = n; --i >= 0; ) {
          int seek = start + i;
          int fooArc = seek >= PER_BUFFER_SIZE ? seek / PER_BUFFER_SIZE : 0;
          int barArc = seek >= PER_BUFFER_SIZE ? seek - ((fooArc) * PER_BUFFER_SIZE) : seek;
   
          byte[] inputData = _mutableBytesStore.get(fooArc);
   
          r = r << 8 | (inputData[barArc] & 0xff);
        }
        return r;
      }
   ```
   
   And similar in the other place this method is called.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716517816



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/builders/FSTBuilder.java
##########
@@ -0,0 +1,565 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.builders;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.SortedMap;
+import java.util.TreeMap;
+import org.apache.pinot.segment.local.utils.nativefst.ConstantArcSizeFST;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+
+
+/**
+ * Fast, memory-conservative finite state transducer builder, returning an
+ * in-memory {@link FST} that is a tradeoff between construction speed and
+ * memory consumption. Use serializers to compress the returned automaton into
+ * more compact form.
+ *
+ * @see FSTSerializer
+ */
+public final class FSTBuilder {
+  /**
+   * A comparator comparing full byte arrays. Unsigned byte comparisons ('C'-locale).
+   */
+  public static final Comparator<byte[]> LEXICAL_ORDERING = new Comparator<byte[]>() {
+    public int compare(byte[] o1, byte[] o2) {
+      return FSTBuilder.compare(o1, 0, o1.length, o2, 0, o2.length);
+    }
+  };
+  /** A megabyte. */
+  private final static int MB = 1024 * 1024;
+
+  /**
+   * Internal serialized FST buffer expand ratio.
+   */
+  private final static int BUFFER_GROWTH_SIZE = 5 * MB;
+
+  /**
+   * Maximum number of labels from a single state.
+   */
+  private final static int MAX_LABELS = 256;
+  /**
+   * Internal serialized FST buffer expand ratio.
+   */
+  private final int _bufferGrowthSize;
+  private byte[] _serialized = new byte[0];
+  private Map<Integer, Integer> _outputSymbols = new HashMap<>();
+
+  /**
+   * Number of bytes already taken in {@link #_serialized}. Start from 1 to keep
+   * 0 a sentinel value (for the hash set and final state).
+   */
+  private int _size;
+  /**
+   * States on the "active path" (still mutable). Values are addresses of each
+   * state's first arc.
+   */
+  private int[] _activePath = new int[0];
+  /**
+   * Current length of the active path.
+   */
+  private int _activePathLen;
+  /**
+   * The next offset at which an arc will be added to the given state on
+   * {@link #_activePath}.
+   */
+  private int[] _nextArcOffset = new int[0];
+  /**
+   * Root state. If negative, the automaton has been built already and cannot be
+   * extended.
+   */
+  private int _root;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private int _epsilon;
+  /**
+   * Hash set of state addresses in {@link #_serialized}, hashed by
+   * {@link #hash(int, int)}. Zero reserved for an unoccupied slot.
+   */
+  private int[] _hashSet = new int[2];
+  /**
+   * Number of entries currently stored in {@link #_hashSet}.
+   */
+  private int _hashSize = 0;
+  /**
+   * Previous sequence added to the automaton in {@link #add(byte[], int, int, int)}.
+   * Used in assertions only.
+   */
+  private byte[] _previous;
+  /**
+   * Information about the automaton and its compilation.
+   */
+  private TreeMap<InfoEntry, Object> _info;
+  /**
+   * {@link #_previous} sequence's length, used in assertions only.
+   */
+  private int _previousLength;
+  /** Number of serialization buffer reallocations. */
+  private int _serializationBufferReallocations;
+
+  /** */
+  public FSTBuilder() {
+    this(BUFFER_GROWTH_SIZE);
+  }
+
+  /**
+   * @param bufferGrowthSize Buffer growth size (in bytes) when constructing the automaton.
+   */
+  public FSTBuilder(int bufferGrowthSize) {
+    _bufferGrowthSize = Math.max(bufferGrowthSize, ConstantArcSizeFST.ARC_SIZE * MAX_LABELS);
+
+    // Allocate epsilon state.
+    _epsilon = allocateState(1);
+    _serialized[_epsilon + ConstantArcSizeFST.FLAGS_OFFSET] |= ConstantArcSizeFST.BIT_ARC_LAST;
+
+    // Allocate root, with an initial empty set of output arcs.
+    expandActivePath(1);
+    _root = _activePath[0];
+  }
+
+  public static FST buildFST(SortedMap<String, Integer> input) {
+
+    FSTBuilder fstbuilder = new FSTBuilder();
+
+    for (Map.Entry<String, Integer> entry : input.entrySet()) {
+      fstbuilder.add(entry.getKey().getBytes(), 0, entry.getKey().length(), entry.getValue().intValue());
+    }
+
+    return fstbuilder.complete();
+  }
+
+  /**
+   * Build a minimal, deterministic automaton from a sorted list of byte
+   * sequences.
+   *
+   * @param input Input sequences to build automaton from.
+   * @return Returns the automaton encoding all input sequences.
+   */
+  public static FST build(byte[][] input, int[] outputSymbols) {
+    final FSTBuilder builder = new FSTBuilder();
+
+    int i = 0;
+    for (byte[] chs : input) {
+      builder.add(chs, 0, chs.length, i < outputSymbols.length ? outputSymbols[i] : -1);
+      ++i;
+    }
+
+    return builder.complete();
+  }
+
+  /**
+   * Build a minimal, deterministic automaton from an iterable list of byte
+   * sequences.
+   *
+   * @param input Input sequences to build automaton from.
+   * @return Returns the automaton encoding all input sequences.
+   */
+  public static FST build(Iterable<byte[]> input, int[] outputSymbols) {
+    final FSTBuilder builder = new FSTBuilder();
+
+    int i = 0;
+
+    for (byte[] chs : input) {
+      builder.add(chs, 0, chs.length, i < outputSymbols.length ? outputSymbols[i] : -1);
+      ++i;
+    }
+
+    return builder.complete();
+  }
+
+  /**
+   * Lexicographic order of input sequences. By default, consistent with the "C"
+   * sort (absolute value of bytes, 0-255).
+   */
+  private static int compare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2) {
+    final int max = Math.min(lens1, lens2);
+
+    for (int i = 0; i < max; i++) {
+      final byte c1 = s1[start1++];
+      final byte c2 = s2[start2++];
+      if (c1 != c2) {
+        return (c1 & 0xff) - (c2 & 0xff);
+      }
+    }
+
+    return lens1 - lens2;
+  }

Review comment:
       In JDK9+ there is [Arrays.mismatch](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Arrays.html#mismatch(byte%5B%5D,byte%5B%5D)) which vectorises this operation (returns the index of the first mismatch so you can implement the comparison at that index). It would be handy to be able to exploit this sort of thing despite the existing JDK8 bias of the codebase. The RoaringBitmap library is Multi-Release, so some classes load better implementations on better platforms, and `ArraysShim.mismatch` will load the vectorised implementation on JDK9+ and the scalar version on JDK8.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mayankshriv commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
mayankshriv commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-932387358


   Are there any remaining items to be taken care of in this PR @atris @siddharthteotia @richardstartin @Jackie-Jiang?
   IMHO, since this does not impact existing functionality, perhaps we can document the TODOs here and followup?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-916370107


   Please give some time for design doc review. I had commented on the issue a couple of days ago to request time for review. Not sure if this has already been reviewed and approved
   
   I would encourage to follow these guidelines since these were recently discussed and approved by PMC - https://docs.pinot.apache.org/developers/developers-and-contributors/contribution-guidelines#pinot-enhancement-proposal-workflow 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mayankshriv commented on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
mayankshriv commented on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917132482


   > I think he published the PR to help with the review of the design doc. As long as the PR does not get merged without the approval of the design, it should be ok.
   > 
   > @atris can you please update the PR description to indicate the same.
   
   Yes, I believe, the intention of the PR was primarily to help with design doc review. Because some aspects do require review beyond what a design doc is good at capturing. Let's get consensus on the design doc, followed by PR review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] atris commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
atris commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710155892



##########
File path: pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/nativefst/FSTTestUtils.java
##########
@@ -0,0 +1,129 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.nio.ByteBuffer;
+import java.nio.charset.Charset;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+import org.apache.pinot.segment.local.utils.nativefst.utils.RegexpMatcher;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.testng.Assert.assertEquals;
+import static org.testng.FileAssert.fail;
+
+
+/**
+ * Test utils class
+ */
+class FSTTestUtils {
+
+  private FSTTestUtils() {
+  }
+
+  /*
+   * Generate a sorted list of random sequences.
+   */
+  public static byte[][] generateRandom(int count, MinMax length, MinMax alphabet) {
+    final byte[][] input = new byte[count][];
+    final Random rnd = new Random();
+    for (int i = 0; i < count; i++) {
+      input[i] = randomByteSequence(rnd, length, alphabet);
+    }
+    Arrays.sort(input, FSTBuilder.LEXICAL_ORDERING);
+    return input;
+  }
+
+  /**
+   * Generate a random string.
+   */
+  private static byte[] randomByteSequence(Random rnd, MinMax length, MinMax alphabet) {
+    byte[] bytes = new byte[length._min + rnd.nextInt(length.range())];
+    for (int i = 0; i < bytes.length; i++) {
+      bytes[i] = (byte) (alphabet._min + rnd.nextInt(alphabet.range()));
+    }
+    return bytes;

Review comment:
       This is not used for benchmarking -- strictly for functional tests.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
siddharthteotia edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-921511027






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b22c014) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `2.25%`.
   > The diff coverage is `44.99%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   69.65%   -2.26%     
   - Complexity     3348     3817     +469     
   ============================================
     Files          1517     1547      +30     
     Lines         75039    78124    +3085     
     Branches      10921    11552     +631     
   ============================================
   + Hits          53961    54415     +454     
   - Misses        17451    19963    +2512     
   - Partials       3627     3746     +119     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `28.00% <0.00%> (-1.11%)` | :arrow_down: |
   | unittests1 | `68.35% <44.99%> (-1.35%)` | :arrow_down: |
   | unittests2 | `13.94% <0.00%> (-0.59%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [157 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...b22c014](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter edited a comment on pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#issuecomment-917372847


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#7405](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b22c014) into [master](https://codecov.io/gh/apache/pinot/commit/fe14d60a5fd3405cc695f9a3a9d1df73a0dbbea3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fe14d60) will **decrease** coverage by `1.11%`.
   > The diff coverage is `44.99%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/7405/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #7405      +/-   ##
   ============================================
   - Coverage     71.91%   70.79%   -1.12%     
   - Complexity     3348     3817     +469     
   ============================================
     Files          1517     1547      +30     
     Lines         75039    78124    +3085     
     Branches      10921    11552     +631     
   ============================================
   + Hits          53961    55310    +1349     
   - Misses        17451    19068    +1617     
   - Partials       3627     3746     +119     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `29.12% <0.00%> (-1.51%)` | :arrow_down: |
   | integration2 | `28.00% <0.00%> (-1.11%)` | :arrow_down: |
   | unittests1 | `68.35% <44.99%> (-1.35%)` | :arrow_down: |
   | unittests2 | `13.94% <0.00%> (-0.59%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/local/utils/nativefst/NativeFSTIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/utils/nativefst/NativeFSTIndexReader.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvTmF0aXZlRlNUSW5kZXhSZWFkZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...al/utils/nativefst/automaton/AutomatonMatcher.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0F1dG9tYXRvbk1hdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/CharacterRunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0NoYXJhY3RlclJ1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../local/utils/nativefst/automaton/RunAutomaton.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1J1bkF1dG9tYXRvbi5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...l/utils/nativefst/automaton/SpecialOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1NwZWNpYWxPcGVyYXRpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ent/local/utils/nativefst/automaton/StatePair.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0YXRlUGFpci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ils/nativefst/automaton/StringUnionOperations.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL1N0cmluZ1VuaW9uT3BlcmF0aW9ucy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/utils/nativefst/builders/FSTUtils.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYnVpbGRlcnMvRlNUVXRpbHMuamF2YQ==) | `14.28% <14.28%> (ø)` | |
   | [...local/utils/nativefst/automaton/BasicAutomata.java](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9uYXRpdmVmc3QvYXV0b21hdG9uL0Jhc2ljQXV0b21hdGEuamF2YQ==) | `20.00% <20.00%> (ø)` | |
   | ... and [72 more](https://codecov.io/gh/apache/pinot/pull/7405/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fe14d60...b22c014](https://codecov.io/gh/apache/pinot/pull/7405?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710152910



##########
File path: pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/nativefst/FSTTestUtils.java
##########
@@ -0,0 +1,129 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.nio.ByteBuffer;
+import java.nio.charset.Charset;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+import org.apache.pinot.segment.local.utils.nativefst.utils.RegexpMatcher;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.testng.Assert.assertEquals;
+import static org.testng.FileAssert.fail;
+
+
+/**
+ * Test utils class
+ */
+class FSTTestUtils {
+
+  private FSTTestUtils() {
+  }
+
+  /*
+   * Generate a sorted list of random sequences.
+   */
+  public static byte[][] generateRandom(int count, MinMax length, MinMax alphabet) {
+    final byte[][] input = new byte[count][];
+    final Random rnd = new Random();
+    for (int i = 0; i < count; i++) {
+      input[i] = randomByteSequence(rnd, length, alphabet);
+    }
+    Arrays.sort(input, FSTBuilder.LEXICAL_ORDERING);
+    return input;
+  }
+
+  /**
+   * Generate a random string.
+   */
+  private static byte[] randomByteSequence(Random rnd, MinMax length, MinMax alphabet) {
+    byte[] bytes = new byte[length._min + rnd.nextInt(length.range())];
+    for (int i = 0; i < bytes.length; i++) {
+      bytes[i] = (byte) (alphabet._min + rnd.nextInt(alphabet.range()));
+    }
+    return bytes;

Review comment:
       I have experienced wildly different results benchmarking string search algorithms on uniformly random input and inputs generated by Markov processes with transition probabilities inferred from large texts: [I wrote about it here](https://richardstartin.github.io/posts/heuristics-for-substring-search#markov-chain-generated-english-and-german). I would suggest sourcing some natural language text to measure this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710168057



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RegexpMatcher.java
##########
@@ -0,0 +1,170 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.utils;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Automaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.CharacterRunAutomaton;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.RegExp;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.State;
+import org.apache.pinot.segment.local.utils.nativefst.automaton.Transition;
+
+
+/**
+ * RegexpMatcher is a helper to retrieve matching values for a given regexp query.
+ * Regexp query is converted into an automaton and we run the matching algorithm on FST.
+ *
+ * Two main functions of this class are
+ *   regexMatchOnFST() Function runs matching on FST (See function comments for more details)
+ *   match(input) Function builds the automaton and matches given input.
+ */
+public class RegexpMatcher {
+  private final String _regexQuery;
+  private final FST _fst;
+  private final Automaton _automaton;
+
+  public RegexpMatcher(String regexQuery, FST fst) {
+    _regexQuery = regexQuery;
+    _fst = fst;
+
+    _automaton = new RegExp(_regexQuery).toAutomaton();
+  }
+
+  public static List<Long> regexMatch(String regexQuery, FST fst) {

Review comment:
       Simpler would be to take an `IntConsumer` so it could be directly appended to the resultant bitmap without this depending on RoaringBitmap. So something like:
   
   ```java
   static void regexMatch(String regexQuery, FST fst, IntConsumer dest) {
       ...
      dest.accept(_fst.getOutputSymbol(path._fstArc));
   }
   ...
   RoaringBitmapWriter<MutableRoaringBitmap> writer = RoaringBitmapWriter.bufferWriter().get();
   regexMatch(regex, fst, writer::add);




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716505530



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/Automaton.java
##########
@@ -0,0 +1,653 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.BitSet;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+
+/**
+ * Finite-state automaton with regular expression operations.
+ * <p>
+ * Class invariants:
+ * <ul>
+ * <li> An automaton is either represented explicitly (with {@link State} and {@link Transition} objects)
+ *      or with a singleton string ({@link #expandSingleton()}) in case
+ *      the automaton is known to accept exactly one string.
+ *      (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
+ * <li> Automata are always reduced (see {@link #reduce()}) 
+ *      and have no transitions to dead states (see {@link #removeDeadTransitions()}).
+ * <li> Automata provided as input to operations are generally assumed to be disjoint.
+ * </ul>
+ * <p>
+ */
+public class Automaton implements Serializable, Cloneable {
+
+  /**
+   * Minimize using Huffman's O(n<sup>2</sup>) algorithm.
+   * This is the standard text-book algorithm.
+   */
+  public static final int MINIMIZE_HUFFMAN = 0;
+  /**
+   * Minimize using Brzozowski's O(2<sup>n</sup>) algorithm.
+   * This algorithm uses the reverse-determinize-reverse-determinize trick, which has a bad
+   * worst-case behavior but often works very well in practice
+   * (even better than Hopcroft's!).
+   */
+  public static final int MINIMIZE_BRZOZOWSKI = 1;
+  /**
+   * Minimize using Hopcroft's O(n log n) algorithm.
+   */
+  public static final int MINIMIZE_HOPCROFT = 2;
+  /**
+   * Minimize using Valmari's O(n + m log m) algorithm.
+   */
+  public static final int MINIMIZE_VALMARI = 3;
+
+  /** Minimize always flag. */
+  public static boolean _minimizeAlways = false;
+
+  /** Selects whether operations may modify the input automata (default: <code>false</code>). */
+  public static boolean _allowMutation = false;
+
+  /** Selects minimization algorithm (default: <code>MINIMIZE_HOPCROFT</code>). */
+  public static int _minimization = MINIMIZE_HOPCROFT;
+
+  /** Initial state of this automaton. */
+  State _initial;
+
+  /** If true, then this automaton is definitely deterministic
+   (i.e., there are no choices for any run, but a run may crash). */
+  boolean _deterministic;
+
+  /** Hash code. Recomputed by {@link #minimize()}. */
+  int _hashCode;
+
+  /** Singleton string. Null if not applicable. */
+  String _singleton;
+
+  /**
+   * Constructs a new automaton that accepts the empty language.
+   * Using this constructor, automata can be constructed manually from
+   * {@link State} and {@link Transition} objects.
+   * @see #setInitialState(State)
+   * @see State
+   * @see Transition
+   */
+  public Automaton() {
+    _initial = new State();
+    _deterministic = true;
+    _singleton = null;
+  }
+
+  /**
+   * Sets or resets allow mutate flag.
+   * If this flag is set, then all automata operations may modify automata given as input;
+   * otherwise, operations will always leave input automata languages unmodified.
+   * By default, the flag is not set.
+   * @param flag if true, the flag is set
+   * @return previous value of the flag
+   */
+  static public boolean setAllowMutate(boolean flag) {
+    boolean b = _allowMutation;
+    _allowMutation = flag;
+    return b;
+  }
+
+  /**
+   * Assigns consecutive numbers to the given states.
+   */
+  static void setStateNumbers(Set<State> states) {
+    if (states.size() == Integer.MAX_VALUE) {
+      throw new IllegalArgumentException("number of states exceeded Integer.MAX_VALUE");
+    }
+    int number = 0;
+    for (State s : states) {
+      s._number = number++;
+    }
+  }
+
+  /**
+   * Returns a sorted array of transitions for each state (and sets state numbers).
+   */
+  static Transition[][] getSortedTransitions(Set<State> states) {
+    setStateNumbers(states);
+    Transition[][] transitions = new Transition[states.size()][];
+    for (State s : states) {
+      transitions[s._number] = s.getSortedTransitionArray(false);
+    }
+    return transitions;
+  }
+
+  /**
+   * See {@link MinimizationOperations#minimize(Automaton)}.
+   * Returns the automaton being given as argument.
+   */
+  public static Automaton minimize(Automaton a) {
+    a.minimize();
+    return a;
+  }
+
+  void checkMinimizeAlways() {
+    if (_minimizeAlways) {
+      minimize();
+    }
+  }
+
+  boolean isSingleton() {
+    return _singleton != null;
+  }
+
+  /**
+   * Gets initial state.
+   * @return state
+   */
+  public State getInitialState() {
+    expandSingleton();
+    return _initial;
+  }
+
+  /**
+   * Sets initial state.
+   * @param s state
+   */
+  public void setInitialState(State s) {
+    _initial = s;
+    _singleton = null;
+  }
+
+  /**
+   * Returns the set of states that are reachable from the initial state.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getStates() {
+    expandSingleton();
+    Set<State> visited;
+
+    visited = new HashSet<>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (!worklist.isEmpty()) {
+      State s = worklist.removeFirst();
+      Collection<Transition> tr;
+
+      tr = s._transitionSet;
+      for (Transition t : tr) {
+        if (!visited.contains(t._to)) {
+          visited.add(t._to);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return visited;
+  }
+
+  /**
+   * Returns the set of reachable accept states.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getAcceptStates() {
+    expandSingleton();
+    HashSet<State> accepts = new HashSet<State>();
+    BitSet visited = new BitSet();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.set(_initial._id);

Review comment:
       Why `_id` and not `_number`? `_id` is global and `_number` is assigned locally within the scope of the automaton?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r717349854



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/builders/FSTBuilder.java
##########
@@ -0,0 +1,565 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst.builders;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.SortedMap;
+import java.util.TreeMap;
+import org.apache.pinot.segment.local.utils.nativefst.ConstantArcSizeFST;
+import org.apache.pinot.segment.local.utils.nativefst.FST;
+
+
+/**
+ * Fast, memory-conservative finite state transducer builder, returning an
+ * in-memory {@link FST} that is a tradeoff between construction speed and
+ * memory consumption. Use serializers to compress the returned automaton into
+ * more compact form.
+ *
+ * @see FSTSerializer
+ */
+public final class FSTBuilder {
+  /**
+   * A comparator comparing full byte arrays. Unsigned byte comparisons ('C'-locale).
+   */
+  public static final Comparator<byte[]> LEXICAL_ORDERING = new Comparator<byte[]>() {
+    public int compare(byte[] o1, byte[] o2) {
+      return FSTBuilder.compare(o1, 0, o1.length, o2, 0, o2.length);
+    }
+  };
+  /** A megabyte. */
+  private final static int MB = 1024 * 1024;
+
+  /**
+   * Internal serialized FST buffer expand ratio.
+   */
+  private final static int BUFFER_GROWTH_SIZE = 5 * MB;
+
+  /**
+   * Maximum number of labels from a single state.
+   */
+  private final static int MAX_LABELS = 256;
+  /**
+   * Internal serialized FST buffer expand ratio.
+   */
+  private final int _bufferGrowthSize;
+  private byte[] _serialized = new byte[0];
+  private Map<Integer, Integer> _outputSymbols = new HashMap<>();
+
+  /**
+   * Number of bytes already taken in {@link #_serialized}. Start from 1 to keep
+   * 0 a sentinel value (for the hash set and final state).
+   */
+  private int _size;
+  /**
+   * States on the "active path" (still mutable). Values are addresses of each
+   * state's first arc.
+   */
+  private int[] _activePath = new int[0];
+  /**
+   * Current length of the active path.
+   */
+  private int _activePathLen;
+  /**
+   * The next offset at which an arc will be added to the given state on
+   * {@link #_activePath}.
+   */
+  private int[] _nextArcOffset = new int[0];
+  /**
+   * Root state. If negative, the automaton has been built already and cannot be
+   * extended.
+   */
+  private int _root;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private int _epsilon;
+  /**
+   * Hash set of state addresses in {@link #_serialized}, hashed by
+   * {@link #hash(int, int)}. Zero reserved for an unoccupied slot.
+   */
+  private int[] _hashSet = new int[2];
+  /**
+   * Number of entries currently stored in {@link #_hashSet}.
+   */
+  private int _hashSize = 0;
+  /**
+   * Previous sequence added to the automaton in {@link #add(byte[], int, int, int)}.
+   * Used in assertions only.
+   */
+  private byte[] _previous;
+  /**
+   * Information about the automaton and its compilation.
+   */
+  private TreeMap<InfoEntry, Object> _info;
+  /**
+   * {@link #_previous} sequence's length, used in assertions only.
+   */
+  private int _previousLength;
+  /** Number of serialization buffer reallocations. */
+  private int _serializationBufferReallocations;
+
+  /** */
+  public FSTBuilder() {
+    this(BUFFER_GROWTH_SIZE);
+  }
+
+  /**
+   * @param bufferGrowthSize Buffer growth size (in bytes) when constructing the automaton.
+   */
+  public FSTBuilder(int bufferGrowthSize) {
+    _bufferGrowthSize = Math.max(bufferGrowthSize, ConstantArcSizeFST.ARC_SIZE * MAX_LABELS);
+
+    // Allocate epsilon state.
+    _epsilon = allocateState(1);
+    _serialized[_epsilon + ConstantArcSizeFST.FLAGS_OFFSET] |= ConstantArcSizeFST.BIT_ARC_LAST;
+
+    // Allocate root, with an initial empty set of output arcs.
+    expandActivePath(1);
+    _root = _activePath[0];
+  }
+
+  public static FST buildFST(SortedMap<String, Integer> input) {
+
+    FSTBuilder fstbuilder = new FSTBuilder();
+
+    for (Map.Entry<String, Integer> entry : input.entrySet()) {
+      fstbuilder.add(entry.getKey().getBytes(), 0, entry.getKey().length(), entry.getValue().intValue());
+    }
+
+    return fstbuilder.complete();
+  }
+
+  /**
+   * Build a minimal, deterministic automaton from a sorted list of byte
+   * sequences.
+   *
+   * @param input Input sequences to build automaton from.
+   * @return Returns the automaton encoding all input sequences.
+   */
+  public static FST build(byte[][] input, int[] outputSymbols) {
+    final FSTBuilder builder = new FSTBuilder();
+
+    int i = 0;
+    for (byte[] chs : input) {
+      builder.add(chs, 0, chs.length, i < outputSymbols.length ? outputSymbols[i] : -1);
+      ++i;
+    }
+
+    return builder.complete();
+  }
+
+  /**
+   * Build a minimal, deterministic automaton from an iterable list of byte
+   * sequences.
+   *
+   * @param input Input sequences to build automaton from.
+   * @return Returns the automaton encoding all input sequences.
+   */
+  public static FST build(Iterable<byte[]> input, int[] outputSymbols) {
+    final FSTBuilder builder = new FSTBuilder();
+
+    int i = 0;
+
+    for (byte[] chs : input) {
+      builder.add(chs, 0, chs.length, i < outputSymbols.length ? outputSymbols[i] : -1);
+      ++i;
+    }
+
+    return builder.complete();
+  }
+
+  /**
+   * Lexicographic order of input sequences. By default, consistent with the "C"
+   * sort (absolute value of bytes, 0-255).
+   */
+  private static int compare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2) {
+    final int max = Math.min(lens1, lens2);
+
+    for (int i = 0; i < max; i++) {
+      final byte c1 = s1[start1++];
+      final byte c2 = s2[start2++];
+      if (c1 != c2) {
+        return (c1 & 0xff) - (c2 & 0xff);
+      }
+    }
+
+    return lens1 - lens2;
+  }

Review comment:
       If you rebase on master you can replace this with `ByteArray.compare`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716503637



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/Automaton.java
##########
@@ -0,0 +1,652 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+
+/**
+ * Finite-state automaton with regular expression operations.
+ * <p>
+ * Class invariants:
+ * <ul>
+ * <li> An automaton is either represented explicitly (with {@link State} and {@link Transition} objects)
+ *      or with a singleton string ({@link #expandSingleton()}) in case
+ *      the automaton is known to accept exactly one string.
+ *      (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
+ * <li> Automata are always reduced (see {@link #reduce()}) 
+ *      and have no transitions to dead states (see {@link #removeDeadTransitions()}).
+ * <li> If an automaton is nondeterministic, then {@link #isDeterministic()} returns false (but
+ *      the converse is not required).
+ * <li> Automata provided as input to operations are generally assumed to be disjoint.
+ * </ul>
+ * <p>
+ */
+public class Automaton implements Serializable, Cloneable {
+
+  /**
+   * Minimize using Huffman's O(n<sup>2</sup>) algorithm.
+   * This is the standard text-book algorithm.
+   */
+  public static final int MINIMIZE_HUFFMAN = 0;
+  /**
+   * Minimize using Brzozowski's O(2<sup>n</sup>) algorithm.
+   * This algorithm uses the reverse-determinize-reverse-determinize trick, which has a bad
+   * worst-case behavior but often works very well in practice
+   * (even better than Hopcroft's!).
+   */
+  public static final int MINIMIZE_BRZOZOWSKI = 1;
+  /**
+   * Minimize using Hopcroft's O(n log n) algorithm.
+   */
+  public static final int MINIMIZE_HOPCROFT = 2;
+  /**
+   * Minimize using Valmari's O(n + m log m) algorithm.
+   */
+  public static final int MINIMIZE_VALMARI = 3;
+
+  /** Minimize always flag. */
+  public static boolean _minimizeAlways = false;
+
+  /** Selects whether operations may modify the input automata (default: <code>false</code>). */
+  public static boolean _allowMutation = false;
+
+  /** Selects minimization algorithm (default: <code>MINIMIZE_HOPCROFT</code>). */
+  public static int _minimization = MINIMIZE_HOPCROFT;
+
+  /** Initial state of this automaton. */
+  State _initial;
+
+  /** If true, then this automaton is definitely deterministic
+   (i.e., there are no choices for any run, but a run may crash). */
+  boolean _deterministic;
+
+  /** Hash code. Recomputed by {@link #minimize()}. */
+  int _hashCode;
+
+  /** Singleton string. Null if not applicable. */
+  String _singleton;
+
+  /**
+   * Constructs a new automaton that accepts the empty language.
+   * Using this constructor, automata can be constructed manually from
+   * {@link State} and {@link Transition} objects.
+   * @see #setInitialState(State)
+   * @see State
+   * @see Transition
+   */
+  public Automaton() {
+    _initial = new State();
+    _deterministic = true;
+    _singleton = null;
+  }
+
+  /**
+   * Sets or resets allow mutate flag.
+   * If this flag is set, then all automata operations may modify automata given as input;
+   * otherwise, operations will always leave input automata languages unmodified.
+   * By default, the flag is not set.
+   * @param flag if true, the flag is set
+   * @return previous value of the flag
+   */
+  static public boolean setAllowMutate(boolean flag) {
+    boolean b = _allowMutation;
+    _allowMutation = flag;
+    return b;
+  }
+
+  /**
+   * Assigns consecutive numbers to the given states.
+   */
+  static void setStateNumbers(Set<State> states) {
+    if (states.size() == Integer.MAX_VALUE) {
+      throw new IllegalArgumentException("number of states exceeded Integer.MAX_VALUE");
+    }
+    int number = 0;
+    for (State s : states) {
+      s._number = number++;
+    }
+  }
+
+  /**
+   * Returns a sorted array of transitions for each state (and sets state numbers).
+   */
+  static Transition[][] getSortedTransitions(Set<State> states) {
+    setStateNumbers(states);
+    Transition[][] transitions = new Transition[states.size()][];
+    for (State s : states) {
+      transitions[s._number] = s.getSortedTransitionArray(false);
+    }
+    return transitions;
+  }
+
+  /**
+   * See {@link MinimizationOperations#minimize(Automaton)}.
+   * Returns the automaton being given as argument.
+   */
+  public static Automaton minimize(Automaton a) {
+    a.minimize();
+    return a;
+  }
+
+  void checkMinimizeAlways() {
+    if (_minimizeAlways) {
+      minimize();
+    }
+  }
+
+  boolean isSingleton() {
+    return _singleton != null;
+  }
+
+  /**
+   * Gets initial state.
+   * @return state
+   */
+  public State getInitialState() {
+    expandSingleton();
+    return _initial;
+  }
+
+  /**
+   * Sets initial state.
+   * @param s state
+   */
+  public void setInitialState(State s) {
+    _initial = s;
+    _singleton = null;
+  }
+
+  /**
+   * Returns the set of states that are reachable from the initial state.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getStates() {
+    expandSingleton();
+    Set<State> visited;
+
+    visited = new HashSet<>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (worklist.size() > 0) {

Review comment:
       Has this comment been addressed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716508950



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeFSTIndexReader.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.avro.util.ByteBufferInputStream;
+import org.apache.pinot.segment.local.utils.nativefst.utils.RegexpMatcher;
+import org.apache.pinot.segment.spi.index.reader.TextIndexReader;
+import org.apache.pinot.segment.spi.memory.PinotDataBuffer;
+import org.roaringbitmap.RoaringBitmapWriter;
+import org.roaringbitmap.buffer.ImmutableRoaringBitmap;
+import org.roaringbitmap.buffer.MutableRoaringBitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This class loads FST index from PinotDataBuffer and creates a FST reader which
+ * is used in finding matching results for regexp queries. Since FST index currently
+ * stores dict ids as values this class only implements getDictIds method.
+ *
+ * This class works on top of ImmutableFST.
+ *
+ */
+public class NativeFSTIndexReader implements TextIndexReader {
+  public static final Logger LOGGER =
+      LoggerFactory.getLogger(org.apache.pinot.segment.local.segment.index.readers.LuceneFSTIndexReader.class);
+
+  private final PinotDataBuffer _dataBuffer;
+
+  private final FST _readFST;
+
+  public NativeFSTIndexReader(PinotDataBuffer pinotDataBuffer)
+      throws IOException {
+    _dataBuffer = pinotDataBuffer;
+
+    List<ByteBuffer> inputList = new ArrayList<>();
+
+    inputList.add(_dataBuffer.toDirectByteBuffer(0, (int) _dataBuffer.size()));
+
+    _readFST =
+        FST.read(new ByteBufferInputStream(inputList), ImmutableFST.class, true);
+  }
+
+  @Override
+  public MutableRoaringBitmap getDocIds(String searchQuery) {
+    throw new RuntimeException("LuceneFSTIndexReader only supports getDictIds currently.");
+  }
+
+  @Override
+  public ImmutableRoaringBitmap getDictIds(String searchQuery) {
+    try {
+      RoaringBitmapWriter<MutableRoaringBitmap> dictIds = RoaringBitmapWriter
+          .bufferWriter().get();
+
+      RoaringBitmapWriter<MutableRoaringBitmap> writer = RoaringBitmapWriter.bufferWriter().get();
+      RegexpMatcher.regexMatch(searchQuery, _readFST, writer::add);
+
+      MutableRoaringBitmap matchingIds = writer.get();
+
+      for (Integer matchingId : matchingIds) {
+        dictIds.add(matchingId.intValue());
+      }
+      return dictIds.get();

Review comment:
       just return `matchingIds` - this decompresses and boxes every id in `matchingIds`, so that it can be unboxed and compressed in a new bitmap.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r716497562



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/ConstantArcSizeFST.java
##########
@@ -0,0 +1,159 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.util.Collections;
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.segment.local.utils.nativefst.builders.FSTBuilder;
+
+
+/**
+ * A FST with constant-size arc representation produced directly by
+ * {@link FSTBuilder}.
+ *
+ * @see FSTBuilder
+ */
+public final class ConstantArcSizeFST extends FST {
+  /** Size of the target address field (constant for the builder). */
+  public final static int TARGET_ADDRESS_SIZE = 4;
+
+  /** Size of the flags field (constant for the builder). */
+  public final static int FLAGS_SIZE = 1;
+
+  /** Size of the label field (constant for the builder). */
+  public final static int LABEL_SIZE = 1;
+
+  /**
+   * Size of a single arc structure.
+   */
+  public final static int ARC_SIZE = FLAGS_SIZE + LABEL_SIZE + TARGET_ADDRESS_SIZE;
+
+  /** Offset of the flags field inside an arc. */
+  public final static int FLAGS_OFFSET = 0;
+
+  /** Offset of the label field inside an arc. */
+  public final static int LABEL_OFFSET = FLAGS_SIZE;
+
+  /** Offset of the address field inside an arc. */
+  public final static int ADDRESS_OFFSET = LABEL_OFFSET + LABEL_SIZE;
+  /**
+   * An arc flag indicating the target node of an arc corresponds to a final
+   * state.
+   */
+  public final static int BIT_ARC_FINAL = 1 << 1;
+  /** An arc flag indicating the arc is last within its state. */
+  public final static int BIT_ARC_LAST = 1 << 0;
+  /** A dummy address of the terminal state. */
+  public final static int TERMINAL_STATE = 0;
+  /**
+   * An epsilon state. The first and only arc of this state points either to the
+   * root or to the terminal state, indicating an empty automaton.
+   */
+  private final int _epsilon;
+
+  /**
+   * FST data, serialized as a byte array.
+   */
+  private final byte[] _data;
+
+  private Map<Integer, Integer> _outputSymbols;
+
+  /**
+   * @param data
+   *          FST data. There must be no trailing bytes after the last state.
+   */
+  public ConstantArcSizeFST(byte[] data, int epsilon, Map<Integer, Integer> outputSymbols) {
+    assert epsilon == 0 : "Epsilon is not zero?";

Review comment:
       You have a final field `_epsilon` which can only take the value zero. This does not make the code more readable in my opinion. Instead, remove `_epsilon` and make a `private static final int EPSILON = 0` and use that instead, that way there can be no confusion that epsilon is zero.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710128168



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/automaton/Automaton.java
##########
@@ -0,0 +1,652 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.segment.local.utils.nativefst.automaton;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+
+/**
+ * Finite-state automaton with regular expression operations.
+ * <p>
+ * Class invariants:
+ * <ul>
+ * <li> An automaton is either represented explicitly (with {@link State} and {@link Transition} objects)
+ *      or with a singleton string ({@link #expandSingleton()}) in case
+ *      the automaton is known to accept exactly one string.
+ *      (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
+ * <li> Automata are always reduced (see {@link #reduce()}) 
+ *      and have no transitions to dead states (see {@link #removeDeadTransitions()}).
+ * <li> If an automaton is nondeterministic, then {@link #isDeterministic()} returns false (but
+ *      the converse is not required).
+ * <li> Automata provided as input to operations are generally assumed to be disjoint.
+ * </ul>
+ * <p>
+ */
+public class Automaton implements Serializable, Cloneable {
+
+  /**
+   * Minimize using Huffman's O(n<sup>2</sup>) algorithm.
+   * This is the standard text-book algorithm.
+   */
+  public static final int MINIMIZE_HUFFMAN = 0;
+  /**
+   * Minimize using Brzozowski's O(2<sup>n</sup>) algorithm.
+   * This algorithm uses the reverse-determinize-reverse-determinize trick, which has a bad
+   * worst-case behavior but often works very well in practice
+   * (even better than Hopcroft's!).
+   */
+  public static final int MINIMIZE_BRZOZOWSKI = 1;
+  /**
+   * Minimize using Hopcroft's O(n log n) algorithm.
+   */
+  public static final int MINIMIZE_HOPCROFT = 2;
+  /**
+   * Minimize using Valmari's O(n + m log m) algorithm.
+   */
+  public static final int MINIMIZE_VALMARI = 3;
+
+  /** Minimize always flag. */
+  public static boolean _minimizeAlways = false;
+
+  /** Selects whether operations may modify the input automata (default: <code>false</code>). */
+  public static boolean _allowMutation = false;
+
+  /** Selects minimization algorithm (default: <code>MINIMIZE_HOPCROFT</code>). */
+  public static int _minimization = MINIMIZE_HOPCROFT;
+
+  /** Initial state of this automaton. */
+  State _initial;
+
+  /** If true, then this automaton is definitely deterministic
+   (i.e., there are no choices for any run, but a run may crash). */
+  boolean _deterministic;
+
+  /** Hash code. Recomputed by {@link #minimize()}. */
+  int _hashCode;
+
+  /** Singleton string. Null if not applicable. */
+  String _singleton;
+
+  /**
+   * Constructs a new automaton that accepts the empty language.
+   * Using this constructor, automata can be constructed manually from
+   * {@link State} and {@link Transition} objects.
+   * @see #setInitialState(State)
+   * @see State
+   * @see Transition
+   */
+  public Automaton() {
+    _initial = new State();
+    _deterministic = true;
+    _singleton = null;
+  }
+
+  /**
+   * Sets or resets allow mutate flag.
+   * If this flag is set, then all automata operations may modify automata given as input;
+   * otherwise, operations will always leave input automata languages unmodified.
+   * By default, the flag is not set.
+   * @param flag if true, the flag is set
+   * @return previous value of the flag
+   */
+  static public boolean setAllowMutate(boolean flag) {
+    boolean b = _allowMutation;
+    _allowMutation = flag;
+    return b;
+  }
+
+  /**
+   * Assigns consecutive numbers to the given states.
+   */
+  static void setStateNumbers(Set<State> states) {
+    if (states.size() == Integer.MAX_VALUE) {
+      throw new IllegalArgumentException("number of states exceeded Integer.MAX_VALUE");
+    }
+    int number = 0;
+    for (State s : states) {
+      s._number = number++;
+    }
+  }
+
+  /**
+   * Returns a sorted array of transitions for each state (and sets state numbers).
+   */
+  static Transition[][] getSortedTransitions(Set<State> states) {
+    setStateNumbers(states);
+    Transition[][] transitions = new Transition[states.size()][];
+    for (State s : states) {
+      transitions[s._number] = s.getSortedTransitionArray(false);
+    }
+    return transitions;
+  }
+
+  /**
+   * See {@link MinimizationOperations#minimize(Automaton)}.
+   * Returns the automaton being given as argument.
+   */
+  public static Automaton minimize(Automaton a) {
+    a.minimize();
+    return a;
+  }
+
+  void checkMinimizeAlways() {
+    if (_minimizeAlways) {
+      minimize();
+    }
+  }
+
+  boolean isSingleton() {
+    return _singleton != null;
+  }
+
+  /**
+   * Gets initial state.
+   * @return state
+   */
+  public State getInitialState() {
+    expandSingleton();
+    return _initial;
+  }
+
+  /**
+   * Sets initial state.
+   * @param s state
+   */
+  public void setInitialState(State s) {
+    _initial = s;
+    _singleton = null;
+  }
+
+  /**
+   * Returns the set of states that are reachable from the initial state.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getStates() {
+    expandSingleton();
+    Set<State> visited;
+
+    visited = new HashSet<>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (worklist.size() > 0) {
+      State s = worklist.removeFirst();
+      Collection<Transition> tr;
+
+      tr = s._transitionSet;
+      for (Transition t : tr) {
+        if (!visited.contains(t._to)) {
+          visited.add(t._to);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return visited;
+  }
+
+  /**
+   * Returns the set of reachable accept states.
+   * @return set of {@link State} objects
+   */
+  public Set<State> getAcceptStates() {
+    expandSingleton();
+    HashSet<State> accepts = new HashSet<State>();
+    HashSet<State> visited = new HashSet<State>();
+    LinkedList<State> worklist = new LinkedList<State>();
+    worklist.add(_initial);
+    visited.add(_initial);
+    while (worklist.size() > 0) {
+      State s = worklist.removeFirst();
+      if (s._accept) {
+        accepts.add(s);
+      }
+      for (Transition t : s._transitionSet) {
+        if (!visited.contains(t._to)) {
+          visited.add(t._to);
+          worklist.add(t._to);
+        }
+      }
+    }
+    return accepts;

Review comment:
       In fact you have this already: `_number`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7405: Introduce Native Text Indices (Core Functionality)

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7405:
URL: https://github.com/apache/pinot/pull/7405#discussion_r710119079



##########
File path: pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeFSTIndexReader.java
##########
@@ -0,0 +1,88 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils.nativefst;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.avro.util.ByteBufferInputStream;
+import org.apache.pinot.segment.local.utils.nativefst.utils.RegexpMatcher;
+import org.apache.pinot.segment.spi.index.reader.TextIndexReader;
+import org.apache.pinot.segment.spi.memory.PinotDataBuffer;
+import org.roaringbitmap.buffer.ImmutableRoaringBitmap;
+import org.roaringbitmap.buffer.MutableRoaringBitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This class loads FST index from PinotDataBuffer and creates a FST reader which
+ * is used in finding matching results for regexp queries. Since FST index currently
+ * stores dict ids as values this class only implements getDictIds method.
+ *
+ * This class works on top of ImmutableFST.
+ *
+ */
+public class NativeFSTIndexReader implements TextIndexReader {
+  public static final Logger LOGGER =
+      LoggerFactory.getLogger(org.apache.pinot.segment.local.segment.index.readers.LuceneFSTIndexReader.class);
+
+  private final PinotDataBuffer _dataBuffer;
+
+  private final FST _readFST;
+
+  public NativeFSTIndexReader(PinotDataBuffer pinotDataBuffer)
+      throws IOException {
+    this._dataBuffer = pinotDataBuffer;
+
+    List<ByteBuffer> inputList = new ArrayList<>();
+
+    inputList.add(_dataBuffer.toDirectByteBuffer(0, (int) _dataBuffer.size()));
+
+    this._readFST =
+        FST.read(new ByteBufferInputStream(inputList), ImmutableFST.class, true);
+  }
+
+  @Override
+  public MutableRoaringBitmap getDocIds(String searchQuery) {
+    throw new RuntimeException("LuceneFSTIndexReader only supports getDictIds currently.");
+  }
+
+  @Override
+  public ImmutableRoaringBitmap getDictIds(String searchQuery) {
+    try {
+      MutableRoaringBitmap dictIds = new MutableRoaringBitmap();
+      List<Long> matchingIds = RegexpMatcher.regexMatch(searchQuery, this._readFST);
+      for (Long matchingId : matchingIds) {
+        dictIds.add(matchingId.intValue());
+      }

Review comment:
       If `matchingIds` is expected to be ordered and large, it would be better to use a `RoaringBitmapWriter` here which provides faster appends.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org