You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@asterixdb.apache.org by wa...@apache.org on 2016/12/10 08:14:59 UTC
[2/3] asterixdb git commit: Full-text implementation step 1
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md b/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
new file mode 100644
index 0000000..921f0b3
--- /dev/null
+++ b/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
@@ -0,0 +1,99 @@
+<!--
+ ! Licensed to the Apache Software Foundation (ASF) under one
+ ! or more contributor license agreements. See the NOTICE file
+ ! distributed with this work for additional information
+ ! regarding copyright ownership. The ASF licenses this file
+ ! to you under the Apache License, Version 2.0 (the
+ ! "License"); you may not use this file except in compliance
+ ! with the License. You may obtain a copy of the License at
+ !
+ ! http://www.apache.org/licenses/LICENSE-2.0
+ !
+ ! Unless required by applicable law or agreed to in writing,
+ ! software distributed under the License is distributed on an
+ ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ! KIND, either express or implied. See the License for the
+ ! specific language governing permissions and limitations
+ ! under the License.
+ !-->
+
+# AsterixDB Support of Full-text search queries #
+
+## <a id="toc">Table of Contents</a> ##
+
+* [Motivation](#Motivation)
+* [Syntax](#Syntax)
+* [Creating and utilizing a Full-text index](#FulltextIndex)
+
+## <a id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font> ##
+
+Full-Text Search (FTS) queries are widely used in applications where users need to find records that satisfy
+an FTS predicate, i.e., where simple string-based matching is not sufficient. These queries are important when
+finding documents that contain a certain keyword is crucial. FTS queries are different from substring matching
+queries in that FTS queries find their query predicates as exact keywords in the given string, rather than
+treating a query predicate as a sequence of characters. For example, an FTS query that finds \u201crain\u201d correctly
+returns a document when it contains \u201crain\u201d as a word. However, a substring-matching query returns a document
+whenever it contains \u201crain\u201d as a substring, for instance, a document with \u201cbrain\u201d or \u201ctraining\u201d would be
+returned as well.
+
+## <a id="Syntax">Syntax</a> <font size="4"><a href="#toc">[Back to TOC]</a></font> ##
+
+The syntax of AsterixDB FTS follows a portion of the XQuery FullText Search syntax.
+A basic form is as follows:
+
+ ftcontains(Expression1, Expression2, {FullTextOption})
+
+For example, we can execute the following query to find tweet messages where the `message-text` field includes
+\u201cvoice\u201d as a word. Please note that an FTS search is case-insensitive.
+Thus, "Voice" or "voice" will be evaluated as the same word.
+
+ use dataverse TinySocial;
+
+ for $msg in dataset TweetMessages
+ where ftcontains($msg.message-text, "voice", {"mode":"any"})
+ return {"id": $msg.id}
+
+The DDL and DML of TinySocial can be found in [ADM: Modeling Semistructed Data in AsterixDB](primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB).
+
+The `Expression1` is an expression that should be evaluable as a string at runtime as in the above example
+where `$msg.message-text` is a string field. The `Expression2` can be a string, an (un)ordered list
+of string value(s), or an expression. In the last case, the given expression should be evaluable
+into one of the first two types, i.e., into a string value or an (un)ordered list of string value(s).
+
+The following examples are all valid expressions.
+
+ ... where ftcontains($msg.message-text, "sound", {"mode":"any"})
+ ... where ftcontains($msg.message-text, ["sound", "system"], {"mode":"any"})
+ ... where ftcontains($msg.message-text, {{"speed", "stand", "customization"}}, {"mode":"all"})
+ ... where ftcontains($msg.message-text, let $keyword_list := ["voice", "system"] return $keyword_list, {"mode":"all"})
+ ... where ftcontains($msg.message-text, $keyword_list, {"mode":"any"})
+
+In the last example above, `$keyword_list` should evaluate to a string or an (un)ordered list of string value(s).
+
+The last `FullTextOption` parameter clarifies the given FTS request. Currently, we only have one option named `mode`.
+And as we extend the FTS feature, more options will be added. Please note that the format of `FullTextOption`
+is a record, thus you need to put the option(s) in a record `{}`.
+The `mode` option indicates whether the given FTS query is a conjunctive (AND) or disjunctive (OR) search request.
+This option can be either `\u201cany\u201d` or `\u201call\u201d`. If one specifies `\u201cany\u201d`, a disjunctive search will be conducted.
+For example, the following query will find documents whose `message-text` field contains \u201csound\u201d or \u201csystem\u201d,
+so a document will be returned if it contains either \u201csound\u201d, \u201csystem\u201d, or both of the keywords.
+
+ ... where ftcontains($msg.message-text, ["sound", "system"], {"mode":"any"})
+
+The other option parameter,`\u201call\u201d`, specifies a conjunctive search. The following example will find the documents whose
+`message-text` field contains both \u201csound\u201d and \u201csystem\u201d. If a document contains only \u201csound\u201d or \u201csystem\u201d but
+not both, it will not be returned.
+
+ ... where ftcontains($msg.message-text, ["sound", "system"], {"mode":"all"})
+
+Currently AsterixDB doesn\u2019t (yet) support phrase searches, so the following query will not work.
+
+ ... where ftcontains($msg.message-text, "sound system", {"mode":"any"})
+
+As a workaround solution, the following query can be used to achieve a roughly similar goal. The difference is that
+the following query will find documents where `$msg.message-text` contains both \u201csound\u201d and \u201csystem\u201d, but the order
+and adjacency of \u201csound\u201d and \u201csystem\u201d are not checked, unlike in a phrase search. As a result, the query below would
+also return documents with \u201csound system can be installed.\u201d, \u201csystem sound is perfect.\u201d,
+or \u201csound is not clear. You may need to install a new system.\u201d
+
+ ... where ftcontains($msg.message-text, ["sound", "system"], {"mode":"all"})
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/common/IBinaryTokenizerFactoryProvider.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/common/IBinaryTokenizerFactoryProvider.java b/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/common/IBinaryTokenizerFactoryProvider.java
index 159c74d..c57dd89 100644
--- a/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/common/IBinaryTokenizerFactoryProvider.java
+++ b/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/common/IBinaryTokenizerFactoryProvider.java
@@ -22,7 +22,8 @@ import org.apache.asterix.om.types.ATypeTag;
import org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizerFactory;
public interface IBinaryTokenizerFactoryProvider {
- public IBinaryTokenizerFactory getWordTokenizerFactory(ATypeTag typeTag, boolean hashedTokens);
+ public IBinaryTokenizerFactory getWordTokenizerFactory(ATypeTag typeTag, boolean hashedTokens,
+ boolean typeTagAlreadyRemoved);
public IBinaryTokenizerFactory getNGramTokenizerFactory(ATypeTag typeTag, int gramLength, boolean usePrePost,
boolean hashedTokens);
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryComparatorFactoryProvider.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryComparatorFactoryProvider.java b/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryComparatorFactoryProvider.java
index 4e0e210..677c004 100644
--- a/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryComparatorFactoryProvider.java
+++ b/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryComparatorFactoryProvider.java
@@ -18,6 +18,8 @@
*/
package org.apache.asterix.formats.nontagged;
+import java.io.Serializable;
+
import org.apache.asterix.dataflow.data.nontagged.comparators.ABinaryComparator;
import org.apache.asterix.dataflow.data.nontagged.comparators.ACirclePartialBinaryComparatorFactory;
import org.apache.asterix.dataflow.data.nontagged.comparators.ADurationPartialBinaryComparatorFactory;
@@ -48,10 +50,9 @@ import org.apache.hyracks.data.std.primitive.IntegerPointable;
import org.apache.hyracks.data.std.primitive.LongPointable;
import org.apache.hyracks.data.std.primitive.ShortPointable;
import org.apache.hyracks.data.std.primitive.UTF8StringLowercasePointable;
+import org.apache.hyracks.data.std.primitive.UTF8StringLowercaseTokenPointable;
import org.apache.hyracks.data.std.primitive.UTF8StringPointable;
-import java.io.Serializable;
-
public class BinaryComparatorFactoryProvider implements IBinaryComparatorFactoryProvider, Serializable {
private static final long serialVersionUID = 1L;
@@ -74,6 +75,10 @@ public class BinaryComparatorFactoryProvider implements IBinaryComparatorFactory
// case-insensitive comparisons.
public static final PointableBinaryComparatorFactory UTF8STRING_LOWERCASE_POINTABLE_INSTANCE =
new PointableBinaryComparatorFactory(UTF8StringLowercasePointable.FACTORY);
+ // Equivalent to UTF8STRING_LOWERCASE_POINTABLE_INSTANCE but the length information is kept separately,
+ // rather than keeping them in the beginning of a string. It is especially useful for the string tokens
+ public static final PointableBinaryComparatorFactory UTF8STRING_LOWERCASE_TOKEN_POINTABLE_INSTANCE =
+ new PointableBinaryComparatorFactory(UTF8StringLowercaseTokenPointable.FACTORY);
public static final PointableBinaryComparatorFactory BINARY_POINTABLE_INSTANCE =
new PointableBinaryComparatorFactory(ByteArrayPointable.FACTORY);
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryTokenizerFactoryProvider.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryTokenizerFactoryProvider.java b/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryTokenizerFactoryProvider.java
index 58740ee..084a811 100644
--- a/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryTokenizerFactoryProvider.java
+++ b/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/BinaryTokenizerFactoryProvider.java
@@ -38,9 +38,13 @@ public class BinaryTokenizerFactoryProvider implements IBinaryTokenizerFactoryPr
new DelimitedUTF8StringBinaryTokenizerFactory(true, true,
new UTF8WordTokenFactory(ATypeTag.SERIALIZED_STRING_TYPE_TAG, ATypeTag.SERIALIZED_INT32_TYPE_TAG));
+ private static final IBinaryTokenizerFactory aqlStringNoTypeTagTokenizer =
+ new DelimitedUTF8StringBinaryTokenizerFactory(true, false,
+ new UTF8WordTokenFactory(ATypeTag.STRING.serialize(), ATypeTag.INT32.serialize()));
+
private static final IBinaryTokenizerFactory aqlHashingStringTokenizer =
- new DelimitedUTF8StringBinaryTokenizerFactory(true, true,
- new HashedUTF8WordTokenFactory(ATypeTag.SERIALIZED_INT32_TYPE_TAG, ATypeTag.SERIALIZED_INT32_TYPE_TAG));
+ new DelimitedUTF8StringBinaryTokenizerFactory(true, true, new HashedUTF8WordTokenFactory(
+ ATypeTag.SERIALIZED_INT32_TYPE_TAG, ATypeTag.SERIALIZED_INT32_TYPE_TAG));
private static final IBinaryTokenizerFactory orderedListTokenizer = new AOrderedListBinaryTokenizerFactory(
new AListElementTokenFactory());
@@ -49,10 +53,17 @@ public class BinaryTokenizerFactoryProvider implements IBinaryTokenizerFactoryPr
new AListElementTokenFactory());
@Override
- public IBinaryTokenizerFactory getWordTokenizerFactory(ATypeTag typeTag, boolean hashedTokens) {
+ public IBinaryTokenizerFactory getWordTokenizerFactory(ATypeTag typeTag, boolean hashedTokens,
+ boolean typeTageAlreadyRemoved) {
switch (typeTag) {
case STRING:
- return hashedTokens ? aqlHashingStringTokenizer : aqlStringTokenizer;
+ if (hashedTokens) {
+ return aqlHashingStringTokenizer;
+ } else if (!typeTageAlreadyRemoved) {
+ return aqlStringTokenizer;
+ } else {
+ return aqlStringNoTypeTagTokenizer;
+ }
case ORDEREDLIST:
return orderedListTokenizer;
case UNORDEREDLIST:
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java b/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java
index 29a693c..f4c0c38 100644
--- a/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java
+++ b/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java
@@ -486,6 +486,10 @@ public class AsterixBuiltinFunctions {
public static final FunctionIdentifier EDIT_DISTANCE_CONTAINS = new FunctionIdentifier(FunctionConstants.ASTERIX_NS,
"edit-distance-contains", 3);
+ // full-text
+ public static final FunctionIdentifier FULLTEXT_CONTAINS = new FunctionIdentifier(FunctionConstants.ASTERIX_NS,
+ "ftcontains", 3);
+
// tokenizers:
public static final FunctionIdentifier WORD_TOKENS = new FunctionIdentifier(FunctionConstants.ASTERIX_NS,
"word-tokens", 1);
@@ -1027,6 +1031,9 @@ public class AsterixBuiltinFunctions {
addPrivateFunction(SIMILARITY_JACCARD_PREFIX, AFloatTypeComputer.INSTANCE, true);
addPrivateFunction(SIMILARITY_JACCARD_PREFIX_CHECK, OrderedListOfAnyTypeComputer.INSTANCE, true);
+ // Full-text function
+ addFunction(FULLTEXT_CONTAINS, ABooleanTypeComputer.INSTANCE, true);
+
// Spatial functions
addFunction(SPATIAL_AREA, ADoubleTypeComputer.INSTANCE, true);
addFunction(SPATIAL_CELL, ARectangleTypeComputer.INSTANCE, true);
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/ConstantExpressionUtil.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/ConstantExpressionUtil.java b/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/ConstantExpressionUtil.java
index c67030a..e627d95 100644
--- a/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/ConstantExpressionUtil.java
+++ b/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/ConstantExpressionUtil.java
@@ -45,7 +45,16 @@ public class ConstantExpressionUtil {
return null;
}
final IAObject iaObject = ((AsterixConstantValue) acv).getObject();
- return iaObject.getType().getTypeTag() == typeTag ? iaObject : null;
+ if (typeTag != null) {
+ return iaObject.getType().getTypeTag() == typeTag ? iaObject : null;
+ } else {
+ return iaObject;
+ }
+ }
+
+ public static ATypeTag getConstantIaObjectType(ILogicalExpression expr) {
+ IAObject iaObject = getConstantIaObject(expr, null);
+ return iaObject.getType().getTypeTag();
}
public static Long getLongConstant(ILogicalExpression expr) {
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/NonTaggedFormatUtil.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/NonTaggedFormatUtil.java b/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/NonTaggedFormatUtil.java
index f46e7da..0608b79 100644
--- a/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/NonTaggedFormatUtil.java
+++ b/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/util/NonTaggedFormatUtil.java
@@ -232,7 +232,7 @@ public final class NonTaggedFormatUtil {
switch (indexType) {
case SINGLE_PARTITION_WORD_INVIX:
case LENGTH_PARTITIONED_WORD_INVIX: {
- return BinaryTokenizerFactoryProvider.INSTANCE.getWordTokenizerFactory(keyType, false);
+ return BinaryTokenizerFactoryProvider.INSTANCE.getWordTokenizerFactory(keyType, false, false);
}
case SINGLE_PARTITION_NGRAM_INVIX:
case LENGTH_PARTITIONED_NGRAM_INVIX: {
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/FullTextContainsEvaluator.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/FullTextContainsEvaluator.java b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/FullTextContainsEvaluator.java
new file mode 100644
index 0000000..471b209
--- /dev/null
+++ b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/FullTextContainsEvaluator.java
@@ -0,0 +1,399 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.asterix.runtime.evaluators.common;
+
+import java.io.DataOutput;
+import java.util.Arrays;
+
+import org.apache.asterix.formats.nontagged.BinaryComparatorFactoryProvider;
+import org.apache.asterix.formats.nontagged.BinaryTokenizerFactoryProvider;
+import org.apache.asterix.formats.nontagged.SerializerDeserializerProvider;
+import org.apache.asterix.om.base.ABoolean;
+import org.apache.asterix.om.base.ANull;
+import org.apache.asterix.om.types.ATypeTag;
+import org.apache.asterix.om.types.BuiltinType;
+import org.apache.asterix.om.types.EnumDeserializer;
+import org.apache.asterix.om.types.hierachy.ATypeHierarchy;
+import org.apache.asterix.runtime.evaluators.functions.FullTextContainsDescriptor;
+import org.apache.hyracks.algebricks.runtime.base.IScalarEvaluator;
+import org.apache.hyracks.algebricks.runtime.base.IScalarEvaluatorFactory;
+import org.apache.hyracks.api.context.IHyracksTaskContext;
+import org.apache.hyracks.api.dataflow.value.IBinaryComparator;
+import org.apache.hyracks.api.dataflow.value.IBinaryHashFunction;
+import org.apache.hyracks.api.dataflow.value.ISerializerDeserializer;
+import org.apache.hyracks.api.exceptions.HyracksDataException;
+import org.apache.hyracks.data.std.accessors.PointableBinaryHashFunctionFactory;
+import org.apache.hyracks.data.std.api.IPointable;
+import org.apache.hyracks.data.std.primitive.TaggedValuePointable;
+import org.apache.hyracks.data.std.primitive.UTF8StringLowercaseTokenPointable;
+import org.apache.hyracks.data.std.primitive.VoidPointable;
+import org.apache.hyracks.data.std.util.ArrayBackedValueStorage;
+import org.apache.hyracks.data.std.util.BinaryEntry;
+import org.apache.hyracks.data.std.util.BinaryHashSet;
+import org.apache.hyracks.dataflow.common.data.accessors.IFrameTupleReference;
+import org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.DelimitedUTF8StringBinaryTokenizer;
+import org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer;
+import org.apache.hyracks.util.string.UTF8StringUtil;
+
+public class FullTextContainsEvaluator implements IScalarEvaluator {
+
+ // assuming type indicator in serde format
+ protected static final int TYPE_INDICATOR_SIZE = 1;
+
+ protected final ArrayBackedValueStorage resultStorage = new ArrayBackedValueStorage();
+ protected final DataOutput out = resultStorage.getDataOutput();
+ protected final TaggedValuePointable argLeft = (TaggedValuePointable) TaggedValuePointable.FACTORY
+ .createPointable();
+ protected final TaggedValuePointable argRight = (TaggedValuePointable) TaggedValuePointable.FACTORY
+ .createPointable();
+ protected TaggedValuePointable[] argOptions;
+ protected final IScalarEvaluator evalLeft;
+ protected final IScalarEvaluator evalRight;
+ protected IScalarEvaluator[] evalOptions;
+ protected IPointable outLeft = VoidPointable.FACTORY.createPointable();
+ protected IPointable outRight = VoidPointable.FACTORY.createPointable();
+ protected IPointable[] outOptions;
+ protected int optionArgsLength;
+
+ // To conduct a full-text search, we convert all strings to the lower case.
+ // In addition, since each token does not include the length information (2 bytes) in the beginning,
+ // We need to have a different binary comparator that is different from a standard string comparator.
+ // i.e. A token comparator that receives the length of a token as a parameter.
+ private final IBinaryComparator strLowerCaseTokenCmp =
+ BinaryComparatorFactoryProvider.UTF8STRING_LOWERCASE_TOKEN_POINTABLE_INSTANCE.createBinaryComparator();
+ private final IBinaryComparator strLowerCaseCmp =
+ BinaryComparatorFactoryProvider.UTF8STRING_LOWERCASE_POINTABLE_INSTANCE.createBinaryComparator();
+ private IBinaryTokenizer tokenizerForLeftArray = null;
+ private IBinaryTokenizer tokenizerForRightArray = null;
+
+ // Case insensitive hash for full-text search
+ private IBinaryHashFunction hashFunc = null;
+
+ // keyEntry used in the hash-set
+ private BinaryEntry keyEntry = null;
+
+ // Parameter: number of bucket, frame size, hashFunction, Comparator, byte
+ // array that contains the key
+ private BinaryHashSet rightHashSet = null;
+
+ // Checks whether the query array has been changed
+ private byte[] queryArray = null;
+
+ // If the following is 1, then we will do a disjunctive search.
+ // Else if it is equal to the number of tokens, then we will do a conjunctive search.
+ private int occurrenceThreshold = 1;
+
+ static final int HASH_SET_SLOT_SIZE = 101;
+ static final int HASH_SET_FRAME_SIZE = 32768;
+
+ @SuppressWarnings("unchecked")
+ protected ISerializerDeserializer<ABoolean> serde =
+ SerializerDeserializerProvider.INSTANCE.getSerializerDeserializer(BuiltinType.ABOOLEAN);
+ @SuppressWarnings("unchecked")
+ protected ISerializerDeserializer<ANull> nullSerde =
+ SerializerDeserializerProvider.INSTANCE.getSerializerDeserializer(BuiltinType.ANULL);
+
+ public FullTextContainsEvaluator(IScalarEvaluatorFactory[] args, IHyracksTaskContext context)
+ throws HyracksDataException {
+ evalLeft = args[0].createScalarEvaluator(context);
+ evalRight = args[1].createScalarEvaluator(context);
+ optionArgsLength = args.length - 2;
+ this.evalOptions = new IScalarEvaluator[optionArgsLength];
+ this.outOptions = new IPointable[optionArgsLength];
+ this.argOptions = new TaggedValuePointable[optionArgsLength];
+ // Full-text search options
+ for (int i = 0; i < optionArgsLength; i++) {
+ this.evalOptions[i] = args[i + 2].createScalarEvaluator(context);
+ this.outOptions[i] = VoidPointable.FACTORY.createPointable();
+ this.argOptions[i] = (TaggedValuePointable) TaggedValuePointable.FACTORY.createPointable();
+ }
+ }
+
+ @Override
+ public void evaluate(IFrameTupleReference tuple, IPointable result) throws HyracksDataException {
+ resultStorage.reset();
+
+ evalLeft.evaluate(tuple, argLeft);
+ argLeft.getValue(outLeft);
+ evalRight.evaluate(tuple, argRight);
+ argRight.getValue(outRight);
+
+ for (int i = 0; i < optionArgsLength; i++) {
+ evalOptions[i].evaluate(tuple, argOptions[i]);
+ argOptions[i].getValue(outOptions[i]);
+ }
+
+ ATypeTag typeTag1 = EnumDeserializer.ATYPETAGDESERIALIZER.deserialize(argLeft.getTag());
+ ATypeTag typeTag2 = EnumDeserializer.ATYPETAGDESERIALIZER.deserialize(argRight.getTag());
+
+ // Checks whether two appropriate types are provided or not. If not, null will be written.
+ if (!checkArgTypes(typeTag1, typeTag2)) {
+ try {
+ nullSerde.serialize(ANull.NULL, out);
+ } catch (HyracksDataException e) {
+ throw new HyracksDataException(e);
+ }
+ result.set(resultStorage);
+ return;
+ }
+
+ try {
+ ABoolean b = fullTextContainsWithArg(typeTag2, argLeft, argRight) ? ABoolean.TRUE : ABoolean.FALSE;
+ serde.serialize(b, out);
+ } catch (HyracksDataException e1) {
+ throw new HyracksDataException(e1);
+ }
+ result.set(resultStorage);
+ }
+
+ /**
+ * Conducts a full-text search. The basic logic is as follows.
+ * 1) Tokenizes the given query predicate(s). Puts them into a hash set.
+ * 2) Tokenizes the given field. For each token, checks whether the hash set contains it.
+ * If so, increase foundCount for a newly found token.
+ * 3) As soon as the foundCount becomes the given threshold, stops the search and returns true.
+ * After traversing all tokens and still the foundCount is less than the given threshold, then returns false.
+ */
+ private boolean fullTextContainsWithArg(ATypeTag typeTag2, IPointable arg1, IPointable arg2)
+ throws HyracksDataException {
+ // Since a fulltext search form is "X contains text Y",
+ // X (document) is the left side and Y (query predicate) is the right side.
+
+ // Initialize variables that are required to conduct full-text search. (e.g., hash-set, tokenizer ...)
+ initializeFullTextContains(typeTag2);
+
+ // Type tag checking is already done in the previous steps.
+ // So we directly conduct the full-text search process.
+ // The right side contains the query predicates
+ byte[] arg2Array = arg2.getByteArray();
+
+ // Checks whether a new query predicate is introduced.
+ // If not, we can re-use the query predicate array we have already created.
+ if (!Arrays.equals(queryArray, arg2Array)) {
+ resetQueryArrayAndRight(arg2Array, typeTag2, arg2);
+ } else {
+ // The query predicate remains the same. However, the count of each token should be reset to zero.
+ // Here, we visit all elements to clear the count.
+ rightHashSet.clearFoundCount();
+ }
+
+ return readLeftAndConductSearch(arg1);
+ }
+
+ private void initializeFullTextContains(ATypeTag predicateTypeTag) {
+ // We use a hash set to store tokens from the right side (query predicate).
+ // Initialize necessary variables.
+ if (rightHashSet == null) {
+ hashFunc = new PointableBinaryHashFunctionFactory(UTF8StringLowercaseTokenPointable.FACTORY)
+ .createBinaryHashFunction();
+ keyEntry = new BinaryEntry();
+ // Parameter: number of bucket, frame size, hashFunction, Comparator, byte
+ // array that contains the key (this array will be set later.)
+ rightHashSet = new BinaryHashSet(HASH_SET_SLOT_SIZE, HASH_SET_FRAME_SIZE, hashFunc, strLowerCaseTokenCmp,
+ null);
+ tokenizerForLeftArray = BinaryTokenizerFactoryProvider.INSTANCE
+ .getWordTokenizerFactory(ATypeTag.STRING, false, true).createTokenizer();
+ }
+
+ // If the right side is an (un)ordered list, we need to apply the (un)ordered list tokenizer.
+ switch (predicateTypeTag) {
+ case ORDEREDLIST:
+ tokenizerForRightArray = BinaryTokenizerFactoryProvider.INSTANCE
+ .getWordTokenizerFactory(ATypeTag.ORDEREDLIST, false, true).createTokenizer();
+ break;
+ case UNORDEREDLIST:
+ tokenizerForRightArray = BinaryTokenizerFactoryProvider.INSTANCE
+ .getWordTokenizerFactory(ATypeTag.UNORDEREDLIST, false, true).createTokenizer();
+ break;
+ case STRING:
+ tokenizerForRightArray = BinaryTokenizerFactoryProvider.INSTANCE
+ .getWordTokenizerFactory(ATypeTag.STRING, false, true).createTokenizer();
+ break;
+ default:
+ break;
+ }
+ }
+
+ void resetQueryArrayAndRight(byte[] arg2Array, ATypeTag typeTag2, IPointable arg2) throws HyracksDataException {
+ queryArray = new byte[arg2Array.length];
+ System.arraycopy(arg2Array, 0, queryArray, 0, arg2Array.length);
+
+ // Clear hash set for the search predicates.
+ rightHashSet.clear();
+ rightHashSet.setRefArray(queryArray);
+
+ // Token count in this query
+ int queryTokenCount = 0;
+ int uniqueQueryTokenCount = 0;
+
+ int startOffset = arg2.getStartOffset();
+ int length = arg2.getLength();
+
+ // Reset the tokenizer for the given keywords in the given query
+ tokenizerForRightArray.reset(queryArray, startOffset, length);
+
+ // Create tokens from the given query predicate
+ while (tokenizerForRightArray.hasNext()) {
+ tokenizerForRightArray.next();
+ queryTokenCount++;
+
+ // Insert the starting position and the length of the current token into the hash set.
+ // We don't store the actual value of this token since we can access it via offset and length.
+ int tokenOffset = tokenizerForRightArray.getToken().getStartOffset();
+ int tokenLength = tokenizerForRightArray.getToken().getTokenLength();
+ int numBytesToStoreLength;
+
+ // If a token comes from a string tokenizer, each token doesn't have the length data
+ // in the beginning. Instead, if a token comes from an (un)ordered list, each token has
+ // the length data in the beginning. Since KeyEntry keeps the length data
+ // as a parameter, we need to adjust token offset and length in this case.
+ // e.g., 8database <--- we only need to store the offset of 'd' and length 8.
+ if (typeTag2 == ATypeTag.ORDEREDLIST || typeTag2 == ATypeTag.UNORDEREDLIST) {
+ // How many bytes are required to store the length of the given token?
+ numBytesToStoreLength = UTF8StringUtil.getNumBytesToStoreLength(
+ UTF8StringUtil.getUTFLength(tokenizerForRightArray.getToken().getData(),
+ tokenizerForRightArray.getToken().getStartOffset()));
+ tokenOffset = tokenOffset + numBytesToStoreLength;
+ tokenLength = tokenLength - numBytesToStoreLength;
+ }
+ keyEntry.set(tokenOffset, tokenLength);
+
+ // Check whether the given token is a phrase.
+ // Currently, for the full-text search, we don't support a phrase search yet.
+ // So, each query predicate should have only one token.
+ // The same logic should be applied in AbstractTOccurrenceSearcher() class.
+ checkWhetherFullTextPredicateIsPhrase(typeTag2, queryArray, tokenOffset, tokenLength, queryTokenCount);
+
+ // Count the number of tokens in the given query. We only count the unique tokens.
+ // We only care about the first insertion of the token into the hash set
+ // since we apply the set semantics.
+ // e.g., if a query predicate is ["database","system","database"],
+ // then "database" should be counted only once.
+ // Thus, when we find the current token (we don't increase the count in this case),
+ // it should not exist.
+ if (rightHashSet.find(keyEntry, queryArray, false) == -1) {
+ rightHashSet.put(keyEntry);
+ uniqueQueryTokenCount++;
+ }
+
+ }
+
+ // Apply the full-text search option here
+ // Based on the search mode option - "any" or "all", set the occurrence threshold of tokens.
+ setFullTextOption(argOptions, uniqueQueryTokenCount);
+ }
+
+ private void checkWhetherFullTextPredicateIsPhrase(ATypeTag typeTag, byte[] refArray, int tokenOffset,
+ int tokenLength, int queryTokenCount) throws HyracksDataException {
+ switch (typeTag) {
+ case STRING:
+ if (queryTokenCount > 1) {
+ throw new HyracksDataException(
+ "Phrase in Full-text search is not supported. An expression should include only one word.");
+ }
+ break;
+ case ORDEREDLIST:
+ case UNORDEREDLIST:
+ for (int j = 0; j < tokenLength; j++) {
+ if (DelimitedUTF8StringBinaryTokenizer.isSeparator((char) refArray[tokenOffset + j])) {
+ throw new HyracksDataException(
+ "Phrase in Full-text is not supported. An expression should include only one word."
+ + (char) refArray[tokenOffset + j] + " " + refArray[tokenOffset + j]);
+ }
+ }
+ break;
+ default:
+ throw new HyracksDataException("Full-text search can be only executed on STRING or (UN)ORDERED LIST.");
+ }
+ }
+
+ /**
+ * Set full-text options. The odd element is an option name and the even element is the argument for that option.
+ */
+ private void setFullTextOption(IPointable[] argOptions, int uniqueQueryTokenCount) throws HyracksDataException {
+ for (int i = 0; i < optionArgsLength; i = i + 2) {
+ // mode option
+ if (compareStrInByteArrayAndPointable(FullTextContainsDescriptor.getSearchModeOptionArray(), argOptions[i],
+ true) == 0) {
+ if (compareStrInByteArrayAndPointable(FullTextContainsDescriptor.getDisjunctiveFTSearchOptionArray(),
+ argOptions[i + 1], true) == 0) {
+ // ANY
+ occurrenceThreshold = 1;
+ } else if (compareStrInByteArrayAndPointable(
+ FullTextContainsDescriptor.getConjunctiveFTSearchOptionArray(), argOptions[i + 1], true) == 0) {
+ // ALL
+ occurrenceThreshold = uniqueQueryTokenCount;
+ }
+ }
+ }
+ }
+
+ boolean readLeftAndConductSearch(IPointable arg1) throws HyracksDataException {
+ // Now, we traverse the left side (document field) and tokenize the array and check whether each token
+ // exists in the hash set. If it's the first time we find it, we increase foundCount.
+ // As soon as foundCount is greater than occurrenceThreshold, we return true and stop.
+ int foundCount = 0;
+
+ // The left side: field (document)
+ // Reset the tokenizer for the given keywords in a document.
+ tokenizerForLeftArray.reset(arg1.getByteArray(), arg1.getStartOffset(), arg1.getLength());
+
+ // Create tokens from a field in the left side (document)
+ while (tokenizerForLeftArray.hasNext()) {
+ tokenizerForLeftArray.next();
+
+ // Record the starting position and the length of the current token.
+ keyEntry.set(tokenizerForLeftArray.getToken().getStartOffset(),
+ tokenizerForLeftArray.getToken().getTokenLength());
+
+ // Checks whether this token exists in the query hash-set.
+ // We don't count multiple occurrence of a token now.
+ // So, finding the same query predicate twice will not be counted as a found.
+ if (rightHashSet.find(keyEntry, arg1.getByteArray(), true) == 1) {
+ foundCount++;
+ if (foundCount >= occurrenceThreshold) {
+ return true;
+ }
+ }
+ }
+
+ // Traversed all tokens. However, the count is not greater than the threshold.
+ return false;
+ }
+
+ private int compareStrInByteArrayAndPointable(byte[] left, IPointable right, boolean rightTypeTagIncluded)
+ throws HyracksDataException {
+ int rightTypeTagLength = rightTypeTagIncluded ? 1 : 0;
+
+ return strLowerCaseCmp.compare(left, 0, left.length, right.getByteArray(),
+ right.getStartOffset() + rightTypeTagLength, right.getLength() - rightTypeTagLength);
+ }
+
+ /**
+ * Check the argument types. The argument1 should be a string. The argument2 should be a string or (un)ordered list.
+ */
+ protected boolean checkArgTypes(ATypeTag typeTag1, ATypeTag typeTag2) throws HyracksDataException {
+ if ((typeTag1 != ATypeTag.STRING) || (typeTag2 != ATypeTag.ORDEREDLIST && typeTag2 != ATypeTag.UNORDEREDLIST
+ && !ATypeHierarchy.isCompatible(typeTag1, typeTag2))) {
+ return false;
+ }
+ return true;
+ }
+
+}
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardCheckEvaluator.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardCheckEvaluator.java b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardCheckEvaluator.java
index 7c1ef63..4f7a30f 100644
--- a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardCheckEvaluator.java
+++ b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardCheckEvaluator.java
@@ -27,7 +27,6 @@ import org.apache.asterix.om.base.ABoolean;
import org.apache.asterix.om.types.AOrderedListType;
import org.apache.asterix.om.types.BuiltinType;
import org.apache.asterix.om.types.EnumDeserializer;
-import org.apache.asterix.runtime.evaluators.functions.BinaryHashMap.BinaryEntry;
import org.apache.hyracks.algebricks.runtime.base.IScalarEvaluator;
import org.apache.hyracks.algebricks.runtime.base.IScalarEvaluatorFactory;
import org.apache.hyracks.api.context.IHyracksTaskContext;
@@ -37,6 +36,7 @@ import org.apache.hyracks.data.std.api.IPointable;
import org.apache.hyracks.data.std.primitive.IntegerPointable;
import org.apache.hyracks.data.std.primitive.VoidPointable;
import org.apache.hyracks.data.std.util.ArrayBackedValueStorage;
+import org.apache.hyracks.data.std.util.BinaryEntry;
import org.apache.hyracks.dataflow.common.data.accessors.IFrameTupleReference;
public class SimilarityJaccardCheckEvaluator extends SimilarityJaccardEvaluator {
@@ -120,18 +120,18 @@ public class SimilarityJaccardCheckEvaluator extends SimilarityJaccardEvaluator
BinaryEntry entry = hashMap.get(keyEntry);
if (entry != null) {
// Increment second value.
- int firstValInt = IntegerPointable.getInteger(entry.buf, entry.off);
+ int firstValInt = IntegerPointable.getInteger(entry.getBuf(), entry.getOffset());
// Irrelevant for the intersection size.
if (firstValInt == 0) {
continue;
}
- int secondValInt = IntegerPointable.getInteger(entry.buf, entry.off + 4);
+ int secondValInt = IntegerPointable.getInteger(entry.getBuf(), entry.getOffset() + 4);
// Subtract old min value.
intersectionSize -= (firstValInt < secondValInt) ? firstValInt : secondValInt;
secondValInt++;
// Add new min value.
intersectionSize += (firstValInt < secondValInt) ? firstValInt : secondValInt;
- IntegerPointable.setInteger(entry.buf, entry.off + 4, secondValInt);
+ IntegerPointable.setInteger(entry.getBuf(), entry.getOffset() + 4, secondValInt);
} else {
// Could not find element in other set. Increase min union size by 1.
minUnionSize++;
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardEvaluator.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardEvaluator.java b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardEvaluator.java
index f08073c..2bad468 100644
--- a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardEvaluator.java
+++ b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/common/SimilarityJaccardEvaluator.java
@@ -27,13 +27,12 @@ import org.apache.asterix.dataflow.data.nontagged.hash.ListItemBinaryHashFunctio
import org.apache.asterix.formats.nontagged.SerializerDeserializerProvider;
import org.apache.asterix.om.base.AFloat;
import org.apache.asterix.om.base.AMutableFloat;
-import org.apache.asterix.runtime.exceptions.TypeMismatchException;
import org.apache.asterix.om.functions.AsterixBuiltinFunctions;
import org.apache.asterix.om.types.ATypeTag;
import org.apache.asterix.om.types.BuiltinType;
import org.apache.asterix.om.types.EnumDeserializer;
import org.apache.asterix.runtime.evaluators.functions.BinaryHashMap;
-import org.apache.asterix.runtime.evaluators.functions.BinaryHashMap.BinaryEntry;
+import org.apache.asterix.runtime.exceptions.TypeMismatchException;
import org.apache.hyracks.algebricks.runtime.base.IScalarEvaluator;
import org.apache.hyracks.algebricks.runtime.base.IScalarEvaluatorFactory;
import org.apache.hyracks.api.context.IHyracksTaskContext;
@@ -45,6 +44,7 @@ import org.apache.hyracks.data.std.api.IPointable;
import org.apache.hyracks.data.std.primitive.IntegerPointable;
import org.apache.hyracks.data.std.primitive.VoidPointable;
import org.apache.hyracks.data.std.util.ArrayBackedValueStorage;
+import org.apache.hyracks.data.std.util.BinaryEntry;
import org.apache.hyracks.dataflow.common.data.accessors.IFrameTupleReference;
public class SimilarityJaccardEvaluator implements IScalarEvaluator {
@@ -171,7 +171,7 @@ public class SimilarityJaccardEvaluator implements IScalarEvaluator {
protected void buildHashMap(AbstractAsterixListIterator buildIter) throws HyracksDataException {
// Build phase: Add items into hash map, starting with first list.
// Value in map is a pair of integers. Set first integer to 1.
- IntegerPointable.setInteger(valEntry.buf, 0, 1);
+ IntegerPointable.setInteger(valEntry.getBuf(), 0, 1);
while (buildIter.hasNext()) {
byte[] buf = buildIter.getData();
int off = buildIter.getPos();
@@ -180,8 +180,8 @@ public class SimilarityJaccardEvaluator implements IScalarEvaluator {
BinaryEntry entry = hashMap.put(keyEntry, valEntry);
if (entry != null) {
// Increment value.
- int firstValInt = IntegerPointable.getInteger(entry.buf, entry.off);
- IntegerPointable.setInteger(entry.buf, entry.off, firstValInt + 1);
+ int firstValInt = IntegerPointable.getInteger(entry.getBuf(), entry.getOffset());
+ IntegerPointable.setInteger(entry.getBuf(), entry.getOffset(), firstValInt + 1);
}
buildIter.next();
}
@@ -199,18 +199,18 @@ public class SimilarityJaccardEvaluator implements IScalarEvaluator {
BinaryEntry entry = hashMap.get(keyEntry);
if (entry != null) {
// Increment second value.
- int firstValInt = IntegerPointable.getInteger(entry.buf, entry.off);
+ int firstValInt = IntegerPointable.getInteger(entry.getBuf(), entry.getOffset());
// Irrelevant for the intersection size.
if (firstValInt == 0) {
continue;
}
- int secondValInt = IntegerPointable.getInteger(entry.buf, entry.off + 4);
+ int secondValInt = IntegerPointable.getInteger(entry.getBuf(), entry.getOffset() + 4);
// Subtract old min value.
intersectionSize -= (firstValInt < secondValInt) ? firstValInt : secondValInt;
secondValInt++;
// Add new min value.
intersectionSize += (firstValInt < secondValInt) ? firstValInt : secondValInt;
- IntegerPointable.setInteger(entry.buf, entry.off + 4, secondValInt);
+ IntegerPointable.setInteger(entry.getBuf(), entry.getOffset() + 4, secondValInt);
}
probeIter.next();
}
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/BinaryHashMap.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/BinaryHashMap.java b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/BinaryHashMap.java
index d89a63e..2864473 100644
--- a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/BinaryHashMap.java
+++ b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/BinaryHashMap.java
@@ -18,9 +18,6 @@
*/
package org.apache.asterix.runtime.evaluators.functions;
-import java.io.ByteArrayInputStream;
-import java.io.DataInput;
-import java.io.DataInputStream;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.Arrays;
@@ -30,8 +27,8 @@ import java.util.List;
import org.apache.hyracks.algebricks.common.utils.Pair;
import org.apache.hyracks.api.dataflow.value.IBinaryComparator;
import org.apache.hyracks.api.dataflow.value.IBinaryHashFunction;
-import org.apache.hyracks.api.dataflow.value.ISerializerDeserializer;
import org.apache.hyracks.api.exceptions.HyracksDataException;
+import org.apache.hyracks.data.std.util.BinaryEntry;
/**
* The most simple implementation of a static hashtable you could imagine.
@@ -60,26 +57,6 @@ public class BinaryHashMap {
private int nextOff;
private int size;
- // Can be used for key or value.
- public static class BinaryEntry {
- public byte[] buf;
- public int off;
- public int len;
-
- public void set(byte[] buf, int off, int len) {
- this.buf = buf;
- this.off = off;
- this.len = len;
- }
-
- // Inefficient. Just for debugging.
- @SuppressWarnings("rawtypes")
- public String print(ISerializerDeserializer serde) throws HyracksDataException {
- ByteArrayInputStream inStream = new ByteArrayInputStream(buf, off, len);
- DataInput dataIn = new DataInputStream(inStream);
- return serde.deserialize(dataIn).toString();
- }
- }
public BinaryHashMap(int tableSize, int frameSize, IBinaryHashFunction putHashFunc,
IBinaryHashFunction getHashFunc, IBinaryComparator cmp) {
@@ -119,9 +96,9 @@ public class BinaryHashMap {
private BinaryEntry getPutInternal(BinaryEntry key, BinaryEntry value, boolean put) throws HyracksDataException {
int bucket;
if (put) {
- bucket = Math.abs(putHashFunc.hash(key.buf, key.off, key.len) % listHeads.length);
+ bucket = Math.abs(putHashFunc.hash(key.getBuf(), key.getOffset(), key.getLength()) % listHeads.length);
} else {
- bucket = Math.abs(getHashFunc.hash(key.buf, key.off, key.len) % listHeads.length);
+ bucket = Math.abs(getHashFunc.hash(key.getBuf(), key.getOffset(), key.getLength()) % listHeads.length);
}
long headPtr = listHeads[bucket];
if (headPtr == NULL_PTR) {
@@ -140,7 +117,8 @@ public class BinaryHashMap {
frame = frames.get(frameIndex);
int entryKeyOff = frameOff + ENTRY_HEADER_SIZE;
int entryKeyLen = frame.getShort(frameOff);
- if (cmp.compare(frame.array(), entryKeyOff, entryKeyLen, key.buf, key.off, key.len) == 0) {
+ if (cmp.compare(frame.array(), entryKeyOff, entryKeyLen, key.getBuf(), key.getOffset(),
+ key.getLength()) == 0) {
// Key found, set values and return.
int entryValOff = frameOff + ENTRY_HEADER_SIZE + entryKeyLen;
int entryValLen = frame.getShort(frameOff + SLOT_SIZE);
@@ -160,7 +138,7 @@ public class BinaryHashMap {
public long appendEntry(BinaryEntry key, BinaryEntry value) {
ByteBuffer frame = frames.get(currFrameIndex);
- int requiredSpace = key.len + value.len + ENTRY_HEADER_SIZE;
+ int requiredSpace = key.getLength() + value.getLength() + ENTRY_HEADER_SIZE;
if (nextOff + requiredSpace >= frameSize) {
// Entry doesn't fit on frame, allocate a new one.
if (requiredSpace > frameSize) {
@@ -171,9 +149,10 @@ public class BinaryHashMap {
nextOff = 0;
frame = frames.get(currFrameIndex);
}
- writeEntryHeader(frame, nextOff, key.len, value.len, NULL_PTR);
- System.arraycopy(key.buf, key.off, frame.array(), nextOff + ENTRY_HEADER_SIZE, key.len);
- System.arraycopy(value.buf, value.off, frame.array(), nextOff + ENTRY_HEADER_SIZE + key.len, value.len);
+ writeEntryHeader(frame, nextOff, key.getLength(), value.getLength(), NULL_PTR);
+ System.arraycopy(key.getBuf(), key.getOffset(), frame.array(), nextOff + ENTRY_HEADER_SIZE, key.getLength());
+ System.arraycopy(value.getBuf(), value.getOffset(), frame.array(),
+ nextOff + ENTRY_HEADER_SIZE + key.getLength(), value.getLength());
long entryPtr = getEntryPtr(currFrameIndex, nextOff);
nextOff += requiredSpace;
size++;
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/FullTextContainsDescriptor.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/FullTextContainsDescriptor.java b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/FullTextContainsDescriptor.java
new file mode 100644
index 0000000..082e0cf
--- /dev/null
+++ b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/FullTextContainsDescriptor.java
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.asterix.runtime.evaluators.functions;
+
+import java.util.LinkedHashMap;
+import java.util.Map;
+
+import org.apache.asterix.om.functions.AsterixBuiltinFunctions;
+import org.apache.asterix.om.functions.IFunctionDescriptor;
+import org.apache.asterix.om.functions.IFunctionDescriptorFactory;
+import org.apache.asterix.om.types.ATypeTag;
+import org.apache.asterix.runtime.evaluators.base.AbstractScalarFunctionDynamicDescriptor;
+import org.apache.asterix.runtime.evaluators.common.FullTextContainsEvaluator;
+import org.apache.hyracks.algebricks.common.exceptions.AlgebricksException;
+import org.apache.hyracks.algebricks.core.algebra.functions.FunctionIdentifier;
+import org.apache.hyracks.algebricks.runtime.base.IScalarEvaluator;
+import org.apache.hyracks.algebricks.runtime.base.IScalarEvaluatorFactory;
+import org.apache.hyracks.api.context.IHyracksTaskContext;
+import org.apache.hyracks.api.exceptions.HyracksDataException;
+import org.apache.hyracks.util.string.UTF8StringUtil;
+
+public class FullTextContainsDescriptor extends AbstractScalarFunctionDynamicDescriptor {
+ private static final long serialVersionUID = 1L;
+
+ // parameter name and its type - based on the order of parameters in this map, parameters will be re-arranged.
+ private static final Map<String, ATypeTag> paramTypeMap = new LinkedHashMap<>();
+
+ public static final String SEARCH_MODE_OPTION = "mode";
+ public static final String DISJUNCTIVE_SEARCH_MODE_OPTION = "any";
+ public static final String CONJUNCTIVE_SEARCH_MODE_OPTION = "all";
+
+ private static final byte[] SEARCH_MODE_OPTION_ARRAY = UTF8StringUtil.writeStringToBytes(SEARCH_MODE_OPTION);
+ private static final byte[] DISJUNCTIVE_SEARCH_MODE_OPTION_ARRAY = UTF8StringUtil
+ .writeStringToBytes(DISJUNCTIVE_SEARCH_MODE_OPTION);
+ private static final byte[] CONJUNCTIVE_SEARCH_MODE_OPTION_ARRAY = UTF8StringUtil
+ .writeStringToBytes(CONJUNCTIVE_SEARCH_MODE_OPTION);
+
+ static {
+ paramTypeMap.put(SEARCH_MODE_OPTION, ATypeTag.STRING);
+ }
+
+ public static final IFunctionDescriptorFactory FACTORY = new IFunctionDescriptorFactory() {
+ @Override
+ public IFunctionDescriptor createFunctionDescriptor() {
+ return new FullTextContainsDescriptor();
+ }
+ };
+
+ /**
+ * Creates full-text search evaluator. There are three arguments:
+ * arg0: Expression1 - search field
+ * arg1: Expression2 - search predicate
+ * arg2 and so on: Full-text search option
+ */
+ @Override
+ public IScalarEvaluatorFactory createEvaluatorFactory(final IScalarEvaluatorFactory[] args)
+ throws AlgebricksException {
+ return new IScalarEvaluatorFactory() {
+ private static final long serialVersionUID = 1L;
+
+ @Override
+ public IScalarEvaluator createScalarEvaluator(IHyracksTaskContext ctx) throws HyracksDataException {
+ return new FullTextContainsEvaluator(args, ctx);
+ }
+ };
+ }
+
+ @Override
+ public FunctionIdentifier getIdentifier() {
+ return AsterixBuiltinFunctions.FULLTEXT_CONTAINS;
+ }
+
+ public static byte[] getSearchModeOptionArray() {
+ return SEARCH_MODE_OPTION_ARRAY;
+ }
+
+ public static byte[] getDisjunctiveFTSearchOptionArray() {
+ return DISJUNCTIVE_SEARCH_MODE_OPTION_ARRAY;
+ }
+
+ public static byte[] getConjunctiveFTSearchOptionArray() {
+ return CONJUNCTIVE_SEARCH_MODE_OPTION_ARRAY;
+ }
+
+ public static Map<String, ATypeTag> getParamTypeMap() {
+ return paramTypeMap;
+ }
+}
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/records/RecordAddFieldsDescriptor.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/records/RecordAddFieldsDescriptor.java b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/records/RecordAddFieldsDescriptor.java
index e8a2c42..b8908dd 100644
--- a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/records/RecordAddFieldsDescriptor.java
+++ b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/records/RecordAddFieldsDescriptor.java
@@ -57,6 +57,7 @@ import org.apache.hyracks.api.exceptions.HyracksDataException;
import org.apache.hyracks.data.std.api.IPointable;
import org.apache.hyracks.data.std.primitive.VoidPointable;
import org.apache.hyracks.data.std.util.ArrayBackedValueStorage;
+import org.apache.hyracks.data.std.util.BinaryEntry;
import org.apache.hyracks.dataflow.common.data.accessors.IFrameTupleReference;
public class RecordAddFieldsDescriptor extends AbstractScalarFunctionDynamicDescriptor {
@@ -120,8 +121,8 @@ public class RecordAddFieldsDescriptor extends AbstractScalarFunctionDynamicDesc
.createBinaryHashFunction();
private final IBinaryHashFunction getHashFunc = ListItemBinaryHashFunctionFactory.INSTANCE
.createBinaryHashFunction();
- private final BinaryHashMap.BinaryEntry keyEntry = new BinaryHashMap.BinaryEntry();
- private final BinaryHashMap.BinaryEntry valEntry = new BinaryHashMap.BinaryEntry();
+ private final BinaryEntry keyEntry = new BinaryEntry();
+ private final BinaryEntry valEntry = new BinaryEntry();
private final IVisitablePointable tempValReference = allocator.allocateEmpty();
private final IBinaryComparator cmp = ListItemBinaryComparatorFactory.INSTANCE
.createBinaryComparator();
@@ -234,9 +235,9 @@ public class RecordAddFieldsDescriptor extends AbstractScalarFunctionDynamicDesc
keyEntry.set(namePointable.getByteArray(), namePointable.getStartOffset(),
namePointable.getLength());
// Check if already in our built record
- BinaryHashMap.BinaryEntry entry = hashMap.get(keyEntry);
+ BinaryEntry entry = hashMap.get(keyEntry);
if (entry != null) {
- tempValReference.set(entry.buf, entry.off, entry.len);
+ tempValReference.set(entry.getBuf(), entry.getOffset(), entry.getLength());
// If value is not equal throw conflicting duplicate field, otherwise ignore
if (!PointableHelper.byteArrayEqual(valuePointable, tempValReference)) {
throw new RuntimeDataException(ErrorCode.ERROR_DUPLICATE_FIELD_NAME,
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/DeepEqualityVisitorHelper.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/DeepEqualityVisitorHelper.java b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/DeepEqualityVisitorHelper.java
index 0e1f342..000425e 100644
--- a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/DeepEqualityVisitorHelper.java
+++ b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/DeepEqualityVisitorHelper.java
@@ -25,6 +25,7 @@ import org.apache.asterix.dataflow.data.nontagged.hash.ListItemBinaryHashFunctio
import org.apache.asterix.runtime.evaluators.functions.BinaryHashMap;
import org.apache.hyracks.api.dataflow.value.IBinaryComparator;
import org.apache.hyracks.api.dataflow.value.IBinaryHashFunction;
+import org.apache.hyracks.data.std.util.BinaryEntry;
public class DeepEqualityVisitorHelper {
// Default values
@@ -39,11 +40,11 @@ public class DeepEqualityVisitorHelper {
private IBinaryComparator cmp = listItemBinaryComparatorFactory.createBinaryComparator();
private BinaryHashMap hashMap = null;
- public BinaryHashMap initializeHashMap(BinaryHashMap.BinaryEntry valEntry) {
+ public BinaryHashMap initializeHashMap(BinaryEntry valEntry) {
return initializeHashMap(0, 0, valEntry);
}
- public BinaryHashMap initializeHashMap(int tableSize, int tableFrameSize, BinaryHashMap.BinaryEntry valEntry) {
+ public BinaryHashMap initializeHashMap(int tableSize, int tableFrameSize, BinaryEntry valEntry) {
if (tableFrameSize != 0 && tableSize != 0) {
hashMap = new BinaryHashMap(tableSize, tableFrameSize, putHashFunc, getHashFunc, cmp);
} else {
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/ListDeepEqualityChecker.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/ListDeepEqualityChecker.java b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/ListDeepEqualityChecker.java
index 6d5513d..df4847e 100644
--- a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/ListDeepEqualityChecker.java
+++ b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/ListDeepEqualityChecker.java
@@ -20,16 +20,17 @@ package org.apache.asterix.runtime.evaluators.visitors;
import java.io.IOException;
import java.util.List;
+
import org.apache.asterix.common.exceptions.AsterixException;
import org.apache.asterix.om.pointables.AListVisitablePointable;
import org.apache.asterix.om.pointables.base.IVisitablePointable;
import org.apache.asterix.om.types.ATypeTag;
import org.apache.asterix.runtime.evaluators.functions.BinaryHashMap;
-import org.apache.asterix.runtime.evaluators.functions.BinaryHashMap.BinaryEntry;
import org.apache.asterix.runtime.evaluators.functions.PointableHelper;
import org.apache.hyracks.algebricks.common.utils.Pair;
import org.apache.hyracks.api.exceptions.HyracksDataException;
import org.apache.hyracks.data.std.primitive.IntegerPointable;
+import org.apache.hyracks.data.std.util.BinaryEntry;
class ListDeepEqualityChecker {
private DeepEqualityVisitor visitor;
@@ -100,7 +101,7 @@ class ListDeepEqualityChecker {
int off = item.getStartOffset();
int len = item.getLength();
keyEntry.set(buf, off, len);
- IntegerPointable.setInteger(valEntry.buf, 0, i);
+ IntegerPointable.setInteger(valEntry.getBuf(), 0, i);
hashMap.put(keyEntry, valEntry);
}
@@ -125,7 +126,7 @@ class ListDeepEqualityChecker {
return false;
}
- int indexLeft = IntegerPointable.getInteger(entry.buf, entry.off);
+ int indexLeft = IntegerPointable.getInteger(entry.getBuf(), entry.getOffset());
ATypeTag fieldTypeLeft = PointableHelper.getTypeTag(itemTagTypesLeft.get(indexLeft));
if(fieldTypeLeft.isDerivedType() && fieldTypeLeft != PointableHelper.getTypeTag(itemTagTypesRight.get(indexRight))) {
return false;
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/RecordDeepEqualityChecker.java
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/RecordDeepEqualityChecker.java b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/RecordDeepEqualityChecker.java
index 84e9cf6..40af09a 100644
--- a/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/RecordDeepEqualityChecker.java
+++ b/asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/visitors/RecordDeepEqualityChecker.java
@@ -30,14 +30,15 @@ import org.apache.asterix.runtime.evaluators.functions.PointableHelper;
import org.apache.hyracks.algebricks.common.utils.Pair;
import org.apache.hyracks.api.exceptions.HyracksDataException;
import org.apache.hyracks.data.std.primitive.IntegerPointable;
+import org.apache.hyracks.data.std.util.BinaryEntry;
class RecordDeepEqualityChecker {
private final Pair<IVisitablePointable, Boolean> nestedVisitorArg = new Pair<IVisitablePointable, Boolean>(null,
false);
private final DeepEqualityVisitorHelper deepEqualityVisitorHelper = new DeepEqualityVisitorHelper();
private DeepEqualityVisitor visitor;
- private BinaryHashMap.BinaryEntry keyEntry = new BinaryHashMap.BinaryEntry();
- private BinaryHashMap.BinaryEntry valEntry = new BinaryHashMap.BinaryEntry();
+ private BinaryEntry keyEntry = new BinaryEntry();
+ private BinaryEntry valEntry = new BinaryEntry();
private BinaryHashMap hashMap;
public RecordDeepEqualityChecker(int tableSize, int tableFrameSize) {
@@ -75,7 +76,7 @@ class RecordDeepEqualityChecker {
for (int i = 0; i < sizeLeft; i++) {
IVisitablePointable fieldName = fieldNamesLeft.get(i);
keyEntry.set(fieldName.getByteArray(), fieldName.getStartOffset(), fieldName.getLength());
- IntegerPointable.setInteger(valEntry.buf, 0, i);
+ IntegerPointable.setInteger(valEntry.getBuf(), 0, i);
hashMap.put(keyEntry, valEntry);
}
@@ -91,12 +92,12 @@ class RecordDeepEqualityChecker {
for (int i = 0; i < fieldNamesRight.size(); i++) {
IVisitablePointable fieldName = fieldNamesRight.get(i);
keyEntry.set(fieldName.getByteArray(), fieldName.getStartOffset(), fieldName.getLength());
- BinaryHashMap.BinaryEntry entry = hashMap.get(keyEntry);
+ BinaryEntry entry = hashMap.get(keyEntry);
if (entry == null) {
return false;
}
- int fieldIdLeft = AInt32SerializerDeserializer.getInt(entry.buf, entry.off);
+ int fieldIdLeft = AInt32SerializerDeserializer.getInt(entry.getBuf(), entry.getOffset());
ATypeTag fieldTypeLeft = PointableHelper.getTypeTag(fieldTypesLeft.get(fieldIdLeft));
if (fieldTypeLeft.isDerivedType() && fieldTypeLeft != PointableHelper.getTypeTag(fieldTypesRight.get(i))) {
return false;
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/UTF8StringLowercaseTokenPointable.java
----------------------------------------------------------------------
diff --git a/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/UTF8StringLowercaseTokenPointable.java b/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/UTF8StringLowercaseTokenPointable.java
new file mode 100644
index 0000000..66c1ab9
--- /dev/null
+++ b/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/primitive/UTF8StringLowercaseTokenPointable.java
@@ -0,0 +1,80 @@
+/*
+ * Copyright 2009-2013 by The Regents of the University of California
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * you may obtain a copy of the License from
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hyracks.data.std.primitive;
+
+import org.apache.hyracks.api.dataflow.value.ITypeTraits;
+import org.apache.hyracks.data.std.api.AbstractPointable;
+import org.apache.hyracks.data.std.api.IComparable;
+import org.apache.hyracks.data.std.api.IHashable;
+import org.apache.hyracks.data.std.api.IPointable;
+import org.apache.hyracks.data.std.api.IPointableFactory;
+import org.apache.hyracks.util.string.UTF8StringUtil;
+
+/**
+ * This lowercase string token pointable is for the UTF8 string that doesn't have length bytes in the beginning.
+ * This pointable exists to represent a string token.
+ * The reason is that when we tokenize a string, each token will contain the length as a separate value.
+ * Instead, the length of this string is provided as a parameter.
+ */
+public final class UTF8StringLowercaseTokenPointable extends AbstractPointable implements IHashable, IComparable {
+ public static final ITypeTraits TYPE_TRAITS = new ITypeTraits() {
+ private static final long serialVersionUID = 1L;
+
+ @Override
+ public boolean isFixedLength() {
+ return false;
+ }
+
+ @Override
+ public int getFixedLength() {
+ return 0;
+ }
+ };
+
+ public static final IPointableFactory FACTORY = new IPointableFactory() {
+ private static final long serialVersionUID = 1L;
+
+ @Override
+ public IPointable createPointable() {
+ return new UTF8StringLowercaseTokenPointable();
+ }
+
+ @Override
+ public ITypeTraits getTypeTraits() {
+ return TYPE_TRAITS;
+ }
+ };
+
+ // Set the length of this pointable
+ public void setLength(int length) {
+ this.length = length;
+ }
+
+ @Override
+ public int compareTo(IPointable pointer) {
+ return compareTo(pointer.getByteArray(), pointer.getStartOffset(), pointer.getLength());
+ }
+
+ @Override
+ public int compareTo(byte[] bytes, int start, int length) {
+ return UTF8StringUtil.lowerCaseCompareTo(this.bytes, this.start, this.length, bytes, start, length);
+ }
+
+ @Override
+ public int hash() {
+ return UTF8StringUtil.lowerCaseHash(bytes, start, length);
+ }
+
+}
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/BinaryEntry.java
----------------------------------------------------------------------
diff --git a/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/BinaryEntry.java b/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/BinaryEntry.java
new file mode 100644
index 0000000..7336dca
--- /dev/null
+++ b/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/BinaryEntry.java
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.hyracks.data.std.util;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInput;
+import java.io.DataInputStream;
+
+import org.apache.hyracks.api.dataflow.value.ISerializerDeserializer;
+import org.apache.hyracks.api.exceptions.HyracksDataException;
+
+/**
+ * A class that stores a meta-data (buf, offset, length) of the entry for BinaryHashMap and BinaryHashSet.
+ */
+public class BinaryEntry {
+ private int off;
+ private int len;
+ private byte[] buf;
+
+ public void set(int offset, int length) {
+ this.buf = null;
+ this.off = offset;
+ this.len = length;
+ }
+
+ public void set(byte[] buf, int off, int len) {
+ this.buf = buf;
+ this.off = off;
+ this.len = len;
+ }
+
+ public void setOffset(int off) {
+ this.off = off;
+ }
+
+ public int getOffset() {
+ return off;
+ }
+
+ public void setLength(int len) {
+ this.len = len;
+ }
+
+ public int getLength() {
+ return len;
+ }
+
+ public void setBuf(byte[] buf) {
+ this.buf = buf;
+ }
+
+ public byte[] getBuf() {
+ return buf;
+ }
+
+ // Inefficient. Just for debugging.
+ @SuppressWarnings("rawtypes")
+ public String print(ISerializerDeserializer serde) throws HyracksDataException {
+ ByteArrayInputStream inStream = new ByteArrayInputStream(buf, off, len);
+ DataInput dataIn = new DataInputStream(inStream);
+ return serde.deserialize(dataIn).toString();
+ }
+}
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/BinaryHashSet.java
----------------------------------------------------------------------
diff --git a/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/BinaryHashSet.java b/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/BinaryHashSet.java
new file mode 100644
index 0000000..c3e36da
--- /dev/null
+++ b/hyracks-fullstack/hyracks/hyracks-data/hyracks-data-std/src/main/java/org/apache/hyracks/data/std/util/BinaryHashSet.java
@@ -0,0 +1,299 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.hyracks.data.std.util;
+
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.hyracks.api.dataflow.value.IBinaryComparator;
+import org.apache.hyracks.api.dataflow.value.IBinaryHashFunction;
+import org.apache.hyracks.api.exceptions.HyracksDataException;
+
+/**
+ * The most simplest implementation of a static hash-set you could imagine.
+ * Intended to work with binary data and be able to map arbitrary key types to
+ * arbitrary value types, given that they have implementations of
+ * IBinaryHashFunction and IBinaryComparator.
+ * Each key in the hash table: the offset (2 byte), length of an entry (2 byte).
+ * The real key value is not stored in the set since it can be found using the reference array.
+ * Additionally, it has the count (1 byte) in a byte array.
+ * Hash value: based on an entry value, it will be calculated.
+ * This class is NOT thread safe. - For single thread access only
+ * Limitation - a frame size can't be greater than 64K because we use 2 bytes to store the offset.
+ * Can't have more than 64K frames.
+ */
+public class BinaryHashSet {
+ // Special value to indicate an empty "bucket" in the header array.
+ static final int NULL_PTR = -1;
+ private static final int PTR_SIZE = 4; // 2 byte - frameIdx, 2 byte - frameOffset
+ static final int SLOT_SIZE = 2;
+
+ // This hash-set also stores the count of the real key.
+ // It's not part of the key and can be used to indicate whether this key exists in a different array or not.
+ static final int COUNT_SIZE = 1; // max value: Byte.MAX_VALUE (2^7 - 1)
+ private static final int ENTRY_HEADER_SIZE = 2 * SLOT_SIZE + PTR_SIZE + COUNT_SIZE;
+ // We are using 2 byte. Therefore, the limit is 64K.
+ private static final int NO_OF_FRAME_LIMIT = 65535;
+ private static final int ONE_FRAME_SIZE_LIMIT = 65535;
+ private final IBinaryHashFunction hashFunc;
+ private final IBinaryComparator cmp;
+
+ private final int[] listHeads;
+ private final int frameSize;
+ private final List<ByteBuffer> frames = new ArrayList<>();
+ private int currFrameIndex;
+ private int nextOff;
+ private int size;
+
+ // Byte array that holds the real data for this hashset
+ private byte[] refArray;
+
+ // Initialize a hash-set. It will contain one frame by default.
+ public BinaryHashSet(int tableSize, int frameSize, IBinaryHashFunction hashFunc, IBinaryComparator cmp,
+ byte[] refArray) {
+ listHeads = new int[tableSize];
+ if (frameSize > ONE_FRAME_SIZE_LIMIT) {
+ throw new IllegalStateException(
+ "A frame size can't be greater than " + ONE_FRAME_SIZE_LIMIT + ". Can't continue.");
+ }
+ this.frameSize = frameSize;
+ this.hashFunc = hashFunc;
+ this.cmp = cmp;
+ frames.add(ByteBuffer.allocate(frameSize));
+ clear();
+ this.refArray = refArray;
+ }
+
+ /**
+ * Set the byte array that the keys in this hash-set refer to.
+ *
+ * @param refArray
+ */
+ public void setRefArray(byte[] refArray) {
+ this.refArray = refArray;
+ }
+
+ /**
+ * Inserts a key (off, len) into the hash set.
+ * The count of the key will not be changed.
+ *
+ * @param key
+ * @return the current count of the key: when a given key is inserted or that key is already there.
+ * 0: when an insertion succeeds.
+ * @throws HyracksDataException
+ */
+ public int put(BinaryEntry key) throws HyracksDataException {
+ return putFindInternal(key, true, null, false);
+ }
+
+ /**
+ * Find whether the given key from an array exists in the hash set.
+ * If the key exists, then the count will be increased by 1.
+ *
+ * @param key
+ * @param keyArray
+ * @param increaseFoundCount
+ * @return the current count of the key: when a given key exists.
+ * -1: when the given key doesn't exist.
+ * @throws HyracksDataException
+ */
+ public int find(BinaryEntry key, byte[] keyArray, boolean increaseFoundCount) throws HyracksDataException {
+ return putFindInternal(key, false, keyArray, increaseFoundCount);
+ }
+
+
+ // Put an entry or find an entry
+ private int putFindInternal(BinaryEntry key, boolean isInsert, byte[] keyArray, boolean increaseFoundCount)
+ throws HyracksDataException {
+ int bucket;
+ bucket = isInsert ? Math.abs(hashFunc.hash(this.refArray, key.getOffset(), key.getLength()) % listHeads.length)
+ : Math.abs(hashFunc.hash(keyArray, key.getOffset(), key.getLength()) % listHeads.length);
+
+ int headPtr = listHeads[bucket];
+ if (headPtr == NULL_PTR) {
+ // Key definitely doesn't exist yet.
+ if (isInsert) {
+ // Key is being inserted.
+ listHeads[bucket] = appendEntry(key);
+ return 0;
+ } else {
+ // find case - the bucket is empty: return false since there is no element in the hash-set
+ return -1;
+ }
+
+ }
+ // if headPtr is not null,
+ // follow the chain in the bucket until we found an entry matching the given key.
+ int frameNum;
+ int frameOff;
+ int entryKeyOff;
+ int entryKeyLen;
+ int entryCount;
+ ByteBuffer frame;
+ do {
+ // Get frame num and frame offset from the ptr
+ frameNum = getFrameIndex(headPtr);
+ frameOff = getFrameOffset(headPtr);
+ frame = frames.get(frameNum);
+
+ // Get entry offset
+ entryKeyOff = (int) frame.getChar(frameOff);
+ entryKeyLen = (int) frame.getChar(frameOff + SLOT_SIZE);
+
+ // Check the key length. If they don't match, we don't even need to compare two entries.
+ if (entryKeyLen == key.getLength()) {
+ if (isInsert) {
+ if (cmp.compare(this.refArray, entryKeyOff, entryKeyLen, this.refArray, key.getOffset(),
+ key.getLength()) == 0) {
+ // put - Key found, return true since we return true when the key is already in the hash-map.
+ entryCount = (int) frame.get(frameOff + 2 * SLOT_SIZE);
+ return entryCount;
+ }
+ } else if (cmp.compare(this.refArray, entryKeyOff, entryKeyLen, keyArray, key.getOffset(),
+ key.getLength()) == 0) {
+ // Find case - the key is found, increase the count when increaseCount is set to true.
+ // Return the count. The maximum count is Byte.MAX_VALUE.
+ entryCount = (int) frame.get(frameOff + 2 * SLOT_SIZE);
+ if (increaseFoundCount && entryCount < Byte.MAX_VALUE) {
+ entryCount++;
+ }
+ frame.put(frameOff + 2 * SLOT_SIZE, (byte) entryCount);
+ return entryCount;
+ }
+ }
+ // Get next key position
+ headPtr = frame.getInt(frameOff + 2 * SLOT_SIZE + COUNT_SIZE);
+ } while (headPtr != NULL_PTR);
+
+ // We've followed the chain to its end, and didn't find the key.
+ if (isInsert) {
+ // Append the new entry, and set a pointer to it in the last entry we've checked.
+ // put case - success
+ int newPtr = appendEntry(key);
+ frame.putInt(frameOff + 2 * SLOT_SIZE + COUNT_SIZE, newPtr);
+ return 0;
+ } else {
+ // find case - fail
+ return -1;
+ }
+ }
+
+ public int appendEntry(BinaryEntry key) {
+ ByteBuffer frame = frames.get(currFrameIndex);
+ int requiredSpace = ENTRY_HEADER_SIZE;
+ if (nextOff + requiredSpace >= frameSize) {
+ // Entry doesn't fit on the current frame, allocate a new one.
+ if (requiredSpace > frameSize) {
+ throw new IllegalStateException(
+ "A hash key is greater than the framesize: " + frameSize + ". Can't continue.");
+ } else if (frames.size() > NO_OF_FRAME_LIMIT) {
+ throw new IllegalStateException(
+ "There can't be more than " + NO_OF_FRAME_LIMIT + "frames. Can't continue.");
+ }
+ frames.add(ByteBuffer.allocate(frameSize));
+ currFrameIndex++;
+ nextOff = 0;
+ frame = frames.get(currFrameIndex);
+ }
+ writeEntryHeader(frame, nextOff, key.getOffset(), key.getLength(), 0, NULL_PTR);
+ int entryPtr = getEntryPtr(currFrameIndex, nextOff);
+ nextOff += requiredSpace;
+ size++;
+ return entryPtr;
+ }
+
+ private void writeEntryHeader(ByteBuffer frame, int targetOff, int keyOff, int keyLen, int keyCount, int ptr) {
+ // [2 byte key offset] [2 byte key length] [1 byte key count] [2 byte the frame num] [2 byte the frame offset]
+ frame.putChar(targetOff, (char) keyOff);
+ frame.putChar(targetOff + SLOT_SIZE, (char) keyLen);
+ frame.put(targetOff + 2 * SLOT_SIZE, (byte) keyCount);
+ frame.putInt(targetOff + 2 * SLOT_SIZE + COUNT_SIZE, ptr);
+ }
+
+ private int getEntryPtr(int frameIndex, int frameOff) {
+ return (frameIndex << 16) + frameOff;
+ }
+
+ private int getFrameIndex(int ptr) {
+ return (int) (ptr >> 16);
+ }
+
+ private int getFrameOffset(int ptr) {
+ return (int) (ptr & 0xffff);
+ }
+
+ public int size() {
+ return size;
+ }
+
+ public boolean isEmpty() {
+ return size > 0;
+ }
+
+ public void clear() {
+ // Initialize all entries to point to nothing.
+ Arrays.fill(listHeads, NULL_PTR);
+ currFrameIndex = 0;
+ nextOff = 0;
+ size = 0;
+ this.refArray = null;
+ }
+
+ /**
+ * Iterate all key entries and reset the foundCount of each key to zero.
+ */
+ public void clearFoundCount() {
+ int currentListHeadIndex = 0;
+ ByteBuffer frame;
+ int frameNum;
+ int frameOff;
+ int headPtr;
+ int checkedListHeadIndex = -1;
+
+ while (true) {
+ // Position to first non-null list-head pointer.
+ while (currentListHeadIndex < listHeads.length && listHeads[currentListHeadIndex] == NULL_PTR) {
+ currentListHeadIndex++;
+ }
+ headPtr = listHeads[currentListHeadIndex];
+ do {
+ // Get frame num and frame offset from the ptr
+ frameNum = getFrameIndex(headPtr);
+ frameOff = getFrameOffset(headPtr);
+ frame = frames.get(frameNum);
+
+ // Set the count as zero
+ frame.put(frameOff + 2 * SLOT_SIZE, (byte) 0);
+
+ // Get next key position
+ headPtr = frame.getInt(frameOff + 2 * SLOT_SIZE + COUNT_SIZE);
+ } while (headPtr != NULL_PTR);
+
+ if (checkedListHeadIndex == currentListHeadIndex) {
+ // no more slots to read - we stop here.
+ break;
+ }
+
+ checkedListHeadIndex = currentListHeadIndex;
+ }
+ }
+
+}
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/44cef249/hyracks-fullstack/hyracks/hyracks-storage-am-lsm-invertedindex/src/main/java/org/apache/hyracks/storage/am/lsm/invertedindex/tokenizers/DelimitedUTF8StringBinaryTokenizer.java
----------------------------------------------------------------------
diff --git a/hyracks-fullstack/hyracks/hyracks-storage-am-lsm-invertedindex/src/main/java/org/apache/hyracks/storage/am/lsm/invertedindex/tokenizers/DelimitedUTF8StringBinaryTokenizer.java b/hyracks-fullstack/hyracks/hyracks-storage-am-lsm-invertedindex/src/main/java/org/apache/hyracks/storage/am/lsm/invertedindex/tokenizers/DelimitedUTF8StringBinaryTokenizer.java
index 28fa2be..32e930d 100644
--- a/hyracks-fullstack/hyracks/hyracks-storage-am-lsm-invertedindex/src/main/java/org/apache/hyracks/storage/am/lsm/invertedindex/tokenizers/DelimitedUTF8StringBinaryTokenizer.java
+++ b/hyracks-fullstack/hyracks/hyracks-storage-am-lsm-invertedindex/src/main/java/org/apache/hyracks/storage/am/lsm/invertedindex/tokenizers/DelimitedUTF8StringBinaryTokenizer.java
@@ -50,7 +50,7 @@ public class DelimitedUTF8StringBinaryTokenizer extends AbstractUTF8StringBinary
return byteIndex < sentenceEndOffset;
}
- private static boolean isSeparator(char c) {
+ public static boolean isSeparator(char c) {
return !(Character.isLetterOrDigit(c) || Character.getType(c) == Character.OTHER_LETTER
|| Character.getType(c) == Character.OTHER_NUMBER);
}