You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hive.apache.org by jv...@apache.org on 2010/08/19 00:59:33 UTC

svn commit: r986973 [1/2] - in /hadoop/hive/trunk: ./ data/files/ ql/src/java/org/apache/hadoop/hive/ql/exec/ ql/src/java/org/apache/hadoop/hive/ql/udf/generic/ ql/src/test/queries/clientpositive/ ql/src/test/results/clientpositive/

Author: jvs
Date: Wed Aug 18 22:59:32 2010
New Revision: 986973

URL: http://svn.apache.org/viewvc?rev=986973&view=rev
Log:
HIVE-1518. context_ngrams() UDAF for estimating top-k contextual
n-grams
(Mayank Lahiri via jvs)


Added:
    hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFContextNGrams.java
    hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NGramEstimator.java
    hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_context_ngrams.q
    hadoop/hive/trunk/ql/src/test/results/clientpositive/udaf_context_ngrams.q.out
Modified:
    hadoop/hive/trunk/CHANGES.txt
    hadoop/hive/trunk/data/files/text-en.txt
    hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
    hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java
    hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java
    hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFnGrams.java
    hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
    hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_ngrams.q
    hadoop/hive/trunk/ql/src/test/results/clientpositive/show_functions.q.out
    hadoop/hive/trunk/ql/src/test/results/clientpositive/udaf_ngrams.q.out

Modified: hadoop/hive/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/CHANGES.txt?rev=986973&r1=986972&r2=986973&view=diff
==============================================================================
--- hadoop/hive/trunk/CHANGES.txt (original)
+++ hadoop/hive/trunk/CHANGES.txt Wed Aug 18 22:59:32 2010
@@ -46,6 +46,10 @@ Trunk -  Unreleased
     and covar_samp
     (Pierre Huyn via jvs)
 
+    HIVE-1518. context_ngrams() UDAF for estimating top-k contextual
+    n-grams
+    (Mayank Lahiri via jvs)
+
   IMPROVEMENTS
 
     HIVE-1394. Do not update transient_lastDdlTime if the partition is modified by a housekeeping

Modified: hadoop/hive/trunk/data/files/text-en.txt
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/data/files/text-en.txt?rev=986973&r1=986972&r2=986973&view=diff
==============================================================================
--- hadoop/hive/trunk/data/files/text-en.txt (original)
+++ hadoop/hive/trunk/data/files/text-en.txt Wed Aug 18 22:59:32 2010
@@ -76,3 +76,20 @@ would have put in his report about Grego
 service Gregor had never once yet been ill.  His boss would certainly come round with the doctor from the medical insurance company, accuse his parents of having a lazy son, and accept the doctor's recommendation not to make any claim as the doctor believed
 that no-one was ever ill but that many were workshy.  And what's more, would he have been entirely wrong in this case? Gregor did in fact, apart from excessive sleepiness after sleeping for so long,
 feel completely well and even felt much hungrier than usual.
+One morning, as Gregor Samsa was waking up from anxious dreams, he discovered that in his bed he had been changed into a monstrous verminous bug. He lay on his armour-hard back and saw, as he lifted his head up a little, his brown, arched abdomen divided up into rigid bow-like sections. From this height the blanket, just about ready to slide off completely, could hardly stay in place. His numerous legs, pitifully thin in comparison to the rest of his circumference, flickered helplessly before his eyes.
+“What’s happened to me,” he thought. It was no dream. His room, a proper room for a human being, only somewhat too small, lay quietly between the four well-known walls. Above the table, on which an unpacked collection of sample cloth goods was spread out—Samsa was a travelling salesman—hung the picture which he had cut out of an illustrated magazine a little while ago and set in a pretty gilt frame. It was a picture of a woman with a fur hat and a fur boa. She sat erect there, lifting up in the direction of the viewer a solid fur muff into which her entire forearm had disappeared.
+“O God,” he thought, “what a demanding job I’ve chosen! Day in, day out, on the road. The stresses of selling are much greater than the actual work going on at head office, and, in addition to that, I still have to cope with the problems of travelling, the worries about train connections, irregular bad food, temporary and constantly changing human relationships, which never come from  the heart. To hell with it all!” He felt a slight itching on the top of his abdomen. He slowly pushed himself on his back closer to the bed post so that he could lift his head more easily, found the itchy part, which was entirely covered with small white spots—he did not know what to make of them and wanted to feel the place with a leg. But he retracted it immediately, for the contact felt like a cold shower all over him.
+He slid back again into his earlier position. “This getting up early,” he thought, “makes a man quite idiotic. A man must have his sleep. Other travelling salesmen live like harem women. For instance, when I come back to the inn during the course of the morning to write up the necessary orders, these gentlemen are just sitting down to breakfast. If I were to try that with my boss, I’d be thrown out on the spot. Still, who knows whether that mightn’t be really good for me. If I didn’t hold back for my parents’ sake, I’d have quit ages ago. I would’ve gone to the boss and told him just what I think from the bottom of my heart. He would’ve fallen right off his desk! How weird it is to sit up at that desk and talk down to the employee from way up there. What’s more, the boss has trouble hearing, so the employee has to step up quite close to him. Anyway, I haven’t completely given up that hope yet. Once I’ve got tog
 ether the money to pay off my parents’ debt to him—that should take another five or six years—I’ll do it for sure. Then I’ll make the big break. In any case, right now I have to get up. My train leaves at five o’clock.”
+He looked over at the alarm clock ticking away by the chest of drawers. “Good God!” he thought. It was half past six, and the hands were going quietly on. It was even past the half hour, already nearly quarter to. Could the alarm have failed to ring? One saw from the bed that it was properly set for four o’clock. Certainly it had rung. Yes, but was it possible to sleep peacefully through that noise which made the furniture shake? Now, it is true he had not slept peacefully, but evidently he had slept all the more deeply. Still, what should he do now? The next train left at seven o’clock. To catch that one, he would have to go in a mad rush. The sample collection was not packed up yet, and he really did not feel particularly fresh and active. And even if he caught the train, there was no avoiding a blow-up with the boss, because the firm’s errand boy would have waited for the five o’clock train and reported the news of his absence long ago. He wa
 s the boss’s minion, without backbone and intelligence. Well then, what if he reported in sick? But that would be extremely embarrassing and suspicious, because during his five years’ service Gregor had not been sick even once. The boss would certainly come with the doctor from the health insurance company and would reproach his parents for their lazy son and cut short all objections with the insurance doctor’s comments; for him everyone was completely healthy but really lazy about work. And besides, would the doctor in this case be totally wrong? Apart from a really excessive drowsiness after the long sleep, Gregor, in fact, felt quite well and even had a really strong appetite.
+As he was thinking all this over in the greatest haste, without being able to make the decision to get out of bed—the alarm clock was indicating exactly quarter to seven—there was a cautious knock on the door by the head of the bed. “Gregor,” a voice called—it was his mother—“it’s quarter to seven. Don’t you want to be on your way?” The soft voice! Gregor was startled when he heard his voice answering. It was clearly and unmistakably his earlier voice, but in it was intermingled, as if from below, an irrepressible, painful squeaking, which left the words positively distinct only in the first moment and distorted them in the reverberation, so that one did not know if one had heard correctly. Gregor wanted to answer in detail and explain everything, but in these circumstances he confined himself to saying, “Yes, yes, thank you mother. I’m getting up right away.” Because of the wooden door the change in Gregorâ€
 ™s voice was not really noticeable outside, so his mother calmed down with this explanation and shuffled off. However, as a result of the short conversation, the other family members became aware that Gregor was unexpectedly still at home, and already his father was knocking on one side door, weakly but with his fist. “Gregor, Gregor,” he called out, “what’s going on?” And, after a short while, he urged him on again in a deeper voice: “Gregor! Gregor!” At the other side door, however, his sister knocked lightly. “Gregor? Are you all right? Do you need anything?” Gregor directed answers in both directions, “I’ll be ready right away.” He made an effort with the most careful articulation and inserted long pauses between the individual words to remove everything remarkable from his voice. His father turned back to his breakfast. However, the sister whispered, “Gregor, open the door—I beg you.” Gregor had
  no intention of opening the door, but congratulated himself on his precaution, acquired from travelling, of locking all doors during the night, even at home.
+First he wanted to stand up quietly and undisturbed, get dressed, above all have breakfast, and only then consider further action, for—he noticed this clearly—by thinking things over in bed he would not reach a reasonable conclusion. He remembered that he had already often felt some light pain or other in bed, perhaps the result of an awkward lying position, which later, once he stood up, turned out to be purely imaginary, and he was eager to see how his present fantasies would gradually dissipate. That the change in his voice was nothing other than the onset of a real chill, an occupational illness of commercial travellers, of that he had not the slightest doubt.
+It was very easy to throw aside the blanket. He needed only to push himself up a little, and it fell by itself. But to continue was difficult, particularly because he was so unusually wide. He needed arms and hands to push himself upright. Instead of these, however, he had only many small limbs, which were incessantly moving with very different motions and which, in addition, he was unable to control. If he wanted to bend one of them, then it was the first to extend itself, and if he finally succeeded doing what he wanted with this limb, in the meantime all the others, as if left free, moved around in an excessively painful agitation. “But I must not stay in bed uselessly,” said Gregor to himself.
+At first he wanted to get out of bed with the lower part of his body, but this lower part—which, by the way, he had not yet looked at and which he also could not picture clearly—proved itself too difficult to move. The attempt went so slowly. When, having become almost frantic, he finally hurled himself forward with all his force and without thinking, he chose his direction incorrectly, and he hit the lower bedpost hard. The violent pain he felt revealed to him that the lower part of his body was at the moment probably the most sensitive.
+Thus, he tried to get his upper body out of the bed first and turned his head carefully toward the edge of the bed. He managed to do this easily, and in spite of its width and weight his body mass at last slowly followed the turning of his head. But as he finally raised his head outside the bed in the open air, he became anxious about moving forward any further in this manner, for if he allowed himself eventually to fall by this process, it would really take a miracle to prevent his head from getting injured. And at all costs he must not lose consciousness right now. He preferred to remain in bed.
+However, after a similar effort, while he lay there again, sighing as before, and once again saw his small limbs fighting one another, if anything even worse than earlier, and did not see any chance of imposing quiet and order on this arbitrary movement, he told himself again that he could not possibly remain in bed and that it might be the most reasonable thing to sacrifice everything if there was even the slightest hope of getting himself out of bed in the process. At the same moment, however, he did not forget to remind himself from time to time of the fact that calm—indeed the calmest—reflection might be much better than confused decisions. At such moments, he directed his gaze as precisely as he could toward the window, but unfortunately there was little confident cheer to be had from a glance at the morning mist, which concealed even the other side of the narrow street. “It’s already seven o’clock,” he told himself at the latest sounds fro
 m the alarm clock, “already seven o’clock and still such a fog.” And for a little while longer he lay quietly with weak breathing, as if perhaps waiting for normal and natural conditions to re-emerge out of the complete stillness.
+But then he said to himself, “Before it strikes a quarter past seven, whatever happens I must be completely out of bed. Besides, by then someone from the office will arrive to inquire about me, because the office will open before seven o’clock.” And he made an effort then to rock his entire body length out of the bed with a uniform motion. If he let himself fall out of the bed in this way, his head, which in the course of the fall he intended to lift up sharply, would probably remain uninjured. His back seemed to be hard; nothing would really happen to that as a result of the fall onto the carpet. His greatest reservation was a worry about the loud noise which the fall must create and which presumably would arouse, if not fright, then at least concern on the other side of all the doors. However, he had to take that chance.
+As Gregor was already in the process of lifting himself half out of bed—the new method was more of a game than an effort; he needed only to rock with a series of jerks—it struck him how easy all this would be if someone were to come to his aid. Two strong people—he thought of his father and the servant girl—would have been quite sufficient. They would only have had to push their arms under his arched back to get him out of the bed, to bend down with their load, and then merely to exercise patience so that he could complete the flip onto the floor, where his diminutive legs would then, he hoped, acquire a purpose. Now, quite apart from the fact that the doors were locked, should he really call out for help? In spite of all his distress, he was unable to suppress a smile at this idea.
+He had already got to the point where, by rocking more strongly, he maintained his equilibrium with difficulty, and very soon he would finally have to make a final decision, for in five minutes it would be a quarter past seven. Then there was a ring at the door of the apartment. “That’s someone from the office,” he told himself, and he almost froze, while his small limbs only danced around all the faster. For one moment everything remained still. “They aren’t opening,” Gregor said to himself, caught up in some absurd hope. But of course then, as usual, the servant girl with her firm tread went to the door and opened it. Gregor needed to hear only the first word of the visitor’s greeting to recognize immediately who it was, the manager himself. Why was Gregor the only one condemned to work in a firm where, at the slightest lapse, someone at once attracted the greatest suspicion? Were all the employees then collectively, one and all, scoundre
 ls? Among them was there then no truly devoted person who, if he failed to use just a couple of hours in the morning for office work, would become abnormal from pangs of conscience and really be in no state to get out of bed? Was it really not enough to let an apprentice make inquiries, if such questioning was even generally  necessary? Must the manager himself come, and in the process must it be demonstrated to the entire innocent family that the investigation of this suspicious circumstance could be entrusted only to the intelligence of the manager? And more as a consequence of the excited state in which this idea put Gregor than as a result of an actual decision, he swung himself with all his might out of the bed. There was a loud thud, but not a real crash. The fall was absorbed somewhat by the carpet and, in addition, his back was more elastic than Gregor had thought. For that reason the dull noise was not quite so conspicuous. But he had not held his head up with suffi
 cient care and had hit it. He turned his head, irritated and in pain, and rubbed it on the carpet.
+“Something has fallen in there,” said the manager in the next room on the left. Gregor tried to imagine to himself whether anything similar to what was happening to him today could have also happened at some point to the manager. At least one had to concede the possibility of such a thing. However, as if to give a rough answer to this question, the manager now, with a squeak of his polished boots, took a few determined steps in the next room. From the neighbouring room on the right the sister was whispering to inform Gregor: “Gregor, the manager is here.” “I know,” said Gregor to himself. But he did not dare make his voice loud enough so that his sister could hear.
+“Gregor,” his father now said from the neighbouring room on the left, “Mr. Manager has come and is asking why you have not left on the early train. We don’t know what we should tell him. Besides, he also wants to speak to you personally. So please open the door. He will be good enough to forgive the mess in your room.” In the middle of all this, the manager called out in a friendly way, “Good morning, Mr. Samsa.” “He is not well,” said his mother to the manager, while his father was still talking at the door, “He is not well, believe me, Mr. Manager. Otherwise how would Gregor miss a train? The young man has nothing in his head except business. I’m almost angry that he never goes out in the evening. Right now he’s been in the city eight days, but he’s been at home every evening. He sits here with us at the table and reads the newspaper quietly or studies his travel schedules. It’s a quite a diversion for h
 im to busy himself with fretwork. For instance, he cut out a small frame over the course of two or three evenings. You’d be amazed how pretty it is. It’s hanging right inside the room. You’ll see it immediately, as soon as Gregor opens the door. Anyway, I’m happy that you’re here, Mr. Manager. By ourselves, we would never have made Gregor open the door. He’s so stubborn, and he’s certainly not well, although he denied that this morning.” “I’m coming right away,” said Gregor slowly and deliberately and didn’t move, so as not to lose one word of the conversation. “My dear lady, I cannot explain it to myself in any other way,” said the manager; “I hope it is nothing serious. On the other hand, I must also say that we business people, luckily or unluckily, however one looks at it, very often simply have to overcome a slight indisposition for business reasons.” “So can Mr. Manager come in to see 
 you now?” asked his father impatiently and knocked once again on the door. “No,” said Gregor. In the neighbouring room on the left an awkward stillness descended. In the neighbouring room on the right the sister began to sob.
+Why did his sister not go to the others? She had probably just got up out of bed now and had not even started to get dressed yet. Then why was she crying? Because he was not getting up and letting the manager in, because he was in danger of losing his position, and because then his boss would badger his parents once again with the old demands? Those were probably unnecessary worries right now. Gregor was still here and was not thinking at all about abandoning his family. At the moment he was lying right there on the carpet, and no one who knew about his condition would have seriously demanded that he let the manager in. But Gregor would not be casually dismissed right way because of this small discourtesy, for which he would find an easy and suitable excuse later on. It seemed to Gregor that it might be far more reasonable to leave him in peace at the moment, instead of disturbing him with crying and conversation. But it was the very uncertainty which distressed the others a
 nd excused their behaviour.

Modified: hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java?rev=986973&r1=986972&r2=986973&view=diff
==============================================================================
--- hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java (original)
+++ hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java Wed Aug 18 22:59:32 2010
@@ -133,6 +133,7 @@ import org.apache.hadoop.hive.ql.udf.UDF
 import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage;
 import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge;
 import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectSet;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFContextNGrams;
 import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount;
 import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCovariance;
 import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCovarianceSample;
@@ -376,6 +377,7 @@ public final class FunctionRegistry {
     registerGenericUDAF("collect_set", new GenericUDAFCollectSet());
 
     registerGenericUDAF("ngrams", new GenericUDAFnGrams());
+    registerGenericUDAF("context_ngrams", new GenericUDAFContextNGrams());
 
     registerUDAF("percentile", UDAFPercentile.class);
 

Added: hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFContextNGrams.java
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFContextNGrams.java?rev=986973&view=auto
==============================================================================
--- hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFContextNGrams.java (added)
+++ hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFContextNGrams.java Wed Aug 18 22:59:32 2010
@@ -0,0 +1,422 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.udf.generic;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Iterator;
+import java.util.Set;
+import java.util.Map;
+import java.util.Collections;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.serde2.io.DoubleWritable;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StandardMapObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructField;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.DoubleObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableDoubleObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.io.Text;
+
+/**
+ * Estimates the top-k contextual n-grams in arbitrary sequential data using a heuristic.
+ */
+@Description(name = "context_ngrams",
+    value = "_FUNC_(expr, array<string1, string2, ...>, k, pf) estimates the top-k most " +
+      "frequent n-grams that fit into the specified context. The second parameter specifies " +
+      "a string of words that specify the positions of the n-gram elements, with a null value " +
+      "standing in for a 'blank' that must be filled by an n-gram element.",
+    extended = "The primary expression must be an array of strings, or an array of arrays of " +
+      "strings, such as the return type of the sentences() UDF. The second parameter specifies " +
+      "the context -- for example, array(\"i\", \"love\", null) -- which would estimate the top " +
+      "'k' words that follow the phrase \"i love\" in the primary expression. The optional " +
+      "fourth parameter 'pf' controls the memory used by the heuristic. Larger values will " +
+      "yield better accuracy, but use more memory. Example usage:\n" +
+      "  SELECT context_ngrams(sentences(lower(review)), array(\"i\", \"love\", null, null), 10)" +
+      " FROM movies\n" +
+      "would attempt to determine the 10 most common two-word phrases that follow \"i love\" " +
+      "in a database of free-form natural language movie reviews.")
+public class GenericUDAFContextNGrams implements GenericUDAFResolver {
+  static final Log LOG = LogFactory.getLog(GenericUDAFContextNGrams.class.getName());
+
+  @Override
+  public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException {
+    if (parameters.length != 3 && parameters.length != 4) {
+      throw new UDFArgumentTypeException(parameters.length-1,
+          "Please specify either three or four arguments.");
+    }
+    
+    // Validate the first parameter, which is the expression to compute over. This should be an
+    // array of strings type, or an array of arrays of strings.
+    PrimitiveTypeInfo pti;
+    if (parameters[0].getCategory() != ObjectInspector.Category.LIST) {
+      throw new UDFArgumentTypeException(0,
+          "Only list type arguments are accepted but "
+          + parameters[0].getTypeName() + " was passed as parameter 1.");
+    }
+    switch (((ListTypeInfo) parameters[0]).getListElementTypeInfo().getCategory()) {
+    case PRIMITIVE:
+      // Parameter 1 was an array of primitives, so make sure the primitives are strings.
+      pti = (PrimitiveTypeInfo) ((ListTypeInfo) parameters[0]).getListElementTypeInfo();
+      break;
+
+    case LIST:
+      // Parameter 1 was an array of arrays, so make sure that the inner arrays contain
+      // primitive strings.
+      ListTypeInfo lti = (ListTypeInfo)
+                         ((ListTypeInfo) parameters[0]).getListElementTypeInfo();
+      pti = (PrimitiveTypeInfo) lti.getListElementTypeInfo();
+      break;
+
+    default:
+      throw new UDFArgumentTypeException(0,
+          "Only arrays of strings or arrays of arrays of strings are accepted but "
+          + parameters[0].getTypeName() + " was passed as parameter 1.");
+    }
+    if(pti.getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
+      throw new UDFArgumentTypeException(0,
+          "Only array<string> or array<array<string>> is allowed, but " 
+          + parameters[0].getTypeName() + " was passed as parameter 1.");
+    }
+
+    // Validate the second parameter, which should be an array of strings
+    if(parameters[1].getCategory() != ObjectInspector.Category.LIST ||
+       ((ListTypeInfo) parameters[1]).getListElementTypeInfo().getCategory() !=
+         ObjectInspector.Category.PRIMITIVE) {
+      throw new UDFArgumentTypeException(1, "Only arrays of strings are accepted but "
+          + parameters[1].getTypeName() + " was passed as parameter 2.");
+    } 
+    if(((PrimitiveTypeInfo) ((ListTypeInfo)parameters[1]).getListElementTypeInfo()).
+        getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
+      throw new UDFArgumentTypeException(1, "Only arrays of strings are accepted but "
+          + parameters[1].getTypeName() + " was passed as parameter 2.");
+    }
+
+    // Validate the third parameter, which should be an integer to represent 'k'
+    if(parameters[2].getCategory() != ObjectInspector.Category.PRIMITIVE) {
+      throw new UDFArgumentTypeException(2, "Only integers are accepted but "
+            + parameters[2].getTypeName() + " was passed as parameter 3.");
+    } 
+    switch(((PrimitiveTypeInfo) parameters[2]).getPrimitiveCategory()) {
+    case BYTE:
+    case SHORT:
+    case INT:
+    case LONG:
+      break;
+
+    default:
+      throw new UDFArgumentTypeException(2, "Only integers are accepted but "
+            + parameters[2].getTypeName() + " was passed as parameter 3.");
+    }
+
+    // If the fourth parameter -- precision factor 'pf' -- has been specified, make sure it's
+    // an integer.
+    if(parameters.length == 4) {
+      if(parameters[3].getCategory() != ObjectInspector.Category.PRIMITIVE) {
+        throw new UDFArgumentTypeException(3, "Only integers are accepted but "
+            + parameters[3].getTypeName() + " was passed as parameter 4.");
+      } 
+      switch(((PrimitiveTypeInfo) parameters[3]).getPrimitiveCategory()) {
+      case BYTE:
+      case SHORT:
+      case INT:
+      case LONG:
+        break;
+
+      default:
+        throw new UDFArgumentTypeException(3, "Only integers are accepted but "
+            + parameters[3].getTypeName() + " was passed as parameter 4.");
+      }
+    }
+
+    return new GenericUDAFContextNGramEvaluator();
+  }
+
+  /**
+   * A constant-space heuristic to estimate the top-k contextual n-grams.
+   */
+  public static class GenericUDAFContextNGramEvaluator extends GenericUDAFEvaluator {
+    // For PARTIAL1 and COMPLETE: ObjectInspectors for original data
+    private StandardListObjectInspector outerInputOI;
+    private StandardListObjectInspector innerInputOI;
+    private StandardListObjectInspector contextListOI;
+    private PrimitiveObjectInspector contextOI;
+    private PrimitiveObjectInspector inputOI;
+    private PrimitiveObjectInspector kOI;
+    private PrimitiveObjectInspector pOI;
+
+    // For PARTIAL2 and FINAL: ObjectInspectors for partial aggregations 
+    private StandardListObjectInspector loi;
+
+    @Override
+    public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {
+      super.init(m, parameters);
+
+      // Init input object inspectors
+      if (m == Mode.PARTIAL1 || m == Mode.COMPLETE) {
+        outerInputOI = (StandardListObjectInspector) parameters[0];
+        if(outerInputOI.getListElementObjectInspector().getCategory() ==
+            ObjectInspector.Category.LIST) {
+          // We're dealing with input that is an array of arrays of strings
+          innerInputOI = (StandardListObjectInspector) outerInputOI.getListElementObjectInspector();
+          inputOI = (PrimitiveObjectInspector) innerInputOI.getListElementObjectInspector();
+        } else {
+          // We're dealing with input that is an array of strings
+          inputOI = (PrimitiveObjectInspector) outerInputOI.getListElementObjectInspector();
+          innerInputOI = null;
+        }
+        contextListOI = (StandardListObjectInspector) parameters[1];
+        contextOI = (PrimitiveObjectInspector) contextListOI.getListElementObjectInspector();
+        kOI = (PrimitiveObjectInspector) parameters[2];
+        if(parameters.length == 4) {
+          pOI = (PrimitiveObjectInspector) parameters[3];
+        } else {
+          pOI = null;
+        }
+      } else {
+          // Init the list object inspector for handling partial aggregations
+          loi = (StandardListObjectInspector) parameters[0];
+      }
+
+      // Init output object inspectors.
+      //
+      // The return type for a partial aggregation is still a list of strings.
+      // 
+      // The return type for FINAL and COMPLETE is a full aggregation result, which is 
+      // an array of structures containing the n-gram and its estimated frequency.
+      if (m == Mode.PARTIAL1 || m == Mode.PARTIAL2) {
+        return ObjectInspectorFactory.getStandardListObjectInspector(
+            PrimitiveObjectInspectorFactory.writableStringObjectInspector);
+      } else {
+        // Final return type that goes back to Hive: a list of structs with n-grams and their
+        // estimated frequencies.
+        ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>();
+        foi.add(ObjectInspectorFactory.getStandardListObjectInspector(
+                  PrimitiveObjectInspectorFactory.writableStringObjectInspector));
+        foi.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
+        ArrayList<String> fname = new ArrayList<String>();
+        fname.add("ngram");
+        fname.add("estfrequency");               
+        return ObjectInspectorFactory.getStandardListObjectInspector(
+                 ObjectInspectorFactory.getStandardStructObjectInspector(fname, foi) );
+      }
+    }
+
+    @Override
+    public void merge(AggregationBuffer agg, Object obj) throws HiveException {
+      if(obj == null) { 
+        return;
+      }
+      NGramAggBuf myagg = (NGramAggBuf) agg;
+      List<Text> partial = (List<Text>) loi.getList(obj);
+
+      // remove the context words from the end of the list
+      int contextSize = Integer.parseInt( ((Text)partial.get(partial.size()-1)).toString() );
+      partial.remove(partial.size()-1);
+      if(myagg.context.size() > 0)  {
+        if(contextSize != myagg.context.size()) {
+          throw new HiveException(getClass().getSimpleName() + ": found a mismatch in the" +
+              " context string lengths. This is usually caused by passing a non-constant" +
+              " expression for the context.");
+        }
+      } else {
+        for(int i = partial.size()-contextSize; i < partial.size(); i++) {
+          String word = partial.get(i).toString();
+          if(word.equals("")) {
+            myagg.context.add( null );
+          } else {
+            myagg.context.add( word );
+          } 
+        }
+        partial.subList(partial.size()-contextSize, partial.size()).clear();
+        myagg.nge.merge(partial);
+      }
+    }
+
+    @Override
+    public Object terminatePartial(AggregationBuffer agg) throws HiveException {
+      NGramAggBuf myagg = (NGramAggBuf) agg;
+      ArrayList<Text> result = myagg.nge.serialize();
+
+      // push the context on to the end of the serialized n-gram estimation
+      for(int i = 0; i < myagg.context.size(); i++) {
+        if(myagg.context.get(i) == null) {
+          result.add(new Text(""));
+        } else {
+          result.add(new Text(myagg.context.get(i)));
+        }
+      }
+      result.add(new Text(Integer.toString(myagg.context.size())));
+
+      return result;
+    }
+
+    // Finds all contextual n-grams in a sequence of words, and passes the n-grams to the
+    // n-gram estimator object
+    private void processNgrams(NGramAggBuf agg, ArrayList<String> seq) throws HiveException {
+      // generate n-grams wherever the context matches
+      assert(agg.context.size() > 0);
+      ArrayList<String> ng = new ArrayList<String>();
+      for(int i = seq.size() - agg.context.size(); i >= 0; i--) {
+        // check if the context matches
+        boolean contextMatches = true;
+        ng.clear();
+        for(int j = 0; j < agg.context.size(); j++) {
+          String contextWord = agg.context.get(j);
+          if(contextWord == null) {
+            ng.add(seq.get(i+j));
+          } else {
+            if(!contextWord.equals(seq.get(i+j))) {
+              contextMatches = false;
+              break;
+            }
+          }
+        }
+
+        // add to n-gram estimation only if the context matches
+        if(contextMatches) {
+          agg.nge.add(ng);
+          ng = new ArrayList<String>();
+        }
+      }
+    }
+
+    @Override
+    public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException {
+      assert (parameters.length == 3 || parameters.length == 4);
+      if(parameters[0] == null || parameters[1] == null || parameters[2] == null) {
+        return;
+      }
+      NGramAggBuf myagg = (NGramAggBuf) agg;
+    
+      // Parse out the context and 'k' if we haven't already done so, and while we're at it,
+      // also parse out the precision factor 'pf' if the user has supplied one.
+      if(!myagg.nge.isInitialized()) {
+        int k = PrimitiveObjectInspectorUtils.getInt(parameters[2], kOI);
+        int pf = 0;
+        if(k < 1) {
+          throw new HiveException(getClass().getSimpleName() + " needs 'k' to be at least 1, "
+                                  + "but you supplied " + k);
+        }
+        if(parameters.length == 4) {
+          pf = PrimitiveObjectInspectorUtils.getInt(parameters[3], pOI);
+          if(pf < 1) {
+            throw new HiveException(getClass().getSimpleName() + " needs 'pf' to be at least 1, "
+                + "but you supplied " + pf);
+          }
+        } else {
+          pf = 1; // placeholder; minimum pf value is enforced in NGramEstimator
+        }
+
+        // Parse out the context and make sure it isn't empty
+        myagg.context.clear();
+        List<Text> context = (List<Text>) contextListOI.getList(parameters[1]);
+        int contextNulls = 0;
+        for(int i = 0; i < context.size(); i++) {
+          String word = PrimitiveObjectInspectorUtils.getString(context.get(i), contextOI);
+          if(word == null) {
+            contextNulls++;
+          }
+          myagg.context.add(word);
+        }
+        if(context.size() == 0) {
+          throw new HiveException(getClass().getSimpleName() + " needs a context array " +
+            "with at least one element.");
+        }
+        if(contextNulls == 0) {
+          throw new HiveException(getClass().getSimpleName() + " the context array needs to " +
+            "contain at least one 'null' value to indicate what should be counted.");
+        }
+
+        // Set parameters in the n-gram estimator object
+        myagg.nge.initialize(k, pf, contextNulls);
+      }
+
+      // get the input expression
+      List<Text> outer = (List<Text>) outerInputOI.getList(parameters[0]);
+      if(innerInputOI != null) {
+        // we're dealing with an array of arrays of strings
+        for(int i = 0; i < outer.size(); i++) {
+          List<Text> inner = (List<Text>) innerInputOI.getList(outer.get(i));
+          ArrayList<String> words = new ArrayList<String>();
+          for(int j = 0; j < inner.size(); j++) {
+            String word = PrimitiveObjectInspectorUtils.getString(inner.get(j), inputOI);
+            words.add(word);
+          }
+
+          // parse out n-grams, update frequency counts
+          processNgrams(myagg, words);
+        } 
+      } else {
+        // we're dealing with an array of strings
+        ArrayList<String> words = new ArrayList<String>();
+        for(int i = 0; i < outer.size(); i++) {
+          String word = PrimitiveObjectInspectorUtils.getString(outer.get(i), inputOI);
+          words.add(word);
+        }
+
+        // parse out n-grams, update frequency counts
+        processNgrams(myagg, words);
+      }
+    }
+
+    @Override
+    public Object terminate(AggregationBuffer agg) throws HiveException {
+      NGramAggBuf myagg = (NGramAggBuf) agg;
+      return myagg.nge.getNGrams();
+    }
+
+
+    // Aggregation buffer methods. 
+    static class NGramAggBuf implements AggregationBuffer {
+      ArrayList<String> context;
+      NGramEstimator nge;
+    };
+
+    @Override
+    public AggregationBuffer getNewAggregationBuffer() throws HiveException {
+      NGramAggBuf result = new NGramAggBuf();
+      result.nge = new NGramEstimator();
+      result.context = new ArrayList<String>();
+      reset(result);
+      return result;
+    }
+
+    @Override
+    public void reset(AggregationBuffer agg) throws HiveException {
+      NGramAggBuf result = (NGramAggBuf) agg;
+      result.context.clear();
+      result.nge.reset();
+    }
+  }
+}

Modified: hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java?rev=986973&r1=986972&r2=986973&view=diff
==============================================================================
--- hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java (original)
+++ hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java Wed Aug 18 22:59:32 2010
@@ -18,7 +18,7 @@
 package org.apache.hadoop.hive.ql.udf.generic;
 
 import java.util.ArrayList;
-import java.util.Arrays;
+import java.util.List;
 import java.util.Random;
 
 import org.apache.commons.logging.Log;
@@ -60,8 +60,8 @@ import org.apache.hadoop.util.StringUtil
              + "histogram bins appear to work well, with more bins being required for skewed or "
              + "smaller datasets. Note that this function creates a histogram with non-uniform "
              + "bin widths. It offers no guarantees in terms of the mean-squared-error of the "
-             + "histogram, but in practice is comparable to the histograms produced by the R/S-Plus "
-             + "statistical computing packages." )
+             + "histogram, but in practice is comparable to the histograms produced by the R/S-Plus"
+             + "statistical computing packages.")
 public class GenericUDAFHistogramNumeric extends AbstractGenericUDAFResolver {
   // class static variables
   static final Log LOG = LogFactory.getLog(GenericUDAFHistogramNumeric.class.getName());
@@ -199,7 +199,7 @@ public class GenericUDAFHistogramNumeric
       if(partial == null) {
         return;
       }
-      ArrayList partialHistogram = (ArrayList) loi.getList(partial);
+      List<DoubleWritable> partialHistogram = (List<DoubleWritable>) loi.getList(partial);
       StdAgg myagg = (StdAgg) agg;
       myagg.histogram.merge(partialHistogram);
     }

Modified: hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java?rev=986973&r1=986972&r2=986973&view=diff
==============================================================================
--- hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java (original)
+++ hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java Wed Aug 18 22:59:32 2010
@@ -18,6 +18,7 @@
 package org.apache.hadoop.hive.ql.udf.generic;
 
 import java.util.ArrayList;
+import java.util.List;
 
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
@@ -283,7 +284,7 @@ public class GenericUDAFPercentileApprox
         return;
       }
       PercentileAggBuf myagg = (PercentileAggBuf) agg;
-      ArrayList<DoubleWritable> partialHistogram = (ArrayList<DoubleWritable>) loi.getList(partial);
+      List<DoubleWritable> partialHistogram = (List<DoubleWritable>) loi.getList(partial);
 
       // remove requested quantiles from the head of the list
       int nquantiles = (int) partialHistogram.get(0).get();

Modified: hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFnGrams.java
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFnGrams.java?rev=986973&r1=986972&r2=986973&view=diff
==============================================================================
--- hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFnGrams.java (original)
+++ hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFnGrams.java Wed Aug 18 22:59:32 2010
@@ -18,12 +18,7 @@
 package org.apache.hadoop.hive.ql.udf.generic;
 
 import java.util.ArrayList;
-import java.util.HashMap;
-import java.util.Iterator;
-import java.util.Set;
-import java.util.Map;
-import java.util.Collections;
-import java.util.Comparator;
+import java.util.List;
 
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
@@ -238,111 +233,39 @@ public class GenericUDAFnGrams implement
         return;
       }
       NGramAggBuf myagg = (NGramAggBuf) agg;
-
-      ArrayList partialNGrams = (ArrayList) loi.getList(partial);
-      int k = Integer.parseInt(((Text)partialNGrams.get(0)).toString());
-      int n = Integer.parseInt(((Text)partialNGrams.get(1)).toString());
-      int pf = Integer.parseInt(((Text)partialNGrams.get(2)).toString());
-      if(myagg.k > 0 && myagg.k != k) {
-        throw new HiveException(getClass().getSimpleName() + ": mismatch in value for 'k'" 
-            + ", which usually is caused by a non-constant expression. Found '"+k+"' and '"
-            + myagg.k + "'.");
-      }
+      List<Text> partialNGrams = (List<Text>) loi.getList(partial);
+      int n = Integer.parseInt(partialNGrams.get(partialNGrams.size()-1).toString());
       if(myagg.n > 0 && myagg.n != n) {
         throw new HiveException(getClass().getSimpleName() + ": mismatch in value for 'n'" 
             + ", which usually is caused by a non-constant expression. Found '"+n+"' and '"
             + myagg.n + "'.");
       }
-      if(myagg.pf > 0 && myagg.pf != pf) {
-        throw new HiveException(getClass().getSimpleName() + ": mismatch in value for 'pf'" 
-            + ", which usually is caused by a non-constant expression. Found '"+pf+"' and '"
-            + myagg.pf + "'.");
-      }
-      myagg.k = k;
       myagg.n = n;
-      myagg.pf = pf;
-
-      for(int i = 3; i < partialNGrams.size(); i++) {
-        ArrayList<String> key = new ArrayList<String>();
-        for(int j = 0; j < n; j++) {
-          key.add(((Text)partialNGrams.get(i+j)).toString());
-        }
-        i += n;
-        double val = Double.parseDouble( ((Text)partialNGrams.get(i)).toString() );
-        Double myval = (Double)myagg.ngrams.get(key);
-        if(myval == null) {
-          myval = new Double(val);
-        } else {
-          myval += val;
-        }
-        myagg.ngrams.put(key, myval);
-      }
-      trim(myagg, myagg.k*myagg.pf);
+      partialNGrams.remove(partialNGrams.size()-1);
+      myagg.nge.merge(partialNGrams);
     }
 
     @Override
     public Object terminatePartial(AggregationBuffer agg) throws HiveException {
       NGramAggBuf myagg = (NGramAggBuf) agg;
-
-      ArrayList<Text> result = new ArrayList<Text>();
-      result.add(new Text(Integer.toString(myagg.k)));
+      ArrayList<Text> result = myagg.nge.serialize();
       result.add(new Text(Integer.toString(myagg.n)));
-      result.add(new Text(Integer.toString(myagg.pf)));
-      for(Iterator<ArrayList<String> > it = myagg.ngrams.keySet().iterator(); it.hasNext(); ) {
-        ArrayList<String> mykey = it.next();
-        for(int i = 0; i < mykey.size(); i++) {
-          result.add(new Text(mykey.get(i)));
-        }
-        Double myval = (Double) myagg.ngrams.get(mykey);
-        result.add(new Text(myval.toString()));
-      }
-
       return result;
     }
 
-    private void trim(NGramAggBuf agg, int N) {
-      ArrayList list = new ArrayList(agg.ngrams.entrySet());
-      if(list.size() <= N) {
-        return;
-      }
-      Collections.sort(list, new Comparator() {
-          public int compare(Object o1, Object o2) {
-          return ((Double)((Map.Entry)o1).getValue()).compareTo(
-            ((Double)((Map.Entry)o2).getValue()) );
-          }
-          });
-      for(int i = 0; i < list.size() - N; i++) {
-        agg.ngrams.remove( ((Map.Entry)list.get(i)).getKey() );
-      }
-    }
-
-    private void processNgrams(NGramAggBuf agg, ArrayList<String> seq) {
+    private void processNgrams(NGramAggBuf agg, ArrayList<String> seq) throws HiveException {
       for(int i = seq.size()-agg.n; i >= 0; i--) {
         ArrayList<String> ngram = new ArrayList<String>();
         for(int j = 0; j < agg.n; j++)  {
           ngram.add(seq.get(i+j));
         }
-        Double curVal = (Double) agg.ngrams.get(ngram);
-        if(curVal == null) {
-          // new n-gram
-          curVal = new Double(1);
-        } else {
-          // existing n-gram, just increment count
-          curVal++;
-        }
-        agg.ngrams.put(ngram, curVal);
-      }
-
-      // do we have too many ngrams? 
-      if(agg.ngrams.size() > agg.k * agg.pf) {
-        // delete low-support n-grams
-        trim(agg, agg.k * agg.pf);
+        agg.nge.add(ngram);
       }
     }
 
     @Override
     public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException {
-      assert (parameters.length == 3);
+      assert (parameters.length == 3 || parameters.length == 4);
       if(parameters[0] == null || parameters[1] == null || parameters[2] == null) {
         return;
       }
@@ -350,37 +273,39 @@ public class GenericUDAFnGrams implement
     
       // Parse out 'n' and 'k' if we haven't already done so, and while we're at it,
       // also parse out the precision factor 'pf' if the user has supplied one.
-      if(myagg.n == 0 || myagg.k == 0) {
-        myagg.n = PrimitiveObjectInspectorUtils.getInt(parameters[1], nOI);
-        myagg.k = PrimitiveObjectInspectorUtils.getInt(parameters[2], kOI);
-        if(myagg.n < 1) {
+      if(!myagg.nge.isInitialized()) {
+        int n = PrimitiveObjectInspectorUtils.getInt(parameters[1], nOI);
+        int k = PrimitiveObjectInspectorUtils.getInt(parameters[2], kOI);
+        int pf = 0;
+        if(n < 1) {
           throw new HiveException(getClass().getSimpleName() + " needs 'n' to be at least 1, "
-                                  + "but you supplied " + myagg.n);
+                                  + "but you supplied " + n);
         }
-        if(myagg.k < 1) {
+        if(k < 1) {
           throw new HiveException(getClass().getSimpleName() + " needs 'k' to be at least 1, "
-                                  + "but you supplied " + myagg.k);
+                                  + "but you supplied " + k);
         }
         if(parameters.length == 4) {
-          myagg.pf = PrimitiveObjectInspectorUtils.getInt(parameters[3], pOI);
-          if(myagg.pf < 1) {
+          pf = PrimitiveObjectInspectorUtils.getInt(parameters[3], pOI);
+          if(pf < 1) {
             throw new HiveException(getClass().getSimpleName() + " needs 'pf' to be at least 1, "
-                + "but you supplied " + myagg.pf);
+                + "but you supplied " + pf);
           }
+        } else {
+          pf = 1; // placeholder; minimum pf value is enforced in NGramEstimator
         }
 
-        // Enforce a minimum n-gram buffer size
-        if(myagg.pf*myagg.k < 1000) {
-          myagg.pf = 1000 / myagg.k;
-        }
+        // Set the parameters
+        myagg.n = n;
+        myagg.nge.initialize(k, pf, n);
       }
 
       // get the input expression
-      ArrayList outer = (ArrayList) outerInputOI.getList(parameters[0]);
+      List<Text> outer = (List<Text>) outerInputOI.getList(parameters[0]);
       if(innerInputOI != null) {
         // we're dealing with an array of arrays of strings
         for(int i = 0; i < outer.size(); i++) {
-          ArrayList inner = (ArrayList) innerInputOI.getList(outer.get(i));
+          List<Text> inner = (List<Text>) innerInputOI.getList(outer.get(i));
           ArrayList<String> words = new ArrayList<String>();
           for(int j = 0; j < inner.size(); j++) {
             String word = PrimitiveObjectInspectorUtils.getString(inner.get(j), inputOI);
@@ -406,48 +331,19 @@ public class GenericUDAFnGrams implement
     @Override
     public Object terminate(AggregationBuffer agg) throws HiveException {
       NGramAggBuf myagg = (NGramAggBuf) agg;
-      if (myagg.ngrams.size() < 1) { // SQL standard - return null for zero elements
-        return null;
-      } 
-
-      ArrayList<Object[]> result = new ArrayList<Object[]>();
-
-      ArrayList list = new ArrayList(myagg.ngrams.entrySet());
-      Collections.sort(list, new Comparator() {
-          public int compare(Object o1, Object o2) {
-          return ((Double)((Map.Entry)o2).getValue()).compareTo(
-            ((Double)((Map.Entry)o1).getValue()) );
-          }
-          });
-
-      for(int i = 0; i < list.size() && i < myagg.k; i++) {
-        ArrayList<String> key = (ArrayList<String>)((Map.Entry)list.get(i)).getKey();
-        Double val = (Double)((Map.Entry)list.get(i)).getValue();
-
-        Object[] ngram = new Object[2];
-        ngram[0] = new ArrayList<Text>();
-        for(int j = 0; j < key.size(); j++) {
-          ((ArrayList<Text>)ngram[0]).add(new Text(key.get(j)));
-        }
-        ngram[1] = new DoubleWritable(val.doubleValue());
-        result.add(ngram);
-      }
-
-      return result;
+      return myagg.nge.getNGrams();
     }
 
-
     // Aggregation buffer methods. 
     static class NGramAggBuf implements AggregationBuffer {
-      HashMap ngrams;
+      NGramEstimator nge;
       int n;
-      int k;
-      int pf;
     };
 
     @Override
     public AggregationBuffer getNewAggregationBuffer() throws HiveException {
       NGramAggBuf result = new NGramAggBuf();
+      result.nge = new NGramEstimator();
       reset(result);
       return result;
     }
@@ -455,8 +351,8 @@ public class GenericUDAFnGrams implement
     @Override
     public void reset(AggregationBuffer agg) throws HiveException {
       NGramAggBuf result = (NGramAggBuf) agg;
-      result.ngrams = new HashMap();
-      result.n = result.k = result.pf = 0;
+      result.nge.reset();
+      result.n = 0;
     }
   }
 }

Added: hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NGramEstimator.java
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NGramEstimator.java?rev=986973&view=auto
==============================================================================
--- hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NGramEstimator.java (added)
+++ hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NGramEstimator.java Wed Aug 18 22:59:32 2010
@@ -0,0 +1,276 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.udf.generic;
+
+import java.util.List;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.Comparator;
+import org.apache.hadoop.hive.serde2.io.DoubleWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
+/**
+ * A generic, re-usable n-gram estimation class that supports partial aggregations.
+ * The algorithm is based on the heuristic from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. 
+ *
+ * In particular, it is guaranteed that frequencies will be under-counted. With large
+ * data and a reasonable precision factor, this undercounting appears to be on the order
+ * of 5%.
+ */
+public class NGramEstimator {
+  /* Class private variables */
+  private int k;
+  private int pf;
+  private int n;
+  private HashMap<ArrayList<String>, Double> ngrams;
+  
+
+  /**
+   * Creates a new n-gram estimator object. The 'n' for n-grams is computed dynamically
+   * when data is fed to the object. 
+   */
+  public NGramEstimator() {
+    k  = 0;
+    pf = 0;
+    n  = 0;
+    ngrams = new HashMap<ArrayList<String>, Double>();
+  }
+
+  /**
+   * Returns true if the 'k' and 'pf' parameters have been set.
+   */
+  public boolean isInitialized() {
+    return (k != 0);
+  }
+
+  /**
+   * Sets the 'k' and 'pf' parameters.
+   */
+  public void initialize(int pk, int ppf, int pn) throws HiveException {
+    assert(pk > 0 && ppf > 0 && pn > 0);
+    k = pk;
+    pf = ppf;
+    n = pn;
+
+    // enforce a minimum precision factor
+    if(k * pf < 1000) {
+      pf = 1000 / k;
+    }
+  }
+
+  /**
+   * Resets an n-gram estimator object to its initial state. 
+   */
+  public void reset() {
+    ngrams.clear();
+    n = pf = k = 0;
+  }
+
+  /**
+   * Returns the final top-k n-grams in a format suitable for returning to Hive.
+   */
+  public ArrayList<Object[]> getNGrams() throws HiveException {
+    trim(true);
+    if(ngrams.size() < 1) { // SQL standard - return null for zero elements
+      return null;
+    } 
+
+    // Sort the n-gram list by frequencies in descending order
+    ArrayList<Object[]> result = new ArrayList<Object[]>();
+    ArrayList<Map.Entry<ArrayList<String>, Double>> list = new ArrayList(ngrams.entrySet());
+    Collections.sort(list, new Comparator<Map.Entry<ArrayList<String>, Double>>() {
+      public int compare(Map.Entry<ArrayList<String>, Double> o1, 
+                         Map.Entry<ArrayList<String>, Double> o2) {
+        return o2.getValue().compareTo(o1.getValue());
+      }
+    });
+
+    // Convert the n-gram list to a format suitable for Hive
+    for(int i = 0; i < list.size(); i++) {
+      ArrayList<String> key = list.get(i).getKey();
+      Double val = list.get(i).getValue();
+
+      Object[] curGram = new Object[2];
+      ArrayList<Text> ng = new ArrayList<Text>();
+      for(int j = 0; j < key.size(); j++) {
+        ng.add(new Text(key.get(j)));
+      }
+      curGram[0] = ng;
+      curGram[1] = new DoubleWritable(val.doubleValue());
+      result.add(curGram);
+    }
+
+    return result;    
+  }
+
+  /**
+   * Returns the number of n-grams in our buffer.
+   */
+  public int size() {
+    return ngrams.size();
+  }
+
+  /**
+   * Adds a new n-gram to the estimation.
+   *
+   * @param ng The n-gram to add to the estimation
+   */
+  public void add(ArrayList<String> ng) throws HiveException {
+    assert(ng != null && ng.size() > 0 && ng.get(0) != null);
+    Double curFreq = ngrams.get(ng);
+    if(curFreq == null) {
+      // new n-gram
+      curFreq = new Double(1.0);
+    } else {
+      // existing n-gram, just increment count
+      curFreq++;
+    }
+    ngrams.put(ng, curFreq);
+
+    // set 'n' if we haven't done so before
+    if(n == 0) {
+      n = ng.size();
+    } else {
+      if(n != ng.size()) {
+        throw new HiveException(getClass().getSimpleName() + ": mismatch in value for 'n'" 
+            + ", which usually is caused by a non-constant expression. Found '"+n+"' and '"
+            + ng.size() + "'.");
+      }
+    }
+
+    // Trim down the total number of n-grams if we've exceeded the maximum amount of memory allowed
+    // 
+    // NOTE: Although 'k'*'pf' specifies the size of the estimation buffer, we don't want to keep
+    //       performing N.log(N) trim operations each time the maximum hashmap size is exceeded.
+    //       To handle this, we *actually* maintain an estimation buffer of size 2*'k'*'pf', and
+    //       trim down to 'k'*'pf' whenever the hashmap size exceeds 2*'k'*'pf'. This really has
+    //       a significant effect when 'k'*'pf' is very high.
+    if(ngrams.size() > k * pf * 2) {
+      trim(false);
+    }
+  }
+
+  /**
+   * Trims an n-gram estimation down to either 'pf' * 'k' n-grams, or 'k' n-grams if 
+   * finalTrim is true.
+   */
+  private void trim(boolean finalTrim) throws HiveException {
+    ArrayList<Map.Entry<ArrayList<String>,Double>> list = new ArrayList(ngrams.entrySet());
+    Collections.sort(list, new Comparator<Map.Entry<ArrayList<String>,Double>>() {
+      public int compare(Map.Entry<ArrayList<String>,Double> o1, 
+                         Map.Entry<ArrayList<String>,Double> o2) {
+        return o1.getValue().compareTo(o2.getValue());
+      }
+    });
+    for(int i = 0; i < list.size() - (finalTrim ? k : pf*k); i++) {
+      ngrams.remove( list.get(i).getKey() );
+    }
+  }
+
+  /**
+   * Takes a serialized n-gram estimator object created by the serialize() method and merges
+   * it with the current n-gram object.
+   *
+   * @param other A serialized n-gram object created by the serialize() method
+   * @see merge
+   */
+  public void merge(List<Text> other) throws HiveException {
+    if(other == null) {
+      return;
+    }
+
+    // Get estimation parameters
+    int otherK = Integer.parseInt(other.get(0).toString());
+    int otherN = Integer.parseInt(other.get(1).toString());
+    int otherPF = Integer.parseInt(other.get(2).toString());
+    if(k > 0 && k != otherK) {
+      throw new HiveException(getClass().getSimpleName() + ": mismatch in value for 'k'" 
+          + ", which usually is caused by a non-constant expression. Found '"+k+"' and '"
+          + otherK + "'.");
+    }
+    if(n > 0 && otherN != n) {
+      throw new HiveException(getClass().getSimpleName() + ": mismatch in value for 'n'" 
+          + ", which usually is caused by a non-constant expression. Found '"+n+"' and '"
+          + otherN + "'.");
+    }
+    if(pf > 0 && otherPF != pf) {
+      throw new HiveException(getClass().getSimpleName() + ": mismatch in value for 'pf'" 
+          + ", which usually is caused by a non-constant expression. Found '"+pf+"' and '"
+          + otherPF + "'.");
+    }
+    k = otherK;
+    pf = otherPF;
+    n = otherN;
+
+    // Merge the other estimation into the current one
+    for(int i = 3; i < other.size(); i++) {
+      ArrayList<String> key = new ArrayList<String>();
+      for(int j = 0; j < n; j++) {
+        Text word = other.get(i+j);
+        key.add(word.toString());
+      }
+      i += n;
+      double val = Double.parseDouble( other.get(i).toString() );
+      Double myval = ngrams.get(key);
+      if(myval == null) {
+        myval = new Double(val);
+      } else {
+        myval += val;
+      }
+      ngrams.put(key, myval);      
+    }
+
+    trim(false);
+  }
+
+
+  /**
+   * In preparation for a Hive merge() call, serializes the current n-gram estimator object into an
+   * ArrayList of Text objects. This list is deserialized and merged by the 
+   * merge method.
+   *
+   * @return An ArrayList of Hadoop Text objects that represents the current
+   * n-gram estimation.
+   * @see merge(ArrayList<Text>)
+   */
+  public ArrayList<Text> serialize() throws HiveException {
+    ArrayList<Text> result = new ArrayList<Text>();    
+    result.add(new Text(Integer.toString(k)));
+    result.add(new Text(Integer.toString(n)));
+    result.add(new Text(Integer.toString(pf)));
+    for(Iterator<ArrayList<String> > it = ngrams.keySet().iterator(); it.hasNext(); ) {
+      ArrayList<String> mykey = it.next();
+      assert(mykey.size() > 0);
+      for(int i = 0; i < mykey.size(); i++) {
+        result.add(new Text(mykey.get(i)));
+      }
+      Double myval = ngrams.get(mykey);
+      result.add(new Text(myval.toString()));
+    }
+
+    return result;
+  }
+}

Modified: hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java?rev=986973&r1=986972&r2=986973&view=diff
==============================================================================
--- hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java (original)
+++ hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java Wed Aug 18 22:59:32 2010
@@ -18,6 +18,7 @@
 package org.apache.hadoop.hive.ql.udf.generic;
 
 import java.util.ArrayList;
+import java.util.List;
 import java.util.Arrays;
 import java.util.Random;
 import org.apache.hadoop.hive.serde2.io.DoubleWritable;
@@ -124,7 +125,7 @@ public class NumericHistogram {
    * @param other A serialized histogram created by the serialize() method
    * @see merge
    */
-  public void merge(ArrayList<DoubleWritable> other) {
+  public void merge(List<DoubleWritable> other) {
     if(other == null) {
       return;
     }
@@ -132,7 +133,7 @@ public class NumericHistogram {
     if(nbins == 0 || nusedbins == 0)  {
       // Our aggregation buffer has nothing in it, so just copy over 'other' 
       // by deserializing the ArrayList of (x,y) pairs into an array of Coord objects
-      nbins = (int) (other.get(0).get());
+      nbins = (int) other.get(0).get();
       nusedbins = (other.size()-1)/2; 
       bins = new Coord[nbins+1]; // +1 to hold a temporary bin for insert()
       for(int i = 1; i < other.size(); i+=2) {

Added: hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_context_ngrams.q
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_context_ngrams.q?rev=986973&view=auto
==============================================================================
--- hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_context_ngrams.q (added)
+++ hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_context_ngrams.q Wed Aug 18 22:59:32 2010
@@ -0,0 +1,12 @@
+CREATE TABLE kafka (contents STRING);
+LOAD DATA LOCAL INPATH '../data/files/text-en.txt' INTO TABLE kafka;
+set mapred.reduce.tasks=1
+set hive.exec.reducers.max=1
+
+SELECT context_ngrams(sentences(lower(contents)), array(null), 100, 1000).estfrequency FROM kafka;
+SELECT context_ngrams(sentences(lower(contents)), array("he",null), 100, 1000) FROM kafka;
+SELECT context_ngrams(sentences(lower(contents)), array(null,"salesmen"), 100, 1000) FROM kafka;
+SELECT context_ngrams(sentences(lower(contents)), array("what","i",null), 100, 1000) FROM kafka;
+SELECT context_ngrams(sentences(lower(contents)), array(null,null), 100, 1000).estfrequency FROM kafka;
+
+DROP TABLE kafka;

Modified: hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_ngrams.q
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_ngrams.q?rev=986973&r1=986972&r2=986973&view=diff
==============================================================================
--- hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_ngrams.q (original)
+++ hadoop/hive/trunk/ql/src/test/queries/clientpositive/udaf_ngrams.q Wed Aug 18 22:59:32 2010
@@ -1,10 +1,12 @@
 CREATE TABLE kafka (contents STRING);
 LOAD DATA LOCAL INPATH '../data/files/text-en.txt' INTO TABLE kafka;
+set mapred.reduce.tasks=1
+set hive.exec.reducers.max=1
 
-SELECT ngrams(sentences(lower(contents)), 2, 100, 1000) FROM kafka;
-SELECT ngrams(sentences(lower(contents)), 1, 100, 1000) FROM kafka;
-SELECT ngrams(sentences(lower(contents)), 3, 100, 1000) FROM kafka;
-SELECT ngrams(sentences(lower(contents)), 4, 100, 1000) FROM kafka;
-SELECT ngrams(sentences(lower(contents)), 5, 100, 1000) FROM kafka;
+SELECT ngrams(sentences(lower(contents)), 1, 100, 1000).estfrequency FROM kafka;
+SELECT ngrams(sentences(lower(contents)), 2, 100, 1000).estfrequency FROM kafka;
+SELECT ngrams(sentences(lower(contents)), 3, 100, 1000).estfrequency FROM kafka;
+SELECT ngrams(sentences(lower(contents)), 4, 100, 1000).estfrequency FROM kafka;
+SELECT ngrams(sentences(lower(contents)), 5, 100, 1000).estfrequency FROM kafka;
 
 DROP TABLE kafka;

Modified: hadoop/hive/trunk/ql/src/test/results/clientpositive/show_functions.q.out
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/test/results/clientpositive/show_functions.q.out?rev=986973&r1=986972&r2=986973&view=diff
==============================================================================
--- hadoop/hive/trunk/ql/src/test/results/clientpositive/show_functions.q.out (original)
+++ hadoop/hive/trunk/ql/src/test/results/clientpositive/show_functions.q.out Wed Aug 18 22:59:32 2010
@@ -37,6 +37,7 @@ coalesce
 collect_set
 concat
 concat_ws
+context_ngrams
 conv
 cos
 count
@@ -165,6 +166,7 @@ coalesce
 collect_set
 concat
 concat_ws
+context_ngrams
 conv
 cos
 count

Added: hadoop/hive/trunk/ql/src/test/results/clientpositive/udaf_context_ngrams.q.out
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/test/results/clientpositive/udaf_context_ngrams.q.out?rev=986973&view=auto
==============================================================================
--- hadoop/hive/trunk/ql/src/test/results/clientpositive/udaf_context_ngrams.q.out (added)
+++ hadoop/hive/trunk/ql/src/test/results/clientpositive/udaf_context_ngrams.q.out Wed Aug 18 22:59:32 2010
@@ -0,0 +1,54 @@
+PREHOOK: query: CREATE TABLE kafka (contents STRING)
+PREHOOK: type: CREATETABLE
+POSTHOOK: query: CREATE TABLE kafka (contents STRING)
+POSTHOOK: type: CREATETABLE
+POSTHOOK: Output: default@kafka
+PREHOOK: query: LOAD DATA LOCAL INPATH '../data/files/text-en.txt' INTO TABLE kafka
+PREHOOK: type: LOAD
+POSTHOOK: query: LOAD DATA LOCAL INPATH '../data/files/text-en.txt' INTO TABLE kafka
+POSTHOOK: type: LOAD
+POSTHOOK: Output: default@kafka
+PREHOOK: query: SELECT context_ngrams(sentences(lower(contents)), array("he",null), 100, 1000) FROM kafka
+PREHOOK: type: QUERY
+PREHOOK: Input: default@kafka
+PREHOOK: Output: file:/var/folders/7i/7iCDbWRkGHOcgJgX0zscimPXXts/-Tmp-/mlahiri/hive_2010-08-18_10-27-03_343_1002290107772020180/-mr-10000
+POSTHOOK: query: SELECT context_ngrams(sentences(lower(contents)), array("he",null), 100, 1000) FROM kafka
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@kafka
+POSTHOOK: Output: file:/var/folders/7i/7iCDbWRkGHOcgJgX0zscimPXXts/-Tmp-/mlahiri/hive_2010-08-18_10-27-03_343_1002290107772020180/-mr-10000
+[{"ngram":["was"],"estfrequency":17.0},{"ngram":["had"],"estfrequency":16.0},{"ngram":["thought"],"estfrequency":13.0},{"ngram":["could"],"estfrequency":9.0},{"ngram":["would"],"estfrequency":7.0},{"ngram":["lay"],"estfrequency":5.0},{"ngram":["looked"],"estfrequency":4.0},{"ngram":["s"],"estfrequency":4.0},{"ngram":["wanted"],"estfrequency":4.0},{"ngram":["did"],"estfrequency":4.0},{"ngram":["felt"],"estfrequency":4.0},{"ngram":["needed"],"estfrequency":3.0},{"ngram":["must"],"estfrequency":3.0},{"ngram":["told"],"estfrequency":3.0},{"ngram":["lifted"],"estfrequency":3.0},{"ngram":["tried"],"estfrequency":3.0},{"ngram":["finally"],"estfrequency":3.0},{"ngram":["slid"],"estfrequency":3.0},{"ngram":["reported"],"estfrequency":2.0},{"ngram":["drew"],"estfrequency":2.0},{"ngram":["is"],"estfrequency":2.0},{"ngram":["wouldn't"],"estfrequency":2.0},{"ngram":["always"],"estfrequency":2.0},{"ngram":["really"],"estfrequency":2.0},{"ngram":["let"],"estfrequency":2.0},{"ngram":["threw
 "],"estfrequency":2.0},{"ngram":["found"],"estfrequency":2.0},{"ngram":["also"],"estfrequency":2.0},{"ngram":["made"],"estfrequency":2.0},{"ngram":["didn't"],"estfrequency":2.0},{"ngram":["touched"],"estfrequency":2.0},{"ngram":["do"],"estfrequency":2.0},{"ngram":["began"],"estfrequency":2.0},{"ngram":["preferred"],"estfrequency":1.0},{"ngram":["maintained"],"estfrequency":1.0},{"ngram":["managed"],"estfrequency":1.0},{"ngram":["urged"],"estfrequency":1.0},{"ngram":["will"],"estfrequency":1.0},{"ngram":["failed"],"estfrequency":1.0},{"ngram":["have"],"estfrequency":1.0},{"ngram":["heard"],"estfrequency":1.0},{"ngram":["were"],"estfrequency":1.0},{"ngram":["caught"],"estfrequency":1.0},{"ngram":["hit"],"estfrequency":1.0},{"ngram":["turned"],"estfrequency":1.0},{"ngram":["slowly"],"estfrequency":1.0},{"ngram":["stood"],"estfrequency":1.0},{"ngram":["chose"],"estfrequency":1.0},{"ngram":["swung"],"estfrequency":1.0},{"ngram":["denied"],"estfrequency":1.0},{"ngram":["intended"]
 ,"estfrequency":1.0},{"ngram":["became"],"estfrequency":1.0},{"ngram":["sits"],"estfrequency":1.0},{"ngram":["discovered"],"estfrequency":1.0},{"ngram":["called"],"estfrequency":1.0},{"ngram":["never"],"estfrequency":1.0},{"ngram":["cut"],"estfrequency":1.0},{"ngram":["directed"],"estfrequency":1.0},{"ngram":["hoped"],"estfrequency":1.0},{"ngram":["remembered"],"estfrequency":1.0},{"ngram":["said"],"estfrequency":1.0},{"ngram":["allowed"],"estfrequency":1.0},{"ngram":["confined"],"estfrequency":1.0},{"ngram":["almost"],"estfrequency":1.0},{"ngram":["retracted"],"estfrequency":1.0}]
+PREHOOK: query: SELECT context_ngrams(sentences(lower(contents)), array(null,"salesmen"), 100, 1000) FROM kafka
+PREHOOK: type: QUERY
+PREHOOK: Input: default@kafka
+PREHOOK: Output: file:/var/folders/7i/7iCDbWRkGHOcgJgX0zscimPXXts/-Tmp-/mlahiri/hive_2010-08-18_10-27-09_129_3462618851006508964/-mr-10000
+POSTHOOK: query: SELECT context_ngrams(sentences(lower(contents)), array(null,"salesmen"), 100, 1000) FROM kafka
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@kafka
+POSTHOOK: Output: file:/var/folders/7i/7iCDbWRkGHOcgJgX0zscimPXXts/-Tmp-/mlahiri/hive_2010-08-18_10-27-09_129_3462618851006508964/-mr-10000
+[{"ngram":["travelling"],"estfrequency":3.0}]
+PREHOOK: query: SELECT context_ngrams(sentences(lower(contents)), array("what","i",null), 100, 1000) FROM kafka
+PREHOOK: type: QUERY
+PREHOOK: Input: default@kafka
+PREHOOK: Output: file:/var/folders/7i/7iCDbWRkGHOcgJgX0zscimPXXts/-Tmp-/mlahiri/hive_2010-08-18_10-27-14_440_1498206600885141414/-mr-10000
+POSTHOOK: query: SELECT context_ngrams(sentences(lower(contents)), array("what","i",null), 100, 1000) FROM kafka
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@kafka
+POSTHOOK: Output: file:/var/folders/7i/7iCDbWRkGHOcgJgX0zscimPXXts/-Tmp-/mlahiri/hive_2010-08-18_10-27-14_440_1498206600885141414/-mr-10000
+[{"ngram":["think"],"estfrequency":3.0},{"ngram":["feel"],"estfrequency":2.0}]
+PREHOOK: query: SELECT context_ngrams(sentences(lower(contents)), array(null,null), 100, 1000).estfrequency FROM kafka
+PREHOOK: type: QUERY
+PREHOOK: Input: default@kafka
+PREHOOK: Output: file:/var/folders/7i/7iCDbWRkGHOcgJgX0zscimPXXts/-Tmp-/mlahiri/hive_2010-08-18_10-27-19_700_7323220829543312705/-mr-10000
+POSTHOOK: query: SELECT context_ngrams(sentences(lower(contents)), array(null,null), 100, 1000).estfrequency FROM kafka
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@kafka
+POSTHOOK: Output: file:/var/folders/7i/7iCDbWRkGHOcgJgX0zscimPXXts/-Tmp-/mlahiri/hive_2010-08-18_10-27-19_700_7323220829543312705/-mr-10000
+[23.0,20.0,18.0,17.0,17.0,16.0,16.0,16.0,16.0,15.0,14.0,13.0,12.0,12.0,12.0,11.0,11.0,11.0,10.0,10.0,10.0,10.0,10.0,10.0,9.0,9.0,9.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0]
+PREHOOK: query: DROP TABLE kafka
+PREHOOK: type: DROPTABLE
+PREHOOK: Input: default@kafka
+PREHOOK: Output: default@kafka
+POSTHOOK: query: DROP TABLE kafka
+POSTHOOK: type: DROPTABLE
+POSTHOOK: Input: default@kafka
+POSTHOOK: Output: default@kafka