You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Jack Krupansky <ja...@gmail.com> on 2016/04/20 18:04:33 UTC

Term vs. token

Looking at the Lucene Similarity Javadoc, I see some references to tokens,
but I am wondering if that is intentional or whether those should really be
references to terms.

For example:

 *        <li><b>lengthNorm</b> - computed
 *        when the document is added to the index in accordance with the
number of tokens
 *        of this field in the document, so that shorter fields contribute
more to the score.

I think that should be terms, not tokens.

See:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466

And this:

   * Returns the total number of tokens in the field.
   * @see Terms#getSumTotalTermFreq()
   */
  public long getNumberOfFieldTokens() {
    return numberOfFieldTokens;

I think that should be terms as well:

See:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65

And... this:

      numberOfFieldTokens = sumTotalTermFreq;

Where it is clearly starting with terms and treating them as tokens, but as
in the previous example, I think that should be terms as well.

See:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128

One last example:

   * Compute any collection-level weight (e.g. IDF, average document
length, etc) needed for scoring a query.
   *
   * @param collectionStats collection-level statistics, such as the number
of tokens in the collection.
   * @param termStats term-level statistics, such as the document frequency
of a term across the collection.
   * @return SimWeight object with the information this Similarity needs to
score a query.
   */
  public abstract SimWeight computeWeight(CollectionStatistics
collectionStats, TermStatistics... termStats);

See:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161

In fact, CollectionStatistics uses term, not token:

  /** returns the total number of tokens for this field
   * @see Terms#getSumTotalTermFreq() */
  public final long sumTotalTermFreq() {
    return sumTotalTermFreq;

Oops... it uses both, emphasizing my point about the confusion.

There are other examples as well.

My understanding is that tokens are merely a temporary transition in
between the original raw source text for a field and then final terms to be
indexed (or query terms from a parsed and analyzed query.) Yes, during and
within TokenStream or the analyzer we speak of tokens and intermediate
string values are referred to as tokens, but once the final string value is
retrieved from the Token Stream (analyzer), it's a term.

In any case, is there some distinction in any of these cited examples (or
other examples in this or related code) where "token" is an important
distinction to be made and "term" is not the proper... term... to be used?

Unless the Lucene project fully intends that the terms token and term are
absolutely synonymous, a clear distinction should be drawn... I think. Or
at least the terms should be used consistently, which my last example
highlights.

Thanks.

-- Jack Krupansky

Re: Term vs. token

Posted by Jack Krupansky <ja...@gmail.com>.
I gather that "term" is the proper technical term within the Vector Space
Model (TDIFS) and BM25 similarity, so it may simply be a question of where
the boundary is in Lucene between VSM processing and other stuff, like the
source for documents and queries.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 1:51 PM, Ryan Josal <rj...@gmail.com> wrote:

> My understanding is a Term is comprised of a "token" and a field.  So then
> the documentation makes sense to me - return the count of tokens in a field
> for example.  But there were a couple of references you had there that
> don't match with that definition, like the number of tokens in a
> collection.  Although maybe a Term doesn't have a whole token because what
> about token attributes like payload.  I guess I've convinced myself I'm not
> entirely clear about it either, but I do feel good about the concept that
> tokens don't have fields.  You can tokenize a string without thinking about
> fields, and they become terms with fields when you query.
>
> Ryan
>
>
> On Wednesday, April 20, 2016, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> Looking at the Lucene Similarity Javadoc, I see some references to
>> tokens, but I am wondering if that is intentional or whether those should
>> really be references to terms.
>>
>> For example:
>>
>>  *        <li><b>lengthNorm</b> - computed
>>  *        when the document is added to the index in accordance with the
>> number of tokens
>>  *        of this field in the document, so that shorter fields
>> contribute more to the score.
>>
>> I think that should be terms, not tokens.
>>
>> See:
>>
>> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466
>>
>> And this:
>>
>>    * Returns the total number of tokens in the field.
>>    * @see Terms#getSumTotalTermFreq()
>>    */
>>   public long getNumberOfFieldTokens() {
>>     return numberOfFieldTokens;
>>
>> I think that should be terms as well:
>>
>> See:
>>
>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65
>>
>> And... this:
>>
>>       numberOfFieldTokens = sumTotalTermFreq;
>>
>> Where it is clearly starting with terms and treating them as tokens, but
>> as in the previous example, I think that should be terms as well.
>>
>> See:
>>
>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128
>>
>> One last example:
>>
>>    * Compute any collection-level weight (e.g. IDF, average document
>> length, etc) needed for scoring a query.
>>    *
>>    * @param collectionStats collection-level statistics, such as the
>> number of tokens in the collection.
>>    * @param termStats term-level statistics, such as the document
>> frequency of a term across the collection.
>>    * @return SimWeight object with the information this Similarity needs
>> to score a query.
>>    */
>>   public abstract SimWeight computeWeight(CollectionStatistics
>> collectionStats, TermStatistics... termStats);
>>
>> See:
>>
>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161
>>
>> In fact, CollectionStatistics uses term, not token:
>>
>>   /** returns the total number of tokens for this field
>>    * @see Terms#getSumTotalTermFreq() */
>>   public final long sumTotalTermFreq() {
>>     return sumTotalTermFreq;
>>
>> Oops... it uses both, emphasizing my point about the confusion.
>>
>> There are other examples as well.
>>
>> My understanding is that tokens are merely a temporary transition in
>> between the original raw source text for a field and then final terms to be
>> indexed (or query terms from a parsed and analyzed query.) Yes, during and
>> within TokenStream or the analyzer we speak of tokens and intermediate
>> string values are referred to as tokens, but once the final string value is
>> retrieved from the Token Stream (analyzer), it's a term.
>>
>> In any case, is there some distinction in any of these cited examples (or
>> other examples in this or related code) where "token" is an important
>> distinction to be made and "term" is not the proper... term... to be used?
>>
>> Unless the Lucene project fully intends that the terms token and term are
>> absolutely synonymous, a clear distinction should be drawn... I think. Or
>> at least the terms should be used consistently, which my last example
>> highlights.
>>
>> Thanks.
>>
>> -- Jack Krupansky
>>
>

Re: Term vs. token

Posted by Ryan Josal <rj...@gmail.com>.
My understanding is a Term is comprised of a "token" and a field.  So then
the documentation makes sense to me - return the count of tokens in a field
for example.  But there were a couple of references you had there that
don't match with that definition, like the number of tokens in a
collection.  Although maybe a Term doesn't have a whole token because what
about token attributes like payload.  I guess I've convinced myself I'm not
entirely clear about it either, but I do feel good about the concept that
tokens don't have fields.  You can tokenize a string without thinking about
fields, and they become terms with fields when you query.

Ryan

On Wednesday, April 20, 2016, Jack Krupansky <ja...@gmail.com>
wrote:

> Looking at the Lucene Similarity Javadoc, I see some references to tokens,
> but I am wondering if that is intentional or whether those should really be
> references to terms.
>
> For example:
>
>  *        <li><b>lengthNorm</b> - computed
>  *        when the document is added to the index in accordance with the
> number of tokens
>  *        of this field in the document, so that shorter fields contribute
> more to the score.
>
> I think that should be terms, not tokens.
>
> See:
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466
>
> And this:
>
>    * Returns the total number of tokens in the field.
>    * @see Terms#getSumTotalTermFreq()
>    */
>   public long getNumberOfFieldTokens() {
>     return numberOfFieldTokens;
>
> I think that should be terms as well:
>
> See:
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65
>
> And... this:
>
>       numberOfFieldTokens = sumTotalTermFreq;
>
> Where it is clearly starting with terms and treating them as tokens, but
> as in the previous example, I think that should be terms as well.
>
> See:
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128
>
> One last example:
>
>    * Compute any collection-level weight (e.g. IDF, average document
> length, etc) needed for scoring a query.
>    *
>    * @param collectionStats collection-level statistics, such as the
> number of tokens in the collection.
>    * @param termStats term-level statistics, such as the document
> frequency of a term across the collection.
>    * @return SimWeight object with the information this Similarity needs
> to score a query.
>    */
>   public abstract SimWeight computeWeight(CollectionStatistics
> collectionStats, TermStatistics... termStats);
>
> See:
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161
>
> In fact, CollectionStatistics uses term, not token:
>
>   /** returns the total number of tokens for this field
>    * @see Terms#getSumTotalTermFreq() */
>   public final long sumTotalTermFreq() {
>     return sumTotalTermFreq;
>
> Oops... it uses both, emphasizing my point about the confusion.
>
> There are other examples as well.
>
> My understanding is that tokens are merely a temporary transition in
> between the original raw source text for a field and then final terms to be
> indexed (or query terms from a parsed and analyzed query.) Yes, during and
> within TokenStream or the analyzer we speak of tokens and intermediate
> string values are referred to as tokens, but once the final string value is
> retrieved from the Token Stream (analyzer), it's a term.
>
> In any case, is there some distinction in any of these cited examples (or
> other examples in this or related code) where "token" is an important
> distinction to be made and "term" is not the proper... term... to be used?
>
> Unless the Lucene project fully intends that the terms token and term are
> absolutely synonymous, a clear distinction should be drawn... I think. Or
> at least the terms should be used consistently, which my last example
> highlights.
>
> Thanks.
>
> -- Jack Krupansky
>