You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Jürgen Wagner (DVT)" <ju...@devoteam.com> on 2014/09/05 09:44:29 UTC

FAST-like document vector data structures in Solr?

Hello all,
  as the migration from FAST to Solr is a relevant topic for several of
our customers, there is one issue that does not seem to be addressed by
Lucene/Solr: document vectors FAST-style. These document vectors are
used to form metrics of similarity, i.e., they may be used as a
"semantic fingerprint" of documents to define similarity relations. I
can think of several ways of approximating a mapping of this mechanism
to Solr, but there are always drawbacks - mostly performance-wise.

Has anybody else encountered and possibly approached this challenge so far?

Is there anything in the roadmap of Solr that has not revealed itself to
me, addressing this issue?

Your input is greatly appreciated!

Cheers,
--Jürgen

Re: FAST-like document vector data structures in Solr?

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

Some further details out of my mind:
- it is a stream based feature
- IDF estimates get updated and refined as more and more documents pass through
- it is actually IDF weighting with stopwords and boosting
-- stopwords should be ignored and not get vectorized
-- boosting should give some boost to vectors

There are some further configuration parameters.
nmin - minimum number of occurrences
type (of IDF weighting) - flat, linear, logarithmic
- flat, gives IDF the value of 0 if occurrences of the string in the
        document is less than nmin, else it is 1.
- linear, interpolates linearly between 0 and 1,
          returns 0 if occurrences is below nmin,
          returns (1 - (# of docs with string found / # of docs passed through))
- logarithmic, uses natural logarithm, weights rarity more heavily,
          returns 0 if occurrences is below nmin,
          returns exponential_log(# of docs passed through / # of docs with string found)

I think logarithmic was default (as far as I can remember).


A question while thinking about this feature, is it possible with solr/lucene to
have access to IDF for strings from the index while processing new documents?


-- Bernd

Am 05.09.2014 16:35, schrieb Jack Krupansky:
> Sounds like a great future to add to Solr, especially if it would facilitate more automatic relevancy enhancement. LucidWorks Search has a
> feature called "unsupervised feedback" that does that but something like a docvector might make it a more realistic default.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: "Jürgen Wagner (DVT)"
> Sent: Friday, September 5, 2014 10:29 AM
> To: solr-user@lucene.apache.org
> Subject: Re: FAST-like document vector data structures in Solr?
> 
> Thanks for posting this. I was just about to send off a message of
> similar content :-)
> 
> Important to add:
> 
> - In FAST ESP, you could have more than one such docvector associated
> with a document, in order to reflect different metrics.
> 
> - Term weights in docvectors are document-relative, not absolute.
> 
> - Processing is done in the search processor (close to the index), not
> in the QR server (providing transformations on the result list).
> 
> This docvector could be used for unsupervised clustering,
> related-to/similarity search, tag clouds or more weird stuff like
> identifying experts on topics contained in a particular document.
> 
> With Solr, it seems I have to handcraft the term vectors to reflect the
> right weights, to approximate the effect of FAST docvectors, e.g., by
> normalizing them to [0...10000). Processing performance would still be
> different from the classical FAST docvectors. The space consumption may
> become ugly for a 200+ GB range shard, however, FAST has also been quite
> generous with disk space, anyway.
> 
> So, the interesting question is whether there is a more canonical way of
> handling this in Solr/Lucene, or if something the like is planned for 5.0+.
> 
> Best regards,
> --Jürgen
> 
> On 05.09.2014 16:02, Jack Krupansky wrote:
>> For reference:
>>
>> “Item Similarity Vector Reference
>>
>> This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned
>> for each item in the query result in the docvector managed property.
>>
>> The value is a string formatted according to the following format:
>>
>> [string1,weight1][string2,weight2]...[stringN,weightN]
>>
>> When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property
>> of the item that is to be used as the similarity reference. The similarity vector consists of a set of "term,weight" expressions, indicating
>> the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases.
>>
>> The weight is a float value between 0 and 1, where 1 indicates the highest relevance.
>>
>> The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding
>>  weight.”
>>
>> See:
>> http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx
>>
>> -- Jack Krupansky
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

Re: FAST-like document vector data structures in Solr?

Posted by Jack Krupansky <ja...@basetechnology.com>.

Sounds like a great future to add to Solr, especially if it would facilitate 
more automatic relevancy enhancement. LucidWorks Search has a feature called 
"unsupervised feedback" that does that but something like a docvector might 
make it a more realistic default.

-- Jack Krupansky

-----Original Message----- 
From: "Jürgen Wagner (DVT)"
Sent: Friday, September 5, 2014 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: FAST-like document vector data structures in Solr?

Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...10000). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:
> For reference:
>
> “Item Similarity Vector Reference
>
> This property represents a similarity reference when searching for similar 
> items. This is a similarity vector representation that is returned for 
> each item in the query result in the docvector managed property.
>
> The value is a string formatted according to the following format:
>
> [string1,weight1][string2,weight2]...[stringN,weightN]
>
> When performing a find similar query, the SimilarTo element should contain 
> a string parameter with the value of the docvector managed property of the 
> item that is to be used as the similarity reference. The similarity vector 
> consists of a set of "term,weight" expressions, indicating the most 
> important terms or concepts in the item and the corresponding perceived 
> importance (weight). Terms can be single words or phrases.
>
> The weight is a float value between 0 and 1, where 1 indicates the highest 
> relevance.
>
> The similarity vector is created during item processing and indicates the 
> most important terms or concepts in the item and the corresponding 
>  weight.”
>
> See:
> http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx
>
> -- Jack Krupansky

Re: FAST-like document vector data structures in Solr?

Posted by "Jürgen Wagner (DVT)" <ju...@devoteam.com>.

Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...10000). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:
> For reference:
>
> “Item Similarity Vector Reference
>
> This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property.
>
> The value is a string formatted according to the following format:
>
> [string1,weight1][string2,weight2]...[stringN,weightN]
>
> When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of "term,weight" expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases.
>
> The weight is a float value between 0 and 1, where 1 indicates the highest relevance.
>
> The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.”
>
> See:
> http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx
>
> -- Jack Krupansky

Re: FAST-like document vector data structures in Solr?

Posted by Jack Krupansky <ja...@basetechnology.com>.

For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property.

The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of "term,weight" expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases.

The weight is a float value between 0 and 1, where 1 indicates the highest relevance.

The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.”

See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky

From: "Jürgen Wagner (DVT)" 
Sent: Friday, September 5, 2014 7:03 AM
To: solr-user@lucene.apache.org 
Subject: Re: FAST-like document vector data structures in Solr?

Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently mapping docvectors to these mechanisms and create term vectors myself from third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the performance of MoreLikeThis queries based on TermVectors is suboptimal on large document sets, so a more efficient support of such retrievals in the Lucene kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:

Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" <juergen.wagner@devoteam.com
:
Hello all,
  as the migration from FAST to Solr is a relevant topic for several of
our customers, there is one issue that does not seem to be addressed by
Lucene/Solr: document vectors FAST-style. These document vectors are
used to form metrics of similarity, i.e., they may be used as a
"semantic fingerprint" of documents to define similarity relations. I
can think of several ways of approximating a mapping of this mechanism
to Solr, but there are always drawbacks - mostly performance-wise.

Has anybody else encountered and possibly approached this challenge so far?

Is there anything in the roadmap of Solr that has not revealed itself to
me, addressing this issue?

Your input is greatly appreciated!

Cheers,
--Jürgen





-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant 

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com, URL: www.devoteam.de


--------------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

Re: FAST-like document vector data structures in Solr?

Posted by "Jürgen Wagner (DVT)" <ju...@devoteam.com>.

Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am
presently mapping docvectors to these mechanisms and create term vectors
myself from third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the
performance of MoreLikeThis queries based on TermVectors is suboptimal
on large document sets, so a more efficient support of such retrievals
in the Lucene kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:
> Hi,
> Something like ?:
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> And just to show some impressive search functionality of the wiki: ;)
> https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors
>
> Cheers,
> Jim
>
>
> 2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" <juergen.wagner@devoteam.com
>> :
>> Hello all,
>>   as the migration from FAST to Solr is a relevant topic for several of
>> our customers, there is one issue that does not seem to be addressed by
>> Lucene/Solr: document vectors FAST-style. These document vectors are
>> used to form metrics of similarity, i.e., they may be used as a
>> "semantic fingerprint" of documents to define similarity relations. I
>> can think of several ways of approximating a mapping of this mechanism
>> to Solr, but there are always drawbacks - mostly performance-wise.
>>
>> Has anybody else encountered and possibly approached this challenge so far?
>>
>> Is there anything in the roadmap of Solr that has not revealed itself to
>> me, addressing this issue?
>>
>> Your input is greatly appreciated!
>>
>> Cheers,
>> --Jürgen
>>
>>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
<ma...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

Re: FAST-like document vector data structures in Solr?

Posted by jim ferenczi <ji...@gmail.com>.

Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" <juergen.wagner@devoteam.com
>:

> Hello all,
>   as the migration from FAST to Solr is a relevant topic for several of
> our customers, there is one issue that does not seem to be addressed by
> Lucene/Solr: document vectors FAST-style. These document vectors are
> used to form metrics of similarity, i.e., they may be used as a
> "semantic fingerprint" of documents to define similarity relations. I
> can think of several ways of approximating a mapping of this mechanism
> to Solr, but there are always drawbacks - mostly performance-wise.
>
> Has anybody else encountered and possibly approached this challenge so far?
>
> Is there anything in the roadmap of Solr that has not revealed itself to
> me, addressing this issue?
>
> Your input is greatly appreciated!
>
> Cheers,
> --Jürgen
>
>

Re: FAST-like document vector data structures in Solr?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Jürgen,

I can't get it. Can you tell more about this feature or point to the doc?
Thanks


On Fri, Sep 5, 2014 at 11:44 AM, "Jürgen Wagner (DVT)" <
juergen.wagner@devoteam.com> wrote:

> Hello all,
>   as the migration from FAST to Solr is a relevant topic for several of
> our customers, there is one issue that does not seem to be addressed by
> Lucene/Solr: document vectors FAST-style. These document vectors are
> used to form metrics of similarity, i.e., they may be used as a
> "semantic fingerprint" of documents to define similarity relations. I
> can think of several ways of approximating a mapping of this mechanism
> to Solr, but there are always drawbacks - mostly performance-wise.
>
> Has anybody else encountered and possibly approached this challenge so far?
>
> Is there anything in the roadmap of Solr that has not revealed itself to
> me, addressing this issue?
>
> Your input is greatly appreciated!
>
> Cheers,
> --Jürgen
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>