You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jon Baer <jo...@gmail.com> on 2008/10/31 22:20:34 UTC

TermVectorComponent for tag generation?

Hi,

So Im looking to either use this or build a component which might do  
what Im looking for.  Id like to figure out if its possible use a  
single doc to get tag generation based on the matches within that  
document for example:

1 News Doc -> contains 5 Players and 8 Teams (show them as possible  
tags for this article)

In this case Players and Teams are also docs.  It's almost like I want  
to use MoreLikeThis w/ a different filter query than what Im using.

Is there any easy hack to get this going?

Thanks.

- Jon

Re: TermVectorComponent for tag generation?

Posted by "Vaijanath N. Rao" <va...@gmail.com>.

Hi Jon,

Isn't it similar to what Grant just said the top most terms ( after 
removing the stop words ).

You would need to get how many terms are there and there related 
frequency and any term which is beyond a certain threshold you would 
mark it as an member of tag set.

One can also build a set of related entities or terms which are 
following the current term, and than can decide on which all can become 
part of the tagset.

It that the requirement or I am missing something here.

-- Thanks and Regards
Vaijanath N. Rao

Jon Baer wrote:
> Well for example in any given text (which is field on a document);
>
> "While suitable for any application which requires full text indexing 
> and searching capability, Lucene has been widely recognized for its 
> utility in the implementation of Internet search engines and local, 
> single-site searching.
>
> At the core of Lucene's logical architecture is the idea of a document 
> containing fields of text. This flexibility allows Lucene's API to be 
> independent of file format. Text from PDFs, HTML, Microsoft Word 
> documents, as well as many others can all be indexed so long as their 
> textual information can be extracted."
>
> Id like to be able to say the tags for this article should be [Lucene, 
> PDF, HTML, Microsoft Word] because they are in field values from other 
> documents.  Basically how to generate tags from just a single document 
> based on other document field values.
>
> - Jon
>
>
> On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:
>
>> Hey Jon,
>>
>> Not following how the TVC (TermVectorComp) would help here.    I 
>> suppose you could use the "most important" terms, as defined by 
>> TF-IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to 
>> generate query terms.
>>
>> However, I'm not following the different filter query piece.  Can you 
>> provide a bit more details?
>>
>> One thing you did make me think, though, is it might be interesting 
>> to extend TermVectorMapper so that it can output a NamedList and then 
>> allow people to implement their own SolrTermVectorMapper and have it 
>> customize the TV output...
>>
>> Thanks,
>> Grant
>>
>> On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:
>>
>>> Hi,
>>>
>>> So Im looking to either use this or build a component which might do 
>>> what Im looking for.  Id like to figure out if its possible use a 
>>> single doc to get tag generation based on the matches within that 
>>> document for example:
>>>
>>> 1 News Doc -> contains 5 Players and 8 Teams (show them as possible 
>>> tags for this article)
>>>
>>> In this case Players and Teams are also docs.  It's almost like I 
>>> want to use MoreLikeThis w/ a different filter query than what Im 
>>> using.
>>>
>>> Is there any easy hack to get this going?
>>>
>>> Thanks.
>>>
>>> - Jon
>>
>> --------------------------
>> Grant Ingersoll
>> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
>> http://www.lucenebootcamp.com
>>
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: TermVectorComponent for tag generation?

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 1, 2008, at 3:04 PM, Jon Baer wrote:

>
> On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:
>
>> How do you propose to distinguish those words from the other ones?
>
> ** They are field values from other documents

But so are many other words from that document, what separates out  
[Lucene, PDF, HTML, Microsoft Word]  from the rest?  Your brain made  
the distinction, but what info exists in that document such that a  
computer can?  (this is a leading question, I have some ideas, but I  
think hearing it from you will help me better understand what you are  
trying to do)

>
>
>> The problem you are addressing is often called keyword extraction.   
>> In general, it 's a difficult problem, but you may have domain  
>> knowledge that can help.
>
> ** Im finding it hard to think Lucene can do amazing job @ search  
> but yet nothing to tell me if a generated list of content is present  
> in a resulting document.

I think it can, I think the thing I'm missing is where the generated  
list comes from.  Given the list, I think it's just another search,  
right?

So, I suppose you could get the TV for your current document, along  
with the DF (doc freq) and know which terms occur in other documents,  
then you could go get those documents by searching for each of those  
terms.

However, I still suspect I'm missing something, so I'd say give it a  
try!  Maybe trying it out in code would be the best way to articulate  
it.

-Grant

Re: TermVectorComponent for tag generation?

Posted by Jon Baer <jo...@gmail.com>.

On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:

> How do you propose to distinguish those words from the other ones?

** They are field values from other documents

>  The problem you are addressing is often called keyword extraction.   
> In general, it 's a difficult problem, but you may have domain  
> knowledge that can help.

** Im finding it hard to think Lucene can do amazing job @ search but  
yet nothing to tell me if a generated list of content is present in a  
resulting document.  The other options of TVC are what peaked my  
interest in the beginning ...

Other Options
     * tv.fl - List of fields to get TV information from. Optional. If  
not specified, the fl parameter is used.
     * tv.docIds - List of Lucene document ids (not the Solr Unique  
Key) to get term vectors for.

Im pretty sure that might work for what I need it for.

- Jon

Re: TermVectorComponent for tag generation?

Posted by Grant Ingersoll <gs...@apache.org>.

How do you propose to distinguish those words from the other ones?   
The problem you are addressing is often called keyword extraction.  In  
general, it 's a difficult problem, but you may have domain knowledge  
that can help.


On Oct 31, 2008, at 6:35 PM, Jon Baer wrote:

> Well for example in any given text (which is field on a document);
>
> "While suitable for any application which requires full text  
> indexing and searching capability, Lucene has been widely recognized  
> for its utility in the implementation of Internet search engines and  
> local, single-site searching.
>
> At the core of Lucene's logical architecture is the idea of a  
> document containing fields of text. This flexibility allows Lucene's  
> API to be independent of file format. Text from PDFs, HTML,  
> Microsoft Word documents, as well as many others can all be indexed  
> so long as their textual information can be extracted."
>
> Id like to be able to say the tags for this article should be  
> [Lucene, PDF, HTML, Microsoft Word] because they are in field values  
> from other documents.  Basically how to generate tags from just a  
> single document based on other document field values.
>
> - Jon
>
>
> On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:
>
>> Hey Jon,
>>
>> Not following how the TVC (TermVectorComp) would help here.    I  
>> suppose you could use the "most important" terms, as defined by TF- 
>> IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to  
>> generate query terms.
>>
>> However, I'm not following the different filter query piece.  Can  
>> you provide a bit more details?
>>
>> One thing you did make me think, though, is it might be interesting  
>> to extend TermVectorMapper so that it can output a NamedList and  
>> then allow people to implement their own SolrTermVectorMapper and  
>> have it customize the TV output...
>>
>> Thanks,
>> Grant
>>
>> On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:
>>
>>> Hi,
>>>
>>> So Im looking to either use this or build a component which might  
>>> do what Im looking for.  Id like to figure out if its possible use  
>>> a single doc to get tag generation based on the matches within  
>>> that document for example:
>>>
>>> 1 News Doc -> contains 5 Players and 8 Teams (show them as  
>>> possible tags for this article)
>>>
>>> In this case Players and Teams are also docs.  It's almost like I  
>>> want to use MoreLikeThis w/ a different filter query than what Im  
>>> using.
>>>
>>> Is there any easy hack to get this going?
>>>
>>> Thanks.
>>>
>>> - Jon
>>
>> --------------------------
>> Grant Ingersoll
>> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
>> http://www.lucenebootcamp.com
>>
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: TermVectorComponent for tag generation?

Posted by Jon Baer <jo...@gmail.com>.

Well for example in any given text (which is field on a document);

"While suitable for any application which requires full text indexing  
and searching capability, Lucene has been widely recognized for its  
utility in the implementation of Internet search engines and local,  
single-site searching.

At the core of Lucene's logical architecture is the idea of a document  
containing fields of text. This flexibility allows Lucene's API to be  
independent of file format. Text from PDFs, HTML, Microsoft Word  
documents, as well as many others can all be indexed so long as their  
textual information can be extracted."

Id like to be able to say the tags for this article should be [Lucene,  
PDF, HTML, Microsoft Word] because they are in field values from other  
documents.  Basically how to generate tags from just a single document  
based on other document field values.

- Jon

On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:

> Hey Jon,
>
> Not following how the TVC (TermVectorComp) would help here.    I  
> suppose you could use the "most important" terms, as defined by TF- 
> IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to  
> generate query terms.
>
> However, I'm not following the different filter query piece.  Can  
> you provide a bit more details?
>
> One thing you did make me think, though, is it might be interesting  
> to extend TermVectorMapper so that it can output a NamedList and  
> then allow people to implement their own SolrTermVectorMapper and  
> have it customize the TV output...
>
> Thanks,
> Grant
>
> On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:
>
>> Hi,
>>
>> So Im looking to either use this or build a component which might  
>> do what Im looking for.  Id like to figure out if its possible use  
>> a single doc to get tag generation based on the matches within that  
>> document for example:
>>
>> 1 News Doc -> contains 5 Players and 8 Teams (show them as possible  
>> tags for this article)
>>
>> In this case Players and Teams are also docs.  It's almost like I  
>> want to use MoreLikeThis w/ a different filter query than what Im  
>> using.
>>
>> Is there any easy hack to get this going?
>>
>> Thanks.
>>
>> - Jon
>
> --------------------------
> Grant Ingersoll
> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
> http://www.lucenebootcamp.com
>
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>

Re: TermVectorComponent for tag generation?

Posted by Grant Ingersoll <gs...@apache.org>.

Hey Jon,

Not following how the TVC (TermVectorComp) would help here.    I  
suppose you could use the "most important" terms, as defined by TF- 
IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to generate  
query terms.

However, I'm not following the different filter query piece.  Can you  
provide a bit more details?

One thing you did make me think, though, is it might be interesting to  
extend TermVectorMapper so that it can output a NamedList and then  
allow people to implement their own SolrTermVectorMapper and have it  
customize the TV output...

Thanks,
Grant

On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:

> Hi,
>
> So Im looking to either use this or build a component which might do  
> what Im looking for.  Id like to figure out if its possible use a  
> single doc to get tag generation based on the matches within that  
> document for example:
>
> 1 News Doc -> contains 5 Players and 8 Teams (show them as possible  
> tags for this article)
>
> In this case Players and Teams are also docs.  It's almost like I  
> want to use MoreLikeThis w/ a different filter query than what Im  
> using.
>
> Is there any easy hack to get this going?
>
> Thanks.
>
> - Jon

--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ