You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jon Baer <jo...@gmail.com> on 2008/10/31 22:20:34 UTC
TermVectorComponent for tag generation?
Hi,
So Im looking to either use this or build a component which might do
what Im looking for. Id like to figure out if its possible use a
single doc to get tag generation based on the matches within that
document for example:
1 News Doc -> contains 5 Players and 8 Teams (show them as possible
tags for this article)
In this case Players and Teams are also docs. It's almost like I want
to use MoreLikeThis w/ a different filter query than what Im using.
Is there any easy hack to get this going?
Thanks.
- Jon
Re: TermVectorComponent for tag generation?
Posted by "Vaijanath N. Rao" <va...@gmail.com>.
Hi Jon,
Isn't it similar to what Grant just said the top most terms ( after
removing the stop words ).
You would need to get how many terms are there and there related
frequency and any term which is beyond a certain threshold you would
mark it as an member of tag set.
One can also build a set of related entities or terms which are
following the current term, and than can decide on which all can become
part of the tagset.
It that the requirement or I am missing something here.
-- Thanks and Regards
Vaijanath N. Rao
Jon Baer wrote:
> Well for example in any given text (which is field on a document);
>
> "While suitable for any application which requires full text indexing
> and searching capability, Lucene has been widely recognized for its
> utility in the implementation of Internet search engines and local,
> single-site searching.
>
> At the core of Lucene's logical architecture is the idea of a document
> containing fields of text. This flexibility allows Lucene's API to be
> independent of file format. Text from PDFs, HTML, Microsoft Word
> documents, as well as many others can all be indexed so long as their
> textual information can be extracted."
>
> Id like to be able to say the tags for this article should be [Lucene,
> PDF, HTML, Microsoft Word] because they are in field values from other
> documents. Basically how to generate tags from just a single document
> based on other document field values.
>
> - Jon
>
>
> On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:
>
>> Hey Jon,
>>
>> Not following how the TVC (TermVectorComp) would help here. I
>> suppose you could use the "most important" terms, as defined by
>> TF-IDF, as suggested tags. The MLT (MoreLikeThis) uses this to
>> generate query terms.
>>
>> However, I'm not following the different filter query piece. Can you
>> provide a bit more details?
>>
>> One thing you did make me think, though, is it might be interesting
>> to extend TermVectorMapper so that it can output a NamedList and then
>> allow people to implement their own SolrTermVectorMapper and have it
>> customize the TV output...
>>
>> Thanks,
>> Grant
>>
>> On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:
>>
>>> Hi,
>>>
>>> So Im looking to either use this or build a component which might do
>>> what Im looking for. Id like to figure out if its possible use a
>>> single doc to get tag generation based on the matches within that
>>> document for example:
>>>
>>> 1 News Doc -> contains 5 Players and 8 Teams (show them as possible
>>> tags for this article)
>>>
>>> In this case Players and Teams are also docs. It's almost like I
>>> want to use MoreLikeThis w/ a different filter query than what Im
>>> using.
>>>
>>> Is there any easy hack to get this going?
>>>
>>> Thanks.
>>>
>>> - Jon
>>
>> --------------------------
>> Grant Ingersoll
>> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
>> http://www.lucenebootcamp.com
>>
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
Re: TermVectorComponent for tag generation?
Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 1, 2008, at 3:04 PM, Jon Baer wrote:
>
> On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:
>
>> How do you propose to distinguish those words from the other ones?
>
> ** They are field values from other documents
But so are many other words from that document, what separates out
[Lucene, PDF, HTML, Microsoft Word] from the rest? Your brain made
the distinction, but what info exists in that document such that a
computer can? (this is a leading question, I have some ideas, but I
think hearing it from you will help me better understand what you are
trying to do)
>
>
>> The problem you are addressing is often called keyword extraction.
>> In general, it 's a difficult problem, but you may have domain
>> knowledge that can help.
>
> ** Im finding it hard to think Lucene can do amazing job @ search
> but yet nothing to tell me if a generated list of content is present
> in a resulting document.
I think it can, I think the thing I'm missing is where the generated
list comes from. Given the list, I think it's just another search,
right?
So, I suppose you could get the TV for your current document, along
with the DF (doc freq) and know which terms occur in other documents,
then you could go get those documents by searching for each of those
terms.
However, I still suspect I'm missing something, so I'd say give it a
try! Maybe trying it out in code would be the best way to articulate
it.
-Grant
Re: TermVectorComponent for tag generation?
Posted by Jon Baer <jo...@gmail.com>.
On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:
> How do you propose to distinguish those words from the other ones?
** They are field values from other documents
> The problem you are addressing is often called keyword extraction.
> In general, it 's a difficult problem, but you may have domain
> knowledge that can help.
** Im finding it hard to think Lucene can do amazing job @ search but
yet nothing to tell me if a generated list of content is present in a
resulting document. The other options of TVC are what peaked my
interest in the beginning ...
Other Options
* tv.fl - List of fields to get TV information from. Optional. If
not specified, the fl parameter is used.
* tv.docIds - List of Lucene document ids (not the Solr Unique
Key) to get term vectors for.
Im pretty sure that might work for what I need it for.
- Jon
Re: TermVectorComponent for tag generation?
Posted by Grant Ingersoll <gs...@apache.org>.
How do you propose to distinguish those words from the other ones?
The problem you are addressing is often called keyword extraction. In
general, it 's a difficult problem, but you may have domain knowledge
that can help.
On Oct 31, 2008, at 6:35 PM, Jon Baer wrote:
> Well for example in any given text (which is field on a document);
>
> "While suitable for any application which requires full text
> indexing and searching capability, Lucene has been widely recognized
> for its utility in the implementation of Internet search engines and
> local, single-site searching.
>
> At the core of Lucene's logical architecture is the idea of a
> document containing fields of text. This flexibility allows Lucene's
> API to be independent of file format. Text from PDFs, HTML,
> Microsoft Word documents, as well as many others can all be indexed
> so long as their textual information can be extracted."
>
> Id like to be able to say the tags for this article should be
> [Lucene, PDF, HTML, Microsoft Word] because they are in field values
> from other documents. Basically how to generate tags from just a
> single document based on other document field values.
>
> - Jon
>
>
> On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:
>
>> Hey Jon,
>>
>> Not following how the TVC (TermVectorComp) would help here. I
>> suppose you could use the "most important" terms, as defined by TF-
>> IDF, as suggested tags. The MLT (MoreLikeThis) uses this to
>> generate query terms.
>>
>> However, I'm not following the different filter query piece. Can
>> you provide a bit more details?
>>
>> One thing you did make me think, though, is it might be interesting
>> to extend TermVectorMapper so that it can output a NamedList and
>> then allow people to implement their own SolrTermVectorMapper and
>> have it customize the TV output...
>>
>> Thanks,
>> Grant
>>
>> On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:
>>
>>> Hi,
>>>
>>> So Im looking to either use this or build a component which might
>>> do what Im looking for. Id like to figure out if its possible use
>>> a single doc to get tag generation based on the matches within
>>> that document for example:
>>>
>>> 1 News Doc -> contains 5 Players and 8 Teams (show them as
>>> possible tags for this article)
>>>
>>> In this case Players and Teams are also docs. It's almost like I
>>> want to use MoreLikeThis w/ a different filter query than what Im
>>> using.
>>>
>>> Is there any easy hack to get this going?
>>>
>>> Thanks.
>>>
>>> - Jon
>>
>> --------------------------
>> Grant Ingersoll
>> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
>> http://www.lucenebootcamp.com
>>
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
Re: TermVectorComponent for tag generation?
Posted by Jon Baer <jo...@gmail.com>.
Well for example in any given text (which is field on a document);
"While suitable for any application which requires full text indexing
and searching capability, Lucene has been widely recognized for its
utility in the implementation of Internet search engines and local,
single-site searching.
At the core of Lucene's logical architecture is the idea of a document
containing fields of text. This flexibility allows Lucene's API to be
independent of file format. Text from PDFs, HTML, Microsoft Word
documents, as well as many others can all be indexed so long as their
textual information can be extracted."
Id like to be able to say the tags for this article should be [Lucene,
PDF, HTML, Microsoft Word] because they are in field values from other
documents. Basically how to generate tags from just a single document
based on other document field values.
- Jon
On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:
> Hey Jon,
>
> Not following how the TVC (TermVectorComp) would help here. I
> suppose you could use the "most important" terms, as defined by TF-
> IDF, as suggested tags. The MLT (MoreLikeThis) uses this to
> generate query terms.
>
> However, I'm not following the different filter query piece. Can
> you provide a bit more details?
>
> One thing you did make me think, though, is it might be interesting
> to extend TermVectorMapper so that it can output a NamedList and
> then allow people to implement their own SolrTermVectorMapper and
> have it customize the TV output...
>
> Thanks,
> Grant
>
> On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:
>
>> Hi,
>>
>> So Im looking to either use this or build a component which might
>> do what Im looking for. Id like to figure out if its possible use
>> a single doc to get tag generation based on the matches within that
>> document for example:
>>
>> 1 News Doc -> contains 5 Players and 8 Teams (show them as possible
>> tags for this article)
>>
>> In this case Players and Teams are also docs. It's almost like I
>> want to use MoreLikeThis w/ a different filter query than what Im
>> using.
>>
>> Is there any easy hack to get this going?
>>
>> Thanks.
>>
>> - Jon
>
> --------------------------
> Grant Ingersoll
> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
> http://www.lucenebootcamp.com
>
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
Re: TermVectorComponent for tag generation?
Posted by Grant Ingersoll <gs...@apache.org>.
Hey Jon,
Not following how the TVC (TermVectorComp) would help here. I
suppose you could use the "most important" terms, as defined by TF-
IDF, as suggested tags. The MLT (MoreLikeThis) uses this to generate
query terms.
However, I'm not following the different filter query piece. Can you
provide a bit more details?
One thing you did make me think, though, is it might be interesting to
extend TermVectorMapper so that it can output a NamedList and then
allow people to implement their own SolrTermVectorMapper and have it
customize the TV output...
Thanks,
Grant
On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:
> Hi,
>
> So Im looking to either use this or build a component which might do
> what Im looking for. Id like to figure out if its possible use a
> single doc to get tag generation based on the matches within that
> document for example:
>
> 1 News Doc -> contains 5 Players and 8 Teams (show them as possible
> tags for this article)
>
> In this case Players and Teams are also docs. It's almost like I
> want to use MoreLikeThis w/ a different filter query than what Im
> using.
>
> Is there any easy hack to get this going?
>
> Thanks.
>
> - Jon
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ