You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Phil Rosen <pr...@optaros.com> on 2006/11/15 17:11:32 UTC

term vectors

Hello,

 

Thanks in advance for your help, I am really stumped I feel.

 

I am building an application that requires I index a set of documents on
the scale of hundreds of thousands.

 

A document can have a varying number of attribute fields with an unknown
set of potential values. I realize that just indexing a blob of fields
would be much faster, however I need to bin the search results based on
common attributes; as different types of attributes could potentially have
overlapping values a single blob for all attributes wont work.

 

My question is this, is there a way to get term frequencies for a set of
documents or hits, without using getTermFreqVector() on each document and
each attribute field? As I could have hundreds of results, each with
dozens of attribute fields, looping getTermFreqVector() would be very
slow. If there isn't something inherent to lucene, has anyone seen an
extension that could accomplish this?

 

Thanks!

Re: term vectors

Posted by "Michael D. Curtin" <mi...@curtin.com>.

Phil Rosen wrote:

> I would like to get the sum of frequency counts for each term in the fields 
> I specify across the search results. I can just iterate through the 
> documents and use getTermFreqVector() for each desired field on each 
> document, then sum that; but this seems slow to me.

It seems that you really do want a custom statistical analysis of the hit 
documents, an analysis that is specific not just to the hits, but to the query 
as well.  I don't think there's anything built into Lucene to do this for you.

Have you run getTermFreqVector(), with representative data and query, to see 
what its performance actually turns out to be?

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: term vectors

Posted by Grant Ingersoll <gs...@apache.org>.

Can this be done as an offline task?

On Nov 15, 2006, at 1:07 PM, Phil Rosen wrote:

> Thanks for your help!
>
> Here is an example, I have 100 items, each with a set of  
> potentially unique
> attributes. Attributes could be color, size, length, density, etc.  
> So an
> example document could be:
>
> Id: 1
> ItemType: foo
> Blob-field: all sorts of text handled normally
> Outer-Color: Red
> Size: Large
> Temperature: hot
> Etc:
> Etc:
>
> I would like to get the sum of frequency counts for each term in  
> the fields
> I specify across the search results. I can just iterate through the
> documents and use getTermFreqVector() for each desired field on each
> document, then sum that; but this seems slow to me.
>
>
>
>
> -----Original Message-----
> From: Michael D. Curtin [mailto:mike@curtin.com]
> Sent: Wednesday, November 15, 2006 11:35 AM
> To: java-user@lucene.apache.org
> Subject: Re: term vectors
>
> Phil Rosen wrote:
>
>> I am building an application that requires I index a set of  
>> documents on
>> the scale of hundreds of thousands.
>>
>> A document can have a varying number of attribute fields with an  
>> unknown
>> set of potential values. I realize that just indexing a blob of  
>> fields
>> would be much faster, however I need to bin the search results  
>> based on
>> common attributes; as different types of attributes could  
>> potentially have
>> overlapping values a single blob for all attributes wont work.
>>
>> My question is this, is there a way to get term frequencies for a  
>> set of
>> documents or hits, without using getTermFreqVector() on each  
>> document and
>> each attribute field? As I could have hundreds of results, each with
>> dozens of attribute fields, looping getTermFreqVector() would be very
>> slow. If there isn't something inherent to lucene, has anyone seen an
>> extension that could accomplish this?
>
> Could you give an example of what you're starting with, what a  
> search looks
> like, and what you want out?  It sounds almost like you're looking  
> for a
> custom statistical analysis of hits, which I doubt Lucene is going  
> to have
> for
> you, out of the box ...
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: term vectors

Posted by Phil Rosen <pr...@optaros.com>.

Thanks for your help!

Here is an example, I have 100 items, each with a set of potentially unique 
attributes. Attributes could be color, size, length, density, etc. So an 
example document could be:

Id: 1
ItemType: foo
Blob-field: all sorts of text handled normally
Outer-Color: Red
Size: Large
Temperature: hot
Etc:
Etc:

I would like to get the sum of frequency counts for each term in the fields 
I specify across the search results. I can just iterate through the 
documents and use getTermFreqVector() for each desired field on each 
document, then sum that; but this seems slow to me.

-----Original Message-----
From: Michael D. Curtin [mailto:mike@curtin.com]
Sent: Wednesday, November 15, 2006 11:35 AM
To: java-user@lucene.apache.org
Subject: Re: term vectors

Phil Rosen wrote:

> I am building an application that requires I index a set of documents on
> the scale of hundreds of thousands.
>
> A document can have a varying number of attribute fields with an unknown
> set of potential values. I realize that just indexing a blob of fields
> would be much faster, however I need to bin the search results based on
> common attributes; as different types of attributes could potentially have
> overlapping values a single blob for all attributes wont work.
>
> My question is this, is there a way to get term frequencies for a set of
> documents or hits, without using getTermFreqVector() on each document and
> each attribute field? As I could have hundreds of results, each with
> dozens of attribute fields, looping getTermFreqVector() would be very
> slow. If there isn't something inherent to lucene, has anyone seen an
> extension that could accomplish this?

Could you give an example of what you're starting with, what a search looks
like, and what you want out?  It sounds almost like you're looking for a
custom statistical analysis of hits, which I doubt Lucene is going to have 
for
you, out of the box ...

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: term vectors

Posted by "Michael D. Curtin" <mi...@curtin.com>.

Phil Rosen wrote:

> I am building an application that requires I index a set of documents on
> the scale of hundreds of thousands.
> 
> A document can have a varying number of attribute fields with an unknown
> set of potential values. I realize that just indexing a blob of fields
> would be much faster, however I need to bin the search results based on
> common attributes; as different types of attributes could potentially have
> overlapping values a single blob for all attributes wont work.
> 
> My question is this, is there a way to get term frequencies for a set of
> documents or hits, without using getTermFreqVector() on each document and
> each attribute field? As I could have hundreds of results, each with
> dozens of attribute fields, looping getTermFreqVector() would be very
> slow. If there isn't something inherent to lucene, has anyone seen an
> extension that could accomplish this?

Could you give an example of what you're starting with, what a search looks 
like, and what you want out?  It sounds almost like you're looking for a 
custom statistical analysis of hits, which I doubt Lucene is going to have for 
you, out of the box ...

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: term vectors

Posted by Erick Erickson <er...@gmail.com>.

Why do you think you need term frequencies in the first place? What is it
that you're trying to do that just searching wouldn't accomplish?

I've often jumped into the middle of something and made it waaaaay too
complex, so I'm asking to see if you're doing something similar <G>.

Lucene has no requirements that each document have identical fields. So you
could simply index a varying number of fields with each document,
corresponding to the attributes of your files. After a search, you could
determine what you needed to by which docs had which attributes (fields). It
seems to me that if you form your query appropriately, the search results
*are* the results you want and this kind of analysis wouldn't be necessary.
Since your search would only return documents that had the fields you
specify and the values you want in those fields (attributes).

But that may just reflect that I don't understand the problem you're trying
to solve very well <G>...

Best
Erick

On 11/15/06, Phil Rosen <pr...@optaros.com> wrote:
>
> Hello,
>
>
>
> Thanks in advance for your help, I am really stumped I feel.
>
>
>
> I am building an application that requires I index a set of documents on
> the scale of hundreds of thousands.
>
>
>
> A document can have a varying number of attribute fields with an unknown
> set of potential values. I realize that just indexing a blob of fields
> would be much faster, however I need to bin the search results based on
> common attributes; as different types of attributes could potentially have
> overlapping values a single blob for all attributes wont work.
>
>
>
> My question is this, is there a way to get term frequencies for a set of
> documents or hits, without using getTermFreqVector() on each document and
> each attribute field? As I could have hundreds of results, each with
> dozens of attribute fields, looping getTermFreqVector() would be very
> slow. If there isn't something inherent to lucene, has anyone seen an
> extension that could accomplish this?
>
>
>
> Thanks!
>
>
>