You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stephane Vaucher <va...@cirano.qc.ca> on 2003/04/28 20:22:34 UTC

How to store document meta information

Hello everyone,

I've got a document that I run through an information extraction engine 
that returns a list of concepts associated to a document with an 
appropriate relevancy factor  (for example, with a news article, it might 
return sport=100%, litterature=84% and politics=10%). 

I would like to index these concepts with an indication of their relevancy 
levels. Is there a recommended way of doing this? Searching the FAQs, I 
found none, but from my knowledge of lucene, I gather I could do it the 
following ways:

1) If all concepts were to be stored in a single field (as I would 
prefer), I don't think I can use field boosting, so I would have to 
probably hold multiple instances of my concept (e.g. I could have 100 
"sport", 84 "litterature" and 10 "politics") in my field.

2) I could use multiple fields with varying boost factors. But I would be 
forced to determine ahead of time how many concepts I'll have to perform 
searches on all of the appropriate fields. This could probably affect the 
performance of the app (I say this with no numbers, simple intuition, so 
correct me if I'm wrong).

Any ideas, pointers or links are appreciated,
sv



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to store document meta information

Posted by Stephane Vaucher <va...@cirano.qc.ca>.
On Mon, 28 Apr 2003, Joshua O'Madadhain wrote:

> On Mon, 28 Apr 2003, Stephane Vaucher wrote:
> 
> > I've got a document that I run through an information extraction engine
> > that returns a list of concepts associated to a document with an
> > appropriate relevancy factor (for example, with a news article, it might
> > return sport=100%, litterature=84% and politics=10%).
> 
> It's unclear what the semantics of your relevancy measure are.  Is it
> something like a fuzzy set measure ('this article is 100% in the set of
> documents about sports, 84% in ... literature, and 10% in politics')?

My IR server extracts entities such as geographic locations and 
people names and give a weight using some sort of combination between a 
statistic and linguistic model to attribute a weight that should 
determine the importance of the term. It's been a while since my AI 
course, so I hope I'm not saying nonsence, but it probably is fuzzy.

example:
If an input is a news article, it can return for example:
"United States" type=geo location weight=98 frequency=4
"Bush"          type=person name  weight=78 fequency=2


> 
> > I would like to index these concepts with an indication of their relevancy
> > levels. Is there a recommended way of doing this? Searching the FAQs, I
> > found none, but from my knowledge of lucene, I gather I could do it the
> > following ways:
> >
> > 1) If all concepts were to be stored in a single field (as I would
> > prefer), I don't think I can use field boosting, so I would have to
> > probably hold multiple instances of my concept (e.g. I could have 100
> > "sport", 84 "litterature" and 10 "politics") in my field.
> >
> > 2) I could use multiple fields with varying boost factors. But I would be
> > forced to determine ahead of time how many concepts I'll have to perform
> > searches on all of the appropriate fields. This could probably affect the
> > performance of the app (I say this with no numbers, simple intuition, so
> > correct me if I'm wrong).
> 
> How do you intend to use these concepts in the search process?  That is,
> how will these concepts be used by (a) the user in specifying a query, (b)
> the indexer in storing the associated documents, (c) the searcher in
> retrieving documents, and (d) the presentation of the results to the user?
> Without knowing these things, it's hard to answer your question (at least
> for me).

I hope I answer your questions correctly:

a) a query could include geographical locations extracted, so a user might 
want to search for american news, so he might specify "United States" and 
articles with a high weight for U.S. should show up first.
b) I'll have to store these concepts in a (or multiple fields) so I can 
search on it. I need to find a way to represent the weight though...
c) I'll have fields with concepts. How I use it will depend on the way I 
index the docs.
d) I would like to show the concept found in the result set, so I can help 
a user refine his search. e.g. a search "manchester uniter" returns docs 
with concepts 'sports'. I would allow a user to add the concept 'sports' 
to his query.

My two points described the ways I thought I could index concept fields.

cheers,
sv

> Regards,
> 
> Joshua O'Madadhain
> 
>  jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
>   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
>  It's that moment of dawning comprehension that I live for--Bill Watterson
> My opinions are too rational and insightful to be those of any organization.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: How to store document meta information

Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.
On Mon, 28 Apr 2003, Stephane Vaucher wrote:

> I've got a document that I run through an information extraction engine
> that returns a list of concepts associated to a document with an
> appropriate relevancy factor (for example, with a news article, it might
> return sport=100%, litterature=84% and politics=10%).

It's unclear what the semantics of your relevancy measure are.  Is it
something like a fuzzy set measure ('this article is 100% in the set of
documents about sports, 84% in ... literature, and 10% in politics')?

> I would like to index these concepts with an indication of their relevancy
> levels. Is there a recommended way of doing this? Searching the FAQs, I
> found none, but from my knowledge of lucene, I gather I could do it the
> following ways:
>
> 1) If all concepts were to be stored in a single field (as I would
> prefer), I don't think I can use field boosting, so I would have to
> probably hold multiple instances of my concept (e.g. I could have 100
> "sport", 84 "litterature" and 10 "politics") in my field.
>
> 2) I could use multiple fields with varying boost factors. But I would be
> forced to determine ahead of time how many concepts I'll have to perform
> searches on all of the appropriate fields. This could probably affect the
> performance of the app (I say this with no numbers, simple intuition, so
> correct me if I'm wrong).

How do you intend to use these concepts in the search process?  That is,
how will these concepts be used by (a) the user in specifying a query, (b)
the indexer in storing the associated documents, (c) the searcher in
retrieving documents, and (d) the presentation of the results to the user?
Without knowing these things, it's hard to answer your question (at least
for me).

Regards,

Joshua O'Madadhain

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org