You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stephane Vaucher <va...@cirano.qc.ca> on 2003/04/28 20:22:34 UTC
How to store document meta information
Hello everyone,
I've got a document that I run through an information extraction engine
that returns a list of concepts associated to a document with an
appropriate relevancy factor (for example, with a news article, it might
return sport=100%, litterature=84% and politics=10%).
I would like to index these concepts with an indication of their relevancy
levels. Is there a recommended way of doing this? Searching the FAQs, I
found none, but from my knowledge of lucene, I gather I could do it the
following ways:
1) If all concepts were to be stored in a single field (as I would
prefer), I don't think I can use field boosting, so I would have to
probably hold multiple instances of my concept (e.g. I could have 100
"sport", 84 "litterature" and 10 "politics") in my field.
2) I could use multiple fields with varying boost factors. But I would be
forced to determine ahead of time how many concepts I'll have to perform
searches on all of the appropriate fields. This could probably affect the
performance of the app (I say this with no numbers, simple intuition, so
correct me if I'm wrong).
Any ideas, pointers or links are appreciated,
sv
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: How to store document meta information
Posted by Stephane Vaucher <va...@cirano.qc.ca>.
On Mon, 28 Apr 2003, Joshua O'Madadhain wrote:
> On Mon, 28 Apr 2003, Stephane Vaucher wrote:
>
> > I've got a document that I run through an information extraction engine
> > that returns a list of concepts associated to a document with an
> > appropriate relevancy factor (for example, with a news article, it might
> > return sport=100%, litterature=84% and politics=10%).
>
> It's unclear what the semantics of your relevancy measure are. Is it
> something like a fuzzy set measure ('this article is 100% in the set of
> documents about sports, 84% in ... literature, and 10% in politics')?
My IR server extracts entities such as geographic locations and
people names and give a weight using some sort of combination between a
statistic and linguistic model to attribute a weight that should
determine the importance of the term. It's been a while since my AI
course, so I hope I'm not saying nonsence, but it probably is fuzzy.
example:
If an input is a news article, it can return for example:
"United States" type=geo location weight=98 frequency=4
"Bush" type=person name weight=78 fequency=2
>
> > I would like to index these concepts with an indication of their relevancy
> > levels. Is there a recommended way of doing this? Searching the FAQs, I
> > found none, but from my knowledge of lucene, I gather I could do it the
> > following ways:
> >
> > 1) If all concepts were to be stored in a single field (as I would
> > prefer), I don't think I can use field boosting, so I would have to
> > probably hold multiple instances of my concept (e.g. I could have 100
> > "sport", 84 "litterature" and 10 "politics") in my field.
> >
> > 2) I could use multiple fields with varying boost factors. But I would be
> > forced to determine ahead of time how many concepts I'll have to perform
> > searches on all of the appropriate fields. This could probably affect the
> > performance of the app (I say this with no numbers, simple intuition, so
> > correct me if I'm wrong).
>
> How do you intend to use these concepts in the search process? That is,
> how will these concepts be used by (a) the user in specifying a query, (b)
> the indexer in storing the associated documents, (c) the searcher in
> retrieving documents, and (d) the presentation of the results to the user?
> Without knowing these things, it's hard to answer your question (at least
> for me).
I hope I answer your questions correctly:
a) a query could include geographical locations extracted, so a user might
want to search for american news, so he might specify "United States" and
articles with a high weight for U.S. should show up first.
b) I'll have to store these concepts in a (or multiple fields) so I can
search on it. I need to find a way to represent the weight though...
c) I'll have fields with concepts. How I use it will depend on the way I
index the docs.
d) I would like to show the concept found in the result set, so I can help
a user refine his search. e.g. a search "manchester uniter" returns docs
with concepts 'sports'. I would allow a user to add the concept 'sports'
to his query.
My two points described the ways I thought I could index concept fields.
cheers,
sv
> Regards,
>
> Joshua O'Madadhain
>
> jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
> Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
> It's that moment of dawning comprehension that I live for--Bill Watterson
> My opinions are too rational and insightful to be those of any organization.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: How to store document meta information
Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.
On Mon, 28 Apr 2003, Stephane Vaucher wrote:
> I've got a document that I run through an information extraction engine
> that returns a list of concepts associated to a document with an
> appropriate relevancy factor (for example, with a news article, it might
> return sport=100%, litterature=84% and politics=10%).
It's unclear what the semantics of your relevancy measure are. Is it
something like a fuzzy set measure ('this article is 100% in the set of
documents about sports, 84% in ... literature, and 10% in politics')?
> I would like to index these concepts with an indication of their relevancy
> levels. Is there a recommended way of doing this? Searching the FAQs, I
> found none, but from my knowledge of lucene, I gather I could do it the
> following ways:
>
> 1) If all concepts were to be stored in a single field (as I would
> prefer), I don't think I can use field boosting, so I would have to
> probably hold multiple instances of my concept (e.g. I could have 100
> "sport", 84 "litterature" and 10 "politics") in my field.
>
> 2) I could use multiple fields with varying boost factors. But I would be
> forced to determine ahead of time how many concepts I'll have to perform
> searches on all of the appropriate fields. This could probably affect the
> performance of the app (I say this with no numbers, simple intuition, so
> correct me if I'm wrong).
How do you intend to use these concepts in the search process? That is,
how will these concepts be used by (a) the user in specifying a query, (b)
the indexer in storing the associated documents, (c) the searcher in
retrieving documents, and (d) the presentation of the results to the user?
Without knowing these things, it's hard to answer your question (at least
for me).
Regards,
Joshua O'Madadhain
jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org