You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Zhao, Xin" <xz...@jhmi.edu> on 2006/08/24 17:15:25 UTC

controlled library

Hi,
I have a design question. Here is what we try to do for indexing:
We designed an indexing tool to generate standard MeSH terms from medical citations, and then use Lucene to save the terms and citations for future search. The information we need to save are:
a) the exact mesh terms (top 10)
b) the score for each term
so the codings are like
----------------------------------- 
for the top 10 MeSH Terms
 myField=Field.Keyword("mesh", mesh.toLowerCase());
 myField.setBoost(score);
 doc.add(myFiled);
end for
------------------------------------
as you could see we generate all the terms under named field "mesh". If I understand correctly, all the fields under the same name would eventually  save into one field, with all the scores be normalized into filed boost. In this case, we wouldn't be able to save separate score, so the information is lost. Am I correct? Is there anyway we could change it? I understand Lucene is for keyword search, and what we try to do is Controlled Vocabulary search, Any other tool we could use?

Thank you,
Xin

Re: controlled vocabulary

Posted by "Zhao, Xin" <xz...@jhmi.edu>.

Hi, Rupinder,
Our algorithm is a little different from what PubMed does. We have scoring 
for each mesh term, which will affect the search result.
What do you think the difference would be for these two:
document.addField(Field.Keyword("mesh", "xxxx"));
and
document.addField( new Field( "mesh", "xxxx", Field.Store.YES , 
Field.Index.TOKENIZED );

Thank you,
Xin



----- Original Message ----- 
From: "Rupinder Singh Mazara" <rm...@masterfile.com>
To: <ja...@lucene.apache.org>
Sent: Friday, August 25, 2006 11:27 AM
Subject: Re: controlled vocabulary


> Hi Xin
>
>   then perhaps you can change it to Field.Index.TOKENIZED, but i was not 
> aware that pubmed boosts mesh terms, they broadly classify terms as major 
> and minor, if you plan to use this simple system of classification 
> consider adding the major terms twice to the document ?
>
> Zhao, Xin wrote:
>> Hi, Rupinder,
>> My understanding is Field.Index.NO_NORMS disables  index-time boosting 
>> and field length normalization at the same time. But I do need index-time 
>> boosting to store the scoring of each mesh term. Have I missed anything?
>> Thank you very much for your help,
>> Xin
>>
>> ----- Original Message ----- From: "Rupinder Singh Mazara" 
>> <rm...@masterfile.com>
>> To: <ja...@lucene.apache.org>
>> Sent: Friday, August 25, 2006 10:49 AM
>> Subject: Re: controlled vocabulary
>>
>>
>>> hi Xin
>>>
>>>  this is take a look at this you can add multiple fields with the name 
>>> mesh
>>> for ( i=0; i< meshList.size() ; i++ ){
>>>    meshTerm = meshList.get(i)
>>>  document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, 
>>> Field.Store.YES , Field.Index.NO_NORMS  );
>>> }
>>>
>>>  when querying this index, create a analyzer that infers the text string 
>>> and generates id's that correspond to the mesh term in the semantic web
>>>
>>>
>>>
>>> Zhao, Xin wrote:
>>>> Hi,
>>>> Thank you for your reply. I had thought about the first two solutions 
>>>> before. If we apply one doc for each MeSH term, it would be 26 docs for 
>>>> each item digested(we actually need the top 25 MeSH terms generated), 
>>>> would it be any problem if there are too many documents? If we apply 
>>>> field name like "mesh_1", "mesh_2"..., when it comes to search, we will 
>>>> have to generate a loop for each single one of the query terms( there 
>>>> will be more than 20-30 terms on average, since we are using sematic 
>>>> web to implement concept search), do you think it would affect the 
>>>> performance in a very bad way?
>>>> Regards,
>>>> Xin
>>>>
>>>>
>>>> ----- Original Message ----- From: "Dedian Guo" <gd...@gmail.com>
>>>> To: <ja...@lucene.apache.org>; "Zhao, Xin" <xz...@jhu.edu>
>>>> Sent: Thursday, August 24, 2006 4:22 PM
>>>> Subject: Re: controlled library
>>>>
>>>>
>>>>> in my solution, you can apply one doc for each mesh term, or apply 
>>>>> different
>>>>> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can 
>>>>> group
>>>>> your mesh terms as one string then add into a field, which requires a 
>>>>> simple
>>>>> string parser for the group string when you wanna read the terms...
>>>>>
>>>>> not sure if that works or answers your question...
>>>>>
>>>>> On 8/24/06, Zhao, Xin <xz...@jhmi.edu> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> I have a design question. Here is what we try to do for indexing:
>>>>>> We designed an indexing tool to generate standard MeSH terms from 
>>>>>> medical
>>>>>> citations, and then use Lucene to save the terms and citations for 
>>>>>> future
>>>>>> search. The information we need to save are:
>>>>>> a) the exact mesh terms (top 10)
>>>>>> b) the score for each term
>>>>>> so the codings are like
>>>>>> -----------------------------------
>>>>>> for the top 10 MeSH Terms
>>>>>> myField=Field.Keyword("mesh", mesh.toLowerCase());
>>>>>> myField.setBoost(score);
>>>>>> doc.add(myFiled);
>>>>>> end for
>>>>>> ------------------------------------
>>>>>> as you could see we generate all the terms under named field "mesh". 
>>>>>> If I
>>>>>> understand correctly, all the fields under the same name would
>>>>>> eventually  save into one field, with all the scores be normalized 
>>>>>> into
>>>>>> filed boost. In this case, we wouldn't be able to save separate 
>>>>>> score, so
>>>>>> the information is lost. Am I correct? Is there anyway we could 
>>>>>> change it? I
>>>>>> understand Lucene is for keyword search, and what we try to do is 
>>>>>> Controlled
>>>>>> Vocabulary search, Any other tool we could use?
>>>>>>
>>>>>> Thank you,
>>>>>> Xin
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: controlled vocabulary

Posted by Rupinder Singh Mazara <rm...@masterfile.com>.

Hi Xin

   then perhaps you can change it to Field.Index.TOKENIZED, but i was 
not aware that pubmed boosts mesh terms, they broadly classify terms as 
major and minor, if you plan to use this simple system of classification 
consider adding the major terms twice to the document ?

Zhao, Xin wrote:
> Hi, Rupinder,
> My understanding is Field.Index.NO_NORMS disables  index-time boosting 
> and field length normalization at the same time. But I do need 
> index-time boosting to store the scoring of each mesh term. Have I 
> missed anything?
> Thank you very much for your help,
> Xin
>
> ----- Original Message ----- From: "Rupinder Singh Mazara" 
> <rm...@masterfile.com>
> To: <ja...@lucene.apache.org>
> Sent: Friday, August 25, 2006 10:49 AM
> Subject: Re: controlled vocabulary
>
>
>> hi Xin
>>
>>  this is take a look at this you can add multiple fields with the 
>> name mesh
>> for ( i=0; i< meshList.size() ; i++ ){
>>    meshTerm = meshList.get(i)
>>  document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, 
>> Field.Store.YES , Field.Index.NO_NORMS  );
>> }
>>
>>  when querying this index, create a analyzer that infers the text 
>> string and generates id's that correspond to the mesh term in the 
>> semantic web
>>
>>
>>
>> Zhao, Xin wrote:
>>> Hi,
>>> Thank you for your reply. I had thought about the first two 
>>> solutions before. If we apply one doc for each MeSH term, it would 
>>> be 26 docs for each item digested(we actually need the top 25 MeSH 
>>> terms generated), would it be any problem if there are too many 
>>> documents? If we apply field name like "mesh_1", "mesh_2"..., when 
>>> it comes to search, we will have to generate a loop for each single 
>>> one of the query terms( there will be more than 20-30 terms on 
>>> average, since we are using sematic web to implement concept 
>>> search), do you think it would affect the performance in a very bad 
>>> way?
>>> Regards,
>>> Xin
>>>
>>>
>>> ----- Original Message ----- From: "Dedian Guo" <gd...@gmail.com>
>>> To: <ja...@lucene.apache.org>; "Zhao, Xin" <xz...@jhu.edu>
>>> Sent: Thursday, August 24, 2006 4:22 PM
>>> Subject: Re: controlled library
>>>
>>>
>>>> in my solution, you can apply one doc for each mesh term, or apply 
>>>> different
>>>> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u 
>>>> can group
>>>> your mesh terms as one string then add into a field, which requires 
>>>> a simple
>>>> string parser for the group string when you wanna read the terms...
>>>>
>>>> not sure if that works or answers your question...
>>>>
>>>> On 8/24/06, Zhao, Xin <xz...@jhmi.edu> wrote:
>>>>>
>>>>> Hi,
>>>>> I have a design question. Here is what we try to do for indexing:
>>>>> We designed an indexing tool to generate standard MeSH terms from 
>>>>> medical
>>>>> citations, and then use Lucene to save the terms and citations for 
>>>>> future
>>>>> search. The information we need to save are:
>>>>> a) the exact mesh terms (top 10)
>>>>> b) the score for each term
>>>>> so the codings are like
>>>>> -----------------------------------
>>>>> for the top 10 MeSH Terms
>>>>> myField=Field.Keyword("mesh", mesh.toLowerCase());
>>>>> myField.setBoost(score);
>>>>> doc.add(myFiled);
>>>>> end for
>>>>> ------------------------------------
>>>>> as you could see we generate all the terms under named field 
>>>>> "mesh". If I
>>>>> understand correctly, all the fields under the same name would
>>>>> eventually  save into one field, with all the scores be normalized 
>>>>> into
>>>>> filed boost. In this case, we wouldn't be able to save separate 
>>>>> score, so
>>>>> the information is lost. Am I correct? Is there anyway we could 
>>>>> change it? I
>>>>> understand Lucene is for keyword search, and what we try to do is 
>>>>> Controlled
>>>>> Vocabulary search, Any other tool we could use?
>>>>>
>>>>> Thank you,
>>>>> Xin
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: controlled vocabulary

Posted by "Zhao, Xin" <xz...@jhmi.edu>.

Hi, Rupinder,
My understanding is Field.Index.NO_NORMS disables  index-time boosting and 
field length normalization at the same time. But I do need index-time 
boosting to store the scoring of each mesh term. Have I missed anything?
Thank you very much for your help,
Xin

----- Original Message ----- 
From: "Rupinder Singh Mazara" <rm...@masterfile.com>
To: <ja...@lucene.apache.org>
Sent: Friday, August 25, 2006 10:49 AM
Subject: Re: controlled vocabulary


> hi Xin
>
>  this is take a look at this you can add multiple fields with the name 
> mesh
> for ( i=0; i< meshList.size() ; i++ ){
>    meshTerm = meshList.get(i)
>  document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, 
> Field.Store.YES , Field.Index.NO_NORMS  );
> }
>
>  when querying this index, create a analyzer that infers the text string 
> and generates id's that correspond to the mesh term in the semantic web
>
>
>
> Zhao, Xin wrote:
>> Hi,
>> Thank you for your reply. I had thought about the first two solutions 
>> before. If we apply one doc for each MeSH term, it would be 26 docs for 
>> each item digested(we actually need the top 25 MeSH terms generated), 
>> would it be any problem if there are too many documents? If we apply 
>> field name like "mesh_1", "mesh_2"..., when it comes to search, we will 
>> have to generate a loop for each single one of the query terms( there 
>> will be more than 20-30 terms on average, since we are using sematic web 
>> to implement concept search), do you think it would affect the 
>> performance in a very bad way?
>> Regards,
>> Xin
>>
>>
>> ----- Original Message ----- From: "Dedian Guo" <gd...@gmail.com>
>> To: <ja...@lucene.apache.org>; "Zhao, Xin" <xz...@jhu.edu>
>> Sent: Thursday, August 24, 2006 4:22 PM
>> Subject: Re: controlled library
>>
>>
>>> in my solution, you can apply one doc for each mesh term, or apply 
>>> different
>>> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can 
>>> group
>>> your mesh terms as one string then add into a field, which requires a 
>>> simple
>>> string parser for the group string when you wanna read the terms...
>>>
>>> not sure if that works or answers your question...
>>>
>>> On 8/24/06, Zhao, Xin <xz...@jhmi.edu> wrote:
>>>>
>>>> Hi,
>>>> I have a design question. Here is what we try to do for indexing:
>>>> We designed an indexing tool to generate standard MeSH terms from 
>>>> medical
>>>> citations, and then use Lucene to save the terms and citations for 
>>>> future
>>>> search. The information we need to save are:
>>>> a) the exact mesh terms (top 10)
>>>> b) the score for each term
>>>> so the codings are like
>>>> -----------------------------------
>>>> for the top 10 MeSH Terms
>>>> myField=Field.Keyword("mesh", mesh.toLowerCase());
>>>> myField.setBoost(score);
>>>> doc.add(myFiled);
>>>> end for
>>>> ------------------------------------
>>>> as you could see we generate all the terms under named field "mesh". If 
>>>> I
>>>> understand correctly, all the fields under the same name would
>>>> eventually  save into one field, with all the scores be normalized into
>>>> filed boost. In this case, we wouldn't be able to save separate score, 
>>>> so
>>>> the information is lost. Am I correct? Is there anyway we could change 
>>>> it? I
>>>> understand Lucene is for keyword search, and what we try to do is 
>>>> Controlled
>>>> Vocabulary search, Any other tool we could use?
>>>>
>>>> Thank you,
>>>> Xin
>>>>
>>>>
>>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: controlled vocabulary

Posted by Rupinder Singh Mazara <rm...@masterfile.com>.

hi Xin

  this is take a look at this you can add multiple fields with the name 
mesh
for ( i=0; i< meshList.size() ; i++ ){
    meshTerm = meshList.get(i)
  document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, 
Field.Store.YES , Field.Index.NO_NORMS  );
}

  when querying this index, create a analyzer that infers the text 
string and generates id's that correspond to the mesh term in the 
semantic web

 
 
Zhao, Xin wrote:
> Hi,
> Thank you for your reply. I had thought about the first two solutions 
> before. If we apply one doc for each MeSH term, it would be 26 docs 
> for each item digested(we actually need the top 25 MeSH terms 
> generated), would it be any problem if there are too many documents? 
> If we apply field name like "mesh_1", "mesh_2"..., when it comes to 
> search, we will have to generate a loop for each single one of the 
> query terms( there will be more than 20-30 terms on average, since we 
> are using sematic web to implement concept search), do you think it 
> would affect the performance in a very bad way?
> Regards,
> Xin
>
>
> ----- Original Message ----- From: "Dedian Guo" <gd...@gmail.com>
> To: <ja...@lucene.apache.org>; "Zhao, Xin" <xz...@jhu.edu>
> Sent: Thursday, August 24, 2006 4:22 PM
> Subject: Re: controlled library
>
>
>> in my solution, you can apply one doc for each mesh term, or apply 
>> different
>> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u 
>> can group
>> your mesh terms as one string then add into a field, which requires a 
>> simple
>> string parser for the group string when you wanna read the terms...
>>
>> not sure if that works or answers your question...
>>
>> On 8/24/06, Zhao, Xin <xz...@jhmi.edu> wrote:
>>>
>>> Hi,
>>> I have a design question. Here is what we try to do for indexing:
>>> We designed an indexing tool to generate standard MeSH terms from 
>>> medical
>>> citations, and then use Lucene to save the terms and citations for 
>>> future
>>> search. The information we need to save are:
>>> a) the exact mesh terms (top 10)
>>> b) the score for each term
>>> so the codings are like
>>> -----------------------------------
>>> for the top 10 MeSH Terms
>>> myField=Field.Keyword("mesh", mesh.toLowerCase());
>>> myField.setBoost(score);
>>> doc.add(myFiled);
>>> end for
>>> ------------------------------------
>>> as you could see we generate all the terms under named field "mesh". 
>>> If I
>>> understand correctly, all the fields under the same name would
>>> eventually  save into one field, with all the scores be normalized into
>>> filed boost. In this case, we wouldn't be able to save separate 
>>> score, so
>>> the information is lost. Am I correct? Is there anyway we could 
>>> change it? I
>>> understand Lucene is for keyword search, and what we try to do is 
>>> Controlled
>>> Vocabulary search, Any other tool we could use?
>>>
>>> Thank you,
>>> Xin
>>>
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: controlled vocabulary

Posted by Dedian Guo <gd...@gmail.com>.

Hi, Xin, in my understanding , the document in Lucene is a term of
collection of fields, while a field is pair of keyword and value, tough it
can be indexed or stored or both. That is plain structure. if you wanna
index a deep tree structure such as complex objects and keep those
relationship inside, i guess we need do some tricky of that. so in my
mentioned solution, i will do something on the keyword of a document(here, a
document represent a object...) . the score problem you mentioned in your
question is similar, i mean, score is actually an attribute of mesh object,
so u wanna index the information which has a tree-like structure (i met the
similar problem when i indexing xml-based pages. esp. for those have lots of
deep element nodes, deep index needed for deep searching).

correct me if i was wrong or there are some better solutions...

On 8/25/06, Zhao, Xin <xz...@jhmi.edu> wrote:
>
> now. i have a second thought about one meah term per document. the scoring
> formula(hits too) is based on document, right? does it mean that we
> shouldn't have more than one document for each object indexed?
> for example, i try to index a publication,  for some of the information,
> like title, abstract i would like to store and index them using default
> similarity, while the other information i would like to use customized
> similarity. i probably should use a different indexing directory and
> writer
> instead of two documents in the same index, right?
> thank you for helping me. you could see that i am in the early learning
> stage now.
> xin
>
>
>
> ----- Original Message -----
> From: "Zhao, Xin" <xz...@jhmi.edu>
> To: <ja...@lucene.apache.org>
> Sent: Friday, August 25, 2006 10:21 AM
> Subject: Re: controlled vocabulary
>
>
> > Hi,
> > Thank you for your reply. I had thought about the first two solutions
> > before. If we apply one doc for each MeSH term, it would be 26 docs for
> > each item digested(we actually need the top 25 MeSH terms generated),
> > would it be any problem if there are too many documents? If we apply
> field
> > name like "mesh_1", "mesh_2"..., when it comes to search, we will have
> to
> > generate a loop for each single one of the query terms( there will be
> more
> > than 20-30 terms on average, since we are using sematic web to implement
> > concept search), do you think it would affect the performance in a very
> > bad way?
> > Regards,
> > Xin
> >
> >
> > ----- Original Message -----
> > From: "Dedian Guo" <gd...@gmail.com>
> > To: <ja...@lucene.apache.org>; "Zhao, Xin" <xz...@jhu.edu>
> > Sent: Thursday, August 24, 2006 4:22 PM
> > Subject: Re: controlled library
> >
> >
> >> in my solution, you can apply one doc for each mesh term, or apply
> >> different
> >> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can
> >> group
> >> your mesh terms as one string then add into a field, which requires a
> >> simple
> >> string parser for the group string when you wanna read the terms...
> >>
> >> not sure if that works or answers your question...
> >>
> >> On 8/24/06, Zhao, Xin <xz...@jhmi.edu> wrote:
> >>>
> >>> Hi,
> >>> I have a design question. Here is what we try to do for indexing:
> >>> We designed an indexing tool to generate standard MeSH terms from
> >>> medical
> >>> citations, and then use Lucene to save the terms and citations for
> >>> future
> >>> search. The information we need to save are:
> >>> a) the exact mesh terms (top 10)
> >>> b) the score for each term
> >>> so the codings are like
> >>> -----------------------------------
> >>> for the top 10 MeSH Terms
> >>> myField=Field.Keyword("mesh", mesh.toLowerCase());
> >>> myField.setBoost(score);
> >>> doc.add(myFiled);
> >>> end for
> >>> ------------------------------------
> >>> as you could see we generate all the terms under named field "mesh".
> If
> >>> I
> >>> understand correctly, all the fields under the same name would
> >>> eventually  save into one field, with all the scores be normalized
> into
> >>> filed boost. In this case, we wouldn't be able to save separate score,
> >>> so
> >>> the information is lost. Am I correct? Is there anyway we could change
> >>> it? I
> >>> understand Lucene is for keyword search, and what we try to do is
> >>> Controlled
> >>> Vocabulary search, Any other tool we could use?
> >>>
> >>> Thank you,
> >>> Xin
> >>>
> >>>
> >>>
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: controlled vocabulary

Posted by "Zhao, Xin" <xz...@jhmi.edu>.

now. i have a second thought about one meah term per document. the scoring 
formula(hits too) is based on document, right? does it mean that we 
shouldn't have more than one document for each object indexed?
for example, i try to index a publication,  for some of the information, 
like title, abstract i would like to store and index them using default 
similarity, while the other information i would like to use customized 
similarity. i probably should use a different indexing directory and writer 
instead of two documents in the same index, right?
thank you for helping me. you could see that i am in the early learning 
stage now.
xin



----- Original Message ----- 
From: "Zhao, Xin" <xz...@jhmi.edu>
To: <ja...@lucene.apache.org>
Sent: Friday, August 25, 2006 10:21 AM
Subject: Re: controlled vocabulary


> Hi,
> Thank you for your reply. I had thought about the first two solutions 
> before. If we apply one doc for each MeSH term, it would be 26 docs for 
> each item digested(we actually need the top 25 MeSH terms generated), 
> would it be any problem if there are too many documents? If we apply field 
> name like "mesh_1", "mesh_2"..., when it comes to search, we will have to 
> generate a loop for each single one of the query terms( there will be more 
> than 20-30 terms on average, since we are using sematic web to implement 
> concept search), do you think it would affect the performance in a very 
> bad way?
> Regards,
> Xin
>
>
> ----- Original Message ----- 
> From: "Dedian Guo" <gd...@gmail.com>
> To: <ja...@lucene.apache.org>; "Zhao, Xin" <xz...@jhu.edu>
> Sent: Thursday, August 24, 2006 4:22 PM
> Subject: Re: controlled library
>
>
>> in my solution, you can apply one doc for each mesh term, or apply 
>> different
>> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can 
>> group
>> your mesh terms as one string then add into a field, which requires a 
>> simple
>> string parser for the group string when you wanna read the terms...
>>
>> not sure if that works or answers your question...
>>
>> On 8/24/06, Zhao, Xin <xz...@jhmi.edu> wrote:
>>>
>>> Hi,
>>> I have a design question. Here is what we try to do for indexing:
>>> We designed an indexing tool to generate standard MeSH terms from 
>>> medical
>>> citations, and then use Lucene to save the terms and citations for 
>>> future
>>> search. The information we need to save are:
>>> a) the exact mesh terms (top 10)
>>> b) the score for each term
>>> so the codings are like
>>> -----------------------------------
>>> for the top 10 MeSH Terms
>>> myField=Field.Keyword("mesh", mesh.toLowerCase());
>>> myField.setBoost(score);
>>> doc.add(myFiled);
>>> end for
>>> ------------------------------------
>>> as you could see we generate all the terms under named field "mesh". If 
>>> I
>>> understand correctly, all the fields under the same name would
>>> eventually  save into one field, with all the scores be normalized into
>>> filed boost. In this case, we wouldn't be able to save separate score, 
>>> so
>>> the information is lost. Am I correct? Is there anyway we could change 
>>> it? I
>>> understand Lucene is for keyword search, and what we try to do is 
>>> Controlled
>>> Vocabulary search, Any other tool we could use?
>>>
>>> Thank you,
>>> Xin
>>>
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: controlled vocabulary

Posted by "Zhao, Xin" <xz...@jhmi.edu>.

Hi,
Thank you for your reply. I had thought about the first two solutions 
before. If we apply one doc for each MeSH term, it would be 26 docs for each 
item digested(we actually need the top 25 MeSH terms generated), would it be 
any problem if there are too many documents? If we apply field name like 
"mesh_1", "mesh_2"..., when it comes to search, we will have to generate a 
loop for each single one of the query terms( there will be more than 20-30 
terms on average, since we are using sematic web to implement concept 
search), do you think it would affect the performance in a very bad way?
Regards,
Xin


----- Original Message ----- 
From: "Dedian Guo" <gd...@gmail.com>
To: <ja...@lucene.apache.org>; "Zhao, Xin" <xz...@jhu.edu>
Sent: Thursday, August 24, 2006 4:22 PM
Subject: Re: controlled library


> in my solution, you can apply one doc for each mesh term, or apply 
> different
> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can 
> group
> your mesh terms as one string then add into a field, which requires a 
> simple
> string parser for the group string when you wanna read the terms...
>
> not sure if that works or answers your question...
>
> On 8/24/06, Zhao, Xin <xz...@jhmi.edu> wrote:
>>
>> Hi,
>> I have a design question. Here is what we try to do for indexing:
>> We designed an indexing tool to generate standard MeSH terms from medical
>> citations, and then use Lucene to save the terms and citations for future
>> search. The information we need to save are:
>> a) the exact mesh terms (top 10)
>> b) the score for each term
>> so the codings are like
>> -----------------------------------
>> for the top 10 MeSH Terms
>> myField=Field.Keyword("mesh", mesh.toLowerCase());
>> myField.setBoost(score);
>> doc.add(myFiled);
>> end for
>> ------------------------------------
>> as you could see we generate all the terms under named field "mesh". If I
>> understand correctly, all the fields under the same name would
>> eventually  save into one field, with all the scores be normalized into
>> filed boost. In this case, we wouldn't be able to save separate score, so
>> the information is lost. Am I correct? Is there anyway we could change 
>> it? I
>> understand Lucene is for keyword search, and what we try to do is 
>> Controlled
>> Vocabulary search, Any other tool we could use?
>>
>> Thank you,
>> Xin
>>
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: controlled library

Posted by Dedian Guo <gd...@gmail.com>.

in my solution, you can apply one doc for each mesh term, or apply different
keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can group
your mesh terms as one string then add into a field, which requires a simple
string parser for the group string when you wanna read the terms...

not sure if that works or answers your question...

On 8/24/06, Zhao, Xin <xz...@jhmi.edu> wrote:
>
> Hi,
> I have a design question. Here is what we try to do for indexing:
> We designed an indexing tool to generate standard MeSH terms from medical
> citations, and then use Lucene to save the terms and citations for future
> search. The information we need to save are:
> a) the exact mesh terms (top 10)
> b) the score for each term
> so the codings are like
> -----------------------------------
> for the top 10 MeSH Terms
> myField=Field.Keyword("mesh", mesh.toLowerCase());
> myField.setBoost(score);
> doc.add(myFiled);
> end for
> ------------------------------------
> as you could see we generate all the terms under named field "mesh". If I
> understand correctly, all the fields under the same name would
> eventually  save into one field, with all the scores be normalized into
> filed boost. In this case, we wouldn't be able to save separate score, so
> the information is lost. Am I correct? Is there anyway we could change it? I
> understand Lucene is for keyword search, and what we try to do is Controlled
> Vocabulary search, Any other tool we could use?
>
> Thank you,
> Xin
>
>
>