You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2002/11/17 20:00:58 UTC

Enumerating Concatenated Fields

I have a collection of XML documents, each of which contains a 'codes' section, each of which contains zero or more 'code' sections.  When I index the documents, I concatenate all the non-empty 'code' sections into a single 'codes' index field to facilitate boolean searching.

Given my structure, is there a way that I could get a list all the defined 'code' values in the entire set of documents?  If not (as I suspect), is there a way that I could change the indexing scheme to add this functionality?

Regards,

Terry




Re: Enumerating Concatenated Fields

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Yes, parsing of a 'codes' document field you will have to do yourself,
of course.  You concatenated all those values manually, you'll have to
split them manually, too.
The process that I described before can still be used, you just have to
add 'split value of codes field on comma delimiter' and store each in a
Set.  Since Set doesn't allow duplicates at the end of the loop that
loops through all documents you will have a unique Set of code values
used in your index.

Otis


--- Terry Steichen <te...@net-frame.com> wrote:
> Otis,
> 
> Thanks for your response, but I don't think I was particularly clear
> in my
> original message.  Here's an expanded description.
> 
> For each Lucene Document in the index there will be a 'codes' field
> which
> will contain a comma-delimited set of codes (this is the result of my
> concatenation at index-time of the individual 'code' sections from
> each of
> the corresponding XML documents).
> 
> In other words, assume the original XML document contains something
> like
> this:
> .....
> <codes>
>     <code>value_of_code1</code>
>     <code>value_of_code2</code>
>     <code>value_of_code3</code>
> </codes>
> ....
> 
> When I index each such an XML document, I create a Lucene Document
> that has
> a field called 'codes', which has the value: "value_of_code1,
> value_of_code2, value_of_code3". (I do this so I can do boolean
> searches on
> this field, so see which documents may have value_of_code1 AND
> value_of_code2 AND NOT value_of_code3, for example.
> 
> Consider that each 'value_of_codexx' is a keyword.  Each XML document
> may
> have zero or more such keywords (aka code sections).  I'm trying to
> figure
> out a way to get a list of all the keywords used by the XML documents
> that
> have been indexed.    It seems to me, the index itself (even though I
> do
> store this concatenated result in it) won't really know how to parse
> the
> string of comma-delimited code values that comprise each 'codes'
> field
> value.
> 
> Does that make more sense?
> 
> Regards,
> 
> Terry
> 
> --- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Sunday, November 17, 2002 4:24 PM
> Subject: Re: Enumerating Concatenated Fields
> 
> 
> > If I understand what you want - open an index with IndexReader, get
> the
> > # of documents in it via IndexReader, loop through all documents,
> > getting one with it's ID, and for each of them get field 'codes'
> out of
> > it.
> >
> > Otis
> >
> >
> > --- Terry Steichen <te...@net-frame.com> wrote:
> > > I have a collection of XML documents, each of which contains a
> > > 'codes' section, each of which contains zero or more 'code'
> sections.
> > >  When I index the documents, I concatenate all the non-empty
> 'code'
> > > sections into a single 'codes' index field to facilitate boolean
> > > searching.
> > >
> > > Given my structure, is there a way that I could get a list all
> the
> > > defined 'code' values in the entire set of documents?  If not (as
> I
> > > suspect), is there a way that I could change the indexing scheme
> to
> > > add this functionality?
> > >
> > > Regards,
> > >
> > > Terry
> > >
> > >
> > >
> > >
> >
> >
> > __________________________________________________
> > Do you Yahoo!?
> > Yahoo! Web Hosting - Let the expert host your site
> > http://webhosting.yahoo.com
> >
> > --
> > To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Enumerating Concatenated Fields

Posted by Terry Steichen <te...@net-frame.com>.
Otis,

Thanks for your response, but I don't think I was particularly clear in my
original message.  Here's an expanded description.

For each Lucene Document in the index there will be a 'codes' field which
will contain a comma-delimited set of codes (this is the result of my
concatenation at index-time of the individual 'code' sections from each of
the corresponding XML documents).

In other words, assume the original XML document contains something like
this:
.....
<codes>
    <code>value_of_code1</code>
    <code>value_of_code2</code>
    <code>value_of_code3</code>
</codes>
....

When I index each such an XML document, I create a Lucene Document that has
a field called 'codes', which has the value: "value_of_code1,
value_of_code2, value_of_code3". (I do this so I can do boolean searches on
this field, so see which documents may have value_of_code1 AND
value_of_code2 AND NOT value_of_code3, for example.

Consider that each 'value_of_codexx' is a keyword.  Each XML document may
have zero or more such keywords (aka code sections).  I'm trying to figure
out a way to get a list of all the keywords used by the XML documents that
have been indexed.    It seems to me, the index itself (even though I do
store this concatenated result in it) won't really know how to parse the
string of comma-delimited code values that comprise each 'codes' field
value.

Does that make more sense?

Regards,

Terry

--- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Sunday, November 17, 2002 4:24 PM
Subject: Re: Enumerating Concatenated Fields


> If I understand what you want - open an index with IndexReader, get the
> # of documents in it via IndexReader, loop through all documents,
> getting one with it's ID, and for each of them get field 'codes' out of
> it.
>
> Otis
>
>
> --- Terry Steichen <te...@net-frame.com> wrote:
> > I have a collection of XML documents, each of which contains a
> > 'codes' section, each of which contains zero or more 'code' sections.
> >  When I index the documents, I concatenate all the non-empty 'code'
> > sections into a single 'codes' index field to facilitate boolean
> > searching.
> >
> > Given my structure, is there a way that I could get a list all the
> > defined 'code' values in the entire set of documents?  If not (as I
> > suspect), is there a way that I could change the indexing scheme to
> > add this functionality?
> >
> > Regards,
> >
> > Terry
> >
> >
> >
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site
> http://webhosting.yahoo.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Enumerating Concatenated Fields

Posted by Otis Gospodnetic <ot...@yahoo.com>.
If I understand what you want - open an index with IndexReader, get the
# of documents in it via IndexReader, loop through all documents,
getting one with it's ID, and for each of them get field 'codes' out of
it.

Otis


--- Terry Steichen <te...@net-frame.com> wrote:
> I have a collection of XML documents, each of which contains a
> 'codes' section, each of which contains zero or more 'code' sections.
>  When I index the documents, I concatenate all the non-empty 'code'
> sections into a single 'codes' index field to facilitate boolean
> searching.
> 
> Given my structure, is there a way that I could get a list all the
> defined 'code' values in the entire set of documents?  If not (as I
> suspect), is there a way that I could change the indexing scheme to
> add this functionality?
> 
> Regards,
> 
> Terry
> 
> 
> 
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>