You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Alice Wong <ai...@gmail.com> on 2013/10/03 01:12:27 UTC

Associated values for a field and its value

Hello,

We would like to index some documents. Each field of a document may have
multiple values. And for each (field,value) pair there are some associated
values. These associated values are just for retrieving, not searching.

For example, a document D could have a field named A. This field has two
values a1 and a2.

It is easy to index D, adding term a1 and a2 to field A, so either query
"A=a1" or "A=a2" will return D.

Assuming we have other values associated with (A,a1) and (A,a2) for D. We
would like to retrieve these associated values depending on whether "A=a1"
or "A=a2" is queried.

For example, if query "A=a1" returns D, we would like to return values 1
and 2. And if query "A=a2" returns D, we want to return values 3 and 10.

Is it possible to do this with Lucene? Initially we want to hack postings
to return associated values, but this seems quite complex.

Thanks!

Re: Associated values for a field and its value

Posted by Alice Wong <ai...@gmail.com>.
Okay, it makes complete sense. Thanks.


On Fri, Oct 4, 2013 at 5:15 AM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

>  On 10/3/13 6:04 PM, Alice Wong wrote:
>
>  Mike,
>
>  That's an interesting idea. The only drawback is we have to re-parse the
> doc and find where it matches and what the associated values are. It could
> be a performance issue if the doc becomes bigger and more complex.
>
> It's true there is some overhead for document-oriented processing.  Lux
> ameliorates this by storing a predigested binary xml form that can be
> traversed efficiently without the need for xml parsing.  However,
>
>
>  I am wondering if there is a way to index a value a1 for a field A and
> store a different value "1,2" associated with a1 in Lucene. Or there might
> be a hack for this?
>
> If you want to use only low-level Lucene constructs, I think payloads
> and/or complicated field values are the way to go.  You could, for example,
> index for document D, a field called "extra" with values like "a1:1,2",
> "a2:2,3".  I think that's what Aditya suggested. You still have to parse
> these though, so why not use a prebuilt flexible parsing infrastructure?
>
>
>  Thanks.
>
>
> On Thu, Oct 3, 2013 at 1:49 PM, Michael Sokolov <
> msokolov@safaribooksonline.com> wrote:
>
>>  On 10/02/2013 07:12 PM, Alice Wong wrote:
>>
>>> Hello,
>>>
>>> We would like to index some documents. Each field of a document may have
>>> multiple values. And for each (field,value) pair there are some
>>> associated
>>> values. These associated values are just for retrieving, not searching.
>>>
>>> For example, a document D could have a field named A. This field has two
>>> values a1 and a2.
>>>
>>> It is easy to index D, adding term a1 and a2 to field A, so either query
>>> "A=a1" or "A=a2" will return D.
>>>
>>> Assuming we have other values associated with (A,a1) and (A,a2) for D. We
>>> would like to retrieve these associated values depending on whether
>>> "A=a1"
>>> or "A=a2" is queried.
>>>
>>> For example, if query "A=a1" returns D, we would like to return values 1
>>> and 2. And if query "A=a2" returns D, we want to return values 3 and 10.
>>>
>>> Is it possible to do this with Lucene? Initially we want to hack postings
>>> to return associated values, but this seems quite complex.
>>>
>>> Thanks!
>>>
>>>   Why not store a (nonindexed) text field with some internal structure
>> (XML, JSON, CSV) that you can analyze after retrieving.  For example,
>>
>> <D>
>>   <A>
>>      <value>a1</value>
>>      <associated-values>
>>        ... whatever you want ...
>>      </associated-values>
>>   </A>
>> </D>
>>
>> If you use Lux (luxdb.org), which is XML query support on top of Lucene,
>> you can do this all automatically, and retrieve the results with a simple
>> query like:
>>
>> /D[A=a1]/associated-values
>>
>> plus if you want to pull out the values and manipulate them, you have
>> XQuery to do it with.
>>
>> -Mike
>>
>
>
>

Re: Associated values for a field and its value

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 10/3/13 6:04 PM, Alice Wong wrote:
> Mike,
>
> That's an interesting idea. The only drawback is we have to re-parse 
> the doc and find where it matches and what the associated values are. 
> It could be a performance issue if the doc becomes bigger and more 
> complex.
It's true there is some overhead for document-oriented processing. Lux 
ameliorates this by storing a predigested binary xml form that can be 
traversed efficiently without the need for xml parsing. However,
>
> I am wondering if there is a way to index a value a1 for a field A and 
> store a different value "1,2" associated with a1 in Lucene. Or there 
> might be a hack for this?
If you want to use only low-level Lucene constructs, I think payloads 
and/or complicated field values are the way to go.  You could, for 
example, index for document D, a field called "extra" with values like 
"a1:1,2", "a2:2,3".  I think that's what Aditya suggested. You still 
have to parse these though, so why not use a prebuilt flexible parsing 
infrastructure?
>
> Thanks.
>
>
> On Thu, Oct 3, 2013 at 1:49 PM, Michael Sokolov 
> <msokolov@safaribooksonline.com 
> <ma...@safaribooksonline.com>> wrote:
>
>     On 10/02/2013 07:12 PM, Alice Wong wrote:
>
>         Hello,
>
>         We would like to index some documents. Each field of a
>         document may have
>         multiple values. And for each (field,value) pair there are
>         some associated
>         values. These associated values are just for retrieving, not
>         searching.
>
>         For example, a document D could have a field named A. This
>         field has two
>         values a1 and a2.
>
>         It is easy to index D, adding term a1 and a2 to field A, so
>         either query
>         "A=a1" or "A=a2" will return D.
>
>         Assuming we have other values associated with (A,a1) and
>         (A,a2) for D. We
>         would like to retrieve these associated values depending on
>         whether "A=a1"
>         or "A=a2" is queried.
>
>         For example, if query "A=a1" returns D, we would like to
>         return values 1
>         and 2. And if query "A=a2" returns D, we want to return values
>         3 and 10.
>
>         Is it possible to do this with Lucene? Initially we want to
>         hack postings
>         to return associated values, but this seems quite complex.
>
>         Thanks!
>
>     Why not store a (nonindexed) text field with some internal
>     structure (XML, JSON, CSV) that you can analyze after retrieving.
>      For example,
>
>     <D>
>       <A>
>          <value>a1</value>
>          <associated-values>
>            ... whatever you want ...
>          </associated-values>
>       </A>
>     </D>
>
>     If you use Lux (luxdb.org <http://luxdb.org>), which is XML query
>     support on top of Lucene, you can do this all automatically, and
>     retrieve the results with a simple query like:
>
>     /D[A=a1]/associated-values
>
>     plus if you want to pull out the values and manipulate them, you
>     have XQuery to do it with.
>
>     -Mike
>
>


Re: Associated values for a field and its value

Posted by Alice Wong <ai...@gmail.com>.
Mike,

That's an interesting idea. The only drawback is we have to re-parse the
doc and find where it matches and what the associated values are. It could
be a performance issue if the doc becomes bigger and more complex.

I am wondering if there is a way to index a value a1 for a field A and
store a different value "1,2" associated with a1 in Lucene. Or there might
be a hack for this?

Thanks.


On Thu, Oct 3, 2013 at 1:49 PM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

> On 10/02/2013 07:12 PM, Alice Wong wrote:
>
>> Hello,
>>
>> We would like to index some documents. Each field of a document may have
>> multiple values. And for each (field,value) pair there are some associated
>> values. These associated values are just for retrieving, not searching.
>>
>> For example, a document D could have a field named A. This field has two
>> values a1 and a2.
>>
>> It is easy to index D, adding term a1 and a2 to field A, so either query
>> "A=a1" or "A=a2" will return D.
>>
>> Assuming we have other values associated with (A,a1) and (A,a2) for D. We
>> would like to retrieve these associated values depending on whether "A=a1"
>> or "A=a2" is queried.
>>
>> For example, if query "A=a1" returns D, we would like to return values 1
>> and 2. And if query "A=a2" returns D, we want to return values 3 and 10.
>>
>> Is it possible to do this with Lucene? Initially we want to hack postings
>> to return associated values, but this seems quite complex.
>>
>> Thanks!
>>
>>  Why not store a (nonindexed) text field with some internal structure
> (XML, JSON, CSV) that you can analyze after retrieving.  For example,
>
> <D>
>   <A>
>      <value>a1</value>
>      <associated-values>
>        ... whatever you want ...
>      </associated-values>
>   </A>
> </D>
>
> If you use Lux (luxdb.org), which is XML query support on top of Lucene,
> you can do this all automatically, and retrieve the results with a simple
> query like:
>
> /D[A=a1]/associated-values
>
> plus if you want to pull out the values and manipulate them, you have
> XQuery to do it with.
>
> -Mike
>

Re: Associated values for a field and its value

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 10/02/2013 07:12 PM, Alice Wong wrote:
> Hello,
>
> We would like to index some documents. Each field of a document may have
> multiple values. And for each (field,value) pair there are some associated
> values. These associated values are just for retrieving, not searching.
>
> For example, a document D could have a field named A. This field has two
> values a1 and a2.
>
> It is easy to index D, adding term a1 and a2 to field A, so either query
> "A=a1" or "A=a2" will return D.
>
> Assuming we have other values associated with (A,a1) and (A,a2) for D. We
> would like to retrieve these associated values depending on whether "A=a1"
> or "A=a2" is queried.
>
> For example, if query "A=a1" returns D, we would like to return values 1
> and 2. And if query "A=a2" returns D, we want to return values 3 and 10.
>
> Is it possible to do this with Lucene? Initially we want to hack postings
> to return associated values, but this seems quite complex.
>
> Thanks!
>
Why not store a (nonindexed) text field with some internal structure 
(XML, JSON, CSV) that you can analyze after retrieving.  For example,

<D>
   <A>
      <value>a1</value>
      <associated-values>
        ... whatever you want ...
      </associated-values>
   </A>
</D>

If you use Lux (luxdb.org), which is XML query support on top of Lucene, 
you can do this all automatically, and retrieve the results with a 
simple query like:

/D[A=a1]/associated-values

plus if you want to pull out the values and manipulate them, you have 
XQuery to do it with.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Associated values for a field and its value

Posted by Aditya <fi...@gmail.com>.
Hi

You need to expand the field as below. Store the document and its
associated values as one document.

Document  Field-A   Stored-Field
D1               a1             1,2
D2               a2              3,10

Other alternative approach is to store these fields external to Lucene, may
be in database or key-value-store and fetch it on demand.

Regards
Aditya
www.findbestopensource.com -- we have collection of more than 1 million
open source projects



On Thu, Oct 3, 2013 at 4:42 AM, Alice Wong <ai...@gmail.com> wrote:

> Hello,
>
> We would like to index some documents. Each field of a document may have
> multiple values. And for each (field,value) pair there are some associated
> values. These associated values are just for retrieving, not searching.
>
> For example, a document D could have a field named A. This field has two
> values a1 and a2.
>
> It is easy to index D, adding term a1 and a2 to field A, so either query
> "A=a1" or "A=a2" will return D.
>
> Assuming we have other values associated with (A,a1) and (A,a2) for D. We
> would like to retrieve these associated values depending on whether "A=a1"
> or "A=a2" is queried.
>
> For example, if query "A=a1" returns D, we would like to return values 1
> and 2. And if query "A=a2" returns D, we want to return values 3 and 10.
>
> Is it possible to do this with Lucene? Initially we want to hack postings
> to return associated values, but this seems quite complex.
>
> Thanks!
>