You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Mario Juric <mj...@unsilo.ai> on 2019/09/04 19:45:53 UTC

How can I do efficient FSIndex lookup?

Hi,

I created a custom FSIndex for an annotation type in the hope of speeding up lookup based on one of it’s fields, but after some profiling I found to my surprise that this doesn’t appear to be what I get. I specified the index to be sorted according to two fields where the first is a key and the next is a value field. After creating a filtered iterator with the key field as one of the constraints I thought it would do a quick lookup to the first element in the list that matches the key constraint, after all it’s sorted according to that field, so I assume at least binary search is possible, but to my surprise that is not what happens. It seems to simply iterate through all elements and skips those that don’t match the constraint. There doesn’t seem to be other ways I can do a more efficient jump to the first element in the index and then stop iterating when the key no longer matches.

I am somewhat baffled by this, and it appears to me I could have achieved the same using a normal select with some simple filtering, which kinda makes the FSIndex redundant. There is another way to obtain an iterator, which takes a FeatureStructure, but I am not sure if that is more efficient, and does this mean that you create FeatureStructures for the sole purpose of lookup into the index? I would appreciate if someone could explain this to me, thanks! :)

Cheers,
Mario

Re: How can I do efficient FSIndex lookup?

Posted by Mario Juric <mj...@unsilo.ai>.

Great!

Looking forward to try it out as soon we have the time. We will need a testing period before we put this into production, but the first step is to get it to build and work end to end.

Cheers
Mario

> On 8 Sep 2019, at 19:40 , Richard Eckart de Castilho <re...@apache.org> wrote:
> 
> Hi Mario,
> 
>> On 6. Sep 2019, at 08:50, Mario Juric <mj...@unsilo.ai> wrote:
>> 
>> We are still on 2.x since we are awaiting the next major DKPro release with UIMA 3 because of dependencies.
> 
> DKPro Core 2.0.0 is now available :)
> 
> Cheers,
> 
> -- Richard

Re: How can I do efficient FSIndex lookup?

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi Mario,

> On 6. Sep 2019, at 08:50, Mario Juric <mj...@unsilo.ai> wrote:
> 
> We are still on 2.x since we are awaiting the next major DKPro release with UIMA 3 because of dependencies.

DKPro Core 2.0.0 is now available :)

Cheers,

-- Richard

Re: How can I do efficient FSIndex lookup?

Posted by Mario Juric <mj...@unsilo.ai>.

Thanks, Marshall.

This is definitely a possibility, although it means a tighter coupling of our upstream components with the downstream parts in the pipeline that produce the different outputs depending on configuration, so we need to think about how to do this properly.

Cheers,
Mario













> On 9 Sep 2019, at 17:51 , Marshall Schor <ms...@schor.com> wrote:
> 
> ok.
> 
> Re: getting the top 5 or 10 items: Here's a technique you may find of use:
> 
> Put the items into a Java PriorityQueue.  Keep a piece of data which is the
> bottom item, and in your insert-into-the-queue code, check if the item
> to-be-inserted is below that, and if so, skip it.
> 
> This gives a very efficient way to get the top 5 or 10 items.
> 
> HTH. -Marshall
> 
> On 9/9/2019 4:09 AM, Mario Juric wrote:
>> Hi,
>> 
>> Once again thanks for the response. It is really appreciated :)
>> 
>> I tried the moveTo(fs) instead of just using an iterator constructed from the FS, and this appeared to give me all items of the specified type when I didn’t set any values on it, which was an accidental experiment, but when I set the key property to what I was searching for then I got zero items back. Not sure what I might be doing wrong here, but I have learned something maybe more importantly to our use case in the mean time: The cost of indexing exceeds by far the benefits of any expected lookup speed in our case.
>> 
>> We are annotating a number of items with a lot of extracted feature information, and the hope was to be able to quickly get top 5 or 10 or whatever of the items with this or that key, which is why it was sorted by key first in natural sort order and then by the value in reverse order, meaning higher value is better, so that we could quickly get to the first item with the right key and then start pulling the top most items until we have those that we need.
>> 
>> So even if I could get this to work optimally it would in our case not be beneficial given the cost of indexing. It seems we really need many of those queries before it pays of, since the amount of feature information is much larger than the items they are associated with, so I reached to the preliminary conclusion to not have features in any index at all and just using plain FS record structures instead. It appears in our case much cheaper to run through all target items, which there are comparatively less of, to find what we need than to index all associated features and find the relevant target items through feature look up.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 6 Sep 2019, at 16:50 , Marshall Schor <ms...@schor.com> wrote:
>>> 
>>> Please don't add to the indexes, the FS you're temporarily using as the argument
>>> for the moveTo operation.  (and of course, if you don't add it, you won't need
>>> to remove it...)
>>> 
>>> If you describe your use case in a bit more detail, I can perhaps comment on
>>> this more.
>>> 
>>> -Marshall
>>> 
>>> On 9/6/2019 2:50 AM, Mario Juric wrote:
>>>> Hi,
>>>> 
>>>> Thanks for responding.
>>>> 
>>>> I tried with a temporary FS where the key value was set, but I got every annotation from the index, so that didn’t appear to change anything, and it also broke my unit tests immediately. I also  stepped through the iterator implementation and found construction of the iterator quite a bit complex with an FS, so that went over my head without spending time to get a deeper understanding of the underlying index implementation. Therefore I tried with an indexed FS and this seemed to return the correct items, but it would be awkward having to add some FS to the index in order to retrieve something else and then having to remove the FS from the index again. I am now also in doubt about the insertion costs, but I haven’t measured that yet.
>>>> 
>>>> I am not sure how many use custom FSIndex, but currently the API doesn’t really support very well the type of use cases that we are working with, so this is a disappointment for us. Does UIMA 3 improve on this? We are still on 2.x since we are awaiting the next major DKPro release with UIMA 3 because of dependencies.
>>>> 
>>>> Thanks a lot and cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho <re...@apache.org> wrote:
>>>>> 
>>>>> On 5. Sep 2019, at 23:40, Marshall Schor <ms...@schor.com> wrote:
>>>>>> The normal way to get the "binary search" kind of behavior is to get a plain
>>>>>> iterator over the sorted index, and then use the moveTo method, specifying a
>>>>>> target FS as the one to move to.  The target FS can be a "temporary" FS, one
>>>>>> that is never added to the indexes, itself; it is just used to supply values
>>>>>> used in the comparison.
>>>>> Is there a way to do this using a "temporary" FS which does not take up CAS heap
>>>>> space in UIMAv2?
>>>>> 
>>>>> -- Richard
>>

Re: How can I do efficient FSIndex lookup?

Posted by Marshall Schor <ms...@schor.com>.

ok.

Re: getting the top 5 or 10 items: Here's a technique you may find of use:

Put the items into a Java PriorityQueue.  Keep a piece of data which is the
bottom item, and in your insert-into-the-queue code, check if the item
to-be-inserted is below that, and if so, skip it.

This gives a very efficient way to get the top 5 or 10 items.

HTH. -Marshall

On 9/9/2019 4:09 AM, Mario Juric wrote:
> Hi,
>
> Once again thanks for the response. It is really appreciated :)
>
> I tried the moveTo(fs) instead of just using an iterator constructed from the FS, and this appeared to give me all items of the specified type when I didn’t set any values on it, which was an accidental experiment, but when I set the key property to what I was searching for then I got zero items back. Not sure what I might be doing wrong here, but I have learned something maybe more importantly to our use case in the mean time: The cost of indexing exceeds by far the benefits of any expected lookup speed in our case.
>
> We are annotating a number of items with a lot of extracted feature information, and the hope was to be able to quickly get top 5 or 10 or whatever of the items with this or that key, which is why it was sorted by key first in natural sort order and then by the value in reverse order, meaning higher value is better, so that we could quickly get to the first item with the right key and then start pulling the top most items until we have those that we need.
>
> So even if I could get this to work optimally it would in our case not be beneficial given the cost of indexing. It seems we really need many of those queries before it pays of, since the amount of feature information is much larger than the items they are associated with, so I reached to the preliminary conclusion to not have features in any index at all and just using plain FS record structures instead. It appears in our case much cheaper to run through all target items, which there are comparatively less of, to find what we need than to index all associated features and find the relevant target items through feature look up.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>> On 6 Sep 2019, at 16:50 , Marshall Schor <ms...@schor.com> wrote:
>>
>> Please don't add to the indexes, the FS you're temporarily using as the argument
>> for the moveTo operation.  (and of course, if you don't add it, you won't need
>> to remove it...)
>>
>> If you describe your use case in a bit more detail, I can perhaps comment on
>> this more.
>>
>> -Marshall
>>
>> On 9/6/2019 2:50 AM, Mario Juric wrote:
>>> Hi,
>>>
>>> Thanks for responding.
>>>
>>> I tried with a temporary FS where the key value was set, but I got every annotation from the index, so that didn’t appear to change anything, and it also broke my unit tests immediately. I also  stepped through the iterator implementation and found construction of the iterator quite a bit complex with an FS, so that went over my head without spending time to get a deeper understanding of the underlying index implementation. Therefore I tried with an indexed FS and this seemed to return the correct items, but it would be awkward having to add some FS to the index in order to retrieve something else and then having to remove the FS from the index again. I am now also in doubt about the insertion costs, but I haven’t measured that yet.
>>>
>>> I am not sure how many use custom FSIndex, but currently the API doesn’t really support very well the type of use cases that we are working with, so this is a disappointment for us. Does UIMA 3 improve on this? We are still on 2.x since we are awaiting the next major DKPro release with UIMA 3 because of dependencies.
>>>
>>> Thanks a lot and cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho <re...@apache.org> wrote:
>>>>
>>>> On 5. Sep 2019, at 23:40, Marshall Schor <ms...@schor.com> wrote:
>>>>> The normal way to get the "binary search" kind of behavior is to get a plain
>>>>> iterator over the sorted index, and then use the moveTo method, specifying a
>>>>> target FS as the one to move to.  The target FS can be a "temporary" FS, one
>>>>> that is never added to the indexes, itself; it is just used to supply values
>>>>> used in the comparison.
>>>> Is there a way to do this using a "temporary" FS which does not take up CAS heap
>>>> space in UIMAv2?
>>>>
>>>> -- Richard
>

Re: How can I do efficient FSIndex lookup?

Posted by Mario Juric <mj...@unsilo.ai>.

Hi,

Once again thanks for the response. It is really appreciated :)

I tried the moveTo(fs) instead of just using an iterator constructed from the FS, and this appeared to give me all items of the specified type when I didn’t set any values on it, which was an accidental experiment, but when I set the key property to what I was searching for then I got zero items back. Not sure what I might be doing wrong here, but I have learned something maybe more importantly to our use case in the mean time: The cost of indexing exceeds by far the benefits of any expected lookup speed in our case.

We are annotating a number of items with a lot of extracted feature information, and the hope was to be able to quickly get top 5 or 10 or whatever of the items with this or that key, which is why it was sorted by key first in natural sort order and then by the value in reverse order, meaning higher value is better, so that we could quickly get to the first item with the right key and then start pulling the top most items until we have those that we need.

So even if I could get this to work optimally it would in our case not be beneficial given the cost of indexing. It seems we really need many of those queries before it pays of, since the amount of feature information is much larger than the items they are associated with, so I reached to the preliminary conclusion to not have features in any index at all and just using plain FS record structures instead. It appears in our case much cheaper to run through all target items, which there are comparatively less of, to find what we need than to index all associated features and find the relevant target items through feature look up.

Cheers,
Mario

> On 6 Sep 2019, at 16:50 , Marshall Schor <ms...@schor.com> wrote:
> 
> Please don't add to the indexes, the FS you're temporarily using as the argument
> for the moveTo operation.  (and of course, if you don't add it, you won't need
> to remove it...)
> 
> If you describe your use case in a bit more detail, I can perhaps comment on
> this more.
> 
> -Marshall
> 
> On 9/6/2019 2:50 AM, Mario Juric wrote:
>> Hi,
>> 
>> Thanks for responding.
>> 
>> I tried with a temporary FS where the key value was set, but I got every annotation from the index, so that didn’t appear to change anything, and it also broke my unit tests immediately. I also  stepped through the iterator implementation and found construction of the iterator quite a bit complex with an FS, so that went over my head without spending time to get a deeper understanding of the underlying index implementation. Therefore I tried with an indexed FS and this seemed to return the correct items, but it would be awkward having to add some FS to the index in order to retrieve something else and then having to remove the FS from the index again. I am now also in doubt about the insertion costs, but I haven’t measured that yet.
>> 
>> I am not sure how many use custom FSIndex, but currently the API doesn’t really support very well the type of use cases that we are working with, so this is a disappointment for us. Does UIMA 3 improve on this? We are still on 2.x since we are awaiting the next major DKPro release with UIMA 3 because of dependencies.
>> 
>> Thanks a lot and cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho <re...@apache.org> wrote:
>>> 
>>> On 5. Sep 2019, at 23:40, Marshall Schor <ms...@schor.com> wrote:
>>>> The normal way to get the "binary search" kind of behavior is to get a plain
>>>> iterator over the sorted index, and then use the moveTo method, specifying a
>>>> target FS as the one to move to.  The target FS can be a "temporary" FS, one
>>>> that is never added to the indexes, itself; it is just used to supply values
>>>> used in the comparison.
>>> Is there a way to do this using a "temporary" FS which does not take up CAS heap
>>> space in UIMAv2?
>>> 
>>> -- Richard
>>

Re: How can I do efficient FSIndex lookup?

Posted by Marshall Schor <ms...@schor.com>.

Please don't add to the indexes, the FS you're temporarily using as the argument
for the moveTo operation.  (and of course, if you don't add it, you won't need
to remove it...)

If you describe your use case in a bit more detail, I can perhaps comment on
this more.

-Marshall

On 9/6/2019 2:50 AM, Mario Juric wrote:
> Hi,
>
> Thanks for responding.
>
> I tried with a temporary FS where the key value was set, but I got every annotation from the index, so that didn’t appear to change anything, and it also broke my unit tests immediately. I also  stepped through the iterator implementation and found construction of the iterator quite a bit complex with an FS, so that went over my head without spending time to get a deeper understanding of the underlying index implementation. Therefore I tried with an indexed FS and this seemed to return the correct items, but it would be awkward having to add some FS to the index in order to retrieve something else and then having to remove the FS from the index again. I am now also in doubt about the insertion costs, but I haven’t measured that yet.
>
> I am not sure how many use custom FSIndex, but currently the API doesn’t really support very well the type of use cases that we are working with, so this is a disappointment for us. Does UIMA 3 improve on this? We are still on 2.x since we are awaiting the next major DKPro release with UIMA 3 because of dependencies.
>
> Thanks a lot and cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>> On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho <re...@apache.org> wrote:
>>
>> On 5. Sep 2019, at 23:40, Marshall Schor <ms...@schor.com> wrote:
>>> The normal way to get the "binary search" kind of behavior is to get a plain
>>> iterator over the sorted index, and then use the moveTo method, specifying a
>>> target FS as the one to move to.  The target FS can be a "temporary" FS, one
>>> that is never added to the indexes, itself; it is just used to supply values
>>> used in the comparison.
>> Is there a way to do this using a "temporary" FS which does not take up CAS heap
>> space in UIMAv2?
>>
>> -- Richard
>

Re: How can I do efficient FSIndex lookup?

Posted by Mario Juric <mj...@unsilo.ai>.

Hi,

Thanks for responding.

I tried with a temporary FS where the key value was set, but I got every annotation from the index, so that didn’t appear to change anything, and it also broke my unit tests immediately. I also  stepped through the iterator implementation and found construction of the iterator quite a bit complex with an FS, so that went over my head without spending time to get a deeper understanding of the underlying index implementation. Therefore I tried with an indexed FS and this seemed to return the correct items, but it would be awkward having to add some FS to the index in order to retrieve something else and then having to remove the FS from the index again. I am now also in doubt about the insertion costs, but I haven’t measured that yet.

I am not sure how many use custom FSIndex, but currently the API doesn’t really support very well the type of use cases that we are working with, so this is a disappointment for us. Does UIMA 3 improve on this? We are still on 2.x since we are awaiting the next major DKPro release with UIMA 3 because of dependencies.

Thanks a lot and cheers,
Mario

> On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho <re...@apache.org> wrote:
> 
> On 5. Sep 2019, at 23:40, Marshall Schor <ms...@schor.com> wrote:
>> 
>> The normal way to get the "binary search" kind of behavior is to get a plain
>> iterator over the sorted index, and then use the moveTo method, specifying a
>> target FS as the one to move to.  The target FS can be a "temporary" FS, one
>> that is never added to the indexes, itself; it is just used to supply values
>> used in the comparison.
> 
> Is there a way to do this using a "temporary" FS which does not take up CAS heap
> space in UIMAv2?
> 
> -- Richard

Re: How can I do efficient FSIndex lookup?

Posted by Marshall Schor <ms...@schor.com>.

Sorry, no. But you can make one "key" per CAS instance, and reuse it (you'll
need to keep some kind of a reference to it).

-Marshall

On 9/5/2019 5:42 PM, Richard Eckart de Castilho wrote:
> On 5. Sep 2019, at 23:40, Marshall Schor <ms...@schor.com> wrote:
>> The normal way to get the "binary search" kind of behavior is to get a plain
>> iterator over the sorted index, and then use the moveTo method, specifying a
>> target FS as the one to move to.  The target FS can be a "temporary" FS, one
>> that is never added to the indexes, itself; it is just used to supply values
>> used in the comparison.
> Is there a way to do this using a "temporary" FS which does not take up CAS heap
> space in UIMAv2?
>
> -- Richard

Re: How can I do efficient FSIndex lookup?

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 5. Sep 2019, at 23:40, Marshall Schor <ms...@schor.com> wrote:
> 
> The normal way to get the "binary search" kind of behavior is to get a plain
> iterator over the sorted index, and then use the moveTo method, specifying a
> target FS as the one to move to.  The target FS can be a "temporary" FS, one
> that is never added to the indexes, itself; it is just used to supply values
> used in the comparison.

Is there a way to do this using a "temporary" FS which does not take up CAS heap
space in UIMAv2?

-- Richard

Re: How can I do efficient FSIndex lookup?

Posted by Marshall Schor <ms...@schor.com>.

Perhaps the use of a filtered iterator went in the wrong direction.

The normal way to get the "binary search" kind of behavior is to get a plain
iterator over the sorted index, and then use the moveTo method, specifying a
target FS as the one to move to.  The target FS can be a "temporary" FS, one
that is never added to the indexes, itself; it is just used to supply values
used in the comparison.

With this, you can "jump to" the nearest element (see the javadocs for the exact
definition of this).

Does this help?

When using uima version 3, the moveto method can be made to ignore type
priorities in the ordering, which is what is wanted in many use cases.  See
http://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.select

-Marshall

On 9/4/2019 3:45 PM, Mario Juric wrote:
> Hi,
>
> I created a custom FSIndex for an annotation type in the hope of speeding up lookup based on one of it’s fields, but after some profiling I found to my surprise that this doesn’t appear to be what I get. I specified the index to be sorted according to two fields where the first is a key and the next is a value field. After creating a filtered iterator with the key field as one of the constraints I thought it would do a quick lookup to the first element in the list that matches the key constraint, after all it’s sorted according to that field, so I assume at least binary search is possible, but to my surprise that is not what happens. It seems to simply iterate through all elements and skips those that don’t match the constraint. There doesn’t seem to be other ways I can do a more efficient jump to the first element in the index and then stop iterating when the key no longer matches.
>
> I am somewhat baffled by this, and it appears to me I could have achieved the same using a normal select with some simple filtering, which kinda makes the FSIndex redundant. There is another way to obtain an iterator, which takes a FeatureStructure, but I am not sure if that is more efficient, and does this mean that you create FeatureStructures for the sole purpose of lookup into the index? I would appreciate if someone could explain this to me, thanks! :)
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>