You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Peter Klügl <pe...@averbis.com> on 2017/07/28 08:15:20 UTC

Re: New dictionary annotator

Hi,


I included an (our) simple dictionary annotator (trie-based) in the
benchmark. It's roughly as fast as your annotator.


However, the benchmark includes the time of the opennlp tokenizer and
the dictionary and its entries are quite minimal (four entires, no multi
token entries). Thus, for the benchmark, it does not really matter if
there is a dictionary at all and the benchmark provides hardly any
evidence at least for my use cases.


Best,


Peter


Am 14.03.2017 um 08:51 schrieb Peter Klügl:
> Hi,
>
>
> it's now March and I did not yet find the time to compare the different
> annotators in your benchmark.
>
>
> I just wanted to mention that I did not forget about this and that this
> is still on my todo list. However, it could easily be April before I
> find the time.
>
>
> Best,
>
>
> Peter
>
>
> Am 08.12.2016 um 10:43 schrieb Donatas Remeika:
>> Hi,
>>
>> Peter, I did some benchmark on 20 newsgroups texts. The results can be
>> found here: https://github.com/tokenmill/dictionary-annotator
>> I didn't measure memory usage, just compared how fast different annotators
>> do the job.
>>
>> Best regards,
>> Donatas
>>
>> On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <pe...@averbis.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> for the UIMA Ruta paper, I used the enron email dataset [1], but it is
>>> probably not optimal here.
>>>
>>>
>>> I think we can find a standard scenario (data+terminology), maybe
>>> something like Genia with MeSH or wikipedia with geonames. Just a quick
>>> guess. I can help setting something up, but probably not before February.
>>>
>>>
>>> Best,
>>>
>>>
>>> Peter
>>>
>>>
>>> [1] https://www.cs.cmu.edu/~enron/
>>>
>>> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
>>>> Hi,
>>>>
>>>> Thanks for feedback.
>>>> Yes, it would be interesting to see benchmark results. Maybe you know
>>> where
>>>> I could find examples and data for doing benchmarks in UIMA?
>>>>
>>>> Best regards,
>>>> Donatas
>>>>
>>>>
>>>> On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl <pe...@averbis.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> a very nice annotator, thank you.
>>>>>
>>>>>
>>>>> Do you have figures how the annotator compares to the others with
>>>>> respect to speed and memory usage?
>>>>>
>>>>> Storing the complete tokens will maybe provide challenges in scenarios
>>>>> with parallelization if the dictionary is not shared between annotators.
>>>>>
>>>>> Would you be interested to set up a benchmark?
>>>>>
>>>>>
>>>>> Because of the limitations of the dictionaries in ruta, I also created a
>>>>> new simple dictionary annotator, but it lives now in our own components
>>>>> repository. Maybe I'll contribute it sometimes to ruta since it provides
>>>>> exactly the functionality the ruta dictionaries miss.
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
>>>>>> Hi,
>>>>>>
>>>>>> Just wanted to let you know that we created a new (probably one more)
>>>>>> dictionary annotator.
>>>>>>
>>>>>> Reasons for creating it was:
>>>>>>  - Quite often we used Ruta in our pipelines only because of its
>>>>> MARKTABLE
>>>>>> action which is able to set several features on annotation
>>>>>>  - Sometimes dictionaries contain duplicate entries with different
>>>>> features
>>>>>> and we need to create annotations for each entry
>>>>>>  - Possibility to use custom dictionary entries tokenizer (default is
>>>>>> whitespace tokenizer)
>>>>>>
>>>>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>>>>> Big
>>>>>> thanks to their developers!
>>>>>>
>>>>>> Code with examples can be found
>>>>>> https://github.com/tokenmill/dictionary-annotator
>>>>>>
>>>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>>>>> uimaFIT
>>>>>> friendly?
>>>>>>
>>>>>> Best regards,
>>>>>> Donatas
>>>>>>