You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Peter Klügl <pe...@averbis.com> on 2017/03/14 07:51:49 UTC

Re: New dictionary annotator

Hi,


it's now March and I did not yet find the time to compare the different
annotators in your benchmark.


I just wanted to mention that I did not forget about this and that this
is still on my todo list. However, it could easily be April before I
find the time.


Best,


Peter


Am 08.12.2016 um 10:43 schrieb Donatas Remeika:
> Hi,
>
> Peter, I did some benchmark on 20 newsgroups texts. The results can be
> found here: https://github.com/tokenmill/dictionary-annotator
> I didn't measure memory usage, just compared how fast different annotators
> do the job.
>
> Best regards,
> Donatas
>
> On Mon, Dec 5, 2016 at 2:35 PM Peter Kl�gl <pe...@averbis.com> wrote:
>
>> Hi,
>>
>>
>> for the UIMA Ruta paper, I used the enron email dataset [1], but it is
>> probably not optimal here.
>>
>>
>> I think we can find a standard scenario (data+terminology), maybe
>> something like Genia with MeSH or wikipedia with geonames. Just a quick
>> guess. I can help setting something up, but probably not before February.
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>> [1] https://www.cs.cmu.edu/~enron/
>>
>> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
>>> Hi,
>>>
>>> Thanks for feedback.
>>> Yes, it would be interesting to see benchmark results. Maybe you know
>> where
>>> I could find examples and data for doing benchmarks in UIMA?
>>>
>>> Best regards,
>>> Donatas
>>>
>>>
>>> On Mon, Dec 5, 2016 at 10:52 AM Peter Kl�gl <pe...@averbis.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> a very nice annotator, thank you.
>>>>
>>>>
>>>> Do you have figures how the annotator compares to the others with
>>>> respect to speed and memory usage?
>>>>
>>>> Storing the complete tokens will maybe provide challenges in scenarios
>>>> with parallelization if the dictionary is not shared between annotators.
>>>>
>>>> Would you be interested to set up a benchmark?
>>>>
>>>>
>>>> Because of the limitations of the dictionaries in ruta, I also created a
>>>> new simple dictionary annotator, but it lives now in our own components
>>>> repository. Maybe I'll contribute it sometimes to ruta since it provides
>>>> exactly the functionality the ruta dictionaries miss.
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
>>>>> Hi,
>>>>>
>>>>> Just wanted to let you know that we created a new (probably one more)
>>>>> dictionary annotator.
>>>>>
>>>>> Reasons for creating it was:
>>>>>  - Quite often we used Ruta in our pipelines only because of its
>>>> MARKTABLE
>>>>> action which is able to set several features on annotation
>>>>>  - Sometimes dictionaries contain duplicate entries with different
>>>> features
>>>>> and we need to create annotations for each entry
>>>>>  - Possibility to use custom dictionary entries tokenizer (default is
>>>>> whitespace tokenizer)
>>>>>
>>>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>>>> Big
>>>>> thanks to their developers!
>>>>>
>>>>> Code with examples can be found
>>>>> https://github.com/tokenmill/dictionary-annotator
>>>>>
>>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>>>> uimaFIT
>>>>> friendly?
>>>>>
>>>>> Best regards,
>>>>> Donatas
>>>>>
>>

Re: New dictionary annotator

Posted by Donatas Remeika <do...@gmail.com>.

Hi Daniel,

Dictionary annotator is definitely faster than Concept Mapper, but has much
less functionality. It supports only first matching strategy.

Regards,
Donatas

On Wed, May 10, 2017 at 12:19 AM Daniel Heinze <dh...@gnoetics.com> wrote:

> Hi... I just pulled and compiled the dictionaryannotator and am looking
> through the code.  I'm looking for something that is faster than UIMA
> Concept-Mapper.  I don't need all the functionality of Concept-Mapper, but
> do need the following:
> * match all, e.g. if dict entries are "a b c", "a b" and "b c" and input
> is "a b c" , I need to match "a b c", "a b"  and "b c"
> * skip tokens, e.g. if dict entry is  "a c d", it should match on input "a
> b c d"
> Can someone familiar with the new dictionary annotator save me some time
> and say if it supports these matching strategies?
> Also, any sense of how the system scales?
> Thanks / Dan
>
> -----Original Message-----
> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
> Sent: Tuesday, March 14, 2017 12:52 AM
> To: user@uima.apache.org
> Subject: Re: New dictionary annotator
>
> Hi,
>
>
> it's now March and I did not yet find the time to compare the different
> annotators in your benchmark.
>
>
> I just wanted to mention that I did not forget about this and that this is
> still on my todo list. However, it could easily be April before I find the
> time.
>
>
> Best,
>
>
> Peter
>
>
> Am 08.12.2016 um 10:43 schrieb Donatas Remeika:
> > Hi,
> >
> > Peter, I did some benchmark on 20 newsgroups texts. The results can be
> > found here: https://github.com/tokenmill/dictionary-annotator
> > I didn't measure memory usage, just compared how fast different
> > annotators do the job.
> >
> > Best regards,
> > Donatas
> >
> > On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <pe...@averbis.com>
> wrote:
> >
> >> Hi,
> >>
> >>
> >> for the UIMA Ruta paper, I used the enron email dataset [1], but it
> >> is probably not optimal here.
> >>
> >>
> >> I think we can find a standard scenario (data+terminology), maybe
> >> something like Genia with MeSH or wikipedia with geonames. Just a
> >> quick guess. I can help setting something up, but probably not before
> February.
> >>
> >>
> >> Best,
> >>
> >>
> >> Peter
> >>
> >>
> >> [1] https://www.cs.cmu.edu/~enron/
> >>
> >> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
> >>> Hi,
> >>>
> >>> Thanks for feedback.
> >>> Yes, it would be interesting to see benchmark results. Maybe you
> >>> know
> >> where
> >>> I could find examples and data for doing benchmarks in UIMA?
> >>>
> >>> Best regards,
> >>> Donatas
> >>>
> >>>
> >>> On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl
> >>> <pe...@averbis.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> a very nice annotator, thank you.
> >>>>
> >>>>
> >>>> Do you have figures how the annotator compares to the others with
> >>>> respect to speed and memory usage?
> >>>>
> >>>> Storing the complete tokens will maybe provide challenges in
> >>>> scenarios with parallelization if the dictionary is not shared
> between annotators.
> >>>>
> >>>> Would you be interested to set up a benchmark?
> >>>>
> >>>>
> >>>> Because of the limitations of the dictionaries in ruta, I also
> >>>> created a new simple dictionary annotator, but it lives now in our
> >>>> own components repository. Maybe I'll contribute it sometimes to
> >>>> ruta since it provides exactly the functionality the ruta
> dictionaries miss.
> >>>>
> >>>>
> >>>> Best,
> >>>>
> >>>>
> >>>> Peter
> >>>>
> >>>>
> >>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
> >>>>> Hi,
> >>>>>
> >>>>> Just wanted to let you know that we created a new (probably one
> >>>>> more) dictionary annotator.
> >>>>>
> >>>>> Reasons for creating it was:
> >>>>>  - Quite often we used Ruta in our pipelines only because of its
> >>>> MARKTABLE
> >>>>> action which is able to set several features on annotation
> >>>>>  - Sometimes dictionaries contain duplicate entries with different
> >>>> features
> >>>>> and we need to create annotations for each entry
> >>>>>  - Possibility to use custom dictionary entries tokenizer (default
> >>>>> is whitespace tokenizer)
> >>>>>
> >>>>> It was inspired by both DKPro dictionary-annotator and Ruta
> MARKTABLE.
> >>>> Big
> >>>>> thanks to their developers!
> >>>>>
> >>>>> Code with examples can be found
> >>>>> https://github.com/tokenmill/dictionary-annotator
> >>>>>
> >>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
> >>>> uimaFIT
> >>>>> friendly?
> >>>>>
> >>>>> Best regards,
> >>>>> Donatas
> >>>>>
> >>
>
>

RE: New dictionary annotator

Posted by Daniel Heinze <dh...@gnoetics.com>.

Hi... I just pulled and compiled the dictionaryannotator and am looking through the code.  I'm looking for something that is faster than UIMA Concept-Mapper.  I don't need all the functionality of Concept-Mapper, but do need the following:
* match all, e.g. if dict entries are "a b c", "a b" and "b c" and input is "a b c" , I need to match "a b c", "a b"  and "b c"
* skip tokens, e.g. if dict entry is  "a c d", it should match on input "a b c d"
Can someone familiar with the new dictionary annotator save me some time and say if it supports these matching strategies?
Also, any sense of how the system scales? 
Thanks / Dan
 
-----Original Message-----
From: Peter Klügl [mailto:peter.kluegl@averbis.com] 
Sent: Tuesday, March 14, 2017 12:52 AM
To: user@uima.apache.org
Subject: Re: New dictionary annotator

Hi,


it's now March and I did not yet find the time to compare the different annotators in your benchmark.


I just wanted to mention that I did not forget about this and that this is still on my todo list. However, it could easily be April before I find the time.


Best,


Peter


Am 08.12.2016 um 10:43 schrieb Donatas Remeika:
> Hi,
>
> Peter, I did some benchmark on 20 newsgroups texts. The results can be 
> found here: https://github.com/tokenmill/dictionary-annotator
> I didn't measure memory usage, just compared how fast different 
> annotators do the job.
>
> Best regards,
> Donatas
>
> On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <pe...@averbis.com> wrote:
>
>> Hi,
>>
>>
>> for the UIMA Ruta paper, I used the enron email dataset [1], but it 
>> is probably not optimal here.
>>
>>
>> I think we can find a standard scenario (data+terminology), maybe 
>> something like Genia with MeSH or wikipedia with geonames. Just a 
>> quick guess. I can help setting something up, but probably not before February.
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>> [1] https://www.cs.cmu.edu/~enron/
>>
>> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
>>> Hi,
>>>
>>> Thanks for feedback.
>>> Yes, it would be interesting to see benchmark results. Maybe you 
>>> know
>> where
>>> I could find examples and data for doing benchmarks in UIMA?
>>>
>>> Best regards,
>>> Donatas
>>>
>>>
>>> On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl 
>>> <pe...@averbis.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> a very nice annotator, thank you.
>>>>
>>>>
>>>> Do you have figures how the annotator compares to the others with 
>>>> respect to speed and memory usage?
>>>>
>>>> Storing the complete tokens will maybe provide challenges in 
>>>> scenarios with parallelization if the dictionary is not shared between annotators.
>>>>
>>>> Would you be interested to set up a benchmark?
>>>>
>>>>
>>>> Because of the limitations of the dictionaries in ruta, I also 
>>>> created a new simple dictionary annotator, but it lives now in our 
>>>> own components repository. Maybe I'll contribute it sometimes to 
>>>> ruta since it provides exactly the functionality the ruta dictionaries miss.
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
>>>>> Hi,
>>>>>
>>>>> Just wanted to let you know that we created a new (probably one 
>>>>> more) dictionary annotator.
>>>>>
>>>>> Reasons for creating it was:
>>>>>  - Quite often we used Ruta in our pipelines only because of its
>>>> MARKTABLE
>>>>> action which is able to set several features on annotation
>>>>>  - Sometimes dictionaries contain duplicate entries with different
>>>> features
>>>>> and we need to create annotations for each entry
>>>>>  - Possibility to use custom dictionary entries tokenizer (default 
>>>>> is whitespace tokenizer)
>>>>>
>>>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>>>> Big
>>>>> thanks to their developers!
>>>>>
>>>>> Code with examples can be found
>>>>> https://github.com/tokenmill/dictionary-annotator
>>>>>
>>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>>>> uimaFIT
>>>>> friendly?
>>>>>
>>>>> Best regards,
>>>>> Donatas
>>>>>
>>

Re: New dictionary annotator

Posted by Peter Klügl <pe...@averbis.com>.

Hi,


I included an (our) simple dictionary annotator (trie-based) in the
benchmark. It's roughly as fast as your annotator.


However, the benchmark includes the time of the opennlp tokenizer and
the dictionary and its entries are quite minimal (four entires, no multi
token entries). Thus, for the benchmark, it does not really matter if
there is a dictionary at all and the benchmark provides hardly any
evidence at least for my use cases.


Best,


Peter


Am 14.03.2017 um 08:51 schrieb Peter Klügl:
> Hi,
>
>
> it's now March and I did not yet find the time to compare the different
> annotators in your benchmark.
>
>
> I just wanted to mention that I did not forget about this and that this
> is still on my todo list. However, it could easily be April before I
> find the time.
>
>
> Best,
>
>
> Peter
>
>
> Am 08.12.2016 um 10:43 schrieb Donatas Remeika:
>> Hi,
>>
>> Peter, I did some benchmark on 20 newsgroups texts. The results can be
>> found here: https://github.com/tokenmill/dictionary-annotator
>> I didn't measure memory usage, just compared how fast different annotators
>> do the job.
>>
>> Best regards,
>> Donatas
>>
>> On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <pe...@averbis.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> for the UIMA Ruta paper, I used the enron email dataset [1], but it is
>>> probably not optimal here.
>>>
>>>
>>> I think we can find a standard scenario (data+terminology), maybe
>>> something like Genia with MeSH or wikipedia with geonames. Just a quick
>>> guess. I can help setting something up, but probably not before February.
>>>
>>>
>>> Best,
>>>
>>>
>>> Peter
>>>
>>>
>>> [1] https://www.cs.cmu.edu/~enron/
>>>
>>> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
>>>> Hi,
>>>>
>>>> Thanks for feedback.
>>>> Yes, it would be interesting to see benchmark results. Maybe you know
>>> where
>>>> I could find examples and data for doing benchmarks in UIMA?
>>>>
>>>> Best regards,
>>>> Donatas
>>>>
>>>>
>>>> On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl <pe...@averbis.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> a very nice annotator, thank you.
>>>>>
>>>>>
>>>>> Do you have figures how the annotator compares to the others with
>>>>> respect to speed and memory usage?
>>>>>
>>>>> Storing the complete tokens will maybe provide challenges in scenarios
>>>>> with parallelization if the dictionary is not shared between annotators.
>>>>>
>>>>> Would you be interested to set up a benchmark?
>>>>>
>>>>>
>>>>> Because of the limitations of the dictionaries in ruta, I also created a
>>>>> new simple dictionary annotator, but it lives now in our own components
>>>>> repository. Maybe I'll contribute it sometimes to ruta since it provides
>>>>> exactly the functionality the ruta dictionaries miss.
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
>>>>>> Hi,
>>>>>>
>>>>>> Just wanted to let you know that we created a new (probably one more)
>>>>>> dictionary annotator.
>>>>>>
>>>>>> Reasons for creating it was:
>>>>>>  - Quite often we used Ruta in our pipelines only because of its
>>>>> MARKTABLE
>>>>>> action which is able to set several features on annotation
>>>>>>  - Sometimes dictionaries contain duplicate entries with different
>>>>> features
>>>>>> and we need to create annotations for each entry
>>>>>>  - Possibility to use custom dictionary entries tokenizer (default is
>>>>>> whitespace tokenizer)
>>>>>>
>>>>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>>>>> Big
>>>>>> thanks to their developers!
>>>>>>
>>>>>> Code with examples can be found
>>>>>> https://github.com/tokenmill/dictionary-annotator
>>>>>>
>>>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>>>>> uimaFIT
>>>>>> friendly?
>>>>>>
>>>>>> Best regards,
>>>>>> Donatas
>>>>>>