You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Donatas Remeika <do...@gmail.com> on 2016/11/30 14:38:04 UTC

New dictionary annotator

Hi,

Just wanted to let you know that we created a new (probably one more)
dictionary annotator.

Reasons for creating it was:
 - Quite often we used Ruta in our pipelines only because of its MARKTABLE
action which is able to set several features on annotation
 - Sometimes dictionaries contain duplicate entries with different features
and we need to create annotations for each entry
 - Possibility to use custom dictionary entries tokenizer (default is
whitespace tokenizer)

It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE. Big
thanks to their developers!

Code with examples can be found
https://github.com/tokenmill/dictionary-annotator

BTW, maybe someone knows Concept Mapper alternative, which is more uimaFIT
friendly?

Best regards,
Donatas

Re: New dictionary annotator

Posted by Hugues de Mazancourt <hu...@mazancourt.com>.
Great. Keep me informed, if you need a beta-tester !

— Hugues


> Le 2 déc. 2016 à 10:37, Donatas Remeika <do...@gmail.com> a écrit :
> 
> During the next week :)
> 
> Donatas
> 
> On Fri, Dec 2, 2016 at 11:32 AM Hugues de Mazancourt <hu...@mazancourt.com>
> wrote:
> 
>> Cool !
>> Any idea of how far that near future is ?
>> ;-)
>> 
>> — Hugues
>> 
>> 
>> 
>>> Le 2 déc. 2016 à 10:26, Donatas Remeika <do...@gmail.com> a
>> écrit :
>>> 
>>> Hi Hugues,
>>> 
>>> Thanks for feedback. Indeed accent-insensitive matching is a needed
>>> feature. Will implement it in a near future.
>>> 
>>> Best regards,
>>> Donatas Remeika
>>> 
>>> On Fri, Dec 2, 2016 at 11:02 AM Hugues de Mazancourt <
>> hugues@mazancourt.com>
>>> wrote:
>>> 
>>>> Thanks for this contribution.
>>>> 
>>>> Do you have any plan to make the lookup accent-insensitive ? Or any
>>>> knowledge of a component that would do the job ?
>>>> I’m currently using ConceptMapper outside of Ruta and MARKTABLE from
>>>> within Ruta but neither performs correctly on accents (btw,
>> conceptMapper
>>>> is *very* slow on resource loading, which can be a problem).
>>>> 
>>>> My point is : I have lists containing elements like « événement » and I
>>>> would like text like « EVENEMENT » or even « évènement » to match that
>>>> list. Lowercasing texts is not a solution, as « é » is mapped to
>> uppercase
>>>> « É » in French locale, which has nothing to do with « e ». I guess you
>>>> have the same problem with latvian.
>>>> 
>>>> Best,
>>>> 
>>>> 
>>>> Hugues de Mazancourt
>>>> http://about.me/mazancourt
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> Le 30 nov. 2016 à 15:38, Donatas Remeika <do...@gmail.com> a
>>>> écrit :
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Just wanted to let you know that we created a new (probably one more)
>>>>> dictionary annotator.
>>>>> 
>>>>> Reasons for creating it was:
>>>>> - Quite often we used Ruta in our pipelines only because of its
>> MARKTABLE
>>>>> action which is able to set several features on annotation
>>>>> - Sometimes dictionaries contain duplicate entries with different
>>>> features
>>>>> and we need to create annotations for each entry
>>>>> - Possibility to use custom dictionary entries tokenizer (default is
>>>>> whitespace tokenizer)
>>>>> 
>>>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>>>> Big
>>>>> thanks to their developers!
>>>>> 
>>>>> Code with examples can be found
>>>>> https://github.com/tokenmill/dictionary-annotator
>>>>> 
>>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>>>> uimaFIT
>>>>> friendly?
>>>>> 
>>>>> Best regards,
>>>>> Donatas
>>>> 
>>>> 
>> 
>> 


Re: New dictionary annotator

Posted by Donatas Remeika <do...@gmail.com>.
During the next week :)

Donatas

On Fri, Dec 2, 2016 at 11:32 AM Hugues de Mazancourt <hu...@mazancourt.com>
wrote:

> Cool !
> Any idea of how far that near future is ?
> ;-)
>
> — Hugues
>
>
>
> > Le 2 déc. 2016 à 10:26, Donatas Remeika <do...@gmail.com> a
> écrit :
> >
> > Hi Hugues,
> >
> > Thanks for feedback. Indeed accent-insensitive matching is a needed
> > feature. Will implement it in a near future.
> >
> > Best regards,
> > Donatas Remeika
> >
> > On Fri, Dec 2, 2016 at 11:02 AM Hugues de Mazancourt <
> hugues@mazancourt.com>
> > wrote:
> >
> >> Thanks for this contribution.
> >>
> >> Do you have any plan to make the lookup accent-insensitive ? Or any
> >> knowledge of a component that would do the job ?
> >> I’m currently using ConceptMapper outside of Ruta and MARKTABLE from
> >> within Ruta but neither performs correctly on accents (btw,
> conceptMapper
> >> is *very* slow on resource loading, which can be a problem).
> >>
> >> My point is : I have lists containing elements like « événement » and I
> >> would like text like « EVENEMENT » or even « évènement » to match that
> >> list. Lowercasing texts is not a solution, as « é » is mapped to
> uppercase
> >> « É » in French locale, which has nothing to do with « e ». I guess you
> >> have the same problem with latvian.
> >>
> >> Best,
> >>
> >>
> >> Hugues de Mazancourt
> >> http://about.me/mazancourt
> >>
> >>
> >>
> >>
> >>> Le 30 nov. 2016 à 15:38, Donatas Remeika <do...@gmail.com> a
> >> écrit :
> >>>
> >>> Hi,
> >>>
> >>> Just wanted to let you know that we created a new (probably one more)
> >>> dictionary annotator.
> >>>
> >>> Reasons for creating it was:
> >>> - Quite often we used Ruta in our pipelines only because of its
> MARKTABLE
> >>> action which is able to set several features on annotation
> >>> - Sometimes dictionaries contain duplicate entries with different
> >> features
> >>> and we need to create annotations for each entry
> >>> - Possibility to use custom dictionary entries tokenizer (default is
> >>> whitespace tokenizer)
> >>>
> >>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
> >> Big
> >>> thanks to their developers!
> >>>
> >>> Code with examples can be found
> >>> https://github.com/tokenmill/dictionary-annotator
> >>>
> >>> BTW, maybe someone knows Concept Mapper alternative, which is more
> >> uimaFIT
> >>> friendly?
> >>>
> >>> Best regards,
> >>> Donatas
> >>
> >>
>
>

Re: New dictionary annotator

Posted by Hugues de Mazancourt <hu...@mazancourt.com>.
Cool !
Any idea of how far that near future is ?
;-)

— Hugues



> Le 2 déc. 2016 à 10:26, Donatas Remeika <do...@gmail.com> a écrit :
> 
> Hi Hugues,
> 
> Thanks for feedback. Indeed accent-insensitive matching is a needed
> feature. Will implement it in a near future.
> 
> Best regards,
> Donatas Remeika
> 
> On Fri, Dec 2, 2016 at 11:02 AM Hugues de Mazancourt <hu...@mazancourt.com>
> wrote:
> 
>> Thanks for this contribution.
>> 
>> Do you have any plan to make the lookup accent-insensitive ? Or any
>> knowledge of a component that would do the job ?
>> I’m currently using ConceptMapper outside of Ruta and MARKTABLE from
>> within Ruta but neither performs correctly on accents (btw, conceptMapper
>> is *very* slow on resource loading, which can be a problem).
>> 
>> My point is : I have lists containing elements like « événement » and I
>> would like text like « EVENEMENT » or even « évènement » to match that
>> list. Lowercasing texts is not a solution, as « é » is mapped to uppercase
>> « É » in French locale, which has nothing to do with « e ». I guess you
>> have the same problem with latvian.
>> 
>> Best,
>> 
>> 
>> Hugues de Mazancourt
>> http://about.me/mazancourt
>> 
>> 
>> 
>> 
>>> Le 30 nov. 2016 à 15:38, Donatas Remeika <do...@gmail.com> a
>> écrit :
>>> 
>>> Hi,
>>> 
>>> Just wanted to let you know that we created a new (probably one more)
>>> dictionary annotator.
>>> 
>>> Reasons for creating it was:
>>> - Quite often we used Ruta in our pipelines only because of its MARKTABLE
>>> action which is able to set several features on annotation
>>> - Sometimes dictionaries contain duplicate entries with different
>> features
>>> and we need to create annotations for each entry
>>> - Possibility to use custom dictionary entries tokenizer (default is
>>> whitespace tokenizer)
>>> 
>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>> Big
>>> thanks to their developers!
>>> 
>>> Code with examples can be found
>>> https://github.com/tokenmill/dictionary-annotator
>>> 
>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>> uimaFIT
>>> friendly?
>>> 
>>> Best regards,
>>> Donatas
>> 
>> 


Re: New dictionary annotator

Posted by Donatas Remeika <do...@gmail.com>.
Hi Hugues,

Thanks for feedback. Indeed accent-insensitive matching is a needed
feature. Will implement it in a near future.

Best regards,
Donatas Remeika

On Fri, Dec 2, 2016 at 11:02 AM Hugues de Mazancourt <hu...@mazancourt.com>
wrote:

> Thanks for this contribution.
>
> Do you have any plan to make the lookup accent-insensitive ? Or any
> knowledge of a component that would do the job ?
> I’m currently using ConceptMapper outside of Ruta and MARKTABLE from
> within Ruta but neither performs correctly on accents (btw, conceptMapper
> is *very* slow on resource loading, which can be a problem).
>
> My point is : I have lists containing elements like « événement » and I
> would like text like « EVENEMENT » or even « évènement » to match that
> list. Lowercasing texts is not a solution, as « é » is mapped to uppercase
> « É » in French locale, which has nothing to do with « e ». I guess you
> have the same problem with latvian.
>
> Best,
>
>
> Hugues de Mazancourt
> http://about.me/mazancourt
>
>
>
>
> > Le 30 nov. 2016 à 15:38, Donatas Remeika <do...@gmail.com> a
> écrit :
> >
> > Hi,
> >
> > Just wanted to let you know that we created a new (probably one more)
> > dictionary annotator.
> >
> > Reasons for creating it was:
> > - Quite often we used Ruta in our pipelines only because of its MARKTABLE
> > action which is able to set several features on annotation
> > - Sometimes dictionaries contain duplicate entries with different
> features
> > and we need to create annotations for each entry
> > - Possibility to use custom dictionary entries tokenizer (default is
> > whitespace tokenizer)
> >
> > It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
> Big
> > thanks to their developers!
> >
> > Code with examples can be found
> > https://github.com/tokenmill/dictionary-annotator
> >
> > BTW, maybe someone knows Concept Mapper alternative, which is more
> uimaFIT
> > friendly?
> >
> > Best regards,
> > Donatas
>
>

Re: New dictionary annotator

Posted by Hugues de Mazancourt <hu...@mazancourt.com>.
Thanks for this contribution.

Do you have any plan to make the lookup accent-insensitive ? Or any knowledge of a component that would do the job ?
I’m currently using ConceptMapper outside of Ruta and MARKTABLE from within Ruta but neither performs correctly on accents (btw, conceptMapper is *very* slow on resource loading, which can be a problem).

My point is : I have lists containing elements like « événement » and I would like text like « EVENEMENT » or even « évènement » to match that list. Lowercasing texts is not a solution, as « é » is mapped to uppercase « É » in French locale, which has nothing to do with « e ». I guess you have the same problem with latvian.

Best,


Hugues de Mazancourt
http://about.me/mazancourt




> Le 30 nov. 2016 à 15:38, Donatas Remeika <do...@gmail.com> a écrit :
> 
> Hi,
> 
> Just wanted to let you know that we created a new (probably one more)
> dictionary annotator.
> 
> Reasons for creating it was:
> - Quite often we used Ruta in our pipelines only because of its MARKTABLE
> action which is able to set several features on annotation
> - Sometimes dictionaries contain duplicate entries with different features
> and we need to create annotations for each entry
> - Possibility to use custom dictionary entries tokenizer (default is
> whitespace tokenizer)
> 
> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE. Big
> thanks to their developers!
> 
> Code with examples can be found
> https://github.com/tokenmill/dictionary-annotator
> 
> BTW, maybe someone knows Concept Mapper alternative, which is more uimaFIT
> friendly?
> 
> Best regards,
> Donatas


Re: New dictionary annotator

Posted by Donatas Remeika <do...@gmail.com>.
Hi Daniel,

Dictionary annotator is definitely faster than Concept Mapper, but has much
less functionality. It supports only first matching strategy.

Regards,
Donatas

On Wed, May 10, 2017 at 12:19 AM Daniel Heinze <dh...@gnoetics.com> wrote:

> Hi... I just pulled and compiled the dictionaryannotator and am looking
> through the code.  I'm looking for something that is faster than UIMA
> Concept-Mapper.  I don't need all the functionality of Concept-Mapper, but
> do need the following:
> * match all, e.g. if dict entries are "a b c", "a b" and "b c" and input
> is "a b c" , I need to match "a b c", "a b"  and "b c"
> * skip tokens, e.g. if dict entry is  "a c d", it should match on input "a
> b c d"
> Can someone familiar with the new dictionary annotator save me some time
> and say if it supports these matching strategies?
> Also, any sense of how the system scales?
> Thanks / Dan
>
> -----Original Message-----
> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
> Sent: Tuesday, March 14, 2017 12:52 AM
> To: user@uima.apache.org
> Subject: Re: New dictionary annotator
>
> Hi,
>
>
> it's now March and I did not yet find the time to compare the different
> annotators in your benchmark.
>
>
> I just wanted to mention that I did not forget about this and that this is
> still on my todo list. However, it could easily be April before I find the
> time.
>
>
> Best,
>
>
> Peter
>
>
> Am 08.12.2016 um 10:43 schrieb Donatas Remeika:
> > Hi,
> >
> > Peter, I did some benchmark on 20 newsgroups texts. The results can be
> > found here: https://github.com/tokenmill/dictionary-annotator
> > I didn't measure memory usage, just compared how fast different
> > annotators do the job.
> >
> > Best regards,
> > Donatas
> >
> > On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <pe...@averbis.com>
> wrote:
> >
> >> Hi,
> >>
> >>
> >> for the UIMA Ruta paper, I used the enron email dataset [1], but it
> >> is probably not optimal here.
> >>
> >>
> >> I think we can find a standard scenario (data+terminology), maybe
> >> something like Genia with MeSH or wikipedia with geonames. Just a
> >> quick guess. I can help setting something up, but probably not before
> February.
> >>
> >>
> >> Best,
> >>
> >>
> >> Peter
> >>
> >>
> >> [1] https://www.cs.cmu.edu/~enron/
> >>
> >> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
> >>> Hi,
> >>>
> >>> Thanks for feedback.
> >>> Yes, it would be interesting to see benchmark results. Maybe you
> >>> know
> >> where
> >>> I could find examples and data for doing benchmarks in UIMA?
> >>>
> >>> Best regards,
> >>> Donatas
> >>>
> >>>
> >>> On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl
> >>> <pe...@averbis.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> a very nice annotator, thank you.
> >>>>
> >>>>
> >>>> Do you have figures how the annotator compares to the others with
> >>>> respect to speed and memory usage?
> >>>>
> >>>> Storing the complete tokens will maybe provide challenges in
> >>>> scenarios with parallelization if the dictionary is not shared
> between annotators.
> >>>>
> >>>> Would you be interested to set up a benchmark?
> >>>>
> >>>>
> >>>> Because of the limitations of the dictionaries in ruta, I also
> >>>> created a new simple dictionary annotator, but it lives now in our
> >>>> own components repository. Maybe I'll contribute it sometimes to
> >>>> ruta since it provides exactly the functionality the ruta
> dictionaries miss.
> >>>>
> >>>>
> >>>> Best,
> >>>>
> >>>>
> >>>> Peter
> >>>>
> >>>>
> >>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
> >>>>> Hi,
> >>>>>
> >>>>> Just wanted to let you know that we created a new (probably one
> >>>>> more) dictionary annotator.
> >>>>>
> >>>>> Reasons for creating it was:
> >>>>>  - Quite often we used Ruta in our pipelines only because of its
> >>>> MARKTABLE
> >>>>> action which is able to set several features on annotation
> >>>>>  - Sometimes dictionaries contain duplicate entries with different
> >>>> features
> >>>>> and we need to create annotations for each entry
> >>>>>  - Possibility to use custom dictionary entries tokenizer (default
> >>>>> is whitespace tokenizer)
> >>>>>
> >>>>> It was inspired by both DKPro dictionary-annotator and Ruta
> MARKTABLE.
> >>>> Big
> >>>>> thanks to their developers!
> >>>>>
> >>>>> Code with examples can be found
> >>>>> https://github.com/tokenmill/dictionary-annotator
> >>>>>
> >>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
> >>>> uimaFIT
> >>>>> friendly?
> >>>>>
> >>>>> Best regards,
> >>>>> Donatas
> >>>>>
> >>
>
>

RE: New dictionary annotator

Posted by Daniel Heinze <dh...@gnoetics.com>.
Hi... I just pulled and compiled the dictionaryannotator and am looking through the code.  I'm looking for something that is faster than UIMA Concept-Mapper.  I don't need all the functionality of Concept-Mapper, but do need the following:
* match all, e.g. if dict entries are "a b c", "a b" and "b c" and input is "a b c" , I need to match "a b c", "a b"  and "b c"
* skip tokens, e.g. if dict entry is  "a c d", it should match on input "a b c d"
Can someone familiar with the new dictionary annotator save me some time and say if it supports these matching strategies?
Also, any sense of how the system scales? 
Thanks / Dan
 
-----Original Message-----
From: Peter Klügl [mailto:peter.kluegl@averbis.com] 
Sent: Tuesday, March 14, 2017 12:52 AM
To: user@uima.apache.org
Subject: Re: New dictionary annotator

Hi,


it's now March and I did not yet find the time to compare the different annotators in your benchmark.


I just wanted to mention that I did not forget about this and that this is still on my todo list. However, it could easily be April before I find the time.


Best,


Peter


Am 08.12.2016 um 10:43 schrieb Donatas Remeika:
> Hi,
>
> Peter, I did some benchmark on 20 newsgroups texts. The results can be 
> found here: https://github.com/tokenmill/dictionary-annotator
> I didn't measure memory usage, just compared how fast different 
> annotators do the job.
>
> Best regards,
> Donatas
>
> On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <pe...@averbis.com> wrote:
>
>> Hi,
>>
>>
>> for the UIMA Ruta paper, I used the enron email dataset [1], but it 
>> is probably not optimal here.
>>
>>
>> I think we can find a standard scenario (data+terminology), maybe 
>> something like Genia with MeSH or wikipedia with geonames. Just a 
>> quick guess. I can help setting something up, but probably not before February.
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>> [1] https://www.cs.cmu.edu/~enron/
>>
>> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
>>> Hi,
>>>
>>> Thanks for feedback.
>>> Yes, it would be interesting to see benchmark results. Maybe you 
>>> know
>> where
>>> I could find examples and data for doing benchmarks in UIMA?
>>>
>>> Best regards,
>>> Donatas
>>>
>>>
>>> On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl 
>>> <pe...@averbis.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> a very nice annotator, thank you.
>>>>
>>>>
>>>> Do you have figures how the annotator compares to the others with 
>>>> respect to speed and memory usage?
>>>>
>>>> Storing the complete tokens will maybe provide challenges in 
>>>> scenarios with parallelization if the dictionary is not shared between annotators.
>>>>
>>>> Would you be interested to set up a benchmark?
>>>>
>>>>
>>>> Because of the limitations of the dictionaries in ruta, I also 
>>>> created a new simple dictionary annotator, but it lives now in our 
>>>> own components repository. Maybe I'll contribute it sometimes to 
>>>> ruta since it provides exactly the functionality the ruta dictionaries miss.
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
>>>>> Hi,
>>>>>
>>>>> Just wanted to let you know that we created a new (probably one 
>>>>> more) dictionary annotator.
>>>>>
>>>>> Reasons for creating it was:
>>>>>  - Quite often we used Ruta in our pipelines only because of its
>>>> MARKTABLE
>>>>> action which is able to set several features on annotation
>>>>>  - Sometimes dictionaries contain duplicate entries with different
>>>> features
>>>>> and we need to create annotations for each entry
>>>>>  - Possibility to use custom dictionary entries tokenizer (default 
>>>>> is whitespace tokenizer)
>>>>>
>>>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>>>> Big
>>>>> thanks to their developers!
>>>>>
>>>>> Code with examples can be found
>>>>> https://github.com/tokenmill/dictionary-annotator
>>>>>
>>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>>>> uimaFIT
>>>>> friendly?
>>>>>
>>>>> Best regards,
>>>>> Donatas
>>>>>
>>


Re: New dictionary annotator

Posted by Peter Klügl <pe...@averbis.com>.
Hi,


I included an (our) simple dictionary annotator (trie-based) in the
benchmark. It's roughly as fast as your annotator.


However, the benchmark includes the time of the opennlp tokenizer and
the dictionary and its entries are quite minimal (four entires, no multi
token entries). Thus, for the benchmark, it does not really matter if
there is a dictionary at all and the benchmark provides hardly any
evidence at least for my use cases.


Best,


Peter


Am 14.03.2017 um 08:51 schrieb Peter Klügl:
> Hi,
>
>
> it's now March and I did not yet find the time to compare the different
> annotators in your benchmark.
>
>
> I just wanted to mention that I did not forget about this and that this
> is still on my todo list. However, it could easily be April before I
> find the time.
>
>
> Best,
>
>
> Peter
>
>
> Am 08.12.2016 um 10:43 schrieb Donatas Remeika:
>> Hi,
>>
>> Peter, I did some benchmark on 20 newsgroups texts. The results can be
>> found here: https://github.com/tokenmill/dictionary-annotator
>> I didn't measure memory usage, just compared how fast different annotators
>> do the job.
>>
>> Best regards,
>> Donatas
>>
>> On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <pe...@averbis.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> for the UIMA Ruta paper, I used the enron email dataset [1], but it is
>>> probably not optimal here.
>>>
>>>
>>> I think we can find a standard scenario (data+terminology), maybe
>>> something like Genia with MeSH or wikipedia with geonames. Just a quick
>>> guess. I can help setting something up, but probably not before February.
>>>
>>>
>>> Best,
>>>
>>>
>>> Peter
>>>
>>>
>>> [1] https://www.cs.cmu.edu/~enron/
>>>
>>> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
>>>> Hi,
>>>>
>>>> Thanks for feedback.
>>>> Yes, it would be interesting to see benchmark results. Maybe you know
>>> where
>>>> I could find examples and data for doing benchmarks in UIMA?
>>>>
>>>> Best regards,
>>>> Donatas
>>>>
>>>>
>>>> On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl <pe...@averbis.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> a very nice annotator, thank you.
>>>>>
>>>>>
>>>>> Do you have figures how the annotator compares to the others with
>>>>> respect to speed and memory usage?
>>>>>
>>>>> Storing the complete tokens will maybe provide challenges in scenarios
>>>>> with parallelization if the dictionary is not shared between annotators.
>>>>>
>>>>> Would you be interested to set up a benchmark?
>>>>>
>>>>>
>>>>> Because of the limitations of the dictionaries in ruta, I also created a
>>>>> new simple dictionary annotator, but it lives now in our own components
>>>>> repository. Maybe I'll contribute it sometimes to ruta since it provides
>>>>> exactly the functionality the ruta dictionaries miss.
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
>>>>>> Hi,
>>>>>>
>>>>>> Just wanted to let you know that we created a new (probably one more)
>>>>>> dictionary annotator.
>>>>>>
>>>>>> Reasons for creating it was:
>>>>>>  - Quite often we used Ruta in our pipelines only because of its
>>>>> MARKTABLE
>>>>>> action which is able to set several features on annotation
>>>>>>  - Sometimes dictionaries contain duplicate entries with different
>>>>> features
>>>>>> and we need to create annotations for each entry
>>>>>>  - Possibility to use custom dictionary entries tokenizer (default is
>>>>>> whitespace tokenizer)
>>>>>>
>>>>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>>>>> Big
>>>>>> thanks to their developers!
>>>>>>
>>>>>> Code with examples can be found
>>>>>> https://github.com/tokenmill/dictionary-annotator
>>>>>>
>>>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>>>>> uimaFIT
>>>>>> friendly?
>>>>>>
>>>>>> Best regards,
>>>>>> Donatas
>>>>>>


Re: New dictionary annotator

Posted by Peter Klügl <pe...@averbis.com>.
Hi,


it's now March and I did not yet find the time to compare the different
annotators in your benchmark.


I just wanted to mention that I did not forget about this and that this
is still on my todo list. However, it could easily be April before I
find the time.


Best,


Peter


Am 08.12.2016 um 10:43 schrieb Donatas Remeika:
> Hi,
>
> Peter, I did some benchmark on 20 newsgroups texts. The results can be
> found here: https://github.com/tokenmill/dictionary-annotator
> I didn't measure memory usage, just compared how fast different annotators
> do the job.
>
> Best regards,
> Donatas
>
> On Mon, Dec 5, 2016 at 2:35 PM Peter Kl�gl <pe...@averbis.com> wrote:
>
>> Hi,
>>
>>
>> for the UIMA Ruta paper, I used the enron email dataset [1], but it is
>> probably not optimal here.
>>
>>
>> I think we can find a standard scenario (data+terminology), maybe
>> something like Genia with MeSH or wikipedia with geonames. Just a quick
>> guess. I can help setting something up, but probably not before February.
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>> [1] https://www.cs.cmu.edu/~enron/
>>
>> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
>>> Hi,
>>>
>>> Thanks for feedback.
>>> Yes, it would be interesting to see benchmark results. Maybe you know
>> where
>>> I could find examples and data for doing benchmarks in UIMA?
>>>
>>> Best regards,
>>> Donatas
>>>
>>>
>>> On Mon, Dec 5, 2016 at 10:52 AM Peter Kl�gl <pe...@averbis.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> a very nice annotator, thank you.
>>>>
>>>>
>>>> Do you have figures how the annotator compares to the others with
>>>> respect to speed and memory usage?
>>>>
>>>> Storing the complete tokens will maybe provide challenges in scenarios
>>>> with parallelization if the dictionary is not shared between annotators.
>>>>
>>>> Would you be interested to set up a benchmark?
>>>>
>>>>
>>>> Because of the limitations of the dictionaries in ruta, I also created a
>>>> new simple dictionary annotator, but it lives now in our own components
>>>> repository. Maybe I'll contribute it sometimes to ruta since it provides
>>>> exactly the functionality the ruta dictionaries miss.
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
>>>>> Hi,
>>>>>
>>>>> Just wanted to let you know that we created a new (probably one more)
>>>>> dictionary annotator.
>>>>>
>>>>> Reasons for creating it was:
>>>>>  - Quite often we used Ruta in our pipelines only because of its
>>>> MARKTABLE
>>>>> action which is able to set several features on annotation
>>>>>  - Sometimes dictionaries contain duplicate entries with different
>>>> features
>>>>> and we need to create annotations for each entry
>>>>>  - Possibility to use custom dictionary entries tokenizer (default is
>>>>> whitespace tokenizer)
>>>>>
>>>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>>>> Big
>>>>> thanks to their developers!
>>>>>
>>>>> Code with examples can be found
>>>>> https://github.com/tokenmill/dictionary-annotator
>>>>>
>>>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>>>> uimaFIT
>>>>> friendly?
>>>>>
>>>>> Best regards,
>>>>> Donatas
>>>>>
>>


Re: New dictionary annotator

Posted by Donatas Remeika <do...@gmail.com>.
Hi,

Peter, I did some benchmark on 20 newsgroups texts. The results can be
found here: https://github.com/tokenmill/dictionary-annotator
I didn't measure memory usage, just compared how fast different annotators
do the job.

Best regards,
Donatas

On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <pe...@averbis.com> wrote:

> Hi,
>
>
> for the UIMA Ruta paper, I used the enron email dataset [1], but it is
> probably not optimal here.
>
>
> I think we can find a standard scenario (data+terminology), maybe
> something like Genia with MeSH or wikipedia with geonames. Just a quick
> guess. I can help setting something up, but probably not before February.
>
>
> Best,
>
>
> Peter
>
>
> [1] https://www.cs.cmu.edu/~enron/
>
> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
> > Hi,
> >
> > Thanks for feedback.
> > Yes, it would be interesting to see benchmark results. Maybe you know
> where
> > I could find examples and data for doing benchmarks in UIMA?
> >
> > Best regards,
> > Donatas
> >
> >
> > On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl <pe...@averbis.com>
> > wrote:
> >
> >> Hi,
> >>
> >>
> >> a very nice annotator, thank you.
> >>
> >>
> >> Do you have figures how the annotator compares to the others with
> >> respect to speed and memory usage?
> >>
> >> Storing the complete tokens will maybe provide challenges in scenarios
> >> with parallelization if the dictionary is not shared between annotators.
> >>
> >> Would you be interested to set up a benchmark?
> >>
> >>
> >> Because of the limitations of the dictionaries in ruta, I also created a
> >> new simple dictionary annotator, but it lives now in our own components
> >> repository. Maybe I'll contribute it sometimes to ruta since it provides
> >> exactly the functionality the ruta dictionaries miss.
> >>
> >>
> >> Best,
> >>
> >>
> >> Peter
> >>
> >>
> >> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
> >>> Hi,
> >>>
> >>> Just wanted to let you know that we created a new (probably one more)
> >>> dictionary annotator.
> >>>
> >>> Reasons for creating it was:
> >>>  - Quite often we used Ruta in our pipelines only because of its
> >> MARKTABLE
> >>> action which is able to set several features on annotation
> >>>  - Sometimes dictionaries contain duplicate entries with different
> >> features
> >>> and we need to create annotations for each entry
> >>>  - Possibility to use custom dictionary entries tokenizer (default is
> >>> whitespace tokenizer)
> >>>
> >>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
> >> Big
> >>> thanks to their developers!
> >>>
> >>> Code with examples can be found
> >>> https://github.com/tokenmill/dictionary-annotator
> >>>
> >>> BTW, maybe someone knows Concept Mapper alternative, which is more
> >> uimaFIT
> >>> friendly?
> >>>
> >>> Best regards,
> >>> Donatas
> >>>
> >>
>
>

Re: New dictionary annotator

Posted by Peter Klügl <pe...@averbis.com>.
Hi,


for the UIMA Ruta paper, I used the enron email dataset [1], but it is
probably not optimal here.


I think we can find a standard scenario (data+terminology), maybe
something like Genia with MeSH or wikipedia with geonames. Just a quick
guess. I can help setting something up, but probably not before February.


Best,


Peter


[1] https://www.cs.cmu.edu/~enron/

Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
> Hi,
>
> Thanks for feedback.
> Yes, it would be interesting to see benchmark results. Maybe you know where
> I could find examples and data for doing benchmarks in UIMA?
>
> Best regards,
> Donatas
>
>
> On Mon, Dec 5, 2016 at 10:52 AM Peter Kl�gl <pe...@averbis.com>
> wrote:
>
>> Hi,
>>
>>
>> a very nice annotator, thank you.
>>
>>
>> Do you have figures how the annotator compares to the others with
>> respect to speed and memory usage?
>>
>> Storing the complete tokens will maybe provide challenges in scenarios
>> with parallelization if the dictionary is not shared between annotators.
>>
>> Would you be interested to set up a benchmark?
>>
>>
>> Because of the limitations of the dictionaries in ruta, I also created a
>> new simple dictionary annotator, but it lives now in our own components
>> repository. Maybe I'll contribute it sometimes to ruta since it provides
>> exactly the functionality the ruta dictionaries miss.
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
>>> Hi,
>>>
>>> Just wanted to let you know that we created a new (probably one more)
>>> dictionary annotator.
>>>
>>> Reasons for creating it was:
>>>  - Quite often we used Ruta in our pipelines only because of its
>> MARKTABLE
>>> action which is able to set several features on annotation
>>>  - Sometimes dictionaries contain duplicate entries with different
>> features
>>> and we need to create annotations for each entry
>>>  - Possibility to use custom dictionary entries tokenizer (default is
>>> whitespace tokenizer)
>>>
>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>> Big
>>> thanks to their developers!
>>>
>>> Code with examples can be found
>>> https://github.com/tokenmill/dictionary-annotator
>>>
>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>> uimaFIT
>>> friendly?
>>>
>>> Best regards,
>>> Donatas
>>>
>>


Re: New dictionary annotator

Posted by Donatas Remeika <do...@gmail.com>.
Hi,

Thanks for feedback.
Yes, it would be interesting to see benchmark results. Maybe you know where
I could find examples and data for doing benchmarks in UIMA?

Best regards,
Donatas


On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl <pe...@averbis.com>
wrote:

> Hi,
>
>
> a very nice annotator, thank you.
>
>
> Do you have figures how the annotator compares to the others with
> respect to speed and memory usage?
>
> Storing the complete tokens will maybe provide challenges in scenarios
> with parallelization if the dictionary is not shared between annotators.
>
> Would you be interested to set up a benchmark?
>
>
> Because of the limitations of the dictionaries in ruta, I also created a
> new simple dictionary annotator, but it lives now in our own components
> repository. Maybe I'll contribute it sometimes to ruta since it provides
> exactly the functionality the ruta dictionaries miss.
>
>
> Best,
>
>
> Peter
>
>
> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
> > Hi,
> >
> > Just wanted to let you know that we created a new (probably one more)
> > dictionary annotator.
> >
> > Reasons for creating it was:
> >  - Quite often we used Ruta in our pipelines only because of its
> MARKTABLE
> > action which is able to set several features on annotation
> >  - Sometimes dictionaries contain duplicate entries with different
> features
> > and we need to create annotations for each entry
> >  - Possibility to use custom dictionary entries tokenizer (default is
> > whitespace tokenizer)
> >
> > It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
> Big
> > thanks to their developers!
> >
> > Code with examples can be found
> > https://github.com/tokenmill/dictionary-annotator
> >
> > BTW, maybe someone knows Concept Mapper alternative, which is more
> uimaFIT
> > friendly?
> >
> > Best regards,
> > Donatas
> >
>
>

Re: New dictionary annotator

Posted by Peter Klügl <pe...@averbis.com>.
Hi,


a very nice annotator, thank you.


Do you have figures how the annotator compares to the others with
respect to speed and memory usage?

Storing the complete tokens will maybe provide challenges in scenarios
with parallelization if the dictionary is not shared between annotators.

Would you be interested to set up a benchmark?


Because of the limitations of the dictionaries in ruta, I also created a
new simple dictionary annotator, but it lives now in our own components
repository. Maybe I'll contribute it sometimes to ruta since it provides
exactly the functionality the ruta dictionaries miss.


Best,


Peter


Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
> Hi,
>
> Just wanted to let you know that we created a new (probably one more)
> dictionary annotator.
>
> Reasons for creating it was:
>  - Quite often we used Ruta in our pipelines only because of its MARKTABLE
> action which is able to set several features on annotation
>  - Sometimes dictionaries contain duplicate entries with different features
> and we need to create annotations for each entry
>  - Possibility to use custom dictionary entries tokenizer (default is
> whitespace tokenizer)
>
> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE. Big
> thanks to their developers!
>
> Code with examples can be found
> https://github.com/tokenmill/dictionary-annotator
>
> BTW, maybe someone knows Concept Mapper alternative, which is more uimaFIT
> friendly?
>
> Best regards,
> Donatas
>