You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Pablo Duboue (Jira)" <de...@uima.apache.org> on 2024/03/14 22:35:00 UTC

[jira] [Updated] (UIMA-6487) Support Aggregate Engines in Apache UIMACPP

     [ https://issues.apache.org/jira/browse/UIMA-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pablo Duboue updated UIMA-6487:
-------------------------------
    Description: 
UIMA is a framework for unstructured information management, built around the idea of heavy annotators interoperating using a common exchange format.

It has been in production use for about two decades.

The framework is mostly written in Java. It has a C++ counterpart that implements a subset of the framework.

The challenge for this GSOC is to work together with the mentor to implement the full framework.

More details on GitHub: [https://github.com/apache/uima-uimacpp/issues/6]
h2. Benefits to the community

This has been discussed as one of the main roadblocks in using the C++ version of the framework by its users: [https://lists.apache.org/thread/f1r3sghgn2oqhvzz27y26zg6j3olv8qq]

On a larger perspective, there is the question of why we need NLP frameworks in 2024. The field has moved to approaches where source text is consumed in a destructive tokenization process that generates subtoken indices over a fixed vocabulary. These are then fed as input to a deep/transformer neural network.

Now, when training said networks, particularly when building Large Language Models (LLMs), gargantuan amounts of texts are quickly tokenized and fed into the model being trained. Additional computational efforts at indexing time can help improve data quality, privacy and terms of use of the text. A high performant UIMA CPP can be the missing piece for quality input data to LLMs.
h2. Technical Skills

Working on this problem requires intermediate knowledge of the C++ programming language.

A solution will most probably exercise this type of skills, which could be learned along the way parallel to the project (mentoring on these topics is not part of the project):
 * Linux command-line and build systems
 * XML parsing
 * Docker (image creation, deployment, debugging)

h2. About the mentor

Dr. Duboue has more than 25 years of experience in AI.  He has a Ph.D. in Computer Science from Columbia University. and was a member of the IBM Watson team that beat the Jeopardy! Champions.

Aside from his consulting work, he he has taught in three different countries and done joint research with more than fifty co-authors.

He has years of experience mentoring both students and employees.

 

 

  was:
UIMA is a framework for unstructured information management, built around the idea of heavy annotators interoperating using a common exchange format.

It has been in production use for about two decades.

The framework is mostly written in Java. It has a C++ counterpart that implements a subset of the framework.

The challenge for this GSOC is to work together with the mentor to implement the full framework.

More details on GitHub: [https://github.com/apache/uima-uimacpp/issues/6]

 

Benefits to the community

This has been discussed as one of the main roadblocks in using the C++ version of the framework by its users: [https://lists.apache.org/thread/f1r3sghgn2oqhvzz27y26zg6j3olv8qq]

 

About the mentor

Dr. Duboue has more than 25 years of experience in AI.  He has a Ph.D. in Computer Science from Columbia University. and was a member of the IBM Watson team that beat the Jeopardy! Champions.

Aside from his consulting work, he he has taught in three different countries and done joint research with more than fifty co-authors.

He has years of experience mentoring both students and employees.

 

 


> Support Aggregate Engines in Apache UIMACPP
> -------------------------------------------
>
>                 Key: UIMA-6487
>                 URL: https://issues.apache.org/jira/browse/UIMA-6487
>             Project: UIMA
>          Issue Type: New Feature
>            Reporter: Pablo Duboue
>            Priority: Major
>              Labels: NLP, full-time, gsoc2024, mentor, uima
>
> UIMA is a framework for unstructured information management, built around the idea of heavy annotators interoperating using a common exchange format.
> It has been in production use for about two decades.
> The framework is mostly written in Java. It has a C++ counterpart that implements a subset of the framework.
> The challenge for this GSOC is to work together with the mentor to implement the full framework.
> More details on GitHub: [https://github.com/apache/uima-uimacpp/issues/6]
> h2. Benefits to the community
> This has been discussed as one of the main roadblocks in using the C++ version of the framework by its users: [https://lists.apache.org/thread/f1r3sghgn2oqhvzz27y26zg6j3olv8qq]
> On a larger perspective, there is the question of why we need NLP frameworks in 2024. The field has moved to approaches where source text is consumed in a destructive tokenization process that generates subtoken indices over a fixed vocabulary. These are then fed as input to a deep/transformer neural network.
> Now, when training said networks, particularly when building Large Language Models (LLMs), gargantuan amounts of texts are quickly tokenized and fed into the model being trained. Additional computational efforts at indexing time can help improve data quality, privacy and terms of use of the text. A high performant UIMA CPP can be the missing piece for quality input data to LLMs.
> h2. Technical Skills
> Working on this problem requires intermediate knowledge of the C++ programming language.
> A solution will most probably exercise this type of skills, which could be learned along the way parallel to the project (mentoring on these topics is not part of the project):
>  * Linux command-line and build systems
>  * XML parsing
>  * Docker (image creation, deployment, debugging)
> h2. About the mentor
> Dr. Duboue has more than 25 years of experience in AI.  He has a Ph.D. in Computer Science from Columbia University. and was a member of the IBM Watson team that beat the Jeopardy! Champions.
> Aside from his consulting work, he he has taught in three different countries and done joint research with more than fifty co-authors.
> He has years of experience mentoring both students and employees.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)