You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Mario Gazzo <ma...@gmail.com> on 2015/10/20 17:26:53 UTC

UIMA Ruta not capturing some XML markup with attributes?

Hi Peter,

RUTA doesn’t seem to capture some XML markup with attributes. Here are some examples:

<xref ref-type="bibr" rid="b35-ehp0113-000220”>
<sec sec-type="methods”>

The above markup examples are totally missing in the TokenSeed annotations. I wonder whether it is related to the dash in the attribute names since other markup without this appear to be captured.

Can you confirm that the dash could cause the problem?

Cheers
Mario

Re: UIMA Ruta not capturing some XML markup with attributes?

Posted by Mario Gazzo <ma...@gmail.com>.
Thanks Peter,

The quotes are just normal quotes in the original source but the mail software must have changed this. Sorry about that misunderstanding.

Cheers
Mario 

> On 21/10/2015, at 16.03, Peter Klügl <pe...@averbis.com> wrote:
> 
> Hi,
> 
> I extended the pattern to support dashes, but not the other quotes. This
> can get arbitrary complex (and slow) if any combination of unicode
> characters that look like quotes should be supported. I still think that
> this is not valid xml. Can you give me a link to the standard?
> 
> It's maybe better to solve this in a specific use case before applying
> the seeder.
> 
> Best,
> 
> Peter
> 
>> Am 20.10.2015 um 19:22 schrieb Mario Gazzo:
>> I believe it should be extended since I think that a RUTA user would expect that the MARKUP annotation indeed captures at least XML and HTML markup properly. The examples are from a Pub Med Central XML file that follows the NISO JATS specification so I will assume it is proper formatted XML without knowing all the details of the spec.
>> 
>> We have managed to implement a crude workaround for now but let us know when an improved version becomes available.
>> 
>> Cheers
>> Mario
>> 
>>> On 20 Oct 2015, at 17:56 , Peter Klügl <pe...@averbis.com> wrote:
>>> 
>>> Hi Mario,
>>> 
>>> yes, and the different quote also causes problems (are these valid?).
>>> 
>>> The MARUP annotation is not created by jflex like the other annoations,
>>> but by a postprocessing step using an regular epression. This expression
>>> does not cover theses cases (markupPattern in DefaultSeeder.java).
>>> 
>>> Should we extend it?
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo:
>>>> Hi Peter,
>>>> 
>>>> RUTA doesn’t seem to capture some XML markup with attributes. Here are some examples:
>>>> 
>>>> <xref ref-type="bibr" rid="b35-ehp0113-000220”>
>>>> <sec sec-type="methods”>
>>>> 
>>>> The above markup examples are totally missing in the TokenSeed annotations. I wonder whether it is related to the dash in the attribute names since other markup without this appear to be captured.
>>>> 
>>>> Can you confirm that the dash could cause the problem?
>>>> 
>>>> Cheers
>>>> Mario
> 

Re: UIMA Ruta not capturing some XML markup with attributes?

Posted by Peter Klügl <pe...@averbis.com>.
Hi,

I extended the pattern to support dashes, but not the other quotes. This
can get arbitrary complex (and slow) if any combination of unicode
characters that look like quotes should be supported. I still think that
this is not valid xml. Can you give me a link to the standard?

It's maybe better to solve this in a specific use case before applying
the seeder.

Best,

Peter

Am 20.10.2015 um 19:22 schrieb Mario Gazzo:
> I believe it should be extended since I think that a RUTA user would expect that the MARKUP annotation indeed captures at least XML and HTML markup properly. The examples are from a Pub Med Central XML file that follows the NISO JATS specification so I will assume it is proper formatted XML without knowing all the details of the spec.
>
> We have managed to implement a crude workaround for now but let us know when an improved version becomes available.
>
> Cheers
> Mario
>
>> On 20 Oct 2015, at 17:56 , Peter Klügl <pe...@averbis.com> wrote:
>>
>> Hi Mario,
>>
>> yes, and the different quote also causes problems (are these valid?).
>>
>> The MARUP annotation is not created by jflex like the other annoations,
>> but by a postprocessing step using an regular epression. This expression
>> does not cover theses cases (markupPattern in DefaultSeeder.java).
>>
>> Should we extend it?
>>
>> Best,
>>
>> Peter
>>
>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo:
>>> Hi Peter,
>>>
>>> RUTA doesn’t seem to capture some XML markup with attributes. Here are some examples:
>>>
>>> <xref ref-type="bibr" rid="b35-ehp0113-000220”>
>>> <sec sec-type="methods”>
>>>
>>> The above markup examples are totally missing in the TokenSeed annotations. I wonder whether it is related to the dash in the attribute names since other markup without this appear to be captured.
>>>
>>> Can you confirm that the dash could cause the problem?
>>>
>>> Cheers
>>> Mario


Re: UIMA Ruta not capturing some XML markup with attributes?

Posted by Mario Gazzo <ma...@gmail.com>.
I believe it should be extended since I think that a RUTA user would expect that the MARKUP annotation indeed captures at least XML and HTML markup properly. The examples are from a Pub Med Central XML file that follows the NISO JATS specification so I will assume it is proper formatted XML without knowing all the details of the spec.

We have managed to implement a crude workaround for now but let us know when an improved version becomes available.

Cheers
Mario

> On 20 Oct 2015, at 17:56 , Peter Klügl <pe...@averbis.com> wrote:
> 
> Hi Mario,
> 
> yes, and the different quote also causes problems (are these valid?).
> 
> The MARUP annotation is not created by jflex like the other annoations,
> but by a postprocessing step using an regular epression. This expression
> does not cover theses cases (markupPattern in DefaultSeeder.java).
> 
> Should we extend it?
> 
> Best,
> 
> Peter
> 
> Am 20.10.2015 um 17:26 schrieb Mario Gazzo:
>> Hi Peter,
>> 
>> RUTA doesn’t seem to capture some XML markup with attributes. Here are some examples:
>> 
>> <xref ref-type="bibr" rid="b35-ehp0113-000220”>
>> <sec sec-type="methods”>
>> 
>> The above markup examples are totally missing in the TokenSeed annotations. I wonder whether it is related to the dash in the attribute names since other markup without this appear to be captured.
>> 
>> Can you confirm that the dash could cause the problem?
>> 
>> Cheers
>> Mario
> 


Re: UIMA Ruta not capturing some XML markup with attributes?

Posted by Peter Klügl <pe...@averbis.com>.
Hi Mario,

yes, and the different quote also causes problems (are these valid?).

The MARUP annotation is not created by jflex like the other annoations,
but by a postprocessing step using an regular epression. This expression
does not cover theses cases (markupPattern in DefaultSeeder.java).

Should we extend it?

Best,

Peter

Am 20.10.2015 um 17:26 schrieb Mario Gazzo:
> Hi Peter,
>
> RUTA doesn’t seem to capture some XML markup with attributes. Here are some examples:
>
> <xref ref-type="bibr" rid="b35-ehp0113-000220”>
> <sec sec-type="methods”>
>
> The above markup examples are totally missing in the TokenSeed annotations. I wonder whether it is related to the dash in the attribute names since other markup without this appear to be captured.
>
> Can you confirm that the dash could cause the problem?
>
> Cheers
> Mario