You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Chengmin Ding <ch...@gmail.com> on 2007/08/24 18:34:34 UTC

A question about HTML reader component

Hi, Folks,

We have been using UIMA to mine data points from some documents in plain
text format and our AE worked fine. But recently those documents are
delivered in HTML format (i.e. with a bunch of HTML tags mixed in) and our
AEs can no longer mine the data correctly. Our question is if whether there
is any HTML Collection Reader component or library already available so we
do not need to reinvent the wheel?

We tried an HTMLCommon collection reader but looks like it cannot parse a
table correctly. It often adds many blank lines between tables cells/rows
which confuses our AE.

Any of your help is highly appreciated.

Thanks

-Chengmin

Re: A question about HTML reader component

Posted by Chengmin Ding <ch...@gmail.com>.
Thank you Pablo for the prompt reply. I will check out the w3
community project and possibly participate in it. I think this HTML
detagging function is such a useful one and deservers more participation.

-Chengmin

On 8/24/07, Pablo Duboue <pa...@gmail.com> wrote:
>
> Hi Chengmin,
>
> The blank lines you refer to are easy to remove and are there by
> design. The detagger has a list of "non-paragraph separating tags",
> any other tag is supposed to delimit chunks of text, thus the added
> blank lines. But there is no reason that behavior can't be
> parameterized.
>
> If you want to join the (IBM internal) project, please stop by the
> Community Source w3 site.
>
> Best regards,
>
> Pablo
>
> On 8/24/07, Chengmin Ding <ch...@gmail.com> wrote:
> > Hi, Folks,
> >
> > We have been using UIMA to mine data points from some documents in plain
> > text format and our AE worked fine. But recently those documents are
> > delivered in HTML format (i.e. with a bunch of HTML tags mixed in) and
> our
> > AEs can no longer mine the data correctly. Our question is if whether
> there
> > is any HTML Collection Reader component or library already available so
> we
> > do not need to reinvent the wheel?
> >
> > We tried an HTMLCommon collection reader but looks like it cannot parse
> a
> > table correctly. It often adds many blank lines between tables
> cells/rows
> > which confuses our AE.
> >
> > Any of your help is highly appreciated.
> >
> > Thanks
> >
> > -Chengmin
> >
>

Re: A question about HTML reader component

Posted by Pablo Duboue <pa...@gmail.com>.
Hi Chengmin,

The blank lines you refer to are easy to remove and are there by
design. The detagger has a list of "non-paragraph separating tags",
any other tag is supposed to delimit chunks of text, thus the added
blank lines. But there is no reason that behavior can't be
parameterized.

If you want to join the (IBM internal) project, please stop by the
Community Source w3 site.

Best regards,

Pablo

On 8/24/07, Chengmin Ding <ch...@gmail.com> wrote:
> Hi, Folks,
>
> We have been using UIMA to mine data points from some documents in plain
> text format and our AE worked fine. But recently those documents are
> delivered in HTML format (i.e. with a bunch of HTML tags mixed in) and our
> AEs can no longer mine the data correctly. Our question is if whether there
> is any HTML Collection Reader component or library already available so we
> do not need to reinvent the wheel?
>
> We tried an HTMLCommon collection reader but looks like it cannot parse a
> table correctly. It often adds many blank lines between tables cells/rows
> which confuses our AE.
>
> Any of your help is highly appreciated.
>
> Thanks
>
> -Chengmin
>