You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Carlos Scheidecker <na...@gmail.com> on 2014/07/22 12:34:01 UTC

Noisy text detection via POS tagging in OpenNLP

Hello all,

I need am tokenizing and tagging noisy text.

Therefore I can identify a few things. Looking at tokens which are not
words ("...................") and POS tags that would help.

The idea is to identify pieces of text that are noisy and other that are
potentially not noisy.

So if I have a noisy sentence I can say this sentence is noisy or if I have
a sequence of tokens I can say these tokens are noisy text, disregard that!

I have a few questions regarding the POS tag and its dictionary for the
PosTagger to retrieve the proper POS tag for a token (word or not).

I have seen OPENNLP tagging some punctuation and stuff it does not
understand.

I am trying to list all.

What I did was I starting looking at the code until I found the file to
generate the tag dictionary and I've found it! It is
under trunk/opennlp-tools/lang/en/postag/en-tagdict.xml

For instance, a dot token (.) is tagged as (.) and is shown as (. .) on the
Parser.

But as from Peen Treebank a tag dot should mean the sentence-final
functiation and tokens that can be annotated as dot are . ! and ? so (. !),
(. .), (. ?)

A comma is tagged as comma, so (, ,)

A # is tagged as a # and $ as $. Now  you can have many tokens for a tag
like $, so that US$ or A$ is considered $ and so on per the file.

You also have the SYM tag meaning + % &, etc. But & for instance can also
be a CC.

You can also have more that one tag for a word, on the xml such as

<entry tags="CD LS NNP"> 117 <token>2</token> which means that a number can
represented by tags CD, LS or NNP for instance.

I have also seen (: for stuff that is NOT a colon. So a word token from
noisy text such as "............................."  is pos tagged as ":"
and shows in the parser as (: .............................)

Looking at the dictionary I can see that the tag ":" can be associated to
the following tokens:
(;)  (:) (-) (--) (...)

Of course, there is a tag dictionary for each language.

What I need to do is the following:

For each token/tag I need to determine if that is a word or not. Whether
the tag is properly correspondent to the token. A valid token to a given
tag.

Sometimes you can get numbers, those are valid things as well.

In order to determine a word I can check for Alpha-characters and numeric
ones as well, so that something like a sequence of dots is not a valid word.

Also I can check the token for valid word and then check its POS tag for
what it seems the case.

Now, is there something that could help as an extra parameter such as the
probability of the pos tag returned or is there a tag that shows unknown
tag? I couldn't find one on the code nor on the dictionary.

Ideally is to determine whether a chunk of text or a sentence is a noisy
sentence or not.

Thanks in advance.