You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/02/17 16:24:41 UTC

[jira] [Resolved] (TIKA-2267) Add common tokens files for tika-eval

     [ https://issues.apache.org/jira/browse/TIKA-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2267.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.15
                   2.0

Took top 20k tokens by document frequency from wikipedia dumps per language.  

I ignored wikipedia pages that conflicted with Optimaize's language id (e.g. if I was processing the ptwiki, and Optimaize identified it as "es", I ignored the page).

I used some heuristics to try to ignore pages that were link/reference articles or other non-content articles.

I attempted to randomly sample 500k articles.  For English, I only pulled the first 10 bzips.  For the other languages, I pulled all.

I removed common html markup tokens (e.g. body, html, script).  If we allowed those, then if html extraction fails and we get a bunch of markup, we would see an incorrectly inflated "common tokens" count.

I removed terms that were < 4 characters long except for CJK.

I added ___url___ and ___email___ so that those would exist for every language model.

If we change the underlying Lucene analysis chain, we'll have to reprocess the wikidumps.

The files are sorted in descending document frequency.  It is clear that the wiki markup stripper wasn't perfect (words for links/references show up frequently), but this seems like a reasonable start.

For posterity, I used this analysis chain:

{noformat}
      "tokenizer": {
        "factory": "oala.standard.UAX29URLEmailTokenizerFactory",
        "params": {}
      },
      "tokenfilters": [
        {
          "factory": "oala.icu.ICUFoldingFilterFactory",
          "params": {}
        },
        {
          "factory": "org.apache.tika.eval.tokens.AlphaIdeographFilterFactory",
          "params": {}
        },
        {
          "factory": "oala.pattern.PatternReplaceFilterFactory",
          "params": {
            "pattern": "^[\\w+\\.]{1,30}@(?:\\w+\\.){1,10}\\w+$",
            "replacement": "___email___",
            "replace": "all"
          }
        },
        {
          "factory": "oala.pattern.PatternReplaceFilterFactory",
          "params": {
            "pattern": "^(?:(?:ftp|https?):\\/\\/)?(?:\\w+\\.){1,10}\\w+$",
            "replacement": "___url___",
            "replace": "all"
          }
        },
        {
          "factory": "oala.cjk.CJKBigramFilterFactory",
          "params": {
            "outputUnigrams": "false"
          }
        },
        {
          "factory": "org.apache.tika.eval.tokens.CJKBigramAwareLengthFilterFactory",
          "params": {
            "min": 4,
            "max": 20
          }
        }
      ]
    }

{noformat}

full list of words removed:
{noformat}
span
table
href
head
title
body
html
tagname
lang
style
script
strong
blockquote
form
iframe
section
colspan
{noformat}

> Add common tokens files for tika-eval
> -------------------------------------
>
>                 Key: TIKA-2267
>                 URL: https://issues.apache.org/jira/browse/TIKA-2267
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-eval
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 2.0, 1.15
>
>
> We should add some common tokens files for popular languages for tika-eval so that users don't have to generate their own.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)