You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2019/01/28 16:49:00 UTC

[jira] [Commented] (TIKA-2822) Update common tokens files for tika-eval

    [ https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754132#comment-16754132 ] 

Tim Allison commented on TIKA-2822:
-----------------------------------

Last time I did this, IIRC, there were separate {{zh-tw}} and {{zh-cn}} wiki dumps.  They are now unifying these into a single {{zh}} dump, but then running some mapping code for presentation.  The character/term/word mappings are available here:  https://phab.wmfusercontent.org/file/data/ycg62tzo5qyv5txmiamh/PHID-FILE-66gf4k72tgxhksd5j36x/ZhConversion.php


> Update common tokens files for tika-eval
> ----------------------------------------
>
>                 Key: TIKA-2822
>                 URL: https://issues.apache.org/jira/browse/TIKA-2822
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-eval
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Trivial
>
> We initially created the common tokens files (top 20k tokens by document frequency) in Wikipedia with Lucene 6.x.  We should rerun that code with an updated Lucene on the off chance that there are slight changes in tokenization.  
> While doing this work, I found a trivial bug in filtering common tokens that we should fix as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)