You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2017/02/17 16:58:41 UTC

tika-eval

All,

  I finally got around to adding tika-eval[1] to Apache Tika.  If you have any interest in comparing the output of different tools/versions/parameters on text extraction, give it a try.  You don't need to use Tika or format the output in a specific format; plain UTF-8 text will work.

  Tilman, I generalized your common word count methodology.  The code now runs language id on the text and then counts the common words for that language.

  Lots more work remains.  Thank you, all, for contributing to the methodologies!

         Cheers,

                      Tim


[1] https://wiki.apache.org/tika/TikaEval

RE: tika-eval

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Ha.  I hadn't realized the video was available until this post.  Thank you!

> And here is the talk about it Tim gave at ApacheCon
>
> https://youtu.be/vRPTPMwI53k?list=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp
>
> I've enjoyed it (the video). 

So did I!

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: tika-eval

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 21.05.2017 um 18:20 schrieb Andreas Lehmkuehler:
> Am 17.02.2017 um 17:58 schrieb Allison, Timothy B.:
>> All,
>>
>>    I finally got around to adding tika-eval[1] to Apache Tika. If you 
>> have any interest in comparing the output of different 
>> tools/versions/parameters on text extraction, give it a try. You 
>> don't need to use Tika or format the output in a specific format; 
>> plain UTF-8 text will work.
>>
>>    Tilman, I generalized your common word count methodology. The code 
>> now runs language id on the text and then counts the common words for 
>> that language.
>>
>>    Lots more work remains.  Thank you, all, for contributing to the 
>> methodologies!
> And here is the talk about it Tim gave at ApacheCon
>
> https://youtu.be/vRPTPMwI53k?list=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp
>
> I've enjoyed it (the video). 

So did I!

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: tika-eval

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 17.02.2017 um 17:58 schrieb Allison, Timothy B.:
> All,
> 
>    I finally got around to adding tika-eval[1] to Apache Tika.  If you have any interest in comparing the output of different tools/versions/parameters on text extraction, give it a try.  You don't need to use Tika or format the output in a specific format; plain UTF-8 text will work.
> 
>    Tilman, I generalized your common word count methodology.  The code now runs language id on the text and then counts the common words for that language.
> 
>    Lots more work remains.  Thank you, all, for contributing to the methodologies!
And here is the talk about it Tim gave at ApacheCon

https://youtu.be/vRPTPMwI53k?list=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp

I've enjoyed it (the video).

Andreas
> 
>           Cheers,
> 
>                        Tim
> 
> 
> [1] https://wiki.apache.org/tika/TikaEval
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org