You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2013/07/11 22:50:52 UTC

Blog post on extracting text features using Tika

Hi all,

I just posted part 1 of a series on extracting text features for machine learning…

http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/

It uses a modified version of the Tika RFC822 parser to process mbox files.

I decided it was time to try to share some of what I'd learned over the years in processing text for classification, clustering and other related ML tasks.

It undoubtedly has some things that are unclear or even incorrect, so please comment :)

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Blog post on extracting text features using Tika

Posted by "Mattmann, Chris A (398J)" <ch...@jpl.nasa.gov>.
Thank you Ken, this is great!

I've created a link to your blog post on the Tika wiki:

https://wiki.apache.org/tika/TikaResources

Thank you again!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Ken Krugler <kk...@transpac.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, July 11, 2013 1:50 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Blog post on extracting text features using Tika

>
>
>
>Hi all,
>
>
>I just posted part 1 of a series on extracting text features for machine
>learningÅ 
>
>
>http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machin
>e-learning-part-1/
>
>
>It uses a modified version of the Tika RFC822 parser to process mbox
>files.
>
>
>I decided it was time to try to share some of what I'd learned over the
>years in processing text for classification, clustering and other related
>ML tasks.
>
>
>It undoubtedly has some things that are unclear or even incorrect, so
>please comment :)
>
>
>Thanks,
>
>
>-- Ken
>
>
>--------------------------
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>