You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2012/02/07 16:13:40 UTC

TF-IDF parser and ContentHandler?

Hey Guys,

I've been toying around with the idea of writing a simple Tika Parser Decorator 
that extends the Text Parser, but that generates TDF-IDF metadata maybe top
word count (summarized) and frequencies/term map. I was also thinking of then
writing a similar ContentHandler as well so it could be piped together with the
other handlers (e.g., like LinkContentHandler).

Would others find this useful?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++