You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Wolfgang Buchner (JIRA)" <ji...@apache.org> on 2014/07/25 13:29:38 UTC

[jira] [Created] (MAHOUT-1598) extend seq2sparse to handle multiple text blocks of same document

Wolfgang Buchner created MAHOUT-1598:
----------------------------------------

Summary: extend seq2sparse to handle multiple text blocks of same document
Key: MAHOUT-1598
URL: https://issues.apache.org/jira/browse/MAHOUT-1598
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.9, 1.0
Reporter: Wolfgang Buchner

Currently the seq2sparse or in particular the org.apache.mahout.vectorizer.DictionaryVectorizer needs as input exactly one text block per document.

I stumbled on this because i'm having an use case where one document represents a ticket which can have several text blocks in different languages.

So my idea was that the org.apache.mahout.vectorizer.DocumentProcessor shall tokenize each text block itself. So i can use language specific features in our Lucene Analyzer.

Unfortunately the current implementation doesn't support this.

But with just minor changes this can be made possible.

The only thing which has to be changed would be the org.apache.mahout.vectorizer.term.TFPartialVectorReducer to handle all values of the iterable (not just the 1st one >.<)

An Alternative would be to change this Reducer to a Mapper, i don't get why in the 1st place this is implemented as an reducer. Is there any benefit from this?

I will provide a PR via github.

Please have a look onto this and tell me if i am assuming anything wrong.

--
This message was sent by Atlassian JIRA
(v6.2#6252)