You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Svetlomir Kasabov <sk...@smail.inf.fh-brs.de> on 2011/06/11 23:13:51 UTC

CsvRecordFactory usage recomendation

Hello,

I have a question:

I have seen, that some of the mahout examples use the class 
CsvRecordFactory.java for parsing training and test examples. Would you 
recommend this class also for actual usage in production? This would 
mean, that I should create a CSV file from my real data (in my case, it 
is in a relational database), and then use the CSV file in order to 
train my (online logistic regression) model. This approach would have 
the advantage of having the 'extracted' data as CSV which can be used 
for quick re-training, without DB access...

Or should I omit the intermediate step with the CSV file and train my 
(online logistic regression) model directly with the data from the 
relational database? Which of the both approaches would be better?

Thank you!

Svetlomir.

Re: CsvRecordFactory usage recomendation

Posted by Ted Dunning <te...@gmail.com>.

I am not very happy with the CsvRecordFactory design (my fault, I
know).  Some of the ideas are useful, but the final outcome was not
general enough.

My own tendency is to build custom vector encoding, if only for
performance.  Of course if you are reading from a database,
performance is clearly not a priority.

On Sat, Jun 11, 2011 at 11:13 PM, Svetlomir Kasabov
<sk...@smail.inf.fh-brs.de> wrote:
> Hello,
>
> I have a question:
>
> I have seen, that some of the mahout examples use the class
> CsvRecordFactory.java for parsing training and test examples. Would you
> recommend this class also for actual usage in production? This would mean,
> that I should create a CSV file from my real data (in my case, it is in a
> relational database), and then use the CSV file in order to train my (online
> logistic regression) model. This approach would have the advantage of having
> the 'extracted' data as CSV which can be used for quick re-training, without
> DB access...
>
> Or should I omit the intermediate step with the CSV file and train my
> (online logistic regression) model directly with the data from the
> relational database? Which of the both approaches would be better?
>
> Thank you!
>
> Svetlomir.
>
>
>
>
>