You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Svetlomir Kasabov <sk...@smail.inf.fh-brs.de> on 2011/06/11 23:13:51 UTC
CsvRecordFactory usage recomendation
Hello,
I have a question:
I have seen, that some of the mahout examples use the class
CsvRecordFactory.java for parsing training and test examples. Would you
recommend this class also for actual usage in production? This would
mean, that I should create a CSV file from my real data (in my case, it
is in a relational database), and then use the CSV file in order to
train my (online logistic regression) model. This approach would have
the advantage of having the 'extracted' data as CSV which can be used
for quick re-training, without DB access...
Or should I omit the intermediate step with the CSV file and train my
(online logistic regression) model directly with the data from the
relational database? Which of the both approaches would be better?
Thank you!
Svetlomir.
Re: CsvRecordFactory usage recomendation
Posted by Ted Dunning <te...@gmail.com>.
I am not very happy with the CsvRecordFactory design (my fault, I
know). Some of the ideas are useful, but the final outcome was not
general enough.
My own tendency is to build custom vector encoding, if only for
performance. Of course if you are reading from a database,
performance is clearly not a priority.
On Sat, Jun 11, 2011 at 11:13 PM, Svetlomir Kasabov
<sk...@smail.inf.fh-brs.de> wrote:
> Hello,
>
> I have a question:
>
> I have seen, that some of the mahout examples use the class
> CsvRecordFactory.java for parsing training and test examples. Would you
> recommend this class also for actual usage in production? This would mean,
> that I should create a CSV file from my real data (in my case, it is in a
> relational database), and then use the CSV file in order to train my (online
> logistic regression) model. This approach would have the advantage of having
> the 'extracted' data as CSV which can be used for quick re-training, without
> DB access...
>
> Or should I omit the intermediate step with the CSV file and train my
> (online logistic regression) model directly with the data from the
> relational database? Which of the both approaches would be better?
>
> Thank you!
>
> Svetlomir.
>
>
>
>
>