You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Xiaobo Gu <gu...@gmail.com> on 2011/08/01 15:56:56 UTC

Re: What about a universal input data handling mechanism for Mahout?

When using sequenceFilesfromCsvFilter, what's the answer for the first
question, should we retain the headers?

Regards

On Thu, Jul 28, 2011 at 11:22 PM, Xiaobo Gu <gu...@gmail.com> wrote:
> I don't know how CSVVectorIterator is used, but
> SequenceFilesFromCsvFilter, there are a few questions,
> 1. The csv files should be without headers?
> 2. I think the protected void process(FileStatus fst, Path current)
> throws IOException  function of SequenceFilesFromCsvFilter is the
> point we can revise to make a csv to sequence converter, the idea is
> following:
>  a.  We must create a SequenceFile<Text, VectorWritable> file object
> and pass it's writer to SequenceFileFromCsvFilter as a constructor
> parameter, the coressponding sequenceFile is our destination.
> b. Forf each line extract the lable value and an encoder vector, then
> call writer.append(lable, new VectorWritable(vector)), which column is
> the lable and which columns contribute to the vector can be passed
> through command line arguments.
>
>
> Regards,
>
> Xiaobo Gu
>
> On Tue, Jul 26, 2011 at 5:50 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> We do have:
>> SequenceFilesFromCsvFilter, although it is somewhat basic
>> CSVVectorIterator, which takes a CSV file and produces a dense vector
>>
>>
>> On Jul 26, 2011, at 3:58 AM, Ted Dunning wrote:
>>
>>> The critical design step here is to decide how to express the schema of the
>>> CSV file.  There is a beginning of this in the CsvRecordFactory, but I was
>>> never happy with the (lack of) speed.
>>>
>>> On Tue, Jul 26, 2011 at 12:10 AM, Sebastian Schelter <ss...@apache.org> wrote:
>>>
>>>> 2. SequenceFile is not file format that command line users can
>>>>> prepare, is there tool for converting CSV files into SequenceFiles
>>>>>
>>>>
>>>> I don't think we have that yet, but it would be very useful imho.
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>>
>>
>>
>>
>

Re: What about a universal input data handling mechanism for Mahout?

Posted by Ted Dunning <te...@gmail.com>.
Yes.  CSV files should have header lines.

And there needs to be something else that specifies the field type.

On Mon, Aug 1, 2011 at 6:56 AM, Xiaobo Gu <gu...@gmail.com> wrote:

> When using sequenceFilesfromCsvFilter, what's the answer for the first
> question, should we retain the headers?
>
> Regards
>
> On Thu, Jul 28, 2011 at 11:22 PM, Xiaobo Gu <gu...@gmail.com>
> wrote:
> > I don't know how CSVVectorIterator is used, but
> > SequenceFilesFromCsvFilter, there are a few questions,
> > 1. The csv files should be without headers?
> > 2. I think the protected void process(FileStatus fst, Path current)
> > throws IOException  function of SequenceFilesFromCsvFilter is the
> > point we can revise to make a csv to sequence converter, the idea is
> > following:
> >  a.  We must create a SequenceFile<Text, VectorWritable> file object
> > and pass it's writer to SequenceFileFromCsvFilter as a constructor
> > parameter, the coressponding sequenceFile is our destination.
> > b. Forf each line extract the lable value and an encoder vector, then
> > call writer.append(lable, new VectorWritable(vector)), which column is
> > the lable and which columns contribute to the vector can be passed
> > through command line arguments.
> >
> >
> > Regards,
> >
> > Xiaobo Gu
> >
> > On Tue, Jul 26, 2011 at 5:50 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >> We do have:
> >> SequenceFilesFromCsvFilter, although it is somewhat basic
> >> CSVVectorIterator, which takes a CSV file and produces a dense vector
> >>
> >>
> >> On Jul 26, 2011, at 3:58 AM, Ted Dunning wrote:
> >>
> >>> The critical design step here is to decide how to express the schema of
> the
> >>> CSV file.  There is a beginning of this in the CsvRecordFactory, but I
> was
> >>> never happy with the (lack of) speed.
> >>>
> >>> On Tue, Jul 26, 2011 at 12:10 AM, Sebastian Schelter <ss...@apache.org>
> wrote:
> >>>
> >>>> 2. SequenceFile is not file format that command line users can
> >>>>> prepare, is there tool for converting CSV files into SequenceFiles
> >>>>>
> >>>>
> >>>> I don't think we have that yet, but it would be very useful imho.
> >>>>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >>
> >>
> >>
> >>
> >
>