You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lee S <sl...@gmail.com> on 2014/11/04 09:28:37 UTC

Why do most algorithms use sequencefile as input and output?

Hi all:
  I'm wondering why the input and output of most algorithm like
kmeans,naivebayes are all sequencefiles. One more step of conversion need
to be done if we want the algorithm works.And
I think the step is time consuming. Because it's also a mapreduce job.
  For the reason to deal with small files and compress to save disk space?

Re: Why do most algorithms use sequencefile as input and output?

Posted by Serega Sheypak <se...@gmail.com>.
Also it's the easiest way to SerDe any complex stuff and get split + block
compression features since SeqFiles are splittable and could be compressed
by default. See the code, it has really complex stuff to transfer between
jobs.

2014-11-10 3:06 GMT+03:00 Bertrand Dechoux <de...@gmail.com>:

> SequenceFile is/was also the standard for binary data on Hadoop. The
> question is rather : what else would you expect? Surely not a text format?
>
> Bertrand
>
> On Fri, Nov 7, 2014 at 3:51 AM, Lee S <sl...@gmail.com> wrote:
>
> > any other reasons or can you give a thorough analysis?
> >
> > 2014-11-05 11:00 GMT+08:00 Ted Dunning <te...@gmail.com>:
> >
> > >
> > > Yes, type conversion is a reason.
> > >
> > > Sent from my iPhone
> > >
> > > > On Nov 4, 2014, at 18:59, Lee S <sl...@gmail.com> wrote:
> > > >
> > > > eg. kmeans input:
> > > > 1,2,3,4  //text file
> > > > kmeans output
> > > > point1, point2,point3(text file of center points)
> > > >
> > > >
> > > > I just thought of one reason. The input data should be storaged in
> > > > vector(dense or sparse) format ,so a conversion step
> > > > needs to be doned before algorithms deal with data. Is that right?
> > > >
> > > > 2014-11-04 23:56 GMT+08:00 Ted Dunning <te...@gmail.com>:
> > > >
> > > >> What should the input be?
> > > >>
> > > >>
> > > >>
> > > >>> On Tue, Nov 4, 2014 at 12:28 AM, Lee S <sl...@gmail.com> wrote:
> > > >>>
> > > >>> Hi all:
> > > >>>  I'm wondering why the input and output of most algorithm like
> > > >>> kmeans,naivebayes are all sequencefiles. One more step of
> conversion
> > > need
> > > >>> to be done if we want the algorithm works.And
> > > >>> I think the step is time consuming. Because it's also a mapreduce
> > job.
> > > >>>  For the reason to deal with small files and compress to save disk
> > > >> space?
> > > >>
> > >
> >
>

Re: Why do most algorithms use sequencefile as input and output?

Posted by Bertrand Dechoux <de...@gmail.com>.
SequenceFile is/was also the standard for binary data on Hadoop. The
question is rather : what else would you expect? Surely not a text format?

Bertrand

On Fri, Nov 7, 2014 at 3:51 AM, Lee S <sl...@gmail.com> wrote:

> any other reasons or can you give a thorough analysis?
>
> 2014-11-05 11:00 GMT+08:00 Ted Dunning <te...@gmail.com>:
>
> >
> > Yes, type conversion is a reason.
> >
> > Sent from my iPhone
> >
> > > On Nov 4, 2014, at 18:59, Lee S <sl...@gmail.com> wrote:
> > >
> > > eg. kmeans input:
> > > 1,2,3,4  //text file
> > > kmeans output
> > > point1, point2,point3(text file of center points)
> > >
> > >
> > > I just thought of one reason. The input data should be storaged in
> > > vector(dense or sparse) format ,so a conversion step
> > > needs to be doned before algorithms deal with data. Is that right?
> > >
> > > 2014-11-04 23:56 GMT+08:00 Ted Dunning <te...@gmail.com>:
> > >
> > >> What should the input be?
> > >>
> > >>
> > >>
> > >>> On Tue, Nov 4, 2014 at 12:28 AM, Lee S <sl...@gmail.com> wrote:
> > >>>
> > >>> Hi all:
> > >>>  I'm wondering why the input and output of most algorithm like
> > >>> kmeans,naivebayes are all sequencefiles. One more step of conversion
> > need
> > >>> to be done if we want the algorithm works.And
> > >>> I think the step is time consuming. Because it's also a mapreduce
> job.
> > >>>  For the reason to deal with small files and compress to save disk
> > >> space?
> > >>
> >
>

Re: Why do most algorithms use sequencefile as input and output?

Posted by Lee S <sl...@gmail.com>.
any other reasons or can you give a thorough analysis?

2014-11-05 11:00 GMT+08:00 Ted Dunning <te...@gmail.com>:

>
> Yes, type conversion is a reason.
>
> Sent from my iPhone
>
> > On Nov 4, 2014, at 18:59, Lee S <sl...@gmail.com> wrote:
> >
> > eg. kmeans input:
> > 1,2,3,4  //text file
> > kmeans output
> > point1, point2,point3(text file of center points)
> >
> >
> > I just thought of one reason. The input data should be storaged in
> > vector(dense or sparse) format ,so a conversion step
> > needs to be doned before algorithms deal with data. Is that right?
> >
> > 2014-11-04 23:56 GMT+08:00 Ted Dunning <te...@gmail.com>:
> >
> >> What should the input be?
> >>
> >>
> >>
> >>> On Tue, Nov 4, 2014 at 12:28 AM, Lee S <sl...@gmail.com> wrote:
> >>>
> >>> Hi all:
> >>>  I'm wondering why the input and output of most algorithm like
> >>> kmeans,naivebayes are all sequencefiles. One more step of conversion
> need
> >>> to be done if we want the algorithm works.And
> >>> I think the step is time consuming. Because it's also a mapreduce job.
> >>>  For the reason to deal with small files and compress to save disk
> >> space?
> >>
>

Re: Why do most algorithms use sequencefile as input and output?

Posted by Ted Dunning <te...@gmail.com>.
Yes, type conversion is a reason.  

Sent from my iPhone

> On Nov 4, 2014, at 18:59, Lee S <sl...@gmail.com> wrote:
> 
> eg. kmeans input:
> 1,2,3,4  //text file
> kmeans output
> point1, point2,point3(text file of center points)
> 
> 
> I just thought of one reason. The input data should be storaged in
> vector(dense or sparse) format ,so a conversion step
> needs to be doned before algorithms deal with data. Is that right?
> 
> 2014-11-04 23:56 GMT+08:00 Ted Dunning <te...@gmail.com>:
> 
>> What should the input be?
>> 
>> 
>> 
>>> On Tue, Nov 4, 2014 at 12:28 AM, Lee S <sl...@gmail.com> wrote:
>>> 
>>> Hi all:
>>>  I'm wondering why the input and output of most algorithm like
>>> kmeans,naivebayes are all sequencefiles. One more step of conversion need
>>> to be done if we want the algorithm works.And
>>> I think the step is time consuming. Because it's also a mapreduce job.
>>>  For the reason to deal with small files and compress to save disk
>> space?
>> 

Re: Why do most algorithms use sequencefile as input and output?

Posted by Lee S <sl...@gmail.com>.
eg. kmeans input:
1,2,3,4  //text file
kmeans output:
point1, point2,point3(text file of center points)


I just thought of one reason. The input data should be storaged in
vector(dense or sparse) format ,so a conversion step
needs to be doned before algorithms deal with data. Is that right?

2014-11-04 23:56 GMT+08:00 Ted Dunning <te...@gmail.com>:

> What should the input be?
>
>
>
> On Tue, Nov 4, 2014 at 12:28 AM, Lee S <sl...@gmail.com> wrote:
>
> > Hi all:
> >   I'm wondering why the input and output of most algorithm like
> > kmeans,naivebayes are all sequencefiles. One more step of conversion need
> > to be done if we want the algorithm works.And
> > I think the step is time consuming. Because it's also a mapreduce job.
> >   For the reason to deal with small files and compress to save disk
> space?
> >
>

Re: Why do most algorithms use sequencefile as input and output?

Posted by Ted Dunning <te...@gmail.com>.
What should the input be?



On Tue, Nov 4, 2014 at 12:28 AM, Lee S <sl...@gmail.com> wrote:

> Hi all:
>   I'm wondering why the input and output of most algorithm like
> kmeans,naivebayes are all sequencefiles. One more step of conversion need
> to be done if we want the algorithm works.And
> I think the step is time consuming. Because it's also a mapreduce job.
>   For the reason to deal with small files and compress to save disk space?
>