You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Angelo Immediata <an...@gmail.com> on 2013/12/02 10:31:15 UTC

Write SequenceFile from custom data

Hi

I'm pretty newbie regarding learning achine and above all Apache Mahout, so
pardon me my low level questions

I need to do some cluster analysis by using some data. At the beginning
this data can be not too much huge, but after some time they can be really
huge (I did some calculation and after 1 year this data cann be around 37
billion of records) Since I have this huge data, I decided to do the
cluster analysis by using Mahout on the top of Apache Hadoop and its HDFS.
Regarding where to store this big amount of data I decided to use Apache
HBase always on the top of Apache Hadoop HDFS

Now I need to do this cluster analysi by considering some environment
variables. These variable may be the following:

   - *recordId* = id of the record
   - *arcId *= id of the arc between 2 points of my "street graph"
   - *mediumVelocity *= medium velocity of the considered arc in the
   specified
   - *vehiclesNumber* = number of the monitored vehicles in order to get
   that velocity
   - *meteo *= weather condition (a numeric representing if there is sun,
   rain etc...)
   - *manifestation *= a numeric representing if there is any kind of
   manifestation (sport manifestation or other)
   - *day of the week*
   - *month of the year*
   - *hour of the day*
   - *vacation *= a numeric representing if it's a vacation day or a
   working day

So my data are so formatted (raw representation):

*recordId arcId mediumVelocity vehiclesNumber meteo manifestation
weekDay yearMonth dayHour vacation*
1         1      34.5            20            1      3            4
   2011       10      3
2         156    66.5            3             2      5            1
   2008        6      2

As far as I know, in order to do the cluster analysis in Mahout I need to
format my data in Mahout format (that is in a SequenceFile) The question
is: how can I format my data represented as the previously written table in
a SequenceFile? I tried to find something but I was not able in finding any
good sample Any suggestion would be really appreciated

Thank you Angelo

Re: Write SequenceFile from custom data

Posted by Angelo Immediata <an...@gmail.com>.
I was thinking to use org.apache.hadoop.mapred.join.TupleWritable in order
to realize my clustering..according to you,...is this a right choice?
Otherwise...how may I implement my scenario?

Thank you
Angelo


2013/12/3 Angelo Immediata <an...@gmail.com>

> well similarity between data should be calculated by taking care of the
> following variables: meteo, manifestation, day of the week, month of the
> year and vacation
>
>
> 2013/12/3 Ted Dunning <te...@gmail.com>
>
>> The key first question is how you plan to compute similarity between data
>> points.  It isn't clear how you should do this with your data.
>>
>>
>>
>>
>> On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata <angeloimm@gmail.com
>> >wrote:
>>
>> > Hi
>> >
>> > I'm pretty newbie regarding learning achine and above all Apache
>> Mahout, so
>> > pardon me my low level questions
>> >
>> > I need to do some cluster analysis by using some data. At the beginning
>> > this data can be not too much huge, but after some time they can be
>> really
>> > huge (I did some calculation and after 1 year this data cann be around
>> 37
>> > billion of records) Since I have this huge data, I decided to do the
>> > cluster analysis by using Mahout on the top of Apache Hadoop and its
>> HDFS.
>> > Regarding where to store this big amount of data I decided to use Apache
>> > HBase always on the top of Apache Hadoop HDFS
>> >
>> > Now I need to do this cluster analysi by considering some environment
>> > variables. These variable may be the following:
>> >
>> >    - *recordId* = id of the record
>> >    - *arcId *= id of the arc between 2 points of my "street graph"
>> >    - *mediumVelocity *= medium velocity of the considered arc in the
>> >    specified
>> >    - *vehiclesNumber* = number of the monitored vehicles in order to get
>> >    that velocity
>> >    - *meteo *= weather condition (a numeric representing if there is
>> sun,
>> >    rain etc...)
>> >    - *manifestation *= a numeric representing if there is any kind of
>> >    manifestation (sport manifestation or other)
>> >    - *day of the week*
>> >    - *month of the year*
>> >    - *hour of the day*
>> >    - *vacation *= a numeric representing if it's a vacation day or a
>> >    working day
>> >
>> > So my data are so formatted (raw representation):
>> >
>> > *recordId arcId mediumVelocity vehiclesNumber meteo manifestation
>> > weekDay yearMonth dayHour vacation*
>> > 1         1      34.5            20            1      3            4
>> >    2011       10      3
>> > 2         156    66.5            3             2      5            1
>> >    2008        6      2
>> >
>> > As far as I know, in order to do the cluster analysis in Mahout I need
>> to
>> > format my data in Mahout format (that is in a SequenceFile) The question
>> > is: how can I format my data represented as the previously written
>> table in
>> > a SequenceFile? I tried to find something but I was not able in finding
>> any
>> > good sample Any suggestion would be really appreciated
>> >
>> > Thank you Angelo
>> >
>>
>
>

Re: Write SequenceFile from custom data

Posted by Angelo Immediata <an...@gmail.com>.
well similarity between data should be calculated by taking care of the
following variables: meteo, manifestation, day of the week, month of the
year and vacation


2013/12/3 Ted Dunning <te...@gmail.com>

> The key first question is how you plan to compute similarity between data
> points.  It isn't clear how you should do this with your data.
>
>
>
>
> On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata <angeloimm@gmail.com
> >wrote:
>
> > Hi
> >
> > I'm pretty newbie regarding learning achine and above all Apache Mahout,
> so
> > pardon me my low level questions
> >
> > I need to do some cluster analysis by using some data. At the beginning
> > this data can be not too much huge, but after some time they can be
> really
> > huge (I did some calculation and after 1 year this data cann be around 37
> > billion of records) Since I have this huge data, I decided to do the
> > cluster analysis by using Mahout on the top of Apache Hadoop and its
> HDFS.
> > Regarding where to store this big amount of data I decided to use Apache
> > HBase always on the top of Apache Hadoop HDFS
> >
> > Now I need to do this cluster analysi by considering some environment
> > variables. These variable may be the following:
> >
> >    - *recordId* = id of the record
> >    - *arcId *= id of the arc between 2 points of my "street graph"
> >    - *mediumVelocity *= medium velocity of the considered arc in the
> >    specified
> >    - *vehiclesNumber* = number of the monitored vehicles in order to get
> >    that velocity
> >    - *meteo *= weather condition (a numeric representing if there is sun,
> >    rain etc...)
> >    - *manifestation *= a numeric representing if there is any kind of
> >    manifestation (sport manifestation or other)
> >    - *day of the week*
> >    - *month of the year*
> >    - *hour of the day*
> >    - *vacation *= a numeric representing if it's a vacation day or a
> >    working day
> >
> > So my data are so formatted (raw representation):
> >
> > *recordId arcId mediumVelocity vehiclesNumber meteo manifestation
> > weekDay yearMonth dayHour vacation*
> > 1         1      34.5            20            1      3            4
> >    2011       10      3
> > 2         156    66.5            3             2      5            1
> >    2008        6      2
> >
> > As far as I know, in order to do the cluster analysis in Mahout I need to
> > format my data in Mahout format (that is in a SequenceFile) The question
> > is: how can I format my data represented as the previously written table
> in
> > a SequenceFile? I tried to find something but I was not able in finding
> any
> > good sample Any suggestion would be really appreciated
> >
> > Thank you Angelo
> >
>

Re: Write SequenceFile from custom data

Posted by Ted Dunning <te...@gmail.com>.
The key first question is how you plan to compute similarity between data
points.  It isn't clear how you should do this with your data.




On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata <an...@gmail.com>wrote:

> Hi
>
> I'm pretty newbie regarding learning achine and above all Apache Mahout, so
> pardon me my low level questions
>
> I need to do some cluster analysis by using some data. At the beginning
> this data can be not too much huge, but after some time they can be really
> huge (I did some calculation and after 1 year this data cann be around 37
> billion of records) Since I have this huge data, I decided to do the
> cluster analysis by using Mahout on the top of Apache Hadoop and its HDFS.
> Regarding where to store this big amount of data I decided to use Apache
> HBase always on the top of Apache Hadoop HDFS
>
> Now I need to do this cluster analysi by considering some environment
> variables. These variable may be the following:
>
>    - *recordId* = id of the record
>    - *arcId *= id of the arc between 2 points of my "street graph"
>    - *mediumVelocity *= medium velocity of the considered arc in the
>    specified
>    - *vehiclesNumber* = number of the monitored vehicles in order to get
>    that velocity
>    - *meteo *= weather condition (a numeric representing if there is sun,
>    rain etc...)
>    - *manifestation *= a numeric representing if there is any kind of
>    manifestation (sport manifestation or other)
>    - *day of the week*
>    - *month of the year*
>    - *hour of the day*
>    - *vacation *= a numeric representing if it's a vacation day or a
>    working day
>
> So my data are so formatted (raw representation):
>
> *recordId arcId mediumVelocity vehiclesNumber meteo manifestation
> weekDay yearMonth dayHour vacation*
> 1         1      34.5            20            1      3            4
>    2011       10      3
> 2         156    66.5            3             2      5            1
>    2008        6      2
>
> As far as I know, in order to do the cluster analysis in Mahout I need to
> format my data in Mahout format (that is in a SequenceFile) The question
> is: how can I format my data represented as the previously written table in
> a SequenceFile? I tried to find something but I was not able in finding any
> good sample Any suggestion would be really appreciated
>
> Thank you Angelo
>