You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Angelo Immediata <an...@gmail.com> on 2013/11/29 16:45:21 UTC

Re: Information

Hi there
I open this old topic since I got some information more becouse I was able
in talking with my customer
Basically my customer wants the following:
by using some historical data, we have to cluster the data by using some
cluster analysis and some environment variables; for each cluster we have
to find the medium velocity of the records. When an user wants to know the
velocity on a street in a well known period (e.g. 24 Jenuary 2014) I should
be able in finding to which cluster the new data belongs and to propose to
the user the medium velocity of that cluster
Now I have to consider the following environment variables:

   - arcId=id of the arc between 2 points of my "street graph"
   - startTime=start time of the pre-clustering misuration
   - endTime=end time of the pre-clustering misuration
   - mediumVelocity=medium velocity of the considered arc in the specified
   time range
   - vehiclesNumber=number of the monitored vehicles in order to get that
   velocity in that time range
   - meteo=weather condition (a numeric representing if there is sun, rain
   etc...)
   - manifestation=a numeric representing if there is any kind of
   manifestation (sport manifestation or other)
   - day of the week
   - month of the year
   - hour of the day
   - vacation=a numeric representing if it's a vacation day or a working day


And maybe some other variable

Now my idea was to use mahout in order to do the cluster analysis by using
kmeans and canopy; moreover the data I should use in the cluster analysis
can be pretty uge (in one year they can arrive also around to 37billion or
records in one table) so I decided to use mahout on top od Hadoop cluster
and to use HBase in order to store and read data

So what I would like to know is if my solution makes sense (it seems to me
good....but as I said I'm newbie to these technologies but, on the other
side, I need performance too)
If this solution is OK....how should I Map/Reduce my historical data in
order to pass them to Mahout to do the cluster analysis?

I hope I didn't do too many mistakes :)

Thank you
Angelo


2013/10/16 Bertrand Dechoux <de...@gmail.com>

> That's why I was asking a bit more about the problem. It looks to me that
> what will bring more value at the beginning is to find the shortest path,
> which is a classical graph algorithm. Then the results could be improved by
> changing the speed of each route according to additional information. As a
> client, if it's raining, I only want to know if I should turn left or
> right. Estimating the speed of each route with a good enough accuracy is
> more complex and is relevant only if there is a single long enough route.
>
> If you are dealing with large volume of data, there are also graph
> solutions for Hadoop like Giraph or Hama.
>
> IMHO, YMMV...
>
> Bertrand
>
>
>
>
> On Tue, Oct 15, 2013 at 10:01 PM, Angelo Immediata <angeloimm@gmail.com
> >wrote:
>
> > hi All
> >
> > First of all thank you for the great suggestions you gave me; you are
> > simply great :)
> > Anyway, returning to my problem, I'll try to be as much clear as
> > possible...As far as I know (but we are still collecting requirements and
> > understanding which kind of data we will have) we should have a situation
> > of this type:
> > on street XYZ in Spring without any events (an event can be
> manifestation,
> > parade etc...) the medium velocity is 50 Km/h
> > on street XYZ in Spring with an event the medium velocity is 20 Km/h
> > on street XYZ in Autumn without any events (an event can be
> manifestation,
> > parade etc...) the medium velocity is 40 Km/h
> > on street XYZ in Autumn with an event the medium velocity is 15 Km/h
> >
> > and so on for all the interested street (basically using the Open Street
> > Map data); note that we are not interested in the worst case that is the
> > case with accident (at least as far as I know).
> >
> > Now my customer would like to offer this kind functionality to the
> clients:
> > a client connects to the site (or downloads an app) and he/she wants to
> go
> > by car to the restaurant W; he/she would like to know if it's a good idea
> > to go on that street or search for a different street; so by knowing the
> > period of time (Spring, Autumn, Summer or Winter) and by knowing if there
> > are some events (manifestations, parades etc...) I should tell him/her:
> if
> > you go on street XYZ probably you will travel at 50Km/h or 20Km/h (the
> best
> > would be if I may suggest a different way...but this is another topic :)
> )
> >
> > So, since i should use old data in order to suggest to the client the
> > velocity he/she may have on street XYZ, I was thinking to use
> mahout....but
> > maybe I was wrong (sadly I'm really new in this kind of world...though
> I'm
> > finding it amazing)
> >
> >
> > Now by using the "old" data (the one I listed previously)
> >
> >
> >
> > 2013/10/15 Andrew Butkus <an...@butkus.co.uk>
> >
> > >
> > > After giving some more thought, you could do something like this:
> > >
> > > Store:
> > >
> > > route
> > > {
> > >         road
> > >         {
> > >                 timestamp,
> > >                 time_to_run_road,
> > >         }
> > > }
> > >
> > > then build up a bigger model, which extracts timestamp from the road on
> > > the route and the time it takes to run that road, and calculate an
> > average
> > > on a per day basis, (for example, if you travel this route every monday
> > at
> > > 9am, then extract the timestamp which matches every monday at 9am, and
> > > average the time_to_run_road data you have collected on a monday for
> that
> > > road. If you want to see how long it takes to run a road on every
> monday
> > at
> > > 9am in january, then you extract all timestamps that match that road
> for
> > > january at 9am on monday
> > >
> > > Not entirely sure where mahout fits in here, but this could be a
> > potential
> > > way forward for you (assuming you can collect/have data about the road)
> > >
> > > Hope that helps
> > >
> > > Andy
> > >
> > > On 15 Oct 2013, at 13:09, Andrew Butkus <an...@butkus.co.uk> wrote:
> > >
> > > > Also to add to this you probably wouldn't want to do it by route, but
> > > > maybe break it down by road, this gives more coverage and greater
> > > > granularity
> > > >
> > > > Sent from my Windows Phone From: Andrew Butkus
> > > > Sent: 15/10/2013 13:07
> > > > To: Bertrand Dechoux; user@mahout.apache.org
> > > > Subject: RE: Information
> > > > IM not sure, i think the last 2 can be predicted, for example in
> > > > january in the uk we get bad weather which causes delays and on
> average
> > > > it will take longer to run a route in this month because of that,
> > > >
> > > > To consider weather as a variable is probably not scalable, recording
> > > > the time to run a route with a timestamp should be good enough.
> > > >
> > > > Also consider once a year there is a festival in reading, so over
> this
> > > > weekend routes through reading will always take longer.
> > > >
> > > > IM not sure where mahout can fit this problem, other than, but if u
> can
> > > > train route time and add a timestamp this would give u something
> > > > scalable. Then figure out on average how long it takes to run a route
> > > > at similar time stamp, for example, minute, hour, week, month, year.
> > > >
> > > > Sent from my Windows Phone From: Bertrand Dechoux
> > > > Sent: 15/10/2013 08:33
> > > > To: user@mahout.apache.org
> > > > Subject: Re: Information
> > > > The biggest point is what data do you have and what exactly is your
> > > problem.
> > > >
> > > > The maximum speed of the route can be easily known and in the best
> case
> > > > that would be your speed. From a very broad point of view, there is
> > three
> > > > reasons for a slowdown.
> > > > 1) traffic jam
> > > > 2) accident
> > > > 3) bad weather
> > > >
> > > > But without up to date observations, those three points are non
> trivial
> > > to
> > > > predict (especially the last two). Doing simple statistics (like
> > average)
> > > > can be a good start to see the variations and understand what factors
> > > > should be taken into account.
> > > >
> > > > At the end, you want to do a regression but classification and
> > clustering
> > > > might help before that. Hard to say more without knowing why the
> medium
> > > > speed is important, for which area, at which time...
> > > >
> > > > Bertrand
> > > >
> > > > On Tue, Oct 15, 2013 at 9:14 AM, Pavan K Narayanan <
> > > > pavan.narayanan@gmail.com> wrote:
> > > >
> > > >> Based on the information you have provided, street routing is
> > > potentially a
> > > >> Vehicle Routing Problem which is based on TSPs. You can check out
> the
> > > below
> > > >> link:
> > > >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman
> > > >> Secondly, if you want to use Mahout for Forecasting, it is not
> > possible
> > > yet
> > > >> as the solution methodology for Forecasting (LWR) is still an open
> > > problem.
> > > >> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> > > >>
> > > >> Bottomline: IMHO, you cannot use Mahout for forecasting at the
> moment;
> > > good
> > > >> luck with your project.
> > > >>
> > > >> Also, you can explore parallel computing paradigms if you have
> > > relatively
> > > >> high volumes of data.
> > > >>
> > > >>
> > > >> On 15 October 2013 12:19, Angelo Immediata <an...@gmail.com>
> > wrote:
> > > >>
> > > >>> Hi there
> > > >>>
> > > >>> I'm pretty new to learning machine and apache mahout as well so
> > pardon
> > > me
> > > >>> if this question is not too correct :)
> > > >>>
> > > >>> I'm in a street routing project where, beside other
> functionalities,
> > we
> > > >>> have to make forecasts. Precisely we should be able in forecasting
> > the
> > > >>> medium speed in a street in a well know period season (e.g we
> should
> > be
> > > >>> able in answering to this kind of question: on the american route
> 66
> > > what
> > > >>> will be the medium speed in spring 2015?)
> > > >>> As far as I know in order to offer this functionality we should use
> > > some
> > > >>> learning machine; this is the reason I'm checking mahout (moreover
> we
> > > >> need
> > > >>> to guarantee high performance and since mahout is based on Apache
> > > hadoop
> > > >>> and since it uses Map/Reduce, it seems to me very amazing)
> > > >>> The first question I'ld love to do is: can I use Apache mahout in
> > order
> > > >> to
> > > >>> implement the previously written funcionality?
> > > >>> If I can use it sure I'll need some data in order to "train"
> > > >> mahout....can
> > > >>> I train mahout in a different time respect to when i need the
> > > prevision?
> > > >> I
> > > >>> mean: can I make the train let's say every week at 10pm and then
> > offer
> > > >> the
> > > >>> forecasting functionality only when a user is interested in it?
> > Should
> > > I
> > > >>> store the training result in some way?
> > > >>> And the last, but not the least :), always if I can use
> > mahout....which
> > > >>> algoritm should I use in order to implement my scenario?
> > > >>>
> > > >>> Thank you for the help and pardon me if i was not too much
> corrected
> > > >>>
> > > >>
> > >
> > >
> >
>
>
>
> --
> Bertrand Dechoux
>