You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by dexter morgan <de...@gmail.com> on 2012/08/27 22:15:32 UTC

best way to join?

Dear list,

Lets say i have a file, like this:
id \t at,tlng <-- structure

1\t40.123,-50.432
2\t41.431,-43.32
...
...
lets call it: 'points.txt'
I'm trying to build a map-reduce job that runs over this BIG points file
and it should output
a file, that will look like:
id[lat,lng] \t [list of points in JSON standard] <--- structure

1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
...

Basically it should run on ITSELF, and grab for each point the N (it will
be an argument for the job) CLOSEST points (the mappers should calculate
the distance)..

Distributed cache is not an option, what else?  not sure if to classify it
as a map-join , reduce-join or both?
Would you do this in HIVE some how?
Is it feasible in a single job?

Would LOVE to hear your suggestions, code (if you think its not that hard)
or what not.
BTW using CDH3 - rev 3 (20.23)

Thanks!

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

On Sun, Sep 2, 2012 at 12:26 PM, dexter morgan <de...@gmail.com>wrote:

> ... Either way, any clustering process requires calculating the distance
> of all points (not between all the points, but of all of them to some
> relative point). Because i'll need a clustering MR job, ill probably use
> it, despite as you said, it has high probability to be correct (not 100%)...
>

This is probably right as stated, but I think that there is confusion here.

Many people assume that each point in the training data has to have
distance computed to all centroids in the clustering.  Even this is not
true.

It is true that you have to compute distance to at least one something, but
not necessarily to all of the clusters.

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

On Sun, Sep 2, 2012 at 12:26 PM, dexter morgan <de...@gmail.com>wrote:

> ... Either way, any clustering process requires calculating the distance
> of all points (not between all the points, but of all of them to some
> relative point). Because i'll need a clustering MR job, ill probably use
> it, despite as you said, it has high probability to be correct (not 100%)...
>

This is probably right as stated, but I think that there is confusion here.

Many people assume that each point in the training data has to have
distance computed to all centroids in the clustering.  Even this is not
true.

It is true that you have to compute distance to at least one something, but
not necessarily to all of the clusters.

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

On Sun, Sep 2, 2012 at 12:26 PM, dexter morgan <de...@gmail.com>wrote:

> ... Either way, any clustering process requires calculating the distance
> of all points (not between all the points, but of all of them to some
> relative point). Because i'll need a clustering MR job, ill probably use
> it, despite as you said, it has high probability to be correct (not 100%)...
>

This is probably right as stated, but I think that there is confusion here.

Many people assume that each point in the training data has to have
distance computed to all centroids in the clustering.  Even this is not
true.

It is true that you have to compute distance to at least one something, but
not necessarily to all of the clusters.

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

On Sun, Sep 2, 2012 at 12:26 PM, dexter morgan <de...@gmail.com>wrote:

> ... Either way, any clustering process requires calculating the distance
> of all points (not between all the points, but of all of them to some
> relative point). Because i'll need a clustering MR job, ill probably use
> it, despite as you said, it has high probability to be correct (not 100%)...
>

This is probably right as stated, but I think that there is confusion here.

Many people assume that each point in the training data has to have
distance computed to all centroids in the clustering.  Even this is not
true.

It is true that you have to compute distance to at least one something, but
not necessarily to all of the clusters.

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Thanks, for the deeper explanation. Now i understand what you ment.
Either way, any clustering process requires calculating the distance of all
points (not between all the points, but of all of them to some relative
point). Because i'll need a clustering MR job, ill probably use it, despite
as you said, it has high probability to be correct (not 100%)...

I'll post another question regarding joining file with itself... (wheter
its using mapper-join / reducer join) , without using hive/ pig


On Fri, Aug 31, 2012 at 6:41 PM, Ted Dunning <td...@maprtech.com> wrote:

> Yes.  I think you mis-understood.
>
> My suggestion is that the few clusters near a point are very likely to
> contain the nearest points.  If you scan these clusters computing the
> distance to your original point, you should be able to find a list of
> points that overlaps with your desired result.  Keep in mind that I am
> suggesting that you use a LOT of clusters here, much more than is typical.
>  Typically, I recommend sqrt(N)/m clusters for this kind of work where m is
> a small constant 1 <= m <= 30.
>
> The basic idea for the method is that there are many points that are
> obviously far from your query.  You don't have to compute the distance to
> these data points to your query since you can eliminate them from
> consideration.  Many of them can be absolutely eliminated by appeal to the
> triangle inequality, but many more can be eliminated safely with reasonably
> high probability.  It is a reasonable heuristic that the farthest clusters
> contain points that can be eliminated this way.  Another way to state this
> is to say that you should only search through the points that are in
> clusters that are near your original cluster.
>
>
> On Fri, Aug 31, 2012 at 9:03 AM, dexter morgan <de...@gmail.com>wrote:
>
>> Hi Ted,
>>
>> First of all, i'd like to know how to make a map/reduce job that does
>> join on the input-file it self.
>> Second, maybe your clustering approach be usefull, i still think it's not
>> correct.
>> Reason:
>> Lets say i want to find the 10 closest points for a given point. Point:
>> [120,90] for example.
>> Clustering approach: which cluster has [120,90] as a node? answer: the
>> cluster at [300,200]
>> Now, if i understood you, i should get the 10 nearest neighbors of
>> [300,200] (again, you didn't elaborate much on this or i didn't understand
>> it)
>>
>> But i require the 10 nearest to [120,90] , not to [300,200]. Even if i
>> know the distances from [120,90] to [300,200] and to the 10 nearest points
>> to [300,200] it won't help me, because maybe the 10 nearest points to
>> [120,90] are actually starting from the 5000th nearest points to [300,200].
>>
>> In the end my goal is to pre-process (as i wrote at the begining) this
>> list of N nearest points for every point in the file. Where N is a
>> parameter given to the job. Let's say 10 points. That's it.
>> No calculation after-wards, only querying that list.
>>
>> Thank you
>>
>> On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> I don't know off-hand.  I don't understand the importance of your
>>> constraint either.
>>>
>>>
>>> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <dextermorgan4u@gmail.com
>>> > wrote:
>>>
>>>> Ok, but as i said before, how do i achieve the same result with out
>>>> clustering , just linear. Join on the same data-set basically?
>>>>
>>>> and calculating the distance as i go
>>>>
>>>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>
>>>>> I don't mean that.
>>>>>
>>>>> I mean that a k-means clustering with pretty large clusters is a
>>>>> useful auxiliary data structure for finding nearest neighbors.  The basic
>>>>> outline is that you find the nearest clusters and search those for near
>>>>> neighbors.  The first riff is that you use a clever data structure for
>>>>> finding the nearest clusters so that you can do that faster than linear
>>>>> search.  The second riff is when you use another clever data structure to
>>>>> search each cluster quickly.
>>>>>
>>>>> There are fancier data structures available as well.
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>
>>>>>> Right, but if i understood your sugesstion, you look at the end goal
>>>>>> , which is:
>>>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>>>
>>>>>> for example, and you say: here we see a cluster basically, that
>>>>>> cluster is represented by the point:  [40.123,-50.432]
>>>>>> which points does this cluster contains?  [[41.431,-
>>>>>> 43.32],[...,...],...,[...]]
>>>>>> meaning: that for every point i have in the dataset, you create a
>>>>>> cluster.
>>>>>> If you don't mean that, but you do mean to create clusters based on
>>>>>> some random-seed points or what not, that would mean
>>>>>>  that i'll have points (talking about the "end goal") that won't have
>>>>>> enough points in their list.
>>>>>>
>>>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>>>
>>>>>> and again, how can i accomplish my task with out running mahout / knn
>>>>>> algo? just by calculating distance between points?
>>>>>> join of a file with it self.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I understand your solution ( i think) , didn't think of that, in
>>>>>>>> that particular way.
>>>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>>>> is an overkill.
>>>>>>>>
>>>>>>>
>>>>>>> I am not sure I understand you.  n = number of points.  k = number
>>>>>>> of clusters.  For searching 1 million points, I would recommend thousands
>>>>>>> of clusters.
>>>>>>>
>>>>>>>
>>>>>>>> How can i achieve the same result WITHOUT using mahout, just
>>>>>>>> running on the dataset , i even think it'll be in the same complexity
>>>>>>>> (o(n^2))
>>>>>>>>
>>>>>>>
>>>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>>>> complexity.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Thanks, for the deeper explanation. Now i understand what you ment.
Either way, any clustering process requires calculating the distance of all
points (not between all the points, but of all of them to some relative
point). Because i'll need a clustering MR job, ill probably use it, despite
as you said, it has high probability to be correct (not 100%)...

I'll post another question regarding joining file with itself... (wheter
its using mapper-join / reducer join) , without using hive/ pig


On Fri, Aug 31, 2012 at 6:41 PM, Ted Dunning <td...@maprtech.com> wrote:

> Yes.  I think you mis-understood.
>
> My suggestion is that the few clusters near a point are very likely to
> contain the nearest points.  If you scan these clusters computing the
> distance to your original point, you should be able to find a list of
> points that overlaps with your desired result.  Keep in mind that I am
> suggesting that you use a LOT of clusters here, much more than is typical.
>  Typically, I recommend sqrt(N)/m clusters for this kind of work where m is
> a small constant 1 <= m <= 30.
>
> The basic idea for the method is that there are many points that are
> obviously far from your query.  You don't have to compute the distance to
> these data points to your query since you can eliminate them from
> consideration.  Many of them can be absolutely eliminated by appeal to the
> triangle inequality, but many more can be eliminated safely with reasonably
> high probability.  It is a reasonable heuristic that the farthest clusters
> contain points that can be eliminated this way.  Another way to state this
> is to say that you should only search through the points that are in
> clusters that are near your original cluster.
>
>
> On Fri, Aug 31, 2012 at 9:03 AM, dexter morgan <de...@gmail.com>wrote:
>
>> Hi Ted,
>>
>> First of all, i'd like to know how to make a map/reduce job that does
>> join on the input-file it self.
>> Second, maybe your clustering approach be usefull, i still think it's not
>> correct.
>> Reason:
>> Lets say i want to find the 10 closest points for a given point. Point:
>> [120,90] for example.
>> Clustering approach: which cluster has [120,90] as a node? answer: the
>> cluster at [300,200]
>> Now, if i understood you, i should get the 10 nearest neighbors of
>> [300,200] (again, you didn't elaborate much on this or i didn't understand
>> it)
>>
>> But i require the 10 nearest to [120,90] , not to [300,200]. Even if i
>> know the distances from [120,90] to [300,200] and to the 10 nearest points
>> to [300,200] it won't help me, because maybe the 10 nearest points to
>> [120,90] are actually starting from the 5000th nearest points to [300,200].
>>
>> In the end my goal is to pre-process (as i wrote at the begining) this
>> list of N nearest points for every point in the file. Where N is a
>> parameter given to the job. Let's say 10 points. That's it.
>> No calculation after-wards, only querying that list.
>>
>> Thank you
>>
>> On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> I don't know off-hand.  I don't understand the importance of your
>>> constraint either.
>>>
>>>
>>> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <dextermorgan4u@gmail.com
>>> > wrote:
>>>
>>>> Ok, but as i said before, how do i achieve the same result with out
>>>> clustering , just linear. Join on the same data-set basically?
>>>>
>>>> and calculating the distance as i go
>>>>
>>>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>
>>>>> I don't mean that.
>>>>>
>>>>> I mean that a k-means clustering with pretty large clusters is a
>>>>> useful auxiliary data structure for finding nearest neighbors.  The basic
>>>>> outline is that you find the nearest clusters and search those for near
>>>>> neighbors.  The first riff is that you use a clever data structure for
>>>>> finding the nearest clusters so that you can do that faster than linear
>>>>> search.  The second riff is when you use another clever data structure to
>>>>> search each cluster quickly.
>>>>>
>>>>> There are fancier data structures available as well.
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>
>>>>>> Right, but if i understood your sugesstion, you look at the end goal
>>>>>> , which is:
>>>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>>>
>>>>>> for example, and you say: here we see a cluster basically, that
>>>>>> cluster is represented by the point:  [40.123,-50.432]
>>>>>> which points does this cluster contains?  [[41.431,-
>>>>>> 43.32],[...,...],...,[...]]
>>>>>> meaning: that for every point i have in the dataset, you create a
>>>>>> cluster.
>>>>>> If you don't mean that, but you do mean to create clusters based on
>>>>>> some random-seed points or what not, that would mean
>>>>>>  that i'll have points (talking about the "end goal") that won't have
>>>>>> enough points in their list.
>>>>>>
>>>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>>>
>>>>>> and again, how can i accomplish my task with out running mahout / knn
>>>>>> algo? just by calculating distance between points?
>>>>>> join of a file with it self.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I understand your solution ( i think) , didn't think of that, in
>>>>>>>> that particular way.
>>>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>>>> is an overkill.
>>>>>>>>
>>>>>>>
>>>>>>> I am not sure I understand you.  n = number of points.  k = number
>>>>>>> of clusters.  For searching 1 million points, I would recommend thousands
>>>>>>> of clusters.
>>>>>>>
>>>>>>>
>>>>>>>> How can i achieve the same result WITHOUT using mahout, just
>>>>>>>> running on the dataset , i even think it'll be in the same complexity
>>>>>>>> (o(n^2))
>>>>>>>>
>>>>>>>
>>>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>>>> complexity.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Thanks, for the deeper explanation. Now i understand what you ment.
Either way, any clustering process requires calculating the distance of all
points (not between all the points, but of all of them to some relative
point). Because i'll need a clustering MR job, ill probably use it, despite
as you said, it has high probability to be correct (not 100%)...

I'll post another question regarding joining file with itself... (wheter
its using mapper-join / reducer join) , without using hive/ pig


On Fri, Aug 31, 2012 at 6:41 PM, Ted Dunning <td...@maprtech.com> wrote:

> Yes.  I think you mis-understood.
>
> My suggestion is that the few clusters near a point are very likely to
> contain the nearest points.  If you scan these clusters computing the
> distance to your original point, you should be able to find a list of
> points that overlaps with your desired result.  Keep in mind that I am
> suggesting that you use a LOT of clusters here, much more than is typical.
>  Typically, I recommend sqrt(N)/m clusters for this kind of work where m is
> a small constant 1 <= m <= 30.
>
> The basic idea for the method is that there are many points that are
> obviously far from your query.  You don't have to compute the distance to
> these data points to your query since you can eliminate them from
> consideration.  Many of them can be absolutely eliminated by appeal to the
> triangle inequality, but many more can be eliminated safely with reasonably
> high probability.  It is a reasonable heuristic that the farthest clusters
> contain points that can be eliminated this way.  Another way to state this
> is to say that you should only search through the points that are in
> clusters that are near your original cluster.
>
>
> On Fri, Aug 31, 2012 at 9:03 AM, dexter morgan <de...@gmail.com>wrote:
>
>> Hi Ted,
>>
>> First of all, i'd like to know how to make a map/reduce job that does
>> join on the input-file it self.
>> Second, maybe your clustering approach be usefull, i still think it's not
>> correct.
>> Reason:
>> Lets say i want to find the 10 closest points for a given point. Point:
>> [120,90] for example.
>> Clustering approach: which cluster has [120,90] as a node? answer: the
>> cluster at [300,200]
>> Now, if i understood you, i should get the 10 nearest neighbors of
>> [300,200] (again, you didn't elaborate much on this or i didn't understand
>> it)
>>
>> But i require the 10 nearest to [120,90] , not to [300,200]. Even if i
>> know the distances from [120,90] to [300,200] and to the 10 nearest points
>> to [300,200] it won't help me, because maybe the 10 nearest points to
>> [120,90] are actually starting from the 5000th nearest points to [300,200].
>>
>> In the end my goal is to pre-process (as i wrote at the begining) this
>> list of N nearest points for every point in the file. Where N is a
>> parameter given to the job. Let's say 10 points. That's it.
>> No calculation after-wards, only querying that list.
>>
>> Thank you
>>
>> On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> I don't know off-hand.  I don't understand the importance of your
>>> constraint either.
>>>
>>>
>>> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <dextermorgan4u@gmail.com
>>> > wrote:
>>>
>>>> Ok, but as i said before, how do i achieve the same result with out
>>>> clustering , just linear. Join on the same data-set basically?
>>>>
>>>> and calculating the distance as i go
>>>>
>>>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>
>>>>> I don't mean that.
>>>>>
>>>>> I mean that a k-means clustering with pretty large clusters is a
>>>>> useful auxiliary data structure for finding nearest neighbors.  The basic
>>>>> outline is that you find the nearest clusters and search those for near
>>>>> neighbors.  The first riff is that you use a clever data structure for
>>>>> finding the nearest clusters so that you can do that faster than linear
>>>>> search.  The second riff is when you use another clever data structure to
>>>>> search each cluster quickly.
>>>>>
>>>>> There are fancier data structures available as well.
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>
>>>>>> Right, but if i understood your sugesstion, you look at the end goal
>>>>>> , which is:
>>>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>>>
>>>>>> for example, and you say: here we see a cluster basically, that
>>>>>> cluster is represented by the point:  [40.123,-50.432]
>>>>>> which points does this cluster contains?  [[41.431,-
>>>>>> 43.32],[...,...],...,[...]]
>>>>>> meaning: that for every point i have in the dataset, you create a
>>>>>> cluster.
>>>>>> If you don't mean that, but you do mean to create clusters based on
>>>>>> some random-seed points or what not, that would mean
>>>>>>  that i'll have points (talking about the "end goal") that won't have
>>>>>> enough points in their list.
>>>>>>
>>>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>>>
>>>>>> and again, how can i accomplish my task with out running mahout / knn
>>>>>> algo? just by calculating distance between points?
>>>>>> join of a file with it self.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I understand your solution ( i think) , didn't think of that, in
>>>>>>>> that particular way.
>>>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>>>> is an overkill.
>>>>>>>>
>>>>>>>
>>>>>>> I am not sure I understand you.  n = number of points.  k = number
>>>>>>> of clusters.  For searching 1 million points, I would recommend thousands
>>>>>>> of clusters.
>>>>>>>
>>>>>>>
>>>>>>>> How can i achieve the same result WITHOUT using mahout, just
>>>>>>>> running on the dataset , i even think it'll be in the same complexity
>>>>>>>> (o(n^2))
>>>>>>>>
>>>>>>>
>>>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>>>> complexity.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Thanks, for the deeper explanation. Now i understand what you ment.
Either way, any clustering process requires calculating the distance of all
points (not between all the points, but of all of them to some relative
point). Because i'll need a clustering MR job, ill probably use it, despite
as you said, it has high probability to be correct (not 100%)...

I'll post another question regarding joining file with itself... (wheter
its using mapper-join / reducer join) , without using hive/ pig


On Fri, Aug 31, 2012 at 6:41 PM, Ted Dunning <td...@maprtech.com> wrote:

> Yes.  I think you mis-understood.
>
> My suggestion is that the few clusters near a point are very likely to
> contain the nearest points.  If you scan these clusters computing the
> distance to your original point, you should be able to find a list of
> points that overlaps with your desired result.  Keep in mind that I am
> suggesting that you use a LOT of clusters here, much more than is typical.
>  Typically, I recommend sqrt(N)/m clusters for this kind of work where m is
> a small constant 1 <= m <= 30.
>
> The basic idea for the method is that there are many points that are
> obviously far from your query.  You don't have to compute the distance to
> these data points to your query since you can eliminate them from
> consideration.  Many of them can be absolutely eliminated by appeal to the
> triangle inequality, but many more can be eliminated safely with reasonably
> high probability.  It is a reasonable heuristic that the farthest clusters
> contain points that can be eliminated this way.  Another way to state this
> is to say that you should only search through the points that are in
> clusters that are near your original cluster.
>
>
> On Fri, Aug 31, 2012 at 9:03 AM, dexter morgan <de...@gmail.com>wrote:
>
>> Hi Ted,
>>
>> First of all, i'd like to know how to make a map/reduce job that does
>> join on the input-file it self.
>> Second, maybe your clustering approach be usefull, i still think it's not
>> correct.
>> Reason:
>> Lets say i want to find the 10 closest points for a given point. Point:
>> [120,90] for example.
>> Clustering approach: which cluster has [120,90] as a node? answer: the
>> cluster at [300,200]
>> Now, if i understood you, i should get the 10 nearest neighbors of
>> [300,200] (again, you didn't elaborate much on this or i didn't understand
>> it)
>>
>> But i require the 10 nearest to [120,90] , not to [300,200]. Even if i
>> know the distances from [120,90] to [300,200] and to the 10 nearest points
>> to [300,200] it won't help me, because maybe the 10 nearest points to
>> [120,90] are actually starting from the 5000th nearest points to [300,200].
>>
>> In the end my goal is to pre-process (as i wrote at the begining) this
>> list of N nearest points for every point in the file. Where N is a
>> parameter given to the job. Let's say 10 points. That's it.
>> No calculation after-wards, only querying that list.
>>
>> Thank you
>>
>> On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> I don't know off-hand.  I don't understand the importance of your
>>> constraint either.
>>>
>>>
>>> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <dextermorgan4u@gmail.com
>>> > wrote:
>>>
>>>> Ok, but as i said before, how do i achieve the same result with out
>>>> clustering , just linear. Join on the same data-set basically?
>>>>
>>>> and calculating the distance as i go
>>>>
>>>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>
>>>>> I don't mean that.
>>>>>
>>>>> I mean that a k-means clustering with pretty large clusters is a
>>>>> useful auxiliary data structure for finding nearest neighbors.  The basic
>>>>> outline is that you find the nearest clusters and search those for near
>>>>> neighbors.  The first riff is that you use a clever data structure for
>>>>> finding the nearest clusters so that you can do that faster than linear
>>>>> search.  The second riff is when you use another clever data structure to
>>>>> search each cluster quickly.
>>>>>
>>>>> There are fancier data structures available as well.
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>
>>>>>> Right, but if i understood your sugesstion, you look at the end goal
>>>>>> , which is:
>>>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>>>
>>>>>> for example, and you say: here we see a cluster basically, that
>>>>>> cluster is represented by the point:  [40.123,-50.432]
>>>>>> which points does this cluster contains?  [[41.431,-
>>>>>> 43.32],[...,...],...,[...]]
>>>>>> meaning: that for every point i have in the dataset, you create a
>>>>>> cluster.
>>>>>> If you don't mean that, but you do mean to create clusters based on
>>>>>> some random-seed points or what not, that would mean
>>>>>>  that i'll have points (talking about the "end goal") that won't have
>>>>>> enough points in their list.
>>>>>>
>>>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>>>
>>>>>> and again, how can i accomplish my task with out running mahout / knn
>>>>>> algo? just by calculating distance between points?
>>>>>> join of a file with it self.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I understand your solution ( i think) , didn't think of that, in
>>>>>>>> that particular way.
>>>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>>>> is an overkill.
>>>>>>>>
>>>>>>>
>>>>>>> I am not sure I understand you.  n = number of points.  k = number
>>>>>>> of clusters.  For searching 1 million points, I would recommend thousands
>>>>>>> of clusters.
>>>>>>>
>>>>>>>
>>>>>>>> How can i achieve the same result WITHOUT using mahout, just
>>>>>>>> running on the dataset , i even think it'll be in the same complexity
>>>>>>>> (o(n^2))
>>>>>>>>
>>>>>>>
>>>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>>>> complexity.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

Yes.  I think you mis-understood.

My suggestion is that the few clusters near a point are very likely to
contain the nearest points.  If you scan these clusters computing the
distance to your original point, you should be able to find a list of
points that overlaps with your desired result.  Keep in mind that I am
suggesting that you use a LOT of clusters here, much more than is typical.
 Typically, I recommend sqrt(N)/m clusters for this kind of work where m is
a small constant 1 <= m <= 30.

The basic idea for the method is that there are many points that are
obviously far from your query.  You don't have to compute the distance to
these data points to your query since you can eliminate them from
consideration.  Many of them can be absolutely eliminated by appeal to the
triangle inequality, but many more can be eliminated safely with reasonably
high probability.  It is a reasonable heuristic that the farthest clusters
contain points that can be eliminated this way.  Another way to state this
is to say that you should only search through the points that are in
clusters that are near your original cluster.

On Fri, Aug 31, 2012 at 9:03 AM, dexter morgan <de...@gmail.com>wrote:

> Hi Ted,
>
> First of all, i'd like to know how to make a map/reduce job that does join
> on the input-file it self.
> Second, maybe your clustering approach be usefull, i still think it's not
> correct.
> Reason:
> Lets say i want to find the 10 closest points for a given point. Point:
> [120,90] for example.
> Clustering approach: which cluster has [120,90] as a node? answer: the
> cluster at [300,200]
> Now, if i understood you, i should get the 10 nearest neighbors of
> [300,200] (again, you didn't elaborate much on this or i didn't understand
> it)
>
> But i require the 10 nearest to [120,90] , not to [300,200]. Even if i
> know the distances from [120,90] to [300,200] and to the 10 nearest points
> to [300,200] it won't help me, because maybe the 10 nearest points to
> [120,90] are actually starting from the 5000th nearest points to [300,200].
>
> In the end my goal is to pre-process (as i wrote at the begining) this
> list of N nearest points for every point in the file. Where N is a
> parameter given to the job. Let's say 10 points. That's it.
> No calculation after-wards, only querying that list.
>
> Thank you
>
> On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> I don't know off-hand.  I don't understand the importance of your
>> constraint either.
>>
>>
>> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:
>>
>>> Ok, but as i said before, how do i achieve the same result with out
>>> clustering , just linear. Join on the same data-set basically?
>>>
>>> and calculating the distance as i go
>>>
>>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>> I don't mean that.
>>>>
>>>> I mean that a k-means clustering with pretty large clusters is a useful
>>>> auxiliary data structure for finding nearest neighbors.  The basic outline
>>>> is that you find the nearest clusters and search those for near neighbors.
>>>>  The first riff is that you use a clever data structure for finding the
>>>> nearest clusters so that you can do that faster than linear search.  The
>>>> second riff is when you use another clever data structure to search each
>>>> cluster quickly.
>>>>
>>>> There are fancier data structures available as well.
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>>> dextermorgan4u@gmail.com> wrote:
>>>>
>>>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>>>> which is:
>>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>>
>>>>> for example, and you say: here we see a cluster basically, that
>>>>> cluster is represented by the point:  [40.123,-50.432]
>>>>> which points does this cluster contains?  [[41.431,-
>>>>> 43.32],[...,...],...,[...]]
>>>>> meaning: that for every point i have in the dataset, you create a
>>>>> cluster.
>>>>> If you don't mean that, but you do mean to create clusters based on
>>>>> some random-seed points or what not, that would mean
>>>>>  that i'll have points (talking about the "end goal") that won't have
>>>>> enough points in their list.
>>>>>
>>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>>
>>>>> and again, how can i accomplish my task with out running mahout / knn
>>>>> algo? just by calculating distance between points?
>>>>> join of a file with it self.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> I understand your solution ( i think) , didn't think of that, in
>>>>>>> that particular way.
>>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>>> is an overkill.
>>>>>>>
>>>>>>
>>>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>>>> clusters.
>>>>>>
>>>>>>
>>>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>>>
>>>>>>
>>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>>> complexity.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

Yes.  I think you mis-understood.

My suggestion is that the few clusters near a point are very likely to
contain the nearest points.  If you scan these clusters computing the
distance to your original point, you should be able to find a list of
points that overlaps with your desired result.  Keep in mind that I am
suggesting that you use a LOT of clusters here, much more than is typical.
 Typically, I recommend sqrt(N)/m clusters for this kind of work where m is
a small constant 1 <= m <= 30.

The basic idea for the method is that there are many points that are
obviously far from your query.  You don't have to compute the distance to
these data points to your query since you can eliminate them from
consideration.  Many of them can be absolutely eliminated by appeal to the
triangle inequality, but many more can be eliminated safely with reasonably
high probability.  It is a reasonable heuristic that the farthest clusters
contain points that can be eliminated this way.  Another way to state this
is to say that you should only search through the points that are in
clusters that are near your original cluster.

On Fri, Aug 31, 2012 at 9:03 AM, dexter morgan <de...@gmail.com>wrote:

> Hi Ted,
>
> First of all, i'd like to know how to make a map/reduce job that does join
> on the input-file it self.
> Second, maybe your clustering approach be usefull, i still think it's not
> correct.
> Reason:
> Lets say i want to find the 10 closest points for a given point. Point:
> [120,90] for example.
> Clustering approach: which cluster has [120,90] as a node? answer: the
> cluster at [300,200]
> Now, if i understood you, i should get the 10 nearest neighbors of
> [300,200] (again, you didn't elaborate much on this or i didn't understand
> it)
>
> But i require the 10 nearest to [120,90] , not to [300,200]. Even if i
> know the distances from [120,90] to [300,200] and to the 10 nearest points
> to [300,200] it won't help me, because maybe the 10 nearest points to
> [120,90] are actually starting from the 5000th nearest points to [300,200].
>
> In the end my goal is to pre-process (as i wrote at the begining) this
> list of N nearest points for every point in the file. Where N is a
> parameter given to the job. Let's say 10 points. That's it.
> No calculation after-wards, only querying that list.
>
> Thank you
>
> On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> I don't know off-hand.  I don't understand the importance of your
>> constraint either.
>>
>>
>> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:
>>
>>> Ok, but as i said before, how do i achieve the same result with out
>>> clustering , just linear. Join on the same data-set basically?
>>>
>>> and calculating the distance as i go
>>>
>>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>> I don't mean that.
>>>>
>>>> I mean that a k-means clustering with pretty large clusters is a useful
>>>> auxiliary data structure for finding nearest neighbors.  The basic outline
>>>> is that you find the nearest clusters and search those for near neighbors.
>>>>  The first riff is that you use a clever data structure for finding the
>>>> nearest clusters so that you can do that faster than linear search.  The
>>>> second riff is when you use another clever data structure to search each
>>>> cluster quickly.
>>>>
>>>> There are fancier data structures available as well.
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>>> dextermorgan4u@gmail.com> wrote:
>>>>
>>>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>>>> which is:
>>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>>
>>>>> for example, and you say: here we see a cluster basically, that
>>>>> cluster is represented by the point:  [40.123,-50.432]
>>>>> which points does this cluster contains?  [[41.431,-
>>>>> 43.32],[...,...],...,[...]]
>>>>> meaning: that for every point i have in the dataset, you create a
>>>>> cluster.
>>>>> If you don't mean that, but you do mean to create clusters based on
>>>>> some random-seed points or what not, that would mean
>>>>>  that i'll have points (talking about the "end goal") that won't have
>>>>> enough points in their list.
>>>>>
>>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>>
>>>>> and again, how can i accomplish my task with out running mahout / knn
>>>>> algo? just by calculating distance between points?
>>>>> join of a file with it self.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> I understand your solution ( i think) , didn't think of that, in
>>>>>>> that particular way.
>>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>>> is an overkill.
>>>>>>>
>>>>>>
>>>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>>>> clusters.
>>>>>>
>>>>>>
>>>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>>>
>>>>>>
>>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>>> complexity.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

Yes.  I think you mis-understood.

My suggestion is that the few clusters near a point are very likely to
contain the nearest points.  If you scan these clusters computing the
distance to your original point, you should be able to find a list of
points that overlaps with your desired result.  Keep in mind that I am
suggesting that you use a LOT of clusters here, much more than is typical.
 Typically, I recommend sqrt(N)/m clusters for this kind of work where m is
a small constant 1 <= m <= 30.

The basic idea for the method is that there are many points that are
obviously far from your query.  You don't have to compute the distance to
these data points to your query since you can eliminate them from
consideration.  Many of them can be absolutely eliminated by appeal to the
triangle inequality, but many more can be eliminated safely with reasonably
high probability.  It is a reasonable heuristic that the farthest clusters
contain points that can be eliminated this way.  Another way to state this
is to say that you should only search through the points that are in
clusters that are near your original cluster.

On Fri, Aug 31, 2012 at 9:03 AM, dexter morgan <de...@gmail.com>wrote:

> Hi Ted,
>
> First of all, i'd like to know how to make a map/reduce job that does join
> on the input-file it self.
> Second, maybe your clustering approach be usefull, i still think it's not
> correct.
> Reason:
> Lets say i want to find the 10 closest points for a given point. Point:
> [120,90] for example.
> Clustering approach: which cluster has [120,90] as a node? answer: the
> cluster at [300,200]
> Now, if i understood you, i should get the 10 nearest neighbors of
> [300,200] (again, you didn't elaborate much on this or i didn't understand
> it)
>
> But i require the 10 nearest to [120,90] , not to [300,200]. Even if i
> know the distances from [120,90] to [300,200] and to the 10 nearest points
> to [300,200] it won't help me, because maybe the 10 nearest points to
> [120,90] are actually starting from the 5000th nearest points to [300,200].
>
> In the end my goal is to pre-process (as i wrote at the begining) this
> list of N nearest points for every point in the file. Where N is a
> parameter given to the job. Let's say 10 points. That's it.
> No calculation after-wards, only querying that list.
>
> Thank you
>
> On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> I don't know off-hand.  I don't understand the importance of your
>> constraint either.
>>
>>
>> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:
>>
>>> Ok, but as i said before, how do i achieve the same result with out
>>> clustering , just linear. Join on the same data-set basically?
>>>
>>> and calculating the distance as i go
>>>
>>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>> I don't mean that.
>>>>
>>>> I mean that a k-means clustering with pretty large clusters is a useful
>>>> auxiliary data structure for finding nearest neighbors.  The basic outline
>>>> is that you find the nearest clusters and search those for near neighbors.
>>>>  The first riff is that you use a clever data structure for finding the
>>>> nearest clusters so that you can do that faster than linear search.  The
>>>> second riff is when you use another clever data structure to search each
>>>> cluster quickly.
>>>>
>>>> There are fancier data structures available as well.
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>>> dextermorgan4u@gmail.com> wrote:
>>>>
>>>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>>>> which is:
>>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>>
>>>>> for example, and you say: here we see a cluster basically, that
>>>>> cluster is represented by the point:  [40.123,-50.432]
>>>>> which points does this cluster contains?  [[41.431,-
>>>>> 43.32],[...,...],...,[...]]
>>>>> meaning: that for every point i have in the dataset, you create a
>>>>> cluster.
>>>>> If you don't mean that, but you do mean to create clusters based on
>>>>> some random-seed points or what not, that would mean
>>>>>  that i'll have points (talking about the "end goal") that won't have
>>>>> enough points in their list.
>>>>>
>>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>>
>>>>> and again, how can i accomplish my task with out running mahout / knn
>>>>> algo? just by calculating distance between points?
>>>>> join of a file with it self.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> I understand your solution ( i think) , didn't think of that, in
>>>>>>> that particular way.
>>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>>> is an overkill.
>>>>>>>
>>>>>>
>>>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>>>> clusters.
>>>>>>
>>>>>>
>>>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>>>
>>>>>>
>>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>>> complexity.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

Yes.  I think you mis-understood.

My suggestion is that the few clusters near a point are very likely to
contain the nearest points.  If you scan these clusters computing the
distance to your original point, you should be able to find a list of
points that overlaps with your desired result.  Keep in mind that I am
suggesting that you use a LOT of clusters here, much more than is typical.
 Typically, I recommend sqrt(N)/m clusters for this kind of work where m is
a small constant 1 <= m <= 30.

The basic idea for the method is that there are many points that are
obviously far from your query.  You don't have to compute the distance to
these data points to your query since you can eliminate them from
consideration.  Many of them can be absolutely eliminated by appeal to the
triangle inequality, but many more can be eliminated safely with reasonably
high probability.  It is a reasonable heuristic that the farthest clusters
contain points that can be eliminated this way.  Another way to state this
is to say that you should only search through the points that are in
clusters that are near your original cluster.

On Fri, Aug 31, 2012 at 9:03 AM, dexter morgan <de...@gmail.com>wrote:

> Hi Ted,
>
> First of all, i'd like to know how to make a map/reduce job that does join
> on the input-file it self.
> Second, maybe your clustering approach be usefull, i still think it's not
> correct.
> Reason:
> Lets say i want to find the 10 closest points for a given point. Point:
> [120,90] for example.
> Clustering approach: which cluster has [120,90] as a node? answer: the
> cluster at [300,200]
> Now, if i understood you, i should get the 10 nearest neighbors of
> [300,200] (again, you didn't elaborate much on this or i didn't understand
> it)
>
> But i require the 10 nearest to [120,90] , not to [300,200]. Even if i
> know the distances from [120,90] to [300,200] and to the 10 nearest points
> to [300,200] it won't help me, because maybe the 10 nearest points to
> [120,90] are actually starting from the 5000th nearest points to [300,200].
>
> In the end my goal is to pre-process (as i wrote at the begining) this
> list of N nearest points for every point in the file. Where N is a
> parameter given to the job. Let's say 10 points. That's it.
> No calculation after-wards, only querying that list.
>
> Thank you
>
> On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> I don't know off-hand.  I don't understand the importance of your
>> constraint either.
>>
>>
>> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:
>>
>>> Ok, but as i said before, how do i achieve the same result with out
>>> clustering , just linear. Join on the same data-set basically?
>>>
>>> and calculating the distance as i go
>>>
>>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>> I don't mean that.
>>>>
>>>> I mean that a k-means clustering with pretty large clusters is a useful
>>>> auxiliary data structure for finding nearest neighbors.  The basic outline
>>>> is that you find the nearest clusters and search those for near neighbors.
>>>>  The first riff is that you use a clever data structure for finding the
>>>> nearest clusters so that you can do that faster than linear search.  The
>>>> second riff is when you use another clever data structure to search each
>>>> cluster quickly.
>>>>
>>>> There are fancier data structures available as well.
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>>> dextermorgan4u@gmail.com> wrote:
>>>>
>>>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>>>> which is:
>>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>>
>>>>> for example, and you say: here we see a cluster basically, that
>>>>> cluster is represented by the point:  [40.123,-50.432]
>>>>> which points does this cluster contains?  [[41.431,-
>>>>> 43.32],[...,...],...,[...]]
>>>>> meaning: that for every point i have in the dataset, you create a
>>>>> cluster.
>>>>> If you don't mean that, but you do mean to create clusters based on
>>>>> some random-seed points or what not, that would mean
>>>>>  that i'll have points (talking about the "end goal") that won't have
>>>>> enough points in their list.
>>>>>
>>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>>
>>>>> and again, how can i accomplish my task with out running mahout / knn
>>>>> algo? just by calculating distance between points?
>>>>> join of a file with it self.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> I understand your solution ( i think) , didn't think of that, in
>>>>>>> that particular way.
>>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>>> is an overkill.
>>>>>>>
>>>>>>
>>>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>>>> clusters.
>>>>>>
>>>>>>
>>>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>>>
>>>>>>
>>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>>> complexity.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Hi Ted,

First of all, i'd like to know how to make a map/reduce job that does join
on the input-file it self.
Second, maybe your clustering approach be usefull, i still think it's not
correct.
Reason:
Lets say i want to find the 10 closest points for a given point. Point:
[120,90] for example.
Clustering approach: which cluster has [120,90] as a node? answer: the
cluster at [300,200]
Now, if i understood you, i should get the 10 nearest neighbors of
[300,200] (again, you didn't elaborate much on this or i didn't understand
it)

But i require the 10 nearest to [120,90] , not to [300,200]. Even if i know
the distances from [120,90] to [300,200] and to the 10 nearest points to
[300,200] it won't help me, because maybe the 10 nearest points to [120,90]
are actually starting from the 5000th nearest points to [300,200].

In the end my goal is to pre-process (as i wrote at the begining) this list
of N nearest points for every point in the file. Where N is a parameter
given to the job. Let's say 10 points. That's it.
No calculation after-wards, only querying that list.

Thank you

On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com> wrote:

> I don't know off-hand.  I don't understand the importance of your
> constraint either.
>
>
> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:
>
>> Ok, but as i said before, how do i achieve the same result with out
>> clustering , just linear. Join on the same data-set basically?
>>
>> and calculating the distance as i go
>>
>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> I don't mean that.
>>>
>>> I mean that a k-means clustering with pretty large clusters is a useful
>>> auxiliary data structure for finding nearest neighbors.  The basic outline
>>> is that you find the nearest clusters and search those for near neighbors.
>>>  The first riff is that you use a clever data structure for finding the
>>> nearest clusters so that you can do that faster than linear search.  The
>>> second riff is when you use another clever data structure to search each
>>> cluster quickly.
>>>
>>> There are fancier data structures available as well.
>>>
>>>
>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>> dextermorgan4u@gmail.com> wrote:
>>>
>>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>>> which is:
>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>
>>>> for example, and you say: here we see a cluster basically, that cluster
>>>> is represented by the point:  [40.123,-50.432]
>>>> which points does this cluster contains?  [[41.431,-
>>>> 43.32],[...,...],...,[...]]
>>>> meaning: that for every point i have in the dataset, you create a
>>>> cluster.
>>>> If you don't mean that, but you do mean to create clusters based on
>>>> some random-seed points or what not, that would mean
>>>>  that i'll have points (talking about the "end goal") that won't have
>>>> enough points in their list.
>>>>
>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>
>>>> and again, how can i accomplish my task with out running mahout / knn
>>>> algo? just by calculating distance between points?
>>>> join of a file with it self.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>>> particular way.
>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>> is an overkill.
>>>>>>
>>>>>
>>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>>> clusters.
>>>>>
>>>>>
>>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>>
>>>>>
>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>> complexity.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Hi Ted,

First of all, i'd like to know how to make a map/reduce job that does join
on the input-file it self.
Second, maybe your clustering approach be usefull, i still think it's not
correct.
Reason:
Lets say i want to find the 10 closest points for a given point. Point:
[120,90] for example.
Clustering approach: which cluster has [120,90] as a node? answer: the
cluster at [300,200]
Now, if i understood you, i should get the 10 nearest neighbors of
[300,200] (again, you didn't elaborate much on this or i didn't understand
it)

But i require the 10 nearest to [120,90] , not to [300,200]. Even if i know
the distances from [120,90] to [300,200] and to the 10 nearest points to
[300,200] it won't help me, because maybe the 10 nearest points to [120,90]
are actually starting from the 5000th nearest points to [300,200].

In the end my goal is to pre-process (as i wrote at the begining) this list
of N nearest points for every point in the file. Where N is a parameter
given to the job. Let's say 10 points. That's it.
No calculation after-wards, only querying that list.

Thank you

On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com> wrote:

> I don't know off-hand.  I don't understand the importance of your
> constraint either.
>
>
> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:
>
>> Ok, but as i said before, how do i achieve the same result with out
>> clustering , just linear. Join on the same data-set basically?
>>
>> and calculating the distance as i go
>>
>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> I don't mean that.
>>>
>>> I mean that a k-means clustering with pretty large clusters is a useful
>>> auxiliary data structure for finding nearest neighbors.  The basic outline
>>> is that you find the nearest clusters and search those for near neighbors.
>>>  The first riff is that you use a clever data structure for finding the
>>> nearest clusters so that you can do that faster than linear search.  The
>>> second riff is when you use another clever data structure to search each
>>> cluster quickly.
>>>
>>> There are fancier data structures available as well.
>>>
>>>
>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>> dextermorgan4u@gmail.com> wrote:
>>>
>>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>>> which is:
>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>
>>>> for example, and you say: here we see a cluster basically, that cluster
>>>> is represented by the point:  [40.123,-50.432]
>>>> which points does this cluster contains?  [[41.431,-
>>>> 43.32],[...,...],...,[...]]
>>>> meaning: that for every point i have in the dataset, you create a
>>>> cluster.
>>>> If you don't mean that, but you do mean to create clusters based on
>>>> some random-seed points or what not, that would mean
>>>>  that i'll have points (talking about the "end goal") that won't have
>>>> enough points in their list.
>>>>
>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>
>>>> and again, how can i accomplish my task with out running mahout / knn
>>>> algo? just by calculating distance between points?
>>>> join of a file with it self.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>>> particular way.
>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>> is an overkill.
>>>>>>
>>>>>
>>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>>> clusters.
>>>>>
>>>>>
>>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>>
>>>>>
>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>> complexity.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Hi Ted,

First of all, i'd like to know how to make a map/reduce job that does join
on the input-file it self.
Second, maybe your clustering approach be usefull, i still think it's not
correct.
Reason:
Lets say i want to find the 10 closest points for a given point. Point:
[120,90] for example.
Clustering approach: which cluster has [120,90] as a node? answer: the
cluster at [300,200]
Now, if i understood you, i should get the 10 nearest neighbors of
[300,200] (again, you didn't elaborate much on this or i didn't understand
it)

But i require the 10 nearest to [120,90] , not to [300,200]. Even if i know
the distances from [120,90] to [300,200] and to the 10 nearest points to
[300,200] it won't help me, because maybe the 10 nearest points to [120,90]
are actually starting from the 5000th nearest points to [300,200].

In the end my goal is to pre-process (as i wrote at the begining) this list
of N nearest points for every point in the file. Where N is a parameter
given to the job. Let's say 10 points. That's it.
No calculation after-wards, only querying that list.

Thank you

On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com> wrote:

> I don't know off-hand.  I don't understand the importance of your
> constraint either.
>
>
> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:
>
>> Ok, but as i said before, how do i achieve the same result with out
>> clustering , just linear. Join on the same data-set basically?
>>
>> and calculating the distance as i go
>>
>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> I don't mean that.
>>>
>>> I mean that a k-means clustering with pretty large clusters is a useful
>>> auxiliary data structure for finding nearest neighbors.  The basic outline
>>> is that you find the nearest clusters and search those for near neighbors.
>>>  The first riff is that you use a clever data structure for finding the
>>> nearest clusters so that you can do that faster than linear search.  The
>>> second riff is when you use another clever data structure to search each
>>> cluster quickly.
>>>
>>> There are fancier data structures available as well.
>>>
>>>
>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>> dextermorgan4u@gmail.com> wrote:
>>>
>>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>>> which is:
>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>
>>>> for example, and you say: here we see a cluster basically, that cluster
>>>> is represented by the point:  [40.123,-50.432]
>>>> which points does this cluster contains?  [[41.431,-
>>>> 43.32],[...,...],...,[...]]
>>>> meaning: that for every point i have in the dataset, you create a
>>>> cluster.
>>>> If you don't mean that, but you do mean to create clusters based on
>>>> some random-seed points or what not, that would mean
>>>>  that i'll have points (talking about the "end goal") that won't have
>>>> enough points in their list.
>>>>
>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>
>>>> and again, how can i accomplish my task with out running mahout / knn
>>>> algo? just by calculating distance between points?
>>>> join of a file with it self.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>>> particular way.
>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>> is an overkill.
>>>>>>
>>>>>
>>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>>> clusters.
>>>>>
>>>>>
>>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>>
>>>>>
>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>> complexity.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Hi Ted,

First of all, i'd like to know how to make a map/reduce job that does join
on the input-file it self.
Second, maybe your clustering approach be usefull, i still think it's not
correct.
Reason:
Lets say i want to find the 10 closest points for a given point. Point:
[120,90] for example.
Clustering approach: which cluster has [120,90] as a node? answer: the
cluster at [300,200]
Now, if i understood you, i should get the 10 nearest neighbors of
[300,200] (again, you didn't elaborate much on this or i didn't understand
it)

But i require the 10 nearest to [120,90] , not to [300,200]. Even if i know
the distances from [120,90] to [300,200] and to the 10 nearest points to
[300,200] it won't help me, because maybe the 10 nearest points to [120,90]
are actually starting from the 5000th nearest points to [300,200].

In the end my goal is to pre-process (as i wrote at the begining) this list
of N nearest points for every point in the file. Where N is a parameter
given to the job. Let's say 10 points. That's it.
No calculation after-wards, only querying that list.

Thank you

On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning <td...@maprtech.com> wrote:

> I don't know off-hand.  I don't understand the importance of your
> constraint either.
>
>
> On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:
>
>> Ok, but as i said before, how do i achieve the same result with out
>> clustering , just linear. Join on the same data-set basically?
>>
>> and calculating the distance as i go
>>
>> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> I don't mean that.
>>>
>>> I mean that a k-means clustering with pretty large clusters is a useful
>>> auxiliary data structure for finding nearest neighbors.  The basic outline
>>> is that you find the nearest clusters and search those for near neighbors.
>>>  The first riff is that you use a clever data structure for finding the
>>> nearest clusters so that you can do that faster than linear search.  The
>>> second riff is when you use another clever data structure to search each
>>> cluster quickly.
>>>
>>> There are fancier data structures available as well.
>>>
>>>
>>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <
>>> dextermorgan4u@gmail.com> wrote:
>>>
>>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>>> which is:
>>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>>
>>>> for example, and you say: here we see a cluster basically, that cluster
>>>> is represented by the point:  [40.123,-50.432]
>>>> which points does this cluster contains?  [[41.431,-
>>>> 43.32],[...,...],...,[...]]
>>>> meaning: that for every point i have in the dataset, you create a
>>>> cluster.
>>>> If you don't mean that, but you do mean to create clusters based on
>>>> some random-seed points or what not, that would mean
>>>>  that i'll have points (talking about the "end goal") that won't have
>>>> enough points in their list.
>>>>
>>>> one of the criterions for a clustering is that for any clusters: C_i
>>>> and C_j (where i != j), C_i intersect C_j is empty
>>>>
>>>> and again, how can i accomplish my task with out running mahout / knn
>>>> algo? just by calculating distance between points?
>>>> join of a file with it self.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>>> dextermorgan4u@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>>> particular way.
>>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>>> is an overkill.
>>>>>>
>>>>>
>>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>>> clusters.
>>>>>
>>>>>
>>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>>
>>>>>
>>>>> Running with a good knn package will give you roughly O(n log n)
>>>>> complexity.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

I don't know off-hand.  I don't understand the importance of your
constraint either.

On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:

> Ok, but as i said before, how do i achieve the same result with out
> clustering , just linear. Join on the same data-set basically?
>
> and calculating the distance as i go
>
> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> I don't mean that.
>>
>> I mean that a k-means clustering with pretty large clusters is a useful
>> auxiliary data structure for finding nearest neighbors.  The basic outline
>> is that you find the nearest clusters and search those for near neighbors.
>>  The first riff is that you use a clever data structure for finding the
>> nearest clusters so that you can do that faster than linear search.  The
>> second riff is when you use another clever data structure to search each
>> cluster quickly.
>>
>> There are fancier data structures available as well.
>>
>>
>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <dextermorgan4u@gmail.com
>> > wrote:
>>
>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>> which is:
>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>
>>> for example, and you say: here we see a cluster basically, that cluster
>>> is represented by the point:  [40.123,-50.432]
>>> which points does this cluster contains?  [[41.431,-
>>> 43.32],[...,...],...,[...]]
>>> meaning: that for every point i have in the dataset, you create a
>>> cluster.
>>> If you don't mean that, but you do mean to create clusters based on some
>>> random-seed points or what not, that would mean
>>>  that i'll have points (talking about the "end goal") that won't have
>>> enough points in their list.
>>>
>>> one of the criterions for a clustering is that for any clusters: C_i and
>>> C_j (where i != j), C_i intersect C_j is empty
>>>
>>> and again, how can i accomplish my task with out running mahout / knn
>>> algo? just by calculating distance between points?
>>> join of a file with it self.
>>>
>>> Thanks
>>>
>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>> dextermorgan4u@gmail.com> wrote:
>>>>
>>>>>
>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>> particular way.
>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>> is an overkill.
>>>>>
>>>>
>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>> clusters.
>>>>
>>>>
>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>
>>>>
>>>> Running with a good knn package will give you roughly O(n log n)
>>>> complexity.
>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

I don't know off-hand.  I don't understand the importance of your
constraint either.

On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:

> Ok, but as i said before, how do i achieve the same result with out
> clustering , just linear. Join on the same data-set basically?
>
> and calculating the distance as i go
>
> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> I don't mean that.
>>
>> I mean that a k-means clustering with pretty large clusters is a useful
>> auxiliary data structure for finding nearest neighbors.  The basic outline
>> is that you find the nearest clusters and search those for near neighbors.
>>  The first riff is that you use a clever data structure for finding the
>> nearest clusters so that you can do that faster than linear search.  The
>> second riff is when you use another clever data structure to search each
>> cluster quickly.
>>
>> There are fancier data structures available as well.
>>
>>
>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <dextermorgan4u@gmail.com
>> > wrote:
>>
>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>> which is:
>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>
>>> for example, and you say: here we see a cluster basically, that cluster
>>> is represented by the point:  [40.123,-50.432]
>>> which points does this cluster contains?  [[41.431,-
>>> 43.32],[...,...],...,[...]]
>>> meaning: that for every point i have in the dataset, you create a
>>> cluster.
>>> If you don't mean that, but you do mean to create clusters based on some
>>> random-seed points or what not, that would mean
>>>  that i'll have points (talking about the "end goal") that won't have
>>> enough points in their list.
>>>
>>> one of the criterions for a clustering is that for any clusters: C_i and
>>> C_j (where i != j), C_i intersect C_j is empty
>>>
>>> and again, how can i accomplish my task with out running mahout / knn
>>> algo? just by calculating distance between points?
>>> join of a file with it self.
>>>
>>> Thanks
>>>
>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>> dextermorgan4u@gmail.com> wrote:
>>>>
>>>>>
>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>> particular way.
>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>> is an overkill.
>>>>>
>>>>
>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>> clusters.
>>>>
>>>>
>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>
>>>>
>>>> Running with a good knn package will give you roughly O(n log n)
>>>> complexity.
>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

I don't know off-hand.  I don't understand the importance of your
constraint either.

On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:

> Ok, but as i said before, how do i achieve the same result with out
> clustering , just linear. Join on the same data-set basically?
>
> and calculating the distance as i go
>
> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> I don't mean that.
>>
>> I mean that a k-means clustering with pretty large clusters is a useful
>> auxiliary data structure for finding nearest neighbors.  The basic outline
>> is that you find the nearest clusters and search those for near neighbors.
>>  The first riff is that you use a clever data structure for finding the
>> nearest clusters so that you can do that faster than linear search.  The
>> second riff is when you use another clever data structure to search each
>> cluster quickly.
>>
>> There are fancier data structures available as well.
>>
>>
>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <dextermorgan4u@gmail.com
>> > wrote:
>>
>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>> which is:
>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>
>>> for example, and you say: here we see a cluster basically, that cluster
>>> is represented by the point:  [40.123,-50.432]
>>> which points does this cluster contains?  [[41.431,-
>>> 43.32],[...,...],...,[...]]
>>> meaning: that for every point i have in the dataset, you create a
>>> cluster.
>>> If you don't mean that, but you do mean to create clusters based on some
>>> random-seed points or what not, that would mean
>>>  that i'll have points (talking about the "end goal") that won't have
>>> enough points in their list.
>>>
>>> one of the criterions for a clustering is that for any clusters: C_i and
>>> C_j (where i != j), C_i intersect C_j is empty
>>>
>>> and again, how can i accomplish my task with out running mahout / knn
>>> algo? just by calculating distance between points?
>>> join of a file with it self.
>>>
>>> Thanks
>>>
>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>> dextermorgan4u@gmail.com> wrote:
>>>>
>>>>>
>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>> particular way.
>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>> is an overkill.
>>>>>
>>>>
>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>> clusters.
>>>>
>>>>
>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>
>>>>
>>>> Running with a good knn package will give you roughly O(n log n)
>>>> complexity.
>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

I don't know off-hand.  I don't understand the importance of your
constraint either.

On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <de...@gmail.com>wrote:

> Ok, but as i said before, how do i achieve the same result with out
> clustering , just linear. Join on the same data-set basically?
>
> and calculating the distance as i go
>
> On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> I don't mean that.
>>
>> I mean that a k-means clustering with pretty large clusters is a useful
>> auxiliary data structure for finding nearest neighbors.  The basic outline
>> is that you find the nearest clusters and search those for near neighbors.
>>  The first riff is that you use a clever data structure for finding the
>> nearest clusters so that you can do that faster than linear search.  The
>> second riff is when you use another clever data structure to search each
>> cluster quickly.
>>
>> There are fancier data structures available as well.
>>
>>
>> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <dextermorgan4u@gmail.com
>> > wrote:
>>
>>> Right, but if i understood your sugesstion, you look at the end goal ,
>>> which is:
>>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>>
>>> for example, and you say: here we see a cluster basically, that cluster
>>> is represented by the point:  [40.123,-50.432]
>>> which points does this cluster contains?  [[41.431,-
>>> 43.32],[...,...],...,[...]]
>>> meaning: that for every point i have in the dataset, you create a
>>> cluster.
>>> If you don't mean that, but you do mean to create clusters based on some
>>> random-seed points or what not, that would mean
>>>  that i'll have points (talking about the "end goal") that won't have
>>> enough points in their list.
>>>
>>> one of the criterions for a clustering is that for any clusters: C_i and
>>> C_j (where i != j), C_i intersect C_j is empty
>>>
>>> and again, how can i accomplish my task with out running mahout / knn
>>> algo? just by calculating distance between points?
>>> join of a file with it self.
>>>
>>> Thanks
>>>
>>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <
>>>> dextermorgan4u@gmail.com> wrote:
>>>>
>>>>>
>>>>> I understand your solution ( i think) , didn't think of that, in that
>>>>> particular way.
>>>>> I think that lets say i have 1M data-points, and running knn , that
>>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>>> is an overkill.
>>>>>
>>>>
>>>> I am not sure I understand you.  n = number of points.  k = number of
>>>> clusters.  For searching 1 million points, I would recommend thousands of
>>>> clusters.
>>>>
>>>>
>>>>> How can i achieve the same result WITHOUT using mahout, just running
>>>>> on the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>>
>>>>
>>>> Running with a good knn package will give you roughly O(n log n)
>>>> complexity.
>>>>
>>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Ok, but as i said before, how do i achieve the same result with out
clustering , just linear. Join on the same data-set basically?

and calculating the distance as i go

On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com> wrote:

> I don't mean that.
>
> I mean that a k-means clustering with pretty large clusters is a useful
> auxiliary data structure for finding nearest neighbors.  The basic outline
> is that you find the nearest clusters and search those for near neighbors.
>  The first riff is that you use a clever data structure for finding the
> nearest clusters so that you can do that faster than linear search.  The
> second riff is when you use another clever data structure to search each
> cluster quickly.
>
> There are fancier data structures available as well.
>
>
> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <de...@gmail.com>wrote:
>
>> Right, but if i understood your sugesstion, you look at the end goal ,
>> which is:
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>
>> for example, and you say: here we see a cluster basically, that cluster
>> is represented by the point:  [40.123,-50.432]
>> which points does this cluster contains?  [[41.431,-
>> 43.32],[...,...],...,[...]]
>> meaning: that for every point i have in the dataset, you create a cluster.
>> If you don't mean that, but you do mean to create clusters based on some
>> random-seed points or what not, that would mean
>>  that i'll have points (talking about the "end goal") that won't have
>> enough points in their list.
>>
>> one of the criterions for a clustering is that for any clusters: C_i and
>> C_j (where i != j), C_i intersect C_j is empty
>>
>> and again, how can i accomplish my task with out running mahout / knn
>> algo? just by calculating distance between points?
>> join of a file with it self.
>>
>> Thanks
>>
>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>>
>>>
>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <dextermorgan4u@gmail.com
>>> > wrote:
>>>
>>>>
>>>> I understand your solution ( i think) , didn't think of that, in that
>>>> particular way.
>>>> I think that lets say i have 1M data-points, and running knn , that the
>>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>> is an overkill.
>>>>
>>>
>>> I am not sure I understand you.  n = number of points.  k = number of
>>> clusters.  For searching 1 million points, I would recommend thousands of
>>> clusters.
>>>
>>>
>>>> How can i achieve the same result WITHOUT using mahout, just running on
>>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>
>>>
>>> Running with a good knn package will give you roughly O(n log n)
>>> complexity.
>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Ok, but as i said before, how do i achieve the same result with out
clustering , just linear. Join on the same data-set basically?

and calculating the distance as i go

On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com> wrote:

> I don't mean that.
>
> I mean that a k-means clustering with pretty large clusters is a useful
> auxiliary data structure for finding nearest neighbors.  The basic outline
> is that you find the nearest clusters and search those for near neighbors.
>  The first riff is that you use a clever data structure for finding the
> nearest clusters so that you can do that faster than linear search.  The
> second riff is when you use another clever data structure to search each
> cluster quickly.
>
> There are fancier data structures available as well.
>
>
> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <de...@gmail.com>wrote:
>
>> Right, but if i understood your sugesstion, you look at the end goal ,
>> which is:
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>
>> for example, and you say: here we see a cluster basically, that cluster
>> is represented by the point:  [40.123,-50.432]
>> which points does this cluster contains?  [[41.431,-
>> 43.32],[...,...],...,[...]]
>> meaning: that for every point i have in the dataset, you create a cluster.
>> If you don't mean that, but you do mean to create clusters based on some
>> random-seed points or what not, that would mean
>>  that i'll have points (talking about the "end goal") that won't have
>> enough points in their list.
>>
>> one of the criterions for a clustering is that for any clusters: C_i and
>> C_j (where i != j), C_i intersect C_j is empty
>>
>> and again, how can i accomplish my task with out running mahout / knn
>> algo? just by calculating distance between points?
>> join of a file with it self.
>>
>> Thanks
>>
>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>>
>>>
>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <dextermorgan4u@gmail.com
>>> > wrote:
>>>
>>>>
>>>> I understand your solution ( i think) , didn't think of that, in that
>>>> particular way.
>>>> I think that lets say i have 1M data-points, and running knn , that the
>>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>> is an overkill.
>>>>
>>>
>>> I am not sure I understand you.  n = number of points.  k = number of
>>> clusters.  For searching 1 million points, I would recommend thousands of
>>> clusters.
>>>
>>>
>>>> How can i achieve the same result WITHOUT using mahout, just running on
>>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>
>>>
>>> Running with a good knn package will give you roughly O(n log n)
>>> complexity.
>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Ok, but as i said before, how do i achieve the same result with out
clustering , just linear. Join on the same data-set basically?

and calculating the distance as i go

On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com> wrote:

> I don't mean that.
>
> I mean that a k-means clustering with pretty large clusters is a useful
> auxiliary data structure for finding nearest neighbors.  The basic outline
> is that you find the nearest clusters and search those for near neighbors.
>  The first riff is that you use a clever data structure for finding the
> nearest clusters so that you can do that faster than linear search.  The
> second riff is when you use another clever data structure to search each
> cluster quickly.
>
> There are fancier data structures available as well.
>
>
> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <de...@gmail.com>wrote:
>
>> Right, but if i understood your sugesstion, you look at the end goal ,
>> which is:
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>
>> for example, and you say: here we see a cluster basically, that cluster
>> is represented by the point:  [40.123,-50.432]
>> which points does this cluster contains?  [[41.431,-
>> 43.32],[...,...],...,[...]]
>> meaning: that for every point i have in the dataset, you create a cluster.
>> If you don't mean that, but you do mean to create clusters based on some
>> random-seed points or what not, that would mean
>>  that i'll have points (talking about the "end goal") that won't have
>> enough points in their list.
>>
>> one of the criterions for a clustering is that for any clusters: C_i and
>> C_j (where i != j), C_i intersect C_j is empty
>>
>> and again, how can i accomplish my task with out running mahout / knn
>> algo? just by calculating distance between points?
>> join of a file with it self.
>>
>> Thanks
>>
>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>>
>>>
>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <dextermorgan4u@gmail.com
>>> > wrote:
>>>
>>>>
>>>> I understand your solution ( i think) , didn't think of that, in that
>>>> particular way.
>>>> I think that lets say i have 1M data-points, and running knn , that the
>>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>> is an overkill.
>>>>
>>>
>>> I am not sure I understand you.  n = number of points.  k = number of
>>> clusters.  For searching 1 million points, I would recommend thousands of
>>> clusters.
>>>
>>>
>>>> How can i achieve the same result WITHOUT using mahout, just running on
>>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>
>>>
>>> Running with a good knn package will give you roughly O(n log n)
>>> complexity.
>>>
>>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Ok, but as i said before, how do i achieve the same result with out
clustering , just linear. Join on the same data-set basically?

and calculating the distance as i go

On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <td...@maprtech.com> wrote:

> I don't mean that.
>
> I mean that a k-means clustering with pretty large clusters is a useful
> auxiliary data structure for finding nearest neighbors.  The basic outline
> is that you find the nearest clusters and search those for near neighbors.
>  The first riff is that you use a clever data structure for finding the
> nearest clusters so that you can do that faster than linear search.  The
> second riff is when you use another clever data structure to search each
> cluster quickly.
>
> There are fancier data structures available as well.
>
>
> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <de...@gmail.com>wrote:
>
>> Right, but if i understood your sugesstion, you look at the end goal ,
>> which is:
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>>
>> for example, and you say: here we see a cluster basically, that cluster
>> is represented by the point:  [40.123,-50.432]
>> which points does this cluster contains?  [[41.431,-
>> 43.32],[...,...],...,[...]]
>> meaning: that for every point i have in the dataset, you create a cluster.
>> If you don't mean that, but you do mean to create clusters based on some
>> random-seed points or what not, that would mean
>>  that i'll have points (talking about the "end goal") that won't have
>> enough points in their list.
>>
>> one of the criterions for a clustering is that for any clusters: C_i and
>> C_j (where i != j), C_i intersect C_j is empty
>>
>> and again, how can i accomplish my task with out running mahout / knn
>> algo? just by calculating distance between points?
>> join of a file with it self.
>>
>> Thanks
>>
>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>>
>>>
>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <dextermorgan4u@gmail.com
>>> > wrote:
>>>
>>>>
>>>> I understand your solution ( i think) , didn't think of that, in that
>>>> particular way.
>>>> I think that lets say i have 1M data-points, and running knn , that the
>>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>>> is an overkill.
>>>>
>>>
>>> I am not sure I understand you.  n = number of points.  k = number of
>>> clusters.  For searching 1 million points, I would recommend thousands of
>>> clusters.
>>>
>>>
>>>> How can i achieve the same result WITHOUT using mahout, just running on
>>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>>
>>>
>>> Running with a good knn package will give you roughly O(n log n)
>>> complexity.
>>>
>>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

I don't mean that.

I mean that a k-means clustering with pretty large clusters is a useful
auxiliary data structure for finding nearest neighbors.  The basic outline
is that you find the nearest clusters and search those for near neighbors.
 The first riff is that you use a clever data structure for finding the
nearest clusters so that you can do that faster than linear search.  The
second riff is when you use another clever data structure to search each
cluster quickly.

There are fancier data structures available as well.

On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <de...@gmail.com>wrote:

> Right, but if i understood your sugesstion, you look at the end goal ,
> which is:
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>
> for example, and you say: here we see a cluster basically, that cluster is
> represented by the point:  [40.123,-50.432]
> which points does this cluster contains?  [[41.431,-
> 43.32],[...,...],...,[...]]
> meaning: that for every point i have in the dataset, you create a cluster.
> If you don't mean that, but you do mean to create clusters based on some
> random-seed points or what not, that would mean
>  that i'll have points (talking about the "end goal") that won't have
> enough points in their list.
>
> one of the criterions for a clustering is that for any clusters: C_i and
> C_j (where i != j), C_i intersect C_j is empty
>
> and again, how can i accomplish my task with out running mahout / knn
> algo? just by calculating distance between points?
> join of a file with it self.
>
> Thanks
>
> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>>
>>
>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:
>>
>>>
>>> I understand your solution ( i think) , didn't think of that, in that
>>> particular way.
>>> I think that lets say i have 1M data-points, and running knn , that the
>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>> is an overkill.
>>>
>>
>> I am not sure I understand you.  n = number of points.  k = number of
>> clusters.  For searching 1 million points, I would recommend thousands of
>> clusters.
>>
>>
>>> How can i achieve the same result WITHOUT using mahout, just running on
>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>
>>
>> Running with a good knn package will give you roughly O(n log n)
>> complexity.
>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

I don't mean that.

I mean that a k-means clustering with pretty large clusters is a useful
auxiliary data structure for finding nearest neighbors.  The basic outline
is that you find the nearest clusters and search those for near neighbors.
 The first riff is that you use a clever data structure for finding the
nearest clusters so that you can do that faster than linear search.  The
second riff is when you use another clever data structure to search each
cluster quickly.

There are fancier data structures available as well.

On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <de...@gmail.com>wrote:

> Right, but if i understood your sugesstion, you look at the end goal ,
> which is:
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>
> for example, and you say: here we see a cluster basically, that cluster is
> represented by the point:  [40.123,-50.432]
> which points does this cluster contains?  [[41.431,-
> 43.32],[...,...],...,[...]]
> meaning: that for every point i have in the dataset, you create a cluster.
> If you don't mean that, but you do mean to create clusters based on some
> random-seed points or what not, that would mean
>  that i'll have points (talking about the "end goal") that won't have
> enough points in their list.
>
> one of the criterions for a clustering is that for any clusters: C_i and
> C_j (where i != j), C_i intersect C_j is empty
>
> and again, how can i accomplish my task with out running mahout / knn
> algo? just by calculating distance between points?
> join of a file with it self.
>
> Thanks
>
> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>>
>>
>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:
>>
>>>
>>> I understand your solution ( i think) , didn't think of that, in that
>>> particular way.
>>> I think that lets say i have 1M data-points, and running knn , that the
>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>> is an overkill.
>>>
>>
>> I am not sure I understand you.  n = number of points.  k = number of
>> clusters.  For searching 1 million points, I would recommend thousands of
>> clusters.
>>
>>
>>> How can i achieve the same result WITHOUT using mahout, just running on
>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>
>>
>> Running with a good knn package will give you roughly O(n log n)
>> complexity.
>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

I don't mean that.

I mean that a k-means clustering with pretty large clusters is a useful
auxiliary data structure for finding nearest neighbors.  The basic outline
is that you find the nearest clusters and search those for near neighbors.
 The first riff is that you use a clever data structure for finding the
nearest clusters so that you can do that faster than linear search.  The
second riff is when you use another clever data structure to search each
cluster quickly.

There are fancier data structures available as well.

On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <de...@gmail.com>wrote:

> Right, but if i understood your sugesstion, you look at the end goal ,
> which is:
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>
> for example, and you say: here we see a cluster basically, that cluster is
> represented by the point:  [40.123,-50.432]
> which points does this cluster contains?  [[41.431,-
> 43.32],[...,...],...,[...]]
> meaning: that for every point i have in the dataset, you create a cluster.
> If you don't mean that, but you do mean to create clusters based on some
> random-seed points or what not, that would mean
>  that i'll have points (talking about the "end goal") that won't have
> enough points in their list.
>
> one of the criterions for a clustering is that for any clusters: C_i and
> C_j (where i != j), C_i intersect C_j is empty
>
> and again, how can i accomplish my task with out running mahout / knn
> algo? just by calculating distance between points?
> join of a file with it self.
>
> Thanks
>
> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>>
>>
>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:
>>
>>>
>>> I understand your solution ( i think) , didn't think of that, in that
>>> particular way.
>>> I think that lets say i have 1M data-points, and running knn , that the
>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>> is an overkill.
>>>
>>
>> I am not sure I understand you.  n = number of points.  k = number of
>> clusters.  For searching 1 million points, I would recommend thousands of
>> clusters.
>>
>>
>>> How can i achieve the same result WITHOUT using mahout, just running on
>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>
>>
>> Running with a good knn package will give you roughly O(n log n)
>> complexity.
>>
>>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

I don't mean that.

I mean that a k-means clustering with pretty large clusters is a useful
auxiliary data structure for finding nearest neighbors.  The basic outline
is that you find the nearest clusters and search those for near neighbors.
 The first riff is that you use a clever data structure for finding the
nearest clusters so that you can do that faster than linear search.  The
second riff is when you use another clever data structure to search each
cluster quickly.

There are fancier data structures available as well.

On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <de...@gmail.com>wrote:

> Right, but if i understood your sugesstion, you look at the end goal ,
> which is:
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>
> for example, and you say: here we see a cluster basically, that cluster is
> represented by the point:  [40.123,-50.432]
> which points does this cluster contains?  [[41.431,-
> 43.32],[...,...],...,[...]]
> meaning: that for every point i have in the dataset, you create a cluster.
> If you don't mean that, but you do mean to create clusters based on some
> random-seed points or what not, that would mean
>  that i'll have points (talking about the "end goal") that won't have
> enough points in their list.
>
> one of the criterions for a clustering is that for any clusters: C_i and
> C_j (where i != j), C_i intersect C_j is empty
>
> and again, how can i accomplish my task with out running mahout / knn
> algo? just by calculating distance between points?
> join of a file with it self.
>
> Thanks
>
> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>>
>>
>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:
>>
>>>
>>> I understand your solution ( i think) , didn't think of that, in that
>>> particular way.
>>> I think that lets say i have 1M data-points, and running knn , that the
>>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>>> is an overkill.
>>>
>>
>> I am not sure I understand you.  n = number of points.  k = number of
>> clusters.  For searching 1 million points, I would recommend thousands of
>> clusters.
>>
>>
>>> How can i achieve the same result WITHOUT using mahout, just running on
>>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>>
>>
>> Running with a good knn package will give you roughly O(n log n)
>> complexity.
>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Right, but if i understood your sugesstion, you look at the end goal ,
which is:
1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]

for example, and you say: here we see a cluster basically, that cluster is
represented by the point:  [40.123,-50.432]
which points does this cluster contains?  [[41.431,-
43.32],[...,...],...,[...]]
meaning: that for every point i have in the dataset, you create a cluster.
If you don't mean that, but you do mean to create clusters based on some
random-seed points or what not, that would mean
that i'll have points (talking about the "end goal") that won't have enough
points in their list.

one of the criterions for a clustering is that for any clusters: C_i and
C_j (where i != j), C_i intersect C_j is empty

and again, how can i accomplish my task with out running mahout / knn algo?
just by calculating distance between points?
join of a file with it self.

Thanks

On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com> wrote:

>
>
> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:
>
>>
>> I understand your solution ( i think) , didn't think of that, in that
>> particular way.
>> I think that lets say i have 1M data-points, and running knn , that the
>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>> is an overkill.
>>
>
> I am not sure I understand you.  n = number of points.  k = number of
> clusters.  For searching 1 million points, I would recommend thousands of
> clusters.
>
>
>> How can i achieve the same result WITHOUT using mahout, just running on
>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>
>
> Running with a good knn package will give you roughly O(n log n)
> complexity.
>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Right, but if i understood your sugesstion, you look at the end goal ,
which is:
1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]

for example, and you say: here we see a cluster basically, that cluster is
represented by the point:  [40.123,-50.432]
which points does this cluster contains?  [[41.431,-
43.32],[...,...],...,[...]]
meaning: that for every point i have in the dataset, you create a cluster.
If you don't mean that, but you do mean to create clusters based on some
random-seed points or what not, that would mean
that i'll have points (talking about the "end goal") that won't have enough
points in their list.

one of the criterions for a clustering is that for any clusters: C_i and
C_j (where i != j), C_i intersect C_j is empty

and again, how can i accomplish my task with out running mahout / knn algo?
just by calculating distance between points?
join of a file with it self.

Thanks

On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com> wrote:

>
>
> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:
>
>>
>> I understand your solution ( i think) , didn't think of that, in that
>> particular way.
>> I think that lets say i have 1M data-points, and running knn , that the
>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>> is an overkill.
>>
>
> I am not sure I understand you.  n = number of points.  k = number of
> clusters.  For searching 1 million points, I would recommend thousands of
> clusters.
>
>
>> How can i achieve the same result WITHOUT using mahout, just running on
>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>
>
> Running with a good knn package will give you roughly O(n log n)
> complexity.
>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Right, but if i understood your sugesstion, you look at the end goal ,
which is:
1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]

for example, and you say: here we see a cluster basically, that cluster is
represented by the point:  [40.123,-50.432]
which points does this cluster contains?  [[41.431,-
43.32],[...,...],...,[...]]
meaning: that for every point i have in the dataset, you create a cluster.
If you don't mean that, but you do mean to create clusters based on some
random-seed points or what not, that would mean
that i'll have points (talking about the "end goal") that won't have enough
points in their list.

one of the criterions for a clustering is that for any clusters: C_i and
C_j (where i != j), C_i intersect C_j is empty

and again, how can i accomplish my task with out running mahout / knn algo?
just by calculating distance between points?
join of a file with it self.

Thanks

On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com> wrote:

>
>
> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:
>
>>
>> I understand your solution ( i think) , didn't think of that, in that
>> particular way.
>> I think that lets say i have 1M data-points, and running knn , that the
>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>> is an overkill.
>>
>
> I am not sure I understand you.  n = number of points.  k = number of
> clusters.  For searching 1 million points, I would recommend thousands of
> clusters.
>
>
>> How can i achieve the same result WITHOUT using mahout, just running on
>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>
>
> Running with a good knn package will give you roughly O(n log n)
> complexity.
>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Right, but if i understood your sugesstion, you look at the end goal ,
which is:
1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]

for example, and you say: here we see a cluster basically, that cluster is
represented by the point:  [40.123,-50.432]
which points does this cluster contains?  [[41.431,-
43.32],[...,...],...,[...]]
meaning: that for every point i have in the dataset, you create a cluster.
If you don't mean that, but you do mean to create clusters based on some
random-seed points or what not, that would mean
that i'll have points (talking about the "end goal") that won't have enough
points in their list.

one of the criterions for a clustering is that for any clusters: C_i and
C_j (where i != j), C_i intersect C_j is empty

and again, how can i accomplish my task with out running mahout / knn algo?
just by calculating distance between points?
join of a file with it self.

Thanks

On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <td...@maprtech.com> wrote:

>
>
> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:
>
>>
>> I understand your solution ( i think) , didn't think of that, in that
>> particular way.
>> I think that lets say i have 1M data-points, and running knn , that the
>> k=1M and n=10 (each point is a cluster that requires up to 10 points)
>> is an overkill.
>>
>
> I am not sure I understand you.  n = number of points.  k = number of
> clusters.  For searching 1 million points, I would recommend thousands of
> clusters.
>
>
>> How can i achieve the same result WITHOUT using mahout, just running on
>> the dataset , i even think it'll be in the same complexity (o(n^2))
>>
>
> Running with a good knn package will give you roughly O(n log n)
> complexity.
>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:

>
> I understand your solution ( i think) , didn't think of that, in that
> particular way.
> I think that lets say i have 1M data-points, and running knn , that the
> k=1M and n=10 (each point is a cluster that requires up to 10 points)
> is an overkill.
>

I am not sure I understand you.  n = number of points.  k = number of
clusters.  For searching 1 million points, I would recommend thousands of
clusters.

> How can i achieve the same result WITHOUT using mahout, just running on
> the dataset , i even think it'll be in the same complexity (o(n^2))
>

Running with a good knn package will give you roughly O(n log n)
complexity.

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:

>
> I understand your solution ( i think) , didn't think of that, in that
> particular way.
> I think that lets say i have 1M data-points, and running knn , that the
> k=1M and n=10 (each point is a cluster that requires up to 10 points)
> is an overkill.
>

I am not sure I understand you.  n = number of points.  k = number of
clusters.  For searching 1 million points, I would recommend thousands of
clusters.

> How can i achieve the same result WITHOUT using mahout, just running on
> the dataset , i even think it'll be in the same complexity (o(n^2))
>

Running with a good knn package will give you roughly O(n log n)
complexity.

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:

>
> I understand your solution ( i think) , didn't think of that, in that
> particular way.
> I think that lets say i have 1M data-points, and running knn , that the
> k=1M and n=10 (each point is a cluster that requires up to 10 points)
> is an overkill.
>

I am not sure I understand you.  n = number of points.  k = number of
clusters.  For searching 1 million points, I would recommend thousands of
clusters.

> How can i achieve the same result WITHOUT using mahout, just running on
> the dataset , i even think it'll be in the same complexity (o(n^2))
>

Running with a good knn package will give you roughly O(n log n)
complexity.

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan <de...@gmail.com>wrote:

>
> I understand your solution ( i think) , didn't think of that, in that
> particular way.
> I think that lets say i have 1M data-points, and running knn , that the
> k=1M and n=10 (each point is a cluster that requires up to 10 points)
> is an overkill.
>

I am not sure I understand you.  n = number of points.  k = number of
clusters.  For searching 1 million points, I would recommend thousands of
clusters.

> How can i achieve the same result WITHOUT using mahout, just running on
> the dataset , i even think it'll be in the same complexity (o(n^2))
>

Running with a good knn package will give you roughly O(n log n)
complexity.

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Dear Ted,

I understand your solution ( i think) , didn't think of that, in that
particular way.
I think that lets say i have 1M data-points, and running knn , that the
k=1M and n=10 (each point is a cluster that requires up to 10 points)
is an overkill.

How can i achieve the same result WITHOUT using mahout, just running on the
dataset , i even think it'll be in the same complexity (o(n^2))
and calculating the distance between each indifferent points?

and maybe the reducer would just sort them in DESC order for each point.

Thank you!

On Tue, Aug 28, 2012 at 12:52 AM, Ted Dunning <td...@maprtech.com> wrote:

> Mahout is getting some very fast knn code in version 0.8.
>
> The basic work flow is that you would first do a large-scale clustering of
> the data.  Then you would make a second pass using the clustering to
> facilitate fast search for nearby points.
>
> The clustering will require two map-reduce jobs, one to find the cluster
> definitions and the second map-only to assign points to clusters in a form
> to be used by the second pass.  The second pass is a map-only process as
> well.
>
> In order to make this as efficient as possible, it is desirable to use a
> distribution of Hadoop that allows you to directly map the cluster data
> structures into shared memory.  IF you have NFS access to your cluster,
> this is easy.  Otherwise, it is considerably trickier.
>
>
> On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <de...@gmail.com>wrote:
>
>> Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Dear Ted,

I understand your solution ( i think) , didn't think of that, in that
particular way.
I think that lets say i have 1M data-points, and running knn , that the
k=1M and n=10 (each point is a cluster that requires up to 10 points)
is an overkill.

How can i achieve the same result WITHOUT using mahout, just running on the
dataset , i even think it'll be in the same complexity (o(n^2))
and calculating the distance between each indifferent points?

and maybe the reducer would just sort them in DESC order for each point.

Thank you!

On Tue, Aug 28, 2012 at 12:52 AM, Ted Dunning <td...@maprtech.com> wrote:

> Mahout is getting some very fast knn code in version 0.8.
>
> The basic work flow is that you would first do a large-scale clustering of
> the data.  Then you would make a second pass using the clustering to
> facilitate fast search for nearby points.
>
> The clustering will require two map-reduce jobs, one to find the cluster
> definitions and the second map-only to assign points to clusters in a form
> to be used by the second pass.  The second pass is a map-only process as
> well.
>
> In order to make this as efficient as possible, it is desirable to use a
> distribution of Hadoop that allows you to directly map the cluster data
> structures into shared memory.  IF you have NFS access to your cluster,
> this is easy.  Otherwise, it is considerably trickier.
>
>
> On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <de...@gmail.com>wrote:
>
>> Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Dear Ted,

I understand your solution ( i think) , didn't think of that, in that
particular way.
I think that lets say i have 1M data-points, and running knn , that the
k=1M and n=10 (each point is a cluster that requires up to 10 points)
is an overkill.

How can i achieve the same result WITHOUT using mahout, just running on the
dataset , i even think it'll be in the same complexity (o(n^2))
and calculating the distance between each indifferent points?

and maybe the reducer would just sort them in DESC order for each point.

Thank you!

On Tue, Aug 28, 2012 at 12:52 AM, Ted Dunning <td...@maprtech.com> wrote:

> Mahout is getting some very fast knn code in version 0.8.
>
> The basic work flow is that you would first do a large-scale clustering of
> the data.  Then you would make a second pass using the clustering to
> facilitate fast search for nearby points.
>
> The clustering will require two map-reduce jobs, one to find the cluster
> definitions and the second map-only to assign points to clusters in a form
> to be used by the second pass.  The second pass is a map-only process as
> well.
>
> In order to make this as efficient as possible, it is desirable to use a
> distribution of Hadoop that allows you to directly map the cluster data
> structures into shared memory.  IF you have NFS access to your cluster,
> this is easy.  Otherwise, it is considerably trickier.
>
>
> On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <de...@gmail.com>wrote:
>
>> Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Dear Ted,

I understand your solution ( i think) , didn't think of that, in that
particular way.
I think that lets say i have 1M data-points, and running knn , that the
k=1M and n=10 (each point is a cluster that requires up to 10 points)
is an overkill.

How can i achieve the same result WITHOUT using mahout, just running on the
dataset , i even think it'll be in the same complexity (o(n^2))
and calculating the distance between each indifferent points?

and maybe the reducer would just sort them in DESC order for each point.

Thank you!

On Tue, Aug 28, 2012 at 12:52 AM, Ted Dunning <td...@maprtech.com> wrote:

> Mahout is getting some very fast knn code in version 0.8.
>
> The basic work flow is that you would first do a large-scale clustering of
> the data.  Then you would make a second pass using the clustering to
> facilitate fast search for nearby points.
>
> The clustering will require two map-reduce jobs, one to find the cluster
> definitions and the second map-only to assign points to clusters in a form
> to be used by the second pass.  The second pass is a map-only process as
> well.
>
> In order to make this as efficient as possible, it is desirable to use a
> distribution of Hadoop that allows you to directly map the cluster data
> structures into shared memory.  IF you have NFS access to your cluster,
> this is easy.  Otherwise, it is considerably trickier.
>
>
> On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <de...@gmail.com>wrote:
>
>> Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

Mahout is getting some very fast knn code in version 0.8.

The basic work flow is that you would first do a large-scale clustering of
the data.  Then you would make a second pass using the clustering to
facilitate fast search for nearby points.

The clustering will require two map-reduce jobs, one to find the cluster
definitions and the second map-only to assign points to clusters in a form
to be used by the second pass.  The second pass is a map-only process as
well.

In order to make this as efficient as possible, it is desirable to use a
distribution of Hadoop that allows you to directly map the cluster data
structures into shared memory.  IF you have NFS access to your cluster,
this is easy.  Otherwise, it is considerably trickier.

On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <de...@gmail.com>wrote:

> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

Mahout is getting some very fast knn code in version 0.8.

The basic work flow is that you would first do a large-scale clustering of
the data.  Then you would make a second pass using the clustering to
facilitate fast search for nearby points.

The clustering will require two map-reduce jobs, one to find the cluster
definitions and the second map-only to assign points to clusters in a form
to be used by the second pass.  The second pass is a map-only process as
well.

In order to make this as efficient as possible, it is desirable to use a
distribution of Hadoop that allows you to directly map the cluster data
structures into shared memory.  IF you have NFS access to your cluster,
this is easy.  Otherwise, it is considerably trickier.

On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <de...@gmail.com>wrote:

> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>

Re: best way to join?

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi Dexter,

I am no sure if I understood your requirements right.
So I repet it to define a starting point.

1.) You have a (static) list of points (the points.txt file)

2.) Now you want to calculate the nearest points to a set of given points.
Are the points which have to be considered in a different data set or do
you look for closest points "within" your big list (in the points.txt) file?

Lets assume the last is what you want:

I suggest Mahout to do this. This could be helpfull: "
https://cwiki.apache.org/MAHOUT/algorithms.html"

"Vector Similarity

Mahout contains implementations that allow one to compare one or more
vectors with another set of vectors. This can be useful if one is, for
instance, trying to calculate the pairwise similarity between all documents
(or a subset of docs) in a corpus.

   - RowSimilarityJob – Builds an inverted index and then computes
   distances between items that have co-occurrences. This is a fully
   distributed calculation.
   - VectorDistanceJob – Does a map side join between a set of "seed"
   vectors and all of the input vectors.

"

If you look for pairs within just one data set you use this as seed vectors
as well as input vectors, otherwise you use different files or portions of
it.

I hope that helps.

Best wishes

Mirko




2012/9/9 dexter morgan <de...@gmail.com>

> Elmar,
>
> Right, thanks a lot for your help, If you'll read what Ted suggested its
> basically this. I'm interesting in knowing how to do this using JOIN
> (map-join + reducer-join i suppose) as well... though i'll go with the
> mahout approach
>
> Best,
> Dex
>
> On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek <macek@cs.uni-kassel.de
> > wrote:
>
>> Hi Dexter,
>>
>> i think, what you want is a clustering of points based on the euclidian
>> distance or density based clustering ( http://en.wikipedia.org/wiki/**
>> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I
>> bet there are some implemented quite well in Mahout already: afaik this is
>> the datamining framework based on Hadoop.
>>
>> Best luck!
>> Elmar
>>
>>
>> Am 27.08.2012 22:15, schrieb dexter morgan:
>>
>>  Dear list,
>>>
>>> Lets say i have a file, like this:
>>> id \t at,tlng <-- structure
>>>
>>> 1\t40.123,-50.432
>>> 2\t41.431,-43.32
>>> ...
>>> ...
>>> lets call it: 'points.txt'
>>> I'm trying to build a map-reduce job that runs over this BIG points file
>>> and it should output
>>> a file, that will look like:
>>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>>
>>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>>> ...
>>>
>>> Basically it should run on ITSELF, and grab for each point the N (it
>>> will be an argument for the job) CLOSEST points (the mappers should
>>> calculate the distance)..
>>>
>>> Distributed cache is not an option, what else?  not sure if to classify
>>> it as a map-join , reduce-join or both?
>>> Would you do this in HIVE some how?
>>> Is it feasible in a single job?
>>>
>>> Would LOVE to hear your suggestions, code (if you think its not that
>>> hard) or what not.
>>> BTW using CDH3 - rev 3 (20.23)
>>>
>>> Thanks!
>>>
>>
>>
>

Re: best way to join?

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi Dexter,

I am no sure if I understood your requirements right.
So I repet it to define a starting point.

1.) You have a (static) list of points (the points.txt file)

2.) Now you want to calculate the nearest points to a set of given points.
Are the points which have to be considered in a different data set or do
you look for closest points "within" your big list (in the points.txt) file?

Lets assume the last is what you want:

I suggest Mahout to do this. This could be helpfull: "
https://cwiki.apache.org/MAHOUT/algorithms.html"

"Vector Similarity

Mahout contains implementations that allow one to compare one or more
vectors with another set of vectors. This can be useful if one is, for
instance, trying to calculate the pairwise similarity between all documents
(or a subset of docs) in a corpus.

   - RowSimilarityJob – Builds an inverted index and then computes
   distances between items that have co-occurrences. This is a fully
   distributed calculation.
   - VectorDistanceJob – Does a map side join between a set of "seed"
   vectors and all of the input vectors.

"

If you look for pairs within just one data set you use this as seed vectors
as well as input vectors, otherwise you use different files or portions of
it.

I hope that helps.

Best wishes

Mirko




2012/9/9 dexter morgan <de...@gmail.com>

> Elmar,
>
> Right, thanks a lot for your help, If you'll read what Ted suggested its
> basically this. I'm interesting in knowing how to do this using JOIN
> (map-join + reducer-join i suppose) as well... though i'll go with the
> mahout approach
>
> Best,
> Dex
>
> On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek <macek@cs.uni-kassel.de
> > wrote:
>
>> Hi Dexter,
>>
>> i think, what you want is a clustering of points based on the euclidian
>> distance or density based clustering ( http://en.wikipedia.org/wiki/**
>> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I
>> bet there are some implemented quite well in Mahout already: afaik this is
>> the datamining framework based on Hadoop.
>>
>> Best luck!
>> Elmar
>>
>>
>> Am 27.08.2012 22:15, schrieb dexter morgan:
>>
>>  Dear list,
>>>
>>> Lets say i have a file, like this:
>>> id \t at,tlng <-- structure
>>>
>>> 1\t40.123,-50.432
>>> 2\t41.431,-43.32
>>> ...
>>> ...
>>> lets call it: 'points.txt'
>>> I'm trying to build a map-reduce job that runs over this BIG points file
>>> and it should output
>>> a file, that will look like:
>>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>>
>>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>>> ...
>>>
>>> Basically it should run on ITSELF, and grab for each point the N (it
>>> will be an argument for the job) CLOSEST points (the mappers should
>>> calculate the distance)..
>>>
>>> Distributed cache is not an option, what else?  not sure if to classify
>>> it as a map-join , reduce-join or both?
>>> Would you do this in HIVE some how?
>>> Is it feasible in a single job?
>>>
>>> Would LOVE to hear your suggestions, code (if you think its not that
>>> hard) or what not.
>>> BTW using CDH3 - rev 3 (20.23)
>>>
>>> Thanks!
>>>
>>
>>
>

Re: best way to join?

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi Dexter,

I am no sure if I understood your requirements right.
So I repet it to define a starting point.

1.) You have a (static) list of points (the points.txt file)

2.) Now you want to calculate the nearest points to a set of given points.
Are the points which have to be considered in a different data set or do
you look for closest points "within" your big list (in the points.txt) file?

Lets assume the last is what you want:

I suggest Mahout to do this. This could be helpfull: "
https://cwiki.apache.org/MAHOUT/algorithms.html"

"Vector Similarity

Mahout contains implementations that allow one to compare one or more
vectors with another set of vectors. This can be useful if one is, for
instance, trying to calculate the pairwise similarity between all documents
(or a subset of docs) in a corpus.

   - RowSimilarityJob – Builds an inverted index and then computes
   distances between items that have co-occurrences. This is a fully
   distributed calculation.
   - VectorDistanceJob – Does a map side join between a set of "seed"
   vectors and all of the input vectors.

"

If you look for pairs within just one data set you use this as seed vectors
as well as input vectors, otherwise you use different files or portions of
it.

I hope that helps.

Best wishes

Mirko




2012/9/9 dexter morgan <de...@gmail.com>

> Elmar,
>
> Right, thanks a lot for your help, If you'll read what Ted suggested its
> basically this. I'm interesting in knowing how to do this using JOIN
> (map-join + reducer-join i suppose) as well... though i'll go with the
> mahout approach
>
> Best,
> Dex
>
> On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek <macek@cs.uni-kassel.de
> > wrote:
>
>> Hi Dexter,
>>
>> i think, what you want is a clustering of points based on the euclidian
>> distance or density based clustering ( http://en.wikipedia.org/wiki/**
>> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I
>> bet there are some implemented quite well in Mahout already: afaik this is
>> the datamining framework based on Hadoop.
>>
>> Best luck!
>> Elmar
>>
>>
>> Am 27.08.2012 22:15, schrieb dexter morgan:
>>
>>  Dear list,
>>>
>>> Lets say i have a file, like this:
>>> id \t at,tlng <-- structure
>>>
>>> 1\t40.123,-50.432
>>> 2\t41.431,-43.32
>>> ...
>>> ...
>>> lets call it: 'points.txt'
>>> I'm trying to build a map-reduce job that runs over this BIG points file
>>> and it should output
>>> a file, that will look like:
>>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>>
>>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>>> ...
>>>
>>> Basically it should run on ITSELF, and grab for each point the N (it
>>> will be an argument for the job) CLOSEST points (the mappers should
>>> calculate the distance)..
>>>
>>> Distributed cache is not an option, what else?  not sure if to classify
>>> it as a map-join , reduce-join or both?
>>> Would you do this in HIVE some how?
>>> Is it feasible in a single job?
>>>
>>> Would LOVE to hear your suggestions, code (if you think its not that
>>> hard) or what not.
>>> BTW using CDH3 - rev 3 (20.23)
>>>
>>> Thanks!
>>>
>>
>>
>

Re: best way to join?

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi Dexter,

I am no sure if I understood your requirements right.
So I repet it to define a starting point.

1.) You have a (static) list of points (the points.txt file)

2.) Now you want to calculate the nearest points to a set of given points.
Are the points which have to be considered in a different data set or do
you look for closest points "within" your big list (in the points.txt) file?

Lets assume the last is what you want:

I suggest Mahout to do this. This could be helpfull: "
https://cwiki.apache.org/MAHOUT/algorithms.html"

"Vector Similarity

Mahout contains implementations that allow one to compare one or more
vectors with another set of vectors. This can be useful if one is, for
instance, trying to calculate the pairwise similarity between all documents
(or a subset of docs) in a corpus.

   - RowSimilarityJob – Builds an inverted index and then computes
   distances between items that have co-occurrences. This is a fully
   distributed calculation.
   - VectorDistanceJob – Does a map side join between a set of "seed"
   vectors and all of the input vectors.

"

If you look for pairs within just one data set you use this as seed vectors
as well as input vectors, otherwise you use different files or portions of
it.

I hope that helps.

Best wishes

Mirko




2012/9/9 dexter morgan <de...@gmail.com>

> Elmar,
>
> Right, thanks a lot for your help, If you'll read what Ted suggested its
> basically this. I'm interesting in knowing how to do this using JOIN
> (map-join + reducer-join i suppose) as well... though i'll go with the
> mahout approach
>
> Best,
> Dex
>
> On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek <macek@cs.uni-kassel.de
> > wrote:
>
>> Hi Dexter,
>>
>> i think, what you want is a clustering of points based on the euclidian
>> distance or density based clustering ( http://en.wikipedia.org/wiki/**
>> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I
>> bet there are some implemented quite well in Mahout already: afaik this is
>> the datamining framework based on Hadoop.
>>
>> Best luck!
>> Elmar
>>
>>
>> Am 27.08.2012 22:15, schrieb dexter morgan:
>>
>>  Dear list,
>>>
>>> Lets say i have a file, like this:
>>> id \t at,tlng <-- structure
>>>
>>> 1\t40.123,-50.432
>>> 2\t41.431,-43.32
>>> ...
>>> ...
>>> lets call it: 'points.txt'
>>> I'm trying to build a map-reduce job that runs over this BIG points file
>>> and it should output
>>> a file, that will look like:
>>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>>
>>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>>> ...
>>>
>>> Basically it should run on ITSELF, and grab for each point the N (it
>>> will be an argument for the job) CLOSEST points (the mappers should
>>> calculate the distance)..
>>>
>>> Distributed cache is not an option, what else?  not sure if to classify
>>> it as a map-join , reduce-join or both?
>>> Would you do this in HIVE some how?
>>> Is it feasible in a single job?
>>>
>>> Would LOVE to hear your suggestions, code (if you think its not that
>>> hard) or what not.
>>> BTW using CDH3 - rev 3 (20.23)
>>>
>>> Thanks!
>>>
>>
>>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Elmar,

Right, thanks a lot for your help, If you'll read what Ted suggested its
basically this. I'm interesting in knowing how to do this using JOIN
(map-join + reducer-join i suppose) as well... though i'll go with the
mahout approach

Best,
Dex

On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek
<ma...@cs.uni-kassel.de>wrote:

> Hi Dexter,
>
> i think, what you want is a clustering of points based on the euclidian
> distance or density based clustering ( http://en.wikipedia.org/wiki/**
> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I bet
> there are some implemented quite well in Mahout already: afaik this is the
> datamining framework based on Hadoop.
>
> Best luck!
> Elmar
>
>
> Am 27.08.2012 22:15, schrieb dexter morgan:
>
>  Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Elmar,

Right, thanks a lot for your help, If you'll read what Ted suggested its
basically this. I'm interesting in knowing how to do this using JOIN
(map-join + reducer-join i suppose) as well... though i'll go with the
mahout approach

Best,
Dex

On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek
<ma...@cs.uni-kassel.de>wrote:

> Hi Dexter,
>
> i think, what you want is a clustering of points based on the euclidian
> distance or density based clustering ( http://en.wikipedia.org/wiki/**
> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I bet
> there are some implemented quite well in Mahout already: afaik this is the
> datamining framework based on Hadoop.
>
> Best luck!
> Elmar
>
>
> Am 27.08.2012 22:15, schrieb dexter morgan:
>
>  Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Elmar,

Right, thanks a lot for your help, If you'll read what Ted suggested its
basically this. I'm interesting in knowing how to do this using JOIN
(map-join + reducer-join i suppose) as well... though i'll go with the
mahout approach

Best,
Dex

On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek
<ma...@cs.uni-kassel.de>wrote:

> Hi Dexter,
>
> i think, what you want is a clustering of points based on the euclidian
> distance or density based clustering ( http://en.wikipedia.org/wiki/**
> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I bet
> there are some implemented quite well in Mahout already: afaik this is the
> datamining framework based on Hadoop.
>
> Best luck!
> Elmar
>
>
> Am 27.08.2012 22:15, schrieb dexter morgan:
>
>  Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Elmar,

Right, thanks a lot for your help, If you'll read what Ted suggested its
basically this. I'm interesting in knowing how to do this using JOIN
(map-join + reducer-join i suppose) as well... though i'll go with the
mahout approach

Best,
Dex

On Tue, Sep 4, 2012 at 11:17 AM, Björn-Elmar Macek
<ma...@cs.uni-kassel.de>wrote:

> Hi Dexter,
>
> i think, what you want is a clustering of points based on the euclidian
> distance or density based clustering ( http://en.wikipedia.org/wiki/**
> Cluster_analysis <http://en.wikipedia.org/wiki/Cluster_analysis> ). I bet
> there are some implemented quite well in Mahout already: afaik this is the
> datamining framework based on Hadoop.
>
> Best luck!
> Elmar
>
>
> Am 27.08.2012 22:15, schrieb dexter morgan:
>
>  Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-**43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-**50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

Re: best way to join?

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.

Hi Dexter,

i think, what you want is a clustering of points based on the euclidian 
distance or density based clustering ( 
http://en.wikipedia.org/wiki/Cluster_analysis ). I bet there are some 
implemented quite well in Mahout already: afaik this is the datamining 
framework based on Hadoop.

Best luck!
Elmar


Am 27.08.2012 22:15, schrieb dexter morgan:
> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points 
> file and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it 
> will be an argument for the job) CLOSEST points (the mappers should 
> calculate the distance)..
>
> Distributed cache is not an option, what else?  not sure if to 
> classify it as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that 
> hard) or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

Mahout is getting some very fast knn code in version 0.8.

The basic work flow is that you would first do a large-scale clustering of
the data.  Then you would make a second pass using the clustering to
facilitate fast search for nearby points.

The clustering will require two map-reduce jobs, one to find the cluster
definitions and the second map-only to assign points to clusters in a form
to be used by the second pass.  The second pass is a map-only process as
well.

In order to make this as efficient as possible, it is desirable to use a
distribution of Hadoop that allows you to directly map the cluster data
structures into shared memory.  IF you have NFS access to your cluster,
this is easy.  Otherwise, it is considerably trickier.

On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <de...@gmail.com>wrote:

> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>

Re: best way to join?

Posted by Ted Dunning <td...@maprtech.com>.

Mahout is getting some very fast knn code in version 0.8.

The basic work flow is that you would first do a large-scale clustering of
the data.  Then you would make a second pass using the clustering to
facilitate fast search for nearby points.

The clustering will require two map-reduce jobs, one to find the cluster
definitions and the second map-only to assign points to clusters in a form
to be used by the second pass.  The second pass is a map-only process as
well.

In order to make this as efficient as possible, it is desirable to use a
distribution of Hadoop that allows you to directly map the cluster data
structures into shared memory.  IF you have NFS access to your cluster,
this is easy.  Otherwise, it is considerably trickier.

On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <de...@gmail.com>wrote:

> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>

Re: best way to join?

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.

Hi Dexter,

i think, what you want is a clustering of points based on the euclidian 
distance or density based clustering ( 
http://en.wikipedia.org/wiki/Cluster_analysis ). I bet there are some 
implemented quite well in Mahout already: afaik this is the datamining 
framework based on Hadoop.

Best luck!
Elmar


Am 27.08.2012 22:15, schrieb dexter morgan:
> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points 
> file and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it 
> will be an argument for the job) CLOSEST points (the mappers should 
> calculate the distance)..
>
> Distributed cache is not an option, what else?  not sure if to 
> classify it as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that 
> hard) or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!

Re: best way to join?

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.

Hi Dexter,

i think, what you want is a clustering of points based on the euclidian 
distance or density based clustering ( 
http://en.wikipedia.org/wiki/Cluster_analysis ). I bet there are some 
implemented quite well in Mahout already: afaik this is the datamining 
framework based on Hadoop.

Best luck!
Elmar


Am 27.08.2012 22:15, schrieb dexter morgan:
> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points 
> file and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it 
> will be an argument for the job) CLOSEST points (the mappers should 
> calculate the distance)..
>
> Distributed cache is not an option, what else?  not sure if to 
> classify it as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that 
> hard) or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!

Re: best way to join?

Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.

Hi Dexter,

i think, what you want is a clustering of points based on the euclidian 
distance or density based clustering ( 
http://en.wikipedia.org/wiki/Cluster_analysis ). I bet there are some 
implemented quite well in Mahout already: afaik this is the datamining 
framework based on Hadoop.

Best luck!
Elmar


Am 27.08.2012 22:15, schrieb dexter morgan:
> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points 
> file and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it 
> will be an argument for the job) CLOSEST points (the mappers should 
> calculate the distance)..
>
> Distributed cache is not an option, what else?  not sure if to 
> classify it as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that 
> hard) or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!

Re: best way to join?

Posted by dexter morgan <de...@gmail.com>.

Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>