You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jonathan Seale <jo...@gmail.com> on 2015/05/14 05:15:19 UTC

Row Similarity

Scientists,

I have an astrophysical application for Mahout that I need help with.

I have 1-dimensional stellar spectra for many, many stars. Each spectrum
consists of a series of intensity values, one per wavelength of light. I
need to be able to find the cosine similarity between ALL pairs of stars.
Seems to me this is simply a user-user similarity problem where I have
stars instead of users, wavelengths instead of items, and intensities
instead of ratings/clicks.

But I'm having difficulty using mahout's row similarity package (I'm new to
this, and these days astronomers code pretty exclusively in python). I know
that I must have to 1) create a sparse matrix where each row is a star,
columns are wavelengths, and the values are intensity, and 2) implement row
similarity. But I'm just not sure how to do it. Anyone have a good resource
or be willing to help? I could probably offer some compensation to anyone
that would be willing to provide a little focussed, personalized assistance.

Thanks,
Jonathan

Re: Row Similarity

Posted by Suneel Marthi <sm...@apache.org>.
There used to be an online page on mahout.apache.org that Pat Ferrel had
put together few years ago.
Not sure if its still around, Pat ???

If not, I can write up more detailed steps later today and send it ur way.

On Thu, May 14, 2015 at 2:18 PM, Jonathan Seale <jo...@gmail.com>
wrote:

> Thanks, guys. Can you recommend any resources that show an example of these
> steps? A google search returns very little information. Now I know what to
> do, but I can't find anything that tells me how to do it.
>
>
> On Wed, May 13, 2015 at 11:56 PM, Suneel Marthi <sm...@apache.org>
> wrote:
>
> > Hi Jonathan,
> >
> > Here's what u gotta do to run RowSimilarity on ur CSV formatted data.
> You
> > would have to use the MapReduce version since the Spark version only
> > supports LLR.
> >
> > 1. Convert CSV to Vectors - use CSVIterator and store the vectors as
> > SequenceFiles
> > 2.  Run RowIDJob on the SequenceFile output of (1). This should generate
> a
> > Matrix of <IntWritable, VectorWriteable> and a docIndex of <IntWritable,
> > Text>
> > 3.  Run RowSimilarityjob on the matrix output from (2) specifiying
> > CosineDistance and a cutoff threshold. This should generate a matrix of
> > Rows -> Most similar rows with distances.
> >
> >
> >
> >
> > On Wed, May 13, 2015 at 11:42 PM, Jonathan Seale <
> jonathanpseale@gmail.com
> > >
> > wrote:
> >
> > > Thanks, Charlie,
> > >
> > > The data has been through lots of processing, but in an attempt to make
> > it
> > > more Mahout-friendly, I've converted it into a single csv table with
> > > columns: star_id, wavelength, intensity. My motivation was to make it
> > like
> > > a user_id, item_id, rating table you might see in other Mahout uses.
> > >
> > > As opposed to using my local machine, I've setup an instance on Amazon
> > with
> > > hopes of turning this into a remote service. So the install is whatever
> > > comes with Amazon's default Mahout installation.
> > >
> > > Jonathan
> > >
> > >
> > >
> > > On Wed, May 13, 2015 at 11:29 PM, Charlie Hack <
> charles.t.hack@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Jonathan, how do you have the data stored? More info about your
> > setup
> > > > the better.
> > > >
> > > >
> > > > Charlie
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > —
> > > > Sent from Mailbox
> > > >
> > > >
> > > >
> > > >
> > > > On Wednesday, May 13, 2015 at 23:16, Jonathan Seale <
> > > > jonathanpseale@gmail.com>, wrote:
> > > > Scientists,
> > > >
> > > >
> > > > I have an astrophysical application for Mahout that I need help with.
> > > >
> > > >
> > > > I have 1-dimensional stellar spectra for many, many stars. Each
> > spectrum
> > > >
> > > > consists of a series of intensity values, one per wavelength of
> light.
> > I
> > > >
> > > > need to be able to find the cosine similarity between ALL pairs of
> > stars.
> > > >
> > > > Seems to me this is simply a user-user similarity problem where I
> have
> > > >
> > > > stars instead of users, wavelengths instead of items, and intensities
> > > >
> > > > instead of ratings/clicks.
> > > >
> > > >
> > > > But I'm having difficulty using mahout's row similarity package (I'm
> > new
> > > to
> > > >
> > > > this, and these days astronomers code pretty exclusively in python).
> I
> > > know
> > > >
> > > > that I must have to 1) create a sparse matrix where each row is a
> star,
> > > >
> > > > columns are wavelengths, and the values are intensity, and 2)
> implement
> > > row
> > > >
> > > > similarity. But I'm just not sure how to do it. Anyone have a good
> > > resource
> > > >
> > > > or be willing to help? I could probably offer some compensation to
> > anyone
> > > >
> > > > that would be willing to provide a little focussed, personalized
> > > > assistance.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Jonathan
> > > >
> > >
> >
>

Re: Row Similarity

Posted by Jonathan Seale <jo...@gmail.com>.
Thanks, guys. Can you recommend any resources that show an example of these
steps? A google search returns very little information. Now I know what to
do, but I can't find anything that tells me how to do it.


On Wed, May 13, 2015 at 11:56 PM, Suneel Marthi <sm...@apache.org> wrote:

> Hi Jonathan,
>
> Here's what u gotta do to run RowSimilarity on ur CSV formatted data.  You
> would have to use the MapReduce version since the Spark version only
> supports LLR.
>
> 1. Convert CSV to Vectors - use CSVIterator and store the vectors as
> SequenceFiles
> 2.  Run RowIDJob on the SequenceFile output of (1). This should generate a
> Matrix of <IntWritable, VectorWriteable> and a docIndex of <IntWritable,
> Text>
> 3.  Run RowSimilarityjob on the matrix output from (2) specifiying
> CosineDistance and a cutoff threshold. This should generate a matrix of
> Rows -> Most similar rows with distances.
>
>
>
>
> On Wed, May 13, 2015 at 11:42 PM, Jonathan Seale <jonathanpseale@gmail.com
> >
> wrote:
>
> > Thanks, Charlie,
> >
> > The data has been through lots of processing, but in an attempt to make
> it
> > more Mahout-friendly, I've converted it into a single csv table with
> > columns: star_id, wavelength, intensity. My motivation was to make it
> like
> > a user_id, item_id, rating table you might see in other Mahout uses.
> >
> > As opposed to using my local machine, I've setup an instance on Amazon
> with
> > hopes of turning this into a remote service. So the install is whatever
> > comes with Amazon's default Mahout installation.
> >
> > Jonathan
> >
> >
> >
> > On Wed, May 13, 2015 at 11:29 PM, Charlie Hack <charles.t.hack@gmail.com
> >
> > wrote:
> >
> > > Hi Jonathan, how do you have the data stored? More info about your
> setup
> > > the better.
> > >
> > >
> > > Charlie
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > —
> > > Sent from Mailbox
> > >
> > >
> > >
> > >
> > > On Wednesday, May 13, 2015 at 23:16, Jonathan Seale <
> > > jonathanpseale@gmail.com>, wrote:
> > > Scientists,
> > >
> > >
> > > I have an astrophysical application for Mahout that I need help with.
> > >
> > >
> > > I have 1-dimensional stellar spectra for many, many stars. Each
> spectrum
> > >
> > > consists of a series of intensity values, one per wavelength of light.
> I
> > >
> > > need to be able to find the cosine similarity between ALL pairs of
> stars.
> > >
> > > Seems to me this is simply a user-user similarity problem where I have
> > >
> > > stars instead of users, wavelengths instead of items, and intensities
> > >
> > > instead of ratings/clicks.
> > >
> > >
> > > But I'm having difficulty using mahout's row similarity package (I'm
> new
> > to
> > >
> > > this, and these days astronomers code pretty exclusively in python). I
> > know
> > >
> > > that I must have to 1) create a sparse matrix where each row is a star,
> > >
> > > columns are wavelengths, and the values are intensity, and 2) implement
> > row
> > >
> > > similarity. But I'm just not sure how to do it. Anyone have a good
> > resource
> > >
> > > or be willing to help? I could probably offer some compensation to
> anyone
> > >
> > > that would be willing to provide a little focussed, personalized
> > > assistance.
> > >
> > >
> > > Thanks,
> > >
> > > Jonathan
> > >
> >
>

Re: Row Similarity

Posted by Suneel Marthi <sm...@apache.org>.
Hi Jonathan,

Here's what u gotta do to run RowSimilarity on ur CSV formatted data.  You
would have to use the MapReduce version since the Spark version only
supports LLR.

1. Convert CSV to Vectors - use CSVIterator and store the vectors as
SequenceFiles
2.  Run RowIDJob on the SequenceFile output of (1). This should generate a
Matrix of <IntWritable, VectorWriteable> and a docIndex of <IntWritable,
Text>
3.  Run RowSimilarityjob on the matrix output from (2) specifiying
CosineDistance and a cutoff threshold. This should generate a matrix of
Rows -> Most similar rows with distances.




On Wed, May 13, 2015 at 11:42 PM, Jonathan Seale <jo...@gmail.com>
wrote:

> Thanks, Charlie,
>
> The data has been through lots of processing, but in an attempt to make it
> more Mahout-friendly, I've converted it into a single csv table with
> columns: star_id, wavelength, intensity. My motivation was to make it like
> a user_id, item_id, rating table you might see in other Mahout uses.
>
> As opposed to using my local machine, I've setup an instance on Amazon with
> hopes of turning this into a remote service. So the install is whatever
> comes with Amazon's default Mahout installation.
>
> Jonathan
>
>
>
> On Wed, May 13, 2015 at 11:29 PM, Charlie Hack <ch...@gmail.com>
> wrote:
>
> > Hi Jonathan, how do you have the data stored? More info about your setup
> > the better.
> >
> >
> > Charlie
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > —
> > Sent from Mailbox
> >
> >
> >
> >
> > On Wednesday, May 13, 2015 at 23:16, Jonathan Seale <
> > jonathanpseale@gmail.com>, wrote:
> > Scientists,
> >
> >
> > I have an astrophysical application for Mahout that I need help with.
> >
> >
> > I have 1-dimensional stellar spectra for many, many stars. Each spectrum
> >
> > consists of a series of intensity values, one per wavelength of light. I
> >
> > need to be able to find the cosine similarity between ALL pairs of stars.
> >
> > Seems to me this is simply a user-user similarity problem where I have
> >
> > stars instead of users, wavelengths instead of items, and intensities
> >
> > instead of ratings/clicks.
> >
> >
> > But I'm having difficulty using mahout's row similarity package (I'm new
> to
> >
> > this, and these days astronomers code pretty exclusively in python). I
> know
> >
> > that I must have to 1) create a sparse matrix where each row is a star,
> >
> > columns are wavelengths, and the values are intensity, and 2) implement
> row
> >
> > similarity. But I'm just not sure how to do it. Anyone have a good
> resource
> >
> > or be willing to help? I could probably offer some compensation to anyone
> >
> > that would be willing to provide a little focussed, personalized
> > assistance.
> >
> >
> > Thanks,
> >
> > Jonathan
> >
>

Re: Row Similarity

Posted by Jonathan Seale <jo...@gmail.com>.
Thanks, Charlie,

The data has been through lots of processing, but in an attempt to make it
more Mahout-friendly, I've converted it into a single csv table with
columns: star_id, wavelength, intensity. My motivation was to make it like
a user_id, item_id, rating table you might see in other Mahout uses.

As opposed to using my local machine, I've setup an instance on Amazon with
hopes of turning this into a remote service. So the install is whatever
comes with Amazon's default Mahout installation.

Jonathan



On Wed, May 13, 2015 at 11:29 PM, Charlie Hack <ch...@gmail.com>
wrote:

> Hi Jonathan, how do you have the data stored? More info about your setup
> the better.
>
>
> Charlie
>
>
>
>
>
>
>
>
>
> —
> Sent from Mailbox
>
>
>
>
> On Wednesday, May 13, 2015 at 23:16, Jonathan Seale <
> jonathanpseale@gmail.com>, wrote:
> Scientists,
>
>
> I have an astrophysical application for Mahout that I need help with.
>
>
> I have 1-dimensional stellar spectra for many, many stars. Each spectrum
>
> consists of a series of intensity values, one per wavelength of light. I
>
> need to be able to find the cosine similarity between ALL pairs of stars.
>
> Seems to me this is simply a user-user similarity problem where I have
>
> stars instead of users, wavelengths instead of items, and intensities
>
> instead of ratings/clicks.
>
>
> But I'm having difficulty using mahout's row similarity package (I'm new to
>
> this, and these days astronomers code pretty exclusively in python). I know
>
> that I must have to 1) create a sparse matrix where each row is a star,
>
> columns are wavelengths, and the values are intensity, and 2) implement row
>
> similarity. But I'm just not sure how to do it. Anyone have a good resource
>
> or be willing to help? I could probably offer some compensation to anyone
>
> that would be willing to provide a little focussed, personalized
> assistance.
>
>
> Thanks,
>
> Jonathan
>

Re: Row Similarity

Posted by Charlie Hack <ch...@gmail.com>.
Hi Jonathan, how do you have the data stored? More info about your setup the better. 


Charlie 









—
Sent from Mailbox




On Wednesday, May 13, 2015 at 23:16, Jonathan Seale <jo...@gmail.com>, wrote:
Scientists,


I have an astrophysical application for Mahout that I need help with.


I have 1-dimensional stellar spectra for many, many stars. Each spectrum

consists of a series of intensity values, one per wavelength of light. I

need to be able to find the cosine similarity between ALL pairs of stars.

Seems to me this is simply a user-user similarity problem where I have

stars instead of users, wavelengths instead of items, and intensities

instead of ratings/clicks.


But I'm having difficulty using mahout's row similarity package (I'm new to

this, and these days astronomers code pretty exclusively in python). I know

that I must have to 1) create a sparse matrix where each row is a star,

columns are wavelengths, and the values are intensity, and 2) implement row

similarity. But I'm just not sure how to do it. Anyone have a good resource

or be willing to help? I could probably offer some compensation to anyone

that would be willing to provide a little focussed, personalized assistance.


Thanks,

Jonathan

Re: Row Similarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Agree with Ted, the rowsimilarity code relies on a good deal of downsampling to execute in O(n). It sounds like you data doesn’t led itself to downsampling. You could disable downsampling but that would give you an O(n^2) execution time.

The tools referenced seem to apply more directly to your problem but any lucene based search engine will give you cosine knn. You’d have to encode your spectra into wavelength buckets (sounds like this is what you have) then encode them as lucene vectors with weights equal to the spectra bucket magnitude for a given star. So the “document” would be a start-id followed by weighted spectra-buck-ids. Once you index this the query is the vector for a given star and the result = knn spectra-wise using cosine.  You also want to make sure TF-IDF weighting is disabled.

On May 14, 2015, at 11:37 AM, Ted Dunning <te...@gmail.com> wrote:

Actually, this is probably done more easily using a simple matrix
multiplication.  The reason for not using recommendation code for this is
that your problem is entirely dense.

How exactly you should go about this is a different question.  Up to tens
of thousands of stars, you can probably do this on a single machine using
pretty standard tools like R or matlab.

For larger problems, you will need parallelize the problem.  Essentially,
if A contains your data this turns in either A A' (if stars are rows) or A'
A (if stars are columns).  The real problem is that your output is going to
be as big as the number of stars, squared.  This will probably limit the
feasibility of this computation.  A million stars will result in something
like 10TB of output.

Assuming you have a million stars and each spectrum contains a few thousand
observations, the way I would go about this computation would be to store
each spectrum as a row, and dividing your data file into batches of rows.
Call the full matrix A and each batch of rows A_1 ... A_n.  Each batch
should have however many rows it takes to get a matrix product A_i A_j' to
take 30-100 seconds.

Now, all you have to do is schedule the multiplication of every pair of A_i
and A_j.  How you do that and how you store the data won't matter very much
because the computation costs will outweigh the scheduling and I/O costs.
The output will consist of matrices B_ij that each contain the dot products
between all of the stars in A_i and all of the starts in A_j.   To find the
dot product of two arbitrary stars, you first have to find which batches
they are in, and then you need to find their product in the corresponding
B_ij file.  You should probably check out some of the efficient math
packages for doing the local multiplications.

My guess is that this is very much not what you really want to be doing.

It is much more likely that you want to have an efficient nearest neighbor
search engine so that you can quickly find the, say, thousand most similar
stars given any query star.  That can be done with packages like FLANN [1]
or others [2].  Mahout will not help you with this given the dense nature
of your data.

[1] http://www.cs.ubc.ca/research/flann/
[2] https://www.cs.umd.edu/~mount/ANN/



On Wed, May 13, 2015 at 11:15 PM, Jonathan Seale <jo...@gmail.com>
wrote:

> Scientists,
> 
> I have an astrophysical application for Mahout that I need help with.
> 
> I have 1-dimensional stellar spectra for many, many stars. Each spectrum
> consists of a series of intensity values, one per wavelength of light. I
> need to be able to find the cosine similarity between ALL pairs of stars.
> Seems to me this is simply a user-user similarity problem where I have
> stars instead of users, wavelengths instead of items, and intensities
> instead of ratings/clicks.
> 
> But I'm having difficulty using mahout's row similarity package (I'm new to
> this, and these days astronomers code pretty exclusively in python). I know
> that I must have to 1) create a sparse matrix where each row is a star,
> columns are wavelengths, and the values are intensity, and 2) implement row
> similarity. But I'm just not sure how to do it. Anyone have a good resource
> or be willing to help? I could probably offer some compensation to anyone
> that would be willing to provide a little focussed, personalized
> assistance.
> 
> Thanks,
> Jonathan
> 


Re: Row Similarity

Posted by Ted Dunning <te...@gmail.com>.
Actually, this is probably done more easily using a simple matrix
multiplication.  The reason for not using recommendation code for this is
that your problem is entirely dense.

How exactly you should go about this is a different question.  Up to tens
of thousands of stars, you can probably do this on a single machine using
pretty standard tools like R or matlab.

For larger problems, you will need parallelize the problem.  Essentially,
if A contains your data this turns in either A A' (if stars are rows) or A'
A (if stars are columns).  The real problem is that your output is going to
be as big as the number of stars, squared.  This will probably limit the
feasibility of this computation.  A million stars will result in something
like 10TB of output.

Assuming you have a million stars and each spectrum contains a few thousand
observations, the way I would go about this computation would be to store
each spectrum as a row, and dividing your data file into batches of rows.
Call the full matrix A and each batch of rows A_1 ... A_n.  Each batch
should have however many rows it takes to get a matrix product A_i A_j' to
take 30-100 seconds.

Now, all you have to do is schedule the multiplication of every pair of A_i
and A_j.  How you do that and how you store the data won't matter very much
because the computation costs will outweigh the scheduling and I/O costs.
The output will consist of matrices B_ij that each contain the dot products
between all of the stars in A_i and all of the starts in A_j.   To find the
dot product of two arbitrary stars, you first have to find which batches
they are in, and then you need to find their product in the corresponding
B_ij file.  You should probably check out some of the efficient math
packages for doing the local multiplications.

My guess is that this is very much not what you really want to be doing.

It is much more likely that you want to have an efficient nearest neighbor
search engine so that you can quickly find the, say, thousand most similar
stars given any query star.  That can be done with packages like FLANN [1]
or others [2].  Mahout will not help you with this given the dense nature
of your data.

[1] http://www.cs.ubc.ca/research/flann/
[2] https://www.cs.umd.edu/~mount/ANN/



On Wed, May 13, 2015 at 11:15 PM, Jonathan Seale <jo...@gmail.com>
wrote:

> Scientists,
>
> I have an astrophysical application for Mahout that I need help with.
>
> I have 1-dimensional stellar spectra for many, many stars. Each spectrum
> consists of a series of intensity values, one per wavelength of light. I
> need to be able to find the cosine similarity between ALL pairs of stars.
> Seems to me this is simply a user-user similarity problem where I have
> stars instead of users, wavelengths instead of items, and intensities
> instead of ratings/clicks.
>
> But I'm having difficulty using mahout's row similarity package (I'm new to
> this, and these days astronomers code pretty exclusively in python). I know
> that I must have to 1) create a sparse matrix where each row is a star,
> columns are wavelengths, and the values are intensity, and 2) implement row
> similarity. But I'm just not sure how to do it. Anyone have a good resource
> or be willing to help? I could probably offer some compensation to anyone
> that would be willing to provide a little focussed, personalized
> assistance.
>
> Thanks,
> Jonathan
>