You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Andrew Palumbo <ap...@outlook.com> on 2015/02/04 16:47:59 UTC

TF-IDF, seq2sparse and DataFrame support

Just copied over the relevant last few messages to keep the other thread 
on topic...

On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> I'd suggest to consider this: remember all this talk about
> language-integrated spark ql being basically dataframe manipulation DSL?
>
> so now Spark devs are noticing this generality as well and are actually
> proposing to rename SchemaRDD into DataFrame and make it mainstream data
> structure. (my "told you so" moment of sorts
>
> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> DataFrame our two major structures. In particular, standardize on using
> DataFrame for things that may include non-numerical data and require more
> grace about column naming and manipulation. Maybe relevant to TF-IDF work
> when it deals with non-matrix content.
Sounds like a worthy effort to me.  We'd be basically implementing an 
API at the math-scala level for SchemaRDD/Dataframe datastructures correct?

  On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> Seems like seq2sparse would be really easy to replace since it takes 
>> text
>> files to start with, then the whole pipeline could be kept in rdds. The
>> dictionaries and counts could be either in-memory maps or rdds for 
>> use with
>> joins? This would get rid of sequence files completely from the 
>> pipeline.
>> Item similarity uses in-memory maps but the plan is to make it more
>> scalable using joins as an alternative with the same API allowing the 
>> user
>> to trade-off footprint for speed.

I think you're right- should be relatively easy.  I've been looking at 
porting seq2sparse  to the DSL for bit now and the stopper at the DSL 
level is that we don't have a distributed data structure for 
strings..Seems like getting a DataFrame implemented as Dmitriy mentioned 
above would take care of this problem.

The other issue i'm a little fuzzy on  is the distributed collocation 
mapping-  it's a part of the seq2sparse code that I've not spent too 
much time in.

I think that this would be very worthy effort as well-  I believe 
seq2sparse is a particular strong mahout feature.

I'll start another thread since we're now way off topic from the 
refactoring proposal.

My use for TF-IDF is for row similarity and would take a DRM (actually
IndexedDataset) and calculate row/doc similarities. It works now but only
using LLR. This is OK when thinking of the items as tags or metadata but
for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
like how CF preferences are downsampled. This would produce an sparsified
all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
terms before row similarity uses cosine. This is not so good for search but
should produce much better similarities than Solr’s “moreLikeThis” and does
it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based
recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:

On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>> Some issues WRT lower level Spark integration:
>> 1) interoperability with Spark data. TF-IDF is one example I actually
looked at. There may be other things we can pick up from their committers
since they have an abundance.
>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
me when someone on the Spark list asked about matrix transpose and an MLlib
committer’s answer was something like “why would you want to do that?”.
Usually you don’t actually execute the transpose but they don’t even
support A’A, AA’, or A’B, which are core to what I work on. At present you
pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
Maybe a half-way measure is some implicit conversions (ugh, I know). If the
DSL could interchange datasets with MLlib, people would be pointed to the
DSL for all of a bunch of “why would you want to do that?” features. MLlib
seems to be algorithms, not math.
>> 3) integration of Streaming. DStreams support most of the RDD
interface. Doing a batch recalc on a moving time window would nearly fall
out of DStream backed DRMs. This isn’t the same as incremental updates on
streaming but it’s a start.
>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
faster compute engines. So we jumped. Now the need is for streaming and
especially incrementally updated streaming. Seems like we need to address
this.
>> Andrew, regardless of the above having TF-IDF would be super
helpful—row similarity for content/text would benefit greatly.
>    I will put a PR up soon.
Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
and Weight interface over from mr-legacy to math-scala. They're available
now in spark-shell but won't be after this refactoring.  These still
require dictionary and a frequency count maps to vectorize incoming text-
so they're more for use with the old MR seq2sparse and I don't think they
can be used with Spark's HashingTF and IDF.  I'll put them up soon.
Hopefully they'll be of some use.

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Andrew Palumbo <ap...@outlook.com>.

We should get a JIRA going for this and try to get this in for 0.10.1.

On 03/24/2015 04:32 PM, Gokhan Capan wrote:
> Andrew,
>
> Maybe making class tag evident in mapBlock calls?, i.e:
> val tfIdfMatrix = tfMatrix.mapBlock(..){
>                      ...idf transformation, etc...
>                    }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]])
>
> Best,
> Gokhan
>
> On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
>> This (last commit on this branch) should be the beginning of a workaround
>> for the problem of reading and returning a Generic-Writable keyed Drm:
>>
>> https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30
>> aae3f37e14
>>
>> However the keyClassTag of the DrmLike returned by the  mapBlock() calls
>> and finally by the method itself is somehow converted to object.  I'm not
>> exactly sure why this is happening.  I think that the implicit evidence is
>> being dropped in the mapBlock call on a [Object]casted CheckPointedDrm.
>> Maybe by calling it out of the scope of this method (breaking down the
>> method would fix it.)
>
>> valtfMatrix = drmMetadata.keyClassTagmatch{
>>
>>    casect  ifct  == ClassTag.Int=> {
>>      (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>>        (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
>> CheckpointedDrmSpark[Int]]
>>    }
>>    casectifct ==ClassTag(classOf[String]) => {
>>      (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>>        (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
>> CheckpointedDrmSpark[String]]
>>    }
>>    casectifct == ClassTag.Long=> {
>>      (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>>        (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
>> CheckpointedDrmSpark[Long]]
>>    }
>>    case_ => {
>>      (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>>        (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
>> CheckpointedDrmSpark[Int]]
>>    }
>> }
>>
>> tfMatrix.checkpoint()
>>
>> // make sure that the classtag of the tf matrix matches the metadata
>> keyClasstag
>> assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Passes here
>> with eg. String keys
>>
>> val tfIdfMatrix = tfMatrix.mapBlock(..){
>>                      ...idf transformation, etc...
>>                    }
>>
>> assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Fails here
>> for all with tfIdfMatrix.keyClassTag
>>                                                                  as an
>> Object.
>>
>>
>> I'll keep looking at it a bit.  If anybody has any ideas please let me
>> know.
>>
>>
>>
>>
>>
>>
>>
>> On 03/09/2015 02:12 PM, Gokhan Capan wrote:
>>
>>> So, here is a sketch of a Spark implementation of seq2sparse, returning a
>>> (matrix:DrmLike, dictionary:Map):
>>>
>>> https://github.com/gcapan/mahout/tree/seq2sparse
>>>
>>> Although it should be possible, I couldn't manage to make it process
>>> non-integer document ids. Any fix would be appreciated. There is a simple
>>> test attached, but I think there is more to do in terms of handling all
>>> parameters of the original seq2sparse implementation.
>>>
>>> I put it directly to the SparkEngine ---not that I think of this object is
>>> the most appropriate placeholder, it just seemed convenient to me.
>>>
>>> Best
>>>
>>>
>>> Gokhan
>>>
>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>>>
>>>   IndexedDataset might suffice until real DataFrames come along.
>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>>>>
>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>>>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>>>
>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo<ap...@outlook.com>
>>>> wrote:
>>>>
>>>>   On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>>   Andrew, not sure what you mean about storing strings. If you mean
>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc column
>>>>>>
>>>>> =
>>>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>>>>>>
>>>>> would
>>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>>   Yes- dataframes will be perfect for this.  The problem that i was
>>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>>>>>
>>>> this
>>>>
>>>>> I believe we would need something like a Distributed vector of Strings
>>>>>
>>>> that
>>>>
>>>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>>>> Even there, MapBlock may not be perfect for this, but some of the new
>>>>> Distributed functions that Gockhan is working on may.
>>>>>
>>>>>   I agree seq2sparse type input is a strong feature. Text files into an
>>>>>> all-documents DataFrame basically. Colocation?
>>>>>>
>>>>>>   as far as collocations i believe that the n-gram are computed and
>>>>> counted
>>>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>>>> looked at the code...) either way, I dont think I ever looked too
>>>>> closely
>>>>> and i was a bit fuzzy on this...
>>>>>
>>>>> These were just some thoughts that I had when briefly looking at porting
>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>>>> algorithm but its a nice starting point.
>>>>>
>>>>> [1]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>> src/main/java/org/apache/mahout/vectorizer/
>>>>> SparseVectorsFromSequenceFiles
>>>>> .java
>>>>> [2]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>> src/main/java/org/apache/mahout/vectorizer/
>>>>> collocations/llr/CollocDriver.
>>>>> java
>>>>>
>>>>>
>>>>>
>>>>>   On Feb 4, 2015, at 7:47 AM, Andrew Palumbo<ap...@outlook.com>  wrote:
>>>>>> Just copied over the relevant last few messages to keep the other
>>>>>> thread
>>>>>> on topic...
>>>>>>
>>>>>>
>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>>
>>>>>>   I'd suggest to consider this: remember all this talk about
>>>>>>> language-integrated spark ql being basically dataframe manipulation
>>>>>>>
>>>>>> DSL?
>>>>> so now Spark devs are noticing this generality as well and are actually
>>>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>>>>>>>
>>>>>> data
>>>>> structure. (my "told you so" moment of sorts
>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>>>>>>> renamed
>>>>>>> DataFrame our two major structures. In particular, standardize on
>>>>>>> using
>>>>>>> DataFrame for things that may include non-numerical data and require
>>>>>>>
>>>>>> more
>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>>>>>> work
>>>>> when it deals with non-matrix content.
>>>>>>>   Sounds like a worthy effort to me.  We'd be basically implementing an
>>>>> API
>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel<pa...@occamsmachete.com>
>>>>>>
>>>>> wrote:
>>>>> Seems like seq2sparse would be really easy to replace since it takes
>>>>>> text
>>>>> files to start with, then the whole pipeline could be kept in rdds.
>>>>>>> The
>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>>>> with
>>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>>> pipeline.
>>>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>>>> scalable using joins as an alternative with the same API allowing the
>>>>>>>> user
>>>>>>>> to trade-off footprint for speed.
>>>>>>>>
>>>>>>>>   I think you're right- should be relatively easy.  I've been looking
>>>>>>> at
>>>>>>>
>>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>>>>>>
>>>>> level
>>>>> is that we don't have a distributed data structure for strings..Seems
>>>>> like
>>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>>>>> care
>>>>> of this problem.
>>>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>>>>>>
>>>>> much
>>>>> time in.
>>>>>> I think that this would be very worthy effort as well-  I believe
>>>>>> seq2sparse is a particular strong mahout feature.
>>>>>>
>>>>>> I'll start another thread since we're now way off topic from the
>>>>>> refactoring proposal.
>>>>>>
>>>>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>>>>>
>>>>> only
>>>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>>>>> for text tokens something like cosine may be better.
>>>>>>
>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
>>>>>>
>>>>> lot
>>>>> like how CF preferences are downsampled. This would produce an
>>>>> sparsified
>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>>>>> terms before row similarity uses cosine. This is not so good for search
>>>>>> but
>>>>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>>>>> does
>>>>>> it for all pairs rather than one at a time.
>>>>>>
>>>>>> In any case it can be used to do a create a personalized content-based
>>>>>> recommender or augment a CF recommender with one more indicator type.
>>>>>>
>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo<ap...@outlook.com>  wrote:
>>>>>>
>>>>>>
>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>>
>>>>>>   On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>>   Some issues WRT lower level Spark integration:
>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>>>>>
>>>>>>>>   looked at. There may be other things we can pick up from their
>>>>>> committers
>>>>> since they have an abundance.
>>>>>>   2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>>>> me when someone on the Spark list asked about matrix transpose and an
>>>>>>>
>>>>>> MLlib
>>>>>> committer’s answer was something like “why would you want to do that?”.
>>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At present
>>>>>>
>>>>> you
>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>>> stuff.
>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>>>>> the
>>>>>> DSL could interchange datasets with MLlib, people would be pointed to
>>>>>>
>>>>> the
>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>>>> MLlib
>>>>> seems to be algorithms, not math.
>>>>>>   3) integration of Streaming. DStreams support most of the RDD
>>>>>>> interface. Doing a batch recalc on a moving time window would nearly
>>>>>>>
>>>>>> fall
>>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>>>>> on
>>>>> streaming but it’s a start.
>>>>>>   Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>>> faster compute engines. So we jumped. Now the need is for streaming
>>>>>>> and
>>>>>>>
>>>>>> especially incrementally updated streaming. Seems like we need to
>>>>>>
>>>>> address
>>>>> this.
>>>>>>   Andrew, regardless of the above having TF-IDF would be super
>>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>>     I will put a PR up soon.
>>>>>>>
>>>>>>>   Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>>>> classes
>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>>>> available
>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>>> require dictionary and a frequency count maps to vectorize incoming
>>>>>>
>>>>> text-
>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>>>> they
>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>>> Hopefully they'll be of some use.
>>>>>>
>>>>>>
>>>>>>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Gokhan Capan <gk...@gmail.com>.

Andrew,

Maybe making class tag evident in mapBlock calls?, i.e:
val tfIdfMatrix = tfMatrix.mapBlock(..){
                    ...idf transformation, etc...
                  }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]])

Best,
Gokhan

On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo <ap...@outlook.com> wrote:

>
> This (last commit on this branch) should be the beginning of a workaround
> for the problem of reading and returning a Generic-Writable keyed Drm:
>
> https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30
> aae3f37e14
>
> However the keyClassTag of the DrmLike returned by the  mapBlock() calls
> and finally by the method itself is somehow converted to object.  I'm not
> exactly sure why this is happening.  I think that the implicit evidence is
> being dropped in the mapBlock call on a [Object]casted CheckPointedDrm.
> Maybe by calling it out of the scope of this method (breaking down the
> method would fix it.)


> valtfMatrix = drmMetadata.keyClassTagmatch{
>
>   casect  ifct  == ClassTag.Int=> {
>     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[Int]]
>   }
>   casectifct ==ClassTag(classOf[String]) => {
>     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[String]]
>   }
>   casectifct == ClassTag.Long=> {
>     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[Long]]
>   }
>   case_ => {
>     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[Int]]
>   }
> }
>
> tfMatrix.checkpoint()
>
> // make sure that the classtag of the tf matrix matches the metadata
> keyClasstag
> assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Passes here
> with eg. String keys
>
> val tfIdfMatrix = tfMatrix.mapBlock(..){
>                     ...idf transformation, etc...
>                   }
>
> assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Fails here
> for all with tfIdfMatrix.keyClassTag
>                                                                 as an
> Object.
>
>
> I'll keep looking at it a bit.  If anybody has any ideas please let me
> know.
>
>
>
>
>
>
>
> On 03/09/2015 02:12 PM, Gokhan Capan wrote:
>
>> So, here is a sketch of a Spark implementation of seq2sparse, returning a
>> (matrix:DrmLike, dictionary:Map):
>>
>> https://github.com/gcapan/mahout/tree/seq2sparse
>>
>> Although it should be possible, I couldn't manage to make it process
>> non-integer document ids. Any fix would be appreciated. There is a simple
>> test attached, but I think there is more to do in terms of handling all
>> parameters of the original seq2sparse implementation.
>>
>> I put it directly to the SparkEngine ---not that I think of this object is
>> the most appropriate placeholder, it just seemed convenient to me.
>>
>> Best
>>
>>
>> Gokhan
>>
>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>>
>>  IndexedDataset might suffice until real DataFrames come along.
>>>
>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>>>
>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>>
>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo<ap...@outlook.com>
>>> wrote:
>>>
>>>  On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>
>>>>  Andrew, not sure what you mean about storing strings. If you mean
>>>>> something like a DRM of tokens, that is a DataFrame with row=doc column
>>>>>
>>>> =
>>>
>>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>>>>>
>>>> would
>>>
>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>
>>>>>  Yes- dataframes will be perfect for this.  The problem that i was
>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>>>>
>>> this
>>>
>>>> I believe we would need something like a Distributed vector of Strings
>>>>
>>> that
>>>
>>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>>> Even there, MapBlock may not be perfect for this, but some of the new
>>>> Distributed functions that Gockhan is working on may.
>>>>
>>>>  I agree seq2sparse type input is a strong feature. Text files into an
>>>>> all-documents DataFrame basically. Colocation?
>>>>>
>>>>>  as far as collocations i believe that the n-gram are computed and
>>>> counted
>>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>>> looked at the code...) either way, I dont think I ever looked too
>>>> closely
>>>> and i was a bit fuzzy on this...
>>>>
>>>> These were just some thoughts that I had when briefly looking at porting
>>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>>> algorithm but its a nice starting point.
>>>>
>>>> [1]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>> src/main/java/org/apache/mahout/vectorizer/
>>>> SparseVectorsFromSequenceFiles
>>>> .java
>>>> [2]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>> src/main/java/org/apache/mahout/vectorizer/
>>>> collocations/llr/CollocDriver.
>>>> java
>>>>
>>>>
>>>>
>>>>  On Feb 4, 2015, at 7:47 AM, Andrew Palumbo<ap...@outlook.com>  wrote:
>>>>>
>>>>> Just copied over the relevant last few messages to keep the other
>>>>> thread
>>>>> on topic...
>>>>>
>>>>>
>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>
>>>>>  I'd suggest to consider this: remember all this talk about
>>>>>> language-integrated spark ql being basically dataframe manipulation
>>>>>>
>>>>> DSL?
>>>
>>>> so now Spark devs are noticing this generality as well and are actually
>>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>>>>>>
>>>>> data
>>>
>>>> structure. (my "told you so" moment of sorts
>>>>>>
>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>>>>>> renamed
>>>>>> DataFrame our two major structures. In particular, standardize on
>>>>>> using
>>>>>> DataFrame for things that may include non-numerical data and require
>>>>>>
>>>>> more
>>>
>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>>>>>>
>>>>> work
>>>
>>>> when it deals with non-matrix content.
>>>>>>
>>>>>>  Sounds like a worthy effort to me.  We'd be basically implementing an
>>>>>
>>>> API
>>>
>>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>>>>
>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel<pa...@occamsmachete.com>
>>>>>
>>>> wrote:
>>>
>>>> Seems like seq2sparse would be really easy to replace since it takes
>>>>>>
>>>>> text
>>>
>>>> files to start with, then the whole pipeline could be kept in rdds.
>>>>>>>
>>>>>> The
>>>
>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>>> with
>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>> pipeline.
>>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>>> scalable using joins as an alternative with the same API allowing the
>>>>>>> user
>>>>>>> to trade-off footprint for speed.
>>>>>>>
>>>>>>>  I think you're right- should be relatively easy.  I've been looking
>>>>>> at
>>>>>>
>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>>>>>
>>>> level
>>>
>>>> is that we don't have a distributed data structure for strings..Seems
>>>>>
>>>> like
>>>
>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>>>>>
>>>> care
>>>
>>>> of this problem.
>>>>>
>>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>>>>>
>>>> much
>>>
>>>> time in.
>>>>>
>>>>> I think that this would be very worthy effort as well-  I believe
>>>>> seq2sparse is a particular strong mahout feature.
>>>>>
>>>>> I'll start another thread since we're now way off topic from the
>>>>> refactoring proposal.
>>>>>
>>>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>>>>
>>>> only
>>>
>>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>>>> for text tokens something like cosine may be better.
>>>>>
>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
>>>>>
>>>> lot
>>>
>>>> like how CF preferences are downsampled. This would produce an
>>>>>
>>>> sparsified
>>>
>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>>>> terms before row similarity uses cosine. This is not so good for search
>>>>> but
>>>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>>>> does
>>>>> it for all pairs rather than one at a time.
>>>>>
>>>>> In any case it can be used to do a create a personalized content-based
>>>>> recommender or augment a CF recommender with one more indicator type.
>>>>>
>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo<ap...@outlook.com>  wrote:
>>>>>
>>>>>
>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>
>>>>>  On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>
>>>>>>  Some issues WRT lower level Spark integration:
>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>>>>
>>>>>>>  looked at. There may be other things we can pick up from their
>>>>>>
>>>>> committers
>>>
>>>> since they have an abundance.
>>>>>
>>>>>  2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>>> me when someone on the Spark list asked about matrix transpose and an
>>>>>>
>>>>> MLlib
>>>>> committer’s answer was something like “why would you want to do that?”.
>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>> support A’A, AA’, or A’B, which are core to what I work on. At present
>>>>>
>>>> you
>>>
>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>> stuff.
>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>>>> the
>>>>> DSL could interchange datasets with MLlib, people would be pointed to
>>>>>
>>>> the
>>>
>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>>>>
>>>> MLlib
>>>
>>>> seems to be algorithms, not math.
>>>>>
>>>>>  3) integration of Streaming. DStreams support most of the RDD
>>>>>> interface. Doing a batch recalc on a moving time window would nearly
>>>>>>
>>>>> fall
>>>
>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>>>>>
>>>> on
>>>
>>>> streaming but it’s a start.
>>>>>
>>>>>  Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>> faster compute engines. So we jumped. Now the need is for streaming
>>>>>> and
>>>>>>
>>>>> especially incrementally updated streaming. Seems like we need to
>>>>>
>>>> address
>>>
>>>> this.
>>>>>
>>>>>  Andrew, regardless of the above having TF-IDF would be super
>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>    I will put a PR up soon.
>>>>>>
>>>>>>  Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>>>>
>>>> classes
>>>
>>>> and Weight interface over from mr-legacy to math-scala. They're
>>>>>
>>>> available
>>>
>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>> require dictionary and a frequency count maps to vectorize incoming
>>>>>
>>>> text-
>>>
>>>> so they're more for use with the old MR seq2sparse and I don't think
>>>>>
>>>> they
>>>
>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>> Hopefully they'll be of some use.
>>>>>
>>>>>
>>>>>
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Andrew Palumbo <ap...@outlook.com>.

This (last commit on this branch) should be the beginning of a 
workaround for the problem of reading and returning a Generic-Writable 
keyed Drm:

https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30aae3f37e14

However the keyClassTag of the DrmLike returned by the  mapBlock() calls 
and finally by the method itself is somehow converted to object.  I'm 
not exactly sure why this is happening.  I think that the implicit 
evidence is being dropped in the mapBlock call on a [Object]casted 
CheckPointedDrm.  Maybe by calling it out of the scope of this method 
(breaking down the method would fix it.)

valtfMatrix = drmMetadata.keyClassTagmatch{

   casect  ifct  == ClassTag.Int=> {
     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[Int]]
   }
   casectifct ==ClassTag(classOf[String]) => {
     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[String]]
   }
   casectifct == ClassTag.Long=> {
     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[Long]]
   }
   case_ => {
     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[Int]]
   }
}

tfMatrix.checkpoint()

// make sure that the classtag of the tf matrix matches the metadata keyClasstag
assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Passes here with eg. String keys

val tfIdfMatrix = tfMatrix.mapBlock(..){
                     ...idf transformation, etc...
                   }

assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Fails here for all with tfIdfMatrix.keyClassTag
                                                                 as an Object.


I'll keep looking at it a bit.  If anybody has any ideas please let me know.






On 03/09/2015 02:12 PM, Gokhan Capan wrote:
> So, here is a sketch of a Spark implementation of seq2sparse, returning a
> (matrix:DrmLike, dictionary:Map):
>
> https://github.com/gcapan/mahout/tree/seq2sparse
>
> Although it should be possible, I couldn't manage to make it process
> non-integer document ids. Any fix would be appreciated. There is a simple
> test attached, but I think there is more to do in terms of handling all
> parameters of the original seq2sparse implementation.
>
> I put it directly to the SparkEngine ---not that I think of this object is
> the most appropriate placeholder, it just seemed convenient to me.
>
> Best
>
>
> Gokhan
>
> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>
>> IndexedDataset might suffice until real DataFrames come along.
>>
>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>>
>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>
>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo<ap...@outlook.com>  wrote:
>>
>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>
>>>> Andrew, not sure what you mean about storing strings. If you mean
>>>> something like a DRM of tokens, that is a DataFrame with row=doc column
>> =
>>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>> would
>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>
>>> Yes- dataframes will be perfect for this.  The problem that i was
>>> referring to was that we dont have a DSL Data Structure to to do the
>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>> this
>>> I believe we would need something like a Distributed vector of Strings
>> that
>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>> Even there, MapBlock may not be perfect for this, but some of the new
>>> Distributed functions that Gockhan is working on may.
>>>
>>>> I agree seq2sparse type input is a strong feature. Text files into an
>>>> all-documents DataFrame basically. Colocation?
>>>>
>>> as far as collocations i believe that the n-gram are computed and counted
>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>> looked at the code...) either way, I dont think I ever looked too closely
>>> and i was a bit fuzzy on this...
>>>
>>> These were just some thoughts that I had when briefly looking at porting
>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>> algorithm but its a nice starting point.
>>>
>>> [1]https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>> .java
>>> [2]https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>> java
>>>
>>>
>>>
>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo<ap...@outlook.com>  wrote:
>>>>
>>>> Just copied over the relevant last few messages to keep the other thread
>>>> on topic...
>>>>
>>>>
>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>
>>>>> I'd suggest to consider this: remember all this talk about
>>>>> language-integrated spark ql being basically dataframe manipulation
>> DSL?
>>>>> so now Spark devs are noticing this generality as well and are actually
>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>> data
>>>>> structure. (my "told you so" moment of sorts
>>>>>
>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>>>> DataFrame our two major structures. In particular, standardize on using
>>>>> DataFrame for things that may include non-numerical data and require
>> more
>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>> work
>>>>> when it deals with non-matrix content.
>>>>>
>>>> Sounds like a worthy effort to me.  We'd be basically implementing an
>> API
>>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>>>
>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel<pa...@occamsmachete.com>
>> wrote:
>>>>> Seems like seq2sparse would be really easy to replace since it takes
>> text
>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>> The
>>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>> with
>>>>>> joins? This would get rid of sequence files completely from the
>>>>>> pipeline.
>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>> scalable using joins as an alternative with the same API allowing the
>>>>>> user
>>>>>> to trade-off footprint for speed.
>>>>>>
>>>>> I think you're right- should be relatively easy.  I've been looking at
>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>> level
>>>> is that we don't have a distributed data structure for strings..Seems
>> like
>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>> care
>>>> of this problem.
>>>>
>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>> much
>>>> time in.
>>>>
>>>> I think that this would be very worthy effort as well-  I believe
>>>> seq2sparse is a particular strong mahout feature.
>>>>
>>>> I'll start another thread since we're now way off topic from the
>>>> refactoring proposal.
>>>>
>>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>> only
>>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>>> for text tokens something like cosine may be better.
>>>>
>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
>> lot
>>>> like how CF preferences are downsampled. This would produce an
>> sparsified
>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>>> terms before row similarity uses cosine. This is not so good for search
>>>> but
>>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>>> does
>>>> it for all pairs rather than one at a time.
>>>>
>>>> In any case it can be used to do a create a personalized content-based
>>>> recommender or augment a CF recommender with one more indicator type.
>>>>
>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo<ap...@outlook.com>  wrote:
>>>>
>>>>
>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>
>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>
>>>>>> Some issues WRT lower level Spark integration:
>>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>>>
>>>>> looked at. There may be other things we can pick up from their
>> committers
>>>> since they have an abundance.
>>>>
>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>> me when someone on the Spark list asked about matrix transpose and an
>>>> MLlib
>>>> committer’s answer was something like “why would you want to do that?”.
>>>> Usually you don’t actually execute the transpose but they don’t even
>>>> support A’A, AA’, or A’B, which are core to what I work on. At present
>> you
>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>> stuff.
>>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>>> the
>>>> DSL could interchange datasets with MLlib, people would be pointed to
>> the
>>>> DSL for all of a bunch of “why would you want to do that?” features.
>> MLlib
>>>> seems to be algorithms, not math.
>>>>
>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> interface. Doing a batch recalc on a moving time window would nearly
>> fall
>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>> on
>>>> streaming but it’s a start.
>>>>
>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>> faster compute engines. So we jumped. Now the need is for streaming and
>>>> especially incrementally updated streaming. Seems like we need to
>> address
>>>> this.
>>>>
>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>    I will put a PR up soon.
>>>>>
>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>> classes
>>>> and Weight interface over from mr-legacy to math-scala. They're
>> available
>>>> now in spark-shell but won't be after this refactoring.  These still
>>>> require dictionary and a frequency count maps to vectorize incoming
>> text-
>>>> so they're more for use with the old MR seq2sparse and I don't think
>> they
>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>> Hopefully they'll be of some use.
>>>>
>>>>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I think everyone agrees that getting this into a PR would be great. We need a modernized text pipeline and this is an excellent starting point. We can discuss there. 

On Mar 10, 2015, at 3:53 AM, Gokhan Capan <gk...@gmail.com> wrote:

Some answers:

- Non-integer document ids:
The implementation does not use operations defined for DrmLike[Int]-only,
so the row keys do not have to be Int's. I just couldn't manage to create
the returning DrmLike with the correct key type. Although while wrapping
into a DrmLike, I tried to pass the key-class using HDFS utils like they
are being used in drmDfsRead, but I somehow wasn't successful. So non-int
document ids is not an actual issue here.

- Breaking the implementation out to smaller pieces: Let's just collect the
requirements and adjust the implementation accordingly. I honestly didn't
think very much about where the implementation fits in, architecturally,
and what pieces are of public interest.

Best

Gokhan

On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi <su...@gmail.com>
wrote:

> AP, How is ur impl different from Gokhan's?
> 
> On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <ap...@outlook.com> wrote:
> 
>> BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
>> using because o.a.m.vectorizer, which is probably a better name, had
>> conflicts in mrlegacy.
>> 
>> 
>> On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
>> 
>>> 
>>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
>>> seq2sparse implementation to live.
>>> 
>>> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>>> 
>>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>>>> live?
>>>>> 
>>>> I think you meant math-scala?
>>>> 
>>>> Actually we should rename math to core
>>>> 
>>>> 
>>>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>> 
>>>> Cool- This is great! I think this is really important to have in.
>>>> 
>>>> +1 to a pull request for comments.
>>>> 
>>>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
>>>> very simple TF and TFIDF classes based on lucene's IDF calculation and
>>>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
>>>> creates an o.a.m.nlp package in mahout-math. I will push that as soon
> as i
>>>> can in case you want to use them.
>>>> 
>>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>>> live?
>>>> 
>>>> Those classes may be of use to you- they're very simple and are
> intended
>>>> for new document vectorization once the legacy deps are removed from
> the
>>>> spark module.  They also might make interoperability with easier.
>>>> 
>>>> One thought having not been able to look at this too closely yet.
>>>> 
>>>> //do we need do calculate df-vector?
>>>>>> 
>>>>> 1.  We do need a document frequency map or vector to be able to
>>>> calculate the IDF terms when vectorizing a new document outside of the
>>>> original corpus.
> 
>>> 
>>>> 
>>>> 
>>>> 
>>>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
>>>> 
>>>>> Ah, you are doing all the lucene analyzer, ngrams and other
> tokenizing,
>>>>> nice.
>>>>> 
>>>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>> 
>>>>> Ah I found the right button in Github no PR necessary.
>>>>> 
>>>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>> 
>>>>> If you create a PR it’s easier to see what was changed.
>>>>> 
>>>>> Wouldn’t it be better to read in files from a directory assigning
>>>>> doc-id = filename and term-ids = terms or are their still Hadoop
> pipeline
>>>>> tools that are needed to create the sequence files? This sort of
> mimics the
>>>>> way Spark reads SchemaRDDs from Json files.
>>>>> 
>>>>> BTW this can also be done with a new reader trait on the
>>>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a
>>>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other
> does
>>>>> the same for columns (text tokens). This would be a few lines of code
> since
>>>>> the string mapping and DRM creation is already written, The only
> thing to
>>>>> do would be map the doc/row ids to filenames. This allows you to take
> the
>>>>> non-int doc ids out of the DRM and replace them with a map. Not based
> on a
>>>>> Spark dataframe yet probably will be.
>>>>> 
>>>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>>> 
>>>>> So, here is a sketch of a Spark implementation of seq2sparse,
> returning
>>>>> a
>>>>> (matrix:DrmLike, dictionary:Map):
>>>>> 
>>>>> https://github.com/gcapan/mahout/tree/seq2sparse
>>>>> 
>>>>> Although it should be possible, I couldn't manage to make it process
>>>>> non-integer document ids. Any fix would be appreciated. There is a
>>>>> simple
>>>>> test attached, but I think there is more to do in terms of handling
> all
>>>>> parameters of the original seq2sparse implementation.
>>>>> 
>>>>> I put it directly to the SparkEngine ---not that I think of this
> object
>>>>> is
>>>>> the most appropriate placeholder, it just seemed convenient to me.
>>>>> 
>>>>> Best
>>>>> 
>>>>> 
>>>>> Gokhan
>>>>> 
>>>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>> 
>>>>> IndexedDataset might suffice until real DataFrames come along.
>>>>>> 
>>>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
>>>>>> is a
>>>>>> byproduct of it IIRC. matrix definitely not a structure to hold
> those.
>>>>>> 
>>>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com>
>>>>>> wrote:
>>>>>> 
>>>>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>>>> 
>>>>>>> Andrew, not sure what you mean about storing strings. If you mean
>>>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc
>>>>>>>> column
>>>>>>>> 
>>>>>>> =
>>>>>> 
>>>>>>> token. A one row DataFrame is a slightly heavy weight
>>>>>>>> string/document. A
>>>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no?
> It
>>>>>>>> 
>>>>>>> would
>>>>>> 
>>>>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>>>> 
>>>>>>>> Yes- dataframes will be perfect for this.  The problem that i was
>>>>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>>>>> initial distributed tokenizing of the documents[1] line:257, [2] .
> For
>>>>>>> 
>>>>>> this
>>>>>> 
>>>>>>> I believe we would need something like a Distributed vector of
> Strings
>>>>>>> 
>>>>>> that
>>>>>> 
>>>>>>> could be broadcast to a mapBlock closure and then tokenized from
>>>>>>> there.
>>>>>>> Even there, MapBlock may not be perfect for this, but some of the
> new
>>>>>>> Distributed functions that Gockhan is working on may.
>>>>>>> 
>>>>>>> I agree seq2sparse type input is a strong feature. Text files into
> an
>>>>>>>> all-documents DataFrame basically. Colocation?
>>>>>>>> 
>>>>>>>> as far as collocations i believe that the n-gram are computed and
>>>>>>> counted
>>>>>>> in the CollocDriver [3] (i might be wrong her...its been a while
>>>>>>> since i
>>>>>>> looked at the code...) either way, I dont think I ever looked too
>>>>>>> closely
>>>>>>> and i was a bit fuzzy on this...
>>>>>>> 
>>>>>>> These were just some thoughts that I had when briefly looking at
>>>>>>> porting
>>>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow
> this
>>>>>>> algorithm but its a nice starting point.
>>>>>>> 
>>>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>>> 
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>>>>>> 
>>>>>>> .java
>>>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>>> 
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>>>>>> 
>>>>>>> java
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Just copied over the relevant last few messages to keep the other
>>>>>>>> thread
>>>>>>>> on topic...
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>>>> 
>>>>>>>> I'd suggest to consider this: remember all this talk about
>>>>>>>>> language-integrated spark ql being basically dataframe
> manipulation
>>>>>>>>> 
>>>>>>>> DSL?
>>>>>> 
>>>>>>> so now Spark devs are noticing this generality as well and are
>>>>>>>>> actually
>>>>>>>>> proposing to rename SchemaRDD into DataFrame and make it
> mainstream
>>>>>>>>> 
>>>>>>>> data
>>>>>> 
>>>>>>> structure. (my "told you so" moment of sorts
>>>>>>>>> 
>>>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>>>>>>>>> renamed
>>>>>>>>> DataFrame our two major structures. In particular, standardize on
>>>>>>>>> using
>>>>>>>>> DataFrame for things that may include non-numerical data and
> require
>>>>>>>>> 
>>>>>>>> more
>>>>>> 
>>>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>>>>>>>>> 
>>>>>>>> work
>>>>>> 
>>>>>>> when it deals with non-matrix content.
>>>>>>>>> 
>>>>>>>>> Sounds like a worthy effort to me.  We'd be basically
> implementing
>>>>>>>> an
>>>>>>>> 
>>>>>>> API
>>>>>> 
>>>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures
>>>>>>>> correct?
>>>>>>>> 
>>>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>>>> 
>>>>>>> wrote:
>>>>>> 
>>>>>>> Seems like seq2sparse would be really easy to replace since it takes
>>>>>>>>> 
>>>>>>>> text
>>>>>> 
>>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>>>>>>>>>> 
>>>>>>>>> The
>>>>>> 
>>>>>>> dictionaries and counts could be either in-memory maps or rdds for
> use
>>>>>>>>>> with
>>>>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>>>>> pipeline.
>>>>>>>>>> Item similarity uses in-memory maps but the plan is to make it
> more
>>>>>>>>>> scalable using joins as an alternative with the same API allowing
>>>>>>>>>> the
>>>>>>>>>> user
>>>>>>>>>> to trade-off footprint for speed.
>>>>>>>>>> 
>>>>>>>>>> I think you're right- should be relatively easy.  I've been
>>>>>>>>> looking at
>>>>>>>>> 
>>>>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the
> DSL
>>>>>>>> 
>>>>>>> level
>>>>>> 
>>>>>>> is that we don't have a distributed data structure for
> strings..Seems
>>>>>>>> 
>>>>>>> like
>>>>>> 
>>>>>>> getting a DataFrame implemented as Dmitriy mentioned above would
> take
>>>>>>>> 
>>>>>>> care
>>>>>> 
>>>>>>> of this problem.
>>>>>>>> 
>>>>>>>> The other issue i'm a little fuzzy on  is the distributed
> collocation
>>>>>>>> mapping-  it's a part of the seq2sparse code that I've not spent
> too
>>>>>>>> 
>>>>>>> much
>>>>>> 
>>>>>>> time in.
>>>>>>>> 
>>>>>>>> I think that this would be very worthy effort as well- I believe
>>>>>>>> seq2sparse is a particular strong mahout feature.
>>>>>>>> 
>>>>>>>> I'll start another thread since we're now way off topic from the
>>>>>>>> refactoring proposal.
>>>>>>>> 
>>>>>>>> My use for TF-IDF is for row similarity and would take a DRM
>>>>>>>> (actually
>>>>>>>> IndexedDataset) and calculate row/doc similarities. It works now
> but
>>>>>>>> 
>>>>>>> only
>>>>>> 
>>>>>>> using LLR. This is OK when thinking of the items as tags or metadata
>>>>>>>> but
>>>>>>>> for text tokens something like cosine may be better.
>>>>>>>> 
>>>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using
> LLR
>>>>>>>> a
>>>>>>>> 
>>>>>>> lot
>>>>>> 
>>>>>>> like how CF preferences are downsampled. This would produce an
>>>>>>>> 
>>>>>>> sparsified
>>>>>> 
>>>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
>>>>>>>> the
>>>>>>>> terms before row similarity uses cosine. This is not so good for
>>>>>>>> search
>>>>>>>> but
>>>>>>>> should produce much better similarities than Solr’s “moreLikeThis”
>>>>>>>> and
>>>>>>>> does
>>>>>>>> it for all pairs rather than one at a time.
>>>>>>>> 
>>>>>>>> In any case it can be used to do a create a personalized
>>>>>>>> content-based
>>>>>>>> recommender or augment a CF recommender with one more indicator
> type.
>>>>>>>> 
>>>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>>>> 
>>>>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>>>> 
>>>>>>>>> Some issues WRT lower level Spark integration:
>>>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I
>>>>>>>>>> actually
>>>>>>>>>> 
>>>>>>>>>> looked at. There may be other things we can pick up from their
>>>>>>>>> 
>>>>>>>> committers
>>>>>> 
>>>>>>> since they have an abundance.
>>>>>>>> 
>>>>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
>>>>>>>>> to
>>>>>>>>> me when someone on the Spark list asked about matrix transpose and
>>>>>>>>> an
>>>>>>>>> 
>>>>>>>> MLlib
>>>>>>>> committer’s answer was something like “why would you want to do
>>>>>>>> that?”.
>>>>>>>> Usually you don’t actually execute the transpose but they don’t
> even
>>>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At
>>>>>>>> present
>>>>>>>> 
>>>>>>> you
>>>>>> 
>>>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>>>>> stuff.
>>>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I
> know).
>>>>>>>> If
>>>>>>>> the
>>>>>>>> DSL could interchange datasets with MLlib, people would be pointed
> to
>>>>>>>> 
>>>>>>> the
>>>>>> 
>>>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>>>>>>> 
>>>>>>> MLlib
>>>>>> 
>>>>>>> seems to be algorithms, not math.
>>>>>>>> 
>>>>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>>>>>> interface. Doing a batch recalc on a moving time window would
> nearly
>>>>>>>>> 
>>>>>>>> fall
>>>>>> 
>>>>>>> out of DStream backed DRMs. This isn’t the same as incremental
> updates
>>>>>>>> 
>>>>>>> on
>>>>>> 
>>>>>>> streaming but it’s a start.
>>>>>>>> 
>>>>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>>>>> faster compute engines. So we jumped. Now the need is for
> streaming
>>>>>>>>> and
>>>>>>>>> 
>>>>>>>> especially incrementally updated streaming. Seems like we need to
>>>>>>>> 
>>>>>>> address
>>>>>> 
>>>>>>> this.
>>>>>>>> 
>>>>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>>>> I will put a PR up soon.
>>>>>>>>> 
>>>>>>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>>>>>>> 
>>>>>>> classes
>>>>>> 
>>>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>>>>>>> 
>>>>>>> available
>>>>>> 
>>>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>>>>> require dictionary and a frequency count maps to vectorize incoming
>>>>>>>> 
>>>>>>> text-
>>>>>> 
>>>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>>>>>>> 
>>>>>>> they
>>>>>> 
>>>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>>>>> Hopefully they'll be of some use.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Gokhan Capan <gk...@gmail.com>.

Some answers:

- Non-integer document ids:
The implementation does not use operations defined for DrmLike[Int]-only,
so the row keys do not have to be Int's. I just couldn't manage to create
the returning DrmLike with the correct key type. Although while wrapping
into a DrmLike, I tried to pass the key-class using HDFS utils like they
are being used in drmDfsRead, but I somehow wasn't successful. So non-int
document ids is not an actual issue here.

- Breaking the implementation out to smaller pieces: Let's just collect the
requirements and adjust the implementation accordingly. I honestly didn't
think very much about where the implementation fits in, architecturally,
and what pieces are of public interest.

Best

Gokhan

On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi <su...@gmail.com>
wrote:

> AP, How is ur impl different from Gokhan's?
>
> On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
> > BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
> > using because o.a.m.vectorizer, which is probably a better name, had
> > conflicts in mrlegacy.
> >
> >
> > On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
> >
> >>
> >> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
> >> seq2sparse implementation to live.
> >>
> >> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
> >>
> >>> Does o.a.m.nlp  in the spark module seem like a good place for this to
> >>>> live?
> >>>>
> >>> I think you meant math-scala?
> >>>
> >>> Actually we should rename math to core
> >>>
> >>>
> >>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap...@outlook.com> wrote:
> >>>
> >>> Cool- This is great! I think this is really important to have in.
> >>>
> >>> +1 to a pull request for comments.
> >>>
> >>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
> >>> very simple TF and TFIDF classes based on lucene's IDF calculation and
> >>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
> >>> creates an o.a.m.nlp package in mahout-math. I will push that as soon
> as i
> >>> can in case you want to use them.
> >>>
> >>> Does o.a.m.nlp  in the spark module seem like a good place for this to
> >>> live?
> >>>
> >>> Those classes may be of use to you- they're very simple and are
> intended
> >>> for new document vectorization once the legacy deps are removed from
> the
> >>> spark module.  They also might make interoperability with easier.
> >>>
> >>> One thought having not been able to look at this too closely yet.
> >>>
> >>>  //do we need do calculate df-vector?
> >>>>>
> >>>> 1.  We do need a document frequency map or vector to be able to
> >>> calculate the IDF terms when vectorizing a new document outside of the
> >>> original corpus.
>
>>>
> >>>
> >>>
> >>>
> >>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
> >>>
> >>>> Ah, you are doing all the lucene analyzer, ngrams and other
> tokenizing,
> >>>> nice.
> >>>>
> >>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>>>
> >>>> Ah I found the right button in Github no PR necessary.
> >>>>
> >>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>>>
> >>>> If you create a PR it’s easier to see what was changed.
> >>>>
> >>>> Wouldn’t it be better to read in files from a directory assigning
> >>>> doc-id = filename and term-ids = terms or are their still Hadoop
> pipeline
> >>>> tools that are needed to create the sequence files? This sort of
> mimics the
> >>>> way Spark reads SchemaRDDs from Json files.
> >>>>
> >>>> BTW this can also be done with a new reader trait on the
> >>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a
> >>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other
> does
> >>>> the same for columns (text tokens). This would be a few lines of code
> since
> >>>> the string mapping and DRM creation is already written, The only
> thing to
> >>>> do would be map the doc/row ids to filenames. This allows you to take
> the
> >>>> non-int doc ids out of the DRM and replace them with a map. Not based
> on a
> >>>> Spark dataframe yet probably will be.
> >>>>
> >>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:
> >>>>
> >>>> So, here is a sketch of a Spark implementation of seq2sparse,
> returning
> >>>> a
> >>>> (matrix:DrmLike, dictionary:Map):
> >>>>
> >>>> https://github.com/gcapan/mahout/tree/seq2sparse
> >>>>
> >>>> Although it should be possible, I couldn't manage to make it process
> >>>> non-integer document ids. Any fix would be appreciated. There is a
> >>>> simple
> >>>> test attached, but I think there is more to do in terms of handling
> all
> >>>> parameters of the original seq2sparse implementation.
> >>>>
> >>>> I put it directly to the SparkEngine ---not that I think of this
> object
> >>>> is
> >>>> the most appropriate placeholder, it just seemed convenient to me.
> >>>>
> >>>> Best
> >>>>
> >>>>
> >>>> Gokhan
> >>>>
> >>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com>
> >>>> wrote:
> >>>>
> >>>>  IndexedDataset might suffice until real DataFrames come along.
> >>>>>
> >>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
> >>>>> is a
> >>>>> byproduct of it IIRC. matrix definitely not a structure to hold
> those.
> >>>>>
> >>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com>
> >>>>> wrote:
> >>>>>
> >>>>>  On 02/04/2015 11:13 AM, Pat Ferrel wrote:
> >>>>>>
> >>>>>>  Andrew, not sure what you mean about storing strings. If you mean
> >>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc
> >>>>>>> column
> >>>>>>>
> >>>>>> =
> >>>>>
> >>>>>> token. A one row DataFrame is a slightly heavy weight
> >>>>>>> string/document. A
> >>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no?
> It
> >>>>>>>
> >>>>>> would
> >>>>>
> >>>>>> be a vector that maintains the tokens as ids for the counts, right?
> >>>>>>>
> >>>>>>>  Yes- dataframes will be perfect for this.  The problem that i was
> >>>>>> referring to was that we dont have a DSL Data Structure to to do the
> >>>>>> initial distributed tokenizing of the documents[1] line:257, [2] .
> For
> >>>>>>
> >>>>> this
> >>>>>
> >>>>>> I believe we would need something like a Distributed vector of
> Strings
> >>>>>>
> >>>>> that
> >>>>>
> >>>>>> could be broadcast to a mapBlock closure and then tokenized from
> >>>>>> there.
> >>>>>> Even there, MapBlock may not be perfect for this, but some of the
> new
> >>>>>> Distributed functions that Gockhan is working on may.
> >>>>>>
> >>>>>>  I agree seq2sparse type input is a strong feature. Text files into
> an
> >>>>>>> all-documents DataFrame basically. Colocation?
> >>>>>>>
> >>>>>>>  as far as collocations i believe that the n-gram are computed and
> >>>>>> counted
> >>>>>> in the CollocDriver [3] (i might be wrong her...its been a while
> >>>>>> since i
> >>>>>> looked at the code...) either way, I dont think I ever looked too
> >>>>>> closely
> >>>>>> and i was a bit fuzzy on this...
> >>>>>>
> >>>>>> These were just some thoughts that I had when briefly looking at
> >>>>>> porting
> >>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow
> this
> >>>>>> algorithm but its a nice starting point.
> >>>>>>
> >>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> >>>>>>
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> >>>>>>
> >>>>>> .java
> >>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> >>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> >>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> >>>>>>
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> >>>>>>
> >>>>>> java
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>  On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> Just copied over the relevant last few messages to keep the other
> >>>>>>> thread
> >>>>>>> on topic...
> >>>>>>>
> >>>>>>>
> >>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> >>>>>>>
> >>>>>>>  I'd suggest to consider this: remember all this talk about
> >>>>>>>> language-integrated spark ql being basically dataframe
> manipulation
> >>>>>>>>
> >>>>>>> DSL?
> >>>>>
> >>>>>> so now Spark devs are noticing this generality as well and are
> >>>>>>>> actually
> >>>>>>>> proposing to rename SchemaRDD into DataFrame and make it
> mainstream
> >>>>>>>>
> >>>>>>> data
> >>>>>
> >>>>>> structure. (my "told you so" moment of sorts
> >>>>>>>>
> >>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
> >>>>>>>> renamed
> >>>>>>>> DataFrame our two major structures. In particular, standardize on
> >>>>>>>> using
> >>>>>>>> DataFrame for things that may include non-numerical data and
> require
> >>>>>>>>
> >>>>>>> more
> >>>>>
> >>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> >>>>>>>>
> >>>>>>> work
> >>>>>
> >>>>>> when it deals with non-matrix content.
> >>>>>>>>
> >>>>>>>>  Sounds like a worthy effort to me.  We'd be basically
> implementing
> >>>>>>> an
> >>>>>>>
> >>>>>> API
> >>>>>
> >>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures
> >>>>>>> correct?
> >>>>>>>
> >>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
> >>>>>>>
> >>>>>> wrote:
> >>>>>
> >>>>>> Seems like seq2sparse would be really easy to replace since it takes
> >>>>>>>>
> >>>>>>> text
> >>>>>
> >>>>>> files to start with, then the whole pipeline could be kept in rdds.
> >>>>>>>>>
> >>>>>>>> The
> >>>>>
> >>>>>> dictionaries and counts could be either in-memory maps or rdds for
> use
> >>>>>>>>> with
> >>>>>>>>> joins? This would get rid of sequence files completely from the
> >>>>>>>>> pipeline.
> >>>>>>>>> Item similarity uses in-memory maps but the plan is to make it
> more
> >>>>>>>>> scalable using joins as an alternative with the same API allowing
> >>>>>>>>> the
> >>>>>>>>> user
> >>>>>>>>> to trade-off footprint for speed.
> >>>>>>>>>
> >>>>>>>>>  I think you're right- should be relatively easy.  I've been
> >>>>>>>> looking at
> >>>>>>>>
> >>>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the
> DSL
> >>>>>>>
> >>>>>> level
> >>>>>
> >>>>>> is that we don't have a distributed data structure for
> strings..Seems
> >>>>>>>
> >>>>>> like
> >>>>>
> >>>>>> getting a DataFrame implemented as Dmitriy mentioned above would
> take
> >>>>>>>
> >>>>>> care
> >>>>>
> >>>>>> of this problem.
> >>>>>>>
> >>>>>>> The other issue i'm a little fuzzy on  is the distributed
> collocation
> >>>>>>> mapping-  it's a part of the seq2sparse code that I've not spent
> too
> >>>>>>>
> >>>>>> much
> >>>>>
> >>>>>> time in.
> >>>>>>>
> >>>>>>> I think that this would be very worthy effort as well- I believe
> >>>>>>> seq2sparse is a particular strong mahout feature.
> >>>>>>>
> >>>>>>> I'll start another thread since we're now way off topic from the
> >>>>>>> refactoring proposal.
> >>>>>>>
> >>>>>>> My use for TF-IDF is for row similarity and would take a DRM
> >>>>>>> (actually
> >>>>>>> IndexedDataset) and calculate row/doc similarities. It works now
> but
> >>>>>>>
> >>>>>> only
> >>>>>
> >>>>>> using LLR. This is OK when thinking of the items as tags or metadata
> >>>>>>> but
> >>>>>>> for text tokens something like cosine may be better.
> >>>>>>>
> >>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using
> LLR
> >>>>>>> a
> >>>>>>>
> >>>>>> lot
> >>>>>
> >>>>>> like how CF preferences are downsampled. This would produce an
> >>>>>>>
> >>>>>> sparsified
> >>>>>
> >>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
> >>>>>>> the
> >>>>>>> terms before row similarity uses cosine. This is not so good for
> >>>>>>> search
> >>>>>>> but
> >>>>>>> should produce much better similarities than Solr’s “moreLikeThis”
> >>>>>>> and
> >>>>>>> does
> >>>>>>> it for all pairs rather than one at a time.
> >>>>>>>
> >>>>>>> In any case it can be used to do a create a personalized
> >>>>>>> content-based
> >>>>>>> recommender or augment a CF recommender with one more indicator
> type.
> >>>>>>>
> >>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> >>>>>>>
> >>>>>>>  On 02/03/2015 12:22 PM, Pat Ferrel wrote:
> >>>>>>>>
> >>>>>>>>  Some issues WRT lower level Spark integration:
> >>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I
> >>>>>>>>> actually
> >>>>>>>>>
> >>>>>>>>>  looked at. There may be other things we can pick up from their
> >>>>>>>>
> >>>>>>> committers
> >>>>>
> >>>>>> since they have an abundance.
> >>>>>>>
> >>>>>>>  2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
> >>>>>>>> to
> >>>>>>>> me when someone on the Spark list asked about matrix transpose and
> >>>>>>>> an
> >>>>>>>>
> >>>>>>> MLlib
> >>>>>>> committer’s answer was something like “why would you want to do
> >>>>>>> that?”.
> >>>>>>> Usually you don’t actually execute the transpose but they don’t
> even
> >>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At
> >>>>>>> present
> >>>>>>>
> >>>>>> you
> >>>>>
> >>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
> >>>>>>> stuff.
> >>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I
> know).
> >>>>>>> If
> >>>>>>> the
> >>>>>>> DSL could interchange datasets with MLlib, people would be pointed
> to
> >>>>>>>
> >>>>>> the
> >>>>>
> >>>>>> DSL for all of a bunch of “why would you want to do that?” features.
> >>>>>>>
> >>>>>> MLlib
> >>>>>
> >>>>>> seems to be algorithms, not math.
> >>>>>>>
> >>>>>>>  3) integration of Streaming. DStreams support most of the RDD
> >>>>>>>> interface. Doing a batch recalc on a moving time window would
> nearly
> >>>>>>>>
> >>>>>>> fall
> >>>>>
> >>>>>> out of DStream backed DRMs. This isn’t the same as incremental
> updates
> >>>>>>>
> >>>>>> on
> >>>>>
> >>>>>> streaming but it’s a start.
> >>>>>>>
> >>>>>>>  Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
> >>>>>>>> faster compute engines. So we jumped. Now the need is for
> streaming
> >>>>>>>> and
> >>>>>>>>
> >>>>>>> especially incrementally updated streaming. Seems like we need to
> >>>>>>>
> >>>>>> address
> >>>>>
> >>>>>> this.
> >>>>>>>
> >>>>>>>  Andrew, regardless of the above having TF-IDF would be super
> >>>>>>>> helpful—row similarity for content/text would benefit greatly.
> >>>>>>>> I will put a PR up soon.
> >>>>>>>>
> >>>>>>>>  Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> >>>>>>>
> >>>>>> classes
> >>>>>
> >>>>>> and Weight interface over from mr-legacy to math-scala. They're
> >>>>>>>
> >>>>>> available
> >>>>>
> >>>>>> now in spark-shell but won't be after this refactoring.  These still
> >>>>>>> require dictionary and a frequency count maps to vectorize incoming
> >>>>>>>
> >>>>>> text-
> >>>>>
> >>>>>> so they're more for use with the old MR seq2sparse and I don't think
> >>>>>>>
> >>>>>> they
> >>>>>
> >>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
> >>>>>>> Hopefully they'll be of some use.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
> >
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Andrew Palumbo <ap...@outlook.com>.

sorry for any confusion... what i just pushed from #75 is not an 
implementation of seq2sparse at all- just a really simple implementation 
the Lucene DefaultSimilarity wrapper classes used in the mrlegacy 
seq2sparse implementation to compute TF-IDF weights for a single term 
given a dictionary, term frequency count, corpus size and a 
documentfrequency count:

https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/TFIDF.java
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/Weight.java

I also added a MLlibTFIDF weight:

https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/nlp/tfidf/TFIDF.scala

For interoperability with MLlib's Hashing TF-IDF which uses a slightly 
different formula.


The classes I pushed are really just to use for something simple like this:

     val tfidf: TFIDF = new TFIDF()
val currentTfIdf = tfidf.calculate(termFreq, docFreq.toInt, docSize, 
totalDFSize.toInt)

I'm using them to vectorize a new document for Naive Bayes using a in a 
mahout spark-shell script for MAHOUT-1536 (using a model that was 
trained with mrlegacy seq2sparse vectors):

https://github.com/andrewpalumbo/mahout/blob/MAHOUT-1536-scala/examples/bin/spark/ClassifyNewNBfull.scala

I was coincidentally going to push them over the weekend but didn't have 
a chance, and i thought he may have some use  them.  Having looked at 
Gokhan's seq2sparse implementation a little more, I don't think that he 
really will have any use for them.

regarding the package name, I was just suggesting that Gokhan could put 
his implementation in o.a.m.nlp if SparkEngine is not where it will go.



Just looking more closely at the actual TF-IDF calculation now:

The mrlegacy TD-IDF weights are calculated by DefaultSimilarity as:

      sqrt(termFreq) * (log(numDocs / (docFreq + 1)) + 1.0)

If I'm reading it correctly, Gokhan's Implementartion is using:

      termFreq * log(numDocs/docFreq)  ;  where docFreq is always > 0

Which is closer to the MLlib TF-IDF formula. (without smoothing).


This is kind of the reason I was thinking that it is good to have 
`TermWeight` classes- to keep different (correct) formulas apart.



Looking at my `MLlibTFIDF` code right now i believe there may be a bug 
in it and also some incorrect documentation ... I will go over it tomorrow.






On 03/09/2015 09:56 PM, Suneel Marthi wrote:
> AP, How is ur impl different from Gokhan's?
>
> On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
>> BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
>> using because o.a.m.vectorizer, which is probably a better name, had
>> conflicts in mrlegacy.
>>
>>
>> On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
>>
>>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
>>> seq2sparse implementation to live.
>>>
>>> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>>>
>>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>>>> live?
>>>>>
>>>> I think you meant math-scala?
>>>>
>>>> Actually we should rename math to core
>>>>
>>>>
>>>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>>
>>>> Cool- This is great! I think this is really important to have in.
>>>>
>>>> +1 to a pull request for comments.
>>>>
>>>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
>>>> very simple TF and TFIDF classes based on lucene's IDF calculation and
>>>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
>>>> creates an o.a.m.nlp package in mahout-math. I will push that as soon as i
>>>> can in case you want to use them.
>>>>
>>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>>> live?
>>>>
>>>> Those classes may be of use to you- they're very simple and are intended
>>>> for new document vectorization once the legacy deps are removed from the
>>>> spark module.  They also might make interoperability with easier.
>>>>
>>>> One thought having not been able to look at this too closely yet.
>>>>
>>>>   //do we need do calculate df-vector?
>>>>> 1.  We do need a document frequency map or vector to be able to
>>>> calculate the IDF terms when vectorizing a new document outside of the
>>>> original corpus.
>>>>
>>>>
>>>>
>>>>
>>>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
>>>>
>>>>> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing,
>>>>> nice.
>>>>>
>>>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>>
>>>>> Ah I found the right button in Github no PR necessary.
>>>>>
>>>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>>
>>>>> If you create a PR it’s easier to see what was changed.
>>>>>
>>>>> Wouldn’t it be better to read in files from a directory assigning
>>>>> doc-id = filename and term-ids = terms or are their still Hadoop pipeline
>>>>> tools that are needed to create the sequence files? This sort of mimics the
>>>>> way Spark reads SchemaRDDs from Json files.
>>>>>
>>>>> BTW this can also be done with a new reader trait on the
>>>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a
>>>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does
>>>>> the same for columns (text tokens). This would be a few lines of code since
>>>>> the string mapping and DRM creation is already written, The only thing to
>>>>> do would be map the doc/row ids to filenames. This allows you to take the
>>>>> non-int doc ids out of the DRM and replace them with a map. Not based on a
>>>>> Spark dataframe yet probably will be.
>>>>>
>>>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>>>
>>>>> So, here is a sketch of a Spark implementation of seq2sparse, returning
>>>>> a
>>>>> (matrix:DrmLike, dictionary:Map):
>>>>>
>>>>> https://github.com/gcapan/mahout/tree/seq2sparse
>>>>>
>>>>> Although it should be possible, I couldn't manage to make it process
>>>>> non-integer document ids. Any fix would be appreciated. There is a
>>>>> simple
>>>>> test attached, but I think there is more to do in terms of handling all
>>>>> parameters of the original seq2sparse implementation.
>>>>>
>>>>> I put it directly to the SparkEngine ---not that I think of this object
>>>>> is
>>>>> the most appropriate placeholder, it just seemed convenient to me.
>>>>>
>>>>> Best
>>>>>
>>>>>
>>>>> Gokhan
>>>>>
>>>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>
>>>>>   IndexedDataset might suffice until real DataFrames come along.
>>>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
>>>>>> is a
>>>>>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>>>>>
>>>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com>
>>>>>> wrote:
>>>>>>
>>>>>>   On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>>>>   Andrew, not sure what you mean about storing strings. If you mean
>>>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc
>>>>>>>> column
>>>>>>>>
>>>>>>> =
>>>>>>> token. A one row DataFrame is a slightly heavy weight
>>>>>>>> string/document. A
>>>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>>>>>>>>
>>>>>>> would
>>>>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>>>>   Yes- dataframes will be perfect for this.  The problem that i was
>>>>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>>>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>>>>>>>
>>>>>> this
>>>>>>
>>>>>>> I believe we would need something like a Distributed vector of Strings
>>>>>>>
>>>>>> that
>>>>>>
>>>>>>> could be broadcast to a mapBlock closure and then tokenized from
>>>>>>> there.
>>>>>>> Even there, MapBlock may not be perfect for this, but some of the new
>>>>>>> Distributed functions that Gockhan is working on may.
>>>>>>>
>>>>>>>   I agree seq2sparse type input is a strong feature. Text files into an
>>>>>>>> all-documents DataFrame basically. Colocation?
>>>>>>>>
>>>>>>>>   as far as collocations i believe that the n-gram are computed and
>>>>>>> counted
>>>>>>> in the CollocDriver [3] (i might be wrong her...its been a while
>>>>>>> since i
>>>>>>> looked at the code...) either way, I dont think I ever looked too
>>>>>>> closely
>>>>>>> and i was a bit fuzzy on this...
>>>>>>>
>>>>>>> These were just some thoughts that I had when briefly looking at
>>>>>>> porting
>>>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>>>>>> algorithm but its a nice starting point.
>>>>>>>
>>>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>>>>>>
>>>>>>> .java
>>>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>>>>>>
>>>>>>> java
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Just copied over the relevant last few messages to keep the other
>>>>>>>> thread
>>>>>>>> on topic...
>>>>>>>>
>>>>>>>>
>>>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>>>>
>>>>>>>>   I'd suggest to consider this: remember all this talk about
>>>>>>>>> language-integrated spark ql being basically dataframe manipulation
>>>>>>>>>
>>>>>>>> DSL?
>>>>>>> so now Spark devs are noticing this generality as well and are
>>>>>>>>> actually
>>>>>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>>>>>>>>>
>>>>>>>> data
>>>>>>> structure. (my "told you so" moment of sorts
>>>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>>>>>>>>> renamed
>>>>>>>>> DataFrame our two major structures. In particular, standardize on
>>>>>>>>> using
>>>>>>>>> DataFrame for things that may include non-numerical data and require
>>>>>>>>>
>>>>>>>> more
>>>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>>>>>>>> work
>>>>>>> when it deals with non-matrix content.
>>>>>>>>>   Sounds like a worthy effort to me.  We'd be basically implementing
>>>>>>>> an
>>>>>>>>
>>>>>>> API
>>>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures
>>>>>>>> correct?
>>>>>>>>
>>>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>>> Seems like seq2sparse would be really easy to replace since it takes
>>>>>>>> text
>>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>>>>>>>>> The
>>>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>>>>>> with
>>>>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>>>>> pipeline.
>>>>>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>>>>>> scalable using joins as an alternative with the same API allowing
>>>>>>>>>> the
>>>>>>>>>> user
>>>>>>>>>> to trade-off footprint for speed.
>>>>>>>>>>
>>>>>>>>>>   I think you're right- should be relatively easy.  I've been
>>>>>>>>> looking at
>>>>>>>>>
>>>>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>>>>>>>>
>>>>>>> level
>>>>>>> is that we don't have a distributed data structure for strings..Seems
>>>>>>> like
>>>>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>>>>>>> care
>>>>>>> of this problem.
>>>>>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>>>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>>>>>>>>
>>>>>>> much
>>>>>>> time in.
>>>>>>>> I think that this would be very worthy effort as well- I believe
>>>>>>>> seq2sparse is a particular strong mahout feature.
>>>>>>>>
>>>>>>>> I'll start another thread since we're now way off topic from the
>>>>>>>> refactoring proposal.
>>>>>>>>
>>>>>>>> My use for TF-IDF is for row similarity and would take a DRM
>>>>>>>> (actually
>>>>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>>>>>>>
>>>>>>> only
>>>>>>> using LLR. This is OK when thinking of the items as tags or metadata
>>>>>>>> but
>>>>>>>> for text tokens something like cosine may be better.
>>>>>>>>
>>>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR
>>>>>>>> a
>>>>>>>>
>>>>>>> lot
>>>>>>> like how CF preferences are downsampled. This would produce an
>>>>>>> sparsified
>>>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
>>>>>>>> the
>>>>>>>> terms before row similarity uses cosine. This is not so good for
>>>>>>>> search
>>>>>>>> but
>>>>>>>> should produce much better similarities than Solr’s “moreLikeThis”
>>>>>>>> and
>>>>>>>> does
>>>>>>>> it for all pairs rather than one at a time.
>>>>>>>>
>>>>>>>> In any case it can be used to do a create a personalized
>>>>>>>> content-based
>>>>>>>> recommender or augment a CF recommender with one more indicator type.
>>>>>>>>
>>>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>>>>
>>>>>>>>   On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>>>>   Some issues WRT lower level Spark integration:
>>>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I
>>>>>>>>>> actually
>>>>>>>>>>
>>>>>>>>>>   looked at. There may be other things we can pick up from their
>>>>>>>> committers
>>>>>>> since they have an abundance.
>>>>>>>>   2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
>>>>>>>>> to
>>>>>>>>> me when someone on the Spark list asked about matrix transpose and
>>>>>>>>> an
>>>>>>>>>
>>>>>>>> MLlib
>>>>>>>> committer’s answer was something like “why would you want to do
>>>>>>>> that?”.
>>>>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At
>>>>>>>> present
>>>>>>>>
>>>>>>> you
>>>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>>>>> stuff.
>>>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know).
>>>>>>>> If
>>>>>>>> the
>>>>>>>> DSL could interchange datasets with MLlib, people would be pointed to
>>>>>>>>
>>>>>>> the
>>>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>>>>>> MLlib
>>>>>>> seems to be algorithms, not math.
>>>>>>>>   3) integration of Streaming. DStreams support most of the RDD
>>>>>>>>> interface. Doing a batch recalc on a moving time window would nearly
>>>>>>>>>
>>>>>>>> fall
>>>>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>>>>>>> on
>>>>>>> streaming but it’s a start.
>>>>>>>>   Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>>>>> faster compute engines. So we jumped. Now the need is for streaming
>>>>>>>>> and
>>>>>>>>>
>>>>>>>> especially incrementally updated streaming. Seems like we need to
>>>>>>>>
>>>>>>> address
>>>>>>> this.
>>>>>>>>   Andrew, regardless of the above having TF-IDF would be super
>>>>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>>>> I will put a PR up soon.
>>>>>>>>>
>>>>>>>>>   Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>>>>>> classes
>>>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>>>>>> available
>>>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>>>>> require dictionary and a frequency count maps to vectorize incoming
>>>>>>>>
>>>>>>> text-
>>>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>>>>>> they
>>>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>>>>> Hopefully they'll be of some use.
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Suneel Marthi <su...@gmail.com>.

AP, How is ur impl different from Gokhan's?

On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
> using because o.a.m.vectorizer, which is probably a better name, had
> conflicts in mrlegacy.
>
>
> On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
>
>>
>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
>> seq2sparse implementation to live.
>>
>> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>>
>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>>> live?
>>>>
>>> I think you meant math-scala?
>>>
>>> Actually we should rename math to core
>>>
>>>
>>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>
>>> Cool- This is great! I think this is really important to have in.
>>>
>>> +1 to a pull request for comments.
>>>
>>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
>>> very simple TF and TFIDF classes based on lucene's IDF calculation and
>>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
>>> creates an o.a.m.nlp package in mahout-math. I will push that as soon as i
>>> can in case you want to use them.
>>>
>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>> live?
>>>
>>> Those classes may be of use to you- they're very simple and are intended
>>> for new document vectorization once the legacy deps are removed from the
>>> spark module.  They also might make interoperability with easier.
>>>
>>> One thought having not been able to look at this too closely yet.
>>>
>>>  //do we need do calculate df-vector?
>>>>>
>>>> 1.  We do need a document frequency map or vector to be able to
>>> calculate the IDF terms when vectorizing a new document outside of the
>>> original corpus.
>>>
>>>
>>>
>>>
>>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
>>>
>>>> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing,
>>>> nice.
>>>>
>>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>
>>>> Ah I found the right button in Github no PR necessary.
>>>>
>>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>
>>>> If you create a PR it’s easier to see what was changed.
>>>>
>>>> Wouldn’t it be better to read in files from a directory assigning
>>>> doc-id = filename and term-ids = terms or are their still Hadoop pipeline
>>>> tools that are needed to create the sequence files? This sort of mimics the
>>>> way Spark reads SchemaRDDs from Json files.
>>>>
>>>> BTW this can also be done with a new reader trait on the
>>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a
>>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does
>>>> the same for columns (text tokens). This would be a few lines of code since
>>>> the string mapping and DRM creation is already written, The only thing to
>>>> do would be map the doc/row ids to filenames. This allows you to take the
>>>> non-int doc ids out of the DRM and replace them with a map. Not based on a
>>>> Spark dataframe yet probably will be.
>>>>
>>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>>
>>>> So, here is a sketch of a Spark implementation of seq2sparse, returning
>>>> a
>>>> (matrix:DrmLike, dictionary:Map):
>>>>
>>>> https://github.com/gcapan/mahout/tree/seq2sparse
>>>>
>>>> Although it should be possible, I couldn't manage to make it process
>>>> non-integer document ids. Any fix would be appreciated. There is a
>>>> simple
>>>> test attached, but I think there is more to do in terms of handling all
>>>> parameters of the original seq2sparse implementation.
>>>>
>>>> I put it directly to the SparkEngine ---not that I think of this object
>>>> is
>>>> the most appropriate placeholder, it just seemed convenient to me.
>>>>
>>>> Best
>>>>
>>>>
>>>> Gokhan
>>>>
>>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>>
>>>>  IndexedDataset might suffice until real DataFrames come along.
>>>>>
>>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
>>>>> is a
>>>>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>>>>
>>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com>
>>>>> wrote:
>>>>>
>>>>>  On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>>>
>>>>>>  Andrew, not sure what you mean about storing strings. If you mean
>>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc
>>>>>>> column
>>>>>>>
>>>>>> =
>>>>>
>>>>>> token. A one row DataFrame is a slightly heavy weight
>>>>>>> string/document. A
>>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>>>>>>>
>>>>>> would
>>>>>
>>>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>>>
>>>>>>>  Yes- dataframes will be perfect for this.  The problem that i was
>>>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>>>>>>
>>>>> this
>>>>>
>>>>>> I believe we would need something like a Distributed vector of Strings
>>>>>>
>>>>> that
>>>>>
>>>>>> could be broadcast to a mapBlock closure and then tokenized from
>>>>>> there.
>>>>>> Even there, MapBlock may not be perfect for this, but some of the new
>>>>>> Distributed functions that Gockhan is working on may.
>>>>>>
>>>>>>  I agree seq2sparse type input is a strong feature. Text files into an
>>>>>>> all-documents DataFrame basically. Colocation?
>>>>>>>
>>>>>>>  as far as collocations i believe that the n-gram are computed and
>>>>>> counted
>>>>>> in the CollocDriver [3] (i might be wrong her...its been a while
>>>>>> since i
>>>>>> looked at the code...) either way, I dont think I ever looked too
>>>>>> closely
>>>>>> and i was a bit fuzzy on this...
>>>>>>
>>>>>> These were just some thoughts that I had when briefly looking at
>>>>>> porting
>>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>>>>> algorithm but its a nice starting point.
>>>>>>
>>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>>>>>
>>>>>> .java
>>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>>>>>
>>>>>> java
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Just copied over the relevant last few messages to keep the other
>>>>>>> thread
>>>>>>> on topic...
>>>>>>>
>>>>>>>
>>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>>>
>>>>>>>  I'd suggest to consider this: remember all this talk about
>>>>>>>> language-integrated spark ql being basically dataframe manipulation
>>>>>>>>
>>>>>>> DSL?
>>>>>
>>>>>> so now Spark devs are noticing this generality as well and are
>>>>>>>> actually
>>>>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>>>>>>>>
>>>>>>> data
>>>>>
>>>>>> structure. (my "told you so" moment of sorts
>>>>>>>>
>>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>>>>>>>> renamed
>>>>>>>> DataFrame our two major structures. In particular, standardize on
>>>>>>>> using
>>>>>>>> DataFrame for things that may include non-numerical data and require
>>>>>>>>
>>>>>>> more
>>>>>
>>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>>>>>>>>
>>>>>>> work
>>>>>
>>>>>> when it deals with non-matrix content.
>>>>>>>>
>>>>>>>>  Sounds like a worthy effort to me.  We'd be basically implementing
>>>>>>> an
>>>>>>>
>>>>>> API
>>>>>
>>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures
>>>>>>> correct?
>>>>>>>
>>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>> Seems like seq2sparse would be really easy to replace since it takes
>>>>>>>>
>>>>>>> text
>>>>>
>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>>>>>>>>>
>>>>>>>> The
>>>>>
>>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>>>>> with
>>>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>>>> pipeline.
>>>>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>>>>> scalable using joins as an alternative with the same API allowing
>>>>>>>>> the
>>>>>>>>> user
>>>>>>>>> to trade-off footprint for speed.
>>>>>>>>>
>>>>>>>>>  I think you're right- should be relatively easy.  I've been
>>>>>>>> looking at
>>>>>>>>
>>>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>>>>>>>
>>>>>> level
>>>>>
>>>>>> is that we don't have a distributed data structure for strings..Seems
>>>>>>>
>>>>>> like
>>>>>
>>>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>>>>>>>
>>>>>> care
>>>>>
>>>>>> of this problem.
>>>>>>>
>>>>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>>>>>>>
>>>>>> much
>>>>>
>>>>>> time in.
>>>>>>>
>>>>>>> I think that this would be very worthy effort as well- I believe
>>>>>>> seq2sparse is a particular strong mahout feature.
>>>>>>>
>>>>>>> I'll start another thread since we're now way off topic from the
>>>>>>> refactoring proposal.
>>>>>>>
>>>>>>> My use for TF-IDF is for row similarity and would take a DRM
>>>>>>> (actually
>>>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>>>>>>
>>>>>> only
>>>>>
>>>>>> using LLR. This is OK when thinking of the items as tags or metadata
>>>>>>> but
>>>>>>> for text tokens something like cosine may be better.
>>>>>>>
>>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR
>>>>>>> a
>>>>>>>
>>>>>> lot
>>>>>
>>>>>> like how CF preferences are downsampled. This would produce an
>>>>>>>
>>>>>> sparsified
>>>>>
>>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
>>>>>>> the
>>>>>>> terms before row similarity uses cosine. This is not so good for
>>>>>>> search
>>>>>>> but
>>>>>>> should produce much better similarities than Solr’s “moreLikeThis”
>>>>>>> and
>>>>>>> does
>>>>>>> it for all pairs rather than one at a time.
>>>>>>>
>>>>>>> In any case it can be used to do a create a personalized
>>>>>>> content-based
>>>>>>> recommender or augment a CF recommender with one more indicator type.
>>>>>>>
>>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>>>
>>>>>>>  On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>>>
>>>>>>>>  Some issues WRT lower level Spark integration:
>>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I
>>>>>>>>> actually
>>>>>>>>>
>>>>>>>>>  looked at. There may be other things we can pick up from their
>>>>>>>>
>>>>>>> committers
>>>>>
>>>>>> since they have an abundance.
>>>>>>>
>>>>>>>  2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
>>>>>>>> to
>>>>>>>> me when someone on the Spark list asked about matrix transpose and
>>>>>>>> an
>>>>>>>>
>>>>>>> MLlib
>>>>>>> committer’s answer was something like “why would you want to do
>>>>>>> that?”.
>>>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At
>>>>>>> present
>>>>>>>
>>>>>> you
>>>>>
>>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>>>> stuff.
>>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know).
>>>>>>> If
>>>>>>> the
>>>>>>> DSL could interchange datasets with MLlib, people would be pointed to
>>>>>>>
>>>>>> the
>>>>>
>>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>>>>>>
>>>>>> MLlib
>>>>>
>>>>>> seems to be algorithms, not math.
>>>>>>>
>>>>>>>  3) integration of Streaming. DStreams support most of the RDD
>>>>>>>> interface. Doing a batch recalc on a moving time window would nearly
>>>>>>>>
>>>>>>> fall
>>>>>
>>>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>>>>>>>
>>>>>> on
>>>>>
>>>>>> streaming but it’s a start.
>>>>>>>
>>>>>>>  Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>>>> faster compute engines. So we jumped. Now the need is for streaming
>>>>>>>> and
>>>>>>>>
>>>>>>> especially incrementally updated streaming. Seems like we need to
>>>>>>>
>>>>>> address
>>>>>
>>>>>> this.
>>>>>>>
>>>>>>>  Andrew, regardless of the above having TF-IDF would be super
>>>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>>> I will put a PR up soon.
>>>>>>>>
>>>>>>>>  Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>>>>>>
>>>>>> classes
>>>>>
>>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>>>>>>
>>>>>> available
>>>>>
>>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>>>> require dictionary and a frequency count maps to vectorize incoming
>>>>>>>
>>>>>> text-
>>>>>
>>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>>>>>>
>>>>>> they
>>>>>
>>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>>>> Hopefully they'll be of some use.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>
>>
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Andrew Palumbo <ap...@outlook.com>.

BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was 
using because o.a.m.vectorizer, which is probably a better name, had 
conflicts in mrlegacy.

On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
>
> I meant would o.a.m.nlp in the spark module be a good place for 
> Gokhan's seq2sparse implementation to live.
>
> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>>> Does o.a.m.nlp  in the spark module seem like a good place for this 
>>> to live?
>> I think you meant math-scala?
>>
>> Actually we should rename math to core
>>
>>
>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>
>> Cool- This is great! I think this is really important to have in.
>>
>> +1 to a pull request for comments.
>>
>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has 
>> very simple TF and TFIDF classes based on lucene's IDF calculation 
>> and MLlib's  I just got a bad flu and haven't had a chance to push 
>> it.  It creates an o.a.m.nlp package in mahout-math. I will push that 
>> as soon as i can in case you want to use them.
>>
>> Does o.a.m.nlp  in the spark module seem like a good place for this 
>> to live?
>>
>> Those classes may be of use to you- they're very simple and are 
>> intended for new document vectorization once the legacy deps are 
>> removed from the spark module.  They also might make interoperability 
>> with easier.
>>
>> One thought having not been able to look at this too closely yet.
>>
>>>> //do we need do calculate df-vector?
>> 1.  We do need a document frequency map or vector to be able to 
>> calculate the IDF terms when vectorizing a new document outside of 
>> the original corpus.
>>
>>
>>
>>
>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
>>> Ah, you are doing all the lucene analyzer, ngrams and other 
>>> tokenizing, nice.
>>>
>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>
>>> Ah I found the right button in Github no PR necessary.
>>>
>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>
>>> If you create a PR it’s easier to see what was changed.
>>>
>>> Wouldn’t it be better to read in files from a directory assigning 
>>> doc-id = filename and term-ids = terms or are their still Hadoop 
>>> pipeline tools that are needed to create the sequence files? This 
>>> sort of mimics the way Spark reads SchemaRDDs from Json files.
>>>
>>> BTW this can also be done with a new reader trait on the 
>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and 
>>> a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the 
>>> other does the same for columns (text tokens). This would be a few 
>>> lines of code since the string mapping and DRM creation is already 
>>> written, The only thing to do would be map the doc/row ids to 
>>> filenames. This allows you to take the non-int doc ids out of the 
>>> DRM and replace them with a map. Not based on a Spark dataframe yet 
>>> probably will be.
>>>
>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>
>>> So, here is a sketch of a Spark implementation of seq2sparse, 
>>> returning a
>>> (matrix:DrmLike, dictionary:Map):
>>>
>>> https://github.com/gcapan/mahout/tree/seq2sparse
>>>
>>> Although it should be possible, I couldn't manage to make it process
>>> non-integer document ids. Any fix would be appreciated. There is a 
>>> simple
>>> test attached, but I think there is more to do in terms of handling all
>>> parameters of the original seq2sparse implementation.
>>>
>>> I put it directly to the SparkEngine ---not that I think of this 
>>> object is
>>> the most appropriate placeholder, it just seemed convenient to me.
>>>
>>> Best
>>>
>>>
>>> Gokhan
>>>
>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com> 
>>> wrote:
>>>
>>>> IndexedDataset might suffice until real DataFrames come along.
>>>>
>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> 
>>>> wrote:
>>>>
>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. 
>>>> It is a
>>>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>>>
>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> 
>>>> wrote:
>>>>
>>>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>>
>>>>>> Andrew, not sure what you mean about storing strings. If you mean
>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc 
>>>>>> column
>>>> =
>>>>>> token. A one row DataFrame is a slightly heavy weight 
>>>>>> string/document. A
>>>>>> DataFrame with token counts would be perfect for input TF-IDF, 
>>>>>> no? It
>>>> would
>>>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>>
>>>>> Yes- dataframes will be perfect for this.  The problem that i was
>>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>>> initial distributed tokenizing of the documents[1] line:257, [2] . 
>>>>> For
>>>> this
>>>>> I believe we would need something like a Distributed vector of 
>>>>> Strings
>>>> that
>>>>> could be broadcast to a mapBlock closure and then tokenized from 
>>>>> there.
>>>>> Even there, MapBlock may not be perfect for this, but some of the new
>>>>> Distributed functions that Gockhan is working on may.
>>>>>
>>>>>> I agree seq2sparse type input is a strong feature. Text files 
>>>>>> into an
>>>>>> all-documents DataFrame basically. Colocation?
>>>>>>
>>>>> as far as collocations i believe that the n-gram are computed and 
>>>>> counted
>>>>> in the CollocDriver [3] (i might be wrong her...its been a while 
>>>>> since i
>>>>> looked at the code...) either way, I dont think I ever looked too 
>>>>> closely
>>>>> and i was a bit fuzzy on this...
>>>>>
>>>>> These were just some thoughts that I had when briefly looking at 
>>>>> porting
>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>>>> algorithm but its a nice starting point.
>>>>>
>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles 
>>>>>
>>>>> .java
>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. 
>>>>>
>>>>> java
>>>>>
>>>>>
>>>>>
>>>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> 
>>>>>> wrote:
>>>>>>
>>>>>> Just copied over the relevant last few messages to keep the other 
>>>>>> thread
>>>>>> on topic...
>>>>>>
>>>>>>
>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>>
>>>>>>> I'd suggest to consider this: remember all this talk about
>>>>>>> language-integrated spark ql being basically dataframe manipulation
>>>> DSL?
>>>>>>> so now Spark devs are noticing this generality as well and are 
>>>>>>> actually
>>>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>>>> data
>>>>>>> structure. (my "told you so" moment of sorts
>>>>>>>
>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly 
>>>>>>> renamed
>>>>>>> DataFrame our two major structures. In particular, standardize 
>>>>>>> on using
>>>>>>> DataFrame for things that may include non-numerical data and 
>>>>>>> require
>>>> more
>>>>>>> grace about column naming and manipulation. Maybe relevant to 
>>>>>>> TF-IDF
>>>> work
>>>>>>> when it deals with non-matrix content.
>>>>>>>
>>>>>> Sounds like a worthy effort to me.  We'd be basically 
>>>>>> implementing an
>>>> API
>>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures 
>>>>>> correct?
>>>>>>
>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>>>>> Seems like seq2sparse would be really easy to replace since it 
>>>>>>> takes
>>>> text
>>>>>>>> files to start with, then the whole pipeline could be kept in 
>>>>>>>> rdds.
>>>> The
>>>>>>>> dictionaries and counts could be either in-memory maps or rdds 
>>>>>>>> for use
>>>>>>>> with
>>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>>> pipeline.
>>>>>>>> Item similarity uses in-memory maps but the plan is to make it 
>>>>>>>> more
>>>>>>>> scalable using joins as an alternative with the same API 
>>>>>>>> allowing the
>>>>>>>> user
>>>>>>>> to trade-off footprint for speed.
>>>>>>>>
>>>>>>> I think you're right- should be relatively easy.  I've been 
>>>>>>> looking at
>>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the 
>>>>>> DSL
>>>> level
>>>>>> is that we don't have a distributed data structure for 
>>>>>> strings..Seems
>>>> like
>>>>>> getting a DataFrame implemented as Dmitriy mentioned above would 
>>>>>> take
>>>> care
>>>>>> of this problem.
>>>>>>
>>>>>> The other issue i'm a little fuzzy on  is the distributed 
>>>>>> collocation
>>>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>>>> much
>>>>>> time in.
>>>>>>
>>>>>> I think that this would be very worthy effort as well- I believe
>>>>>> seq2sparse is a particular strong mahout feature.
>>>>>>
>>>>>> I'll start another thread since we're now way off topic from the
>>>>>> refactoring proposal.
>>>>>>
>>>>>> My use for TF-IDF is for row similarity and would take a DRM 
>>>>>> (actually
>>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>>> only
>>>>>> using LLR. This is OK when thinking of the items as tags or 
>>>>>> metadata but
>>>>>> for text tokens something like cosine may be better.
>>>>>>
>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using 
>>>>>> LLR a
>>>> lot
>>>>>> like how CF preferences are downsampled. This would produce an
>>>> sparsified
>>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would 
>>>>>> re-weight the
>>>>>> terms before row similarity uses cosine. This is not so good for 
>>>>>> search
>>>>>> but
>>>>>> should produce much better similarities than Solr’s 
>>>>>> “moreLikeThis” and
>>>>>> does
>>>>>> it for all pairs rather than one at a time.
>>>>>>
>>>>>> In any case it can be used to do a create a personalized 
>>>>>> content-based
>>>>>> recommender or augment a CF recommender with one more indicator 
>>>>>> type.
>>>>>>
>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> 
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>>
>>>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>>
>>>>>>>> Some issues WRT lower level Spark integration:
>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I 
>>>>>>>> actually
>>>>>>>>
>>>>>>> looked at. There may be other things we can pick up from their
>>>> committers
>>>>>> since they have an abundance.
>>>>>>
>>>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was 
>>>>>>> illustrated to
>>>>>>> me when someone on the Spark list asked about matrix transpose 
>>>>>>> and an
>>>>>> MLlib
>>>>>> committer’s answer was something like “why would you want to do 
>>>>>> that?”.
>>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At 
>>>>>> present
>>>> you
>>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>>> stuff.
>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I 
>>>>>> know). If
>>>>>> the
>>>>>> DSL could interchange datasets with MLlib, people would be 
>>>>>> pointed to
>>>> the
>>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>>> MLlib
>>>>>> seems to be algorithms, not math.
>>>>>>
>>>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>>>> interface. Doing a batch recalc on a moving time window would 
>>>>>>> nearly
>>>> fall
>>>>>> out of DStream backed DRMs. This isn’t the same as incremental 
>>>>>> updates
>>>> on
>>>>>> streaming but it’s a start.
>>>>>>
>>>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>>> faster compute engines. So we jumped. Now the need is for 
>>>>>>> streaming and
>>>>>> especially incrementally updated streaming. Seems like we need to
>>>> address
>>>>>> this.
>>>>>>
>>>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>> I will put a PR up soon.
>>>>>>>
>>>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>>> classes
>>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>>> available
>>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>>> require dictionary and a frequency count maps to vectorize incoming
>>>> text-
>>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>>> they
>>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>>> Hopefully they'll be of some use.
>>>>>>
>>>>>>
>>>
>>
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Andrew Palumbo <ap...@outlook.com>.

I meant would o.a.m.nlp in the spark module be a good place for Gokhan's 
seq2sparse implementation to live.

On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>> Does o.a.m.nlp  in the spark module seem like a good place for this to live?
> I think you meant math-scala?
>
> Actually we should rename math to core
>
>
> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
> Cool- This is great! I think this is really important to have in.
>
> +1 to a pull request for comments.
>
> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's  I just got a bad flu and haven't had a chance to push it.  It creates an o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you want to use them.
>
> Does o.a.m.nlp  in the spark module seem like a good place for this to live?
>
> Those classes may be of use to you- they're very simple and are intended for new document vectorization once the legacy deps are removed from the spark module.  They also might make interoperability with easier.
>
> One thought having not been able to look at this too closely yet.
>
>>> //do we need do calculate df-vector?
> 1.  We do need a document frequency map or vector to be able to calculate the IDF terms when vectorizing a new document outside of the original corpus.
>
>
>
>
> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
>> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.
>>
>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>> Ah I found the right button in Github no PR necessary.
>>
>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>> If you create a PR it’s easier to see what was changed.
>>
>> Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files.
>>
>> BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be.
>>
>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>
>> So, here is a sketch of a Spark implementation of seq2sparse, returning a
>> (matrix:DrmLike, dictionary:Map):
>>
>> https://github.com/gcapan/mahout/tree/seq2sparse
>>
>> Although it should be possible, I couldn't manage to make it process
>> non-integer document ids. Any fix would be appreciated. There is a simple
>> test attached, but I think there is more to do in terms of handling all
>> parameters of the original seq2sparse implementation.
>>
>> I put it directly to the SparkEngine ---not that I think of this object is
>> the most appropriate placeholder, it just seemed convenient to me.
>>
>> Best
>>
>>
>> Gokhan
>>
>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>>> IndexedDataset might suffice until real DataFrames come along.
>>>
>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>>
>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>
>>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>
>>>>> Andrew, not sure what you mean about storing strings. If you mean
>>>>> something like a DRM of tokens, that is a DataFrame with row=doc column
>>> =
>>>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>>> would
>>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>
>>>> Yes- dataframes will be perfect for this.  The problem that i was
>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>>> this
>>>> I believe we would need something like a Distributed vector of Strings
>>> that
>>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>>> Even there, MapBlock may not be perfect for this, but some of the new
>>>> Distributed functions that Gockhan is working on may.
>>>>
>>>>> I agree seq2sparse type input is a strong feature. Text files into an
>>>>> all-documents DataFrame basically. Colocation?
>>>>>
>>>> as far as collocations i believe that the n-gram are computed and counted
>>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>>> looked at the code...) either way, I dont think I ever looked too closely
>>>> and i was a bit fuzzy on this...
>>>>
>>>> These were just some thoughts that I had when briefly looking at porting
>>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>>> algorithm but its a nice starting point.
>>>>
>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>>> .java
>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>>> java
>>>>
>>>>
>>>>
>>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>>>
>>>>> Just copied over the relevant last few messages to keep the other thread
>>>>> on topic...
>>>>>
>>>>>
>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>
>>>>>> I'd suggest to consider this: remember all this talk about
>>>>>> language-integrated spark ql being basically dataframe manipulation
>>> DSL?
>>>>>> so now Spark devs are noticing this generality as well and are actually
>>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>>> data
>>>>>> structure. (my "told you so" moment of sorts
>>>>>>
>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>>>>> DataFrame our two major structures. In particular, standardize on using
>>>>>> DataFrame for things that may include non-numerical data and require
>>> more
>>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>>> work
>>>>>> when it deals with non-matrix content.
>>>>>>
>>>>> Sounds like a worthy effort to me.  We'd be basically implementing an
>>> API
>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>>>>
>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>>>> Seems like seq2sparse would be really easy to replace since it takes
>>> text
>>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>>> The
>>>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>>> with
>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>> pipeline.
>>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>>> scalable using joins as an alternative with the same API allowing the
>>>>>>> user
>>>>>>> to trade-off footprint for speed.
>>>>>>>
>>>>>> I think you're right- should be relatively easy.  I've been looking at
>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>>> level
>>>>> is that we don't have a distributed data structure for strings..Seems
>>> like
>>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>>> care
>>>>> of this problem.
>>>>>
>>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>>> much
>>>>> time in.
>>>>>
>>>>> I think that this would be very worthy effort as well-  I believe
>>>>> seq2sparse is a particular strong mahout feature.
>>>>>
>>>>> I'll start another thread since we're now way off topic from the
>>>>> refactoring proposal.
>>>>>
>>>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>> only
>>>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>>>> for text tokens something like cosine may be better.
>>>>>
>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
>>> lot
>>>>> like how CF preferences are downsampled. This would produce an
>>> sparsified
>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>>>> terms before row similarity uses cosine. This is not so good for search
>>>>> but
>>>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>>>> does
>>>>> it for all pairs rather than one at a time.
>>>>>
>>>>> In any case it can be used to do a create a personalized content-based
>>>>> recommender or augment a CF recommender with one more indicator type.
>>>>>
>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>>>
>>>>>
>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>
>>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>
>>>>>>> Some issues WRT lower level Spark integration:
>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>>>>
>>>>>> looked at. There may be other things we can pick up from their
>>> committers
>>>>> since they have an abundance.
>>>>>
>>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>>> me when someone on the Spark list asked about matrix transpose and an
>>>>> MLlib
>>>>> committer’s answer was something like “why would you want to do that?”.
>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>> support A’A, AA’, or A’B, which are core to what I work on. At present
>>> you
>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>> stuff.
>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>>>> the
>>>>> DSL could interchange datasets with MLlib, people would be pointed to
>>> the
>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>> MLlib
>>>>> seems to be algorithms, not math.
>>>>>
>>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>>> interface. Doing a batch recalc on a moving time window would nearly
>>> fall
>>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>>> on
>>>>> streaming but it’s a start.
>>>>>
>>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>> faster compute engines. So we jumped. Now the need is for streaming and
>>>>> especially incrementally updated streaming. Seems like we need to
>>> address
>>>>> this.
>>>>>
>>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>> I will put a PR up soon.
>>>>>>
>>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>> classes
>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>> available
>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>> require dictionary and a frequency count maps to vectorize incoming
>>> text-
>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>> they
>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>> Hopefully they'll be of some use.
>>>>>
>>>>>
>>
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Pat Ferrel <pa...@occamsmachete.com>.

> Does o.a.m.nlp  in the spark module seem like a good place for this to live?

I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap...@outlook.com> wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's  I just got a bad flu and haven't had a chance to push it.  It creates an o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended for new document vectorization once the legacy deps are removed from the spark module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.

>> //do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to calculate the IDF terms when vectorizing a new document outside of the original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:
> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.
> 
> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Ah I found the right button in Github no PR necessary.
> 
> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> If you create a PR it’s easier to see what was changed.
> 
> Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files.
> 
> BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be.
> 
> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:
> 
> So, here is a sketch of a Spark implementation of seq2sparse, returning a
> (matrix:DrmLike, dictionary:Map):
> 
> https://github.com/gcapan/mahout/tree/seq2sparse
> 
> Although it should be possible, I couldn't manage to make it process
> non-integer document ids. Any fix would be appreciated. There is a simple
> test attached, but I think there is more to do in terms of handling all
> parameters of the original seq2sparse implementation.
> 
> I put it directly to the SparkEngine ---not that I think of this object is
> the most appropriate placeholder, it just seemed convenient to me.
> 
> Best
> 
> 
> Gokhan
> 
> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> IndexedDataset might suffice until real DataFrames come along.
>> 
>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>> 
>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>> 
>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>> 
>>>> Andrew, not sure what you mean about storing strings. If you mean
>>>> something like a DRM of tokens, that is a DataFrame with row=doc column
>> =
>>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>> would
>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>> 
>>> Yes- dataframes will be perfect for this.  The problem that i was
>>> referring to was that we dont have a DSL Data Structure to to do the
>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>> this
>>> I believe we would need something like a Distributed vector of Strings
>> that
>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>> Even there, MapBlock may not be perfect for this, but some of the new
>>> Distributed functions that Gockhan is working on may.
>>> 
>>>> I agree seq2sparse type input is a strong feature. Text files into an
>>>> all-documents DataFrame basically. Colocation?
>>>> 
>>> as far as collocations i believe that the n-gram are computed and counted
>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>> looked at the code...) either way, I dont think I ever looked too closely
>>> and i was a bit fuzzy on this...
>>> 
>>> These were just some thoughts that I had when briefly looking at porting
>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>> algorithm but its a nice starting point.
>>> 
>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>> .java
>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>> java
>>> 
>>> 
>>> 
>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>> 
>>>> Just copied over the relevant last few messages to keep the other thread
>>>> on topic...
>>>> 
>>>> 
>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>> 
>>>>> I'd suggest to consider this: remember all this talk about
>>>>> language-integrated spark ql being basically dataframe manipulation
>> DSL?
>>>>> so now Spark devs are noticing this generality as well and are actually
>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>> data
>>>>> structure. (my "told you so" moment of sorts
>>>>> 
>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>>>> DataFrame our two major structures. In particular, standardize on using
>>>>> DataFrame for things that may include non-numerical data and require
>> more
>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>> work
>>>>> when it deals with non-matrix content.
>>>>> 
>>>> Sounds like a worthy effort to me.  We'd be basically implementing an
>> API
>>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>>> 
>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>>>> Seems like seq2sparse would be really easy to replace since it takes
>> text
>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>> The
>>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>> with
>>>>>> joins? This would get rid of sequence files completely from the
>>>>>> pipeline.
>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>> scalable using joins as an alternative with the same API allowing the
>>>>>> user
>>>>>> to trade-off footprint for speed.
>>>>>> 
>>>>> I think you're right- should be relatively easy.  I've been looking at
>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>> level
>>>> is that we don't have a distributed data structure for strings..Seems
>> like
>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>> care
>>>> of this problem.
>>>> 
>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>> much
>>>> time in.
>>>> 
>>>> I think that this would be very worthy effort as well-  I believe
>>>> seq2sparse is a particular strong mahout feature.
>>>> 
>>>> I'll start another thread since we're now way off topic from the
>>>> refactoring proposal.
>>>> 
>>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>> only
>>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>>> for text tokens something like cosine may be better.
>>>> 
>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
>> lot
>>>> like how CF preferences are downsampled. This would produce an
>> sparsified
>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>>> terms before row similarity uses cosine. This is not so good for search
>>>> but
>>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>>> does
>>>> it for all pairs rather than one at a time.
>>>> 
>>>> In any case it can be used to do a create a personalized content-based
>>>> recommender or augment a CF recommender with one more indicator type.
>>>> 
>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>> 
>>>> 
>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>> 
>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>> 
>>>>>> Some issues WRT lower level Spark integration:
>>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>>> 
>>>>> looked at. There may be other things we can pick up from their
>> committers
>>>> since they have an abundance.
>>>> 
>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>> me when someone on the Spark list asked about matrix transpose and an
>>>> MLlib
>>>> committer’s answer was something like “why would you want to do that?”.
>>>> Usually you don’t actually execute the transpose but they don’t even
>>>> support A’A, AA’, or A’B, which are core to what I work on. At present
>> you
>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>> stuff.
>>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>>> the
>>>> DSL could interchange datasets with MLlib, people would be pointed to
>> the
>>>> DSL for all of a bunch of “why would you want to do that?” features.
>> MLlib
>>>> seems to be algorithms, not math.
>>>> 
>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> interface. Doing a batch recalc on a moving time window would nearly
>> fall
>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>> on
>>>> streaming but it’s a start.
>>>> 
>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>> faster compute engines. So we jumped. Now the need is for streaming and
>>>> especially incrementally updated streaming. Seems like we need to
>> address
>>>> this.
>>>> 
>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>> I will put a PR up soon.
>>>>> 
>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>> classes
>>>> and Weight interface over from mr-legacy to math-scala. They're
>> available
>>>> now in spark-shell but won't be after this refactoring.  These still
>>>> require dictionary and a frequency count maps to vectorize incoming
>> text-
>>>> so they're more for use with the old MR seq2sparse and I don't think
>> they
>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>> Hopefully they'll be of some use.
>>>> 
>>>> 
>> 
> 
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Andrew Palumbo <ap...@outlook.com>.

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has 
very simple TF and TFIDF classes based on lucene's IDF calculation and 
MLlib's  I just got a bad flu and haven't had a chance to push it.  It 
creates an o.a.m.nlp package in mahout-math. I will push that as soon as 
i can in case you want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended 
for new document vectorization once the legacy deps are removed from the 
spark module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.

>> //do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to 
calculate the IDF terms when vectorizing a new document outside of the 
original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:
> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.
>
> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> Ah I found the right button in Github no PR necessary.
>
> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> If you create a PR it’s easier to see what was changed.
>
> Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files.
>
> BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be.
>
> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:
>
> So, here is a sketch of a Spark implementation of seq2sparse, returning a
> (matrix:DrmLike, dictionary:Map):
>
> https://github.com/gcapan/mahout/tree/seq2sparse
>
> Although it should be possible, I couldn't manage to make it process
> non-integer document ids. Any fix would be appreciated. There is a simple
> test attached, but I think there is more to do in terms of handling all
> parameters of the original seq2sparse implementation.
>
> I put it directly to the SparkEngine ---not that I think of this object is
> the most appropriate placeholder, it just seemed convenient to me.
>
> Best
>
>
> Gokhan
>
> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> IndexedDataset might suffice until real DataFrames come along.
>>
>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>
>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>
>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>
>>>> Andrew, not sure what you mean about storing strings. If you mean
>>>> something like a DRM of tokens, that is a DataFrame with row=doc column
>> =
>>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>> would
>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>
>>> Yes- dataframes will be perfect for this.  The problem that i was
>>> referring to was that we dont have a DSL Data Structure to to do the
>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>> this
>>> I believe we would need something like a Distributed vector of Strings
>> that
>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>> Even there, MapBlock may not be perfect for this, but some of the new
>>> Distributed functions that Gockhan is working on may.
>>>
>>>> I agree seq2sparse type input is a strong feature. Text files into an
>>>> all-documents DataFrame basically. Colocation?
>>>>
>>> as far as collocations i believe that the n-gram are computed and counted
>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>> looked at the code...) either way, I dont think I ever looked too closely
>>> and i was a bit fuzzy on this...
>>>
>>> These were just some thoughts that I had when briefly looking at porting
>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>> algorithm but its a nice starting point.
>>>
>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>> .java
>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>> java
>>>
>>>
>>>
>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>>
>>>> Just copied over the relevant last few messages to keep the other thread
>>>> on topic...
>>>>
>>>>
>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>
>>>>> I'd suggest to consider this: remember all this talk about
>>>>> language-integrated spark ql being basically dataframe manipulation
>> DSL?
>>>>> so now Spark devs are noticing this generality as well and are actually
>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>> data
>>>>> structure. (my "told you so" moment of sorts
>>>>>
>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>>>> DataFrame our two major structures. In particular, standardize on using
>>>>> DataFrame for things that may include non-numerical data and require
>> more
>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>> work
>>>>> when it deals with non-matrix content.
>>>>>
>>>> Sounds like a worthy effort to me.  We'd be basically implementing an
>> API
>>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>>>
>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>>>> Seems like seq2sparse would be really easy to replace since it takes
>> text
>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>> The
>>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>> with
>>>>>> joins? This would get rid of sequence files completely from the
>>>>>> pipeline.
>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>> scalable using joins as an alternative with the same API allowing the
>>>>>> user
>>>>>> to trade-off footprint for speed.
>>>>>>
>>>>> I think you're right- should be relatively easy.  I've been looking at
>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>> level
>>>> is that we don't have a distributed data structure for strings..Seems
>> like
>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>> care
>>>> of this problem.
>>>>
>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>> much
>>>> time in.
>>>>
>>>> I think that this would be very worthy effort as well-  I believe
>>>> seq2sparse is a particular strong mahout feature.
>>>>
>>>> I'll start another thread since we're now way off topic from the
>>>> refactoring proposal.
>>>>
>>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>> only
>>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>>> for text tokens something like cosine may be better.
>>>>
>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
>> lot
>>>> like how CF preferences are downsampled. This would produce an
>> sparsified
>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>>> terms before row similarity uses cosine. This is not so good for search
>>>> but
>>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>>> does
>>>> it for all pairs rather than one at a time.
>>>>
>>>> In any case it can be used to do a create a personalized content-based
>>>> recommender or augment a CF recommender with one more indicator type.
>>>>
>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>>>
>>>>
>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>
>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>
>>>>>> Some issues WRT lower level Spark integration:
>>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>>>
>>>>> looked at. There may be other things we can pick up from their
>> committers
>>>> since they have an abundance.
>>>>
>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>> me when someone on the Spark list asked about matrix transpose and an
>>>> MLlib
>>>> committer’s answer was something like “why would you want to do that?”.
>>>> Usually you don’t actually execute the transpose but they don’t even
>>>> support A’A, AA’, or A’B, which are core to what I work on. At present
>> you
>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>> stuff.
>>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>>> the
>>>> DSL could interchange datasets with MLlib, people would be pointed to
>> the
>>>> DSL for all of a bunch of “why would you want to do that?” features.
>> MLlib
>>>> seems to be algorithms, not math.
>>>>
>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> interface. Doing a batch recalc on a moving time window would nearly
>> fall
>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>> on
>>>> streaming but it’s a start.
>>>>
>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>> faster compute engines. So we jumped. Now the need is for streaming and
>>>> especially incrementally updated streaming. Seems like we need to
>> address
>>>> this.
>>>>
>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>> I will put a PR up soon.
>>>>>
>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>> classes
>>>> and Weight interface over from mr-legacy to math-scala. They're
>> available
>>>> now in spark-shell but won't be after this refactoring.  These still
>>>> require dictionary and a frequency count maps to vectorize incoming
>> text-
>>>> so they're more for use with the old MR seq2sparse and I don't think
>> they
>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>> Hopefully they'll be of some use.
>>>>
>>>>
>>
>
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Pat Ferrel <pa...@occamsmachete.com>.

There is a whole pipeline here and an interesting way of making parts accessible via nested function defs. 

Would it make sense to break them out into separate functions so the base function doesn’t take so many params? Maybe one big helper and smaller but separate pipeline funtions so it would be easier to string together your own? For instance I’d like part-of-speech or even nlp as a filter and would never perform the tfidf or LLR in my recommender use cases since they are done in other places. I see they can be disabled. 

This would be useful for a content based recommender but needs a BiMap or the doc-ids preserved in the DRM rows, since they must be written to a search engine as application specific ids—not Mahout ints.

Input a matrix of doc-id, token, perform AA’ with LLR filtering of the tokens (spark-rowsimilarity) and write this to a search engine _using application specific tokens and doc-ids_. The search engine does the TF-IDF. Then either get similar docs for any doc-id or use the user’s history of docs-ids read as a query on AA’ to get personalized recs.


On Mar 9, 2015, at 2:10 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
>>>> I'd suggest to consider this: remember all this talk about
>>>> language-integrated spark ql being basically dataframe manipulation
> DSL?
>>>> 
>>>> so now Spark devs are noticing this generality as well and are actually
>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
>>>> structure. (my "told you so" moment of sorts
>>>> 
>>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>>> DataFrame our two major structures. In particular, standardize on using
>>>> DataFrame for things that may include non-numerical data and require
> more
>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
>>>> when it deals with non-matrix content.
>>>> 
>>> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>> 
>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>>> 
>>>> Seems like seq2sparse would be really easy to replace since it takes
> text
>>>>> files to start with, then the whole pipeline could be kept in rdds.
> The
>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>> with
>>>>> joins? This would get rid of sequence files completely from the
>>>>> pipeline.
>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>> scalable using joins as an alternative with the same API allowing the
>>>>> user
>>>>> to trade-off footprint for speed.
>>>>> 
>>>> I think you're right- should be relatively easy.  I've been looking at
>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
> level
>>> is that we don't have a distributed data structure for strings..Seems
> like
>>> getting a DataFrame implemented as Dmitriy mentioned above would take
> care
>>> of this problem.
>>> 
>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>> mapping-  it's a part of the seq2sparse code that I've not spent too
> much
>>> time in.
>>> 
>>> I think that this would be very worthy effort as well-  I believe
>>> seq2sparse is a particular strong mahout feature.
>>> 
>>> I'll start another thread since we're now way off topic from the
>>> refactoring proposal.
>>> 
>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>> IndexedDataset) and calculate row/doc similarities. It works now but
> only
>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>> for text tokens something like cosine may be better.
>>> 
>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
> lot
>>> like how CF preferences are downsampled. This would produce an
> sparsified
>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>> terms before row similarity uses cosine. This is not so good for search
>>> but
>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>> does
>>> it for all pairs rather than one at a time.
>>> 
>>> In any case it can be used to do a create a personalized content-based
>>> recommender or augment a CF recommender with one more indicator type.
>>> 
>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>> 
>>> 
>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>> 
>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>> 
>>>>> Some issues WRT lower level Spark integration:
>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>> 
>>>> looked at. There may be other things we can pick up from their
> committers
>>> since they have an abundance.
>>> 
>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>> 
>>>> me when someone on the Spark list asked about matrix transpose and an
>>> MLlib
>>> committer’s answer was something like “why would you want to do that?”.
>>> Usually you don’t actually execute the transpose but they don’t even
>>> support A’A, AA’, or A’B, which are core to what I work on. At present
> you
>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>> stuff.
>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>> the
>>> DSL could interchange datasets with MLlib, people would be pointed to
> the
>>> DSL for all of a bunch of “why would you want to do that?” features.
> MLlib
>>> seems to be algorithms, not math.
>>> 
>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> 
>>>> interface. Doing a batch recalc on a moving time window would nearly
> fall
>>> out of DStream backed DRMs. This isn’t the same as incremental updates
> on
>>> streaming but it’s a start.
>>> 
>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>> 
>>>> faster compute engines. So we jumped. Now the need is for streaming and
>>> especially incrementally updated streaming. Seems like we need to
> address
>>> this.
>>> 
>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> 
>>>> helpful—row similarity for content/text would benefit greatly.
>>> 
>>>> I will put a PR up soon.
>>>> 
>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> classes
>>> and Weight interface over from mr-legacy to math-scala. They're
> available
>>> now in spark-shell but won't be after this refactoring.  These still
>>> require dictionary and a frequency count maps to vectorize incoming
> text-
>>> so they're more for use with the old MR seq2sparse and I don't think
> they
>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>> Hopefully they'll be of some use.
>>> 
>>> 
>> 
> 
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
>>>> I'd suggest to consider this: remember all this talk about
>>>> language-integrated spark ql being basically dataframe manipulation
> DSL?
>>>> 
>>>> so now Spark devs are noticing this generality as well and are actually
>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
>>>> structure. (my "told you so" moment of sorts
>>>> 
>>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>>> DataFrame our two major structures. In particular, standardize on using
>>>> DataFrame for things that may include non-numerical data and require
> more
>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
>>>> when it deals with non-matrix content.
>>>> 
>>> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>> 
>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>>> 
>>>> Seems like seq2sparse would be really easy to replace since it takes
> text
>>>>> files to start with, then the whole pipeline could be kept in rdds.
> The
>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>> with
>>>>> joins? This would get rid of sequence files completely from the
>>>>> pipeline.
>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>> scalable using joins as an alternative with the same API allowing the
>>>>> user
>>>>> to trade-off footprint for speed.
>>>>> 
>>>> I think you're right- should be relatively easy.  I've been looking at
>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
> level
>>> is that we don't have a distributed data structure for strings..Seems
> like
>>> getting a DataFrame implemented as Dmitriy mentioned above would take
> care
>>> of this problem.
>>> 
>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>> mapping-  it's a part of the seq2sparse code that I've not spent too
> much
>>> time in.
>>> 
>>> I think that this would be very worthy effort as well-  I believe
>>> seq2sparse is a particular strong mahout feature.
>>> 
>>> I'll start another thread since we're now way off topic from the
>>> refactoring proposal.
>>> 
>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>> IndexedDataset) and calculate row/doc similarities. It works now but
> only
>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>> for text tokens something like cosine may be better.
>>> 
>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
> lot
>>> like how CF preferences are downsampled. This would produce an
> sparsified
>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>> terms before row similarity uses cosine. This is not so good for search
>>> but
>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>> does
>>> it for all pairs rather than one at a time.
>>> 
>>> In any case it can be used to do a create a personalized content-based
>>> recommender or augment a CF recommender with one more indicator type.
>>> 
>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>> 
>>> 
>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>> 
>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>> 
>>>>> Some issues WRT lower level Spark integration:
>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>> 
>>>> looked at. There may be other things we can pick up from their
> committers
>>> since they have an abundance.
>>> 
>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>> 
>>>> me when someone on the Spark list asked about matrix transpose and an
>>> MLlib
>>> committer’s answer was something like “why would you want to do that?”.
>>> Usually you don’t actually execute the transpose but they don’t even
>>> support A’A, AA’, or A’B, which are core to what I work on. At present
> you
>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>> stuff.
>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>> the
>>> DSL could interchange datasets with MLlib, people would be pointed to
> the
>>> DSL for all of a bunch of “why would you want to do that?” features.
> MLlib
>>> seems to be algorithms, not math.
>>> 
>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> 
>>>> interface. Doing a batch recalc on a moving time window would nearly
> fall
>>> out of DStream backed DRMs. This isn’t the same as incremental updates
> on
>>> streaming but it’s a start.
>>> 
>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>> 
>>>> faster compute engines. So we jumped. Now the need is for streaming and
>>> especially incrementally updated streaming. Seems like we need to
> address
>>> this.
>>> 
>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> 
>>>> helpful—row similarity for content/text would benefit greatly.
>>> 
>>>> I will put a PR up soon.
>>>> 
>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> classes
>>> and Weight interface over from mr-legacy to math-scala. They're
> available
>>> now in spark-shell but won't be after this refactoring.  These still
>>> require dictionary and a frequency count maps to vectorize incoming
> text-
>>> so they're more for use with the old MR seq2sparse and I don't think
> they
>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>> Hopefully they'll be of some use.
>>> 
>>> 
>> 
> 
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
>>>> I'd suggest to consider this: remember all this talk about
>>>> language-integrated spark ql being basically dataframe manipulation
> DSL?
>>>> 
>>>> so now Spark devs are noticing this generality as well and are actually
>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
>>>> structure. (my "told you so" moment of sorts
>>>> 
>>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>>> DataFrame our two major structures. In particular, standardize on using
>>>> DataFrame for things that may include non-numerical data and require
> more
>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
>>>> when it deals with non-matrix content.
>>>> 
>>> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>> 
>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>>> 
>>>> Seems like seq2sparse would be really easy to replace since it takes
> text
>>>>> files to start with, then the whole pipeline could be kept in rdds.
> The
>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>> with
>>>>> joins? This would get rid of sequence files completely from the
>>>>> pipeline.
>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>> scalable using joins as an alternative with the same API allowing the
>>>>> user
>>>>> to trade-off footprint for speed.
>>>>> 
>>>> I think you're right- should be relatively easy.  I've been looking at
>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
> level
>>> is that we don't have a distributed data structure for strings..Seems
> like
>>> getting a DataFrame implemented as Dmitriy mentioned above would take
> care
>>> of this problem.
>>> 
>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>> mapping-  it's a part of the seq2sparse code that I've not spent too
> much
>>> time in.
>>> 
>>> I think that this would be very worthy effort as well-  I believe
>>> seq2sparse is a particular strong mahout feature.
>>> 
>>> I'll start another thread since we're now way off topic from the
>>> refactoring proposal.
>>> 
>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>> IndexedDataset) and calculate row/doc similarities. It works now but
> only
>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>> for text tokens something like cosine may be better.
>>> 
>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
> lot
>>> like how CF preferences are downsampled. This would produce an
> sparsified
>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>> terms before row similarity uses cosine. This is not so good for search
>>> but
>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>> does
>>> it for all pairs rather than one at a time.
>>> 
>>> In any case it can be used to do a create a personalized content-based
>>> recommender or augment a CF recommender with one more indicator type.
>>> 
>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>> 
>>> 
>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>> 
>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>> 
>>>>> Some issues WRT lower level Spark integration:
>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>> 
>>>> looked at. There may be other things we can pick up from their
> committers
>>> since they have an abundance.
>>> 
>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>> 
>>>> me when someone on the Spark list asked about matrix transpose and an
>>> MLlib
>>> committer’s answer was something like “why would you want to do that?”.
>>> Usually you don’t actually execute the transpose but they don’t even
>>> support A’A, AA’, or A’B, which are core to what I work on. At present
> you
>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>> stuff.
>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>> the
>>> DSL could interchange datasets with MLlib, people would be pointed to
> the
>>> DSL for all of a bunch of “why would you want to do that?” features.
> MLlib
>>> seems to be algorithms, not math.
>>> 
>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> 
>>>> interface. Doing a batch recalc on a moving time window would nearly
> fall
>>> out of DStream backed DRMs. This isn’t the same as incremental updates
> on
>>> streaming but it’s a start.
>>> 
>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>> 
>>>> faster compute engines. So we jumped. Now the need is for streaming and
>>> especially incrementally updated streaming. Seems like we need to
> address
>>> this.
>>> 
>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> 
>>>> helpful—row similarity for content/text would benefit greatly.
>>> 
>>>> I will put a PR up soon.
>>>> 
>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> classes
>>> and Weight interface over from mr-legacy to math-scala. They're
> available
>>> now in spark-shell but won't be after this refactoring.  These still
>>> require dictionary and a frequency count maps to vectorize incoming
> text-
>>> so they're more for use with the old MR seq2sparse and I don't think
> they
>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>> Hopefully they'll be of some use.
>>> 
>>> 
>> 
> 
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Pat Ferrel <pa...@occamsmachete.com>.

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gk...@gmail.com> wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
>>>> I'd suggest to consider this: remember all this talk about
>>>> language-integrated spark ql being basically dataframe manipulation
> DSL?
>>>> 
>>>> so now Spark devs are noticing this generality as well and are actually
>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
>>>> structure. (my "told you so" moment of sorts
>>>> 
>>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>>> DataFrame our two major structures. In particular, standardize on using
>>>> DataFrame for things that may include non-numerical data and require
> more
>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
>>>> when it deals with non-matrix content.
>>>> 
>>> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>> 
>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>>> 
>>>> Seems like seq2sparse would be really easy to replace since it takes
> text
>>>>> files to start with, then the whole pipeline could be kept in rdds.
> The
>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>> with
>>>>> joins? This would get rid of sequence files completely from the
>>>>> pipeline.
>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>> scalable using joins as an alternative with the same API allowing the
>>>>> user
>>>>> to trade-off footprint for speed.
>>>>> 
>>>> I think you're right- should be relatively easy.  I've been looking at
>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
> level
>>> is that we don't have a distributed data structure for strings..Seems
> like
>>> getting a DataFrame implemented as Dmitriy mentioned above would take
> care
>>> of this problem.
>>> 
>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>> mapping-  it's a part of the seq2sparse code that I've not spent too
> much
>>> time in.
>>> 
>>> I think that this would be very worthy effort as well-  I believe
>>> seq2sparse is a particular strong mahout feature.
>>> 
>>> I'll start another thread since we're now way off topic from the
>>> refactoring proposal.
>>> 
>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>> IndexedDataset) and calculate row/doc similarities. It works now but
> only
>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>> for text tokens something like cosine may be better.
>>> 
>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
> lot
>>> like how CF preferences are downsampled. This would produce an
> sparsified
>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>> terms before row similarity uses cosine. This is not so good for search
>>> but
>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>> does
>>> it for all pairs rather than one at a time.
>>> 
>>> In any case it can be used to do a create a personalized content-based
>>> recommender or augment a CF recommender with one more indicator type.
>>> 
>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>> 
>>> 
>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>> 
>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>> 
>>>>> Some issues WRT lower level Spark integration:
>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>> 
>>>> looked at. There may be other things we can pick up from their
> committers
>>> since they have an abundance.
>>> 
>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>> 
>>>> me when someone on the Spark list asked about matrix transpose and an
>>> MLlib
>>> committer’s answer was something like “why would you want to do that?”.
>>> Usually you don’t actually execute the transpose but they don’t even
>>> support A’A, AA’, or A’B, which are core to what I work on. At present
> you
>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>> stuff.
>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>> the
>>> DSL could interchange datasets with MLlib, people would be pointed to
> the
>>> DSL for all of a bunch of “why would you want to do that?” features.
> MLlib
>>> seems to be algorithms, not math.
>>> 
>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> 
>>>> interface. Doing a batch recalc on a moving time window would nearly
> fall
>>> out of DStream backed DRMs. This isn’t the same as incremental updates
> on
>>> streaming but it’s a start.
>>> 
>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>> 
>>>> faster compute engines. So we jumped. Now the need is for streaming and
>>> especially incrementally updated streaming. Seems like we need to
> address
>>> this.
>>> 
>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> 
>>>> helpful—row similarity for content/text would benefit greatly.
>>> 
>>>>  I will put a PR up soon.
>>>> 
>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> classes
>>> and Weight interface over from mr-legacy to math-scala. They're
> available
>>> now in spark-shell but won't be after this refactoring.  These still
>>> require dictionary and a frequency count maps to vectorize incoming
> text-
>>> so they're more for use with the old MR seq2sparse and I don't think
> they
>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>> Hopefully they'll be of some use.
>>> 
>>> 
>> 
> 
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Gokhan Capan <gk...@gmail.com>.

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> IndexedDataset might suffice until real DataFrames come along.
>
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
>
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>
> >
> > On 02/04/2015 11:13 AM, Pat Ferrel wrote:
> >
> >> Andrew, not sure what you mean about storing strings. If you mean
> >> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
> >> token. A one row DataFrame is a slightly heavy weight string/document. A
> >> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
> >> be a vector that maintains the tokens as ids for the counts, right?
> >>
> >
> > Yes- dataframes will be perfect for this.  The problem that i was
> > referring to was that we dont have a DSL Data Structure to to do the
> > initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
> > I believe we would need something like a Distributed vector of Strings
> that
> > could be broadcast to a mapBlock closure and then tokenized from there.
> > Even there, MapBlock may not be perfect for this, but some of the new
> > Distributed functions that Gockhan is working on may.
> >
> >>
> >> I agree seq2sparse type input is a strong feature. Text files into an
> >> all-documents DataFrame basically. Colocation?
> >>
> > as far as collocations i believe that the n-gram are computed and counted
> > in the CollocDriver [3] (i might be wrong her...its been a while since i
> > looked at the code...) either way, I dont think I ever looked too closely
> > and i was a bit fuzzy on this...
> >
> > These were just some thoughts that I had when briefly looking at porting
> > seq2sparse to the DSL before.. Obviously we don't have to follow this
> > algorithm but its a nice starting point.
> >
> > [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> > .java
> > [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> > [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> > java
> >
> >
> >
> >> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
> >>
> >> Just copied over the relevant last few messages to keep the other thread
> >> on topic...
> >>
> >>
> >> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> >>
> >>> I'd suggest to consider this: remember all this talk about
> >>> language-integrated spark ql being basically dataframe manipulation
> DSL?
> >>>
> >>> so now Spark devs are noticing this generality as well and are actually
> >>> proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
> >>> structure. (my "told you so" moment of sorts
> >>>
> >>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> >>> DataFrame our two major structures. In particular, standardize on using
> >>> DataFrame for things that may include non-numerical data and require
> more
> >>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
> >>> when it deals with non-matrix content.
> >>>
> >> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
> >> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
> >>
> >> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >>
> >>> Seems like seq2sparse would be really easy to replace since it takes
> text
> >>>> files to start with, then the whole pipeline could be kept in rdds.
> The
> >>>> dictionaries and counts could be either in-memory maps or rdds for use
> >>>> with
> >>>> joins? This would get rid of sequence files completely from the
> >>>> pipeline.
> >>>> Item similarity uses in-memory maps but the plan is to make it more
> >>>> scalable using joins as an alternative with the same API allowing the
> >>>> user
> >>>> to trade-off footprint for speed.
> >>>>
> >>> I think you're right- should be relatively easy.  I've been looking at
> >> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
> level
> >> is that we don't have a distributed data structure for strings..Seems
> like
> >> getting a DataFrame implemented as Dmitriy mentioned above would take
> care
> >> of this problem.
> >>
> >> The other issue i'm a little fuzzy on  is the distributed collocation
> >> mapping-  it's a part of the seq2sparse code that I've not spent too
> much
> >> time in.
> >>
> >> I think that this would be very worthy effort as well-  I believe
> >> seq2sparse is a particular strong mahout feature.
> >>
> >> I'll start another thread since we're now way off topic from the
> >> refactoring proposal.
> >>
> >> My use for TF-IDF is for row similarity and would take a DRM (actually
> >> IndexedDataset) and calculate row/doc similarities. It works now but
> only
> >> using LLR. This is OK when thinking of the items as tags or metadata but
> >> for text tokens something like cosine may be better.
> >>
> >> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
> lot
> >> like how CF preferences are downsampled. This would produce an
> sparsified
> >> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
> >> terms before row similarity uses cosine. This is not so good for search
> >> but
> >> should produce much better similarities than Solr’s “moreLikeThis” and
> >> does
> >> it for all pairs rather than one at a time.
> >>
> >> In any case it can be used to do a create a personalized content-based
> >> recommender or augment a CF recommender with one more indicator type.
> >>
> >> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
> >>
> >>
> >> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> >>
> >>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
> >>>
> >>>> Some issues WRT lower level Spark integration:
> >>>> 1) interoperability with Spark data. TF-IDF is one example I actually
> >>>>
> >>> looked at. There may be other things we can pick up from their
> committers
> >> since they have an abundance.
> >>
> >>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
> >>>>
> >>> me when someone on the Spark list asked about matrix transpose and an
> >> MLlib
> >> committer’s answer was something like “why would you want to do that?”.
> >> Usually you don’t actually execute the transpose but they don’t even
> >> support A’A, AA’, or A’B, which are core to what I work on. At present
> you
> >> pretty much have to choose between MLlib or Mahout for sparse matrix
> >> stuff.
> >> Maybe a half-way measure is some implicit conversions (ugh, I know). If
> >> the
> >> DSL could interchange datasets with MLlib, people would be pointed to
> the
> >> DSL for all of a bunch of “why would you want to do that?” features.
> MLlib
> >> seems to be algorithms, not math.
> >>
> >>> 3) integration of Streaming. DStreams support most of the RDD
> >>>>
> >>> interface. Doing a batch recalc on a moving time window would nearly
> fall
> >> out of DStream backed DRMs. This isn’t the same as incremental updates
> on
> >> streaming but it’s a start.
> >>
> >>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
> >>>>
> >>> faster compute engines. So we jumped. Now the need is for streaming and
> >> especially incrementally updated streaming. Seems like we need to
> address
> >> this.
> >>
> >>> Andrew, regardless of the above having TF-IDF would be super
> >>>>
> >>> helpful—row similarity for content/text would benefit greatly.
> >>
> >>>   I will put a PR up soon.
> >>>
> >> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> classes
> >> and Weight interface over from mr-legacy to math-scala. They're
> available
> >> now in spark-shell but won't be after this refactoring.  These still
> >> require dictionary and a frequency count maps to vectorize incoming
> text-
> >> so they're more for use with the old MR seq2sparse and I don't think
> they
> >> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
> >> Hopefully they'll be of some use.
> >>
> >>
> >
>
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Pat Ferrel <pa...@occamsmachete.com>.

IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:

> 
> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
> 
>> Andrew, not sure what you mean about storing strings. If you mean
>> something like a DRM of tokens, that is a DataFrame with row=doc column =
>> token. A one row DataFrame is a slightly heavy weight string/document. A
>> DataFrame with token counts would be perfect for input TF-IDF, no? It would
>> be a vector that maintains the tokens as ids for the counts, right?
>> 
> 
> Yes- dataframes will be perfect for this.  The problem that i was
> referring to was that we dont have a DSL Data Structure to to do the
> initial distributed tokenizing of the documents[1] line:257, [2] . For this
> I believe we would need something like a Distributed vector of Strings that
> could be broadcast to a mapBlock closure and then tokenized from there.
> Even there, MapBlock may not be perfect for this, but some of the new
> Distributed functions that Gockhan is working on may.
> 
>> 
>> I agree seq2sparse type input is a strong feature. Text files into an
>> all-documents DataFrame basically. Colocation?
>> 
> as far as collocations i believe that the n-gram are computed and counted
> in the CollocDriver [3] (i might be wrong her...its been a while since i
> looked at the code...) either way, I dont think I ever looked too closely
> and i was a bit fuzzy on this...
> 
> These were just some thoughts that I had when briefly looking at porting
> seq2sparse to the DSL before.. Obviously we don't have to follow this
> algorithm but its a nice starting point.
> 
> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> .java
> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> java
> 
> 
> 
>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>> 
>> Just copied over the relevant last few messages to keep the other thread
>> on topic...
>> 
>> 
>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>> 
>>> I'd suggest to consider this: remember all this talk about
>>> language-integrated spark ql being basically dataframe manipulation DSL?
>>> 
>>> so now Spark devs are noticing this generality as well and are actually
>>> proposing to rename SchemaRDD into DataFrame and make it mainstream data
>>> structure. (my "told you so" moment of sorts
>>> 
>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>> DataFrame our two major structures. In particular, standardize on using
>>> DataFrame for things that may include non-numerical data and require more
>>> grace about column naming and manipulation. Maybe relevant to TF-IDF work
>>> when it deals with non-matrix content.
>>> 
>> Sounds like a worthy effort to me.  We'd be basically implementing an API
>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>> 
>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>>> Seems like seq2sparse would be really easy to replace since it takes text
>>>> files to start with, then the whole pipeline could be kept in rdds. The
>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>> with
>>>> joins? This would get rid of sequence files completely from the
>>>> pipeline.
>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>> scalable using joins as an alternative with the same API allowing the
>>>> user
>>>> to trade-off footprint for speed.
>>>> 
>>> I think you're right- should be relatively easy.  I've been looking at
>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
>> is that we don't have a distributed data structure for strings..Seems like
>> getting a DataFrame implemented as Dmitriy mentioned above would take care
>> of this problem.
>> 
>> The other issue i'm a little fuzzy on  is the distributed collocation
>> mapping-  it's a part of the seq2sparse code that I've not spent too much
>> time in.
>> 
>> I think that this would be very worthy effort as well-  I believe
>> seq2sparse is a particular strong mahout feature.
>> 
>> I'll start another thread since we're now way off topic from the
>> refactoring proposal.
>> 
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>> 
>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
>> like how CF preferences are downsampled. This would produce an sparsified
>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>> terms before row similarity uses cosine. This is not so good for search
>> but
>> should produce much better similarities than Solr’s “moreLikeThis” and
>> does
>> it for all pairs rather than one at a time.
>> 
>> In any case it can be used to do a create a personalized content-based
>> recommender or augment a CF recommender with one more indicator type.
>> 
>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>> 
>> 
>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>> 
>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>> 
>>>> Some issues WRT lower level Spark integration:
>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>> 
>>> looked at. There may be other things we can pick up from their committers
>> since they have an abundance.
>> 
>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>> 
>>> me when someone on the Spark list asked about matrix transpose and an
>> MLlib
>> committer’s answer was something like “why would you want to do that?”.
>> Usually you don’t actually execute the transpose but they don’t even
>> support A’A, AA’, or A’B, which are core to what I work on. At present you
>> pretty much have to choose between MLlib or Mahout for sparse matrix
>> stuff.
>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>> the
>> DSL could interchange datasets with MLlib, people would be pointed to the
>> DSL for all of a bunch of “why would you want to do that?” features. MLlib
>> seems to be algorithms, not math.
>> 
>>> 3) integration of Streaming. DStreams support most of the RDD
>>>> 
>>> interface. Doing a batch recalc on a moving time window would nearly fall
>> out of DStream backed DRMs. This isn’t the same as incremental updates on
>> streaming but it’s a start.
>> 
>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>> 
>>> faster compute engines. So we jumped. Now the need is for streaming and
>> especially incrementally updated streaming. Seems like we need to address
>> this.
>> 
>>> Andrew, regardless of the above having TF-IDF would be super
>>>> 
>>> helpful—row similarity for content/text would benefit greatly.
>> 
>>>   I will put a PR up soon.
>>> 
>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
>> and Weight interface over from mr-legacy to math-scala. They're available
>> now in spark-shell but won't be after this refactoring.  These still
>> require dictionary and a frequency count maps to vectorize incoming text-
>> so they're more for use with the old MR seq2sparse and I don't think they
>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>> Hopefully they'll be of some use.
>> 
>> 
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap...@outlook.com> wrote:

>
> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>
>> Andrew, not sure what you mean about storing strings. If you mean
>> something like a DRM of tokens, that is a DataFrame with row=doc column =
>> token. A one row DataFrame is a slightly heavy weight string/document. A
>> DataFrame with token counts would be perfect for input TF-IDF, no? It would
>> be a vector that maintains the tokens as ids for the counts, right?
>>
>
> Yes- dataframes will be perfect for this.  The problem that i was
> referring to was that we dont have a DSL Data Structure to to do the
> initial distributed tokenizing of the documents[1] line:257, [2] . For this
> I believe we would need something like a Distributed vector of Strings that
> could be broadcast to a mapBlock closure and then tokenized from there.
> Even there, MapBlock may not be perfect for this, but some of the new
> Distributed functions that Gockhan is working on may.
>
>>
>> I agree seq2sparse type input is a strong feature. Text files into an
>> all-documents DataFrame basically. Colocation?
>>
> as far as collocations i believe that the n-gram are computed and counted
> in the CollocDriver [3] (i might be wrong her...its been a while since i
> looked at the code...) either way, I dont think I ever looked too closely
> and i was a bit fuzzy on this...
>
> These were just some thoughts that I had when briefly looking at porting
> seq2sparse to the DSL before.. Obviously we don't have to follow this
> algorithm but its a nice starting point.
>
> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> .java
> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> java
>
>
>
>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>
>> Just copied over the relevant last few messages to keep the other thread
>> on topic...
>>
>>
>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>
>>> I'd suggest to consider this: remember all this talk about
>>> language-integrated spark ql being basically dataframe manipulation DSL?
>>>
>>> so now Spark devs are noticing this generality as well and are actually
>>> proposing to rename SchemaRDD into DataFrame and make it mainstream data
>>> structure. (my "told you so" moment of sorts
>>>
>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>> DataFrame our two major structures. In particular, standardize on using
>>> DataFrame for things that may include non-numerical data and require more
>>> grace about column naming and manipulation. Maybe relevant to TF-IDF work
>>> when it deals with non-matrix content.
>>>
>> Sounds like a worthy effort to me.  We'd be basically implementing an API
>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>
>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>>> Seems like seq2sparse would be really easy to replace since it takes text
>>>> files to start with, then the whole pipeline could be kept in rdds. The
>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>> with
>>>> joins? This would get rid of sequence files completely from the
>>>> pipeline.
>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>> scalable using joins as an alternative with the same API allowing the
>>>> user
>>>> to trade-off footprint for speed.
>>>>
>>> I think you're right- should be relatively easy.  I've been looking at
>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
>> is that we don't have a distributed data structure for strings..Seems like
>> getting a DataFrame implemented as Dmitriy mentioned above would take care
>> of this problem.
>>
>> The other issue i'm a little fuzzy on  is the distributed collocation
>> mapping-  it's a part of the seq2sparse code that I've not spent too much
>> time in.
>>
>> I think that this would be very worthy effort as well-  I believe
>> seq2sparse is a particular strong mahout feature.
>>
>> I'll start another thread since we're now way off topic from the
>> refactoring proposal.
>>
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>>
>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
>> like how CF preferences are downsampled. This would produce an sparsified
>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>> terms before row similarity uses cosine. This is not so good for search
>> but
>> should produce much better similarities than Solr’s “moreLikeThis” and
>> does
>> it for all pairs rather than one at a time.
>>
>> In any case it can be used to do a create a personalized content-based
>> recommender or augment a CF recommender with one more indicator type.
>>
>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>
>>
>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>
>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>
>>>> Some issues WRT lower level Spark integration:
>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>
>>> looked at. There may be other things we can pick up from their committers
>> since they have an abundance.
>>
>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>
>>> me when someone on the Spark list asked about matrix transpose and an
>> MLlib
>> committer’s answer was something like “why would you want to do that?”.
>> Usually you don’t actually execute the transpose but they don’t even
>> support A’A, AA’, or A’B, which are core to what I work on. At present you
>> pretty much have to choose between MLlib or Mahout for sparse matrix
>> stuff.
>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>> the
>> DSL could interchange datasets with MLlib, people would be pointed to the
>> DSL for all of a bunch of “why would you want to do that?” features. MLlib
>> seems to be algorithms, not math.
>>
>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>
>>> interface. Doing a batch recalc on a moving time window would nearly fall
>> out of DStream backed DRMs. This isn’t the same as incremental updates on
>> streaming but it’s a start.
>>
>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>
>>> faster compute engines. So we jumped. Now the need is for streaming and
>> especially incrementally updated streaming. Seems like we need to address
>> this.
>>
>>> Andrew, regardless of the above having TF-IDF would be super
>>>>
>>> helpful—row similarity for content/text would benefit greatly.
>>
>>>    I will put a PR up soon.
>>>
>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
>> and Weight interface over from mr-legacy to math-scala. They're available
>> now in spark-shell but won't be after this refactoring.  These still
>> require dictionary and a frequency count maps to vectorize incoming text-
>> so they're more for use with the old MR seq2sparse and I don't think they
>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>> Hopefully they'll be of some use.
>>
>>
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Gokhan Capan <gk...@gmail.com>.

I think I have a sketch of implementation for creating a drm from a
sequence file of <Int, Text>s, a.k.a. seq2sparse, using Spark.

Give me a couple days day and I will provide an initial implementation.

Best

Gokhan

On Wed, Feb 4, 2015 at 7:16 PM, Andrew Palumbo <ap...@outlook.com> wrote:

>
> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>
>> Andrew, not sure what you mean about storing strings. If you mean
>> something like a DRM of tokens, that is a DataFrame with row=doc column =
>> token. A one row DataFrame is a slightly heavy weight string/document. A
>> DataFrame with token counts would be perfect for input TF-IDF, no? It would
>> be a vector that maintains the tokens as ids for the counts, right?
>>
>
> Yes- dataframes will be perfect for this.  The problem that i was
> referring to was that we dont have a DSL Data Structure to to do the
> initial distributed tokenizing of the documents[1] line:257, [2] . For this
> I believe we would need something like a Distributed vector of Strings that
> could be broadcast to a mapBlock closure and then tokenized from there.
> Even there, MapBlock may not be perfect for this, but some of the new
> Distributed functions that Gockhan is working on may.
>
>>
>> I agree seq2sparse type input is a strong feature. Text files into an
>> all-documents DataFrame basically. Colocation?
>>
> as far as collocations i believe that the n-gram are computed and counted
> in the CollocDriver [3] (i might be wrong her...its been a while since i
> looked at the code...) either way, I dont think I ever looked too closely
> and i was a bit fuzzy on this...
>
> These were just some thoughts that I had when briefly looking at porting
> seq2sparse to the DSL before.. Obviously we don't have to follow this
> algorithm but its a nice starting point.
>
> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> .java
> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> java
>
>
>
>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>
>> Just copied over the relevant last few messages to keep the other thread
>> on topic...
>>
>>
>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>
>>> I'd suggest to consider this: remember all this talk about
>>> language-integrated spark ql being basically dataframe manipulation DSL?
>>>
>>> so now Spark devs are noticing this generality as well and are actually
>>> proposing to rename SchemaRDD into DataFrame and make it mainstream data
>>> structure. (my "told you so" moment of sorts
>>>
>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>> DataFrame our two major structures. In particular, standardize on using
>>> DataFrame for things that may include non-numerical data and require more
>>> grace about column naming and manipulation. Maybe relevant to TF-IDF work
>>> when it deals with non-matrix content.
>>>
>> Sounds like a worthy effort to me.  We'd be basically implementing an API
>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>
>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>>> Seems like seq2sparse would be really easy to replace since it takes text
>>>> files to start with, then the whole pipeline could be kept in rdds. The
>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>> with
>>>> joins? This would get rid of sequence files completely from the
>>>> pipeline.
>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>> scalable using joins as an alternative with the same API allowing the
>>>> user
>>>> to trade-off footprint for speed.
>>>>
>>> I think you're right- should be relatively easy.  I've been looking at
>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
>> is that we don't have a distributed data structure for strings..Seems like
>> getting a DataFrame implemented as Dmitriy mentioned above would take care
>> of this problem.
>>
>> The other issue i'm a little fuzzy on  is the distributed collocation
>> mapping-  it's a part of the seq2sparse code that I've not spent too much
>> time in.
>>
>> I think that this would be very worthy effort as well-  I believe
>> seq2sparse is a particular strong mahout feature.
>>
>> I'll start another thread since we're now way off topic from the
>> refactoring proposal.
>>
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>>
>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
>> like how CF preferences are downsampled. This would produce an sparsified
>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>> terms before row similarity uses cosine. This is not so good for search
>> but
>> should produce much better similarities than Solr’s “moreLikeThis” and
>> does
>> it for all pairs rather than one at a time.
>>
>> In any case it can be used to do a create a personalized content-based
>> recommender or augment a CF recommender with one more indicator type.
>>
>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>
>>
>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>
>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>
>>>> Some issues WRT lower level Spark integration:
>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>
>>> looked at. There may be other things we can pick up from their committers
>> since they have an abundance.
>>
>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>
>>> me when someone on the Spark list asked about matrix transpose and an
>> MLlib
>> committer’s answer was something like “why would you want to do that?”.
>> Usually you don’t actually execute the transpose but they don’t even
>> support A’A, AA’, or A’B, which are core to what I work on. At present you
>> pretty much have to choose between MLlib or Mahout for sparse matrix
>> stuff.
>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>> the
>> DSL could interchange datasets with MLlib, people would be pointed to the
>> DSL for all of a bunch of “why would you want to do that?” features. MLlib
>> seems to be algorithms, not math.
>>
>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>
>>> interface. Doing a batch recalc on a moving time window would nearly fall
>> out of DStream backed DRMs. This isn’t the same as incremental updates on
>> streaming but it’s a start.
>>
>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>
>>> faster compute engines. So we jumped. Now the need is for streaming and
>> especially incrementally updated streaming. Seems like we need to address
>> this.
>>
>>> Andrew, regardless of the above having TF-IDF would be super
>>>>
>>> helpful—row similarity for content/text would benefit greatly.
>>
>>>    I will put a PR up soon.
>>>
>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
>> and Weight interface over from mr-legacy to math-scala. They're available
>> now in spark-shell but won't be after this refactoring.  These still
>> require dictionary and a frequency count maps to vectorize incoming text-
>> so they're more for use with the old MR seq2sparse and I don't think they
>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>> Hopefully they'll be of some use.
>>
>>
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Andrew Palumbo <ap...@outlook.com>.

On 02/04/2015 11:13 AM, Pat Ferrel wrote:
> Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right?

Yes- dataframes will be perfect for this.  The problem that i was 
referring to was that we dont have a DSL Data Structure to to do the 
initial distributed tokenizing of the documents[1] line:257, [2] . For 
this I believe we would need something like a Distributed vector of 
Strings that could be broadcast to a mapBlock closure and then tokenized 
from there.  Even there, MapBlock may not be perfect for this, but some 
of the new Distributed functions that Gockhan is working on may.
>
> I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation?
as far as collocations i believe that the n-gram are computed and 
counted in the CollocDriver [3] (i might be wrong her...its been a while 
since i  looked at the code...) either way, I dont think I ever looked 
too closely and i was a bit fuzzy on this...

These were just some thoughts that I had when briefly looking at porting 
seq2sparse to the DSL before.. Obviously we don't have to follow this 
algorithm but its a nice starting point.

[1] 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
[2] 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
[3]https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.java

>   
>
> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>
> Just copied over the relevant last few messages to keep the other thread on topic...
>
>
> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>> I'd suggest to consider this: remember all this talk about
>> language-integrated spark ql being basically dataframe manipulation DSL?
>>
>> so now Spark devs are noticing this generality as well and are actually
>> proposing to rename SchemaRDD into DataFrame and make it mainstream data
>> structure. (my "told you so" moment of sorts
>>
>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>> DataFrame our two major structures. In particular, standardize on using
>> DataFrame for things that may include non-numerical data and require more
>> grace about column naming and manipulation. Maybe relevant to TF-IDF work
>> when it deals with non-matrix content.
> Sounds like a worthy effort to me.  We'd be basically implementing an API at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>
> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> Seems like seq2sparse would be really easy to replace since it takes text
>>> files to start with, then the whole pipeline could be kept in rdds. The
>>> dictionaries and counts could be either in-memory maps or rdds for use with
>>> joins? This would get rid of sequence files completely from the pipeline.
>>> Item similarity uses in-memory maps but the plan is to make it more
>>> scalable using joins as an alternative with the same API allowing the user
>>> to trade-off footprint for speed.
> I think you're right- should be relatively easy.  I've been looking at porting seq2sparse  to the DSL for bit now and the stopper at the DSL level is that we don't have a distributed data structure for strings..Seems like getting a DataFrame implemented as Dmitriy mentioned above would take care of this problem.
>
> The other issue i'm a little fuzzy on  is the distributed collocation mapping-  it's a part of the seq2sparse code that I've not spent too much time in.
>
> I think that this would be very worthy effort as well-  I believe seq2sparse is a particular strong mahout feature.
>
> I'll start another thread since we're now way off topic from the refactoring proposal.
>
> My use for TF-IDF is for row similarity and would take a DRM (actually
> IndexedDataset) and calculate row/doc similarities. It works now but only
> using LLR. This is OK when thinking of the items as tags or metadata but
> for text tokens something like cosine may be better.
>
> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
> like how CF preferences are downsampled. This would produce an sparsified
> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
> terms before row similarity uses cosine. This is not so good for search but
> should produce much better similarities than Solr’s “moreLikeThis” and does
> it for all pairs rather than one at a time.
>
> In any case it can be used to do a create a personalized content-based
> recommender or augment a CF recommender with one more indicator type.
>
> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
>
> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>> Some issues WRT lower level Spark integration:
>>> 1) interoperability with Spark data. TF-IDF is one example I actually
> looked at. There may be other things we can pick up from their committers
> since they have an abundance.
>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
> me when someone on the Spark list asked about matrix transpose and an MLlib
> committer’s answer was something like “why would you want to do that?”.
> Usually you don’t actually execute the transpose but they don’t even
> support A’A, AA’, or A’B, which are core to what I work on. At present you
> pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
> Maybe a half-way measure is some implicit conversions (ugh, I know). If the
> DSL could interchange datasets with MLlib, people would be pointed to the
> DSL for all of a bunch of “why would you want to do that?” features. MLlib
> seems to be algorithms, not math.
>>> 3) integration of Streaming. DStreams support most of the RDD
> interface. Doing a batch recalc on a moving time window would nearly fall
> out of DStream backed DRMs. This isn’t the same as incremental updates on
> streaming but it’s a start.
>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
> faster compute engines. So we jumped. Now the need is for streaming and
> especially incrementally updated streaming. Seems like we need to address
> this.
>>> Andrew, regardless of the above having TF-IDF would be super
> helpful—row similarity for content/text would benefit greatly.
>>    I will put a PR up soon.
> Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
> and Weight interface over from mr-legacy to math-scala. They're available
> now in spark-shell but won't be after this refactoring.  These still
> require dictionary and a frequency count maps to vectorize incoming text-
> so they're more for use with the old MR seq2sparse and I don't think they
> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
> Hopefully they'll be of some use.
>

Re: TF-IDF, seq2sparse and DataFrame support

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right?

I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation?

On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap...@outlook.com> wrote:

Just copied over the relevant last few messages to keep the other thread on topic...

On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> I'd suggest to consider this: remember all this talk about
> language-integrated spark ql being basically dataframe manipulation DSL?
> 
> so now Spark devs are noticing this generality as well and are actually
> proposing to rename SchemaRDD into DataFrame and make it mainstream data
> structure. (my "told you so" moment of sorts
> 
> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> DataFrame our two major structures. In particular, standardize on using
> DataFrame for things that may include non-numerical data and require more
> grace about column naming and manipulation. Maybe relevant to TF-IDF work
> when it deals with non-matrix content.
Sounds like a worthy effort to me.  We'd be basically implementing an API at the math-scala level for SchemaRDD/Dataframe datastructures correct?

On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> Seems like seq2sparse would be really easy to replace since it takes text
>> files to start with, then the whole pipeline could be kept in rdds. The
>> dictionaries and counts could be either in-memory maps or rdds for use with
>> joins? This would get rid of sequence files completely from the pipeline.
>> Item similarity uses in-memory maps but the plan is to make it more
>> scalable using joins as an alternative with the same API allowing the user
>> to trade-off footprint for speed.

I think you're right- should be relatively easy.  I've been looking at porting seq2sparse  to the DSL for bit now and the stopper at the DSL level is that we don't have a distributed data structure for strings..Seems like getting a DataFrame implemented as Dmitriy mentioned above would take care of this problem.

The other issue i'm a little fuzzy on  is the distributed collocation mapping-  it's a part of the seq2sparse code that I've not spent too much time in.

I think that this would be very worthy effort as well-  I believe seq2sparse is a particular strong mahout feature.

I'll start another thread since we're now way off topic from the refactoring proposal.

My use for TF-IDF is for row similarity and would take a DRM (actually
IndexedDataset) and calculate row/doc similarities. It works now but only
using LLR. This is OK when thinking of the items as tags or metadata but
for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
like how CF preferences are downsampled. This would produce an sparsified
all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
terms before row similarity uses cosine. This is not so good for search but
should produce much better similarities than Solr’s “moreLikeThis” and does
it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based
recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:

On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>> Some issues WRT lower level Spark integration:
>> 1) interoperability with Spark data. TF-IDF is one example I actually
looked at. There may be other things we can pick up from their committers
since they have an abundance.
>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
me when someone on the Spark list asked about matrix transpose and an MLlib
committer’s answer was something like “why would you want to do that?”.
Usually you don’t actually execute the transpose but they don’t even
support A’A, AA’, or A’B, which are core to what I work on. At present you
pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
Maybe a half-way measure is some implicit conversions (ugh, I know). If the
DSL could interchange datasets with MLlib, people would be pointed to the
DSL for all of a bunch of “why would you want to do that?” features. MLlib
seems to be algorithms, not math.
>> 3) integration of Streaming. DStreams support most of the RDD
interface. Doing a batch recalc on a moving time window would nearly fall
out of DStream backed DRMs. This isn’t the same as incremental updates on
streaming but it’s a start.
>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
faster compute engines. So we jumped. Now the need is for streaming and
especially incrementally updated streaming. Seems like we need to address
this.
>> Andrew, regardless of the above having TF-IDF would be super
helpful—row similarity for content/text would benefit greatly.
>   I will put a PR up soon.
Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
and Weight interface over from mr-legacy to math-scala. They're available
now in spark-shell but won't be after this refactoring.  These still
require dictionary and a frequency count maps to vectorize incoming text-
so they're more for use with the old MR seq2sparse and I don't think they
can be used with Spark's HashingTF and IDF.  I'll put them up soon.
Hopefully they'll be of some use.