You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by rishikesh <ri...@hotmail.com> on 2015/07/04 13:04:00 UTC

Feature Generation On Spark

Hi

I am new to Spark and am working on document classification. Before model
fitting I need to do feature generation. Each document is to be converted to
a feature vector. However I am not sure how to do that. While testing
locally I have a static list of tokens and when I parse a file I do a lookup
and increment counters. 

In the case of Spark I can create an RDD which loads all the documents
however I am not sure if one files goes to one executor or multiple. If the
file is split then the feature vectors needs to be merged. But I am not able
to figure out how to do that.

Thanks
Rishi



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: Feature Generation On Spark

Posted by Mohammed Guller <mo...@glassbeam.com>.

Try this (replace ... with the appropriate values for your environment):

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector

val sc = new SparkContext(...)
val documents = sc.wholeTextFile(...)
val tokenized = documents.map{ case(path, document) => (path, document.split("\\s+"))}
val numFeatures = 100000
val hashingTF = new HashingTF(numFeatures)
val featurized = tokenized.map{case(path, words) => (path, hashingTF.transform(words))}


Mohammed

From: rishikesh thakur [mailto:rishikeshthakur@hotmail.com]
Sent: Friday, July 17, 2015 12:33 AM
To: Mohammed Guller
Subject: Re: Feature Generation On Spark


Thanks I did look at the example. I am using Spark 1.2. The modules mentioned there are not in 1.2 I guess. The import is failing


Rishi

________________________________
From: Mohammed Guller <mo...@glassbeam.com>>
Sent: Friday, July 10, 2015 2:31 AM
To: rishikesh thakur; ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark


Take a look at the examples here:

https://spark.apache.org/docs/latest/ml-guide.html



Mohammed



From: rishikesh thakur [mailto:rishikeshthakur@hotmail.com]
Sent: Saturday, July 4, 2015 10:49 PM
To: ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark



I have one document per file and each file is to be converted to a feature vector. Pretty much like standard feature construction for document classification.



Thanks

Rishi

________________________________

Date: Sun, 5 Jul 2015 01:44:04 +1000
Subject: Re: Feature Generation On Spark
From: guha.ayan@gmail.com<ma...@gmail.com>
To: micizma@gmail.com<ma...@gmail.com>
CC: rishikeshthakur@hotmail.com<ma...@hotmail.com>; user@spark.apache.org<ma...@spark.apache.org>

Do you have one document per file or multiple document in the file?

On 4 Jul 2015 23:38, "Michal Čizmazia" <mi...@gmail.com>> wrote:

Spark Context has a method wholeTextFiles. Is that what you need?

On 4 July 2015 at 07:04, rishikesh <ri...@hotmail.com>> wrote:
> Hi
>
> I am new to Spark and am working on document classification. Before model
> fitting I need to do feature generation. Each document is to be converted to
> a feature vector. However I am not sure how to do that. While testing
> locally I have a static list of tokens and when I parse a file I do a lookup
> and increment counters.
>
> In the case of Spark I can create an RDD which loads all the documents
> however I am not sure if one files goes to one executor or multiple. If the
> file is split then the feature vectors needs to be merged. But I am not able
> to figure out how to do that.
>
> Thanks
> Rishi
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

RE: Feature Generation On Spark

Posted by Mohammed Guller <mo...@glassbeam.com>.

Take a look at the examples here:
https://spark.apache.org/docs/latest/ml-guide.html

Mohammed

From: rishikesh thakur [mailto:rishikeshthakur@hotmail.com]
Sent: Saturday, July 4, 2015 10:49 PM
To: ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark

I have one document per file and each file is to be converted to a feature vector. Pretty much like standard feature construction for document classification.

Thanks
Rishi
________________________________
Date: Sun, 5 Jul 2015 01:44:04 +1000
Subject: Re: Feature Generation On Spark
From: guha.ayan@gmail.com<ma...@gmail.com>
To: micizma@gmail.com<ma...@gmail.com>
CC: rishikeshthakur@hotmail.com<ma...@hotmail.com>; user@spark.apache.org<ma...@spark.apache.org>
Do you have one document per file or multiple document in the file?
On 4 Jul 2015 23:38, "Michal Čizmazia" <mi...@gmail.com>> wrote:
Spark Context has a method wholeTextFiles. Is that what you need?

On 4 July 2015 at 07:04, rishikesh <ri...@hotmail.com>> wrote:
> Hi
>
> I am new to Spark and am working on document classification. Before model
> fitting I need to do feature generation. Each document is to be converted to
> a feature vector. However I am not sure how to do that. While testing
> locally I have a static list of tokens and when I parse a file I do a lookup
> and increment counters.
>
> In the case of Spark I can create an RDD which loads all the documents
> however I am not sure if one files goes to one executor or multiple. If the
> file is split then the feature vectors needs to be merged. But I am not able
> to figure out how to do that.
>
> Thanks
> Rishi
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

RE: Feature Generation On Spark

Posted by rishikesh thakur <ri...@hotmail.com>.

I have one document per file and each file is to be converted to a feature vector. Pretty much like standard feature construction for document classification.
ThanksRishi

Date: Sun, 5 Jul 2015 01:44:04 +1000
Subject: Re: Feature Generation On Spark
From: guha.ayan@gmail.com
To: micizma@gmail.com
CC: rishikeshthakur@hotmail.com; user@spark.apache.org

Do you have one document per file or multiple document in the file? 
On 4 Jul 2015 23:38, "Michal Čizmazia" <mi...@gmail.com> wrote:
Spark Context has a method wholeTextFiles. Is that what you need?



On 4 July 2015 at 07:04, rishikesh <ri...@hotmail.com> wrote:

> Hi

>

> I am new to Spark and am working on document classification. Before model

> fitting I need to do feature generation. Each document is to be converted to

> a feature vector. However I am not sure how to do that. While testing

> locally I have a static list of tokens and when I parse a file I do a lookup

> and increment counters.

>

> In the case of Spark I can create an RDD which loads all the documents

> however I am not sure if one files goes to one executor or multiple. If the

> file is split then the feature vectors needs to be merged. But I am not able

> to figure out how to do that.

>

> Thanks

> Rishi

>

>

>

> --

> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html

> Sent from the Apache Spark User List mailing list archive at Nabble.com.

>

> ---------------------------------------------------------------------

> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org

> For additional commands, e-mail: user-help@spark.apache.org

>



---------------------------------------------------------------------

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org

For additional commands, e-mail: user-help@spark.apache.org

Re: Feature Generation On Spark

Posted by ayan guha <gu...@gmail.com>.

Do you have one document per file or multiple document in the file?
On 4 Jul 2015 23:38, "Michal Čizmazia" <mi...@gmail.com> wrote:

> Spark Context has a method wholeTextFiles. Is that what you need?
>
> On 4 July 2015 at 07:04, rishikesh <ri...@hotmail.com> wrote:
> > Hi
> >
> > I am new to Spark and am working on document classification. Before model
> > fitting I need to do feature generation. Each document is to be
> converted to
> > a feature vector. However I am not sure how to do that. While testing
> > locally I have a static list of tokens and when I parse a file I do a
> lookup
> > and increment counters.
> >
> > In the case of Spark I can create an RDD which loads all the documents
> > however I am not sure if one files goes to one executor or multiple. If
> the
> > file is split then the feature vectors needs to be merged. But I am not
> able
> > to figure out how to do that.
> >
> > Thanks
> > Rishi
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

RE: Feature Generation On Spark

Posted by rishikesh thakur <ri...@hotmail.com>.

Hi
Thanks, I guess this will solve my problem. I will load mutiple files using wildcard's likes *.csv. I guess if I use wholeTextFile instead of textFile, I will get whole file contents as value which will in turn ensure one feature vector per file.
thanksNitin
> Date: Sat, 4 Jul 2015 09:37:52 -0400
> Subject: Re: Feature Generation On Spark
> From: micizma@gmail.com
> To: rishikeshthakur@hotmail.com
> CC: user@spark.apache.org
> 
> Spark Context has a method wholeTextFiles. Is that what you need?
> 
> On 4 July 2015 at 07:04, rishikesh <ri...@hotmail.com> wrote:
> > Hi
> >
> > I am new to Spark and am working on document classification. Before model
> > fitting I need to do feature generation. Each document is to be converted to
> > a feature vector. However I am not sure how to do that. While testing
> > locally I have a static list of tokens and when I parse a file I do a lookup
> > and increment counters.
> >
> > In the case of Spark I can create an RDD which loads all the documents
> > however I am not sure if one files goes to one executor or multiple. If the
> > file is split then the feature vectors needs to be merged. But I am not able
> > to figure out how to do that.
> >
> > Thanks
> > Rishi
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >

Re: Feature Generation On Spark

Posted by Michal Čizmazia <mi...@gmail.com>.

Spark Context has a method wholeTextFiles. Is that what you need?

On 4 July 2015 at 07:04, rishikesh <ri...@hotmail.com> wrote:
> Hi
>
> I am new to Spark and am working on document classification. Before model
> fitting I need to do feature generation. Each document is to be converted to
> a feature vector. However I am not sure how to do that. While testing
> locally I have a static list of tokens and when I parse a file I do a lookup
> and increment counters.
>
> In the case of Spark I can create an RDD which loads all the documents
> however I am not sure if one files goes to one executor or multiple. If the
> file is split then the feature vectors needs to be merged. But I am not able
> to figure out how to do that.
>
> Thanks
> Rishi
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org