You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jatinpreet <ja...@gmail.com> on 2014/09/18 12:46:31 UTC

New API for TFIDF generation in Spark 1.1.0

Hi,

I have been running into memory overflow issues while creating TFIDF vectors
to be used in document classification using MLlib's Naive Baye's
classification implementation.

http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/

Memory overflow and GC issues occur while collecting idfs for all the terms.
To give an idea of scale, I am reading around 615,000(around 4GB of text
data) small sized documents from HBase and running the spark program with 8
cores and 6GB of executor memory. I have tried increasing the parallelism
level and shuffle memory fraction but to no avail.

The new TFIDF generation APIs caught my eye in the latest Spark version
1.1.0. The example given in the official documentation mentions creation of
TFIDF vectors based of Hashing Trick. I want to know if it will solve the
mentioned problem by benefiting from reduced memory consumption.

Also, the example does not state how to create labeled points for a corpus
of pre-classified document data. For example, my training input looks
something like this,

I want to create labeled points from this sample data for training. And once
the Naive Bayes model is created, I generate TFIDFs for the test documents
and predict the document type.

If the new API can solve my issue, how can I generate labelled points using
the new APIs? An example would be great.

Also, I have a special requirement of ignoring terms that occur in less than
two documents. This has important implications for the accuracy of my use
case and needs to be accommodated while generating TFIDFs.

Thanks,
Jatin

-----
Novice Big Data Programmer
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: New API for TFIDF generation in Spark 1.1.0

Posted by nilesh <ni...@yahoo.com>.

Did some digging in the documentation. Looks like the IDFModel.transform only
accepts RDD as an input,
and not individual elements. Is this a bug? I am saying this because
HashingTF.transform accepts both RDD as well as vector elements as its
input.

>From your post replying to Jatin, looks like you are using
IDFModel.transform with individual elements going in as inputs through the
map function. This is illegal as per the Spark documentation as well as from
what I am seeing in my prototyping.

Help?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543p16078.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: New API for TFIDF generation in Spark 1.1.0

Posted by nilesh <ni...@yahoo.com>.

hi Xiangrui,

     I am trying to implement the tfidf as per the instruction you sent in
your response to Jatin.
I am getting an error in idf step. Here are my steps that run till the last
line where the compile
fails.

val labeledDocs = sc.textFile("title_subcategory")

val stopwords = scala.io.Source.fromFile("stopwords.txt").getLines().toList

val labeledTerms =
labeledDocs.map(_.split('\t')).map(x=>(x(2).toDouble,x(1).split('
').map(_.toLowerCase).filter(!stopwords.contains(_)).toSeq))

val tf = new HashingTF()

val freqs = labeledTerms.map(x=>(x._1,tf.transform(x._2)))

val idf =  new IDF()

val idfModel = idf.fit(freqs.values)

val vectors = freqs.map(x => LabeledPoint(x._1, idfModel.transform(x._2)))

This is where it fails with the following error:

NBContentSubcategory.scala:39: error: overloaded method value transform with
alternatives:
  (dataset:
org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.linalg.Vector])org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.linalg.Vector]
<and>
  (dataset:
org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector])org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
 cannot be applied to (org.apache.spark.mllib.linalg.Vector)
        val transformedValues = idfModel.transform(values)

It seems to be getting confused with multiple (java and scala) transform
methods. 

Any insights?

Thanks,
Nilesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543p16057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: New API for TFIDF generation in Spark 1.1.0

Posted by jatinpreet <ja...@gmail.com>.

Thanks Xangrui and RJ for the responses.

RJ, I have created a Jira for the same. It would be great if you could look
into this. Following is the link to the improvement task,
https://issues.apache.org/jira/browse/SPARK-3614

Let me know if I can be of any help and please keep me posted!

Thanks,
Jatin



-----
Novice Big Data Programmer
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543p14737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: New API for TFIDF generation in Spark 1.1.0

Posted by RJ Nowling <rn...@gmail.com>.

Jatin,

If you file the JIRA and don't want to work on it, I'd be happy to step in
and take a stab at it.

RJ

On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Jatin,
>
> HashingTF should be able to solve the memory problem if you use a
> small feature dimension in HashingTF. Please do not cache the input
> document, but cache the output from HashingTF and IDF instead. We
> don't have a label indexer yet, so you need a label to index map to
> map it to double values, e.g., D1 -> 0.0, D2 -> 1.0, etc. Assuming
> that the input is an RDD[(label: String, doc: Seq[String])], the code
> should look like the following:
>
> val docTypeToLabel = Map("D1" -> 0.0, ...)
> val tf = new HashingTF();
> val freqs = input.map(x => (docTypeToLabel(x._1),
> tf.transform(x._2))).cache()
> val idf = new IDF()
> val idfModel = idf.fit(freqs.values)
> val vectors = freqs.map(x => LabeledPoint(x._1, idfModel.transform(x._2)))
> val nbModel = NaiveBayes.train(vectors)
>
> IDF doesn't provide the filter on the min occurrence, but it is nice
> to put that option. Please create a JIRA and someone may work on it.
>
> Best,
> Xiangrui
>
>
> On Thu, Sep 18, 2014 at 3:46 AM, jatinpreet <ja...@gmail.com> wrote:
> > Hi,
> >
> > I have been running into memory overflow issues while creating TFIDF
> vectors
> > to be used in document classification using MLlib's Naive Baye's
> > classification implementation.
> >
> >
> http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
> >
> > Memory overflow and GC issues occur while collecting idfs for all the
> terms.
> > To give an idea of scale, I am reading around 615,000(around 4GB of text
> > data) small sized documents from HBase  and running the spark program
> with 8
> > cores and 6GB of executor memory. I have tried increasing the parallelism
> > level and shuffle memory fraction but to no avail.
> >
> > The new TFIDF generation APIs caught my eye in the latest Spark version
> > 1.1.0. The example given in the official documentation mentions creation
> of
> > TFIDF vectors based of Hashing Trick. I want to know if it will solve the
> > mentioned problem by benefiting from reduced memory consumption.
> >
> > Also, the example does not state how to create labeled points for a
> corpus
> > of pre-classified document data. For example, my training input looks
> > something like this,
> >
> > DocumentType  |  Content
> > -----------------------------------------------------------------
> > D1                   |  This is Doc1 sample.
> > D1                   |  This also belongs to Doc1.
> > D1                   |  Yet another Doc1 sample.
> > D2                   |  Doc2 sample.
> > D2                   |  Sample content for Doc2.
> > D3                   |  The only sample for Doc3.
> > D4                   |  Doc4 sample looks like this.
> > D4                   |  This is Doc4 sample content.
> >
> > I want to create labeled points from this sample data for training. And
> once
> > the Naive Bayes model is created, I generate TFIDFs for the test
> documents
> > and predict the document type.
> >
> > If the new API can solve my issue, how can I generate labelled points
> using
> > the new APIs? An example would be great.
> >
> > Also, I have a special requirement of ignoring terms that occur in less
> than
> > two documents. This has important implications for the accuracy of my use
> > case and needs to be accommodated while generating TFIDFs.
> >
> > Thanks,
> > Jatin
> >
> >
> >
> > -----
> > Novice Big Data Programmer
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: New API for TFIDF generation in Spark 1.1.0

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Jatin,

HashingTF should be able to solve the memory problem if you use a
small feature dimension in HashingTF. Please do not cache the input
document, but cache the output from HashingTF and IDF instead. We
don't have a label indexer yet, so you need a label to index map to
map it to double values, e.g., D1 -> 0.0, D2 -> 1.0, etc. Assuming
that the input is an RDD[(label: String, doc: Seq[String])], the code
should look like the following:

val docTypeToLabel = Map("D1" -> 0.0, ...)
val tf = new HashingTF();
val freqs = input.map(x => (docTypeToLabel(x._1), tf.transform(x._2))).cache()
val idf = new IDF()
val idfModel = idf.fit(freqs.values)
val vectors = freqs.map(x => LabeledPoint(x._1, idfModel.transform(x._2)))
val nbModel = NaiveBayes.train(vectors)

IDF doesn't provide the filter on the min occurrence, but it is nice
to put that option. Please create a JIRA and someone may work on it.

Best,
Xiangrui


On Thu, Sep 18, 2014 at 3:46 AM, jatinpreet <ja...@gmail.com> wrote:
> Hi,
>
> I have been running into memory overflow issues while creating TFIDF vectors
> to be used in document classification using MLlib's Naive Baye's
> classification implementation.
>
> http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
>
> Memory overflow and GC issues occur while collecting idfs for all the terms.
> To give an idea of scale, I am reading around 615,000(around 4GB of text
> data) small sized documents from HBase  and running the spark program with 8
> cores and 6GB of executor memory. I have tried increasing the parallelism
> level and shuffle memory fraction but to no avail.
>
> The new TFIDF generation APIs caught my eye in the latest Spark version
> 1.1.0. The example given in the official documentation mentions creation of
> TFIDF vectors based of Hashing Trick. I want to know if it will solve the
> mentioned problem by benefiting from reduced memory consumption.
>
> Also, the example does not state how to create labeled points for a corpus
> of pre-classified document data. For example, my training input looks
> something like this,
>
> DocumentType  |  Content
> -----------------------------------------------------------------
> D1                   |  This is Doc1 sample.
> D1                   |  This also belongs to Doc1.
> D1                   |  Yet another Doc1 sample.
> D2                   |  Doc2 sample.
> D2                   |  Sample content for Doc2.
> D3                   |  The only sample for Doc3.
> D4                   |  Doc4 sample looks like this.
> D4                   |  This is Doc4 sample content.
>
> I want to create labeled points from this sample data for training. And once
> the Naive Bayes model is created, I generate TFIDFs for the test documents
> and predict the document type.
>
> If the new API can solve my issue, how can I generate labelled points using
> the new APIs? An example would be great.
>
> Also, I have a special requirement of ignoring terms that occur in less than
> two documents. This has important implications for the accuracy of my use
> case and needs to be accommodated while generating TFIDFs.
>
> Thanks,
> Jatin
>
>
>
> -----
> Novice Big Data Programmer
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org