You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Old-School <gi...@outlook.com> on 2017/04/03 12:05:18 UTC

Read file and represent rows as Vectors

I have a dataset that contains DocID, WordID and frequency (count) as shown
below. Note that the first three numbers represent 1. the number of
documents, 2. the number of words in the vocabulary and 3. the total number
of words in the collection.

189
1430
12300
1 2 1
1 39 1
1 42 3
1 77 1
1 95 1
1 96 1
2 105 1
2 108 1
3 133 3


What I want to do is to read the data (ignore the first three lines),
combine the words per document and finally represent each document as a
vector that contains the frequency of the wordID.

Based on the above dataset the representation of documents 1, 2 and 3 will
be (note that vocab_size can be extracted by the second line of the data):

val data = Array(
Vectors.sparse(vocab_size, Seq((2, 1.0), (39, 1.0), (42, 3.0), (77, 1.0),
(95, 1.0), (96, 1.0))),
Vectors.sparse(vocab_size, Seq((105, 1.0), (108, 1.0))),
Vectors.sparse(vocab_size, Seq((133, 3.0))))


The problem is that I am not quite sure how to read the .txt.gz file as RDD
and create an Array of sparse vectors as described above. Please note that I
actually want to pass the data array in the PCA transformer.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Read-file-and-represent-rows-as-Vectors-tp28562.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Read file and represent rows as Vectors

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.

You can try `mapPartitions` method.

example as below:
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#mapPartitions

On Mon, Apr 3, 2017 at 8:05 PM, Old-School <gi...@outlook.com>
wrote:

> I have a dataset that contains DocID, WordID and frequency (count) as shown
> below. Note that the first three numbers represent 1. the number of
> documents, 2. the number of words in the vocabulary and 3. the total number
> of words in the collection.
>
> 189
> 1430
> 12300
> 1 2 1
> 1 39 1
> 1 42 3
> 1 77 1
> 1 95 1
> 1 96 1
> 2 105 1
> 2 108 1
> 3 133 3
>
>
> What I want to do is to read the data (ignore the first three lines),
> combine the words per document and finally represent each document as a
> vector that contains the frequency of the wordID.
>
> Based on the above dataset the representation of documents 1, 2 and 3 will
> be (note that vocab_size can be extracted by the second line of the data):
>
> val data = Array(
> Vectors.sparse(vocab_size, Seq((2, 1.0), (39, 1.0), (42, 3.0), (77, 1.0),
> (95, 1.0), (96, 1.0))),
> Vectors.sparse(vocab_size, Seq((105, 1.0), (108, 1.0))),
> Vectors.sparse(vocab_size, Seq((133, 3.0))))
>
>
> The problem is that I am not quite sure how to read the .txt.gz file as RDD
> and create an Array of sparse vectors as described above. Please note that
> I
> actually want to pass the data array in the PCA transformer.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Read-file-and-represent-rows-as-Vectors-tp28562.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Read file and represent rows as Vectors

Posted by Paul Tremblay <pa...@gmail.com>.

So if I am understanding your problem, you have the data in CSV files, but
the CSV files are gunzipped? If so Spark can read a gunzip file directly.
Sorry if I didn't understand your question.

Henry

On Mon, Apr 3, 2017 at 5:05 AM, Old-School <gi...@outlook.com>
wrote:

> I have a dataset that contains DocID, WordID and frequency (count) as shown
> below. Note that the first three numbers represent 1. the number of
> documents, 2. the number of words in the vocabulary and 3. the total number
> of words in the collection.
>
> 189
> 1430
> 12300
> 1 2 1
> 1 39 1
> 1 42 3
> 1 77 1
> 1 95 1
> 1 96 1
> 2 105 1
> 2 108 1
> 3 133 3
>
>
> What I want to do is to read the data (ignore the first three lines),
> combine the words per document and finally represent each document as a
> vector that contains the frequency of the wordID.
>
> Based on the above dataset the representation of documents 1, 2 and 3 will
> be (note that vocab_size can be extracted by the second line of the data):
>
> val data = Array(
> Vectors.sparse(vocab_size, Seq((2, 1.0), (39, 1.0), (42, 3.0), (77, 1.0),
> (95, 1.0), (96, 1.0))),
> Vectors.sparse(vocab_size, Seq((105, 1.0), (108, 1.0))),
> Vectors.sparse(vocab_size, Seq((133, 3.0))))
>
>
> The problem is that I am not quite sure how to read the .txt.gz file as RDD
> and create an Array of sparse vectors as described above. Please note that
> I
> actually want to pass the data array in the PCA transformer.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Read-file-and-represent-rows-as-Vectors-tp28562.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Paul Henry Tremblay
Robert Half Technology