You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nan Zhu <zh...@gmail.com> on 2014/01/06 18:17:56 UTC
shared variable and ALS in mllib
Hi, all
I meet a question related to how to share a variable among tasks, it seems that neither broadcast nor accumulator can resolve my problem
I have a set of txt files as my dataset, naming 1.txt - 20000.txt
each txt file represents the rating of users to a certain product, the product ID is indicated in the first line of each file, “1:”…”20000:”
the following lines are ratings “userid, rating"
I want to parse the input files with spark and pass it to the ALS implementation in mllib
the ALS requires me to have a RDD of Rating objects, where Rating is 3-tuple (user, product, rating)
My problem is that some tasks get the partition of a certain text file, so it will never see the first line like “1:” so that it cannot get which product the rating is corresponded to
How can I resolve this, except getting some script to transform the format of the files by appending the product id to each line?
Best,
--
Nan Zhu
Re: shared variable and ALS in mllib
Posted by Nan Zhu <zh...@gmail.com>.
Thanks Jason, yes, that’s true, but how to finish the first step
it seems that sc.textFile() has no parameters to achieve the goal,
I stored the file on s3
Best,
--
Nan Zhu
On Monday, January 6, 2014 at 11:27 PM, Jason Dai wrote:
> If you assign each file to a standalone partition, then you can generate the Rating RDD using something like the following:
>
> files.mapPartitions { part =>
> product = part.next()
> part.map((user, rating) => (user, product, rating))
> }
>
> Thanks,
> -Jason
>
>
>
> On Tue, Jan 7, 2014 at 1:17 AM, Nan Zhu <zhunanmcgill@gmail.com (mailto:zhunanmcgill@gmail.com)> wrote:
> > Hi, all
> >
> > I meet a question related to how to share a variable among tasks, it seems that neither broadcast nor accumulator can resolve my problem
> >
> > I have a set of txt files as my dataset, naming 1.txt - 20000.txt
> >
> > each txt file represents the rating of users to a certain product, the product ID is indicated in the first line of each file, “1:”…”20000:”
> >
> > the following lines are ratings “userid, rating"
> >
> > I want to parse the input files with spark and pass it to the ALS implementation in mllib
> >
> > the ALS requires me to have a RDD of Rating objects, where Rating is 3-tuple (user, product, rating)
> >
> > My problem is that some tasks get the partition of a certain text file, so it will never see the first line like “1:” so that it cannot get which product the rating is corresponded to
> >
> > How can I resolve this, except getting some script to transform the format of the files by appending the product id to each line?
> >
> > Best,
> >
> > --
> > Nan Zhu
> >
> >
> >
>
Re: shared variable and ALS in mllib
Posted by Jason Dai <ja...@gmail.com>.
If you assign each file to a standalone partition, then you can generate
the Rating RDD using something like the following:
files.mapPartitions { part =>
product = part.next()
part.map((user, rating) => (user, product, rating))
}
Thanks,
-Jason
On Tue, Jan 7, 2014 at 1:17 AM, Nan Zhu <zh...@gmail.com> wrote:
> Hi, all
>
> I meet a question related to how to share a variable among tasks, it seems
> that neither broadcast nor accumulator can resolve my problem
>
> I have a set of txt files as my dataset, naming 1.txt - 20000.txt
>
> each txt file represents the rating of users to a certain product, the
> product ID is indicated in the first line of each file, “1:”…”20000:”
>
> the following lines are ratings “userid, rating"
>
> I want to parse the input files with spark and pass it to the ALS
> implementation in mllib
>
> the ALS requires me to have a RDD of Rating objects, where Rating is
> 3-tuple (user, product, rating)
>
> My problem is that some tasks get the partition of a certain text file, so
> it will never see the first line like “1:” so that it cannot get which
> product the rating is corresponded to
>
> How can I resolve this, except getting some script to transform the format
> of the files by appending the product id to each line?
>
> Best,
>
> --
> Nan Zhu
>
>