You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/09/20 21:10:25 UTC

Distributed dictionary building

Hi,

I am building a dictionary of RDD[(String, Long)] and after the dictionary
is built and cached, I find key "almonds" at value 5187 using:

rdd.filter{case(product, index) => product == "almonds"}.collect

Output:

Debug product almonds index 5187
Now I take the same dictionary and write it out as:

dictionary.map{case(product, index) => product + "," + index}
.saveAsTextFile(outputPath)

Inside the map I also print what's the product at index 5187 and I get a
different product:

Debug Index 5187 userOrProduct cardigans

Is this an expected behavior from map ?

By the way "almonds" and "apparel-cardigans" are just one off in the
index...

I am using spark-1.1 but it's a snapshot..

Thanks.
Deb

Re: Distributed dictionary building

Posted by Nan Zhu <zh...@gmail.com>.

great, thanks 

-- 
Nan Zhu


On Tuesday, September 23, 2014 at 9:58 AM, Sean Owen wrote:

> Yes, Matei made a JIRA last week and I just suggested a PR:
> https://github.com/apache/spark/pull/2508 
> On Sep 23, 2014 2:55 PM, "Nan Zhu" <zhunanmcgill@gmail.com (mailto:zhunanmcgill@gmail.com)> wrote:
> > shall we document this in the API doc? 
> > 
> > Best, 
> > 
> > -- 
> > Nan Zhu
> > 
> > 
> > On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:
> > 
> > > zipWithUniqueId is also affected...
> > > 
> > > I had to persist the dictionaries to make use of the indices lower down in the flow...
> > > 
> > > On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)> wrote:
> > > > Reference - https://issues.apache.org/jira/browse/SPARK-3098
> > > > I imagine zipWithUniqueID is also affected, but may not happen to have
> > > > exhibited in your test.
> > > > 
> > > > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)> wrote:
> > > > > Some more debug revealed that as Sean said I have to keep the dictionaries
> > > > > persisted till I am done with the RDD manipulation.....
> > > > >
> > > > > Thanks Sean for the pointer...would it be possible to point me to the JIRA
> > > > > as well ?
> > > > >
> > > > > Are there plans to make it more transparent for the users ?
> > > > >
> > > > > Is it possible for the DAG to speculate such things...similar to branch
> > > > > prediction ideas from comp arch...
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)>
> > > > > wrote:
> > > > >>
> > > > >> I changed zipWithIndex to zipWithUniqueId and that seems to be working...
> > > > >>
> > > > >> What's the difference between zipWithIndex vs zipWithUniqueId ?
> > > > >>
> > > > >> For zipWithIndex we don't need to run the count to compute the offset
> > > > >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's
> > > > >> not very clear from docs...
> > > > >>
> > > > >>
> > > > >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)>
> > > > >> wrote:
> > > > >>>
> > > > >>> I did not persist / cache it as I assumed zipWithIndex will preserve
> > > > >>> order...
> > > > >>>
> > > > >>> There is also zipWithUniqueId...I am trying that...If that also shows the
> > > > >>> same issue, we should make it clear in the docs...
> > > > >>>
> > > > >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)> wrote:
> > > > >>>>
> > > > >>>> From offline question - zipWithIndex is being used to assign IDs. From a
> > > > >>>> recent JIRA discussion I understand this is not deterministic within a
> > > > >>>> partition so the index can be different when the RDD is reevaluated. If you
> > > > >>>> need it fixed, persist the zipped RDD on disk or in memory.
> > > > >>>>
> > > > >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)>
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>> Hi,
> > > > >>>>>
> > > > >>>>> I am building a dictionary of RDD[(String, Long)] and after the
> > > > >>>>> dictionary is built and cached, I find key "almonds" at value 5187 using:
> > > > >>>>>
> > > > >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
> > > > >>>>>
> > > > >>>>> Output:
> > > > >>>>>
> > > > >>>>> Debug product almonds index 5187
> > > > >>>>>
> > > > >>>>> Now I take the same dictionary and write it out as:
> > > > >>>>>
> > > > >>>>> dictionary.map{case(product, index) => product + "," + index}
> > > > >>>>> .saveAsTextFile(outputPath)
> > > > >>>>>
> > > > >>>>> Inside the map I also print what's the product at index 5187 and I get
> > > > >>>>> a different product:
> > > > >>>>>
> > > > >>>>> Debug Index 5187 userOrProduct cardigans
> > > > >>>>>
> > > > >>>>> Is this an expected behavior from map ?
> > > > >>>>>
> > > > >>>>> By the way "almonds" and "apparel-cardigans" are just one off in the
> > > > >>>>> index...
> > > > >>>>>
> > > > >>>>> I am using spark-1.1 but it's a snapshot..
> > > > >>>>>
> > > > >>>>> Thanks.
> > > > >>>>> Deb
> > > > >>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > > >
> > > 
> >

Re: Distributed dictionary building

Posted by Sean Owen <so...@cloudera.com>.

Yes, Matei made a JIRA last week and I just suggested a PR:
https://github.com/apache/spark/pull/2508
On Sep 23, 2014 2:55 PM, "Nan Zhu" <zh...@gmail.com> wrote:

>  shall we document this in the API doc?
>
> Best,
>
> --
> Nan Zhu
>
> On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:
>
> zipWithUniqueId is also affected...
>
> I had to persist the dictionaries to make use of the indices lower down in
> the flow...
>
> On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <so...@cloudera.com> wrote:
>
> Reference - https://issues.apache.org/jira/browse/SPARK-3098
> I imagine zipWithUniqueID is also affected, but may not happen to have
> exhibited in your test.
>
> On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <de...@gmail.com>
> wrote:
> > Some more debug revealed that as Sean said I have to keep the
> dictionaries
> > persisted till I am done with the RDD manipulation.....
> >
> > Thanks Sean for the pointer...would it be possible to point me to the
> JIRA
> > as well ?
> >
> > Are there plans to make it more transparent for the users ?
> >
> > Is it possible for the DAG to speculate such things...similar to branch
> > prediction ideas from comp arch...
> >
> >
> >
> > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <de...@gmail.com>
> > wrote:
> >>
> >> I changed zipWithIndex to zipWithUniqueId and that seems to be
> working...
> >>
> >> What's the difference between zipWithIndex vs zipWithUniqueId ?
> >>
> >> For zipWithIndex we don't need to run the count to compute the offset
> >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ?
> It's
> >> not very clear from docs...
> >>
> >>
> >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.das83@gmail.com
> >
> >> wrote:
> >>>
> >>> I did not persist / cache it as I assumed zipWithIndex will preserve
> >>> order...
> >>>
> >>> There is also zipWithUniqueId...I am trying that...If that also shows
> the
> >>> same issue, we should make it clear in the docs...
> >>>
> >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote:
> >>>>
> >>>> From offline question - zipWithIndex is being used to assign IDs.
> From a
> >>>> recent JIRA discussion I understand this is not deterministic within a
> >>>> partition so the index can be different when the RDD is reevaluated.
> If you
> >>>> need it fixed, persist the zipped RDD on disk or in memory.
> >>>>
> >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <de...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am building a dictionary of RDD[(String, Long)] and after the
> >>>>> dictionary is built and cached, I find key "almonds" at value 5187
> using:
> >>>>>
> >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
> >>>>>
> >>>>> Output:
> >>>>>
> >>>>> Debug product almonds index 5187
> >>>>>
> >>>>> Now I take the same dictionary and write it out as:
> >>>>>
> >>>>> dictionary.map{case(product, index) => product + "," + index}
> >>>>> .saveAsTextFile(outputPath)
> >>>>>
> >>>>> Inside the map I also print what's the product at index 5187 and I
> get
> >>>>> a different product:
> >>>>>
> >>>>> Debug Index 5187 userOrProduct cardigans
> >>>>>
> >>>>> Is this an expected behavior from map ?
> >>>>>
> >>>>> By the way "almonds" and "apparel-cardigans" are just one off in the
> >>>>> index...
> >>>>>
> >>>>> I am using spark-1.1 but it's a snapshot..
> >>>>>
> >>>>> Thanks.
> >>>>> Deb
> >>>>>
> >>>>>
> >>>
> >>
> >
>
>
>
>

Re: Distributed dictionary building

Posted by Nan Zhu <zh...@gmail.com>.

shall we document this in the API doc? 

Best, 

-- 
Nan Zhu


On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:

> zipWithUniqueId is also affected...
> 
> I had to persist the dictionaries to make use of the indices lower down in the flow...
> 
> On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)> wrote:
> > Reference - https://issues.apache.org/jira/browse/SPARK-3098
> > I imagine zipWithUniqueID is also affected, but may not happen to have
> > exhibited in your test.
> > 
> > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)> wrote:
> > > Some more debug revealed that as Sean said I have to keep the dictionaries
> > > persisted till I am done with the RDD manipulation.....
> > >
> > > Thanks Sean for the pointer...would it be possible to point me to the JIRA
> > > as well ?
> > >
> > > Are there plans to make it more transparent for the users ?
> > >
> > > Is it possible for the DAG to speculate such things...similar to branch
> > > prediction ideas from comp arch...
> > >
> > >
> > >
> > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)>
> > > wrote:
> > >>
> > >> I changed zipWithIndex to zipWithUniqueId and that seems to be working...
> > >>
> > >> What's the difference between zipWithIndex vs zipWithUniqueId ?
> > >>
> > >> For zipWithIndex we don't need to run the count to compute the offset
> > >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's
> > >> not very clear from docs...
> > >>
> > >>
> > >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)>
> > >> wrote:
> > >>>
> > >>> I did not persist / cache it as I assumed zipWithIndex will preserve
> > >>> order...
> > >>>
> > >>> There is also zipWithUniqueId...I am trying that...If that also shows the
> > >>> same issue, we should make it clear in the docs...
> > >>>
> > >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)> wrote:
> > >>>>
> > >>>> From offline question - zipWithIndex is being used to assign IDs. From a
> > >>>> recent JIRA discussion I understand this is not deterministic within a
> > >>>> partition so the index can be different when the RDD is reevaluated. If you
> > >>>> need it fixed, persist the zipped RDD on disk or in memory.
> > >>>>
> > >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I am building a dictionary of RDD[(String, Long)] and after the
> > >>>>> dictionary is built and cached, I find key "almonds" at value 5187 using:
> > >>>>>
> > >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
> > >>>>>
> > >>>>> Output:
> > >>>>>
> > >>>>> Debug product almonds index 5187
> > >>>>>
> > >>>>> Now I take the same dictionary and write it out as:
> > >>>>>
> > >>>>> dictionary.map{case(product, index) => product + "," + index}
> > >>>>> .saveAsTextFile(outputPath)
> > >>>>>
> > >>>>> Inside the map I also print what's the product at index 5187 and I get
> > >>>>> a different product:
> > >>>>>
> > >>>>> Debug Index 5187 userOrProduct cardigans
> > >>>>>
> > >>>>> Is this an expected behavior from map ?
> > >>>>>
> > >>>>> By the way "almonds" and "apparel-cardigans" are just one off in the
> > >>>>> index...
> > >>>>>
> > >>>>> I am using spark-1.1 but it's a snapshot..
> > >>>>>
> > >>>>> Thanks.
> > >>>>> Deb
> > >>>>>
> > >>>>>
> > >>>
> > >>
> > >
>

Re: Distributed dictionary building

Posted by Debasish Das <de...@gmail.com>.

zipWithUniqueId is also affected...

I had to persist the dictionaries to make use of the indices lower down in
the flow...

On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <so...@cloudera.com> wrote:

> Reference - https://issues.apache.org/jira/browse/SPARK-3098
> I imagine zipWithUniqueID is also affected, but may not happen to have
> exhibited in your test.
>
> On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <de...@gmail.com>
> wrote:
> > Some more debug revealed that as Sean said I have to keep the
> dictionaries
> > persisted till I am done with the RDD manipulation.....
> >
> > Thanks Sean for the pointer...would it be possible to point me to the
> JIRA
> > as well ?
> >
> > Are there plans to make it more transparent for the users ?
> >
> > Is it possible for the DAG to speculate such things...similar to branch
> > prediction ideas from comp arch...
> >
> >
> >
> > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <de...@gmail.com>
> > wrote:
> >>
> >> I changed zipWithIndex to zipWithUniqueId and that seems to be
> working...
> >>
> >> What's the difference between zipWithIndex vs zipWithUniqueId ?
> >>
> >> For zipWithIndex we don't need to run the count to compute the offset
> >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ?
> It's
> >> not very clear from docs...
> >>
> >>
> >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.das83@gmail.com
> >
> >> wrote:
> >>>
> >>> I did not persist / cache it as I assumed zipWithIndex will preserve
> >>> order...
> >>>
> >>> There is also zipWithUniqueId...I am trying that...If that also shows
> the
> >>> same issue, we should make it clear in the docs...
> >>>
> >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote:
> >>>>
> >>>> From offline question - zipWithIndex is being used to assign IDs.
> From a
> >>>> recent JIRA discussion I understand this is not deterministic within a
> >>>> partition so the index can be different when the RDD is reevaluated.
> If you
> >>>> need it fixed, persist the zipped RDD on disk or in memory.
> >>>>
> >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <de...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am building a dictionary of RDD[(String, Long)] and after the
> >>>>> dictionary is built and cached, I find key "almonds" at value 5187
> using:
> >>>>>
> >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
> >>>>>
> >>>>> Output:
> >>>>>
> >>>>> Debug product almonds index 5187
> >>>>>
> >>>>> Now I take the same dictionary and write it out as:
> >>>>>
> >>>>> dictionary.map{case(product, index) => product + "," + index}
> >>>>> .saveAsTextFile(outputPath)
> >>>>>
> >>>>> Inside the map I also print what's the product at index 5187 and I
> get
> >>>>> a different product:
> >>>>>
> >>>>> Debug Index 5187 userOrProduct cardigans
> >>>>>
> >>>>> Is this an expected behavior from map ?
> >>>>>
> >>>>> By the way "almonds" and "apparel-cardigans" are just one off in the
> >>>>> index...
> >>>>>
> >>>>> I am using spark-1.1 but it's a snapshot..
> >>>>>
> >>>>> Thanks.
> >>>>> Deb
> >>>>>
> >>>>>
> >>>
> >>
> >
>

Re: Distributed dictionary building

Posted by Sean Owen <so...@cloudera.com>.

Reference - https://issues.apache.org/jira/browse/SPARK-3098
I imagine zipWithUniqueID is also affected, but may not happen to have
exhibited in your test.

On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <de...@gmail.com> wrote:
> Some more debug revealed that as Sean said I have to keep the dictionaries
> persisted till I am done with the RDD manipulation.....
>
> Thanks Sean for the pointer...would it be possible to point me to the JIRA
> as well ?
>
> Are there plans to make it more transparent for the users ?
>
> Is it possible for the DAG to speculate such things...similar to branch
> prediction ideas from comp arch...
>
>
>
> On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <de...@gmail.com>
> wrote:
>>
>> I changed zipWithIndex to zipWithUniqueId and that seems to be working...
>>
>> What's the difference between zipWithIndex vs zipWithUniqueId ?
>>
>> For zipWithIndex we don't need to run the count to compute the offset
>> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's
>> not very clear from docs...
>>
>>
>> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>>
>>> I did not persist / cache it as I assumed zipWithIndex will preserve
>>> order...
>>>
>>> There is also zipWithUniqueId...I am trying that...If that also shows the
>>> same issue, we should make it clear in the docs...
>>>
>>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>> From offline question - zipWithIndex is being used to assign IDs. From a
>>>> recent JIRA discussion I understand this is not deterministic within a
>>>> partition so the index can be different when the RDD is reevaluated. If you
>>>> need it fixed, persist the zipped RDD on disk or in memory.
>>>>
>>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <de...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am building a dictionary of RDD[(String, Long)] and after the
>>>>> dictionary is built and cached, I find key "almonds" at value 5187 using:
>>>>>
>>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
>>>>>
>>>>> Output:
>>>>>
>>>>> Debug product almonds index 5187
>>>>>
>>>>> Now I take the same dictionary and write it out as:
>>>>>
>>>>> dictionary.map{case(product, index) => product + "," + index}
>>>>> .saveAsTextFile(outputPath)
>>>>>
>>>>> Inside the map I also print what's the product at index 5187 and I get
>>>>> a different product:
>>>>>
>>>>> Debug Index 5187 userOrProduct cardigans
>>>>>
>>>>> Is this an expected behavior from map ?
>>>>>
>>>>> By the way "almonds" and "apparel-cardigans" are just one off in the
>>>>> index...
>>>>>
>>>>> I am using spark-1.1 but it's a snapshot..
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Distributed dictionary building

Posted by Debasish Das <de...@gmail.com>.

Some more debug revealed that as Sean said I have to keep the dictionaries
persisted till I am done with the RDD manipulation.....

Thanks Sean for the pointer...would it be possible to point me to the JIRA
as well ?

Are there plans to make it more transparent for the users ?

Is it possible for the DAG to speculate such things...similar to branch
prediction ideas from comp arch...



On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <de...@gmail.com>
wrote:

> I changed zipWithIndex to zipWithUniqueId and that seems to be working...
>
> What's the difference between zipWithIndex vs zipWithUniqueId ?
>
> For zipWithIndex we don't need to run the count to compute the offset
> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's
> not very clear from docs...
>
>
> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> I did not persist / cache it as I assumed zipWithIndex will preserve
>> order...
>>
>> There is also zipWithUniqueId...I am trying that...If that also shows the
>> same issue, we should make it clear in the docs...
>>
>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> From offline question - zipWithIndex is being used to assign IDs. From a
>>> recent JIRA discussion I understand this is not deterministic within a
>>> partition so the index can be different when the RDD is reevaluated. If you
>>> need it fixed, persist the zipped RDD on disk or in memory.
>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <de...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am building a dictionary of RDD[(String, Long)] and after the
>>>> dictionary is built and cached, I find key "almonds" at value 5187 using:
>>>>
>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
>>>>
>>>> Output:
>>>>
>>>> Debug product almonds index 5187
>>>> Now I take the same dictionary and write it out as:
>>>>
>>>> dictionary.map{case(product, index) => product + "," + index}
>>>> .saveAsTextFile(outputPath)
>>>>
>>>> Inside the map I also print what's the product at index 5187 and I get
>>>> a different product:
>>>>
>>>> Debug Index 5187 userOrProduct cardigans
>>>>
>>>> Is this an expected behavior from map ?
>>>>
>>>> By the way "almonds" and "apparel-cardigans" are just one off in the
>>>> index...
>>>>
>>>> I am using spark-1.1 but it's a snapshot..
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>>
>>
>

Re: Distributed dictionary building

Posted by Debasish Das <de...@gmail.com>.

I changed zipWithIndex to zipWithUniqueId and that seems to be working...

What's the difference between zipWithIndex vs zipWithUniqueId ?

For zipWithIndex we don't need to run the count to compute the offset which
is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's not
very clear from docs...


On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <de...@gmail.com>
wrote:

> I did not persist / cache it as I assumed zipWithIndex will preserve
> order...
>
> There is also zipWithUniqueId...I am trying that...If that also shows the
> same issue, we should make it clear in the docs...
>
> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> From offline question - zipWithIndex is being used to assign IDs. From a
>> recent JIRA discussion I understand this is not deterministic within a
>> partition so the index can be different when the RDD is reevaluated. If you
>> need it fixed, persist the zipped RDD on disk or in memory.
>> On Sep 20, 2014 8:10 PM, "Debasish Das" <de...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am building a dictionary of RDD[(String, Long)] and after the
>>> dictionary is built and cached, I find key "almonds" at value 5187 using:
>>>
>>> rdd.filter{case(product, index) => product == "almonds"}.collect
>>>
>>> Output:
>>>
>>> Debug product almonds index 5187
>>> Now I take the same dictionary and write it out as:
>>>
>>> dictionary.map{case(product, index) => product + "," + index}
>>> .saveAsTextFile(outputPath)
>>>
>>> Inside the map I also print what's the product at index 5187 and I get a
>>> different product:
>>>
>>> Debug Index 5187 userOrProduct cardigans
>>>
>>> Is this an expected behavior from map ?
>>>
>>> By the way "almonds" and "apparel-cardigans" are just one off in the
>>> index...
>>>
>>> I am using spark-1.1 but it's a snapshot..
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>>
>

Re: Distributed dictionary building

Posted by Debasish Das <de...@gmail.com>.

I did not persist / cache it as I assumed zipWithIndex will preserve
order...

There is also zipWithUniqueId...I am trying that...If that also shows the
same issue, we should make it clear in the docs...

On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote:

> From offline question - zipWithIndex is being used to assign IDs. From a
> recent JIRA discussion I understand this is not deterministic within a
> partition so the index can be different when the RDD is reevaluated. If you
> need it fixed, persist the zipped RDD on disk or in memory.
> On Sep 20, 2014 8:10 PM, "Debasish Das" <de...@gmail.com> wrote:
>
>> Hi,
>>
>> I am building a dictionary of RDD[(String, Long)] and after the
>> dictionary is built and cached, I find key "almonds" at value 5187 using:
>>
>> rdd.filter{case(product, index) => product == "almonds"}.collect
>>
>> Output:
>>
>> Debug product almonds index 5187
>> Now I take the same dictionary and write it out as:
>>
>> dictionary.map{case(product, index) => product + "," + index}
>> .saveAsTextFile(outputPath)
>>
>> Inside the map I also print what's the product at index 5187 and I get a
>> different product:
>>
>> Debug Index 5187 userOrProduct cardigans
>>
>> Is this an expected behavior from map ?
>>
>> By the way "almonds" and "apparel-cardigans" are just one off in the
>> index...
>>
>> I am using spark-1.1 but it's a snapshot..
>>
>> Thanks.
>> Deb
>>
>>
>>

Re: Distributed dictionary building

Posted by Sean Owen <so...@cloudera.com>.

>From offline question - zipWithIndex is being used to assign IDs. From a
recent JIRA discussion I understand this is not deterministic within a
partition so the index can be different when the RDD is reevaluated. If you
need it fixed, persist the zipped RDD on disk or in memory.
On Sep 20, 2014 8:10 PM, "Debasish Das" <de...@gmail.com> wrote:

> Hi,
>
> I am building a dictionary of RDD[(String, Long)] and after the dictionary
> is built and cached, I find key "almonds" at value 5187 using:
>
> rdd.filter{case(product, index) => product == "almonds"}.collect
>
> Output:
>
> Debug product almonds index 5187
> Now I take the same dictionary and write it out as:
>
> dictionary.map{case(product, index) => product + "," + index}
> .saveAsTextFile(outputPath)
>
> Inside the map I also print what's the product at index 5187 and I get a
> different product:
>
> Debug Index 5187 userOrProduct cardigans
>
> Is this an expected behavior from map ?
>
> By the way "almonds" and "apparel-cardigans" are just one off in the
> index...
>
> I am using spark-1.1 but it's a snapshot..
>
> Thanks.
> Deb
>
>
>