You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2015/05/19 17:21:29 UTC

Beyond Spark 1.1.1

We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The primary reason is that the big distros are there already or will be very soon. Many people using Mahout will have the environment they must use dictated by support orgs in their companies so our current position as running only on Spark 1.1.1 means many potential users are out of luck.

Here are the problems I know of in moving Mahout ahead on Spark
1) Guava in any backend code (executor closures) relies on being serialized with Javaserializer, which is broken and hasn’t been fixed in 1.2+ There is a work around, which involves moving a Guava jar to all Spark workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR has been removed from Scala code and will be pushed to the master probably this week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew has (I think) removed the Preconditions and replaced them with asserts. But there remain some uses of Map and AbstractIterator from Guava. Not sure how many of these remain but if anyone can help please check here: https://issues.apache.org/jira/browse/MAHOUT-1708 <https://issues.apache.org/jira/browse/MAHOUT-1708>
2) the Mahout Shell relies on APIs not available in Spark 1.3.
3) the api for writing to sequence files now requires implicit values that are not available in the current code. I think Andy did a temp fix to write to object files but this is probably nto what we want to release.

I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and soon. This is a call for help in cleaning these things up. Even with no new features the above things would make Mahout much more usable in current environments.

Re: Beyond Spark 1.1.1

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Running on Spark 1.2 and 1.3 may not require that we remove guava from java. It would be nice to have no jars passed to the executors but passing guava is not the issue with failing to run. The failure was in using Javaserializer on a guava class, which has been removed. From what little I have tested the recent master it now seems to run on Spark 1.3.

If this holds true we have time to clean out guava from mahout-math and mahout-hdfs. 
But the following changes are still important:
1) The Spark sequence file io API has changed
2) The Shell doesn’t run on 1.3


On May 20, 2015, at 8:22 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> These rules do not apply to the java modules, of course.

So you are correct but we do use the Scala construct _in Scala_

On May 20, 2015, at 8:20 AM, Suneel Marthi <sm...@apache.org> wrote:

Ok, I was talking about Java asserts. Fine then go with it.

On Wed, May 20, 2015 at 11:18 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> BTW Scala assert, require, etc. are quite a different thing than Java
> assert. They do not use the java framework and _are_ indeed useful in
> production code for many of the reasons Preconditions were used. Scala
> provides several methods to check invariants and API contracts. They throw
> different exceptions and _can_ be disabled at runtime though this is
> controversial. They are peppered throughout the DSL code and afaik are not
> meant to be disabled at runtime. Think of them as a replacement for
> Preconditions.
> 
> These rules do not apply to the java modules, of course.
> 
> On May 19, 2015, at 9:17 AM, Suneel Marthi <sm...@apache.org> wrote:
> 
> Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure if
> its just straight replacement of Preconditions -> Asserts though.
> Preconditions throw an exception if some condition is not satisfied. Java
> Asserts are never meant to be used in production code.
> 
> So the right fix would be to replace all references to Preconditions with
> some exception handling boilerplate.
> 
> On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> We only have to worry about mahout-math and mahout-hdfs.
>> 
>> Yes, Andrew was working on those they were replaced with plain Java
>> asserts.
>> 
>> There still remain the uses you mention in those two modules but I see no
>> good alternative to hacking them out. Maybe we can move some code out to
>> mahout-mr if it’s easier.
>> 
>> On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org> wrote:
>> 
>> I had tried minimizing the Guava Dependency to a large extent in the run
> up
>> to 0.10.0.  Its not as trivial as it seems, there are parts of the code
>> (Collocations, lucene2seq. Lucene TokenStream processing and tokenization
>> code) that are heavily reliant on AbstractIterator;  and there are
> sections
>> of the code that assign a HashSet to a List (again have to use Guava for
>> that if one wants to avoid writing boilerplate for doing the same.
>> 
>> Moreover, things that return something like Iterable<?> and need to be
>> converted into a regular collection, can easily be done using Guava
> without
>> writing own boilerplate again.
>> 
>> Are we replacing all Preconditions by straight Asserts now ??
>> 
>> 
>> On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>>> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
>>> primary reason is that the big distros are there already or will be very
>>> soon. Many people using Mahout will have the environment they must use
>>> dictated by support orgs in their companies so our current position as
>>> running only on Spark 1.1.1 means many potential users are out of luck.
>>> 
>>> Here are the problems I know of in moving Mahout ahead on Spark
>>> 1) Guava in any backend code (executor closures) relies on being
>>> serialized with Javaserializer, which is broken and hasn’t been fixed in
>>> 1.2+ There is a work around, which involves moving a Guava jar to all
>> Spark
>>> workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR
>> has
>>> been removed from Scala code and will be pushed to the master probably
>> this
>>> week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
>>> has (I think) removed the Preconditions and replaced them with asserts.
>> But
>>> there remain some uses of Map and AbstractIterator from Guava. Not sure
>> how
>>> many of these remain but if anyone can help please check here:
>>> https://issues.apache.org/jira/browse/MAHOUT-1708 <
>>> https://issues.apache.org/jira/browse/MAHOUT-1708>
>>> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
>>> 3) the api for writing to sequence files now requires implicit values
>> that
>>> are not available in the current code. I think Andy did a temp fix to
>> write
>>> to object files but this is probably nto what we want to release.
>>> 
>>> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
>>> soon. This is a call for help in cleaning these things up. Even with no
>> new
>>> features the above things would make Mahout much more usable in current
>>> environments.
>> 
>> 
> 
>

Re: Beyond Spark 1.1.1

Posted by Pat Ferrel <pa...@occamsmachete.com>.

> These rules do not apply to the java modules, of course.

So you are correct but we do use the Scala construct _in Scala_

On May 20, 2015, at 8:20 AM, Suneel Marthi <sm...@apache.org> wrote:

Ok, I was talking about Java asserts. Fine then go with it.

On Wed, May 20, 2015 at 11:18 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> BTW Scala assert, require, etc. are quite a different thing than Java
> assert. They do not use the java framework and _are_ indeed useful in
> production code for many of the reasons Preconditions were used. Scala
> provides several methods to check invariants and API contracts. They throw
> different exceptions and _can_ be disabled at runtime though this is
> controversial. They are peppered throughout the DSL code and afaik are not
> meant to be disabled at runtime. Think of them as a replacement for
> Preconditions.
> 
> These rules do not apply to the java modules, of course.
> 
> On May 19, 2015, at 9:17 AM, Suneel Marthi <sm...@apache.org> wrote:
> 
> Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure if
> its just straight replacement of Preconditions -> Asserts though.
> Preconditions throw an exception if some condition is not satisfied. Java
> Asserts are never meant to be used in production code.
> 
> So the right fix would be to replace all references to Preconditions with
> some exception handling boilerplate.
> 
> On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> We only have to worry about mahout-math and mahout-hdfs.
>> 
>> Yes, Andrew was working on those they were replaced with plain Java
>> asserts.
>> 
>> There still remain the uses you mention in those two modules but I see no
>> good alternative to hacking them out. Maybe we can move some code out to
>> mahout-mr if it’s easier.
>> 
>> On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org> wrote:
>> 
>> I had tried minimizing the Guava Dependency to a large extent in the run
> up
>> to 0.10.0.  Its not as trivial as it seems, there are parts of the code
>> (Collocations, lucene2seq. Lucene TokenStream processing and tokenization
>> code) that are heavily reliant on AbstractIterator;  and there are
> sections
>> of the code that assign a HashSet to a List (again have to use Guava for
>> that if one wants to avoid writing boilerplate for doing the same.
>> 
>> Moreover, things that return something like Iterable<?> and need to be
>> converted into a regular collection, can easily be done using Guava
> without
>> writing own boilerplate again.
>> 
>> Are we replacing all Preconditions by straight Asserts now ??
>> 
>> 
>> On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>>> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
>>> primary reason is that the big distros are there already or will be very
>>> soon. Many people using Mahout will have the environment they must use
>>> dictated by support orgs in their companies so our current position as
>>> running only on Spark 1.1.1 means many potential users are out of luck.
>>> 
>>> Here are the problems I know of in moving Mahout ahead on Spark
>>> 1) Guava in any backend code (executor closures) relies on being
>>> serialized with Javaserializer, which is broken and hasn’t been fixed in
>>> 1.2+ There is a work around, which involves moving a Guava jar to all
>> Spark
>>> workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR
>> has
>>> been removed from Scala code and will be pushed to the master probably
>> this
>>> week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
>>> has (I think) removed the Preconditions and replaced them with asserts.
>> But
>>> there remain some uses of Map and AbstractIterator from Guava. Not sure
>> how
>>> many of these remain but if anyone can help please check here:
>>> https://issues.apache.org/jira/browse/MAHOUT-1708 <
>>> https://issues.apache.org/jira/browse/MAHOUT-1708>
>>> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
>>> 3) the api for writing to sequence files now requires implicit values
>> that
>>> are not available in the current code. I think Andy did a temp fix to
>> write
>>> to object files but this is probably nto what we want to release.
>>> 
>>> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
>>> soon. This is a call for help in cleaning these things up. Even with no
>> new
>>> features the above things would make Mahout much more usable in current
>>> environments.
>> 
>> 
> 
>

Re: Beyond Spark 1.1.1

Posted by Suneel Marthi <sm...@apache.org>.

Ok, I was talking about Java asserts. Fine then go with it.

On Wed, May 20, 2015 at 11:18 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> BTW Scala assert, require, etc. are quite a different thing than Java
> assert. They do not use the java framework and _are_ indeed useful in
> production code for many of the reasons Preconditions were used. Scala
> provides several methods to check invariants and API contracts. They throw
> different exceptions and _can_ be disabled at runtime though this is
> controversial. They are peppered throughout the DSL code and afaik are not
> meant to be disabled at runtime. Think of them as a replacement for
> Preconditions.
>
> These rules do not apply to the java modules, of course.
>
> On May 19, 2015, at 9:17 AM, Suneel Marthi <sm...@apache.org> wrote:
>
> Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure if
> its just straight replacement of Preconditions -> Asserts though.
> Preconditions throw an exception if some condition is not satisfied. Java
> Asserts are never meant to be used in production code.
>
> So the right fix would be to replace all references to Preconditions with
> some exception handling boilerplate.
>
> On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > We only have to worry about mahout-math and mahout-hdfs.
> >
> > Yes, Andrew was working on those they were replaced with plain Java
> > asserts.
> >
> > There still remain the uses you mention in those two modules but I see no
> > good alternative to hacking them out. Maybe we can move some code out to
> > mahout-mr if it’s easier.
> >
> > On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org> wrote:
> >
> > I had tried minimizing the Guava Dependency to a large extent in the run
> up
> > to 0.10.0.  Its not as trivial as it seems, there are parts of the code
> > (Collocations, lucene2seq. Lucene TokenStream processing and tokenization
> > code) that are heavily reliant on AbstractIterator;  and there are
> sections
> > of the code that assign a HashSet to a List (again have to use Guava for
> > that if one wants to avoid writing boilerplate for doing the same.
> >
> > Moreover, things that return something like Iterable<?> and need to be
> > converted into a regular collection, can easily be done using Guava
> without
> > writing own boilerplate again.
> >
> > Are we replacing all Preconditions by straight Asserts now ??
> >
> >
> > On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >
> >> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
> >> primary reason is that the big distros are there already or will be very
> >> soon. Many people using Mahout will have the environment they must use
> >> dictated by support orgs in their companies so our current position as
> >> running only on Spark 1.1.1 means many potential users are out of luck.
> >>
> >> Here are the problems I know of in moving Mahout ahead on Spark
> >> 1) Guava in any backend code (executor closures) relies on being
> >> serialized with Javaserializer, which is broken and hasn’t been fixed in
> >> 1.2+ There is a work around, which involves moving a Guava jar to all
> > Spark
> >> workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR
> > has
> >> been removed from Scala code and will be pushed to the master probably
> > this
> >> week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
> >> has (I think) removed the Preconditions and replaced them with asserts.
> > But
> >> there remain some uses of Map and AbstractIterator from Guava. Not sure
> > how
> >> many of these remain but if anyone can help please check here:
> >> https://issues.apache.org/jira/browse/MAHOUT-1708 <
> >> https://issues.apache.org/jira/browse/MAHOUT-1708>
> >> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
> >> 3) the api for writing to sequence files now requires implicit values
> > that
> >> are not available in the current code. I think Andy did a temp fix to
> > write
> >> to object files but this is probably nto what we want to release.
> >>
> >> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
> >> soon. This is a call for help in cleaning these things up. Even with no
> > new
> >> features the above things would make Mahout much more usable in current
> >> environments.
> >
> >
>
>

Re: Beyond Spark 1.1.1

Posted by Pat Ferrel <pa...@occamsmachete.com>.

BTW Scala assert, require, etc. are quite a different thing than Java assert. They do not use the java framework and _are_ indeed useful in production code for many of the reasons Preconditions were used. Scala provides several methods to check invariants and API contracts. They throw different exceptions and _can_ be disabled at runtime though this is controversial. They are peppered throughout the DSL code and afaik are not meant to be disabled at runtime. Think of them as a replacement for Preconditions.

These rules do not apply to the java modules, of course.

On May 19, 2015, at 9:17 AM, Suneel Marthi <sm...@apache.org> wrote:

Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure if
its just straight replacement of Preconditions -> Asserts though.
Preconditions throw an exception if some condition is not satisfied. Java
Asserts are never meant to be used in production code.

So the right fix would be to replace all references to Preconditions with
some exception handling boilerplate.

On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> We only have to worry about mahout-math and mahout-hdfs.
> 
> Yes, Andrew was working on those they were replaced with plain Java
> asserts.
> 
> There still remain the uses you mention in those two modules but I see no
> good alternative to hacking them out. Maybe we can move some code out to
> mahout-mr if it’s easier.
> 
> On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org> wrote:
> 
> I had tried minimizing the Guava Dependency to a large extent in the run up
> to 0.10.0.  Its not as trivial as it seems, there are parts of the code
> (Collocations, lucene2seq. Lucene TokenStream processing and tokenization
> code) that are heavily reliant on AbstractIterator;  and there are sections
> of the code that assign a HashSet to a List (again have to use Guava for
> that if one wants to avoid writing boilerplate for doing the same.
> 
> Moreover, things that return something like Iterable<?> and need to be
> converted into a regular collection, can easily be done using Guava without
> writing own boilerplate again.
> 
> Are we replacing all Preconditions by straight Asserts now ??
> 
> 
> On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
>> primary reason is that the big distros are there already or will be very
>> soon. Many people using Mahout will have the environment they must use
>> dictated by support orgs in their companies so our current position as
>> running only on Spark 1.1.1 means many potential users are out of luck.
>> 
>> Here are the problems I know of in moving Mahout ahead on Spark
>> 1) Guava in any backend code (executor closures) relies on being
>> serialized with Javaserializer, which is broken and hasn’t been fixed in
>> 1.2+ There is a work around, which involves moving a Guava jar to all
> Spark
>> workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR
> has
>> been removed from Scala code and will be pushed to the master probably
> this
>> week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
>> has (I think) removed the Preconditions and replaced them with asserts.
> But
>> there remain some uses of Map and AbstractIterator from Guava. Not sure
> how
>> many of these remain but if anyone can help please check here:
>> https://issues.apache.org/jira/browse/MAHOUT-1708 <
>> https://issues.apache.org/jira/browse/MAHOUT-1708>
>> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
>> 3) the api for writing to sequence files now requires implicit values
> that
>> are not available in the current code. I think Andy did a temp fix to
> write
>> to object files but this is probably nto what we want to release.
>> 
>> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
>> soon. This is a call for help in cleaning these things up. Even with no
> new
>> features the above things would make Mahout much more usable in current
>> environments.
> 
>

Re: Beyond Spark 1.1.1

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Are all those classes really needed for scala/spark? Seems like we should prune out non-dependencies if possible before we start changing the code. There are probably a lot of things that could be used with Mahout-Samsara but aren’t explicitly in it. Do we loose much by moving those to another module?

On May 19, 2015, at 11:24 AM, Andrew Musselman <an...@gmail.com> wrote:

Might not be terrible, I didn't look too hard but there are 97 instances of
"com.google.common" in mahout-math and 4 in mahout-hdfs.

On Tue, May 19, 2015 at 11:17 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> PS assuming we clean mahout-math and scala modules -- this should be fairly
> easy. Maybe there's some stuff in the colt classes but there shoulnd't be a
> lot?
> 
> 
> On Tue, May 19, 2015 at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> 
>> can't we just declare its own guava for mahout-mr? Or inherit it from
>> whenever it is declared in hadoop we depend on there?
>> 
>> On Tue, May 19, 2015 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>> 
>>> I was hoping someone knew the differences. Andrew and I are feeling our
>>> way along since we haven’t used either to any extent.
>>> 
>>> On May 19, 2015, at 9:17 AM, Suneel Marthi <sm...@apache.org> wrote:
>>> 
>>> Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure
> if
>>> its just straight replacement of Preconditions -> Asserts though.
>>> Preconditions throw an exception if some condition is not satisfied.
> Java
>>> Asserts are never meant to be used in production code.
>>> 
>>> So the right fix would be to replace all references to Preconditions
> with
>>> some exception handling boilerplate.
>>> 
>>> On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>> 
>>>> We only have to worry about mahout-math and mahout-hdfs.
>>>> 
>>>> Yes, Andrew was working on those they were replaced with plain Java
>>>> asserts.
>>>> 
>>>> There still remain the uses you mention in those two modules but I see
>>> no
>>>> good alternative to hacking them out. Maybe we can move some code out
> to
>>>> mahout-mr if it’s easier.
>>>> 
>>>> On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org>
> wrote:
>>>> 
>>>> I had tried minimizing the Guava Dependency to a large extent in the
>>> run up
>>>> to 0.10.0.  Its not as trivial as it seems, there are parts of the
> code
>>>> (Collocations, lucene2seq. Lucene TokenStream processing and
>>> tokenization
>>>> code) that are heavily reliant on AbstractIterator;  and there are
>>> sections
>>>> of the code that assign a HashSet to a List (again have to use Guava
> for
>>>> that if one wants to avoid writing boilerplate for doing the same.
>>>> 
>>>> Moreover, things that return something like Iterable<?> and need to be
>>>> converted into a regular collection, can easily be done using Guava
>>> without
>>>> writing own boilerplate again.
>>>> 
>>>> Are we replacing all Preconditions by straight Asserts now ??
>>>> 
>>>> 
>>>> On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>> 
>>>>> We need to move to Spark 1.3 asap and set the stage for beyond 1.3.
> The
>>>>> primary reason is that the big distros are there already or will be
>>> very
>>>>> soon. Many people using Mahout will have the environment they must
> use
>>>>> dictated by support orgs in their companies so our current position
> as
>>>>> running only on Spark 1.1.1 means many potential users are out of
> luck.
>>>>> 
>>>>> Here are the problems I know of in moving Mahout ahead on Spark
>>>>> 1) Guava in any backend code (executor closures) relies on being
>>>>> serialized with Javaserializer, which is broken and hasn’t been fixed
>>> in
>>>>> 1.2+ There is a work around, which involves moving a Guava jar to all
>>>> Spark
>>>>> workers, which is unacceptable in many cases. Guava in the Spark-1.2
> PR
>>>> has
>>>>> been removed from Scala code and will be pushed to the master
> probably
>>>> this
>>>>> week. That leaves a bunch of uses of Guava in java math and hdfs.
>>> Andrew
>>>>> has (I think) removed the Preconditions and replaced them with
> asserts.
>>>> But
>>>>> there remain some uses of Map and AbstractIterator from Guava. Not
> sure
>>>> how
>>>>> many of these remain but if anyone can help please check here:
>>>>> https://issues.apache.org/jira/browse/MAHOUT-1708 <
>>>>> https://issues.apache.org/jira/browse/MAHOUT-1708>
>>>>> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
>>>>> 3) the api for writing to sequence files now requires implicit values
>>>> that
>>>>> are not available in the current code. I think Andy did a temp fix to
>>>> write
>>>>> to object files but this is probably nto what we want to release.
>>>>> 
>>>>> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+.
>>> and
>>>>> soon. This is a call for help in cleaning these things up. Even with
> no
>>>> new
>>>>> features the above things would make Mahout much more usable in
> current
>>>>> environments.
>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Beyond Spark 1.1.1

Posted by Andrew Musselman <an...@gmail.com>.

Might not be terrible, I didn't look too hard but there are 97 instances of
"com.google.common" in mahout-math and 4 in mahout-hdfs.

On Tue, May 19, 2015 at 11:17 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> PS assuming we clean mahout-math and scala modules -- this should be fairly
> easy. Maybe there's some stuff in the colt classes but there shoulnd't be a
> lot?
>
>
> On Tue, May 19, 2015 at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > can't we just declare its own guava for mahout-mr? Or inherit it from
> > whenever it is declared in hadoop we depend on there?
> >
> > On Tue, May 19, 2015 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> >> I was hoping someone knew the differences. Andrew and I are feeling our
> >> way along since we haven’t used either to any extent.
> >>
> >> On May 19, 2015, at 9:17 AM, Suneel Marthi <sm...@apache.org> wrote:
> >>
> >> Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure
> if
> >> its just straight replacement of Preconditions -> Asserts though.
> >> Preconditions throw an exception if some condition is not satisfied.
> Java
> >> Asserts are never meant to be used in production code.
> >>
> >> So the right fix would be to replace all references to Preconditions
> with
> >> some exception handling boilerplate.
> >>
> >> On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >>
> >> > We only have to worry about mahout-math and mahout-hdfs.
> >> >
> >> > Yes, Andrew was working on those they were replaced with plain Java
> >> > asserts.
> >> >
> >> > There still remain the uses you mention in those two modules but I see
> >> no
> >> > good alternative to hacking them out. Maybe we can move some code out
> to
> >> > mahout-mr if it’s easier.
> >> >
> >> > On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org>
> wrote:
> >> >
> >> > I had tried minimizing the Guava Dependency to a large extent in the
> >> run up
> >> > to 0.10.0.  Its not as trivial as it seems, there are parts of the
> code
> >> > (Collocations, lucene2seq. Lucene TokenStream processing and
> >> tokenization
> >> > code) that are heavily reliant on AbstractIterator;  and there are
> >> sections
> >> > of the code that assign a HashSet to a List (again have to use Guava
> for
> >> > that if one wants to avoid writing boilerplate for doing the same.
> >> >
> >> > Moreover, things that return something like Iterable<?> and need to be
> >> > converted into a regular collection, can easily be done using Guava
> >> without
> >> > writing own boilerplate again.
> >> >
> >> > Are we replacing all Preconditions by straight Asserts now ??
> >> >
> >> >
> >> > On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> >> > wrote:
> >> >
> >> >> We need to move to Spark 1.3 asap and set the stage for beyond 1.3.
> The
> >> >> primary reason is that the big distros are there already or will be
> >> very
> >> >> soon. Many people using Mahout will have the environment they must
> use
> >> >> dictated by support orgs in their companies so our current position
> as
> >> >> running only on Spark 1.1.1 means many potential users are out of
> luck.
> >> >>
> >> >> Here are the problems I know of in moving Mahout ahead on Spark
> >> >> 1) Guava in any backend code (executor closures) relies on being
> >> >> serialized with Javaserializer, which is broken and hasn’t been fixed
> >> in
> >> >> 1.2+ There is a work around, which involves moving a Guava jar to all
> >> > Spark
> >> >> workers, which is unacceptable in many cases. Guava in the Spark-1.2
> PR
> >> > has
> >> >> been removed from Scala code and will be pushed to the master
> probably
> >> > this
> >> >> week. That leaves a bunch of uses of Guava in java math and hdfs.
> >> Andrew
> >> >> has (I think) removed the Preconditions and replaced them with
> asserts.
> >> > But
> >> >> there remain some uses of Map and AbstractIterator from Guava. Not
> sure
> >> > how
> >> >> many of these remain but if anyone can help please check here:
> >> >> https://issues.apache.org/jira/browse/MAHOUT-1708 <
> >> >> https://issues.apache.org/jira/browse/MAHOUT-1708>
> >> >> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
> >> >> 3) the api for writing to sequence files now requires implicit values
> >> > that
> >> >> are not available in the current code. I think Andy did a temp fix to
> >> > write
> >> >> to object files but this is probably nto what we want to release.
> >> >>
> >> >> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+.
> >> and
> >> >> soon. This is a call for help in cleaning these things up. Even with
> no
> >> > new
> >> >> features the above things would make Mahout much more usable in
> current
> >> >> environments.
> >> >
> >> >
> >>
> >>
> >
>

Re: Beyond Spark 1.1.1

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS assuming we clean mahout-math and scala modules -- this should be fairly
easy. Maybe there's some stuff in the colt classes but there shoulnd't be a
lot?


On Tue, May 19, 2015 at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> can't we just declare its own guava for mahout-mr? Or inherit it from
> whenever it is declared in hadoop we depend on there?
>
> On Tue, May 19, 2015 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> I was hoping someone knew the differences. Andrew and I are feeling our
>> way along since we haven’t used either to any extent.
>>
>> On May 19, 2015, at 9:17 AM, Suneel Marthi <sm...@apache.org> wrote:
>>
>> Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure if
>> its just straight replacement of Preconditions -> Asserts though.
>> Preconditions throw an exception if some condition is not satisfied. Java
>> Asserts are never meant to be used in production code.
>>
>> So the right fix would be to replace all references to Preconditions with
>> some exception handling boilerplate.
>>
>> On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>> > We only have to worry about mahout-math and mahout-hdfs.
>> >
>> > Yes, Andrew was working on those they were replaced with plain Java
>> > asserts.
>> >
>> > There still remain the uses you mention in those two modules but I see
>> no
>> > good alternative to hacking them out. Maybe we can move some code out to
>> > mahout-mr if it’s easier.
>> >
>> > On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org> wrote:
>> >
>> > I had tried minimizing the Guava Dependency to a large extent in the
>> run up
>> > to 0.10.0.  Its not as trivial as it seems, there are parts of the code
>> > (Collocations, lucene2seq. Lucene TokenStream processing and
>> tokenization
>> > code) that are heavily reliant on AbstractIterator;  and there are
>> sections
>> > of the code that assign a HashSet to a List (again have to use Guava for
>> > that if one wants to avoid writing boilerplate for doing the same.
>> >
>> > Moreover, things that return something like Iterable<?> and need to be
>> > converted into a regular collection, can easily be done using Guava
>> without
>> > writing own boilerplate again.
>> >
>> > Are we replacing all Preconditions by straight Asserts now ??
>> >
>> >
>> > On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
>> > wrote:
>> >
>> >> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
>> >> primary reason is that the big distros are there already or will be
>> very
>> >> soon. Many people using Mahout will have the environment they must use
>> >> dictated by support orgs in their companies so our current position as
>> >> running only on Spark 1.1.1 means many potential users are out of luck.
>> >>
>> >> Here are the problems I know of in moving Mahout ahead on Spark
>> >> 1) Guava in any backend code (executor closures) relies on being
>> >> serialized with Javaserializer, which is broken and hasn’t been fixed
>> in
>> >> 1.2+ There is a work around, which involves moving a Guava jar to all
>> > Spark
>> >> workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR
>> > has
>> >> been removed from Scala code and will be pushed to the master probably
>> > this
>> >> week. That leaves a bunch of uses of Guava in java math and hdfs.
>> Andrew
>> >> has (I think) removed the Preconditions and replaced them with asserts.
>> > But
>> >> there remain some uses of Map and AbstractIterator from Guava. Not sure
>> > how
>> >> many of these remain but if anyone can help please check here:
>> >> https://issues.apache.org/jira/browse/MAHOUT-1708 <
>> >> https://issues.apache.org/jira/browse/MAHOUT-1708>
>> >> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
>> >> 3) the api for writing to sequence files now requires implicit values
>> > that
>> >> are not available in the current code. I think Andy did a temp fix to
>> > write
>> >> to object files but this is probably nto what we want to release.
>> >>
>> >> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+.
>> and
>> >> soon. This is a call for help in cleaning these things up. Even with no
>> > new
>> >> features the above things would make Mahout much more usable in current
>> >> environments.
>> >
>> >
>>
>>
>

Re: Beyond Spark 1.1.1

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

can't we just declare its own guava for mahout-mr? Or inherit it from
whenever it is declared in hadoop we depend on there?

On Tue, May 19, 2015 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I was hoping someone knew the differences. Andrew and I are feeling our
> way along since we haven’t used either to any extent.
>
> On May 19, 2015, at 9:17 AM, Suneel Marthi <sm...@apache.org> wrote:
>
> Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure if
> its just straight replacement of Preconditions -> Asserts though.
> Preconditions throw an exception if some condition is not satisfied. Java
> Asserts are never meant to be used in production code.
>
> So the right fix would be to replace all references to Preconditions with
> some exception handling boilerplate.
>
> On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > We only have to worry about mahout-math and mahout-hdfs.
> >
> > Yes, Andrew was working on those they were replaced with plain Java
> > asserts.
> >
> > There still remain the uses you mention in those two modules but I see no
> > good alternative to hacking them out. Maybe we can move some code out to
> > mahout-mr if it’s easier.
> >
> > On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org> wrote:
> >
> > I had tried minimizing the Guava Dependency to a large extent in the run
> up
> > to 0.10.0.  Its not as trivial as it seems, there are parts of the code
> > (Collocations, lucene2seq. Lucene TokenStream processing and tokenization
> > code) that are heavily reliant on AbstractIterator;  and there are
> sections
> > of the code that assign a HashSet to a List (again have to use Guava for
> > that if one wants to avoid writing boilerplate for doing the same.
> >
> > Moreover, things that return something like Iterable<?> and need to be
> > converted into a regular collection, can easily be done using Guava
> without
> > writing own boilerplate again.
> >
> > Are we replacing all Preconditions by straight Asserts now ??
> >
> >
> > On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >
> >> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
> >> primary reason is that the big distros are there already or will be very
> >> soon. Many people using Mahout will have the environment they must use
> >> dictated by support orgs in their companies so our current position as
> >> running only on Spark 1.1.1 means many potential users are out of luck.
> >>
> >> Here are the problems I know of in moving Mahout ahead on Spark
> >> 1) Guava in any backend code (executor closures) relies on being
> >> serialized with Javaserializer, which is broken and hasn’t been fixed in
> >> 1.2+ There is a work around, which involves moving a Guava jar to all
> > Spark
> >> workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR
> > has
> >> been removed from Scala code and will be pushed to the master probably
> > this
> >> week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
> >> has (I think) removed the Preconditions and replaced them with asserts.
> > But
> >> there remain some uses of Map and AbstractIterator from Guava. Not sure
> > how
> >> many of these remain but if anyone can help please check here:
> >> https://issues.apache.org/jira/browse/MAHOUT-1708 <
> >> https://issues.apache.org/jira/browse/MAHOUT-1708>
> >> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
> >> 3) the api for writing to sequence files now requires implicit values
> > that
> >> are not available in the current code. I think Andy did a temp fix to
> > write
> >> to object files but this is probably nto what we want to release.
> >>
> >> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
> >> soon. This is a call for help in cleaning these things up. Even with no
> > new
> >> features the above things would make Mahout much more usable in current
> >> environments.
> >
> >
>
>

Re: Beyond Spark 1.1.1

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I was hoping someone knew the differences. Andrew and I are feeling our way along since we haven’t used either to any extent.

On May 19, 2015, at 9:17 AM, Suneel Marthi <sm...@apache.org> wrote:

Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure if
its just straight replacement of Preconditions -> Asserts though.
Preconditions throw an exception if some condition is not satisfied. Java
Asserts are never meant to be used in production code.

So the right fix would be to replace all references to Preconditions with
some exception handling boilerplate.

On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> We only have to worry about mahout-math and mahout-hdfs.
> 
> Yes, Andrew was working on those they were replaced with plain Java
> asserts.
> 
> There still remain the uses you mention in those two modules but I see no
> good alternative to hacking them out. Maybe we can move some code out to
> mahout-mr if it’s easier.
> 
> On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org> wrote:
> 
> I had tried minimizing the Guava Dependency to a large extent in the run up
> to 0.10.0.  Its not as trivial as it seems, there are parts of the code
> (Collocations, lucene2seq. Lucene TokenStream processing and tokenization
> code) that are heavily reliant on AbstractIterator;  and there are sections
> of the code that assign a HashSet to a List (again have to use Guava for
> that if one wants to avoid writing boilerplate for doing the same.
> 
> Moreover, things that return something like Iterable<?> and need to be
> converted into a regular collection, can easily be done using Guava without
> writing own boilerplate again.
> 
> Are we replacing all Preconditions by straight Asserts now ??
> 
> 
> On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
>> primary reason is that the big distros are there already or will be very
>> soon. Many people using Mahout will have the environment they must use
>> dictated by support orgs in their companies so our current position as
>> running only on Spark 1.1.1 means many potential users are out of luck.
>> 
>> Here are the problems I know of in moving Mahout ahead on Spark
>> 1) Guava in any backend code (executor closures) relies on being
>> serialized with Javaserializer, which is broken and hasn’t been fixed in
>> 1.2+ There is a work around, which involves moving a Guava jar to all
> Spark
>> workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR
> has
>> been removed from Scala code and will be pushed to the master probably
> this
>> week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
>> has (I think) removed the Preconditions and replaced them with asserts.
> But
>> there remain some uses of Map and AbstractIterator from Guava. Not sure
> how
>> many of these remain but if anyone can help please check here:
>> https://issues.apache.org/jira/browse/MAHOUT-1708 <
>> https://issues.apache.org/jira/browse/MAHOUT-1708>
>> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
>> 3) the api for writing to sequence files now requires implicit values
> that
>> are not available in the current code. I think Andy did a temp fix to
> write
>> to object files but this is probably nto what we want to release.
>> 
>> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
>> soon. This is a call for help in cleaning these things up. Even with no
> new
>> features the above things would make Mahout much more usable in current
>> environments.
> 
>

Re: Beyond Spark 1.1.1

Posted by Suneel Marthi <sm...@apache.org>.

Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure if
its just straight replacement of Preconditions -> Asserts though.
Preconditions throw an exception if some condition is not satisfied. Java
Asserts are never meant to be used in production code.

So the right fix would be to replace all references to Preconditions with
some exception handling boilerplate.

On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> We only have to worry about mahout-math and mahout-hdfs.
>
> Yes, Andrew was working on those they were replaced with plain Java
> asserts.
>
> There still remain the uses you mention in those two modules but I see no
> good alternative to hacking them out. Maybe we can move some code out to
> mahout-mr if it’s easier.
>
> On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org> wrote:
>
> I had tried minimizing the Guava Dependency to a large extent in the run up
> to 0.10.0.  Its not as trivial as it seems, there are parts of the code
> (Collocations, lucene2seq. Lucene TokenStream processing and tokenization
> code) that are heavily reliant on AbstractIterator;  and there are sections
> of the code that assign a HashSet to a List (again have to use Guava for
> that if one wants to avoid writing boilerplate for doing the same.
>
> Moreover, things that return something like Iterable<?> and need to be
> converted into a regular collection, can easily be done using Guava without
> writing own boilerplate again.
>
> Are we replacing all Preconditions by straight Asserts now ??
>
>
> On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
> > primary reason is that the big distros are there already or will be very
> > soon. Many people using Mahout will have the environment they must use
> > dictated by support orgs in their companies so our current position as
> > running only on Spark 1.1.1 means many potential users are out of luck.
> >
> > Here are the problems I know of in moving Mahout ahead on Spark
> > 1) Guava in any backend code (executor closures) relies on being
> > serialized with Javaserializer, which is broken and hasn’t been fixed in
> > 1.2+ There is a work around, which involves moving a Guava jar to all
> Spark
> > workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR
> has
> > been removed from Scala code and will be pushed to the master probably
> this
> > week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
> > has (I think) removed the Preconditions and replaced them with asserts.
> But
> > there remain some uses of Map and AbstractIterator from Guava. Not sure
> how
> > many of these remain but if anyone can help please check here:
> > https://issues.apache.org/jira/browse/MAHOUT-1708 <
> > https://issues.apache.org/jira/browse/MAHOUT-1708>
> > 2) the Mahout Shell relies on APIs not available in Spark 1.3.
> > 3) the api for writing to sequence files now requires implicit values
> that
> > are not available in the current code. I think Andy did a temp fix to
> write
> > to object files but this is probably nto what we want to release.
> >
> > I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
> > soon. This is a call for help in cleaning these things up. Even with no
> new
> > features the above things would make Mahout much more usable in current
> > environments.
>
>

Re: Beyond Spark 1.1.1

Posted by Pat Ferrel <pa...@occamsmachete.com>.

We only have to worry about mahout-math and mahout-hdfs.

Yes, Andrew was working on those they were replaced with plain Java asserts.

There still remain the uses you mention in those two modules but I see no good alternative to hacking them out. Maybe we can move some code out to mahout-mr if it’s easier.

On May 19, 2015, at 8:48 AM, Suneel Marthi <sm...@apache.org> wrote:

I had tried minimizing the Guava Dependency to a large extent in the run up
to 0.10.0.  Its not as trivial as it seems, there are parts of the code
(Collocations, lucene2seq. Lucene TokenStream processing and tokenization
code) that are heavily reliant on AbstractIterator;  and there are sections
of the code that assign a HashSet to a List (again have to use Guava for
that if one wants to avoid writing boilerplate for doing the same.

Moreover, things that return something like Iterable<?> and need to be
converted into a regular collection, can easily be done using Guava without
writing own boilerplate again.

Are we replacing all Preconditions by straight Asserts now ??

On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
> primary reason is that the big distros are there already or will be very
> soon. Many people using Mahout will have the environment they must use
> dictated by support orgs in their companies so our current position as
> running only on Spark 1.1.1 means many potential users are out of luck.
> 
> Here are the problems I know of in moving Mahout ahead on Spark
> 1) Guava in any backend code (executor closures) relies on being
> serialized with Javaserializer, which is broken and hasn’t been fixed in
> 1.2+ There is a work around, which involves moving a Guava jar to all Spark
> workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR has
> been removed from Scala code and will be pushed to the master probably this
> week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
> has (I think) removed the Preconditions and replaced them with asserts. But
> there remain some uses of Map and AbstractIterator from Guava. Not sure how
> many of these remain but if anyone can help please check here:
> https://issues.apache.org/jira/browse/MAHOUT-1708 <
> https://issues.apache.org/jira/browse/MAHOUT-1708>
> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
> 3) the api for writing to sequence files now requires implicit values that
> are not available in the current code. I think Andy did a temp fix to write
> to object files but this is probably nto what we want to release.
> 
> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
> soon. This is a call for help in cleaning these things up. Even with no new
> features the above things would make Mahout much more usable in current
> environments.

Re: Beyond Spark 1.1.1

Posted by Andrew Musselman <an...@gmail.com>.

I only looked at replacing Precinditions by asserts and found a bunch of
other stuff from Google common package, so held off.

On Tuesday, May 19, 2015, Suneel Marthi <sm...@apache.org> wrote:

> I had tried minimizing the Guava Dependency to a large extent in the run up
> to 0.10.0.  Its not as trivial as it seems, there are parts of the code
> (Collocations, lucene2seq. Lucene TokenStream processing and tokenization
> code) that are heavily reliant on AbstractIterator;  and there are sections
> of the code that assign a HashSet to a List (again have to use Guava for
> that if one wants to avoid writing boilerplate for doing the same.
>
> Moreover, things that return something like Iterable<?> and need to be
> converted into a regular collection, can easily be done using Guava without
> writing own boilerplate again.
>
> Are we replacing all Preconditions by straight Asserts now ??
>
>
> On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>> wrote:
>
> > We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
> > primary reason is that the big distros are there already or will be very
> > soon. Many people using Mahout will have the environment they must use
> > dictated by support orgs in their companies so our current position as
> > running only on Spark 1.1.1 means many potential users are out of luck.
> >
> > Here are the problems I know of in moving Mahout ahead on Spark
> > 1) Guava in any backend code (executor closures) relies on being
> > serialized with Javaserializer, which is broken and hasn’t been fixed in
> > 1.2+ There is a work around, which involves moving a Guava jar to all
> Spark
> > workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR
> has
> > been removed from Scala code and will be pushed to the master probably
> this
> > week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
> > has (I think) removed the Preconditions and replaced them with asserts.
> But
> > there remain some uses of Map and AbstractIterator from Guava. Not sure
> how
> > many of these remain but if anyone can help please check here:
> > https://issues.apache.org/jira/browse/MAHOUT-1708 <
> > https://issues.apache.org/jira/browse/MAHOUT-1708>
> > 2) the Mahout Shell relies on APIs not available in Spark 1.3.
> > 3) the api for writing to sequence files now requires implicit values
> that
> > are not available in the current code. I think Andy did a temp fix to
> write
> > to object files but this is probably nto what we want to release.
> >
> > I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
> > soon. This is a call for help in cleaning these things up. Even with no
> new
> > features the above things would make Mahout much more usable in current
> > environments.
>

Re: Beyond Spark 1.1.1

Posted by Suneel Marthi <sm...@apache.org>.

I had tried minimizing the Guava Dependency to a large extent in the run up
to 0.10.0.  Its not as trivial as it seems, there are parts of the code
(Collocations, lucene2seq. Lucene TokenStream processing and tokenization
code) that are heavily reliant on AbstractIterator;  and there are sections
of the code that assign a HashSet to a List (again have to use Guava for
that if one wants to avoid writing boilerplate for doing the same.

Moreover, things that return something like Iterable<?> and need to be
converted into a regular collection, can easily be done using Guava without
writing own boilerplate again.

Are we replacing all Preconditions by straight Asserts now ??


On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. The
> primary reason is that the big distros are there already or will be very
> soon. Many people using Mahout will have the environment they must use
> dictated by support orgs in their companies so our current position as
> running only on Spark 1.1.1 means many potential users are out of luck.
>
> Here are the problems I know of in moving Mahout ahead on Spark
> 1) Guava in any backend code (executor closures) relies on being
> serialized with Javaserializer, which is broken and hasn’t been fixed in
> 1.2+ There is a work around, which involves moving a Guava jar to all Spark
> workers, which is unacceptable in many cases. Guava in the Spark-1.2 PR has
> been removed from Scala code and will be pushed to the master probably this
> week. That leaves a bunch of uses of Guava in java math and hdfs. Andrew
> has (I think) removed the Preconditions and replaced them with asserts. But
> there remain some uses of Map and AbstractIterator from Guava. Not sure how
> many of these remain but if anyone can help please check here:
> https://issues.apache.org/jira/browse/MAHOUT-1708 <
> https://issues.apache.org/jira/browse/MAHOUT-1708>
> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
> 3) the api for writing to sequence files now requires implicit values that
> are not available in the current code. I think Andy did a temp fix to write
> to object files but this is probably nto what we want to release.
>
> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. and
> soon. This is a call for help in cleaning these things up. Even with no new
> features the above things would make Mahout much more usable in current
> environments.