You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@fluo.apache.org by Josh Elser <jo...@gmail.com> on 2016/10/24 00:05:23 UTC

Fwd: Hadoop Weekly #191

Congrats, the 1.0.0-incubating release was picked up by Hadoop Weekly :)
---------- Forwarded message ----------
From: "Hadoop Weekly" <in...@hadoopweekly.com>
Date: Oct 23, 2016 19:21
Subject: Hadoop Weekly #191
To: <jo...@gmail.com>
Cc:

Hadoop Weekly
> Issue #191
> 23 October 2016
>
> This week's issue is short and sweet with a few technical posts, two
> interesting news articles, and several exciting releases (including Apache
> Kafka 0.10.1.0). With Spark Summit Europe this week, expect lots of great
> content in the next issue. And if you're attending, please send interesting
> slides/talks my way!
>
> Technical
> =======
>
> Cloudera's CDH supports intra-node disk balancing since version 5.8.2
> (it's also part of the 3.0.0 alpha Apache release). Using this feature, a
> data node can rebalance data blocks across disks using the `hdfs
> diskbalancer` command. This post describes how the tool works and shows how
> to run it.
>
> http://blog.cloudera.com/blog/2016/10/how-to-use-the-new-
> hdfs-intra-datanode-disk-balancer-in-apache-hadoop/
>
>
> This post demonstrates the capabilities of the spark.ml library by
> building a logistic regression model to predict malignancy of cases from
> the Wisconsin Diagnostic Breast Cancer data set. The example code covers
> parsing, exploring a dataset with built-in statistics, extracting features
> from the input dataset, training the model, and evaluating the model.
>
> https://www.mapr.com/blog/predicting-breast-cancer-
> using-apache-spark-machine-learning-logistic-regression
>
>
> The Amazon Big Data blog has a tutorial for running RStudio with sparklyr
> on EMR. Thanks to a bootstrap action, a cluster complete with RStudio
> running on the master, can be launched with a single command.
>
> https://aws.amazon.com/blogs/big-data/running-sparklyr-
> rstudios-r-interface-to-spark-on-amazon-emr/
>
>
> The Databricks blog features a list of seven tips for debugging Apache
> Spark code on Databricks. Most of the suggestions, like "Scale up Spark
> jobs slowly for really large datasets" and "Examine the partitioning for
> your dataset," are generally applicable to all Spark users.
>
> https://databricks.com/blog/2016/10/18/7-tips-to-debug-
> apache-spark-code-faster-with-databricks.html
>
>
> News
> ====
>
> InfoQ has an interview with Yahoo VP of Engineering, Peter Cnudde. Topics
> covered include Hadoop, Spark adoption at Yahoo (mostly for in-memory
> computing, not for ETL), and Caffe-on-Spark for deep learning.
>
> https://www.infoq.com/articles/peter-cnudde-yahoo-big-data
>
>
> ZDNet contributor Tony Baer has read between the lines when it comes to
> recent benchmarks by Cloudera and Hortonworks. The takeaways are as
> follows: 1) "SQL's the gateway drug to Hadoop." 2) Cloudera is trying to
> challenge Amazon (in this case Redshift), and 3) Hortonworks (via Hive's
> Live Long and Prosper) has caught up on the investment Cloudera made in
> Impala.
>
> http://www.zdnet.com/article/sql-on-hadoop-benchmarks-get-serious/
>
>
> Releases
> =======
>
> Apache Kafka 0.10.1.0 was released this week. It contains improvements
> from over 500 pull requests and the implementation of 15 Kafka Improvement
> Proposals. The Confluent blog has the highlights of additions/improvements
> to Kafka Server (time-based indexes, replication quotas, and improved log
> compaction), improvements to Kafka client APIs (interactive queries for
> Kafak Streams, improved memory management, secure quotas, and more), and
> bug fixes.
>
> http://mail-archives.apache.org/mod_mbox/kafka-users/
> 201610.mbox/%3CCAJL4t_oz9q4T9vn6Z-EBoazWJFyqHw4Y0L-
> PTowD%2BpFhcPv0VQ%40mail.gmail.com%3E
> http://www.confluent.io/blog/announcing-apache-kafka-0-10-1-0/
>
> Apache Fluo (incubating), recently had its first release since entering
> the incubator. Fluo is a tool for making "incremental updates to large data
> sets stored in Apache Accumulo" a la Google's Perculator.
>
> https://fluo.apache.org/release/fluo-1.0.0-incubating/
>
>
> Apache Flume 1.7.0 was released. It adds support for a `taildir` source
> and includes a number of improvements and bug fixes. Many of these are
> around Flume's integration with Apache Kafka.
>
> http://flume.apache.org/releases/1.7.0.html
>
>
> Apache NiFi 0.7.1 was released as a follow-up to July's 0.7.0 release
> (version 1.0.0 was also recently released—in August). This release adds a
> number of improvements and bug fixes.
>
> https://cwiki.apache.org/confluence/display/NIFI/
> Release+Notes#ReleaseNotes-Version0.7.1
>
>
> Apache Giraph 1.2.0 was released. Highlight's of the release include a new
> blocks API, support for graphs that don't fit in memory, and the addition
> of a new set of default configuration options based on Facebook's
> experience with Giraph.
>
> https://blogs.apache.org/giraph/entry/giraph_1_2_0_release
>
>
> `deeplearning4j` is a deep learning implementation that integrates with
> Hadoop and Spark and supports GPUs. Version 0.6.0 was recently released.
>
> https://github.com/deeplearning4j/deeplearning4j
>
>
> Events
> =====
> Curated by Datadog ( http://www.datadog.com )
> UNITED STATES
>
> California
> Uber Engineering Tech Talk Series (San Francisco) - Monday, October 24
> http://www.meetup.com/UberEvents/events/234789134/
>
> Real-Time Streaming and Exactly-Once Semantics with Kafka (San Francisco)
> - Tuesday, October 25
> http://www.meetup.com/MemSQL/events/234405914/
>
> Building Your First Spark & C* App + SMACK Stack + The Cassandra Odyssey
> (San Francisco) - Wednesday, October 26
> http://www.meetup.com/SF-Spark-and-Friends/events/234932979/
>
> Apache YARN Committers/Contribut­ors Meetup #4 (Sunnyvale) - Thursday,
> October 27
> http://www.meetup.com/Hadoop-Contributors/events/234971372/
>
>
> Washington
> Kafka Palooza: LinkedIn, Microsoft Azure, MapR (Bellevue) - Monday,
> October 24
> http://www.meetup.com/Seattle-Apache-Kafka-Meetup/events/234836624/
>
>
> Nevada
> PixieDust: Making Python Visualizations Easier for Jupyter Notebooks with
> Spark (Las Vegas) - Monday, October 24
> http://www.meetup.com/Data-Science-Las-Vegas/events/234557659/
>
>
> Texas
> O&G Big Data Use Cases, by Hortonworks (Houston) - Thursday, October 27
> http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/234282996/
>
>
> Kansas
> Using Data Quality to Support Analytics in Hadoop (Overland Park) -
> Tuesday, October 25
> http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/
> events/234597551/
>
>
> Missouri
> Using Data Quality to Support Analytics in Hadoop (Kansas City) - Tuesday,
> October 25
> http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/
> events/234597347/
>
>
> Illinois
> Big Data Streaming Platform Ecosystem (Chicago) - Tuesday, October 25
> http://www.meetup.com/ChicagoRealTimeStreamingAnalytics/events/234676872/
>
> Apache Spark 101 (Chicago) - Tuesday, October 25
> http://www.meetup.com/Chicago-Spark-Users/events/233999667/
>
>
> Ohio
> October Edition of MOHUG (Dublin) - Tuesday, October 25
> http://www.meetup.com/MOHUG-Mid-Ohio-Hadoop-User-Group/events/234416779/
>
>
> Florida
> Apache Spark (Miami) - Wednesday, October 26
> http://www.meetup.com/Miami-Hadoop-User-Group/events/234992451/
>
>
> New York
> Lambda-in-a-Box: Merging Apache Spark & HBase into an Open-Source Database
> (New York) - Thursday, October 27
> http://www.meetup.com/mysqlnyc/events/233775657/
>
> October Data Engineering Meetup (New York) - Thursday, October 27
> http://www.meetup.com/NYC-Data-Engineering/events/234946410/
>
>
> CANADA
> Toronto Apache Spark #14 (Toronto) - Wednesday, October 26
> http://www.meetup.com/Toronto-Apache-Spark/events/234878620/
>
> Introduction to MapR (Toronto) - Thursday, October 27
> http://www.meetup.com/Toronto-MapR-User-Group/events/231648976/
>
>
> UNITED KINGDOM
> Why SMACK for Fast Data (London) - Monday, October 24
> http://www.meetup.com/skillsmatter/events/234588911/
>
> Building Scalable Systems in a Changing Data Landscape (London) - Tuesday,
> October 25
> http://www.meetup.com/data-science-lab/events/234754144/
>
> Spark Structured Streaming in Practice (London) - Wednesday, October 26
> http://www.meetup.com/hadoop-users-group-uk/events/234876912/
>
>
> SPAIN
> Season Premiere with Reynold Xin, Co-Founder & Chief Architect at
> Databricks (Barcelona) - Thursday, October 27
> http://www.meetup.com/Spark-Barcelona/events/234463208/
>
> Introduction to Kafka (Malaga) - Friday, October 28
> http://www.meetup.com/Linux-Malaga/events/234826330/
>
>
> BELGIUM
> Spark Pre-Summit Meetup (Brussels) - Tuesday, October 25
> http://www.meetup.com/Spark-Belgium/events/234234256/
>
> Meeting on Streamsets, Datameer and Kudu (Kontich) - Tuesday, October 25
> http://www.meetup.com/Belgium-Cloudera-User-Group/events/234618841/
>
> Spark & Machine Learning Meetup (Brussels) - Thursday, October 27
> http://www.meetup.com/Data-Science-Community-Meetup/events/234173917/
>
>
> INDIA
> Introduction to Spark & Use Cases (Hyderabad) - Monday, October 24
> http://www.meetup.com/meetup-group-ytFpRTDs/events/234412261/
>
>
> AUSTRALIA
> Rethink SQL for Big Data with Apache Drill (Barton) - Tuesday, October 25
> http://www.meetup.com/Canberra-Big-Data-Converged-SQL-NoSQL-and-Real-Time/
> events/233463561/
>
> Spark Meetup October (Sydney) - Wednesday, October 26
> http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/233723585/
>
> Rethink SQL for Big Data with Apache Drill (Melbourne) - Thursday, October
> 27
> http://www.meetup.com/Melbourne-Big-Data-Converged-
> SQL-NoSQL-and-Real-Time/events/233463459/
>
>
> ESTONIA
> Big Data: Spark and TensorFlow (Tallinn) - Monday, October 24
> http://www.meetup.com/Advanced-Java-Estonia/events/234612322/
>
>
>
>
> If you didn't receive this email directly, and you'd like to subscribe to
> weekly emails please visit http://hadoopweekly.com
>
> ==============================================
> You signed up for this email at hadoopweekly.com
>
> Unsubscribe josh.elser@gmail.com from this list:
> http://hadoopweekly.us6.list-manage.com/unsubscribe?u=
> c31415a60fb0bc4efbe86f45b&id=976fe003f4&e=b0d6d006e8&c=d7d5e262dd
>
> Our mailing address is:
> Hadoop Weekly
> PO Box 373
> Cranford, NJ 07016
> USA
>

Re: Hadoop Weekly #191

Posted by Keith Turner <ke...@deenlo.com>.
That's interesting.   Seems their website has not been updated for a few
weeks.  The last issue on the website is currently 10/2.

On Sun, Oct 23, 2016 at 8:05 PM, Josh Elser <jo...@gmail.com> wrote:

> Congrats, the 1.0.0-incubating release was picked up by Hadoop Weekly :)
> ---------- Forwarded message ----------
> From: "Hadoop Weekly" <in...@hadoopweekly.com>
> Date: Oct 23, 2016 19:21
> Subject: Hadoop Weekly #191
> To: <jo...@gmail.com>
> Cc:
>
> Hadoop Weekly
> > Issue #191
> > 23 October 2016
> >
> > This week's issue is short and sweet with a few technical posts, two
> > interesting news articles, and several exciting releases (including
> Apache
> > Kafka 0.10.1.0). With Spark Summit Europe this week, expect lots of great
> > content in the next issue. And if you're attending, please send
> interesting
> > slides/talks my way!
> >
> > Technical
> > =======
> >
> > Cloudera's CDH supports intra-node disk balancing since version 5.8.2
> > (it's also part of the 3.0.0 alpha Apache release). Using this feature, a
> > data node can rebalance data blocks across disks using the `hdfs
> > diskbalancer` command. This post describes how the tool works and shows
> how
> > to run it.
> >
> > http://blog.cloudera.com/blog/2016/10/how-to-use-the-new-
> > hdfs-intra-datanode-disk-balancer-in-apache-hadoop/
> >
> >
> > This post demonstrates the capabilities of the spark.ml library by
> > building a logistic regression model to predict malignancy of cases from
> > the Wisconsin Diagnostic Breast Cancer data set. The example code covers
> > parsing, exploring a dataset with built-in statistics, extracting
> features
> > from the input dataset, training the model, and evaluating the model.
> >
> > https://www.mapr.com/blog/predicting-breast-cancer-
> > using-apache-spark-machine-learning-logistic-regression
> >
> >
> > The Amazon Big Data blog has a tutorial for running RStudio with sparklyr
> > on EMR. Thanks to a bootstrap action, a cluster complete with RStudio
> > running on the master, can be launched with a single command.
> >
> > https://aws.amazon.com/blogs/big-data/running-sparklyr-
> > rstudios-r-interface-to-spark-on-amazon-emr/
> >
> >
> > The Databricks blog features a list of seven tips for debugging Apache
> > Spark code on Databricks. Most of the suggestions, like "Scale up Spark
> > jobs slowly for really large datasets" and "Examine the partitioning for
> > your dataset," are generally applicable to all Spark users.
> >
> > https://databricks.com/blog/2016/10/18/7-tips-to-debug-
> > apache-spark-code-faster-with-databricks.html
> >
> >
> > News
> > ====
> >
> > InfoQ has an interview with Yahoo VP of Engineering, Peter Cnudde. Topics
> > covered include Hadoop, Spark adoption at Yahoo (mostly for in-memory
> > computing, not for ETL), and Caffe-on-Spark for deep learning.
> >
> > https://www.infoq.com/articles/peter-cnudde-yahoo-big-data
> >
> >
> > ZDNet contributor Tony Baer has read between the lines when it comes to
> > recent benchmarks by Cloudera and Hortonworks. The takeaways are as
> > follows: 1) "SQL's the gateway drug to Hadoop." 2) Cloudera is trying to
> > challenge Amazon (in this case Redshift), and 3) Hortonworks (via Hive's
> > Live Long and Prosper) has caught up on the investment Cloudera made in
> > Impala.
> >
> > http://www.zdnet.com/article/sql-on-hadoop-benchmarks-get-serious/
> >
> >
> > Releases
> > =======
> >
> > Apache Kafka 0.10.1.0 was released this week. It contains improvements
> > from over 500 pull requests and the implementation of 15 Kafka
> Improvement
> > Proposals. The Confluent blog has the highlights of
> additions/improvements
> > to Kafka Server (time-based indexes, replication quotas, and improved log
> > compaction), improvements to Kafka client APIs (interactive queries for
> > Kafak Streams, improved memory management, secure quotas, and more), and
> > bug fixes.
> >
> > http://mail-archives.apache.org/mod_mbox/kafka-users/
> > 201610.mbox/%3CCAJL4t_oz9q4T9vn6Z-EBoazWJFyqHw4Y0L-
> > PTowD%2BpFhcPv0VQ%40mail.gmail.com%3E
> > http://www.confluent.io/blog/announcing-apache-kafka-0-10-1-0/
> >
> > Apache Fluo (incubating), recently had its first release since entering
> > the incubator. Fluo is a tool for making "incremental updates to large
> data
> > sets stored in Apache Accumulo" a la Google's Perculator.
> >
> > https://fluo.apache.org/release/fluo-1.0.0-incubating/
> >
> >
> > Apache Flume 1.7.0 was released. It adds support for a `taildir` source
> > and includes a number of improvements and bug fixes. Many of these are
> > around Flume's integration with Apache Kafka.
> >
> > http://flume.apache.org/releases/1.7.0.html
> >
> >
> > Apache NiFi 0.7.1 was released as a follow-up to July's 0.7.0 release
> > (version 1.0.0 was also recently released—in August). This release adds a
> > number of improvements and bug fixes.
> >
> > https://cwiki.apache.org/confluence/display/NIFI/
> > Release+Notes#ReleaseNotes-Version0.7.1
> >
> >
> > Apache Giraph 1.2.0 was released. Highlight's of the release include a
> new
> > blocks API, support for graphs that don't fit in memory, and the addition
> > of a new set of default configuration options based on Facebook's
> > experience with Giraph.
> >
> > https://blogs.apache.org/giraph/entry/giraph_1_2_0_release
> >
> >
> > `deeplearning4j` is a deep learning implementation that integrates with
> > Hadoop and Spark and supports GPUs. Version 0.6.0 was recently released.
> >
> > https://github.com/deeplearning4j/deeplearning4j
> >
> >
> > Events
> > =====
> > Curated by Datadog ( http://www.datadog.com )
> > UNITED STATES
> >
> > California
> > Uber Engineering Tech Talk Series (San Francisco) - Monday, October 24
> > http://www.meetup.com/UberEvents/events/234789134/
> >
> > Real-Time Streaming and Exactly-Once Semantics with Kafka (San Francisco)
> > - Tuesday, October 25
> > http://www.meetup.com/MemSQL/events/234405914/
> >
> > Building Your First Spark & C* App + SMACK Stack + The Cassandra Odyssey
> > (San Francisco) - Wednesday, October 26
> > http://www.meetup.com/SF-Spark-and-Friends/events/234932979/
> >
> > Apache YARN Committers/Contribut­ors Meetup #4 (Sunnyvale) - Thursday,
> > October 27
> > http://www.meetup.com/Hadoop-Contributors/events/234971372/
> >
> >
> > Washington
> > Kafka Palooza: LinkedIn, Microsoft Azure, MapR (Bellevue) - Monday,
> > October 24
> > http://www.meetup.com/Seattle-Apache-Kafka-Meetup/events/234836624/
> >
> >
> > Nevada
> > PixieDust: Making Python Visualizations Easier for Jupyter Notebooks with
> > Spark (Las Vegas) - Monday, October 24
> > http://www.meetup.com/Data-Science-Las-Vegas/events/234557659/
> >
> >
> > Texas
> > O&G Big Data Use Cases, by Hortonworks (Houston) - Thursday, October 27
> > http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/234282996/
> >
> >
> > Kansas
> > Using Data Quality to Support Analytics in Hadoop (Overland Park) -
> > Tuesday, October 25
> > http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/
> > events/234597551/
> >
> >
> > Missouri
> > Using Data Quality to Support Analytics in Hadoop (Kansas City) -
> Tuesday,
> > October 25
> > http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/
> > events/234597347/
> >
> >
> > Illinois
> > Big Data Streaming Platform Ecosystem (Chicago) - Tuesday, October 25
> > http://www.meetup.com/ChicagoRealTimeStreamingAnalyt
> ics/events/234676872/
> >
> > Apache Spark 101 (Chicago) - Tuesday, October 25
> > http://www.meetup.com/Chicago-Spark-Users/events/233999667/
> >
> >
> > Ohio
> > October Edition of MOHUG (Dublin) - Tuesday, October 25
> > http://www.meetup.com/MOHUG-Mid-Ohio-Hadoop-User-Group/events/234416779/
> >
> >
> > Florida
> > Apache Spark (Miami) - Wednesday, October 26
> > http://www.meetup.com/Miami-Hadoop-User-Group/events/234992451/
> >
> >
> > New York
> > Lambda-in-a-Box: Merging Apache Spark & HBase into an Open-Source
> Database
> > (New York) - Thursday, October 27
> > http://www.meetup.com/mysqlnyc/events/233775657/
> >
> > October Data Engineering Meetup (New York) - Thursday, October 27
> > http://www.meetup.com/NYC-Data-Engineering/events/234946410/
> >
> >
> > CANADA
> > Toronto Apache Spark #14 (Toronto) - Wednesday, October 26
> > http://www.meetup.com/Toronto-Apache-Spark/events/234878620/
> >
> > Introduction to MapR (Toronto) - Thursday, October 27
> > http://www.meetup.com/Toronto-MapR-User-Group/events/231648976/
> >
> >
> > UNITED KINGDOM
> > Why SMACK for Fast Data (London) - Monday, October 24
> > http://www.meetup.com/skillsmatter/events/234588911/
> >
> > Building Scalable Systems in a Changing Data Landscape (London) -
> Tuesday,
> > October 25
> > http://www.meetup.com/data-science-lab/events/234754144/
> >
> > Spark Structured Streaming in Practice (London) - Wednesday, October 26
> > http://www.meetup.com/hadoop-users-group-uk/events/234876912/
> >
> >
> > SPAIN
> > Season Premiere with Reynold Xin, Co-Founder & Chief Architect at
> > Databricks (Barcelona) - Thursday, October 27
> > http://www.meetup.com/Spark-Barcelona/events/234463208/
> >
> > Introduction to Kafka (Malaga) - Friday, October 28
> > http://www.meetup.com/Linux-Malaga/events/234826330/
> >
> >
> > BELGIUM
> > Spark Pre-Summit Meetup (Brussels) - Tuesday, October 25
> > http://www.meetup.com/Spark-Belgium/events/234234256/
> >
> > Meeting on Streamsets, Datameer and Kudu (Kontich) - Tuesday, October 25
> > http://www.meetup.com/Belgium-Cloudera-User-Group/events/234618841/
> >
> > Spark & Machine Learning Meetup (Brussels) - Thursday, October 27
> > http://www.meetup.com/Data-Science-Community-Meetup/events/234173917/
> >
> >
> > INDIA
> > Introduction to Spark & Use Cases (Hyderabad) - Monday, October 24
> > http://www.meetup.com/meetup-group-ytFpRTDs/events/234412261/
> >
> >
> > AUSTRALIA
> > Rethink SQL for Big Data with Apache Drill (Barton) - Tuesday, October 25
> > http://www.meetup.com/Canberra-Big-Data-Converged-
> SQL-NoSQL-and-Real-Time/
> > events/233463561/
> >
> > Spark Meetup October (Sydney) - Wednesday, October 26
> > http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/233723585/
> >
> > Rethink SQL for Big Data with Apache Drill (Melbourne) - Thursday,
> October
> > 27
> > http://www.meetup.com/Melbourne-Big-Data-Converged-
> > SQL-NoSQL-and-Real-Time/events/233463459/
> >
> >
> > ESTONIA
> > Big Data: Spark and TensorFlow (Tallinn) - Monday, October 24
> > http://www.meetup.com/Advanced-Java-Estonia/events/234612322/
> >
> >
> >
> >
> > If you didn't receive this email directly, and you'd like to subscribe to
> > weekly emails please visit http://hadoopweekly.com
> >
> > ==============================================
> > You signed up for this email at hadoopweekly.com
> >
> > Unsubscribe josh.elser@gmail.com from this list:
> > http://hadoopweekly.us6.list-manage.com/unsubscribe?u=
> > c31415a60fb0bc4efbe86f45b&id=976fe003f4&e=b0d6d006e8&c=d7d5e262dd
> >
> > Our mailing address is:
> > Hadoop Weekly
> > PO Box 373
> > Cranford, NJ 07016
> > USA
> >
>