Spark Summit 2016 – Key Highlights

Spark Summit

I recently had the amazing chance to meet many interesting folks at Spark Summit 2016 and also learnt quite a bit about the technology updates and where the industry is heading. In this blog, I would like to summarize my key take-aways from the event. Spark summit 2016 keynote was heavily focussed on Deep Learning (DL). Jeff Dean of Google TensorFlow project showcased how they are using DL in most of their products- be it instant replies in Inbox app, Google photos app suggesting text related to photos, Google real-time language translation from images or suggesting solar panel for your home by analyzing your house rooftop. They have even provided APIs for the community to use the DL models without having to spend the time in re-inventing the wheel to solve critical business problems.

Here are some great links if you would like to delve deeper on some of the DL products by Google:

We are constantly seeing the increase of DL in day-to-day products and they are getting better and better. Jeff even claimed that currently 60% of the replies in Inbox mail app happens through smart replies which relies extensively on Deep Learning. Isn’t it amazing?

Not just Google, even Andrew Ng, Chief Data Scientist of Baidu and CoFounder of Coursera had shared lot of awesome data products he is building which extensively use Deep Learning (DL). 

Needless to say AI is going to revolutionize many industries ranging from Healthcare, Industrial, Manufacturing & Transportation. 


New features in Spark 2.0 & MLlib 2.0

Apache Spark 2.0

Source: Apache Spark 2.0 Technical Preview by Databricks

  1. Structured streaming which combines streaming and interactive analysis
  2. Tungsten phase 2 speedups 5-20x
  3. Unification of DataSets and DataFrames
  4. DataFrame API will become primary but RDD based API will still exist in maintenance mode
  5. Expansion of Python/R API
  6. Model persistence
  7. MLlib for exploratory data analysis
  8. Following new algorithms have made into 2.0:
    1. Generalized Linear Model
    2. Approximate counting of distinct elements
    3. Approximate Quantile algorithms have been added
  9. Customizing ML pipelines
    1. 29 feature transformers (Tokenizer, Word2Vec)
    2. 21 models (for classification, regression, clustering)
    3. Model tuning & evaluation

Other interesting talks related to Data Science:

The summit overall was an amazing exposure into the diverse initiatives being done in Spark and how are companies positioning their needs amidst the Industrial Internet boom. The next months will be truly interesting to watch the interesting use-cases data science will empower users with.