I recently had the amazing chance to meet many interesting folks at Spark Summit 2016 and also learnt quite a bit about the technology updates and where the industry is heading. In this blog, I would like to summarize my key take-aways from the event. Spark summit 2016 keynote was heavily focussed on Deep Learning (DL). Jeff Dean of Google TensorFlow project showcased how they are using DL in most of their products- be it instant replies in Inbox app, Google photos app suggesting text related to photos, Google real-time language translation from images or suggesting solar panel for your home by analyzing your house rooftop. They have even provided APIs for the community to use the DL models without having to spend the time in re-inventing the wheel to solve critical business problems.
Here are some great links if you would like to delve deeper on some of the DL products by Google:
We are constantly seeing the increase of DL in day-to-day products and they are getting better and better. Jeff even claimed that currently 60% of the replies in Inbox mail app happens through smart replies which relies extensively on Deep Learning. Isn’t it amazing?
Not just Google, even Andrew Ng, Chief Data Scientist of Baidu and CoFounder of Coursera had shared lot of awesome data products he is building which extensively use Deep Learning (DL).
Needless to say AI is going to revolutionize many industries ranging from Healthcare, Industrial, Manufacturing & Transportation.
New features in Spark 2.0 & MLlib 2.0
- Structured streaming which combines streaming and interactive analysis
- Tungsten phase 2 speedups 5-20x
- Unification of DataSets and DataFrames
- DataFrame API will become primary but RDD based API will still exist in maintenance mode
- Expansion of Python/R API
- Model persistence
- MLlib for exploratory data analysis
- Following new algorithms have made into 2.0:
- Generalized Linear Model
- Approximate counting of distinct elements
- Approximate Quantile algorithms have been added
- Customizing ML pipelines
- 29 feature transformers (Tokenizer, Word2Vec)
- 21 models (for classification, regression, clustering)
- Model tuning & evaluation
Other interesting talks related to Data Science:
- Huohua Distributed time series analysis by TwoSigma
- Timeseries RDD in Huohua
- Temporal joins
- Group function on time series data
- Elasticsearch-hadoop project
- Apache SystemML project is going strong
- Baidu has built Parallel Asynchronous Distributed Deep Learning Engine (PADDLE) with CPU & GPU support to perform vision, speech, and NLP workloads at scale
- Automatic features generation and model training on Spark using bayesian approach showed lot of interesting optimization opportunity in hyper parameter tuning
- Red Hat team showed how they are analyzing log data to find anomalies and reducing False alarms by using techniques like Ensembles of Decision trees and Self organizing maps
The summit overall was an amazing exposure into the diverse initiatives being done in Spark and how are companies positioning their needs amidst the Industrial Internet boom. The next months will be truly interesting to watch the interesting use-cases data science will empower users with.