Machine Learning Workshops - deepsense.io LOGO

Corporate Workshops

Custom-tailored machine learning workshops
to enhance company growth

State-of-the-Practice:
Innovative Machine Learning
Business Use Cases in the Enterprise

August 23 (Tuesday) 1:00-1:30pm EST

SIGN UP FOR THE WEBINARaiworld-logo

Machine Learning Workshops by deepsense.io

At deepsense.io we help innovative companies
develop faster and more effectively

Our intensive, hands-on training programs on machine learning, deep learning, data analysis and visualisation provide the knowledge and experience needed in a growing number of industries.

We provide hands-on machine learning workshops, deep learning workshops and Apache Spark training on real data and case studies, including:

  • 2-day, 5-day or training series
  • run exclusively for your team when and where you need it
  • customized internet training sessions
  • custom-tailored training materials available as needed
  • run by experienced instructors

Expert programmers provide the instruction during all of our workshops. They have both corporate and scientific backgrounds and are experienced in sharing their expertise. Thanks to their mixed backgrounds, they can apply advanced knowledge that is well-suited to real-world business needs.

We are happy to provide a wide variety of machine learning workshops, deep learning workshops and Apache Spark training topics for your benefit. Each training listed below is flexibly designed and can be run separately or in 2+ day block combinations. We also offer more comprehensive workshop series.

All deepsense.io workshop and training materials are created in-house.

If your company or team has any specific Big Data professional education needs, we also provide custom-tailored solutions. Let us know what you’re looking for, and we will be more than happy to meet the challenge.

I had true hands-on experience solving real issues frequently faced in production environment. It was a true eye-opener and I am now feeling much more comfortable with deploying similar solutions in my work environment.

Laurent Isenegger

Software Engineer, Intel

deepsense.io machine learning workshops' proposition for your company

Machine Learning (scikit-learn) & Big Data (Spark)

Summary

The 2-day intensive workshops using hands-on exercises teaches a wide range of important Machine Learning algorithms and tools that we selected for their value in today’s market. The workshop offers a short yet comprehensive introduction to Machine Learning (Python) and Big Data processing, and then moves to practical exercises with many programming assignments (Spark).

Participants use Python with libraries for data manipulation (Pandas), visualization (Seaborn) and machine learning (scikit-learn). They can also learn IPython Notebook and PySpark from the Apache Spark project. By combining theory with an extensive set of practical exercises led by instructors who are practitioners in the field , each attendee can benefit from a very effective learning experience.

We are focused to provide high quality, professional training which guide each participant through Machine Learning and Big Data foundations. With us, the workshops’ attendees can better understand why Data Science gives businesses an edge and how they can use Python and Spark to extract insights from data in a more effective way.

We believe that cooperation between science and business is the key to innovation and success.

Day 1: Machine Learning (scikit-learn)
Lecture: What is machine learning?

An introduction to machine learning – how it differs from traditional solutions and which problems can be easily solved using it. We describe an overview of basic concepts (overfitting, cross validation) and types of machine learning algorithms (supervised vs unsupervised, classification vs regression).

IPython Notebook

IPython Notebook (aka Jupyter) is an interactive environment suitable for exploratory data analysis and experimenting with various data transformations and machine learning techniques. We present some of its basic features, so that the participants can use it smoothly throughout the workshop and for any other project they wish to use. All of the course material is held in IPython Notebook.

Data Exploration

Before we can start any machine learning we need to load data and investigate what we have. We show the most important functions of Pandas as Seaborn, as well as the crucial steps each data scientist needs to perform when working with a new dataset.

Linear regression

Linear regression is a simple, yet powerful, technique for fitting models and predicting numerical values. Using a real dataset of bike rentals, we are going to show:

  • how to apply linear regression,

  • how to measure its performance (R squared, cross validation),

  • how to improve results by transforming variables,

  • how to interpret parameters of linear regression,

  • how to prevent overfitting with LASSO regularization.

Linear Regression – continuation
Logistic Regression

Logistic regression is an analogue of linear regression applicable for classification problems. We show:

  • intuition behind the general concept of logistic regression,

  • how it can be applied for predicting binary classes,

  • how its results can be visualized and interpreted,

  • how to interpret score and confusion matrix.

Clustering

As the only kind of unsupervised learning, we provide some motivations behind hierarchical aggregation and k-means,

  • when it can be used to get some insight into data,

  • how to scale variables, so they are suitable for k-means,

  • how to interpret results.

Random Forest

Random forest is a powerful technique, useful for both classification and regression. Very often it can be applied directly to the raw data, giving astonishingly good predictions.

  • how to prepare data for scikit-learn random forest,

  • how to interpret feature importance,

  • pitfalls: measuring score with training, scaling variables.

Day 2: Big Data (Spark)
Challenge: Titanic prediction

Based on the Titanic survival dataset, we prepared an open-ended exercise to predict who is going to survive and who is going to perish.

Lecture: Big Data

An overview of big data - where we need to tackle big amounts of data, what are the opportunities… and problems. We will show basics of the famous MapReduce paradigm and present pros and cons of using Apache Spark vs Hadoop MapReduce.

MapReduce in Spark

MapReduce is the basic paradigm of big data processing, and the core of Apache Hadoop. We show how it can be used by Spark, both in cases equivalent to that of Hadoop (but offering much more clear and succinct code) and in cases which cannot be solved within a single run of Hadoop. Moreover, Spark has tools going beyond the MapReduce paradigm, especially:

  • joins and group-by, similar to SQL concepts,

  • multi-step operations, with optional in-memory caching.

Beyond MapReduce
Machine Learning in Spark

Spark has built-in support for machine learning (MLlib), covering the most popular scalable algorithms, such as linear and logistic regression, random forest or k-means. We will show:

  • how to prepare data for machine learning in Spark,

  • how to train and use classifiers,

  • how to calculate scores.

Data in action: Stack Exchange

We will work with real and up-to-date data from Stack Exchange Q&A sites, with a set of open-ended problems involving big data exploration and machine learning.

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Deep learning

Summary

This workshop presents an opportunity to acquire the basic concepts of deep learning, which revolutionizes tasks such as image recognition, speech analysis or time series prediction. Participants will get a taste of this exciting field and will be well equipped to begin individual exploration.

Requirements

  • Working knowledge of Python

  • Familiarity with basic math

  • No previous experience with neural networks or machine learning is necessary.

Day 1
Introduction to neural networks

This short lecture covers the main characteristics of artificial neural networks and biological inspirations behind them. We compare neural networks with other models used in machine learning and describe their strengths and weaknesses. Deep learning (in a not-so-formal notion) and „shallow learning” is explored. Finally, we present modern applications of neural networks in various practical tasks.

Classification with multi-layer perceptron

Simple exercises which allow to grasp key concept of neural networks in practice. Exercises are presented as an interactive Python notebook will cover the following topics:

  • single neuron vs logistic regression

  • simple classification task on toy dataset

  • fitting the model

  • the role of hidden layer: non-linear classifiers

  • problems with complexity: overfitting

Implementing neural networks in Python and Theano

Python is one of the popular programming languages used for data analysis and fast prototyping of machine learning solutions. While it is not noted for performance, various libraries offering optimized algebraic operations and efficient implementation of neural networks exist (Theano, TensorFlow, Keras, Lasagne, Blocks, Pylearn2) . We give a short overview of available tools. Possibilities of training neural networks using GPU are also discussed. In the later exercises Theano is used as a library of choice.

Training networks with backpropagation

The most popular type of supervised learning in neural networks is backpropagation. Here we take a closer look at this algorithm and its characteristics. We will cover:

  • form of optimization (gradient descent, stochastic gradient descent),

  • known limitations (vanishing or exploding gradient, local minima, long training time),

  • additional hyperparameters (learning rate, momentum, weights regularization) and their impact.

Day 2
Image recognition with convolutional networks

A real breakthrough in image recognition comes from artificial networks inspired by visual mechanism present in living organisms. We discuss:

  • insights from studying visual processing in animals,

  • detecting local features,

  • role of weights sharing,

  • introduction of pooling layers,

  • application to classifying high dimensional image dataset.

Recurrent neural networks

Neural networks are elastic models which are not limited to feed-forward processing. Through recurrent network architecture it is possible to introduce a form of memory. We will focus on:

  • learning patterns through space and time,

  • describing recurrent architecture (echo state network (ESN), Long short-term memory (LSTM) network),

  • techniques for training recurrent networks,

  • application to time series prediction,

RNN as pattern generator: generating English sentences.

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Complex networks - analysis and visualization

Summary

2-day hands-on introduction to data analysis of graph data and complex networks with Gephi and Python + igraph. It covers main concepts in analysis of graph data (social networks, semantic networks, transport, network connections) - providing fundamentals and practical skills. Complex networks are different from the most common data - tabular; they come with many specific methods, tools and techniques that can be used across datasets and disciplines.

Requirements

Basic knowledge of Python.
(It can be held in pure Gephi, in this case even no programming skills are required.)

Outline

What is a graph?

Introduction to graph theory, object built of vertexes (or nodes) connected by edges (or links).

  • fundamental concepts

  • types of graphs

  • how to abstract data as a graph

  • real-world examples

Quantifying networks

How to measure and interpret the following properties of a network:

  • distance

  • density

  • degree distribution

  • clustering coefficient

Data Exploration

Before we can start any machine learning we need to load data and investigate what we have. We show the most important functions of Pandas as Seaborn, as well as the crucial steps each data scientist needs to perform when working with a new dataset.

  • distance

  • density

  • degree distribution

  • clustering coefficient

Node centrality

Very often we would like to distinguish nodes that play a special role, for example: the most important websites. It is crucial to be able to distinguish between properties of various centrality measures such as:

  • node degree

  • closeness centrality

  • betweenness centrality

  • PageRank

Artificial and natural networks

The structure of a network can tell us a lot about data it describes. To obtain insights, these common structures in complex networks need to be learned:

  • cliques, trees, bipartite

  • classical graphs

  • random graphs

Graph visualization

Whether for exploration or for visual impact, graphical representations are crucial. We will focus on using Gephi for getting insightful and beautiful network visualizations.

  • Heatmaps

  • Diagrams

  • Force-directed models

Community detection

Most networks are too big to give direct insights - we need to split them into subparts, based on densely-connected regions. This clustering, called “community detection”, allows us to simplify analysis, visualization and give direct insights.

  • Modularity maximization

  • Louvain

  • Infomap

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Python in Data Science – hands-on workshops

Summary

3h introduction to the Python data science environment for exploratory data analysis. It consists of a hands-on workshop with a real dataset. We focus on the following tools (and their interactions): Jupyter/IPython Notebook environment for interactive data analysis; Pandas for working for machine learning.

The workshop intends to give an overview of tools and workflows for exploratory data analysis. We focus on the interoperability between key Python libraries (Jupyter, Pandas, seaborn, scikit-learn) during the hands-on tasks.

Target Audience

current & future Data Scientists, current & future Data Analytics, Sociologists, Biologists, Physics.

Python in data science

3h introduction to the Python data science environment for exploratory data analysis. It consists of a hands-on workshop with a real dataset.

Jupyter/IPython Notebook

IPython Notebook (aka Jupyter) is an interactive environment suitable for exploratory data analysis, experimenting with various data transformations and machine learning techniques. All of the course materials are in IPython Notebook.

  • comments in Markdown + LaTeX

  • Jupyter for other languages (e.g. R, Julia, Haskell)

  • sharing notebooks on GitHub

Pandas

Pandas is a library for dealing with tabular data. Very often it is the backbone of data processing.

  • reading and writing CSV files

  • descriptive statistics

  • data transformations

  • plotting in Pandas

seaborn

seaborn provides a range of useful plots. Throughout the workshop we will show some of the most handy ones.

scikit-learn

scikit-learn is the main Python library for machine learning, offering a wide class of supervised and unsupervised machine learning techniques.

  • preparing data for machine learning

  • training a classifier

  • measuring performance (scores and plots for cross validation)

We will use two common classifiers: linear regression and random forest.

Outline

3h introduction to the Python data science environment for exploratory data analysis. It consists of a hands-on workshop with a real dataset.

We focus on the following tools (and their interactions):

  • Jupyter/IPython Notebook environment for interactive data analysis

  • Pandas for working with tabular data

  • seaborn for advanced plots

  • Scikit-learn for machine learning

Important

Participants are expected to:

  • have basic knowledge of programming (familiarity with Python syntax)

  • have basic experience working with data

  • basic knowledge of machine learning is a plus, but not a requirement

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Reducing redundant information: dimensionality reduction techniques

Summary

Sometimes the abundance of available data can be overwhelming. To make sense of such vast datasets we need a way to select only those elements which are really important. This is ``beyond the machine learning basics’’ type of workshop introducing dimensionality reduction techniques.

Requirements

  • Working knowledge of Python

  • Basic experience with machine learning and data analysis (training classifiers, model evaluation)

  • Knowledge of Python data science libraries (numpy, scipy, matplotlib, scikit-learn) is recommended

Day 1
Who needs feature selection and embedding techniques?

We explain the role of various methods of dimensionality reduction in data analysis. They allow to make sense of large datasets with lots of attributes by narrowing down the analysis to just a few most important factors. We present how such tools are useful for understanding data structure, identifying key factors, preparing visualizations and training classifiers.

Exploratory analysis with Principal Component Analysis

One of the most popular feature transformation techniques is Principal Component Analysis (PCA). It allows to describe the data in terms of linearly uncorrelated variables which explain sources of variability in data. This part covers introduction to algebraic concepts behind PCA, practical exercises in applying the method, choosing the right number of dimensions and interpreting results.

Visualization with multidimensional scaling

For the purposes of visualization preserving distances between data points is more important than just decomposing covariance. We discuss classic multidimensional scaling and its relation with PCA. Simple examples show how to use it for data visualization.

Non-linear embeddings

Then we show a broader class of non-linear methods which strive to preserve data structure in low dimensional space. In a short overview we present:

  • Isomap technique incorporating geodesic distance

  • Discrete embedding and clustering with self-organizing map (SOM)

  • Stochastic embedding with t-SNE method

  • Semantic words embedding using neural networks (word2vec)

Then we demonstrate practical use of such methods through a problem of image clustering.

Day 2
Filter methods for feature selection

In large data sets often only a small subset of attributes is relevant to the classification problem. Filter methods are based on initial screening of attributes in a model-independent way. We present some examples and allow participants to find advantages of such methods on their own.

Wrapper methods for feature selection

Wrapper methods select features to maximize some specific performance measure of the trained classifier. Here we demonstrate advantages and disadvantages of such an approach. We cover:

  • Stepwise regression procedures

  • Global optimization methods for feature selection

  • Multiple testing problem and the dangers of overfitting

Embedded methods for feature selection

Sometimes methods of scoring feature importance are already embedded in the classification model. We focus on two algorithms: penalized logistic regression and Random Forest. We cover:

  • Various forms of penalized regression (LASSO, Ridge, SCAD)

  • Calculating feature importance with tree-based methods (Random Forest, Gradient Tree Boosting)

  • Finding proper cut-off for important features

  • Discovering feature interdependencies

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Interactive data visualization in D3.js

Summary

2-day hands-on workshop in creating interactive data visualizations in D3.js Basic working knowledge of D3.js and best practices of data visualization.

Once we start using specific data we can easily discover that charts provided by standard plotting libraries are not enough; or the abundance of data makes it impossible to cover it within a set of static plots. You can use it for your clients (to present it in a neat, original way) or as an in-house tool for data exploration.

The workshop will focus on pure D3.js, so that it can be used standalone, or within any other framework. The course will be based on practical exercises, so participants will have an opportunity to code every feature covered in the material.

Requirements

Basic knowledge of JavaScript is required. (If you program in another language, just look up Learn X in Y Minutes, if you are new to programming - take an introductory course, e.g. Eloquent JavaScript.) Additional knowledge of JavaScript or HTML is a plus.

Outline

What is data visualization?

A lecture covering basics of data visualization - typical approaches, inspiring examples and common pitfalls.

  • Standard charts

  • Less standard charts (but still very useful)

  • Good and bad practices

D3.js basics

Very basic introduction to D3.js library and its main functionalities.

  • Creating selectors and binding data

  • Adding, modifying and removing elements

  • Chaining syntax

  • Good practices

SVG basics

D3.js can work with DOM or Canvas, but it shines when used with Scalable Vector Graphics (SVG).

  • SVG elements

  • Groups

  • Transformations

  • it’s not a <div>!

Preparing data

Typically we want to preprocess data, so that data can be served quickly and leaving data munging to other tools than JavaScript.

  • CSV and JSON

  • Splitting semantics from visuals

  • Typical patterns

Scales and axes

Each data visualization turns data properties into visual attributes: positions, color, shapes. D3.js have special objects facilitating such conversion.

  • Continuous

  • Ordinal

  • Color

  • Logarithmic and power scales

Interactivity

One of the key advantages of creating browser-based data visualization is that we can make it interactive - to supplement it it with additional data, to make it unrestricted data exploration, or just - to make it more visually appealing and engaging.

  • Click

  • Hover

  • Drag and drop

  • Zoom

Transition animations

When changing plot parameters it is useful to use smooth transitions.

  • Tweening position, color, opacity, size and other attributes

  • Applications for data changing and removal

Prebuilt visualizations

D3.js offer a number of visualizations as either built-in objects or external libraries. We will explore some of the most useful:

  • Force-directed graphs

  • Hierarchies

  • Sankey diagrams

Open-ended project

The last, and the most important, exercise is creating a custom data visualization from scratch.

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Introduction to Machine Learning

Target audience

Programmers and Tech Leaders who want to learn about the concepts statistics, data visualisation and machine learning.

Time

2-day workshop with a total duration of 14h and including breaks. On each day, 1h is devoted for a lunch break + social mixer.

Summary

This workshop introduces basic concepts of machine learning. It provides an overview of algorithms and their applications. The course comprises of lectures and a series of practical exercises. The participants should obtain a solid knowledge of machine learning basics to apply it in practice or for a foundation to a deeper study in the subject area. The training is held in the Python programming language and shows how to make use of numpy, pandas, seaborn and scikit libraries.

Requirements

Some experience in programming (preferably in Python) is necessary for doing practical exercises. We assume no prior knowledge in machine learning (as it is an introductory course), however some affinity with statistics might be helpful.

Agenda

Introductory talk: What is machine learning?

We start with an introduction to machine learning - a subfield of computer science that enables computers to learn without being explicitly programmed. We provide an overview of types of machine learning algorithms (supervised vs unsupervised, classification vs regression) and the basic concepts (evaluation, overfitting, regularization). We present how a general modelling pipeline looks like. Some example problems and selected applications are presented. We also introduce the basic concepts of an instance, a feature, a label and numeric response and discuss how to represent knowledge.

Introduction to IPython notebook with elements of Python programming

IPython Notebook (aka Jupyter) is an interactive environment suitable for exploratory data analysis and experimenting with various data transformations and machine learning techniques. We present some of its basic features, so that the participants can use it smoothly throughout the workshop and for any other project they wish to use. We also cover basics of Python programming language that are necessary for later parts of the course. The whole material is held in IPython Notebook.

Exploratory data analysis and preprocessing

Prior to creating a statistical or machine learning model it is important to get familiar with the dataset that we are working with. In fact, this is the most crucial step of creating a reliable and successful model. By using a real dataset of bike rentals, we discuss common problems that come with data and the ways to handle them, e.g., missing values, outliers, how to preprocess data for modelling.

Machine learning for labelled data - classification

In this part we are going to present two models for classification. Logistic regression is a basic model applicable for classification problems. A decision tree model is another machine learning technique which allows for non-linear interactions between features. In this part of the trainings we cover:

  • Intuition behind the general concept of logistic regression and a decision tree

  • How these models can be applied for predicting binary classes

  • How the results can be visualized, interpreted and evaluated

Evaluation of what has been learned

Once we are familiar with several methods for the task of classification, we discuss the evaluation of a model performance in greater detail. This is an essential part of creating a successful application of a model.

  • We discuss how to arrive at an unbiased performance measurement

  • We present array evaluation metrics tailored for particular prediction tasks

  • We discuss some common pitfalls in the model evaluation stage

Predicting numeric response - regression

Linear regression is a simple, yet powerful, technique for fitting models and predicting numerical values. Using the same dataset of bike rentals, we are going to show:

  • How to apply and interpret linear regression model

  • How to measure its performance

  • How to prevent overfitting by regularization

Tuning up performance - ensemble methods

Being able to improve a prediction model by even the slightest margin, can lead to enormous gains. In this part, we present state-of-the-art ensemble modelling. This is a set of powerful techniques to combine several simple models to create a yet more accurate one. We discuss:

  • The idea of behind bagging, boosting, blending and stacking classifiers

  • How the Random Forest algorithm works

  • How to get insight from this model by feature importance metrics

Discovering structure in data - clustering

One of the goals of clustering is to partition the data into groups of similar observations. As the only kind of unsupervised learning, we provide some motivations behind k-means algorithm:

  • How to apply it to get insight into data

  • How to interpret results

  • How to decide on the number of clusters

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Big Data processing with Apache Spark

Summary

The course prepares to work with new problems that arise during Big Data analysis from various sources using Apache Spark family tools. The course consists of typical Big Data problems and their solutions using Apache Spark and other Big Data tools. Moreover, the attendees receive a general overview of advantages and disadvantages of using Apache Spark for their business problem solving. In addition, the course is a good starting point for people entering the quickly growing Big Data processing field and the novelties in problem solving that Apache Spark presents.

Prerequisites

The course requires experience in the following programming languages: Java, Scala or Python; preferred training language is Java 8. Additionally, experience in data processing, functional programming, distributed processing, and *nix systems are useful, but not mandatory.

Day 1
Introduction to Big Data

The first part introduces the general concept of Big Data with a historical background. The short lecture explains the technology, its usage and stakeholders involved in Big Data projects. Moreover, the general approach to Big Data problems is shown, followed by case studies and discussion.

Using a Hadoop cluster

While Hadoop is not the primary focus of this course, using Spark requires basic knowledge of the Big Data ecosystem. We perform a short introduction to Hadoop and its complementary tools, such as YARN and HDFS, along with positioning Spark in this environment. Basic operations on the cluster using the Ambari example will be conducted. This section includes training in practical cluster usage.

Data analysis with Apache Spark

Apache Spark is one of the fastest growing Big Data projects. Its popularity is due to its speed and ease-of-use. We introduce the Apache Spark project and its corresponding parts. All the principal concepts, such as Resilient Distributed Datasets, transformations and actions along with comparisons to how it fits in with Hadoop are explained. All the major software functions are presented using easy to follow examples based on general practice. Moreover, basic job performance tuning is explained. The Spark jobs are tested locally as well as run on a small cluster, which simulates the real production environment.

Day 2
Tabular data analysis with Apache Spark SQL

Tables and SQL has been the lingua franca of data processing and analysis for many years. Spark SQL is an extension of Spark functionality focusing on exactly these concepts. We show the usage of SQL and DataFrames in Spark and how they greatly help with tabular data handling and analysis. Moreover, integration with Hive and other sources, including JDBC usage, is presented. As before, we work with simple but practical examples.

Stream data analysis with Apache Spark Streaming

Streams of data are becoming more common in the modern world. Apache Spark Streaming addresses this need, providing a micro batch extension of the Spark base project, with itssimilar easy-to-use wealth of features. Moreover, we explain the different streaming approaches and projects, such as Apache Storm. We illustrate the usage with simple stream processing examples.

Data analysis in practice

The assignment will allow attendees to put their new skills to use. While the assignment will be done individually, the instructor is there to answer questions.

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Machine Learning in detail (scikit-learn)

Summary

2-day hands-on workshop introduction to important machine learning algorithms. It will will balance practical problem-solving with insight into the inner workings of algorithms.

The workshop will be held in a suitable Python environment (scikit-learn, Pandas, Jupyter Notebook, and when necessary using lower-level libraries like NumPy, SciPy and Theano).

It is aimed at:

  • people who already use ML, but would like to get better a understanding of the algorithms that they use,

  • people who want to get into ML and have a quantitative background (e.g. physics, statistics or mathematics).

Requirements

  • Basic knowledge of Python.

  • Basic knowledge of calculus (matrices, derivatives).

Outline

Lecture: Machine Learning framework

Having a firm grasp of the core concepts of machine learning makes it easier to understand the variety of techniques used, and select & modify them to fit particular problems.

  • supervised vs unsupervised,

  • cross-validation,

  • cost function,

  • regularization,

  • hyperparameters.

Linear and logistic regression

These are simple algorithms commonly used both as complete solutions as well as building blocks for more advanced algorithms. We will explore their foundations, applications and common extensions.

  • interpretation

  • regularization (Ridge, LASSO, elastic net)

  • least square fit for polynomials

  • logistic function, softmax

  • solution 1: linear algebra

  • solution 2: stochastic gradient descent

Random Forest

Random Forest is a very versatile algorithm, often working out of the box and providing a great result. It is capable of using categorical and numerical variables, and works for both regression and prediction.

  • classification trees

  • splitting rules, Gini impurity

  • ensemble methods

  • out-of-bag score

  • feature importance

  • cross-validation pitfalls

Gradient Boosted Trees

GBT is a more powerful analogue of RF. However, it comes at a cost - there are more parameters to tune, and it does not scale as well.

  • boosting

  • grid search

Clustering

How to cluster data when we do not know the correct answers, how to compare different clusterings and how to decide if our clusters make sense.

  • k-means algorithm

  • stability, internal consistency

  • normalized mutual information

Hierarchical aggregation

Hierarchical aggregation is a class of methods giving a fine-grained clustering. It is a family of techniques, sharing common building blocks.

  • distances

  • linkage: single, complete, Ward method

  • dendrograms

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Machine Learning case studies (scikit-learn)

Summary

2-day hands-on workshop centered around practical problem-solving with machine learning.

We focus on providing insight into how to approach problems, how to select an appropriate machine learning technique, how to tweak the solution’s performance by feature engineering and parameter tweaking. During this workshop we will be solving 4 different machine learning tasks, trying to improve performance and getting new insight into data and ML capabilities.

The workshop will be held in a suitable Python environment (scikit-learn, Pandas, Jupyter Notebook).

It is aimed at:

  • people who already use ML, but would like learn new methods and tricks, and get a lot of feedback,

  • people who know the basics of ML theory, but would like to get more practical exposure.

Requirements

  • Basic knowledge of Python.

  • Basic experience with machine learning.

Outline

Lecture: Machine Learning goals and methods

A big picture of how machine learning can be applied to prediction.

  • supervised vs unsupervised,

  • prediction and interpretation,

  • meaningful error penalty,

  • workflow.

While tackling 4 different machine learning problems we will cover the following topics:

Supervised techniques

An overview of popular techniques for supervised machine learning. Emphasis on their pros and cons (capabilities, performance, artifacts, etc), so that we can choose the most promising methods for a given task.

  • linear & logistic regression

  • k-nearest neighbors

  • random forest

  • gradient-boosted trees

Unsupervised techniques

Sometimes we need to create new variables summarizing our results - either for other machine learning algorithms or for manual inspection. We will work with clustering and dimensionality reduction.

  • k-means

  • hierarchical aggregation

  • dimensionality reduction

Data transformations

Often the initial data preprocessing is as important as the machine learning algorithm. We will cover its most important aspects.

  • restructuring data

  • missing values

  • variable scaling

  • numerical and categorical

Extracting text data

The majority of ML algorithms use numerical and categorical data. Yet, sometimes there is valuable data in text fields - but how to extract it?

  • regex

  • word count

  • aggregation

  • topic modelling

Measuring performance

How to measure performance, compare different algorithms and tweak them.

  • grid search

  • cross-validation

  • holdout set

  • data independence

To learn more about our offer for your company, please contact the Workshop Coordinator:
Kamila Stepniowska: +1 206 225 9011, workshops@deepsense.io

Let’s talk about your needs and our solutions!
Please fill out the form. We will be happy to contact you soon.

Your name and surname: *

Phone number:

Your e-mail: *

Your company name: *

Country: *

City: *

What kind of training are you looking for?

Please describe what kind of Data Science knowledge and skills you expect
to achieve for your team, employees or yourself:

How big is the team for which you would like to provide a training?

How did you learned about deepsense.io corporate workshops?

Additional information:

* - required field