17 minute read

In September 2019 I had the good fortune to attend ECML 2019 in beautiful Wurzburg, Germany.

I was there to present our paper

Bhalla, S. et al., 2019. Compact Representation of a Multi-dimensional Combustion Manifold Using Deep Neural Networks. In European Conference on Machine Learning. Wurzburg, Germany, p. 8. read more pdf

It was a fantastic conference, a nice size and a real diversity of topics, not just a whole bunch of permutations on Deep Learning. What follows are some notes I took at talks which I attended, you can take a look at the full schedule here.

Monday Talks

SridharMahadevan Tutorial

  • he’s now at AdobeResearch
  • can also see his slides for IJCAI which was longer than this one

Imagination and Causation

  • ImaginationScience
    • he wants to do reasoning about future paths and outcomes without having data to back it up
    • how does he distinguish is from counterfactual reasoning it is just one subset part of it
    • also includes: analogy, abstraction, spatial/temporal projection
    • what is Combinatorial Creativity? mixing and matching existing compoenets that never occur, like a sphinx
  • example: long term predictions for climate change
    • not just about predicing the next step but about reasoning baout what society will do for the next few decades
  • example: search engine
    • what is the next step?
    • “Improbable” company essentially is trying to simulate reality
  • he thinks GANs are really only novel thing to come out of Deep learning recently
    • he doesn’t think ML will be fundamentally different in 20 years, it’s solved essentially
  • he’s talking about machines creating art
    • can machines ever create art?

Pearl’s Ladder of Causality

  • This is that new three layered approach by JudeaPearl
  • “The Book of Why” where he explains it in simple approach


  • observational machine learning


  • Experimental science
  • Pearl: probabilities are an epiphenomanon resulting from the causal nature of the world


  • Can be thought of as necessary for Imagination

Combining Observation and Causality

  • GANs - solve problem is visualize imagination
    • he says GANs are related to Actor-Critic RL problem
    • Actor-Critic models converge but it wasn’t known until 20 years later
    • Wasserstein GANs have more theoretical bassis
  • Seeing GANs as an optimization problem
    • min_G max_D V(D,G)

Art and Imagination

  • CANs are a new model for simulating creativity
  • Can create fairly nice modern art paintings, but is it art or it sampling a space of images?
    • Is there a difference?
    • Does it mean anything? does it need to?

Relevant Optimization methods from Economics which would be useful for GANs

  • Optimization vs Equilibration
    • minimize a function in feasible set
    • usually need to assume f(x) is differentiable
  • Equilibration from physics (Stampacchia in the 1960s)
    • they define the set of partial gradients for a function as a ‘vector field’
    • assume this field is given rather than f(x) being given
    • means we can solve optimization problems but also other problems
    • eg. there are some vector fields that don’t have a function with a gradient that generates the vector field
    • economists use this for domains where the vector field can be written down easily but no simple function leads to it via derivatives
  • Variational inequality
    • eg. traffic management
    • GANs are this kind of problem whihc is why they work
    • he thinks game theory, GANs, traffic, RL etc are better solved using this approach
  • GANs don’t converge if you just do gradient descent
    • but if you use the Wasserstein loss function then get a nash equilibrium?
    • the surface for a GAN is a saddle so gradient descent is bad, very unlikely to lead to a optimal appoint, you will cycle around
    • a better idea sometimes is to move orthogonal to the gradient because the think you are trying to do is find an equilibrium point
  • Extragradient Method
    • Frobenius projection from the gradient
    • Project back to the main function if the negative vector field leads you out of the feasible set
  • Mirror Descent instead of Gradient Descent - by Nemirovsky and Yudin
    • gradients are in the dual of the original space, and this happens to line up in euclidean but wouldnt work others, so you can’t add them relaly
    • essentially you convert the state vector to the dual space first, then compute the gradient, update it and project back to the original space via a conjugate function
    • in Euclidean space, this whole mechanism collapses to gradient descent because the conversion is identity
    • this explains why multiplicative updates of gradients work better in many spaces
    • you can explain many ML methods using this idea
      • boosting
      • natural gradients - gradient descent should be done in reimanian space using the Fischer information matrix
        • Kakade’s paper on Natural Actor-Critic for playing Tetris
  • Mirror-prox
    • take two gradient steps in the dual space, works even better
  • He calls the dual space the imagination space because it handles (something)
  • Reinforcement Learning as a type of Causal Learning

  • where you can try out particular actions
  • imagination values
    • what is the value of deviating from the action the policy advices
    • counterfactural kind of value function
    • not needed for an MDP but for POMDPs yuo need to imagine the possible worlds
    • even for multi-armed bandits it can help, because it leads you outside the existing policy actions
    • is it just exploration? it is related to off-policy exploration, how?

Tuesday Talks

Active Learning Anomaly Detection

  • Unsupervised and Active Learning using Maximin-based Anomaly Detection
  • Zahra Ghafoori (University of Melbourne), James C. Bezdek (University of Melbourne), Christopher Leckie (University of Melbourne), Shanika Karunasekera (University of Melbourne)
  • took some pictures of the AD background description which was very good
    • semi-supervised which an oracle that can be queried about whether points are anomalies or not
  • they do active learning and compare to standard AD methods like OCSVM and iForest
  • better quality than iforest and faster because it only builds one model

OneClass Anomaly Detection

  • The Elliptical Basis Function Data Descriptor (EBFDD) Network - A One-Class Classification Approach to Anomaly Detection
  • MehranBazargani (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin), Brian Mac Namee (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin)
  • a cost functiont hat turns RBF networks in to a one class classifier
  • assumptions
    • they are interested in streaming data
    • training data does not include any anomalies
  • RBF Networks
    • overview
    • three layers
      • input data
      • hidden layer of gaussian kernals, initialize the (m,v) with k-means
      • output layer lineraly combies them via a sigmoid
    • backprop learns the paramters of the gaussians
    • not applicable to one class
  • their change
    • main idea: try to tightly fit the gaussians around the subspace of the normal points
    • introduce elliptical kernals instead of spherical
    • the ellipses can be stretched and shaped to adjust the amount of correlation (or not) between the dimensions
    • in stead of gaussian (m,v) they have cov matrix for each kernel
  • expeirments
    • they use the emmot and dietterich paper as their guide

      Autoencoder Anomaly Detection

  • Robust Anomaly Detection in Images using Adversarial Autoencoders
  • LauraBeggel (Bosch Center for Artificial Intelligence, Renningen; Ludwig-Maximilians-University Munich), Michael Pfeiffer (Bosch Center for Artificial Intelligence, Renningen), Bernd Bischl (Ludwig-Maximilians-University Munich)
  • They use an AutoEnc with an discriminator network instead of using KL divergence
  • neat idea : a point is anomalous if either of the following are true
    • point is in a dense region and has high reconstruction error compared to the data from training
    • point is a lower density with respect to the training distirbution
    • this gets you botht he desnity based and deteailed clasified based approaches, couldn’t any method use this?

Counterfactual Justification

  • Unjustified Classification Regions and Counterfactual Explanations In Machine Learning
  • Thibault Laugel (Sorbonne Université), Marie-Jeanne Lesot (Sorbonne Université), Christophe Marsala (Sorbonne Université), Xavier Renard (AXA, Paris), Marcin Detyniecki (Sorbonne Université; AXA, Paris; Polish Academy of Science)
  • they are trying to infer counterfactual explanations but being careful not to create ones wchih are not jsutified by the data
  • Idea : can you use decision trees to do coutnerfactual search for other ways to get the result you found?
    • some data point goes down the tree to a specific leaf, but then you can infer wether taht is right, maybe it should have gone sligtly elehwere
    • does this imply the tree should be different or that the leaf’s label is wrong?


  • Fast and Parallelizable Ranking with Outliers from Pairwise Comparisons
  • Sungjin Im (University of California), Mahshid Montazer Qaem (University of California)
  • problem: given multiple partial ordering o felemnts/daatpoints, goal is to create a full, single consistent one
  • cycles are a problem, standard methods are well understand from an algorithmic sense, it is NP Hard
  • usually comes down to counting and minimizing the number of baward arcs from a DAg
    • is it like finding the maximial DAG in a graph?

      Hierarchical Dense anomalies

  • CatchCore: Catching Hierarchical Dense Subtensor
  • Wenjie Feng (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences),

Causal Effects Talk

  • Adjustment Criteria for Recovering Causal Effects from Missing Data
  • Mojdeh Saadati (Iowa State University), JinTian (Iowa State University)
  • Pearl backdoor condition, or ignorablity
    • allows us to in certain cases, infer a causal effect from observational data
  • but what if there is msising data? or you are concerned about selection bias affecting the outcomes
  • they produce two criteria for evaluating only using the causal graph whether the backdoor criterion can be used

Uplift Regression

  • Shrinkage Estimators for Uplift Regression
  • Krzysztof Rudaś (Warsaw University of Technology; Institute of Computer Science, Polish Academy of Sciences), Szymon Jaroszewicz (Institute of Computer Science, Polish Academy of Sciences)
  • exmaple: marketing sends out a discount email
  • you get new data about purchases after the discount, but do any change arise because of the discounts (uplift) or in spite of them?
  • Impact could even be reversed, they buy less because of the dicsount coupon
  • they improve upon the standard double regression approach for this

Wednesday Talks

Fast Gradient Boosting

  • Fast Gradient Boosting Decision Trees with Bit-Level Data Structures
  • Laurens Devos (KU Leuven), Wannes Meert (KU Leuven), Jesse Davis (KU Leuven)
  • XGBoost is the most popular now
  • also LightGBM, look that up
  • idea - using full ints and floats is too detailed they use bitlevel datascturcures
  • existing gradient updates on decision trees Fast Gradient Boosting
    • move some datapoints from one leaf to a neighbouring leaf ndoe (logically, it might not be asimple sibling)
  • BitRoost algorithm (theirs)
    • bit representation for each leaf? with one hot dncoding
    • then use the fast and/or/counts ability of bits during FGB
  • read this one in more detail, looks like very useful approach

Association Rules

  • Sets of Robust Rules, and How to Find Them
  • Jonas Fischer (Max Planck Institute for Informatics; Saarland University), Jilles Vreeken (CISPA Helmholtz Center for Information Security)
  • still useful in biology
  • people kept working on Apriori algorithm and found you get a lot of that for free by mining conjuctions
  • they are still liked becuase you get clean interpretable models andrules
  • problem is you get millions of rules
  • Grab algorithm
    • heuristics to reduce the rule space
    • so it seems like a smarter, more general version of Apriori algorithm

Black Box Explanation

  • Black Box Explanation by Learning Image Exemplars in the Latent Feature Space
  • Riccardo Guidotti (ISTI-CNR, Pisa), Anna Monreale (University of Pisa), Stan Matwin (Dalhousie University; Polish Academy of Sciences), Dino Pedreschi (University of Pisa)
  • Problems if you don’t explain
    • The Husky-Wolf classifier problem - it was very good but used snow in the background for all wolf images
  • current explanation approaches in image clasifiers
    • saliency map, showing which pixels leading to which labels
    • DeepExplain and other gradient based approaches, visualize the gradient, but it is very specific to taht trained network
    • prototype based methods - just show similar types of images that show you what the model is ‘thinking about’
  • ABELE their method

TD Actor Critic

  • TD-Regularized Actor-Critic Methods
  • SimoneParisi(presented), Voot Tangkaratt, Jan Peters, EmtiyazKhan
  • They are trying to deal with the instability of actor-critic methods when there is little data
  • The problem: using the critic in AC reduces the variance of simple gradient descent
    • but the critic introduces bias and so it often overshoots good parts of the space
  • their approach (TDRPG) - instead of trying to fix the actor or the critic estimates, they focus on the way they interact and stabilize that
    • they add a squared regularization penalty to the optimization step
    • they can add their regulazer to the training for any actor critic approach

Deep Ordinal Reinforcement Learning

  • AlexanderZap (TU Darmstadt), Tobias Joppen (TU Darmstadt), Johannes Fürnkranz (TU Darmstadt)
  • problem: numerical rewards are arbitrary, changing the values can lead to completely different optimal policies
  • solution: use orginal rewards instead,
    • map numerical rewards to a dinstinct ordered set
    • scale doesn’t matter anymore, just order
  • issues:
    • how to accumulate rewards? you don’t add them, you maintain a histogram/prob distribution of getting each reward in the state
    • value function uses this prob distribtuion to choose actiopn and return the associated ordinal value
    • how do you maximize it?
    • sometimes you need to know the relative difference between values and this isn’t appropriate
    • very interesting idea

Attentive Multi-Task Deep Reinforcement Learning

label: ECML2019,RL, skill learning

  • Timo Bräm (ETH Zurich), Gino Brunner (ETH Zurich)
  • Problem - Multi-task learning is hard
  • Existing approaches
    • transfer learning from agents experts at prior tasks - but might forget old skills
    • robust multitask RL (dilstation, teh, 2017) similar to what we are doing
  • Their approach - using attention…
    • train on all environments at once, but maintain explicit submodules for each task during training
    • can also enforce a smaller number of submodules than ther are tasks
    • this forces generalization just like in autoencoders
  • There are multiple subnetworks to learn with but they are not explicitly designated to be for a specific task
  • Attention network is used to decide which subnetwork is the most relevant right now
  • question - how do you know it isn’t just more paramters that are helping?
  • experiments : grid world
  • criticism -
    • just 10 random seeds
    • domain very simple
    • are they changing the rewards along the way?
    • still needs lots of work

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

label: ECML2019,RL

  • DenisSteckelmacher (Vrije Universiteit Brussel), Hélène Plisnier (Vrije Universiteit Brussel), Diederik M. Roijers (VU Amsterdam), Ann Nowé (Vrije Universiteit Brussel)
  • Problem: how to learn a policy with a small number of samples without having a model
  • They introduce some ways to speed up learning with policy gradients to reuse prior trajectories like in clipped DQN

Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space

label: ECML2019,RL

  • ZacWellmer (Hong Kong University of Science), James T. Kwok (Technology)
  • policy gradients, implicit RL
  • problem: want to do model-free, model-based combination for policy gradients
  • PPN : many changes and tricks compined together
  • use similar approach to PPO
    • clipping, latent space transition models

Safe Policy Improvement with Soft Baseline Bootstrapping

label: ECML2019,RL, safeAI

  • KimiaNadjahi (Télécom Paris), Romain Laroche (Microsoft Research Montréal), Rémi Tachet des Combes (Microsoft Research Montréal)
  • to build a model of the uncertainty you can try to do MLE on the transition modle learning
  • problem : how to train your RL but focussed on safe outcomes rather than highest performance in a simulation
    • This is defined as performing at least as well as the baseline with high probability.
  • existing approaches
    • SPIBB [Laroche2019, ICML] - if you are not sure enough then you don’t take the action, just use the baseline instead
      • problem is they have a very binary approach to something being safe or not
  • Their Approach
    • SPIBB but it’s soft, safe or not is not a binary choice
    • They allow an error-budget to control how much error they can safely allow for that domain.


Stochastic Activation Actor Critic Methods

  • Wenling Shang (University of Amsterdam-Bosch-Deltalab), Herke van Hoof (University of Amsterdam-Bosch-Deltalab), Max Welling (University of Amsterdam-Bosch-Deltalab)
  • Pertubation of intenral network weights to encourage exploration is well established.
  • but it doesn’t seem to work so well in Actor-Critic Deep RL
  • they add noise in clever ways to LSTMs to make it work
  • Joni Pajarinen, Hong Linh Thai, Riad Akrour, Jan Peters, Gerhard Neumann
  • Trust region policy search - using greedy updates the entropy drops too fast
  • Their approach
    • hard entropy constraint
    • something elseq

Stochastic One-Sided Full-Information Bandit

  • Haoyu Zhao (Tsinghua University), Wei Chen (Microsoft Research, Beijing)
  • problem: repeated second price auctions
  • prior work:
    • SODA assumes bidders are truthful, and that distritbuion of bidders if iid
    • max bid does not need to be known
  • their approach:
    • iid bidders not required
    • need to know the max value of the bids
    • maintain a set of all good arms : determined by empircal mean

Practical Open-Loop Optimistic Planning

  • Edouard Leurent (SequeL team, INRIA Lille - Nord Europe; Renault Group), Odalric-Ambrym Maillard (SequeL team, INRIA Lille - Nord Europe)
  • highway-env environmenta on github for highway driving of human behaviours
  • assume a generative model
  • optimisitic planning - they use this along with tree search
  • they recall people have found out out failing cases of UCT

    • very deep branches where all the choices are bad, the optimisitc bias leads to problems
  • solution (lots of Munos work in 2008,2010) works for restricted classes of MDPs and works
  • OLOP showed that you can do UCB style planning for stochastic and deterministic MDPs the same
    • but htere was no empirical vbalidation of it, so this work does that
    • they show it is actually quite difference in pracic
  • They improve this by using the KL divergneece with OLOP

An Engineered Empirical Bernstein Bound

  • MarkBurgess (Australian National University), Archie C. Chapman (University of Sydney), Paul Scott (Australian National University)
  • Hoefding bound - it’s a concentration inequality
  • Empircal Bernstein bound is similar - uses sample vairnces
  • Bennett’s inequality is much stronger and useful, but maybe hard to use? this would give you the perfect bound possible if you knew the full variance?
  • they define their own EBB bound, pretty complex formulation
  • they show it provides a tighter bound thatn hoeffding on bernoulli bandits

MACLEAN Earth Observation Workshop

Earth Orientation Parameters Time Series

  • G. Okhotnikov and N. Golyandina: EOP Time Series Prediction Using Singular Spectrum Analysis
  • The EOP data include 5 numbers that arrive as a time series
    • pole coordinates
    • lenght of day
    • changes in pole angles over time?
  • There is a service that publishes the daily values for the time series
  • important for navigation and satellites
  • SSA - https://en.wikipedia.org/wiki/Singular_spectrum_analysis is a common method used to seperate out trend, noise etc from a time series
    • time series of lenght L
    • embedding, create a trajectory matrix
    • take SVD
    • Group eigen triples together
    • diagononal averaging
    • then reverse the embedding
  • uses: allows you to extract the orignal sine wave if there was one

MvMF Loss for Prediction locations of an image on Earth’s surface

– M. Izbicki, E. Papalexakis and V. Tsotras: The MvMF Loss for Predicting Locations on the Earth’s Surface

  • problem how to locate the location of an image in the world just by looking at it
    • easy and hard problems : eiffel tower vs inside of an apple store
  • data - data base of fflickr images with gps in them
  • their approach - Mixture of von Mises-Fisher Distritbuion
    • vMF is like Gaussian distribution for spheres - (m,s)
    • MvMF is a mixture of these just like a mixture of gaussians but it knows about sphere structure
    • works better for anything about locating predictions on the earth but it assumes the earth is a sphere
  • so using MvMF as the loss function in one of these algorithm, such as Google PlaNet [Weyland 2019], makes the predcitions much smoother on the earth’s surface
  • other uses
    • estimate range of animals and birds from people’s social media images

Deep learning for power line inspection

  • Invited Speaker: Prof. Robert Jenssen
  • 25 person machine learning group at UIT in Northern Norway
  • they are 70 degree north, very weak magentic field there so they get the strongest northern lights
  • Northern Lights Deep Learning Workshop 2020 (January 20-21), about 100 people
  • problem: power line inspection using deep learning
    • how to use drones to inspect poles in remote regions to assess if maintenance work is needed
  • method: few shot learning? YOLO.
  • other applications
    • power lines
    • piplines
    • roads, railways
  • simulated data
    • they generate synthetic of landscapes with power lines, high resoltuion
    • tarin the detector and predictor on this
    • then use the trained model for real drone flight

      Few Shot Learning

  • it is common for this type of domain to find new objects that were never enouctered before

  • ZhenCVPR2018 - Ring Loss for convex feature normalziation for face revognition
    • FewShotLearning usually uses protoype points based on few data poiints
    • They think this ring loss approach produces a more natural representation
    • but it has some difficulat input paramters
  • Their approach
    • their paper on improving few shot learning
    • they use a class-conditional dissimilarity measure
      • they want to mesure distance between them base on the angle of the norms?
    • result - goal is that all points with hte same class have the same norm
    • interesting - they use different scores for distances between points in the same class and ones that are in different classes
    • so it’s kind of like a normalization