ECML 2019 Notes
In September 2019 I had the good fortune to attend ECML 2019 in beautiful Wurzburg, Germany.
I was there to present our paper
Bhalla, S. et al., 2019. Compact Representation of a Multidimensional Combustion Manifold Using Deep Neural Networks. In European Conference on Machine Learning. Wurzburg, Germany, p. 8. read more pdf
It was a fantastic conference, a nice size and a real diversity of topics, not just a whole bunch of permutations on Deep Learning. What follows are some notes I took at talks which I attended, you can take a look at the full schedule here.
Monday Talks
SridharMahadevan Tutorial
 he’s now at AdobeResearch
 can also see his slides for IJCAI which was longer than this one
Imagination and Causation
 ImaginationScience
 he wants to do reasoning about future paths and outcomes without having data to back it up
 how does he distinguish is from counterfactual reasoning it is just one subset part of it
 also includes: analogy, abstraction, spatial/temporal projection
 what is Combinatorial Creativity? mixing and matching existing compoenets that never occur, like a sphinx
 example: long term predictions for climate change
 not just about predicing the next step but about reasoning baout what society will do for the next few decades
 example: search engine
 what is the next step?
 “Improbable” company essentially is trying to simulate reality
 he thinks GANs are really only novel thing to come out of Deep learning recently
 he doesn’t think ML will be fundamentally different in 20 years, it’s solved essentially
 he’s talking about machines creating art
 can machines ever create art?
Pearl’s Ladder of Causality
 This is that new three layered approach by JudeaPearl
 “The Book of Why” where he explains it in simple approach
Association
 observational machine learning
Intervention
 Experimental science
 Pearl: probabilities are an epiphenomanon resulting from the causal nature of the world
Counterfactuals
 Can be thought of as necessary for Imagination
Combining Observation and Causality
 GANs  solve problem is visualize imagination
 he says GANs are related to ActorCritic RL problem
 ActorCritic models converge but it wasn’t known until 20 years later
 Wasserstein GANs have more theoretical bassis
 Seeing GANs as an optimization problem
 min_G max_D V(D,G)
Art and Imagination
 CANs are a new model for simulating creativity
 Can create fairly nice modern art paintings, but is it art or it sampling a space of images?
 Is there a difference?
 Does it mean anything? does it need to?
Relevant Optimization methods from Economics which would be useful for GANs
 Optimization vs Equilibration
 minimize a function in feasible set
 usually need to assume f(x) is differentiable
 Equilibration from physics (Stampacchia in the 1960s)
 they define the set of partial gradients for a function as a ‘vector field’
 assume this field is given rather than f(x) being given
 means we can solve optimization problems but also other problems
 eg. there are some vector fields that don’t have a function with a gradient that generates the vector field
 economists use this for domains where the vector field can be written down easily but no simple function leads to it via derivatives
 Variational inequality
 eg. traffic management
 GANs are this kind of problem whihc is why they work
 he thinks game theory, GANs, traffic, RL etc are better solved using this approach
 GANs don’t converge if you just do gradient descent
 but if you use the Wasserstein loss function then get a nash equilibrium?
 the surface for a GAN is a saddle so gradient descent is bad, very unlikely to lead to a optimal appoint, you will cycle around
 a better idea sometimes is to move orthogonal to the gradient because the think you are trying to do is find an equilibrium point
 Extragradient Method
 Frobenius projection from the gradient
 Project back to the main function if the negative vector field leads you out of the feasible set
 Mirror Descent instead of Gradient Descent
 by Nemirovsky and Yudin
 gradients are in the dual of the original space, and this happens to line up in euclidean but wouldnt work others, so you can’t add them relaly
 essentially you convert the state vector to the dual space first, then compute the gradient, update it and project back to the original space via a conjugate function
 in Euclidean space, this whole mechanism collapses to gradient descent because the conversion is identity
 this explains why multiplicative updates of gradients work better in many spaces
 you can explain many ML methods using this idea
 boosting
 natural gradients  gradient descent should be done in reimanian space using the Fischer information matrix
 Kakade’s paper on Natural ActorCritic for playing Tetris
 Mirrorprox
 take two gradient steps in the dual space, works even better
 He calls the dual space the imagination space because it handles (something)

Reinforcement Learning as a type of Causal Learning
 where you can try out particular actions
 imagination values
 what is the value of deviating from the action the policy advices
 counterfactural kind of value function
 not needed for an MDP but for POMDPs yuo need to imagine the possible worlds
 even for multiarmed bandits it can help, because it leads you outside the existing policy actions
 is it just exploration? it is related to offpolicy exploration, how?
Tuesday Talks
Active Learning Anomaly Detection
 Unsupervised and Active Learning using Maximinbased Anomaly Detection
 Zahra Ghafoori (University of Melbourne), James C. Bezdek (University of Melbourne), Christopher Leckie (University of Melbourne), Shanika Karunasekera (University of Melbourne)
 took some pictures of the AD background description which was very good
 semisupervised which an oracle that can be queried about whether points are anomalies or not
 they do active learning and compare to standard AD methods like OCSVM and iForest
 better quality than iforest and faster because it only builds one model
OneClass Anomaly Detection
 The Elliptical Basis Function Data Descriptor (EBFDD) Network  A OneClass Classification Approach to Anomaly Detection
 MehranBazargani (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin), Brian Mac Namee (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin)
 a cost functiont hat turns RBF networks in to a one class classifier
 assumptions
 they are interested in streaming data
 training data does not include any anomalies
 RBF Networks
 overview
 three layers
 input data
 hidden layer of gaussian kernals, initialize the (m,v) with kmeans
 output layer lineraly combies them via a sigmoid
 backprop learns the paramters of the gaussians
 not applicable to one class
 their change
 main idea: try to tightly fit the gaussians around the subspace of the normal points
 introduce elliptical kernals instead of spherical
 the ellipses can be stretched and shaped to adjust the amount of correlation (or not) between the dimensions
 in stead of gaussian (m,v) they have cov matrix for each kernel
 expeirments
 they use the emmot and dietterich paper as their guide
Autoencoder Anomaly Detection
 they use the emmot and dietterich paper as their guide
 Robust Anomaly Detection in Images using Adversarial Autoencoders
 LauraBeggel (Bosch Center for Artificial Intelligence, Renningen; LudwigMaximiliansUniversity Munich), Michael Pfeiffer (Bosch Center for Artificial Intelligence, Renningen), Bernd Bischl (LudwigMaximiliansUniversity Munich)
 They use an AutoEnc with an discriminator network instead of using KL divergence
 neat idea : a point is anomalous if either of the following are true
 point is in a dense region and has high reconstruction error compared to the data from training
 point is a lower density with respect to the training distirbution
 this gets you botht he desnity based and deteailed clasified based approaches, couldn’t any method use this?
Counterfactual Justification
 Unjustified Classification Regions and Counterfactual Explanations In Machine Learning
 Thibault Laugel (Sorbonne Université), MarieJeanne Lesot (Sorbonne Université), Christophe Marsala (Sorbonne Université), Xavier Renard (AXA, Paris), Marcin Detyniecki (Sorbonne Université; AXA, Paris; Polish Academy of Science)
 they are trying to infer counterfactual explanations but being careful not to create ones wchih are not jsutified by the data
 Idea : can you use decision trees to do coutnerfactual search for other ways to get the result you found?
 some data point goes down the tree to a specific leaf, but then you can infer wether taht is right, maybe it should have gone sligtly elehwere
 does this imply the tree should be different or that the leaf’s label is wrong?
Ranking
 Fast and Parallelizable Ranking with Outliers from Pairwise Comparisons
 Sungjin Im (University of California), Mahshid Montazer Qaem (University of California)
 problem: given multiple partial ordering o felemnts/daatpoints, goal is to create a full, single consistent one
 cycles are a problem, standard methods are well understand from an algorithmic sense, it is NP Hard
 usually comes down to counting and minimizing the number of baward arcs from a DAg
 is it like finding the maximial DAG in a graph?
Hierarchical Dense anomalies
 is it like finding the maximial DAG in a graph?
 CatchCore: Catching Hierarchical Dense Subtensor
 Wenjie Feng (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences),
Causal Effects Talk
 Adjustment Criteria for Recovering Causal Effects from Missing Data
 Mojdeh Saadati (Iowa State University), JinTian (Iowa State University)
 Pearl backdoor condition, or ignorablity
 allows us to in certain cases, infer a causal effect from observational data
 but what if there is msising data? or you are concerned about selection bias affecting the outcomes
 they produce two criteria for evaluating only using the causal graph whether the backdoor criterion can be used
Uplift Regression
 Shrinkage Estimators for Uplift Regression
 Krzysztof Rudaś (Warsaw University of Technology; Institute of Computer Science, Polish Academy of Sciences), Szymon Jaroszewicz (Institute of Computer Science, Polish Academy of Sciences)
 exmaple: marketing sends out a discount email
 you get new data about purchases after the discount, but do any change arise because of the discounts (uplift) or in spite of them?
 Impact could even be reversed, they buy less because of the dicsount coupon
 they improve upon the standard double regression approach for this
Wednesday Talks
Fast Gradient Boosting
 Fast Gradient Boosting Decision Trees with BitLevel Data Structures
 Laurens Devos (KU Leuven), Wannes Meert (KU Leuven), Jesse Davis (KU Leuven)
 XGBoost is the most popular now
 also LightGBM, look that up
 idea  using full ints and floats is too detailed they use bitlevel datascturcures
 existing gradient updates on decision trees Fast Gradient Boosting
 move some datapoints from one leaf to a neighbouring leaf ndoe (logically, it might not be asimple sibling)
 BitRoost algorithm (theirs)
 bit representation for each leaf? with one hot dncoding
 then use the fast and/or/counts ability of bits during FGB
 read this one in more detail, looks like very useful approach
Association Rules
 Sets of Robust Rules, and How to Find Them
 Jonas Fischer (Max Planck Institute for Informatics; Saarland University), Jilles Vreeken (CISPA Helmholtz Center for Information Security)
 still useful in biology
 people kept working on Apriori algorithm and found you get a lot of that for free by mining conjuctions
 they are still liked becuase you get clean interpretable models andrules
 problem is you get millions of rules
 Grab algorithm
 heuristics to reduce the rule space
 so it seems like a smarter, more general version of Apriori algorithm
Black Box Explanation
 Black Box Explanation by Learning Image Exemplars in the Latent Feature Space
 Riccardo Guidotti (ISTICNR, Pisa), Anna Monreale (University of Pisa), Stan Matwin (Dalhousie University; Polish Academy of Sciences), Dino Pedreschi (University of Pisa)
 Problems if you don’t explain
 The HuskyWolf classifier problem  it was very good but used snow in the background for all wolf images
 current explanation approaches in image clasifiers
 saliency map, showing which pixels leading to which labels
 DeepExplain and other gradient based approaches, visualize the gradient, but it is very specific to taht trained network
 prototype based methods  just show similar types of images that show you what the model is ‘thinking about’
 ABELE their method
TD Actor Critic
 TDRegularized ActorCritic Methods
 SimoneParisi(presented), Voot Tangkaratt, Jan Peters, EmtiyazKhan
 They are trying to deal with the instability of actorcritic methods when there is little data
 The problem: using the critic in AC reduces the variance of simple gradient descent
 but the critic introduces bias and so it often overshoots good parts of the space
 their approach (TDRPG)  instead of trying to fix the actor or the critic estimates, they focus on the way they interact and stabilize that
 they add a squared regularization penalty to the optimization step
 they can add their regulazer to the training for any actor critic approach
Deep Ordinal Reinforcement Learning
 AlexanderZap (TU Darmstadt), Tobias Joppen (TU Darmstadt), Johannes Fürnkranz (TU Darmstadt)
 problem: numerical rewards are arbitrary, changing the values can lead to completely different optimal policies
 solution: use orginal rewards instead,
 map numerical rewards to a dinstinct ordered set
 scale doesn’t matter anymore, just order
 issues:
 how to accumulate rewards? you don’t add them, you maintain a histogram/prob distribution of getting each reward in the state
 value function uses this prob distribtuion to choose actiopn and return the associated ordinal value
 how do you maximize it?
 sometimes you need to know the relative difference between values and this isn’t appropriate
 very interesting idea
Attentive MultiTask Deep Reinforcement Learning
label: ECML2019,RL, skill learning
 Timo Bräm (ETH Zurich), Gino Brunner (ETH Zurich)
 Problem  Multitask learning is hard
 Existing approaches
 transfer learning from agents experts at prior tasks  but might forget old skills
 robust multitask RL (dilstation, teh, 2017) similar to what we are doing
 Their approach  using attention…
 train on all environments at once, but maintain explicit submodules for each task during training
 can also enforce a smaller number of submodules than ther are tasks
 this forces generalization just like in autoencoders
 There are multiple subnetworks to learn with but they are not explicitly designated to be for a specific task
 Attention network is used to decide which subnetwork is the most relevant right now
 question  how do you know it isn’t just more paramters that are helping?
 experiments : grid world
 criticism 
 just 10 random seeds
 domain very simple
 are they changing the rewards along the way?
 still needs lots of work
SampleEfficient ModelFree Reinforcement Learning with OffPolicy Critics
label: ECML2019,RL
 DenisSteckelmacher (Vrije Universiteit Brussel), Hélène Plisnier (Vrije Universiteit Brussel), Diederik M. Roijers (VU Amsterdam), Ann Nowé (Vrije Universiteit Brussel)
 Problem: how to learn a policy with a small number of samples without having a model
 They introduce some ways to speed up learning with policy gradients to reuse prior trajectories like in clipped DQN
Policy Prediction Network: ModelFree Behavior Policy with ModelBased Learning in Continuous Action Space
label: ECML2019,RL
 ZacWellmer (Hong Kong University of Science), James T. Kwok (Technology)
 policy gradients, implicit RL
 problem: want to do modelfree, modelbased combination for policy gradients
 PPN : many changes and tricks compined together
 use similar approach to PPO
 clipping, latent space transition models
Safe Policy Improvement with Soft Baseline Bootstrapping
label: ECML2019,RL, safeAI
 KimiaNadjahi (Télécom Paris), Romain Laroche (Microsoft Research Montréal), Rémi Tachet des Combes (Microsoft Research Montréal)
 to build a model of the uncertainty you can try to do MLE on the transition modle learning
 problem : how to train your RL but focussed on safe outcomes rather than highest performance in a simulation
 This is defined as performing at least as well as the baseline with high probability.
 existing approaches
 SOMEMETHOD 
 SPIBB [Laroche2019, ICML]  if you are not sure enough then you don’t take the action, just use the baseline instead
 problem is they have a very binary approach to something being safe or not
 Their Approach
 SPIBB but it’s soft, safe or not is not a binary choice
 They allow an errorbudget to control how much error they can safely allow for that domain.
Thursday
Stochastic Activation Actor Critic Methods
 Wenling Shang (University of AmsterdamBoschDeltalab), Herke van Hoof (University of AmsterdamBoschDeltalab), Max Welling (University of AmsterdamBoschDeltalab)
 Pertubation of intenral network weights to encourage exploration is well established.
 but it doesn’t seem to work so well in ActorCritic Deep RL
 they add noise in clever ways to LSTMs to make it work
Compatible Natural Gradient Policy Search
 Joni Pajarinen, Hong Linh Thai, Riad Akrour, Jan Peters, Gerhard Neumann
 Trust region policy search  using greedy updates the entropy drops too fast
 Their approach
 hard entropy constraint
 something elseq
Stochastic OneSided FullInformation Bandit
 Haoyu Zhao (Tsinghua University), Wei Chen (Microsoft Research, Beijing)
 problem: repeated second price auctions
 prior work:
 SODA assumes bidders are truthful, and that distritbuion of bidders if iid
 max bid does not need to be known
 their approach:
 iid bidders not required
 need to know the max value of the bids
 maintain a set of all good arms : determined by empircal mean
Practical OpenLoop Optimistic Planning
 Edouard Leurent (SequeL team, INRIA Lille  Nord Europe; Renault Group), OdalricAmbrym Maillard (SequeL team, INRIA Lille  Nord Europe)
highwayenv
environmenta on github for highway driving of human behaviours assume a generative model
 optimisitic planning  they use this along with tree search

they recall people have found out out failing cases of UCT
 very deep branches where all the choices are bad, the optimisitc bias leads to problems
 solution (lots of Munos work in 2008,2010) works for restricted classes of MDPs and works
 OLOP showed that you can do UCB style planning for stochastic and deterministic MDPs the same
 but htere was no empirical vbalidation of it, so this work does that
 they show it is actually quite difference in pracic
 They improve this by using the KL divergneece with OLOP
An Engineered Empirical Bernstein Bound
 MarkBurgess (Australian National University), Archie C. Chapman (University of Sydney), Paul Scott (Australian National University)
 Hoefding bound  it’s a concentration inequality
 Empircal Bernstein bound is similar  uses sample vairnces
 Bennett’s inequality is much stronger and useful, but maybe hard to use? this would give you the perfect bound possible if you knew the full variance?
 they define their own EBB bound, pretty complex formulation
 they show it provides a tighter bound thatn hoeffding on bernoulli bandits
MACLEAN Earth Observation Workshop
Earth Orientation Parameters Time Series
 G. Okhotnikov and N. Golyandina: EOP Time Series Prediction Using Singular Spectrum Analysis
 The EOP data include 5 numbers that arrive as a time series
 pole coordinates
 lenght of day
 changes in pole angles over time?
 There is a service that publishes the daily values for the time series
 important for navigation and satellites
 SSA  https://en.wikipedia.org/wiki/Singular_spectrum_analysis is a common method used to seperate out trend, noise etc from a time series
 time series of lenght L
 embedding, create a trajectory matrix
 take SVD
 Group eigen triples together
 diagononal averaging
 then reverse the embedding
 uses: allows you to extract the orignal sine wave if there was one
MvMF Loss for Prediction locations of an image on Earth’s surface
– M. Izbicki, E. Papalexakis and V. Tsotras: The MvMF Loss for Predicting Locations on the Earth’s Surface
 problem how to locate the location of an image in the world just by looking at it
 easy and hard problems : eiffel tower vs inside of an apple store
 data  data base of fflickr images with gps in them
 their approach  Mixture of von MisesFisher Distritbuion
 vMF is like Gaussian distribution for spheres  (m,s)
 MvMF is a mixture of these just like a mixture of gaussians but it knows about sphere structure
 works better for anything about locating predictions on the earth but it assumes the earth is a sphere
 so using MvMF as the loss function in one of these algorithm, such as Google PlaNet [Weyland 2019], makes the predcitions much smoother on the earth’s surface
 other uses
 estimate range of animals and birds from people’s social media images
Deep learning for power line inspection
 Invited Speaker: Prof. Robert Jenssen
 25 person machine learning group at UIT in Northern Norway
 they are 70 degree north, very weak magentic field there so they get the strongest northern lights
 Northern Lights Deep Learning Workshop 2020 (January 2021), about 100 people
 problem: power line inspection using deep learning
 how to use drones to inspect poles in remote regions to assess if maintenance work is needed
 method: few shot learning? YOLO.
 other applications
 power lines
 piplines
 roads, railways
 simulated data
 they generate synthetic of landscapes with power lines, high resoltuion
 tarin the detector and predictor on this
 then use the trained model for real drone flight
Few Shot Learning

it is common for this type of domain to find new objects that were never enouctered before
 ZhenCVPR2018  Ring Loss for convex feature normalziation for face revognition
 FewShotLearning usually uses protoype points based on few data poiints
 They think this ring loss approach produces a more natural representation
 but it has some difficulat input paramters
 Their approach
 their paper on improving few shot learning
 they use a classconditional dissimilarity measure

 they want to mesure distance between them base on the angle of the norms?
 result  goal is that all points with hte same class have the same norm
 interesting  they use different scores for distances between points in the same class and ones that are in different classes
 so it’s kind of like a normalization