ECML 2019 Notes
In September 2019 I had the good fortune to attend ECML 2019 in beautiful Wurzburg, Germany.
I was there to present our paper
Bhalla, S. et al., 2019. Compact Representation of a Multi-dimensional Combustion Manifold Using Deep Neural Networks. In European Conference on Machine Learning. Wurzburg, Germany, p. 8. read more pdf
It was a fantastic conference, a nice size and a real diversity of topics, not just a whole bunch of permutations on Deep Learning. What follows are some notes I took at talks which I attended, you can take a look at the full schedule here.
Monday Talks
SridharMahadevan Tutorial
- he’s now at AdobeResearch
- can also see his slides for IJCAI which was longer than this one
Imagination and Causation
- ImaginationScience
- he wants to do reasoning about future paths and outcomes without having data to back it up
- how does he distinguish is from counterfactual reasoning it is just one subset part of it
- also includes: analogy, abstraction, spatial/temporal projection
- what is Combinatorial Creativity? mixing and matching existing compoenets that never occur, like a sphinx
- example: long term predictions for climate change
- not just about predicing the next step but about reasoning baout what society will do for the next few decades
- example: search engine
- what is the next step?
- “Improbable” company essentially is trying to simulate reality
- he thinks GANs are really only novel thing to come out of Deep learning recently
- he doesn’t think ML will be fundamentally different in 20 years, it’s solved essentially
- he’s talking about machines creating art
- can machines ever create art?
Pearl’s Ladder of Causality
- This is that new three layered approach by JudeaPearl
- “The Book of Why” where he explains it in simple approach
Association
- observational machine learning
Intervention
- Experimental science
- Pearl: probabilities are an epiphenomanon resulting from the causal nature of the world
Counterfactuals
- Can be thought of as necessary for Imagination
Combining Observation and Causality
- GANs - solve problem is visualize imagination
- he says GANs are related to Actor-Critic RL problem
- Actor-Critic models converge but it wasn’t known until 20 years later
- Wasserstein GANs have more theoretical bassis
- Seeing GANs as an optimization problem
- min_G max_D V(D,G)
Art and Imagination
- CANs are a new model for simulating creativity
- Can create fairly nice modern art paintings, but is it art or it sampling a space of images?
- Is there a difference?
- Does it mean anything? does it need to?
Relevant Optimization methods from Economics which would be useful for GANs
- Optimization vs Equilibration
- minimize a function in feasible set
- usually need to assume f(x) is differentiable
- Equilibration from physics (Stampacchia in the 1960s)
- they define the set of partial gradients for a function as a ‘vector field’
- assume this field is given rather than f(x) being given
- means we can solve optimization problems but also other problems
- eg. there are some vector fields that don’t have a function with a gradient that generates the vector field
- economists use this for domains where the vector field can be written down easily but no simple function leads to it via derivatives
- Variational inequality
- eg. traffic management
- GANs are this kind of problem whihc is why they work
- he thinks game theory, GANs, traffic, RL etc are better solved using this approach
- GANs don’t converge if you just do gradient descent
- but if you use the Wasserstein loss function then get a nash equilibrium?
- the surface for a GAN is a saddle so gradient descent is bad, very unlikely to lead to a optimal appoint, you will cycle around
- a better idea sometimes is to move orthogonal to the gradient because the think you are trying to do is find an equilibrium point
- Extragradient Method
- Frobenius projection from the gradient
- Project back to the main function if the negative vector field leads you out of the feasible set
- Mirror Descent instead of Gradient Descent
- by Nemirovsky and Yudin
- gradients are in the dual of the original space, and this happens to line up in euclidean but wouldnt work others, so you can’t add them relaly
- essentially you convert the state vector to the dual space first, then compute the gradient, update it and project back to the original space via a conjugate function
- in Euclidean space, this whole mechanism collapses to gradient descent because the conversion is identity
- this explains why multiplicative updates of gradients work better in many spaces
- you can explain many ML methods using this idea
- boosting
- natural gradients - gradient descent should be done in reimanian space using the Fischer information matrix
- Kakade’s paper on Natural Actor-Critic for playing Tetris
- Mirror-prox
- take two gradient steps in the dual space, works even better
- He calls the dual space the imagination space because it handles (something)
-
Reinforcement Learning as a type of Causal Learning
- where you can try out particular actions
- imagination values
- what is the value of deviating from the action the policy advices
- counterfactural kind of value function
- not needed for an MDP but for POMDPs yuo need to imagine the possible worlds
- even for multi-armed bandits it can help, because it leads you outside the existing policy actions
- is it just exploration? it is related to off-policy exploration, how?
Tuesday Talks
Active Learning Anomaly Detection
- Unsupervised and Active Learning using Maximin-based Anomaly Detection
- Zahra Ghafoori (University of Melbourne), James C. Bezdek (University of Melbourne), Christopher Leckie (University of Melbourne), Shanika Karunasekera (University of Melbourne)
- took some pictures of the AD background description which was very good
- semi-supervised which an oracle that can be queried about whether points are anomalies or not
- they do active learning and compare to standard AD methods like OCSVM and iForest
- better quality than iforest and faster because it only builds one model
OneClass Anomaly Detection
- The Elliptical Basis Function Data Descriptor (EBFDD) Network - A One-Class Classification Approach to Anomaly Detection
- MehranBazargani (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin), Brian Mac Namee (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin)
- a cost functiont hat turns RBF networks in to a one class classifier
- assumptions
- they are interested in streaming data
- training data does not include any anomalies
- RBF Networks
- overview
- three layers
- input data
- hidden layer of gaussian kernals, initialize the (m,v) with k-means
- output layer lineraly combies them via a sigmoid
- backprop learns the paramters of the gaussians
- not applicable to one class
- their change
- main idea: try to tightly fit the gaussians around the subspace of the normal points
- introduce elliptical kernals instead of spherical
- the ellipses can be stretched and shaped to adjust the amount of correlation (or not) between the dimensions
- in stead of gaussian (m,v) they have cov matrix for each kernel
- expeirments
- they use the emmot and dietterich paper as their guide
Autoencoder Anomaly Detection
- they use the emmot and dietterich paper as their guide
- Robust Anomaly Detection in Images using Adversarial Autoencoders
- LauraBeggel (Bosch Center for Artificial Intelligence, Renningen; Ludwig-Maximilians-University Munich), Michael Pfeiffer (Bosch Center for Artificial Intelligence, Renningen), Bernd Bischl (Ludwig-Maximilians-University Munich)
- They use an AutoEnc with an discriminator network instead of using KL divergence
- neat idea : a point is anomalous if either of the following are true
- point is in a dense region and has high reconstruction error compared to the data from training
- point is a lower density with respect to the training distirbution
- this gets you botht he desnity based and deteailed clasified based approaches, couldn’t any method use this?
Counterfactual Justification
- Unjustified Classification Regions and Counterfactual Explanations In Machine Learning
- Thibault Laugel (Sorbonne Université), Marie-Jeanne Lesot (Sorbonne Université), Christophe Marsala (Sorbonne Université), Xavier Renard (AXA, Paris), Marcin Detyniecki (Sorbonne Université; AXA, Paris; Polish Academy of Science)
- they are trying to infer counterfactual explanations but being careful not to create ones wchih are not jsutified by the data
- Idea : can you use decision trees to do coutnerfactual search for other ways to get the result you found?
- some data point goes down the tree to a specific leaf, but then you can infer wether taht is right, maybe it should have gone sligtly elehwere
- does this imply the tree should be different or that the leaf’s label is wrong?
Ranking
- Fast and Parallelizable Ranking with Outliers from Pairwise Comparisons
- Sungjin Im (University of California), Mahshid Montazer Qaem (University of California)
- problem: given multiple partial ordering o felemnts/daatpoints, goal is to create a full, single consistent one
- cycles are a problem, standard methods are well understand from an algorithmic sense, it is NP Hard
- usually comes down to counting and minimizing the number of baward arcs from a DAg
- is it like finding the maximial DAG in a graph?
Hierarchical Dense anomalies
- is it like finding the maximial DAG in a graph?
- CatchCore: Catching Hierarchical Dense Subtensor
- Wenjie Feng (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences),
Causal Effects Talk
- Adjustment Criteria for Recovering Causal Effects from Missing Data
- Mojdeh Saadati (Iowa State University), JinTian (Iowa State University)
- Pearl backdoor condition, or ignorablity
- allows us to in certain cases, infer a causal effect from observational data
- but what if there is msising data? or you are concerned about selection bias affecting the outcomes
- they produce two criteria for evaluating only using the causal graph whether the backdoor criterion can be used
Uplift Regression
- Shrinkage Estimators for Uplift Regression
- Krzysztof Rudaś (Warsaw University of Technology; Institute of Computer Science, Polish Academy of Sciences), Szymon Jaroszewicz (Institute of Computer Science, Polish Academy of Sciences)
- exmaple: marketing sends out a discount email
- you get new data about purchases after the discount, but do any change arise because of the discounts (uplift) or in spite of them?
- Impact could even be reversed, they buy less because of the dicsount coupon
- they improve upon the standard double regression approach for this
Wednesday Talks
Fast Gradient Boosting
- Fast Gradient Boosting Decision Trees with Bit-Level Data Structures
- Laurens Devos (KU Leuven), Wannes Meert (KU Leuven), Jesse Davis (KU Leuven)
- XGBoost is the most popular now
- also LightGBM, look that up
- idea - using full ints and floats is too detailed they use bitlevel datascturcures
- existing gradient updates on decision trees Fast Gradient Boosting
- move some datapoints from one leaf to a neighbouring leaf ndoe (logically, it might not be asimple sibling)
- BitRoost algorithm (theirs)
- bit representation for each leaf? with one hot dncoding
- then use the fast and/or/counts ability of bits during FGB
- read this one in more detail, looks like very useful approach
Association Rules
- Sets of Robust Rules, and How to Find Them
- Jonas Fischer (Max Planck Institute for Informatics; Saarland University), Jilles Vreeken (CISPA Helmholtz Center for Information Security)
- still useful in biology
- people kept working on Apriori algorithm and found you get a lot of that for free by mining conjuctions
- they are still liked becuase you get clean interpretable models andrules
- problem is you get millions of rules
- Grab algorithm
- heuristics to reduce the rule space
- so it seems like a smarter, more general version of Apriori algorithm
Black Box Explanation
- Black Box Explanation by Learning Image Exemplars in the Latent Feature Space
- Riccardo Guidotti (ISTI-CNR, Pisa), Anna Monreale (University of Pisa), Stan Matwin (Dalhousie University; Polish Academy of Sciences), Dino Pedreschi (University of Pisa)
- Problems if you don’t explain
- The Husky-Wolf classifier problem - it was very good but used snow in the background for all wolf images
- current explanation approaches in image clasifiers
- saliency map, showing which pixels leading to which labels
- DeepExplain and other gradient based approaches, visualize the gradient, but it is very specific to taht trained network
- prototype based methods - just show similar types of images that show you what the model is ‘thinking about’
- ABELE their method
TD Actor Critic
- TD-Regularized Actor-Critic Methods
- SimoneParisi(presented), Voot Tangkaratt, Jan Peters, EmtiyazKhan
- They are trying to deal with the instability of actor-critic methods when there is little data
- The problem: using the critic in AC reduces the variance of simple gradient descent
- but the critic introduces bias and so it often overshoots good parts of the space
- their approach (TDRPG) - instead of trying to fix the actor or the critic estimates, they focus on the way they interact and stabilize that
- they add a squared regularization penalty to the optimization step
- they can add their regulazer to the training for any actor critic approach
Deep Ordinal Reinforcement Learning
- AlexanderZap (TU Darmstadt), Tobias Joppen (TU Darmstadt), Johannes Fürnkranz (TU Darmstadt)
- problem: numerical rewards are arbitrary, changing the values can lead to completely different optimal policies
- solution: use orginal rewards instead,
- map numerical rewards to a dinstinct ordered set
- scale doesn’t matter anymore, just order
- issues:
- how to accumulate rewards? you don’t add them, you maintain a histogram/prob distribution of getting each reward in the state
- value function uses this prob distribtuion to choose actiopn and return the associated ordinal value
- how do you maximize it?
- sometimes you need to know the relative difference between values and this isn’t appropriate
- very interesting idea
Attentive Multi-Task Deep Reinforcement Learning
label: ECML2019,RL, skill learning
- Timo Bräm (ETH Zurich), Gino Brunner (ETH Zurich)
- Problem - Multi-task learning is hard
- Existing approaches
- transfer learning from agents experts at prior tasks - but might forget old skills
- robust multitask RL (dilstation, teh, 2017) similar to what we are doing
- Their approach - using attention…
- train on all environments at once, but maintain explicit submodules for each task during training
- can also enforce a smaller number of submodules than ther are tasks
- this forces generalization just like in autoencoders
- There are multiple subnetworks to learn with but they are not explicitly designated to be for a specific task
- Attention network is used to decide which subnetwork is the most relevant right now
- question - how do you know it isn’t just more paramters that are helping?
- experiments : grid world
- criticism -
- just 10 random seeds
- domain very simple
- are they changing the rewards along the way?
- still needs lots of work
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
label: ECML2019,RL
- DenisSteckelmacher (Vrije Universiteit Brussel), Hélène Plisnier (Vrije Universiteit Brussel), Diederik M. Roijers (VU Amsterdam), Ann Nowé (Vrije Universiteit Brussel)
- Problem: how to learn a policy with a small number of samples without having a model
- They introduce some ways to speed up learning with policy gradients to reuse prior trajectories like in clipped DQN
Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space
label: ECML2019,RL
- ZacWellmer (Hong Kong University of Science), James T. Kwok (Technology)
- policy gradients, implicit RL
- problem: want to do model-free, model-based combination for policy gradients
- PPN : many changes and tricks compined together
- use similar approach to PPO
- clipping, latent space transition models
Safe Policy Improvement with Soft Baseline Bootstrapping
label: ECML2019,RL, safeAI
- KimiaNadjahi (Télécom Paris), Romain Laroche (Microsoft Research Montréal), Rémi Tachet des Combes (Microsoft Research Montréal)
- to build a model of the uncertainty you can try to do MLE on the transition modle learning
- problem : how to train your RL but focussed on safe outcomes rather than highest performance in a simulation
- This is defined as performing at least as well as the baseline with high probability.
- existing approaches
- SOMEMETHOD -
- SPIBB [Laroche2019, ICML] - if you are not sure enough then you don’t take the action, just use the baseline instead
- problem is they have a very binary approach to something being safe or not
- Their Approach
- SPIBB but it’s soft, safe or not is not a binary choice
- They allow an error-budget to control how much error they can safely allow for that domain.
Thursday
Stochastic Activation Actor Critic Methods
- Wenling Shang (University of Amsterdam-Bosch-Deltalab), Herke van Hoof (University of Amsterdam-Bosch-Deltalab), Max Welling (University of Amsterdam-Bosch-Deltalab)
- Pertubation of intenral network weights to encourage exploration is well established.
- but it doesn’t seem to work so well in Actor-Critic Deep RL
- they add noise in clever ways to LSTMs to make it work
Compatible Natural Gradient Policy Search
- Joni Pajarinen, Hong Linh Thai, Riad Akrour, Jan Peters, Gerhard Neumann
- Trust region policy search - using greedy updates the entropy drops too fast
- Their approach
- hard entropy constraint
- something elseq
Stochastic One-Sided Full-Information Bandit
- Haoyu Zhao (Tsinghua University), Wei Chen (Microsoft Research, Beijing)
- problem: repeated second price auctions
- prior work:
- SODA assumes bidders are truthful, and that distritbuion of bidders if iid
- max bid does not need to be known
- their approach:
- iid bidders not required
- need to know the max value of the bids
- maintain a set of all good arms : determined by empircal mean
Practical Open-Loop Optimistic Planning
- Edouard Leurent (SequeL team, INRIA Lille - Nord Europe; Renault Group), Odalric-Ambrym Maillard (SequeL team, INRIA Lille - Nord Europe)
highway-env
environmenta on github for highway driving of human behaviours- assume a generative model
- optimisitic planning - they use this along with tree search
-
they recall people have found out out failing cases of UCT
- very deep branches where all the choices are bad, the optimisitc bias leads to problems
- solution (lots of Munos work in 2008,2010) works for restricted classes of MDPs and works
- OLOP showed that you can do UCB style planning for stochastic and deterministic MDPs the same
- but htere was no empirical vbalidation of it, so this work does that
- they show it is actually quite difference in pracic
- They improve this by using the KL divergneece with OLOP
An Engineered Empirical Bernstein Bound
- MarkBurgess (Australian National University), Archie C. Chapman (University of Sydney), Paul Scott (Australian National University)
- Hoefding bound - it’s a concentration inequality
- Empircal Bernstein bound is similar - uses sample vairnces
- Bennett’s inequality is much stronger and useful, but maybe hard to use? this would give you the perfect bound possible if you knew the full variance?
- they define their own EBB bound, pretty complex formulation
- they show it provides a tighter bound thatn hoeffding on bernoulli bandits
MACLEAN Earth Observation Workshop
Earth Orientation Parameters Time Series
- G. Okhotnikov and N. Golyandina: EOP Time Series Prediction Using Singular Spectrum Analysis
- The EOP data include 5 numbers that arrive as a time series
- pole coordinates
- lenght of day
- changes in pole angles over time?
- There is a service that publishes the daily values for the time series
- important for navigation and satellites
- SSA - https://en.wikipedia.org/wiki/Singular_spectrum_analysis is a common method used to seperate out trend, noise etc from a time series
- time series of lenght L
- embedding, create a trajectory matrix
- take SVD
- Group eigen triples together
- diagononal averaging
- then reverse the embedding
- uses: allows you to extract the orignal sine wave if there was one
MvMF Loss for Prediction locations of an image on Earth’s surface
– M. Izbicki, E. Papalexakis and V. Tsotras: The MvMF Loss for Predicting Locations on the Earth’s Surface
- problem how to locate the location of an image in the world just by looking at it
- easy and hard problems : eiffel tower vs inside of an apple store
- data - data base of fflickr images with gps in them
- their approach - Mixture of von Mises-Fisher Distritbuion
- vMF is like Gaussian distribution for spheres - (m,s)
- MvMF is a mixture of these just like a mixture of gaussians but it knows about sphere structure
- works better for anything about locating predictions on the earth but it assumes the earth is a sphere
- so using MvMF as the loss function in one of these algorithm, such as Google PlaNet [Weyland 2019], makes the predcitions much smoother on the earth’s surface
- other uses
- estimate range of animals and birds from people’s social media images
Deep learning for power line inspection
- Invited Speaker: Prof. Robert Jenssen
- 25 person machine learning group at UIT in Northern Norway
- they are 70 degree north, very weak magentic field there so they get the strongest northern lights
- Northern Lights Deep Learning Workshop 2020 (January 20-21), about 100 people
- problem: power line inspection using deep learning
- how to use drones to inspect poles in remote regions to assess if maintenance work is needed
- method: few shot learning? YOLO.
- other applications
- power lines
- piplines
- roads, railways
- simulated data
- they generate synthetic of landscapes with power lines, high resoltuion
- tarin the detector and predictor on this
- then use the trained model for real drone flight
Few Shot Learning
-
it is common for this type of domain to find new objects that were never enouctered before
- ZhenCVPR2018 - Ring Loss for convex feature normalziation for face revognition
- FewShotLearning usually uses protoype points based on few data poiints
- They think this ring loss approach produces a more natural representation
- but it has some difficulat input paramters
- Their approach
- their paper on improving few shot learning
- they use a class-conditional dissimilarity measure
-
- they want to mesure distance between them base on the angle of the norms?
- result - goal is that all points with hte same class have the same norm
- interesting - they use different scores for distances between points in the same class and ones that are in different classes
- so it’s kind of like a normalization