## Machine Learning Seminar Series

#### Wednesdays, 1-2pm in the Harper Center (Booth) Room 219 Pizza provided by UChicago CS Department

April 3, 2019 **in Crerar room 390**: Lorenzo Orecchia, Boston University

### First-Order Methods Unleashed: Scalable Optimization in the Age of Big Data

Abstract:
First-order methods are a fundamental tool in the design of efficient algorithms for large-scale computational problems. Besides being the optimization workhorse of machine learning, first-order methods have recently served as a springboard for a number of algorithmic advances in discrete optimization, including submodular optimization and maximum flow problems. In this talk, I will showcase a number of results from my research that demonstrate the power of first-order methods as a generic framework for algorithm design.
In the first part, I will describe my view of first-order methods as discretizations of continuous dynamical systems over curved spaces. For convex optimization, such dynamics conserve a specific quantity — the product of time and a notion of duality gap — which immediately guarantees convergence to optimum. This primal-dual view helps us to both design novel algorithms and simplify the analyses of existing ones. In particular, I will discuss how it yields a simple, intuitive analysis of accelerated algorithms and how it allows us to port such algorithms to contexts that do not squarely match standard smoothness assumptions.
In the second part, we will see how to exploit problem-specific structure by preconditioning, i.e., by endowing the space with a curved geometry that facilitates the convergence of the dynamics above. In particular, I will describe how different random-walk-based algorithms for graph partitioning arise from different preconditionings of the same optimization problem, and how combinatorial preconditioners yield nearly-linear-time algorithms for flow problems over undirected graph.
Bio:
Lorenzo Orecchia is an assistant professor in the Department of Computer Science at Boston University. Lorenzo’s research focuses on the design of efficient algorithms for fundamental computational challenges in machine learning and combinatorial optimization. His approach is based on combining ideas from continuous and discrete optimization into a single framework for algorithm design. Lorenzo obtained his PhD in computer science at UC Berkeley under the supervision of Satish Rao in 2011, and was an applied mathematics instructor at MIT under the supervision of Jon Kelner until 2014. He was a recipient of the 2014 SODA Best Paper award and a co-organizer of the Simons semester “Bridging Continuous and Discrete Optimization” in Fall 2017.

April 10, 2019 **in room C25**: Mesrob Ohannessian, TTIC

### From Fair Decisions to Social Benefit

Using data efficiently is no longer our only burden. As automated decisions in socio-technical systems affect more lives than ever, it is imperative that we also use data responsibly. Being non-discriminatory and fair is one such responsibility. To date, explorations of fairness in intelligent decision systems have mostly ignored long-term influence on the underlying population. In this talk I give a first comprehensive perspective of this problem. Among the various insights I provide is quantifying both sides of the mismatch hypothesis: when can we hope that affirmative action is indeed beneficial to society? Bio: Mesrob I. Ohannessian is a Research Assistant Professor at the Toyota Technological Institute at Chicago. He was previously a postdoc at UCSD, MSR-Inria, and Université Paris-Sud. He received his PhD in EECS from MIT. His research interests are in machine learning, statistics, information theory, and their applications, particularly to problems marked by data scarcity and to decisions that affect society.

April 17, 2019 **in room C04**: Lin Yang, Princeton University

### Learn Policy Optimally via Efficiently Utilizing Data

Recent years have witnessed increasing empirical successes in reinforcement learning. Nevertheless, it is an irony that many theoretical problems in this field are not well understood even in the most basic setting. For instance, the optimal sample and time complexities of policy learning in finite-state Markov decision process still remain unclear. Given a state-transition sampler, we develop a novel algorithm that learns an approximate-optimal policy in near-optimal time and using a minimal number of samples. The algorithm makes updates by processing samples in a “streaming” fashion, which requires small memory and naturally adapts to large-scale data. Our result resolves the long-standing open problem on the sample complexity of Markov decision process and provides new insights on how to use data efficiently in learning and optimization.
The algorithm and analysis can be extended to solve two-person stochastic games and feature-based Markov decision problems while achieving near-optimal sample complexity. We further illustrate several other examples of learning and optimization over streaming data, with applications in accelerating Astrophysical discoveries and improving network securities.
Bio:
Lin Yang is currently a postdoctoral researcher at Princeton University working with Prof. Mengdi Wang. He obtained two Ph.D. degrees simultaneously in Computer Science and in Physics & Astronomy from Johns Hopkins University in 2017. Prior to that, he obtained a bachelor’s degree from Tsinghua University. His research focuses on developing fast algorithms for large-scale optimization and machine learning. This includes reinforcement learning and streaming methods for optimization and function approximations. His algorithms have been applied to real-world applications including accelerating astrophysical discoveries and improving network security. He has published numerous papers in top Computer Science conferences including NeurIPS, ICML, STOC, and PODS. At Johns Hopkins, he was a recipient of the Dean Robert H. Roy Fellowship.

April 24, 2019: Avrim Blum, TTIC

### Algorithmic fairness in online decision-making

There is growing concern about fairness in algorithmic decision making: Are algorithmic decisions treating different groups fairly? How can we make them fairer? What do we even mean by fair? In this talk I will discuss some of our work on this topic, focusing on the setting of online decision making. For instance, a classic result states that given a collection of predictors, one can adaptively combine them to perform nearly as well as the best of those predictors in hindsight (achieve “no regret”) without any stochastic assumptions. Can one extend this guarantee so that if the predictors are themselves fair (according to a given definition), then the overall combination is fair as well (according to the same definition)? I will discuss this and other issues. This is joint work with Suriya Gunasekar, Thodoris Lykouris, and Nati Srebro.

April 29, 2019: (2:30pm in Crerar 390) Tong Zhang, Hong Kong University of Science and Technology.

### Modern Techniques of Statistical Optimization for Machine Learning

Abstract:
Many problems in machine learning rely on statistics and optimization. To solve these problems, new techniques are needed. I will show some of these new techniques through some machine learning problems I have recently worked on, such as nonconvex stochastic optimization, distributed training, adversarial attack, generative model, etc.
Bio:
Tong Zhang is a Professor of Computer Science and Mathematics at The Hong Kong University of Science and Technology. Previously, he was a professor at Rutgers University, and worked at IBM, Yahoo, Baidu, and Tencent. Tong Zhang’s research interests include machine learning algorithms and theory, statistical methods for big data and their applications. He is a fellow of ASA and IMS, and he has been on the editorial boards of leading machine learning journals and program committees of top machine learning conferences. Tong Zhang received a B.A. in mathematics and computer science from Cornell University and a Ph.D. in Computer Science from Stanford University.

May 1, 2019: Tengyuan Liang, University of Chicago Booth

### New Thoughts on Adaptivity, Generalization and Interpolation Motivated from Neural Networks

Consider the problem: given data pair (x, y) drawn from a population with f_*(x) = E[y|x], specify a neural network and run gradient flow on the weights over time until reaching any stationarity. How does f_t, the function computed by the neural network at time t, relate to f_*, in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. We show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation, and performs the global least squares projection onto the adaptive RKHS, simultaneously. In addition, we prove that as the RKHS is data-adaptive and task-specific, the residual for f_* lies in a subspace that is smaller than the orthogonal complement of the RKHS, formalizing the representation and approximation benefits of neural networks. Then we will move to discuss generalization for interpolation methods in RKHS. In the absence of explicit regularization, Kernel “Ridgeless” Regression with nonlinear kernels has the potential to fit the training data perfectly. It has been observed empirically, however, that such interpolated solutions can still generalize well on test data. We isolate a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. In addition to deriving a data-dependent upper bound on the out-of-sample error, we present experimental evidence suggesting that the phenomenon occurs in the MNIST dataset.

May 8, 2019: Rosemary Braun, Northwestern University

May 15, 2019: Yali Amit, UChicago Statistics

May 22, 2019: UChicago Geometric Data Analysis Workshop

May 29, 2019: Dan McDonald

June 5, 2019: TBD

## Past Quarters:

Jan. 16, 2019: Sumeet Katariya, UW-Madison

### Adaptive Sampling for Ranking and Clustering

Abstract: The multi-armed bandit framework has been used to study sequential decision making for various objectives: minimizing regret, finding the best-item, finding the top-k items. In this work we study sequential decision making for *ranking*, where the goal is to sort items according to their meanls by adaptively sampling from their unknown reward distributions. An approximate notion of ranking is clustered or quantile ranking, where the goal is to find not the exact rank of an item, but only which quantile it belongs to. This formulation is motivated from a social science application at the Knowledge Lab at UChicago. In this talk, I will discuss a computationally efficient PAC algorithm for quantile ranking and its analysis. This setup naturally leads to the problem of clustering, which in its simplest form reduces to the problem of identifying the maximum gap between the means, and I will also discuss algorithms for adaptive clustering. This is joint work with Ardhendu Tripathy, Lalit Jain, Nandana Sengupta, James Evans, and Rob Nowak.

Jan. 23, 2019:

Karl Stratos, TTIC
### Challenges and Hopes of Unsupervised Learning by Mutual Information Maximization

Abstract: The objective of maximizing mutual information has a rich history of success in unsupervised learning and is recently of great interest to the deep learning community. Unfortunately, measuring mutual information from a finite sample is a notoriously difficult task. As a result, past works often resort to an approximation by a sample-estimatable lower bound. In this work, we give universal limitations of measuring mutual information that no such approach can overcome. Specifically, we show that any distribution-free high-confidence lower bound on mutual information estimated from N samples is at most logarithmic in N, which implies that the lower bound methodology is fundamentally flawed whenever the true mutual information is nontrivial. We argue that a better approach is to treat mutual information as a difference of entropies and measuring each entropy byl the cross-entropy upper bound. We present an empirical investigation with translated sentences and related news articles to support our argument. This is joint work with David McAllester.

Feb. 6, 2019:

Chao Gao, University of Chicago Statistics
### Convergence Rates of Variational Posterior Distributions

We study convergence rates of variational posterior distributions for nonparametric and high-dimensional inference. We formulate general conditions on prior, likelihood, and variational class that characterize the convergence rates. Under similar “prior mass and testing” conditions considered in the literature, the rate is found to be the sum of two terms. The first term stands for the convergence rate of the true posterior distribution, and the second term is contributed by the variational approximation error. For a class of priors that admit the structure of a mixture of product measures, we propose a novel prior mass condition, under which the variational approximation error of the generalized mean-field class is dominated by convergence rate of the true posterior. We demonstrate the applicability of our general results for various models, prior distributions and variational classes by deriving convergence rates of the corresponding variational posteriors.

Feb. 13, 2019:

Veronika Rockova, Univeristy of Chicago, Booth
### On Theory for BART

Since their inception in the 1980¹s, regression trees have been one of the more widely used non-parametric prediction methods. Tree-structured methods yield a histogram reconstruction of the regression surface, where the bins correspond to terminal nodes of recursive partitioning. Trees are powerful, yet susceptible to over-fitting. Strategies against overfitting have traditionally relied on pruning greedily grown trees. The Bayesian framework offers an alternative remedy against overfitting through priors. Roughly speaking, a good prior charges smaller trees where overfitting does not occur. While the consistency of random histograms, trees and their ensembles has been studied quite extensively, the theoretical understanding of the Bayesian counterparts has been missing. In this paper, we take first steps towards understanding why/when do Bayesian trees and forests not overfit. To address this question, we study the speed at which the posterior concentrates around the true smooth regression function. We propose a spike-and-tree variant of the popular Bayesian CART prior and establish new theoretical results showing that regression trees (and forests) (a) are capable of recovering smooth regression surfaces (with smoothness not exceeding one), achieving optimal rates up to a log factor, (b) can adapt to the unknown level of smoothness and (c) can perform effective dimension reduction when p > n. Going further, we also show semi and non-parametric Bernstein-von Mises theorems showing that BART is fundamentally justified from a frequentist point of view. These results provide a piece of missing theoretical evidence explaining why Bayesian trees (and additive variants thereof) have worked so well in practice.

Feb. 20, 2019:

Ruey Tsay, University of Chicago, Booth
### Analysis of Big Dependent Data

Most works in the machine learning and high-dimensional statistical inference assume independence. Most data, on the other hand, are either dynamically or spatially dependent. In this talk, I discuss the impact of dependence on statistical inference of high-dimensional data analysis, including LASSO regression and generalized linear models. The presentation depends on some recent joint papers, focusing on modeling big dependent data for which either the dimension or the sample size or both are large. We demonstrate the analyses using PM2.5 data collected at multiple monitoring stations and at various frequencies.

Feb. 27, 2019: No Seminar

March 5, 2019 **(TUESDAY), Harper Center 3B from 12:10 to 1:10 p.m.:** Jorge Nocedal, Northwestern University

### Zero-Order Methods for the Optimization of Noisy Functions

This talk presents a finite difference quasi-Newton method for the minimization of noisy functions. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval h based on the noise estimation techniques of Hamming and More and Wild. This noise estimation procedure and the selection of h are inexpensive but not always accurate, and to prevent failures the algorithm incorporates a recovery mechanism that takes appropriate action in the case when the line search procedure is unable to produce an acceptable point. A novel convergence analysis is presented that considers the effect of a noisy line search procedure. Numerical experiments comparing the method to a model based trust region method are presented.

March 6, 2019: Bryan Pardo, Northwestern University

### Audio source separation models that learn without ground truth and are open to user correction

Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. It is an enabling technology for many tasks, such as automatic speech recognition, labeling sound objects in an acoustic scene, music transcription, and remixing of existing recordings. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sound sources (e.g. instruments in a band) than channels (a stereo recording has two channels). Currently, deep learning systems that perform source separation are trained on many mixtures (e.g., tens of thousands) for which the ground truth decompositions are already known. Since most real-world recordings have no such decomposition available, developers train systems on artificial mixtures created from isolated individual recordings. Although there are large databases of isolated speech, it is impractical to find or build large databases of isolated recordings for every arbitrary sound. This fundamentally limits the range of sounds that deep models can learn to separate. Once learned, a deep model’s output is take-it-or-leave it and it can be difficult for the end user to affect either the current output or to give corrective feedback for the future. In this talk Prof. Pardo discusses recent work in two areas. The first is bootstrapping learning of a scene segmentation model using an acoustic cue known to be used in human audition. This allows learning a model without access to ground-truth decompositions of acoustic scenes. The second is ongoing work to provide an interface for an end user to interact with a deep model, to affect the current separation and improve future separation by allowing for retraining of the model from corrective feedback. BIO: Bryan Pardo is an associate professor in the Northwestern University Department of Electrical Engineering and Computer Science. Prof. Pardo received a M. Mus. in Jazz Studies in 2001 and a Ph.D. in Computer Science in 2005, both from the University of Michigan. He has authored over 100 peer-reviewed publications. He has developed speech analysis software for the Speech and Hearing department of the Ohio State University, statistical software for SPSS and worked as a machine learning researcher for General Dynamics. While finishing his doctorate, he taught in the Music Department of Madonna University.

March 13, 2019: Sebastian Stitch, EPFL

### Error Feedback for Communication Efficient SGD

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated. We analyze Stochastic Gradient Descent (SGD) with k-sparsification (for instance top-k or random-k) and compression (for instance quantization) and show that these schemes converge at the same rate as vanilla SGD when equipped with error compensation (i.e. keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate.

March 20, 2019: Spring Break, No Seminar

March 27, 2019: Spring Break, No Seminar