Machine Learning Seminar Series
Spring 2022: Wednesdays at 11:30
Wednesday, Jun. 1, 2022 in TTIC 530: Chenhao Tan, University of Chicago
Towards Human-Centered Explanations of AI Predictions
Wednesday, Apr. 27, 2022 in TTIC 530: Yuexi Wang, University of Chicago
Approximate Bayesian Computation via Classification
Wednesday, Apr. 20, 2022 in TTIC 530: Rad Niazadeh, University of Chicago
Optimal Algorithms for Continuous Non-Monotone Submodular Maximization : Theory and Applications
Fall 2021: Wednesdays at 10:30
Wednesday, Dec. 01, 2021 in TTIC 530: Mladen Kolar, University of Chicago
Estimation Differential Networks from Functional Data
Wednesday, Nov. 17, 2021 in TTIC 530: Omar Montasser,TTIC
Transductive Robust Learning Guarantees
Wednesday, Nov. 10, 2021 in TTIC 530: Frederic Koehler, UC Berkeley
Learning Ising Models with Latent Variables
Wednesday, Nov. 3, 2021 in TTIC 530: Zhimei Ren, University of Chicago
Policy Learning with Adaptively Collected Data
Abstract: This talk is about the predictive power of learned graphical models where we show that an incorrect or incomplete combinatorial structure (graph) can nevertheless yield accurate predictions.
In particular, in the first half of the talk, I look into learning tree structured Ising models in which the learned model is used subsequently for prediction based on partial observations (given the realization of a subset of variables, predict the value of the remaining ones). The vast majority of previous work on learning graphical models aims to correctly recover the underlying graph structure. In the data-constrained regime, learning the entire graph structure correctly is usually impossible. I show that it is possible to efficiently learn a tree model that gives accurate predictions even when there is insufficient data to learn the correct structure.
The second half of the talk is about speciation rate estimation in phylogenetic trees. This problem is essentially one of inferring features of the model (in this case, the speciation or extinction rate) from partial observations (thesequences at the leaves of the tree) of a latent tree model (phylogeny). I show that to estimate the speciation rate efficiently, it is not necessary to follow the popular approach of reconstructing the complete tree structure as an intermediate step, which requires long DNA sequences. Instead, one can extract precisely the right type of information about the rates by zooming into carefully chosen local structures. My results show that an incomplete and partially incorrect summary of the tree structure is enough to estimate the speciation rate with the minimax optimal dependence on the length of observed DNA sequences.
Joint work with Guy Bresler, Sebastien Roch and Robert Nowak.
Bio: Mina Karzand is a postdoctoral associate in University of Wisconsin-Madison. Before that, she was a postdoctoral research associate in MIT where she received her PhD in Electrical Engineering and Computer Science. Her research interests are on the design and analysis of data driven inference and decision making systems in the intersection of areas of machine learning, probability, and information theory.
Wednesday, Oct. 20, 2021 in TTIC 530: Cong Ma, University of Chicago
Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism
Abstract: This talk is about the predictive power of learned graphical models where we show that an incorrect or incomplete combinatorial structure (graph) can nevertheless yield accurate predictions.
In particular, in the first half of the talk, I look into learning tree structured Ising models in which the learned model is used subsequently for prediction based on partial observations (given the realization of a subset of variables, predict the value of the remaining ones). The vast majority of previous work on learning graphical models aims to correctly recover the underlying graph structure. In the data-constrained regime, learning the entire graph structure correctly is usually impossible. I show that it is possible to efficiently learn a tree model that gives accurate predictions even when there is insufficient data to learn the correct structure.
The second half of the talk is about speciation rate estimation in phylogenetic trees. This problem is essentially one of inferring features of the model (in this case, the speciation or extinction rate) from partial observations (thesequences at the leaves of the tree) of a latent tree model (phylogeny). I show that to estimate the speciation rate efficiently, it is not necessary to follow the popular approach of reconstructing the complete tree structure as an intermediate step, which requires long DNA sequences. Instead, one can extract precisely the right type of information about the rates by zooming into carefully chosen local structures. My results show that an incomplete and partially incorrect summary of the tree structure is enough to estimate the speciation rate with the minimax optimal dependence on the length of observed DNA sequences.
Joint work with Guy Bresler, Sebastien Roch and Robert Nowak.
Bio: Mina Karzand is a postdoctoral associate in University of Wisconsin-Madison. Before that, she was a postdoctoral research associate in MIT where she received her PhD in Electrical Engineering and Computer Science. Her research interests are on the design and analysis of data driven inference and decision making systems in the intersection of areas of machine learning, probability, and information theory.
Spring 2020: Fridays at 10:30
Friday, Apr. 3, 2020: No talk
Friday, Apr. 10, 2020: No talk
Friday, Apr. 17, 2020: Pritish Kamath, TTIC
Friday, Apr. 24, 2020: Dimitris Papailiopoulos, UW-Madison
Friday, May 1, 2020: Shirley Wu, UT-Austin
Friday, May 8, 2020: Vaishaal Shankar, UC-Berkeley
Friday, May 15, 2020: No talk
Friday, May 22, 2020: Cyrus Rashtchian, UCSD
Friday, May 29, 2020: Laurent Lessard, UW-Madison
Friday, Jun. 5, 2020: Yuanzhi Li, CMU
Winter 2020: Fridays at 10:30
Friday, Jan. 10, 2020 in TTIC 526: Mina Karzand, UW-Madison
Focused Learning in Tree-Structured Graphical Models
Abstract: This talk is about the predictive power of learned graphical models where we show that an incorrect or incomplete combinatorial structure (graph) can nevertheless yield accurate predictions.
In particular, in the first half of the talk, I look into learning tree structured Ising models in which the learned model is used subsequently for prediction based on partial observations (given the realization of a subset of variables, predict the value of the remaining ones). The vast majority of previous work on learning graphical models aims to correctly recover the underlying graph structure. In the data-constrained regime, learning the entire graph structure correctly is usually impossible. I show that it is possible to efficiently learn a tree model that gives accurate predictions even when there is insufficient data to learn the correct structure.
The second half of the talk is about speciation rate estimation in phylogenetic trees. This problem is essentially one of inferring features of the model (in this case, the speciation or extinction rate) from partial observations (thesequences at the leaves of the tree) of a latent tree model (phylogeny). I show that to estimate the speciation rate efficiently, it is not necessary to follow the popular approach of reconstructing the complete tree structure as an intermediate step, which requires long DNA sequences. Instead, one can extract precisely the right type of information about the rates by zooming into carefully chosen local structures. My results show that an incomplete and partially incorrect summary of the tree structure is enough to estimate the speciation rate with the minimax optimal dependence on the length of observed DNA sequences.
Joint work with Guy Bresler, Sebastien Roch and Robert Nowak.
Bio: Mina Karzand is a postdoctoral associate in University of Wisconsin-Madison. Before that, she was a postdoctoral research associate in MIT where she received her PhD in Electrical Engineering and Computer Science. Her research interests are on the design and analysis of data driven inference and decision making systems in the intersection of areas of machine learning, probability, and information theory.
Friday, Jan. 17, 2020 in Crerar 390: Rina Foygel Barber, University of Chicago
Predictive inference with the jackknife+
Abstract: We introduce the jackknife+, a novel method for constructing predictive confidence intervals that is robust to the distribution of the data. The jackknife+ modifies the well-known jackknife (leave-one-out cross-validation) to account for the variability in the fitted regression function when we subsample the training data. Assuming exchangeable training samples, we prove that the jackknife+ permits rigorous coverage guarantees regardless of the distribution of the data points, for any algorithm that treats the training points symmetrically. Such guarantees are not possible for the original jackknife and we demonstrate examples where the coverage rate may actually vanish. Our theoretical and empirical analysis reveals that the jackknife and jackknife+ intervals achieve nearly exact coverage and have similar lengths whenever the fitting algorithm obeys some form of stability. We also extend to the setting of K-fold cross-validation. Our methods are related to cross-conformal prediction proposed by Vovk [2015] and we discuss connections. This work is joint with Emmanuel Candes, Aaditya Ramdas, and Ryan Tibshirani.
Bio: Rina Foygel Barber is an Associate Professor in the Department of Statistics at the University of Chicago. Before starting at U of C, she was a NSF postdoctoral fellow during 2012-13 in the Department of Statistics at Stanford University, supervised by Emmanuel Candès. She received her PhD in Statistics at the University of Chicago in 2012, advised by Mathias Drton and Nati Srebro, and a MS in Mathematics at the University of Chicago in 2009. Prior to graduate school, she was a mathematics teacher at the Park School of Baltimore from 2005 to 2007.
Friday, Jan. 24, 2020 in TTIC 526: Steve Hanneke, TTIC
Learning Whenever Learning is Possible: Universal Learning under General Stochastic Processes
Abstract: I will present a general theory of learning and generalization without the i.i.d. assumption, starting from first principles. We focus on the problem of universal consistency: the ability to estimate any function from X to Y. We endeavor to answer the question: Does there exist a learning algorithm that is universally consistent under every process on X for which universal consistency is possible? Remarkably, we find the answer is “Yes”. Thus, we replace the i.i.d. assumption with only the most natural (and necessary!) assumption: that learning is at least possible. Along the way, we also find a concise characterization of the family of all processes that admit universally consistent learners. One challenge for learning is that some of these processes do not even have a law of large numbers.
Bio: Steve Hanneke is a Research Assistant Professor at the Toyota Technological Institute at Chicago. His research explores the theory of machine learning, with a focus on reducing the number of training examples sufficient for learning. His work develops new approaches to supervised, semi-supervised, transfer, and active learning, and also revisits the basic probabilistic assumptions at the foundation of learning theory. Steve earned a Bachelor of Science degree in Computer Science from UIUC in 2005 and a Ph.D. in Machine Learning from Carnegie Mellon University in 2009 with a dissertation on the theoretical foundations of active learning.
Friday, Jan. 31, 2020 in Crerar 390: Tengyu Ma, Stanford
Designing Explicit Regularizers for Deep Models
Abstract: I will discuss some recent results on designing explicit regularizers to improve the generalization performances of deep neural networks. We derive data-dependent generalization bounds for deep neural networks. We empirically regularize the bounds and obtain improved generalization performance (in terms of the standard accuracy or the robust accuracy). I will also touch on recent results on applying these techniques to imbalanced datasets.
Based on joint work with Colin Wei, Kaidi Cao, Adrien Gaidon, and Nikos Arechiga
Bio: Tengyu Ma is an assistant professor of Computer Science and Statistics at Stanford University. He received his Ph.D. from Princeton University and B.E. from Tsinghua University. His research interests include topics in machine learning and algorithms, such as deep learning and its theory, non-convex optimization, deep reinforcement learning, representation learning, and high-dimensional statistics. He is a recipient of NIPS’16 best student paper award, COLT’18 best paper award, and ACM Doctoral Dissertation Award Honorable Mention.
Friday, Feb. 7, 2020 in TTIC 526: Yixin Wang, Columbia
The Blessings of Multiple Causes
Abstract: Causal inference from observational data is a vital problem, but it comes with strong assumptions. Most methods assume that we observe all confounders, variables that affect both the causal variables and the outcome variables. But whether we have observed all confounders is a famously untestable assumption. We describe the deconfounder, a way to do causal inference from observational data allowing for unobserved confounding.
How does the deconfounder work? The deconfounder is designed for problems of multiple causal inferences: scientific studies that involve many causes whose effects are simultaneously of interest. The deconfounder uses the correlation among causes as evidence for unobserved confounders, combining unsupervised machine learning and predictive model checking to perform causal inference. We study the theoretical requirements for the deconfounder to provide unbiased causal estimates, along with its limitations and tradeoffs. We demonstrate the deconfounder on real-world data and simulation studies.
Friday, Feb. 14, 2020 in Crerar 390: Bryon Aragam, University of Chicago, Booth
A general framework for learning DAGs with NO TEARS
Abstract: Interpretability and causality have been acknowledged as key ingredients to the success and evolution of modern machine learning systems. Graphical models, and more specifically directed acyclic graphs (DAGs, also known as Bayesian networks), are an established tool for learning and representing interpretable causal models. Unfortunately, estimating the structure of DAGs from data is a notoriously difficult problem, and as a result existing approaches rely on various local heuristics for enforcing the acyclicity constraint. In this talk, we introduce a fundamentally different strategy: We formulate the structure learning problem as a purely continuous optimization problem that avoids this combinatorial constraint entirely. This optimization problem can be efficiently solved by standard numerical algorithms, avoiding handcrafted algorithms which also makes implementation particularly easy. As a result, we obtain a general framework for learning parametric, nonparametric, and dynamic DAG models that includes GLMs, additive noise models, and index models as special cases.
Joint work with Xun Zheng, Chen Dan, Pradeep Ravikumar, and Eric P. Xing.
Bio: Bryon Aragam is an Assistant Professor and Topel Faculty Scholar in the Booth School of Business at the University of Chicago. His research interests include statistical machine learning, nonparametric statistics, and optimization. He is also involved with developing open-source software and solving problems in interpretability, ethics, and fairness in artificial intelligence.
Prior to joining the University of Chicago, Bryon was a project scientist and postdoctoral researcher in the Machine Learning Department at Carnegie Mellon University. He completed his PhD in Statistics and a Masters in Applied Mathematics at UCLA. He has also served as a data science consultant for technology and marketing firms, where he has worked on problems in survey design and methodology, ranking, customer retention, and logistics.
Friday, Feb. 21, 2020 in TTIC 526: TTIC Student Workshop
Friday, Feb. 28, 2020 in TTIC 526: Xiang Cheng, Berkeley
Sampling as Optimization and Optimization as Sampling
Abstract: This talk presents a series of results that draw connections between optimization and sampling. In one such result, we show that the Langevin SDE corresponds to the gradient flow of KL divergence with respect to the 2-Wasserstein metric in probability space. This allows us to prove convergence of Langevin MCMC in KL divergence, and even achieve accelerated rates in a similar fashion to Nesterov’s accelerated gradient descent. In the reverse direction, we can also show that Stochastic Gradient Descent may be viewed as the discretization of a certain Stochastic Differential Equation with a state-dependent diffusion matrix that corresponds to the covariance matrix of the sampled stochastic gradient. This theory helps us explain the behavior of SGD in settings such as the training of deep neural networks, where it has been observed that larger noise (in the form of smaller batch-size/larger step-size) gives smaller generalization error.
Based on joint work with Peter Bartlett, Niladri Chatterji, Michael Jordan, Yian Ma, Dong Yin
Friday, Mar. 6, 2020: No talk
Friday, Mar. 13, 2020 in Crerar 390: Samory Kpotufe, Columbia
Some Recent Insights on Transfer Learning
Abstract : A common situation in Machine Learning is one where training data is not fully representative of a target population due to bias in the sampling mechanism or high costs in sampling the target population; in such situations, we aim to ’transfer’ relevant information from the training data (a.k.a. source data) to the target application. How much information is in the source data? How much target data should we collect if any? These are all practical questions that depend crucially on ‘how far’ the source domain is from the target. However, how to properly measure ‘distance’ between source and target domains remains largely unclear.
In this talk we will argue that much of the traditional notions of ‘distance’ (e.g. KL-divergence, extensions of TV such as D_A discrepancy, density-ratios, Wasserstein distance) can yield an over-pessimistic picture of transferability. Instead, we show that some new notions of ‘relative dimension’ between source and target (which we simply term ‘transfer-exponents’) capture a continuum from easy to hard transfer. Transfer-exponents uncover a rich set of situations where transfer is possible even at fast rates, helps answer questions such as the benefit of unlabeled or labeled target data, yields a sense of optimal vs suboptimal transfer heuristics, and have interesting implications for related problems such as multi-task learning.
Finally, transfer-exponents provide guidance as to *how* to efficiently sample target data so as to guarantee improvement over source data alone. We illustrate these new insights through various simulations on controlled data, and on the popular CIFAR-10 image dataset.
The talk is based on work with Guillaume Martinet, and ongoing work with Steve Hanneke.
Friday, Mar. 20, 2020 in TTIC 526: Filip Hanzely, KAUST
Fall 2019: Fridays at 10:30
SPECIAL TIME: Wednesday, Oct. 2, 2019 at 1:30pm in TTIC room 526: Arthur Szlam, Facebook Research
Language and Interaction in Minecraft
Abstract: I will discuss a research program aimed at building a Minecraft assistant, in order to facilitate the study of agents that can complete tasks specified by dialogue, and eventually, to learn from dialogue interactions. I will describe the tools and platform we have built allowing players to interact with the agents and to record those interactions, and the data we have collected. In addition, I will describe an initial agent from which we (and hopefully others) can iterate.
Friday, Oct. 11, 2019 in TTIC room 526: Peter Bartlett, UC-Berkeley
Optimizing probability distributions for machine learning: sampling meets optimization
Abstract: Optimization and sampling are both of central importance in large-scale machine learning problems, but they are typically viewed as very different problems. This talk presents recent results that exploit the interplay between them. Viewing Markov chain Monte Carlo sampling algorithms as performing an optimization over the space of probability distributions, we demonstrate analogs of Nesterov’s acceleration approach in the sampling domain, in the form of a discretization of an underdamped Langevin diffusion. In the other direction, we view stochastic gradient optimization methods, such as those that are common in deep learning, as sampling algorithms, and study the finite-time convergence of their iterates to a stationary distribution.
Joint work with Yasin Abbasi-Yadkori, Niladri Chatterji, Xiang Cheng, Michael Jordan, Yi-An Ma, Wenlong Mou, and Martin Wainwright.
Bio: Peter Bartlett is a professor in the Computer Science Division and Department of Statistics and Associate Director of the Simons Institute for the Theory of Computing at the University of California at Berkeley. His research interests include machine learning and statistical learning theory. He is the co-author, with Martin Anthony, of the book Neural Network Learning: Theoretical Foundations. He has served as an associate editor of the journals Bernoulli, Mathematics of Operations Research, the Journal of Artificial Intelligence Research, the Journal of Machine Learning Research, and the IEEE Transactions on Information Theory, and as program committee co-chair for COLT and NeurIPS. He was awarded the Malcolm McIntosh Prize for Physical Scientist of the Year in Australia in 2001, and was chosen as an Institute of Mathematical Statistics Medallion Lecturer in 2008, an IMS Fellow and Australian Laureate Fellow in 2011, and a Fellow of the ACM in 2018. He was elected to the Australian Academy of Science in 2015.
Friday, Oct. 18, 2019 in Crerar room 390: Madeleine Udell, Cornell
Big data is low rank
Abstract: Matrices of low rank are pervasive in big data, appearing in recommender systems, movie preferences, topic models, medical records, and genomics. While there is a vast literature on how to exploit low rank structure in these datasets, there is less attention on explaining why low rank structure appears in the first place. In this talk, we explain the abundance of low rank matrices in big data by proving that certain latent variable models associated to piecewise analytic functions are of log-rank. Any large matrix from such a latent variable model can be approximated, up to a small error, by a low rank matrix. Armed with this theorem, we show how to use a low rank modeling framework to exploit low rank structure even for datasets that are not numeric, with applications in the social sciences, medicine, and automated machine learning.
Friday, Oct. 25, 2019 in TTIC room 526: Shay Moran, Technion
Convex Set Disjointness, Distributed Learning of Halfspaces, and LP Feasibility
Abstract: We study the Convex Set Disjointness (CSD) problem, where two players have input sets taken from an arbitrary fixed domain U\subset R^d of size | U | = n. Their mutual goal is to decide using minimum communication whether the convex hulls of their sets intersect (equivalently, whether their sets can be separated by a hyperplane).
Different forms of this problem naturally arise in distributed learning and optimization: it is equivalent to Distributed Linear Program (LP) Feasibility — a basic task in distributed optimization, and it is tightly linked to Distributed Learning of Halfdpaces in R^d.
In communication complexity theory, CSD can be viewed as a geometric interpolation between the classical problems of Set Disjointness (when d>= n-1) and Greater-Than (when d=1).
We establish a nearly tight bound of ~Theta(d log n) on the communication complexity of learning halfspaces in R^d.
For Convex Set Disjointness (and the equivalent task of distributed LP feasibility) we derive upper and lower bounds of tilde O(d^2\log n) and ~Omega(d\log n). These results improve upon several previous works in distributed learning and optimization.
Unlike typical works in communication complexity, the main technical contribution of this work lies in the upper bounds. In particular, our protocols are based on a Container Lemma for Halfspaces and on two variants of {\it Carath\’eodory’s Theorem}, which may be of independent interest. These geometric statements are used by our protocols to provide a compressed summary of the players’ input.
Joint work with Mark Braverman, Gillat Kol, and Raghuvansh R. Saxena (Princeton University).
Link Disabled
During preview, link to different page is disabled
Friday, Nov. 1, 2019 in Crerar room 390: Ohad Shamir, Weizmann
Training Neural Networks: The Bigger the Better?
Abstract:
Artificial neural networks are nowadays routinely trained to solve challenging learning tasks, but our theoretical understanding of this phenomenon remains quite limited. One increasingly popular approach, which is aligned with practice, is to study how making the network sufficiently large (a.k.a. “over-parameterized”) makes the associated training problem easier. In this talk, I’ll describe some of the possibilities and challenges in understanding neural networks using this approach.
Based on joint works with Itay Safran and Gilad Yehudai.
Bio: Ohad Shamir is a faculty member at the Department of Computer Science and Applied Mathematics at the Weizmann Institute. He received his PhD in 2010 at the Hebrew University, and between 2010-2013 and 2017-2018 was a researcher at Microsoft Research in Boston. His research focuses on theoretical machine learning, in areas such as theory of deep learning, learning with information and communication constraints, and topics at the intersection of machine learning and optimization. He received several awards, and served as program co-chair of COLT as well as a member of its steering committee.
Special Talk: Distinguished Lecture at TTIC, Friday, Nov. 8, 2019 in TTIC room 526: Shafi Goldwasser, MIT
Friday, Nov. 15, 2019 in Crerar 390: Greg Ongie, University of Chicago; and Blake Woodworth, TTIC
Greg Ongie: A function space view of overparameterized neural networks
Abstract: Contrary to classical bias/variance tradeoffs, deep learning practitioners have observed that vastly overparameterized neural networks with the capacity to fit virtually any labels nevertheless generalize well when trained on real data. One possible explanation of this phenomenon is that complexity control is being achieved by implicitly or explicitly controlling the magnitude of the weights of the network. This raises the question: What functions are well-approximated by neural networks whose weights are bounded in norm? In this talk, I will give some partial answers to this question. In particular, I will give a precise characterization of the space of functions realizable as a two-layer (i.e., one hidden layer) neural network with ReLU activations having an unbounded number of units, but where the Euclidean norm of the weights in the network remains bounded. Surprisingly, this characterization is naturally posed in terms of the Radon transform as used in computational imaging, and I will show how tools from Radon transform analysis yield novel insights about learning with two and three-layer ReLU networks.
Blake Woodworth: The complexity of finding stationary points in convex and non-convex optimization
Abstract: Non-convex optimization algorithms typically guarantee convergence to approximate stationary points of the objective. However, the fundamental complexity of finding such points is poorly understood, even in the convex setting, and especially in comparison to our very thorough understanding of the complexity of finding points with near optimal function value in the convex case. In this talk, I will discuss two recent papers in which we tightly bound the stochastic first-order oracle complexity of finding an approximate stationary point, first for the convex case and then for the non-convex case. An important implication of our work is that, in a certain sense, plain SGD is an optimal algorithm for stochastic non-convex optimization.
Friday, Nov. 22, 2019 in TTIC 526: Ramya Vinayak, University of Washington
Learning from Sparse Data
Abstract:
In many scientific domains, the number of individuals in the population under study is often very large, however the number of observations available per individual is often very limited (sparse). Limited observations prohibit accurate estimation of parameters of interest for any given individual. In this sparse data regime, the key question is, how accurately can we estimate the distribution of parameters over the population? This problem arises in various domains such as epidemiology, psychology, health care, biology, and social sciences. As an example, suppose for a large random sample of the population we have observations of whether a person caught the flu for each year over the past 5 years. We cannot accurately estimate the probability of any given person catching the flu with only 5 observations, however our goal is to estimate the distribution of these probabilities over the whole population. Such an estimated distribution can be used in downstream tasks, like testing and estimating properties of the distribution.
In this talk, I will present our recent results where we show that the maximum likelihood estimator (MLE) is minimax optimal in the sparse observation regime. While the MLE for this problem was proposed as early as the late 1960’s, how accurately the MLE recovers the true distribution was not known. Our work closes this gap. In the course of our analysis, we provide novel bounds on the coefficients of Bernstein polynomials approximating Lipschitz-1 functions. Furthermore, the MLE is also efficiently computable in this setting and we evaluate the performance of MLE on both synthetic and real datasets.
Joint work with Weihao Kong, Gregory Valiant, and Sham Kakade.
Bio: Ramya Korlakai Vinayak is a postdoctoral researcher at the Paul G. Allen School of Computer Science and Engineering at the University of Washington, working with Sham Kakade. Her research interests broadly span the areas of machine learning, statistical inference, and crowdsourcing. She received a Ph.D. from Caltech where she was advised by Babak Hassibi. She is a recipient of the Schlumberger Foundation Faculty of the Future fellowship from 2013- 15. She obtained her Masters from Caltech and Bachelors from IIT Madras.
Friday, Nov. 29, 2019: Thanksgiving break
Friday, Dec. 6, 2019 in TTIC 526: Zhaoran Wang, Northwestern
On the Computational and Statistical Efficiency of Policy Optimization in (Deep) Reinforcement Learning
Abstract: Coupled with powerful function approximators such as neural networks, policy optimization plays a key role in the tremendous empirical successes of deep reinforcement learning. In sharp contrast, the theoretical understandings of policy optimization remain rather limited from both the computational and statistical perspectives. From the perspective of computational efficiency, it remains unclear whether policy optimization converges to the globally optimal policy in a finite number of iterations, even given infinite data. From the perspective of statistical efficiency, it remains unclear how to attain the globally optimal policy with a finite regret or sample complexity.
To address the computational question, I will show that, under suitable conditions, natural policy gradient/proximal policy optimization/trust-region policy optimization (NPG/PPO/TRPO) converges to the globally optimal policy at a sublinear rate, even when it is coupled with neural networks. To address the statistical question, I will present an optimistic variant of NPG/PPO/TRPO, namely OPPO, which incorporates exploration in a principled manner and attains a \sqrt{T}-regret.
(Joint work with Qi Cai, Chi Jin, Jason Lee, Boyi Liu, Zhuoran Yang)
Bio: Zhaoran Wang is an assistant professor at Northwestern University, working at the interface of machine learning, statistics, and optimization. He is the recipient of the AISTATS (Artificial Intelligence and Statistics Conference) notable paper award, ASA (American Statistical Association) best student paper in statistical learning and data mining, INFORMS (Institute for Operations Research and the Management Sciences) best student paper finalist in data mining, and the Microsoft fellowship.
Friday, Dec. 13, 2019: NeurIPS
Academic year 2018-2019
Wednesdays, 1-2pm in the Harper Center (Booth) Room 219 Pizza provided by UChicago CS Department
April 3, 2019 in Crerar room 390: Lorenzo Orecchia, Boston University
First-Order Methods Unleashed: Scalable Optimization in the Age of Big Data
Abstract:First-order methods are a fundamental tool in the design of efficient algorithms for large-scale computational problems. Besides being the optimization workhorse of machine learning, first-order methods have recently served as a springboard for a number of algorithmic advances in discrete optimization, including submodular optimization and maximum flow problems. In this talk, I will showcase a number of results from my research that demonstrate the power of first-order methods as a generic framework for algorithm design.In the first part, I will describe my view of first-order methods as discretizations of continuous dynamical systems over curved spaces. For convex optimization, such dynamics conserve a specific quantity — the product of time and a notion of duality gap — which immediately guarantees convergence to optimum. This primal-dual view helps us to both design novel algorithms and simplify the analyses of existing ones. In particular, I will discuss how it yields a simple, intuitive analysis of accelerated algorithms and how it allows us to port such algorithms to contexts that do not squarely match standard smoothness assumptions.In the second part, we will see how to exploit problem-specific structure by preconditioning, i.e., by endowing the space with a curved geometry that facilitates the convergence of the dynamics above. In particular, I will describe how different random-walk-based algorithms for graph partitioning arise from different preconditionings of the same optimization problem, and how combinatorial preconditioners yield nearly-linear-time algorithms for flow problems over undirected graph.Bio:Lorenzo Orecchia is an assistant professor in the Department of Computer Science at Boston University. Lorenzo’s research focuses on the design of efficient algorithms for fundamental computational challenges in machine learning and combinatorial optimization. His approach is based on combining ideas from continuous and discrete optimization into a single framework for algorithm design. Lorenzo obtained his PhD in computer science at UC Berkeley under the supervision of Satish Rao in 2011, and was an applied mathematics instructor at MIT under the supervision of Jon Kelner until 2014. He was a recipient of the 2014 SODA Best Paper award and a co-organizer of the Simons semester “Bridging Continuous and Discrete Optimization” in Fall 2017.
April 10, 2019 in room C25: Mesrob Ohannessian, TTIC
From Fair Decisions to Social Benefit
Using data efficiently is no longer our only burden. As automated decisions in socio-technical systems affect more lives than ever, it is imperative that we also use data responsibly. Being non-discriminatory and fair is one such responsibility. To date, explorations of fairness in intelligent decision systems have mostly ignored long-term influence on the underlying population. In this talk I give a first comprehensive perspective of this problem. Among the various insights I provide is quantifying both sides of the mismatch hypothesis: when can we hope that affirmative action is indeed beneficial to society? Bio: Mesrob I. Ohannessian is a Research Assistant Professor at the Toyota Technological Institute at Chicago. He was previously a postdoc at UCSD, MSR-Inria, and Université Paris-Sud. He received his PhD in EECS from MIT. His research interests are in machine learning, statistics, information theory, and their applications, particularly to problems marked by data scarcity and to decisions that affect society.
April 17, 2019 in room C04: Lin Yang, Princeton University
Learn Policy Optimally via Efficiently Utilizing Data
Recent years have witnessed increasing empirical successes in reinforcement learning. Nevertheless, it is an irony that many theoretical problems in this field are not well understood even in the most basic setting. For instance, the optimal sample and time complexities of policy learning in finite-state Markov decision process still remain unclear. Given a state-transition sampler, we develop a novel algorithm that learns an approximate-optimal policy in near-optimal time and using a minimal number of samples. The algorithm makes updates by processing samples in a “streaming” fashion, which requires small memory and naturally adapts to large-scale data. Our result resolves the long-standing open problem on the sample complexity of Markov decision process and provides new insights on how to use data efficiently in learning and optimization.The algorithm and analysis can be extended to solve two-person stochastic games and feature-based Markov decision problems while achieving near-optimal sample complexity. We further illustrate several other examples of learning and optimization over streaming data, with applications in accelerating Astrophysical discoveries and improving network securities.Bio:Lin Yang is currently a postdoctoral researcher at Princeton University working with Prof. Mengdi Wang. He obtained two Ph.D. degrees simultaneously in Computer Science and in Physics & Astronomy from Johns Hopkins University in 2017. Prior to that, he obtained a bachelor’s degree from Tsinghua University. His research focuses on developing fast algorithms for large-scale optimization and machine learning. This includes reinforcement learning and streaming methods for optimization and function approximations. His algorithms have been applied to real-world applications including accelerating astrophysical discoveries and improving network security. He has published numerous papers in top Computer Science conferences including NeurIPS, ICML, STOC, and PODS. At Johns Hopkins, he was a recipient of the Dean Robert H. Roy Fellowship.
April 24, 2019: Avrim Blum, TTIC
Algorithmic fairness in online decision-making
There is growing concern about fairness in algorithmic decision making: Are algorithmic decisions treating different groups fairly? How can we make them fairer? What do we even mean by fair? In this talk I will discuss some of our work on this topic, focusing on the setting of online decision making. For instance, a classic result states that given a collection of predictors, one can adaptively combine them to perform nearly as well as the best of those predictors in hindsight (achieve “no regret”) without any stochastic assumptions. Can one extend this guarantee so that if the predictors are themselves fair (according to a given definition), then the overall combination is fair as well (according to the same definition)? I will discuss this and other issues. This is joint work with Suriya Gunasekar, Thodoris Lykouris, and Nati Srebro.
April 29, 2019: (2:30pm in Crerar 390) Tong Zhang, Hong Kong University of Science and Technology.
Modern Techniques of Statistical Optimization for Machine Learning
Abstract:Many problems in machine learning rely on statistics and optimization. To solve these problems, new techniques are needed. I will show some of these new techniques through some machine learning problems I have recently worked on, such as nonconvex stochastic optimization, distributed training, adversarial attack, generative model, etc.Bio:Tong Zhang is a Professor of Computer Science and Mathematics at The Hong Kong University of Science and Technology. Previously, he was a professor at Rutgers University, and worked at IBM, Yahoo, Baidu, and Tencent. Tong Zhang’s research interests include machine learning algorithms and theory, statistical methods for big data and their applications. He is a fellow of ASA and IMS, and he has been on the editorial boards of leading machine learning journals and program committees of top machine learning conferences. Tong Zhang received a B.A. in mathematics and computer science from Cornell University and a Ph.D. in Computer Science from Stanford University.
May 1, 2019: Tengyuan Liang, University of Chicago Booth
New Thoughts on Adaptivity, Generalization and Interpolation Motivated from Neural Networks
Consider the problem: given data pair (x, y) drawn from a population with f_*(x) = E[y|x], specify a neural network and run gradient flow on the weights over time until reaching any stationarity. How does f_t, the function computed by the neural network at time t, relate to f_*, in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. We show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation, and performs the global least squares projection onto the adaptive RKHS, simultaneously. In addition, we prove that as the RKHS is data-adaptive and task-specific, the residual for f_* lies in a subspace that is smaller than the orthogonal complement of the RKHS, formalizing the representation and approximation benefits of neural networks. Then we will move to discuss generalization for interpolation methods in RKHS. In the absence of explicit regularization, Kernel “Ridgeless” Regression with nonlinear kernels has the potential to fit the training data perfectly. It has been observed empirically, however, that such interpolated solutions can still generalize well on test data. We isolate a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. In addition to deriving a data-dependent upper bound on the out-of-sample error, we present experimental evidence suggesting that the phenomenon occurs in the MNIST dataset.
May 8, 2019: Rosemary Braun, Northwestern University
Using Gene Expression to Tell Time
Abstract:Determining the state of an individual’s internal physiologicalclock has important implications for precision medicine, fromdiagnosing neurological disorders to optimizing drug delivery. Tobe useful, such a test must be accurate, minimally burdensome tothe patient, and robust to differences in patient protocols, samplecollection, and assay technologies. In this talk I will presentTimeSignature, a novel machine-learning algorithm to estimatecircadian variables from gene expression in human blood. By makinguse of the high dimensionality of the gene expression measurementsand exploiting the periodic nature of the circadian variables wewish to predict, TimeSignature can be applied to samples fromdisparate studies and yield highly accurate results despite systematicdifferences between the studies. This generalizability is uniqueamongst expression-based predictors and addresses a major challengein the development of robust biomarker tests. This talk will detailthe method, present several applications, and discuss our recentwork to extend it.Bio:Rosemary Braun (PhD, MPH) is an Assistant Professor in the Division ofBiostatistics (Dept of Preventive Medicine, Feinberg School ofMedicine) and the Department of Engineering Sciences and AppliedMathematics at Northwestern University. Dr Braun’s overarchingresearch interests are in the development of mathematical andcomputational methods to elucidate how large-scale biologicalphenomena emerge from the complex interplay of thousands ofmicroscopic interactions. To this end, her research group developsmachine-learning methods for the statistical analysis ofhigh-dimensional omics data; graph-thoretical approaches for modelingthe behavior of regulatory networks; and dynamical simulations tostudy molecular interactions. Recent publications have appeared inPNAS, Nucleic Acids Research, and Bioinformatics. Dr Braun obtainedher PhD in Physics in 2004 from the University of Illinois at UrbanaChampaign, where she studied the statistical physics of living systemsunder Klaus Schulten. This was followed by an MPH (Concentration inBiostatistics) at Johns Hopkins in 2006. Prior to joiningNorthwestern, she was a Postdoctoral Fellow at the National CancerInstitute (2006-2011).
May 15, 2019: Yali Amit, UChicago Statistics
Optimization of latent variables in deep network applications
The primary use of deep networks is to provide direct prediction ofresponse variables,whether discrete or continuous, following a forward pass through thenetwork. In some caseshowever, we know that the values of certain well defined but unobservedlatent variables are key to successful prediction. For example the scaleor rotation of an object in an image.In deep learning the typical solution is to provide extensive (alsoknown as augmented) training sets where the expected range of values ofthe latent variables is well represented.In generative modeling it is more common to define a statistical modelwith a distribution of the observed dataconditional on some unobserved latent variables, and online optimizationor simulation w.r.t to the latent variablesis required when computing likelihoods. I will describe someexperiments, with `deformable classifiers’where we train a deep network togetherwith an optimization step over predefined latent variables, and requirethis optimization to also be performed onlineduring test time. I will show that this enables learning with muchsmaller data sets, at the cost of more intensive computation,and provides as output not just the class label but the optimalinstantiation of the latent variables for the test example. I will alsoshow some application of these ideas for training generator networks.Bio: Yali Amit is Professor of Statistics and Computer Science at theUniversity of Chicago. He fields of interest are computer vision,machine learning and computational neuroscience.
May 22, 2019: UChicago Geometric Data Analysis Workshop
May 29, 2019 in SHFE 203 (Saieh Hall For Economics): Dan McDonald
Trend filtering in exponential families
Trend filtering is a modern approach to nonparametric regression thatis more adaptive to local smoothness than splines or basisprocedures. Current analysis of trend filtering focuses on estimatinga function corrupted by Gaussian noise, but our work extends thistechnique to general exponential family distributions. This extensionis motivated by the need to study massive, gridded climate dataderived from polar-orbiting satellites. We present algorithms tailoredto large problems, theoretical results for general loss functions, andprincipled methods for tuning parameter selection without excesscomputation.
Past Quarters:
Jan. 16, 2019: Sumeet Katariya, UW-Madison
Adaptive Sampling for Ranking and Clustering
Abstract: The multi-armed bandit framework has been used to study sequential decision making for various objectives: minimizing regret, finding the best-item, finding the top-k items. In this work we study sequential decision making for *ranking*, where the goal is to sort items according to their meanls by adaptively sampling from their unknown reward distributions. An approximate notion of ranking is clustered or quantile ranking, where the goal is to find not the exact rank of an item, but only which quantile it belongs to. This formulation is motivated from a social science application at the Knowledge Lab at UChicago. In this talk, I will discuss a computationally efficient PAC algorithm for quantile ranking and its analysis. This setup naturally leads to the problem of clustering, which in its simplest form reduces to the problem of identifying the maximum gap between the means, and I will also discuss algorithms for adaptive clustering. This is joint work with Ardhendu Tripathy, Lalit Jain, Nandana Sengupta, James Evans, and Rob Nowak.
Jan. 23, 2019: Karl Stratos, TTIC
Challenges and Hopes of Unsupervised Learning by Mutual Information Maximization
Abstract: The objective of maximizing mutual information has a rich history of success in unsupervised learning and is recently of great interest to the deep learning community. Unfortunately, measuring mutual information from a finite sample is a notoriously difficult task. As a result, past works often resort to an approximation by a sample-estimatable lower bound. In this work, we give universal limitations of measuring mutual information that no such approach can overcome. Specifically, we show that any distribution-free high-confidence lower bound on mutual information estimated from N samples is at most logarithmic in N, which implies that the lower bound methodology is fundamentally flawed whenever the true mutual information is nontrivial. We argue that a better approach is to treat mutual information as a difference of entropies and measuring each entropy byl the cross-entropy upper bound. We present an empirical investigation with translated sentences and related news articles to support our argument. This is joint work with David McAllester.
Feb. 6, 2019: Chao Gao, University of Chicago Statistics
Convergence Rates of Variational Posterior Distributions
We study convergence rates of variational posterior distributions for nonparametric and high-dimensional inference. We formulate general conditions on prior, likelihood, and variational class that characterize the convergence rates. Under similar “prior mass and testing” conditions considered in the literature, the rate is found to be the sum of two terms. The first term stands for the convergence rate of the true posterior distribution, and the second term is contributed by the variational approximation error. For a class of priors that admit the structure of a mixture of product measures, we propose a novel prior mass condition, under which the variational approximation error of the generalized mean-field class is dominated by convergence rate of the true posterior. We demonstrate the applicability of our general results for various models, prior distributions and variational classes by deriving convergence rates of the corresponding variational posteriors.
Feb. 13, 2019: Veronika Rockova, Univeristy of Chicago, Booth
On Theory for BART
Since their inception in the 1980¹s, regression trees have been one of the more widely used non-parametric prediction methods. Tree-structured methods yield a histogram reconstruction of the regression surface, where the bins correspond to terminal nodes of recursive partitioning. Trees are powerful, yet susceptible to over-fitting. Strategies against overfitting have traditionally relied on pruning greedily grown trees. The Bayesian framework offers an alternative remedy against overfitting through priors. Roughly speaking, a good prior charges smaller trees where overfitting does not occur. While the consistency of random histograms, trees and their ensembles has been studied quite extensively, the theoretical understanding of the Bayesian counterparts has been missing. In this paper, we take first steps towards understanding why/when do Bayesian trees and forests not overfit. To address this question, we study the speed at which the posterior concentrates around the true smooth regression function. We propose a spike-and-tree variant of the popular Bayesian CART prior and establish new theoretical results showing that regression trees (and forests) (a) are capable of recovering smooth regression surfaces (with smoothness not exceeding one), achieving optimal rates up to a log factor, (b) can adapt to the unknown level of smoothness and (c) can perform effective dimension reduction when p > n. Going further, we also show semi and non-parametric Bernstein-von Mises theorems showing that BART is fundamentally justified from a frequentist point of view. These results provide a piece of missing theoretical evidence explaining why Bayesian trees (and additive variants thereof) have worked so well in practice.
Feb. 20, 2019: Ruey Tsay, University of Chicago, Booth
Analysis of Big Dependent Data
Most works in the machine learning and high-dimensional statistical inference assume independence. Most data, on the other hand, are either dynamically or spatially dependent. In this talk, I discuss the impact of dependence on statistical inference of high-dimensional data analysis, including LASSO regression and generalized linear models. The presentation depends on some recent joint papers, focusing on modeling big dependent data for which either the dimension or the sample size or both are large. We demonstrate the analyses using PM2.5 data collected at multiple monitoring stations and at various frequencies.
Feb. 27, 2019: No Seminar
March 5, 2019 (TUESDAY), Harper Center 3B from 12:10 to 1:10 p.m.: Jorge Nocedal, Northwestern University
Zero-Order Methods for the Optimization of Noisy Functions
This talk presents a finite difference quasi-Newton method for the minimization of noisy functions. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval h based on the noise estimation techniques of Hamming and More and Wild. This noise estimation procedure and the selection of h are inexpensive but not always accurate, and to prevent failures the algorithm incorporates a recovery mechanism that takes appropriate action in the case when the line search procedure is unable to produce an acceptable point. A novel convergence analysis is presented that considers the effect of a noisy line search procedure. Numerical experiments comparing the method to a model based trust region method are presented.
March 6, 2019: Bryan Pardo, Northwestern University
Audio source separation models that learn without ground truth and are open to user correction
Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. It is an enabling technology for many tasks, such as automatic speech recognition, labeling sound objects in an acoustic scene, music transcription, and remixing of existing recordings. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sound sources (e.g. instruments in a band) than channels (a stereo recording has two channels). Currently, deep learning systems that perform source separation are trained on many mixtures (e.g., tens of thousands) for which the ground truth decompositions are already known. Since most real-world recordings have no such decomposition available, developers train systems on artificial mixtures created from isolated individual recordings. Although there are large databases of isolated speech, it is impractical to find or build large databases of isolated recordings for every arbitrary sound. This fundamentally limits the range of sounds that deep models can learn to separate. Once learned, a deep model’s output is take-it-or-leave it and it can be difficult for the end user to affect either the current output or to give corrective feedback for the future. In this talk Prof. Pardo discusses recent work in two areas. The first is bootstrapping learning of a scene segmentation model using an acoustic cue known to be used in human audition. This allows learning a model without access to ground-truth decompositions of acoustic scenes. The second is ongoing work to provide an interface for an end user to interact with a deep model, to affect the current separation and improve future separation by allowing for retraining of the model from corrective feedback. BIO: Bryan Pardo is an associate professor in the Northwestern University Department of Electrical Engineering and Computer Science. Prof. Pardo received a M. Mus. in Jazz Studies in 2001 and a Ph.D. in Computer Science in 2005, both from the University of Michigan. He has authored over 100 peer-reviewed publications. He has developed speech analysis software for the Speech and Hearing department of the Ohio State University, statistical software for SPSS and worked as a machine learning researcher for General Dynamics. While finishing his doctorate, he taught in the Music Department of Madonna University.
March 13, 2019: Sebastian Stitch, EPFL
Error Feedback for Communication Efficient SGD
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated. We analyze Stochastic Gradient Descent (SGD) with k-sparsification (for instance top-k or random-k) and compression (for instance quantization) and show that these schemes converge at the same rate as vanilla SGD when equipped with error compensation (i.e. keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate.
March 20, 2019: Spring Break, No Seminar
March 27, 2019: Spring Break, No Seminar