2022-23
By Radu V Craiu and Evgeny Levi
Annual Reviews of Statistics and Its Application | 2023 | 10, 379-400
Rich data generating mechanisms are ubiquitous in this age of information and require complex statistical models to draw meaningful inference. While Bayesian analysis has seen enormous development in the last 30 years, benefitting from the impetus given by the successful application of Markov chain Monte Carlo (MCMC) sampling, the combination of big data and complex models conspire to produce significant challenges for the traditional MCMC algorithms. We review modern algorithmic developments addressing the latter and compare their performance using numerical experiments.
By Lindsay Katz, and Rohan Alexander
Scientific Data | 2023 | 10, 567
Public knowledge of what is said in parliament is a tenet of democracy, and a critical resource for political science research. In Australia, following the British tradition, the written record of what is said in parliament is known as Hansard. While the Australian Hansard has always been publicly available, it has been difficult to use for the purpose of large-scale macro- and micro-level text analysis because it has only been available as PDFs or XMLs. Following the lead of the Linked Parliamentary Data project which achieved this for Canada, we provide a new, comprehensive, high-quality, rectangular database that captures proceedings of the Australian parliamentary debates from 1998 to 2022. The database is publicly available and can be linked to other datasets such as election results. The creation and accessibility of this database enables the exploration of new questions and serves as a valuable resource for both researchers and policymakers.
By Roberto Casarin, Radu V. Craiu, Lorenzo Frattarolo and Christian Robert
Statistical Science | 2023 | accepted
We identify recurrent ingredients in the antithetic sampling literature leading to a unified sampling framework. We introduce a new class of antithetic schemes that includes the most used antithetic proposals.
By Annie Collins, and Rohan Alexander
Scientometrics | 2022 | 7/4/2023
To examine the reproducibility of COVID-19 research, we create a dataset of pre-prints posted to arXiv, bioRxiv, and medRxiv between 28 January 2020 and 30 June 2021 that are related to COVID-19. We extract the text from these pre-prints and parse them looking for keyword markers signaling the availability of the data and code underpinning the pre-print. For the pre-prints that are in our sample, we are unable to find markers of either open data or open code for 75% of those on arXiv, 67% of those on bioRxiv, and 79% of those on medRxiv.
Previous Publications
Oops I Took A Gradient: Scalable Sampling for Discrete Distributions
by Will Grathwohl, Milad Hashemi, Kevin Swersky, David Duvenaud, and Chris Maddison
International conference on machine learning | 2021 (accepted)
Short Summary: We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate our improved sampler for training deep energy-based models on high dimensional discrete image data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.
Robust Risk-Aware Reinforcement Learning
by Sebastian Jaimungal, Silvana M. Pesenti, Ye Sheng Wang, and Hariom Tatsat
SIAM Journal on Financial Mathematics | 2021 | 13 (1), 213-226
Short Summary: We present a reinforcement learning (RL) approach for robust optimization of risk-aware performance criteria. To allow agents to express a wide variety of risk-reward profiles, we assess the value of a policy using rank dependent expected utility (RDEU). RDEU allows agents to seek gains, while simultaneously protecting themselves against downside risk. To robustify optimal policies against model uncertainty, we assess a policy not by its distribution but rather by the worst possible distribution that lies within a Wasserstein ball around it. Thus, our problem formulation may be viewed as an actor/agent choosing a policy (the outer problem) and the adversary then acting to worsen the performance of that strategy (the inner problem). We develop explicit policy gradient formulae for the inner and outer problems and show their efficacy on three prototypical financial problems: robust portfolio allocation, benchmark optimization, and statistical arbitrage.
Computational Skills by Stealth in Introductory Data Science Teaching
by Wesley Burr, Fanny Chevalier, Christopher Collins, Alison L Gibbs, Raymond Ng, and Chris Wild
Teaching Statistics | 2021 (accepted) | DOI: 10.1111/test.12277
Short Summary: In 2010 the Nolan and Temple Lang-proposed “integration of computing concepts into statistics curricula at all levels.” The unprecedented growth in data and emphasis on data science has provided an impetus to finally realizing full implementations of this in new statistics and data science programs and courses. We discuss a proposal for the stealth development of computational skills in students’ exposure to introductory data science through careful, scaffolded exposure to computation and its power. Our intent is to support students, regardless of interest and self-efficacy in coding, in becoming data-driven learners, who are capable of asking complex questions about the world around them, and then answering those questions through the use of data- driven inquiry. Reference is made to the computer science and statistics consensus curriculum frameworks the International Data Science in Schools Project (IDSSP) recently published for secondary school data science or introductory tertiary programs, designed to optimize data- science accessibility.
Double Happiness: Enhancing the Coupled Gains of L-lag Coupling via Control Variates
by Radu V. Craiu and Xiao-Li Meng
Statistica Sinica | 2021 (Accepted)
Short Summary: The paper adds two innovations to the general construction of unbiased MCMC estimators using L-lag coupling that has been developed, in a series of papers, by Pierre Jacob and his collaborators. One is to consider the use of control variates to increase the efficiency of the estimators. The control variates are easily available since they are provided by the coupling construction itself. An added bonus is that the new estimators leads to tighter bounds of the total variation distance between the chain's distribution after k iterations and its stationary distribution.
Dual Space Preconditioning for Gradient Descent
by Chris J. Maddison, Daniel Paulin, Yee Whye Teh, and Arnaud Doucet
SIAM J. Optim. | 2021 | 31 (1), 991-1016
Short Summary: The conditions of relative smoothness and relative strong convexity were recently introduced for the analysis of Bregman gradient methods for convex optimization. We introduce a generalized left-preconditioning method for gradient descent and show that its convergence on an essentially smooth convex objective function can be guaranteed via an application of relative smoothness in the dual space. Our relative smoothness assumption is between the designed preconditioner and the convex conjugate of the objective, and it generalizes the typical Lipschitz gradient assumption. Under dual relative strong convexity, we obtain linear convergence with a generalized condition number that is invariant under horizontal translations, distinguishing it from Bregman gradient methods. Thus, in principle our method is capable of improving the conditioning of gradient descent on problems with a non-Lipschitz gradient or nonstrongly convex structure. We demonstrate our method on p-norm regression and exponential penalty function minimization.
Finding Our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods
by Evgeny Levi and Radu V, Craiu
Bayesian Analysis | 2021 (Accepted)
Short Summary: In this paper we design perturbed MCMC samplers that can be used within the Approximate Bayesian Computation (ABC) and Bayesian Synthetic Likelihood (BSL) paradigms to significantly accelerate computation while maintaining control on computational efficiency. The proposed strategy relies on recycling samples from the chain’s past.
Gradient Estimation with Stochastic Softmax Tricks
by Max B. Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, and Chris J. Maddison
2020
Short Summary: The Gumbel-Max trick is the basis of many relaxed gradient estimators. These estimators are easy to implement and low variance, but the goal of scaling them comprehensively to large combinatorial distributions is still outstanding. Working within the perturbation model framework, we introduce stochastic softmax tricks, which generalize the Gumbel-Softmax trick to combinatorial spaces. Our framework is a unified perspective on existing relaxed estimators for perturbation models, and it contains many novel relaxations. We design structured relaxations for subset selection, spanning trees, arborescences, and others. When compared to less structured baselines, we find that stochastic softmax tricks can be used to train latent variable models that perform better and discover more latent structure.
In praise of small data
by Nancy Reid
Notices American Math Soc | 2021 | Volume: 68, 105-113
Short Summary: The over-promotion of ``Big Data'' has perhaps settled down, but the data are still there, and the rapid development of the new field of data science is a response to this. As more data become available, the questions asked become more complex, and big data can quickly turn into small data. Statistical science has developed an arsenal of methods and models for learning under uncertainty over its 200-year history. Some thoughts on the interplay between statistical and data science, their interactions with science, and the ongoing relevance of statistical theory are presented and illustrated.
LocusFocus: Web-based colocalization for the annotation and functional follow-up of GWAS
by Naim Panjwani, Fan Wang, Scott Mastromatteo, Allen Bao, Cheng Wang, Gengming He, Jiafen Gong, Johanna M. Rommens, Lei Sun, and Lisa J. Strug
PLOS Computational Biology | 2020 | 16(10):e1008336
Short Summary: Genome-wide association studies (GWAS) have primarily identified trait-associated loci in the non-coding genome. Colocalization analyses of SNP associations from GWAS with expression quantitative trait loci (eQTL) evidence enable the generation of hypotheses about responsible mechanism, genes and tissues of origin to guide functional characterization. Here, we present a web-based colocalization browsing and testing tool named LocusFocus. LocusFocus formally tests colocalization using our established Simple Sum method to identify the most relevant genes and tissues for a particular GWAS locus in the presence of high linkage disequilibrium and/or allelic heterogeneity. We demonstrate the utility of LocusFocus, following up on a genome-wide significant locus from a GWAS of meconium ileus (an intestinal obstruction in cystic fibrosis). Using LocusFocus for colocalization analysis with eQTL data suggests variation in ATP12A gene expression in the pancreas rather than intestine is responsible for the GWAS locus. LocusFocus has no operating system dependencies and may be installed in a local web server. LocusFocus is available under the MIT license, with full documentation and source code accessible on GitHub.
Oops I Took A Gradient: Scalable Sampling for Discrete Distributions
by Will Grathwohl, Milad Hashemi, Kevin Swersky, David Duvenaud, and Chris Maddison
International Conference on Machine Learning | 2021 (accepted)
Short Summary: We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.
Scalable Gradients for Stochastic Differential Equations
by Xuechen Li, Ting-Kam Leonard Wong, Ricky Tian Qi Chen, and David Duvenaud
Conference on AI and Statistics | 2020
Short Summary: We generalize the adjoint sensitivity method to stochastic differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers. Specifically, we derive a stochastic differential equation whose solution is the gradient, a memory-efficient algorithm for caching noise, and conditions under which numerical solutions converge. In addition, we combine our method with gradient-based stochastic variational inference for latent stochastic differential equations. We use our method to fit stochastic dynamics defined by neural networks, achieving competitive performance on a 50-dimensional motion capture dataset.
Statistical power in COVID-19 case-control host genomic study design
by Yu-Chung Lin, Jennifer D. Brooks, Shelley B. Bull, France Gagnon, Celia M. T. Greenwood, Rayjean J. Hung, Jerald Lawless, Andrew D. Paterson, Lei Sun, and Lisa J. Strug
Genome Medicine | 2020 | Volume 12, Article 115
Short Summary: The identification of genetic variation that directly impacts infection susceptibility to SARS-CoV-2 and disease severity of COVID-19 is an important step towards risk stratification, personalized treatment plans, therapeutic, and vaccine development and deployment. Given the importance of study design in infectious disease genetic epidemiology, we use simulation and draw on current estimates of exposure, infectivity, and test accuracy of COVID-19 to demonstrate the feasibility of detecting host genetic factors associated with susceptibility and severity in published COVID-19 study designs. We demonstrate that limited phenotypic data and exposure/infection information in the early stages of the pandemic significantly impact the ability to detect most genetic variants with moderate effect sizes, especially when studying susceptibility to SARS-CoV-2 infection. Our insights can aid in the interpretation of genetic findings emerging in the literature and guide the design of future host genetic studies.
The Building Blocks of Statistical Education in the Data Science Ecosystem
by Alison L. Gibbs, and Nathan Taback
Harvard Data Science Review | 2021 (accepted)
Adaptive Component-wise Multiple-Try Metropolis
by JinyoungYang, Evgeny Levi, R.V. Craiu and J.S. Rosenthal
Journal of Computational and Graphical Statistics | 2018
Short Summary: Adaptive MCMC for targets with irregular characteristics
Backpropagation through the Void: Optimizing control variates for black-box gradient estimation
by Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud
International Conference on Learning Representations | 2018
Short Summary: We learn low-variance, unbiased gradient estimators for any function of random variables. We backprop through a neural net surrogate of the original function, which is optimized to minimize gradient variance during the optimization of the original objective. We train discrete latent-variable models, and do continuous and discrete reinforcement learning with an adaptive, action-conditional baseline.
Global Non-convex Optimization with Discretized Diffusions
by Murat A. Erdogdu, Lester Mackey, and Ohad Shamir
Advances in Neural Information Processing Systems | 2018 (to appear)
Short summary: An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems. We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing convex and non-convex functions not covered by the existing Langevin theory. Our non-asymptotic analysis delivers computable optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. Central to our approach are new explicit Stein factor bounds on the solutions of Poisson equations. We complement these results with improved optimization guarantees for targets other than the standard Gibbs measure.
Neural Ordinary Differential Equations
by Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud
Advances in Neural Information Processing Systems | 2018
Short Summary: We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.
Scalable Approximations for Generalized Linear Problems
by Murat A. Erdogdu, Mohsen Bayati, and Lee H. Dicker
Journal of Machine Learning Research | 2018 (to appear)
Short Summary: In stochastic optimization, the population risk is generally approximated by the empirical risk. However, in the large-scale setting, minimization of the empirical risk may be computationally restrictive. In this paper, we design an efficient algorithm to approximate the population risk minimizer in generalized linear problems such as binary classification with surrogate losses and generalized linear regression models. We focus on large-scale problems, where the iterative minimization of the empirical risk is computationally intractable, i.e., the number of observations $n$ is much larger than the dimension of the parameter $p$, i.e. $n \gg p \gg 1$. We show that under random sub-Gaussian design, the true minimizer of the population risk is approximately proportional to the corresponding ordinary least squares (OLS) estimator. Using this relation, we design an algorithm that achieves the same accuracy as the empirical risk minimizer through iterations that attain up to a quadratic convergence rate, and that are computationally cheaper than any batch optimization algorithm by at least a factor of $\mathcal{O}(p)$. We provide theoretical guarantees for our algorithm, and analyze the convergence behavior in terms of data dimensions.
Stability of Adversarial Markov Chains, with an Application to Adaptive MCMC Algorithms
by R.V. Craiu, L. Gray, K. Latuszynski, N. Madras, G.O. Roberts, and J.S. Rosenthal
Annals of Applied Probability | 2015 | Vol. 25(6), pp. 3592-3623
Short Summary: Provides a simple way to verify the correct convergence of adaptive MCMC algorithms, thus opening up new avenues for computational progress and accurate estimation.