new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Feb 19

A Flexible Parametric Modelling Framework for Survival Analysis

We introduce a general, flexible, parametric survival modelling framework which encompasses key shapes of hazard function (constant, increasing, decreasing, up-then-down, down-then-up), various common survival distributions (log-logistic, Burr type XII, Weibull, Gompertz), and includes defective distributions (i.e., cure models). This generality is achieved using four basic distributional parameters: two scale-type parameters and two shape parameters. Generalising to covariate dependence, the scale-type regression components correspond to accelerated failure time (AFT) and proportional hazards (PH) models. Therefore, this general formulation unifies the most popular survival models which allows us to consider the practical value of possible modelling choices for survival data. Furthermore, in line with our proposed flexible baseline distribution, we advocate the use of multi-parameter regression in which more than one distributional parameter depends on covariates - rather than the usual convention of having a single covariate-dependent (scale) parameter. While many choices are available, we suggest introducing covariates through just one or other of the two scale parameters, which covers AFT and PH models, in combination with a `power' shape parameter, which allows for more complex non-AFT/non-PH effects, while the other shape parameter remains covariate-independent, and handles automatic selection of the baseline distribution. We explore inferential issues in simulations, both with and without a covariate, with particular focus on evidence concerning the need, or otherwise, to include both AFT and PH parameters. We illustrate the efficacy of our modelling framework by investigating differences between treatment groups using data from a lung cancer study and a melanoma study. Censoring is accommodated throughout.

  • 3 authors
·
Jan 10, 2019

The Slepian model based independent interval approximation of persistency and zero-level exceedance distributions

In physics and engineering literature, the distribution of the excursion-above-zero time distribution (exceedance distribution) for a stationary Gaussian process has been approximated by a stationary switching process with independently distributed switching times. The approach matched the covariance of the clipped Gaussian process with the one for the stationary switching process and the distribution of the latter was used as the so-called independent interval approximation (IIA). The approach successfully assessed the persistency exponent for many physically important processes but left an unanswered question when such an approach leads to a mathematically meaningful and proper exceedance distribution. Here we address this question by proposing an alternative matching of the expected values of the clipped Slepian process and the corresponding switched process initiated at the origin. The method has allowed resolving the mathematical correctness of the matching method for a large subclass of the Gaussian processes with monotonic covariance, for which we provide a sufficient condition for the validity of the IIA. Within this class, the IIA produces a valid distribution for the excursion time and is represented in an explicit stochastic form that connects directly to the covariance of the underlying Gaussian process. We compare the excursion level distributions as well as the corresponding persistency exponents obtained through the IIA method with numerically computed exact distributions, and the simulated distribution for several important Gaussian models. We also argue that for stationary Gaussian processes with a non-monotonic covariance, the IIA fails and should not be used.

  • 2 authors
·
Jan 3, 2024

Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation

Ensuring equitable treatment (fairness) across protected attributes (such as gender or ethnicity) is a critical issue in machine learning. Most existing literature focuses on binary classification, but achieving fairness in regression tasks-such as insurance pricing or hiring score assessments-is equally important. Moreover, anti-discrimination laws also apply to continuous attributes, such as age, for which many existing methods are not applicable. In practice, multiple protected attributes can exist simultaneously; however, methods targeting fairness across several attributes often overlook so-called "fairness gerrymandering", thereby ignoring disparities among intersectional subgroups (e.g., African-American women or Hispanic men). In this paper, we propose a distance covariance regularisation framework that mitigates the association between model predictions and protected attributes, in line with the fairness definition of demographic parity, and that captures both linear and nonlinear dependencies. To enhance applicability in the presence of multiple protected attributes, we extend our framework by incorporating two multivariate dependence measures based on distance covariance: the previously proposed joint distance covariance (JdCov) and our novel concatenated distance covariance (CCdCov), which effectively address fairness gerrymandering in both regression and classification tasks involving protected attributes of various types. We discuss and illustrate how to calibrate regularisation strength, including a method based on Jensen-Shannon divergence, which quantifies dissimilarities in prediction distributions across groups. We apply our framework to the COMPAS recidivism dataset and a large motor insurance claims dataset.

  • 5 authors
·
Sep 9, 2025

One-connection rule for structural equation models

Linear structural equation models are multivariate statistical models encoded by mixed graphs. In particular, the set of covariance matrices for distributions belonging to a linear structural equation model for a fixed mixed graph G=(V, D,B) is parameterized by a rational function with parameters for each vertex and edge in G. This rational parametrization naturally allows for the study of these models from an algebraic and combinatorial point of view. Indeed, this point of view has led to a collection of results in the literature, mainly focusing on questions related to identifiability and determining relationships between covariances (i.e., finding polynomials in the Gaussian vanishing ideal). So far, a large proportion of these results has focused on the case when D, the directed part of the mixed graph G, is acyclic. This is due to the fact that in the acyclic case, the parametrization becomes polynomial and there is a description of the entries of the covariance matrices in terms of a finite sum. We move beyond the acyclic case and give a closed form expression for the entries of the covariance matrices in terms of the one-connections in a graph obtained from D through some small operations. This closed form expression then allows us to show that if G is simple, then the parametrization map is generically finite-to-one. Finally, having a closed form expression for the covariance matrices allows for the development of an algorithm for systematically exploring possible polynomials in the Gaussian vanishing ideal.

  • 4 authors
·
Oct 1, 2022

Chronos-2: From Univariate to Universal Forecasting

Pretrained time series models have enabled inference-only forecasting systems that produce accurate predictions without task-specific training. However, existing approaches largely focus on univariate forecasting, limiting their applicability in real-world scenarios where multivariate data and covariates play a crucial role. We present Chronos-2, a pretrained model capable of handling univariate, multivariate, and covariate-informed forecasting tasks in a zero-shot manner. Chronos-2 employs a group attention mechanism that facilitates in-context learning (ICL) through efficient information sharing across multiple time series within a group, which may represent sets of related series, variates of a multivariate series, or targets and covariates in a forecasting task. These general capabilities are achieved through training on synthetic datasets that impose diverse multivariate structures on univariate series. Chronos-2 delivers state-of-the-art performance across three comprehensive benchmarks: fev-bench, GIFT-Eval, and Chronos Benchmark II. On fev-bench, which emphasizes multivariate and covariate-informed forecasting, Chronos-2's universal ICL capabilities lead to substantial improvements over existing models. On tasks involving covariates, it consistently outperforms baselines by a wide margin. Case studies in the energy and retail domains further highlight its practical advantages. The in-context learning capabilities of Chronos-2 establish it as a general-purpose forecasting model that can be used "as is" in real-world forecasting pipelines.

amazon Amazon
·
Oct 17, 2025 3

fastHDMI: Fast Mutual Information Estimation for High-Dimensional Data

In this paper, we introduce fastHDMI, a Python package designed for efficient variable screening in high-dimensional datasets, particularly neuroimaging data. This work pioneers the application of three mutual information estimation methods for neuroimaging variable selection, a novel approach implemented via fastHDMI. These advancements enhance our ability to analyze the complex structures of neuroimaging datasets, providing improved tools for variable selection in high-dimensional spaces. Using the preprocessed ABIDE dataset, we evaluate the performance of these methods through extensive simulations. The tests cover a range of conditions, including linear and nonlinear associations, as well as continuous and binary outcomes. Our results highlight the superiority of the FFTKDE-based mutual information estimation for feature screening in continuous nonlinear outcomes, while binning-based methods outperform others for binary outcomes with nonlinear probability preimages. For linear simulations, both Pearson correlation and FFTKDE-based methods show comparable performance for continuous outcomes, while Pearson excels in binary outcomes with linear probability preimages. A comprehensive case study using the ABIDE dataset further demonstrates fastHDMI's practical utility, showcasing the predictive power of models built from variables selected using our screening techniques. This research affirms the computational efficiency and methodological strength of fastHDMI, significantly enriching the toolkit available for neuroimaging analysis.

  • 5 authors
·
Oct 13, 2024

Weighted least-squares approximation with determinantal point processes and generalized volume sampling

We consider the problem of approximating a function from L^2 by an element of a given m-dimensional space V_m, associated with some feature map varphi, using evaluations of the function at random points x_1,dots,x_n. After recalling some results on optimal weighted least-squares using independent and identically distributed points, we consider weighted least-squares using projection determinantal point processes (DPP) or volume sampling. These distributions introduce dependence between the points that promotes diversity in the selected features varphi(x_i). We first provide a generalized version of volume-rescaled sampling yielding quasi-optimality results in expectation with a number of samples n = O(mlog(m)), that means that the expected L^2 error is bounded by a constant times the best approximation error in L^2. Also, further assuming that the function is in some normed vector space H continuously embedded in L^2, we further prove that the approximation is almost surely bounded by the best approximation error measured in the H-norm. This includes the cases of functions from L^infty or reproducing kernel Hilbert spaces. Finally, we present an alternative strategy consisting in using independent repetitions of projection DPP (or volume sampling), yielding similar error bounds as with i.i.d. or volume sampling, but in practice with a much lower number of samples. Numerical experiments illustrate the performance of the different strategies.

  • 2 authors
·
Dec 21, 2023

Kernel Density Estimators in Large Dimensions

This paper studies Kernel density estimation for a high-dimensional distribution rho(x). Traditional approaches have focused on the limit of large number of data points n and fixed dimension d. We analyze instead the regime where both the number n of data points y_i and their dimensionality d grow with a fixed ratio alpha=(log n)/d. Our study reveals three distinct statistical regimes for the kernel-based estimate of the density hat rho_h^{D}(x)=1{n h^d}sum_{i=1}^n Kleft(x-y_i{h}right), depending on the bandwidth h: a classical regime for large bandwidth where the Central Limit Theorem (CLT) holds, which is akin to the one found in traditional approaches. Below a certain value of the bandwidth, h_{CLT}(alpha), we find that the CLT breaks down. The statistics of hat rho_h^{D}(x) for a fixed x drawn from rho(x) is given by a heavy-tailed distribution (an alpha-stable distribution). In particular below a value h_G(alpha), we find that hat rho_h^{D}(x) is governed by extreme value statistics: only a few points in the database matter and give the dominant contribution to the density estimator. We provide a detailed analysis for high-dimensional multivariate Gaussian data. We show that the optimal bandwidth threshold based on Kullback-Leibler divergence lies in the new statistical regime identified in this paper. Our findings reveal limitations of classical approaches, show the relevance of these new statistical regimes, and offer new insights for Kernel density estimation in high-dimensional settings.

  • 2 authors
·
Aug 11, 2024

Applying the Polynomial Maximization Method to Estimate ARIMA Models with Asymmetric Non-Gaussian Innovations

Classical estimators for ARIMA parameters (MLE, CSS, OLS) assume Gaussian innovations, an assumption frequently violated in financial and economic data exhibiting asymmetric distributions with heavy tails. We develop and validate the second-order polynomial maximization method (PMM2) for estimating ARIMA(p,d,q) models with non-Gaussian innovations. PMM2 is a semiparametric technique that exploits higher-order moments and cumulants without requiring full distributional specification. Monte Carlo experiments (128,000 simulations) across sample sizes N in {100, 200, 500, 1000} and four innovation distributions demonstrate that PMM2 substantially outperforms classical methods for asymmetric innovations. For ARIMA(1,1,0) with N=500, relative efficiency reaches 1.58--1.90 for Gamma, lognormal, and χ^2(3) innovations (37--47\% variance reduction). Under Gaussian innovations PMM2 matches OLS efficiency, avoiding the precision loss typical of robust estimators. The method delivers major gains for moderate asymmetry (|γ_3| geq 0.5) and N geq 200, with computational costs comparable to MLE. PMM2 provides an effective alternative for time series with asymmetric innovations typical of financial markets, macroeconomic indicators, and industrial measurements. Future extensions include seasonal SARIMA models, GARCH integration, and automatic order selection.

  • 1 authors
·
Nov 10, 2025 1

Extending Mixture of Experts Model to Investigate Heterogeneity of Trajectories: When, Where and How to Add Which Covariates

Researchers are usually interested in examining the impact of covariates when separating heterogeneous samples into latent classes that are more homogeneous. The majority of theoretical and empirical studies with such aims have focused on identifying covariates as predictors of class membership in the structural equation modeling framework. In other words, the covariates only indirectly affect the sample heterogeneity. However, the covariates' influence on between-individual differences can also be direct. This article presents a mixture model that investigates covariates to explain within-cluster and between-cluster heterogeneity simultaneously, known as a mixture-of-experts (MoE) model. This study aims to extend the MoE framework to investigate heterogeneity in nonlinear trajectories: to identify latent classes, covariates as predictors to clusters, and covariates that explain within-cluster differences in change patterns over time. Our simulation studies demonstrate that the proposed model generally estimates the parameters unbiasedly, precisely and exhibits appropriate empirical coverage for a nominal 95% confidence interval. This study also proposes implementing structural equation model forests to shrink the covariate space of the proposed mixture model. We illustrate how to select covariates and construct the proposed model with longitudinal mathematics achievement data. Additionally, we demonstrate that the proposed mixture model can be further extended in the structural equation modeling framework by allowing the covariates that have direct effects to be time-varying.

  • 2 authors
·
Jul 5, 2020

RoLA: A Real-Time Online Lightweight Anomaly Detection System for Multivariate Time Series

A multivariate time series refers to observations of two or more variables taken from a device or a system simultaneously over time. There is an increasing need to monitor multivariate time series and detect anomalies in real time to ensure proper system operation and good service quality. It is also highly desirable to have a lightweight anomaly detection system that considers correlations between different variables, adapts to changes in the pattern of the multivariate time series, offers immediate responses, and provides supportive information regarding detection results based on unsupervised learning and online model training. In the past decade, many multivariate time series anomaly detection approaches have been introduced. However, they are unable to offer all the above-mentioned features. In this paper, we propose RoLA, a real-time online lightweight anomaly detection system for multivariate time series based on a divide-and-conquer strategy, parallel processing, and the majority rule. RoLA employs multiple lightweight anomaly detectors to monitor multivariate time series in parallel, determine the correlations between variables dynamically on the fly, and then jointly detect anomalies based on the majority rule in real time. To demonstrate the performance of RoLA, we conducted an experiment based on a public dataset provided by the FerryBox of the One Ocean Expedition. The results show that RoLA provides satisfactory detection accuracy and lightweight performance.

  • 2 authors
·
May 25, 2023

MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--

For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback-Leibler (K-L) divergence, the closest distribution is called the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expectation of K-L divergence between the information projection and the predictive distribution with plugged-in MLE. Here, the asymptotic expansion of the risk is derived up to n^{-2}-order, and the sufficient condition on the risk for the Bayes error rate between the true distribution and the information projection to be lower than a specified value is investigated. Combining these results, the "p-n criterion" is proposed, which determines whether the MLE is sufficiently close to the information projection for the given model and sample. In particular, the criterion for an exponential family model is relatively simple and can be used for a complex model with no explicit form of normalizing constant. This criterion can constitute a solution to the sample size or model acceptance problem. Use of the p-n criteria is demonstrated for two practical datasets. The relationship between the results and information criteria is also studied.

  • 1 authors
·
May 19, 2021

Efficient estimation of multiple expectations with the same sample by adaptive importance sampling and control variates

Some classical uncertainty quantification problems require the estimation of multiple expectations. Estimating all of them accurately is crucial and can have a major impact on the analysis to perform, and standard existing Monte Carlo methods can be costly to do so. We propose here a new procedure based on importance sampling and control variates for estimating more efficiently multiple expectations with the same sample. We first show that there exists a family of optimal estimators combining both importance sampling and control variates, which however cannot be used in practice because they require the knowledge of the values of the expectations to estimate. Motivated by the form of these optimal estimators and some interesting properties, we therefore propose an adaptive algorithm. The general idea is to adaptively update the parameters of the estimators for approaching the optimal ones. We suggest then a quantitative stopping criterion that exploits the trade-off between approaching these optimal parameters and having a sufficient budget left. This left budget is then used to draw a new independent sample from the final sampling distribution, allowing to get unbiased estimators of the expectations. We show how to apply our procedure to sensitivity analysis, by estimating Sobol' indices and quantifying the impact of the input distributions. Finally, realistic test cases show the practical interest of the proposed algorithm, and its significant improvement over estimating the expectations separately.

  • 3 authors
·
Nov 30, 2022

Partial Correlations in Compositional Data Analysis

Partial correlations quantify linear association between two variables adjusting for the influence of the remaining variables. They form the backbone for graphical models and are readily obtained from the inverse of the covariance matrix. For compositional data, the covariance structure is specified from log ratios of variables, so unless we try to "open" the data via a normalization, this implies changes in the definition and interpretation of partial correlations. In the present work, we elucidate how results derived by Aitchison (1986) lead to a natural definition of partial correlation that has a number of advantages over current measures of association. For this, we show that the residuals of log-ratios between a variable with a reference, when adjusting for all remaining variables including the reference, are reference-independent. Since the reference itself can be controlled for, correlations between residuals are defined for the variables directly without the necessity to recur to ratios except when specifying which variables are partialled out. Thus, perhaps surprisingly, partial correlations do not have the problems commonly found with measures of pairwise association on compositional data. They are well-defined between two variables, are properly scaled, and allow for negative association. By design, they are subcompositionally incoherent, but they share this property with conventional partial correlations (where results change when adjusting for the influence of fewer variables). We discuss the equivalence with normalization-based approaches whenever the normalizing variables are controlled for. We also discuss the partial variances and correlations we obtain from a previously studied data set of Roman glass cups.

  • 1 authors
·
Apr 20, 2019

SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models

Large language models (LLMs) are increasingly deployed in high-stakes domains, where rare but severe failures can result in irreversible harm. However, prevailing evaluation benchmarks often reduce complex social risk to mean-centered scalar scores, thereby obscuring distributional structure, cross-dimensional interactions, and worst-case behavior. This paper introduces Social Harm Analysis via Risk Profiles (SHARP), a framework for multidimensional, distribution-aware evaluation of social harm. SHARP models harm as a multivariate random variable and integrates explicit decomposition into bias, fairness, ethics, and epistemic reliability with a union-of-failures aggregation reparameterized as additive cumulative log-risk. The framework further employs risk-sensitive distributional statistics, with Conditional Value at Risk (CVaR95) as a primary metric, to characterize worst-case model behavior. Application of SHARP to eleven frontier LLMs, evaluated on a fixed corpus of n=901 socially sensitive prompts, reveals that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility. Across models, dimension-wise marginal tail behavior varies systematically across harm dimensions, with bias exhibiting the strongest tail severities, epistemic and fairness risks occupying intermediate regimes, and ethical misalignment consistently lower; together, these patterns reveal heterogeneous, model-dependent failure structures that scalar benchmarks conflate. These findings indicate that responsible evaluation and governance of LLMs require moving beyond scalar averages toward multidimensional, tail-sensitive risk profiling.

  • 3 authors
·
Jan 28 2

Multi-resolution Networks For Flexible Irregular Time Series Modeling (Multi-FIT)

Missing values, irregularly collected samples, and multi-resolution signals commonly occur in multivariate time series data, making predictive tasks difficult. These challenges are especially prevalent in the healthcare domain, where patients' vital signs and electronic records are collected at different frequencies and have occasionally missing information due to the imperfections in equipment or patient circumstances. Researchers have handled each of these issues differently, often handling missing data through mean value imputation and then using sequence models over the multivariate signals while ignoring the different resolution of signals. We propose a unified model named Multi-resolution Flexible Irregular Time series Network (Multi-FIT). The building block for Multi-FIT is the FIT network. The FIT network creates an informative dense representation at each time step using signal information such as last observed value, time difference since the last observed time stamp and overall mean for the signal. Vertical FIT (FIT-V) is a variant of FIT which also models the relationship between different temporal signals while creating the informative dense representations for the signal. The multi-FIT model uses multiple FIT networks for sets of signals with different resolutions, further facilitating the construction of flexible representations. Our model has three main contributions: a.) it does not impute values but rather creates informative representations to provide flexibility to the model for creating task-specific representations b.) it models the relationship between different signals in the form of support signals c.) it models different resolutions in parallel before merging them for the final prediction task. The FIT, FIT-V and Multi-FIT networks improve upon the state-of-the-art models for three predictive tasks, including the forecasting of patient survival.

  • 7 authors
·
Apr 30, 2019

Can Multimodal LLMs Perform Time Series Anomaly Detection?

Large language models (LLMs) have been increasingly used in time series analysis. However, the potential of multimodal LLMs (MLLMs), particularly vision-language models, for time series remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. Motivated by this, we raise a critical and practical research question: Can multimodal LLMs perform time series anomaly detection? To answer this, we propose VisualTimeAnomaly benchmark to evaluate MLLMs in time series anomaly detection (TSAD). Our approach transforms time series numerical data into the image format and feed these images into various MLLMs, including proprietary models (GPT-4o and Gemini-1.5) and open-source models (LLaVA-NeXT and Qwen2-VL), each with one larger and one smaller variant. In total, VisualTimeAnomaly contains 12.4k time series images spanning 3 scenarios and 3 anomaly granularities with 9 anomaly types across 8 MLLMs. Starting with the univariate case (point- and range-wise anomalies), we extend our evaluation to more practical scenarios, including multivariate and irregular time series scenarios, and variate-wise anomalies. Our study reveals several key insights: 1) MLLMs detect range- and variate-wise anomalies more effectively than point-wise anomalies. 2) MLLMs are highly robust to irregular time series, even with 25% of the data missing. 3) Open-source MLLMs perform comparably to proprietary models in TSAD. While open-source MLLMs excel on univariate time series, proprietary MLLMs demonstrate superior effectiveness on multivariate time series. To the best of our knowledge, this is the first work to comprehensively investigate MLLMs for TSAD, particularly for multivariate and irregular time series scenarios. We release our dataset and code at https://github.com/mllm-ts/VisualTimeAnomaly to support future research.

  • 6 authors
·
Feb 24, 2025

CSIM: A Copula-based similarity index sensitive to local changes for Image quality assessment

Image similarity metrics play an important role in computer vision applications, as they are used in image processing, computer vision and machine learning. Furthermore, those metrics enable tasks such as image retrieval, object recognition and quality assessment, essential in fields like healthcare, astronomy and surveillance. Existing metrics, such as PSNR, MSE, SSIM, ISSM and FSIM, often face limitations in terms of either speed, complexity or sensitivity to small changes in images. To address these challenges, a novel image similarity metric, namely CSIM, that combines real-time while being sensitive to subtle image variations is investigated in this paper. The novel metric uses Gaussian Copula from probability theory to transform an image into vectors of pixel distribution associated to local image patches. These vectors contain, in addition to intensities and pixel positions, information on the dependencies between pixel values, capturing the structural relationships within the image. By leveraging the properties of Copulas, CSIM effectively models the joint distribution of pixel intensities, enabling a more nuanced comparison of image patches making it more sensitive to local changes compared to other metrics. Experimental results demonstrate that CSIM outperforms existing similarity metrics in various image distortion scenarios, including noise, compression artifacts and blur. The metric's ability to detect subtle differences makes it suitable for applications requiring high precision, such as medical imaging, where the detection of minor anomalies can be of a high importance. The results obtained in this work can be reproduced from this Github repository: https://github.com/safouaneelg/copulasimilarity.

  • 4 authors
·
Oct 2, 2024

Environment-Adaptive Covariate Selection: Learning When to Use Spurious Correlations for Out-of-Distribution Prediction

Out-of-distribution (OOD) prediction is often approached by restricting models to causal or invariant covariates, avoiding non-causal spurious associations that may be unstable across environments. Despite its theoretical appeal, this strategy frequently underperforms empirical risk minimization (ERM) in practice. We investigate the source of this gap and show that such failures naturally arise when only a subset of the true causes of the outcome is observed. In these settings, non-causal spurious covariates can serve as informative proxies for unobserved causes and substantially improve prediction, except under distribution shifts that break these proxy relationships. Consequently, the optimal set of predictive covariates is neither universal nor necessarily exhibits invariant relationships with the outcome across all environments, but instead depends on the specific type of shift encountered. Crucially, we observe that different covariate shifts induce distinct, observable signatures in the covariate distribution itself. Moreover, these signatures can be extracted from unlabeled data in the target OOD environment and used to assess when proxy covariates remain reliable and when they fail. Building on this observation, we propose an environment-adaptive covariate selection (EACS) algorithm that maps environment-level covariate summaries to environment-specific covariate sets, while allowing the incorporation of prior causal knowledge as constraints. Across simulations and applied datasets, EACS consistently outperforms static causal, invariant, and ERM-based predictors under diverse distribution shifts.

  • 2 authors
·
Jan 5

Contributions to Robust and Efficient Methods for Analysis of High Dimensional Data

A ubiquitous feature of data of our era is their extra-large sizes and dimensions. Analyzing such high-dimensional data poses significant challenges, since the feature dimension is often much larger than the sample size. This thesis introduces robust and computationally efficient methods to address several common challenges associated with high-dimensional data. In my first manuscript, I propose a coherent approach to variable screening that accommodates nonlinear associations. I develop a novel variable screening method that transcends traditional linear assumptions by leveraging mutual information, with an intended application in neuroimaging data. This approach allows for accurate identification of important variables by capturing nonlinear as well as linear relationships between the outcome and covariates. Building on this foundation, I develop new optimization methods for sparse estimation using nonconvex penalties in my second manuscript. These methods address notable challenges in current statistical computing practices, facilitating computationally efficient and robust analyses of complex datasets. The proposed method can be applied to a general class of optimization problems. In my third manuscript, I contribute to robust modeling of high-dimensional correlated observations by developing a mixed-effects model based on Tsallis power-law entropy maximization and discussed the theoretical properties of such distribution. This model surpasses the constraints of conventional Gaussian models by accommodating a broader class of distributions with enhanced robustness to outliers. Additionally, I develop a proximal nonlinear conjugate gradient algorithm that accelerates convergence while maintaining numerical stability, along with rigorous statistical properties for the proposed framework.

  • 1 authors
·
Sep 9, 2025

Regression Discontinuity Design with Distribution-Valued Outcomes

This article introduces Regression Discontinuity Design (RDD) with Distribution-Valued Outcomes (R3D), extending the standard RDD framework to settings where the outcome is a distribution rather than a scalar. Such settings arise when treatment is assigned at a higher level of aggregation than the outcome-for example, when a subsidy is allocated based on a firm-level revenue cutoff while the outcome of interest is the distribution of employee wages within the firm. Since standard RDD methods cannot accommodate such two-level randomness, I propose a novel approach based on random distributions. The target estimand is a "local average quantile treatment effect", which averages across random quantiles. To estimate this target, I introduce two related approaches: one that extends local polynomial regression to random quantiles and another based on local Fr\'echet regression, a form of functional regression. For both estimators, I establish asymptotic normality and develop uniform, debiased confidence bands together with a data-driven bandwidth selection procedure. Simulations validate these theoretical properties and show existing methods to be biased and inconsistent in this setting. I then apply the proposed methods to study the effects of gubernatorial party control on within-state income distributions in the US, using a close-election design. The results suggest a classic equality-efficiency tradeoff under Democratic governorship, driven by reductions in income at the top of the distribution.

  • 1 authors
·
Apr 4, 2025

Chimera: Effectively Modeling Multivariate Time Series with 2-Dimensional State Space Models

Modeling multivariate time series is a well-established problem with a wide range of applications from healthcare to financial markets. Traditional State Space Models (SSMs) are classical approaches for univariate time series modeling due to their simplicity and expressive power to represent linear dependencies. They, however, have fundamentally limited expressive power to capture non-linear dependencies, are slow in practice, and fail to model the inter-variate information flow. Despite recent attempts to improve the expressive power of SSMs by using deep structured SSMs, the existing methods are either limited to univariate time series, fail to model complex patterns (e.g., seasonal patterns), fail to dynamically model the dependencies of variate and time dimensions, and/or are input-independent. We present Chimera that uses two input-dependent 2-D SSM heads with different discretization processes to learn long-term progression and seasonal patterns. To improve the efficiency of complex 2D recurrence, we present a fast training using a new 2-dimensional parallel selective scan. We further present and discuss 2-dimensional Mamba and Mamba-2 as the spacial cases of our 2D SSM. Our experimental evaluation shows the superior performance of Chimera on extensive and diverse benchmarks, including ECG and speech time series classification, long-term and short-term time series forecasting, and time series anomaly detection.

  • 3 authors
·
Jun 6, 2024 1

A Nearly-Optimal Bound for Fast Regression with ell_infty Guarantee

Given a matrix Ain R^{ntimes d} and a vector bin R^n, we consider the regression problem with ell_infty guarantees: finding a vector x'in R^d such that |x'-x^*|_infty leq epsilon{d}cdot |Ax^*-b|_2cdot |A^dagger| where x^*=argmin_{xin R^d}|Ax-b|_2. One popular approach for solving such ell_2 regression problem is via sketching: picking a structured random matrix Sin R^{mtimes n} with mll n and SA can be quickly computed, solve the ``sketched'' regression problem argmin_{xin R^d} |SAx-Sb|_2. In this paper, we show that in order to obtain such ell_infty guarantee for ell_2 regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with m=epsilon^{-2}dlog^3(n/delta) such that solving the sketched regression problem gives the ell_infty guarantee, with probability at least 1-delta. Moreover, the matrix SA can be computed in time O(ndlog n). Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in d rows, m=Omega(epsilon^{-2}d^{1+gamma}) for gamma=Theta(frac{loglog n{log d}}) is required. We also develop a novel analytical framework for ell_infty guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.

  • 4 authors
·
Feb 1, 2023

Predictive Multiplicity in Probabilistic Classification

Machine learning models are often used to inform real world risk assessment tasks: predicting consumer default risk, predicting whether a person suffers from a serious illness, or predicting a person's risk to appear in court. Given multiple models that perform almost equally well for a prediction task, to what extent do predictions vary across these models? If predictions are relatively consistent for similar models, then the standard approach of choosing the model that optimizes a penalized loss suffices. But what if predictions vary significantly for similar models? In machine learning, this is referred to as predictive multiplicity i.e. the prevalence of conflicting predictions assigned by near-optimal competing models. In this paper, we present a framework for measuring predictive multiplicity in probabilistic classification (predicting the probability of a positive outcome). We introduce measures that capture the variation in risk estimates over the set of competing models, and develop optimization-based methods to compute these measures efficiently and reliably for convex empirical risk minimization problems. We demonstrate the incidence and prevalence of predictive multiplicity in real-world tasks. Further, we provide insight into how predictive multiplicity arises by analyzing the relationship between predictive multiplicity and data set characteristics (outliers, separability, and majority-minority structure). Our results emphasize the need to report predictive multiplicity more widely.

  • 3 authors
·
Jun 2, 2022

What Benefits Drive Membership in Medicare Advantage Plans?

We seek to identify the most relevant benefits offered by Medicare Advantage Health Plans that drive membership and market share. As an example, we explore plans operating in a single county in New Jersey between 2018 and 2023. A dataset of benefits from publicly available data sources was created and the variance inflation factor was applied to identify the correlation between the extracted features, to avoid multicollinearity and overparameterization problems. We categorized the variable Market Share and used it as a multinomial response variable with three categories: less than 0.3\%, 0.3\% to 1.5\%, and over 1.5\%. Categories were chosen to achieve approximately uniform distribution of plans (47, 60, and 65 respectively). We built a multinomial Lasso model using 5-fold cross-validation to tune the penalty parameter. Lasso forced some features to be dropped from the model, which reduces the risk of overfitting and increases the interpretability of the results. For each category, important variables are different. Certain brands drive market share, as do PPO plans and prescription drug coverage. Benefits, particularly ancillary benefits that are not part of CMS's required benefits, appear to have little influence, while financial terms such as deductibles, copays, and out-of-pocket limits are associated with higher market share. Finally, we evaluated the predictive accuracy of the Lasso model with the test set. The accuracy is 0.76.

  • 2 authors
·
Nov 3, 2025

Diffusion-Driven Generation of Minimally Preprocessed Brain MRI

The purpose of this study is to present and compare three denoising diffusion probabilistic models (DDPMs) that generate 3D T_1-weighted MRI human brain images. Three DDPMs were trained using 80,675 image volumes from 42,406 subjects spanning 38 publicly available brain MRI datasets. These images had approximately 1 mm isotropic resolution and were manually inspected by three human experts to exclude those with poor quality, field-of-view issues, and excessive pathology. The images were minimally preprocessed to preserve the visual variability of the data. Furthermore, to enable the DDPMs to produce images with natural orientation variations and inhomogeneity, the images were neither registered to a common coordinate system nor bias field corrected. Evaluations included segmentation, Frechet Inception Distance (FID), and qualitative inspection. Regarding results, all three DDPMs generated coherent MR brain volumes. The velocity and flow prediction models achieved lower FIDs than the sample prediction model. However, all three models had higher FIDs compared to real images across multiple cohorts. In a permutation experiment, the generated brain regional volume distributions differed statistically from real data. However, the velocity and flow prediction models had fewer statistically different volume distributions in the thalamus and putamen. In conclusion this work presents and releases the first 3D non-latent diffusion model for brain data without skullstripping or registration. Despite the negative results in statistical testing, the presented DDPMs are capable of generating high-resolution 3D T_1-weighted brain images. All model weights and corresponding inference code are publicly available at https://github.com/piksl-research/medforj .

  • 4 authors
·
Oct 29, 2025