How partial differential equations can unravel information in data
The advance of data science and the solution of big data questions rely heavily on fundamental mathematical techniques. We are surrounded by technology that collects, transmits and manipulates data on an immense scale; the key is the application and development of advanced mathematics for the efficient gathering and manipulation of ‘data’–values of qualitative or quantitative variables–and efficient extraction of ‘information’–the content and meaning present in data.
A recent special issue of the European Journal of Applied Mathematics (EJAM) on `Big Data and Partial Differential Equations’ discusses a particularly novel range of mathematical techniques based on so-called partial differential equations (PDEs) for analysis of and simulation from large-scale and high-dimensional data. Partial differential equations are equations of a function of independent variables and its derivatives. Based on Newton’s differential calculus, they are often used to describe phenomena in nature where the variation of quantities are compared. They have applications in physics, biology, engineering, computer science, social sciences and economics, and describe (by way of example) the laws in quantum mechanics, relativity and fluid dynamics. In this EJAM special issue, an exciting range of examples of their application in the solution of pressing data science problems is presented.
The papers “A new analytical approach to consistency and overfitting in regularized empirical risk minimization” by Nicolás García Trillos and Ryan Murray, and “On the game p-Laplacian on weighted graphs with applications in image processing and data clustering” by Abderrahim Elmoataz, Xavier Desquesnes, and Matthieu Toutain, are both about partial differential equations and graph-based methods.
The paper by García Trillos and Murray proves an asymptotic consistency result for a regularized empirical risk minimization. A graph-based functional is proposed as objective function for empirical risk minimization. It includes a regularization term to prevent overfitting. The balance between risk minimization and regularization is tuned with the regularization parameter λ. Modern discrete-to-continuum Γ-convergence techniques are used to show that in the infinite data point limit the correct continuum functional is obtained, if λ is chosen in the correct regime. Precise conditions are obtained that show that λ has to be large enough to prevent overfitting, yet also small enough to prevent underfitting.
The Elmoataz, Desquesnes and Toutain paper uses a graph based game theoretic p-Laplacian for such tasks as image inpainting and semi-supervised classification. The key idea is to mimic the definition of the continuum p-Laplacian in a graph setting and to solve a Dirichlet equation which uses this graph p-Laplacian and which incorporates a priori known data in its boundary conditions. Existence and uniqueness of solutions is proven and numerical experiments show how this idea is applied in practice.
Another example is the paper “Hybrid PDE solver for data-driven problems and modern branching,” by Francisco Bernal, Gonçalos Reis and Greig Smith, which deals with probabilistic domain decomposition for numerically solving partial differential equations that can simulate phenomena fed with very large data.
Recent years have seen an increasing interest from applied analysts in applying the models and techniques from variational methods and PDEs to problems in data science. This issue of the European Journal of Applied Mathematics highlights some of these exciting developments in this young and growing area.