AGU 2019
EP53C
Machine Learning Applications in Earth Surface Processes Research

**Welcome to the extended version of my AGU poster!**

**Jump to interactive results!**

Copyright - The Authors 2019

Collaboration between scientists and stakeholders increasingly focuses on large-scale challenges linked to the competing needs of aquatic ecosystems and human activities. In that context, geomorphic classifications of rivers are critical to the assessment of ecohydraulic suitability across a watershed or basin. In recent years, hierarchical geomorphic classifications derived statistically from field data have gained traction and have been applied in many study areas. Yet, as they rely on costly in-situ data collection, such bottom-up classifications have a limited spatial coverage and comparison across study areas remains a difficult exercise often hinging on expert knowledge. In this study, to provide a reliable quantitative approach for comparing classifications, we closely investigate the outputs from a data-driven machine-learning-enabled framework applied to seven study areas covering most of California (USA). The present analysis has two main components. First, we leverage the gap in performance between traditional machine learning models and deep learning models trained to identify reach-scale channel types. Importantly, such difference in performance is tied to the mismatch in scales between remote sensing predictors and channel types attributes. Second, we characterize the probabilistic outputs of the machine-learning models in each region using machine-learning techniques, spatial statistics and information theory measures. In particular, we outline the relationship between model uncertainties and the underlying predictor-space. Such an approach increases the interpretability of individual geomorphic classification, identifies their characteristic spatial scale and paves the way to combining distinct regional classifications into an overarching statewide classification. Furthermore, this study provides decisive insights for top-down geomorphic classifications enabled by increasing coverage and resolution of remote-sensing.

- The proposed framework quantitatively characterizes and compares statistical classifications established in different areas of study.
- Difference in traditional and deep learning performance identifies the minimum degree of information needed to separate classes.
- Nested resampling and entropy-based measures underline the stability of random forests in statistical learning and predictive modeling.

Machine learning is increasingly central to derive regional classifications and needed to address the competing needs of aquatic ecosystems and human activities. For example, McManamay et al. (2018) described physical habitat diversity in the eastern United States with a combinatorial multi-layer approach. Wolfe et al. (2019) clustered watersheds in Canada based on climate, geological, topographical, and land-cover data. Yang, Griffiths, and Zammit (2019) predicted three types of surface–groundwater interaction from geology, hydrology, and land use data. Henshaw et al. (2019) discriminated between five river types using process indicators extracted from remote sensing imagery. Beyond these study with clear implications for regional management, Beechie and Imaki (2014) distinguished between four channel patterns in the Columbia River basin, USA and Clubb, Bookhagen, and Rheinwalt (2019) identified geomorphic domains by clustering river profiles.

Despite the great predictive ability of such data-driven classifications, comparison across study areas remains a difficult exercise often hinging on expert knowledge. This is in part linked to costly in-situ data collection leading to limited training datasets. Furthermore, there is no existing quantitative framework to compare similar statistical classification established in vastly different regions. In particular, the scale at which such classification is unknown, making it difficult to combine them or compare associated findings. In this study, we develop a reliable quantitative approach for comparing classification and leverage a rare example of five statistical classifications describing the types of river occurring in different regions of California (Byrne et al. 2019). The developed approach increases the interpretability of each individual geomorphic classification, identifies their characteristic spatial scale and paves the way to combining distinct regional classifications into an overarching statewide classification.

In recent years, deep learning approaches had unreasonable success in predicting complex patterns (LeCun, Bengio, and Hinton 2015). More precisely, deep neural networks have the ability to approximate any function between input and output (Cybenko 1989; Hornik, Stinchcombe, and White 1989) while avoiding local optima (Baldassi et al. 2016) and with a limited number of parameters (H. W. Lin, Tegmark, and Rolnick 2017). This success is decisively linked to the stacked architecture of deep neural networks which can reverse the hierarchical generative process responsible for the complex relationship between output and input data (H. W. Lin, Tegmark, and Rolnick 2017). Such a hierarchical process is best reversed when information is distilled near-perfectly from one layer to another that is from one step of the generative process to the next (H. W. Lin, Tegmark, and Rolnick 2017; Tishby and Zaslavsky 2015). In our case, the level of information in the input data is constant across regions of study but the level of information contained in the class labels is both unknown and unlikely to perfectly match input data information.

Support Vector Machine (SVM) and Random Forest (RF) are two of the most used traditional machine learning methods (Cortes and Vapnik 1995; Ho 1995; Breiman 2001). SVM is a maximum margin classifier where the width of the margin is defined by the distance between each class’ closest points forming the support vectors of the class boundary. Conceptualized as an ensemble of classification and regression trees (Breiman 1984), RF includes at each split of each tree an information selection process based on the Gini coefficient or on information theory measure. While this inherent feature selection is one interesting attribute of RF, its decisive strengths are its ensemble decision process and the internal bagging process leading to uncorrelated trees. Such characteristics lead to great performance when the training dataset is reduced, noisy or both (Fox et al. 2017).

To estimate the information needed to discriminate between the different classes of channel types, we leverage the median **Jensen-Shannon distance**, an information theory measure. The median Jensen-Shannon distance of a classification, \(\tilde{d}_{JS}\), estimates the typical degree of information needed to separate class examples. In our case, we calculate the average \(\bar{d}_{JS}\) from the distributions of channel attributes measured in-situ. Channel types with high \(\bar{d}_{JS}\) are defined from more distinct underlying information thus requiring a lower degree of information for discriminating between them. Conversely, channel types with low \(\bar{d}_{JS}\) are defined from less distinct underlying information thus requiring a higher degree of information for discriminating between them. Importantly, we expect that confined classes to have a low \(\bar{d}_{JS}\) and to require a higher degree of discrimination information.

In the following, we detail the relations between Jensen-Shannon distance, entropy and Kullback-Leibler divergence. First, we introduce the **Shannon’s entropy** which describe how predictable a random variable \(X\) with discrete probability mass function \(P\) over \(n\) outcomes is (Shannon 1948):

\[ H(X) = - \sum_{i=1}^n P(x_i) log_b P(x_i)\]

Here, \(b\) represents the base of the logarithm function. We use \(b = 2\) and metrics have units of bit. Second, we introduce the **Kullback-Leibler divergence** describing the mean information for discriminating between discrete probability distributions \(P\) and \(Q\) by observing \(P\) (Kullback and Leibler 1951):

\[ D_{KL}(P,Q) = \sum_{i=1}^n P(x_i) log_b \frac{P(x_i)}{Q(x_i)} \]

Formally, the Kullback-Leibler divergence is the expectation of the logarithmic difference between discrete probability distributions \(P\) and \(Q\) with respect to probabilities \(P\). Because of this, the Kullback-Leibler divergence is asymmetric and, in non-trivial cases, \(D_{KL}(P,Q) \neq D_{KL}(Q ,P)\). The Kullback-Leibler divergence is related to entropy:

\[ D_{KL}(P,Q) = H(P,Q) - H(P) \]

where \(H(P,Q)\) is the cross-entropy such that: \(H(P,Q) = - \sum_{i=1}^n P(x_i) log_b Q(x_i)\). Third, the **Jensen-Shannon divergence** is a measure of discrimination between two probability function directly related to the Kullback-Leibler divergence (J. Lin 1991; Topsoe 2000):

\[D_{JS}(P,Q) = \frac{1}{2} \lbrack D_{KL}(P,R) + D_{KL}(Q,R) \rbrack \]

with \(R = \frac{1}{2} (P+Q)\) the midpoint probability. Finally, the **Jensen-Shannon distance**, \(d_{JS} = D_{JS}^{1/2}\) retains the advantageous symmetric property of \(D_{JS}\), while satisfying the triangular inequality and being a proper distance metric (Endres and Schindelin 2003).

We assess the performance of machine learning models in statistical learning and predictive modeling.

The performance in statistical learning is assessed by benchmarking machine-learning models across the five regions of study using **area-under-curve** and **hyper-parameter tuning entropy**. Previous research leads us to limit the base learners to Random forest (RF, Breiman (2001)), support vector machine (SVM, Cortes and Vapnik (1995)) and deep artificial neural network (ANN, LeCun, Bengio, and Hinton (2015)) are trained. The observations are balanced using Synthetic Minority Oversampling TEchnique (Chawla et al. 2002). The input data are filtered for no-variance predictors, centered and scaled, and missing values are imputed with a median imputation. The tuning length is defined as 32 both for discrete (RF, SVM) and random discrete tuning (ANN). The benchmark is performed with nested resampling which estimate the robustness of the tuning process and limits over-fitting by using two nested loops: an inner loop for model tuning and an outer loop for model selection (Bischl et al. 2012). Here, the outer resampling is a 10-fold stratified cross-validation repeated 10 times, the inner resampling is a 10-fold stratified cross-validation. While traditional resampling leads to a distribution of model performance, nested resampling additionally provides a distribution of best-tuned hyper-parameters. In consequence, in addition to area-under-curve (AUC), the performance of the model is assessed by estimating the hyper-parameter tuning entropy from the distribution of their best-tuned hyper-parameters. AUC is preferred here to accuracy for its higher discrimination performance, its relation to class-separability and its suitability for limited dataset (Rosset 2004; Huang and Ling 2005; Ferri, Hernández-Orallo, and Modroiu 2009). In total, these benchmark parameters leads to training 480,000 distinct models.

The performance in predictive modeling is assessed in two quantitative ways from a **regional entropy** measure and from a network-scale **entropy rate**. First, a regional prediction entropy is computed from the distribution of discrete class predictions over the region of study. Such prediction entropy is expected to have a value close to the entropy computed from the distribution of class examples in the training data. In particular, significantly lower values would identify biased models. Second, entropy *rate* leverages the network structure of the predictions and estimates the stability of the predictions from the transition probabilities between each channel types. Such entropy rate prognosticates the prediction skill of a model (Stephenson and Dolas-Reyes 2000,Roulston and Smith (2002)) and helps select models providing the best information (Daley and Vere-Jones 2004; Nearing and Gupta 2015). Both metrics are computed from predictions after a cross-validated multinomial calibration that corrects the potential distortion of posterior probabilities and improves model performance (DeGroot and Fienberg 1983; Zadrozny 2002; Niculescu-Mizil and Caruana 2005).

To understand what explains the performance of traditional machine learning models and of deep learning models, we perform **a correlation analysis between variables describing each region and the performance from the machine learning models**. The regional variables included are number of observation, area, observation density, number of classes, number of confined classes and the degree of required discrimination information \(\bar{d}_{JS}\) in the region aggregated by a median value over all classes and minima \(\min\left\{\bar{d}_{JS}\right\}\) over all classes and confined classes only. The performance metrics from machine learning models included are AUC, accuracy, hyper-parameter tuning entropy and entropy rate. In addition, we perform a correlation between the difference in statistical learning performance between traditional and deep learning models, and the regional variables. Both Pearson’s and Spearman correlation were performed on scaled data and yielded similar results.

The following graphs present the average value of Jensen-Shannon distance, \(\bar{d}_{JS}\) across field attributes for channel types and on a one-class vs one-class basis. The median and mean value of the entire regional matrix is given in parenthesis.