Support Vector Machine: (Primal and Dual Formulation) Theory and Concepts

1. Hard-Margin SVM

We want the hyperplane \[f(x)=w^Tx+b\] that separates the data with maximum margin.

Given training data \((x_i,y_i)\) where \(y_i\in \{-1,+1\}\)

1.1 Primal Optimization Problem

We want to find a hyperplane that maximizes the margin between two linearly separable classes.

The primal form is:

\[ \min_{w,b} \frac{1}{2} \|w\|^2 \] subject to \[ y_i (w^T x_i + b) \ge 1, \quad \forall i. \]

Margin \(=2/\|w\|\), so minimizing \(\|w\|^2\) maximizes the margin.
No slack variables since the data are assumed separable.

1.2 Lagrangian (Hard-Margin)

We form the Lagrangian with multipliers \(\alpha_i \ge 0\):

\[ \mathcal{L}(w, b, \alpha) = \frac{1}{2} \|w\|^2 - \sum_{i=1}^n \alpha_i [y_i (w^T x_i + b) - 1]. \]

1.3 Dual Formulation

Take partial derivatives:

Derivative w.r.t. \(w\): \[\frac{\partial \mathcal{L}}{\partial w} = w - \sum_i \alpha_i y_i x_i = 0 \Rightarrow w = \sum_i \alpha_i y_i x_i.\]
Derivative w.r.t. \(b\): \[\frac{\partial \mathcal{L}}{\partial b} = -\sum_i \alpha_i y_i = 0.\]

Substitute into Lagrangian → the dual:

\[ \max_\alpha \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j (x_i^T x_j) \] subject to \[ \sum_i \alpha_i y_i = 0, \quad \alpha_i \ge 0. \]

This is a Quadratic Programming (QP) problem.

Support vectors are points with \(\alpha_i>0\).
Only inner products \(x_i^Tx_j\) appear → kernel extension becomes natural.

2. Soft-Margin SVM

For non-separable data, introduce slack variables \(\xi_i\).

2.1 Primal Optimization Problem

\[ \min_{w,b,\xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^n \xi_i \] subject to \[ y_i (w^T x_i + b) \ge 1 - \xi_i, \quad \xi_i \ge 0. \]

\(C>0\) controls trade-off between margin size and classification error.

2.2 Lagrangian (Soft-Margin)

Lagrangian:

\[ \mathcal{L}(w,b,\xi,\alpha,\mu) = \frac{1}{2}\|w\|^2 + C \sum_i \xi_i - \sum_i \alpha_i [y_i(w^T x_i + b) - 1 + \xi_i] - \sum_i \mu_i \xi_i \] with \(\alpha_i, \mu_i \ge 0\).

2.3 Dual Form

Derivative wrt \(w\): \[w=\sum_{i=1}^n \alpha_iy_ix_i\]

Derivative wrt \(b\): \[\sum_i\alpha_iy_i=0\]

Derivative wrt \(\xi_i\): \[C-\alpha_i-\mu_i=0\implies 0\leq\alpha_i\leq C\]

Substituting gradients yields the dual:

\[ \max_\alpha \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i^T x_j) \] subject to \[ \sum_i \alpha_i y_i = 0, \] \[ 0 \le \alpha_i \le C. \]

Differences from hard margin:

Upper bound \(C\) on multipliers.
Soft-margin allows non-zero slack \(\xi\).

3. Kernelized SVM Dual

Replace dot products with a kernel function:

\[K(x_i, x_j) = \phi(x_i)^T \phi(x_j).\]

Dual becomes:

\[ \max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j) \] subject to \[ \sum_i \alpha_i y_i = 0, \] \[ 0 \le \alpha_i \le C. \]

4. Simple Example in R

# Example dataset
x= matrix(c(1,2, 2,3, 3,3, 2,1), ncol=2, byrow=TRUE)
y= c(1,1,-1,-1)

library(e1071)

## Warning: package 'e1071' was built under R version 4.2.3

model= svm(x, y, type="C-classification", kernel="linear", cost=1, scale=FALSE)
model

## 
## Call:
## svm.default(x = x, y = y, scale = FALSE, type = "C-classification", 
##     kernel = "linear", cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  4

Random Forest — Theory and Concepts

1. Introduction

Random Forest (RF) is an ensemble method introduced by Breiman (2001) that builds a collection of decision trees trained on bootstrap samples and with randomized feature selection at each split. The ensemble prediction is an average (regression) or majority vote (classification). The two key mechanisms that make RF effective are:

Bootstrap aggregation (bagging) which reduces variance by averaging many high-variance base learners; and
Random feature subspace selection (mtry) at each split which reduces correlation between trees and therefore further reduces ensemble variance.

RF is nonparametric, handles high-dimensional inputs, and requires relatively little preprocessing.

2. Decision Tree Base Learner: brief recap

A regression tree partitions the predictor space into disjoint regions \(R_1,\dots,R_M\) and predicts the mean of the response within each region. For squared error loss, a single split is chosen to minimize within-node sum of squares. Formally, for a split on variable \(j\) at value \(s\):

\[ \min_{j,s} \Bigg\{ \sum_{x_i \in R_1(j,s)} (y_i - \bar{y}_{R_1})^2 + \sum_{x_i \in R_2(j,s)} (y_i - \bar{y}_{R_2})^2 \Bigg\}, \]

where \(R_1(j,s) = \{x: x_j \le s\}\) and \(R_2(j,s) = \{x: x_j > s\}\).

Trees are grown recursively and typically grown deep in RF (no pruning) to keep bias low.

3. Bagging (Bootstrap Aggregation)

Bagging aims to reduce variance by averaging many unstable estimators. Suppose we have an estimator \(\hat{f}(x; \mathcal{D})\) trained on dataset \(\mathcal{D}\). Draw \(B\) bootstrap datasets \(\mathcal{D}^{(b)}\) and form trees \(T^{(b)}(x) = \hat{f}(x; \mathcal{D}^{(b)})\). The bagged predictor is

\[ \hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^B T^{(b)}(x). \]

If \(\mathrm{Var}[T(x)] = \sigma^2\) and pairwise correlation between trees is \(\rho\), the variance of the bagged predictor is approximately:

\[ \mathrm{Var}[\hat{f}_{\text{bag}}(x)] = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2. \]

As \(B\to\infty\), variance tends to \(\rho \sigma^2\). Thus, reducing the correlation \(\rho\) among trees is crucial; Random Forest uses random feature selection to do this.

4. Random Forest Algorithm (Breiman, 2001)

Given training data \(\mathcal{D} = \{(x_i,y_i)\}_{i=1}^n\), predictors \(p\), and hyperparameters \(B\) (number of trees) and \(m\) (mtry):

For \(b = 1,\dots,B\):

Draw bootstrap sample \(\mathcal{D}^{(b)}\) of size \(n\) with replacement.
Grow a decision tree \(T^{(b)}\) to the bootstrap sample:
- At each node, select \(m\) predictors uniformly at random from the \(p\) predictors.
- Choose the best split among these \(m\) predictors (by reducing impurity).
- Do not prune; grow to maximum depth or until stopping conditions (e.g., minimum node size) are met.
Output ensemble: regression prediction

\[ \hat{f}_{RF}(x) = \frac{1}{B} \sum_{b=1}^B T^{(b)}(x). \]

For classification, the ensemble predicts the class with the largest number of votes across trees.

Practical defaults: for regression \(m \approx p/3\), for classification \(m \approx \sqrt{p}\).

5. Out-of-Bag (OOB) Estimation

Each bootstrap sample excludes on average about \(e^{-1} \approx 0.368\) fraction of the observations. For an observation \(i\), collect the subset of trees for which \(i\) was OOB. The OOB prediction for \(x_i\) is the average (or majority vote) across those trees. The OOB error is:

\[ \mathrm{Err}_{OOB} = \frac{1}{n} \sum_{i=1}^n L\big(y_i, \hat{f}_{OOB}(x_i)\big), \]

where \(L\) is an appropriate loss (squared error or 0–1 loss). OOB error is a nearly unbiased estimator of test error and is commonly used for model selection and quick diagnostics.

6. Variable Importance Measures

Two widely used importance metrics are:

Mean Decrease in Impurity (MDI): For each split in each tree, record the impurity reduction (e.g., reduction in MSE for regression or Gini impurity for classification). Sum these decreases for each variable across all trees and normalize.
Permutation Importance (Mean Decrease Accuracy, MDA): For each variable \(X_j\):
- Compute baseline OOB error.
- Permute values of \(X_j\) in the OOB samples and recompute OOB predictions and error.
- Importance is the increase in OOB error caused by permuting \(X_j\).

Permutation importance reflects effect on predictive accuracy; MDI can be biased toward variables with many categories or continuous variables.

7. Bias–Variance Decomposition and Role of mtry

Consider the decomposition of expected squared error at a point \(x\):

\[ \mathbb{E}\big[(Y - \hat{f}_{RF}(x))^2\big] = \underbrace{\sigma^2_{\epsilon}}_{\text{irreducible}} + \underbrace{\big(\mathbb{E}[\hat{f}_{RF}(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\mathrm{Var}[\hat{f}_{RF}(x)]}_{\text{variance}}. \]

In RF we intentionally grow deep trees (low bias) and reduce variance by averaging. The variance term includes the tree–tree correlation \(\rho\); lowering \(m\) (mtry) tends to reduce \(\rho\) but may increase bias slightly. Thus \(m\) controls the bias–variance trade-off.

8. Hyperparameters and Their Effects

Common hyperparameters:

ntree (B): number of trees. Larger B reduces variance; typical values: 200–2000.
mtry (m): number of variables sampled at each split. Controls correlation; typical defaults: \(p/3\) (regression), \(\sqrt{p}\) (classification).
nodesize: minimum observations in terminal nodes (affects smoothness).
maxnodes / maxdepth: restrict tree size if desired.
sample_size / replace: bootstrap sample size and whether sampling is with replacement.

Tuning guidance: mtry is usually the most important. Use OOB error or inner CV to tune mtry and nodesize, keep ntree large enough to stabilize errors.

9. Theoretical Results and Consistency

Under some regularity conditions (e.g., trees grown with appropriate splitting rules, terminal node sizes shrinking properly), Breiman-style forests and certain simplified variants have been shown to be consistent: the RF predictor converges to the true regression function as \(n\to\infty\). More recent theoretical work (e.g., Biau, Scornet, Wager) studies rates of convergence and conditions under which forests adapt to sparsity or low-dimensional structure.

A key theoretical insight is that the randomness (bootstrap + feature subsampling) stabilizes the high-variance nature of trees while preserving their low bias for complex functions.

10. Practical Diagnostics and Interpretation

OOB error curve: plot OOB error vs number of trees to ensure stability.
Variable importance: present both MDI and permutation importance; check stability across multiple runs (different seeds).
Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) for interpreting marginal effects.
Accumulated Local Effects (ALE) as a faster/more robust alternative when features are correlated.
Residual diagnostics: residual vs fitted, heteroscedasticity checks, and spatial residual mapping if applicable.

11. Limitations and Failure Modes

Bias in importance measures: MDI can favor variables with many possible split points; use permutation importance for reliability.
Extrapolation: Trees (and RF) do not extrapolate beyond the range of training data in a smooth way.
Large datasets: RF can become memory- and time-intensive with extremely large datasets and large ntree.
Imbalanced classification: RF may produce biased class probabilities; use class weights, stratified sampling, or balanced subsampling.

12. Implementation Examples in R

# Example: synthetic regression dataset and Random Forest analysis
set.seed(42)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

# synthetic data
n= 500
p= 10
X= matrix(runif(n * p, -1, 1), nrow = n)
dimnames(X)= list(NULL, paste0("X", 1:p))
# true function: nonlinear in first three variables
Y= with(as.data.frame(X), 3*X1 - 2 * X2^2 + sin(3 * X3) + rnorm(n, 0, 0.5))

df= data.frame(Y = Y, X)

# Fit RF
rf= randomForest(Y ~ ., data = df, ntree = 500, mtry = floor(p/3), importance = TRUE)
print(rf)

## 
## Call:
##  randomForest(formula = Y ~ ., data = df, ntree = 500, mtry = floor(p/3),      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.6608643
##                     % Var explained: 85.07

# OOB error plot
plot(rf)

# Variable importance (permutation)
importance(rf)

##         %IncMSE IncNodePurity
## X1  109.3714244    1330.22300
## X2   22.0475191     174.26256
## X3   43.6675479     252.22440
## X4    1.5293574      52.20063
## X5    0.2096404      67.99561
## X6    0.9692954      62.69983
## X7   -0.7914752      57.43699
## X8   -4.3104377      52.38036
## X9   -0.9216762      60.16870
## X10   2.7006062      71.46540

varImpPlot(rf, type= 1) # 1 = permutation importance (mean decrease in accuracy)

12.1 Tuning `mtry` with caret (nested CV friendly setup)

library(caret)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.2.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

set.seed(123)

ctrl= trainControl(method= "cv", number= 5, search= "grid")
# grid over mtry
grid= expand.grid(.mtry = c(2, 3, 4, 6, 8))

caret_rf= train(Y ~ ., data= df,method="rf",trControl=ctrl, tuneGrid= grid, ntree= 500)
caret_rf

## Random Forest 
## 
## 500 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##   2     0.9560004  0.8750425  0.7768376
##   3     0.8204965  0.8839073  0.6653442
##   4     0.7688577  0.8842186  0.6216426
##   6     0.7435022  0.8825609  0.5964405
##   8     0.7532739  0.8779009  0.6012917
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

plot(caret_rf)

12.2 Partial Dependence Example (pdp package)

library(pdp)
# PDP for X1
pdp_X1= partial(rf, pred.var= "X1", train= df)
plotPartial(pdp_X1)

# 2-D PDP for X1 and X3
pdp_X1X3= partial(rf, pred.var=c("X1","X3"),chull= T,train= df)
plotPartial(pdp_X1X3,levelplot=F)

13. Reporting Recommendations

When reporting Random Forest results in a paper or presentation:

Describe preprocessing (scaling, imputation), ntree, mtry, nodesize, and any sampling options.
Provide OOB error and outer-fold CV performance if nested CV used.
Report variable importance (both MDI and permutation) and show PDPs for the top predictors.
If using RF for inference-like interpretation, be cautious — variable importance is associative, not causal.

References

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197–227.

Gradient Boosting Machines (GBM) — Theory and Concepts

1. Introduction

Gradient Boosting Machines (GBM) are a class of powerful ensemble methods that build a strong predictive model by sequentially fitting many weak learners (usually shallow regression trees) to the negative gradient (pseudo-residuals) of a chosen loss function. The method is due to Friedman (2001) and can be viewed as steepest-descent (functional gradient) optimization in function space.

Key ingredients:

A differentiable loss function \(L(y, F(x))\).
A base learner \(h(x; \theta)\), typically a regression tree of limited depth.
A small learning rate (shrinkage) \(\nu\) to stabilize learning.

GBM is flexible (custom loss functions possible) and often yields excellent predictive performance on structured/tabular data.

2. Functional Gradient Descent: Derivation

Let \(F(x)\) denote the ensemble model. We aim to minimize empirical risk:

\[ \hat{R}(F) = \sum_{i=1}^n L\big(y_i, F(x_i)\big). \]

The functional gradient descent algorithm constructs an additive model:

\[ F_M(x) = F_0(x) + \sum_{m=1}^M \nu \gamma_m h_m(x), \]

where:

\(F_0(x)\) is an initial constant (e.g., for squared loss, the sample mean),
\(h_m(x)\) is the base learner at iteration \(m\),
\(\nu\) is the learning rate (shrinkage), and
\(\gamma_m\) is a step-length determined by line search.

At iteration \(m\):

Compute pseudo-residuals (negative gradients): \[ r_{im} = - \left| \frac{\partial L\big(y_i, F(x_i)\big)}{\partial F(x_i)} \right|_{F = F_{m-1}}. \]
Fit the base learner to the pseudo-residuals: find \(h_m(x)\) that approximates \(r_{im}\) : \(h_m(x)\approx r_{im}\).
Compute multiplier (line search): \[ \gamma_m = \arg\min_{\gamma} \sum_{i=1}^n L\big(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\big). \]
Update model: \[ F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x). \]

This is equivalent to performing gradient descent in the space of functions spanned by the base learners.

3. Loss Functions and Pseudo-Residuals (Common Cases)

3.1 Squared Error Loss (Regression)

\[ L(y,F)=\tfrac{1}{2}(y-F)^2 \]

Pseudo-residuals: \[ r_{im} = y_i - F_{m-1}(x_i). \]

Line search gives \(\gamma_m=1\) when trees are fit to residuals directly (if tree predictions are the fitted residuals).

Thus, the update simplifies to:

\[F_m(x)= F_{m-1}(x)+ \nu h_m(x)\]

3.2 Absolute Error Loss (L1)

\[ L(y,F)=|y-F| \]

Pseudo-residuals are signs of residuals; line search finds median-type updates.

3.3 Logistic Loss (Binary Classification)

For binary labels \(y\in\{0,1\}\) or \(\{\pm1\}\), logistic loss is common.

Using \(y\in\{0,1\}\): \[ L(y,F)= -y \log p - (1-y) \log(1-p), \quad p = \sigma(2F) = \frac{1}{1+e^{-2F}}. \]

Pseudo-residuals (for coding \(y\in\{0,1\}\)): \[ r_{im} = y_i - p_{m-1}(x_i). \]

Alternatively with \(y\in\{\pm1\}\), one obtains the familiar form involving \(2y/(1+e^{2yF})\).

3.4 Custom Losses

GBM allows any differentiable loss (Poisson, quantile, Cox partial likelihood for survival, etc.) by defining the appropriate negative gradient.

4. Base Learner: Regression Trees (Weak Learners)

Trees used in GBM are typically shallow (small interaction.depth) to ensure the weak-learner property. Typical choices:

interaction.depth (max depth): 1–5 (1 = stumps)
n.minobsinnode (min samples per leaf)

Shallow trees produce localized, low-bias corrections to the model at each step; deeper trees increase per-iteration complexity and risk overfitting.

5. Shrinkage, Number of Trees, and Trade-offs

Learning rate (\(\nu\)) controls contribution of each tree. Typical values: 0.001–0.1. Smaller values require more trees but often improve generalization.

Number of trees (M or n.trees) must be large enough when \(\nu\) is small. Use early stopping based on a validation set to choose optimal number of trees.

Interaction depth determines interaction order captured per tree. Lower depth → more additive structure; higher depth → captures interactions faster but may overfit.

Trade-off strategy: choose small \(\nu\) (e.g., 0.01) and large \(n.trees\) (e.g., up to several thousand) with early stopping.

6. Stochastic Gradient Boosting (Subsampling)

Introducing subsampling of the training data at each iteration (bag.fraction < 1) yields stochastic gradient boosting (Friedman, 2002). Typical bag.fraction values: 0.5–0.8.

Benefits:

Reduces variance (like random forests)
Adds regularization
Often improves generalization when combined with shrinkage

Algorithmic change: at each iteration, fit the tree to a subsample of the training data without replacement.

7. Regularization Options

Shrinkage (learning rate): \(\nu\) small.
Subsampling (bag.fraction): <1 to add stochasticity.
Tree complexity control: interaction.depth, n.minobsinnode.
Penalizing terminal node values: in some implementations (like XGBoost) leaf weights are regularized (L1/L2).

XGBoost and LightGBM add further regularization (lambda, alpha) and algorithmic improvements (histogram-based splits, sparsity-awareness).

8. Hyperparameters and Tuning Guidance

Key hyperparameters (caret/gbm naming):

n.trees (number of trees): 100–5000 (use early stopping)
shrinkage (learning rate): 0.001–0.1 (commonly 0.01)
interaction.depth (max depth): 1–10 (commonly 1–5)
n.minobsinnode: 5–20
bag.fraction: 0.5–1.0

Tuning strategy:

Fix a small shrinkage (e.g., 0.01), tune interaction.depth and n.minobsinnode on a grid.
For each config, use early stopping on a validation split to choose n.trees.
Optionally, tune bag.fraction (0.5–0.8) to add variance reduction.
Consider random search or Bayesian optimization when parameter space is large.

Your inner-grid (example): shrinkage ∈ {0.01, 0.05}, interaction.depth ∈ {1,3,5}, n.trees ∈ {200,500} — good starter grid; expand with lower shrinkage and larger n.trees when compute allows.

9. Diagnostics and Interpretability

Training vs validation loss curves: detect overfitting and choose early stopping round.
Variable importance: based on reduction in loss or permutation approaches (use both).
Partial dependence plots (PDP) and ICE: visualize marginal effects.
SHAP values: for local and global interpretability (available via packages).
Residual analysis: check heteroscedasticity and systematic patterns.

10. Statistical Properties and Theory

GBM can be framed as greedy function approximation and, under certain conditions, is consistent. Theoretical analysis is complex due to the sequential, data-dependent nature of tree fitting. Recent research studies GBM generalization under assumptions on tree complexity, learning rate, and sample size.

Important intuition: - Shrinkage and subsampling act as regularizers controlling capacity. - Shallow base learners favor bias control; many iterations control variance.

11. Practical Implementation in R

# Synthetic regression example using gbm
set.seed(101)
library(gbm)

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

# Synthetic data
n= 1000
p= 8
X= as.data.frame(matrix(runif(n * p, -2, 2), nrow = n))
colnames(X)= paste0("X", 1:p)
Y= with(X, 1.5 * X1 - 2 * X2^2 + sin(2 * X3) + rnorm(n, 0, 0.8))
train_df= data.frame(Y = Y, X)

# Fit gbm (squared error)
set.seed(123)
gbm_model= gbm(
  formula = Y ~ .,
  data = train_df,
  distribution = "gaussian",
  n.trees = 1000,
  interaction.depth = 3,
  shrinkage = 0.01,
  n.minobsinnode = 10,
  bag.fraction = 0.7,
  train.fraction = 0.8,
  verbose = FALSE)

# Best iteration by OOB estimation
best_iter= gbm.perf(gbm_model,method= "OOB",plot.it= FALSE)

## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.

best_iter

## [1] 745
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
##     length(x)/10), 50))
## 
## Number of Observations: 1000 
## Equivalent Number of Parameters: 40 
## Residual Standard Error: 0.002814

# Partial dependence for X1
pdp_X1= plot(gbm_model, i.var= "X1",n.trees= best_iter,return.grid= TRUE)
head(pdp_X1)

##          X1         y
## 1 -1.998512 -5.294761
## 2 -1.958151 -5.294761
## 3 -1.917791 -5.223514
## 4 -1.877431 -5.223514
## 5 -1.837071 -5.185695
## 6 -1.796710 -5.171040

12. Using caret to tune GBM (inner CV example)

library(caret)
set.seed(202)

grid= expand.grid(
  .n.trees = c(200, 500, 1000),
  .interaction.depth = c(1, 3, 5),
  .shrinkage = c(0.01, 0.05),
  .n.minobsinnode = c(5, 10)
)

ctrl= trainControl(method = "cv", number = 5)

caret_gbm= train(Y ~ ., data = train_df,
                 method = "gbm",
                 trControl = ctrl,
                 tuneGrid = grid,
                 verbose = FALSE)

caret_gbm

## Stochastic Gradient Boosting 
## 
## 1000 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 800, 800, 800, 800, 800 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  RMSE       Rsquared 
##   0.01       1                   5               200     2.2037916  0.7847005
##   0.01       1                   5               500     1.5433776  0.8435053
##   0.01       1                   5              1000     1.1519684  0.8896137
##   0.01       1                  10               200     2.2009155  0.7862203
##   0.01       1                  10               500     1.5444175  0.8430913
##   0.01       1                  10              1000     1.1501194  0.8889046
##   0.01       3                   5               200     1.5290604  0.8465255
##   0.01       3                   5               500     1.0106880  0.9089685
##   0.01       3                   5              1000     0.8903781  0.9221595
##   0.01       3                  10               200     1.5348811  0.8462302
##   0.01       3                  10               500     1.0077826  0.9100372
##   0.01       3                  10              1000     0.8884905  0.9225327
##   0.01       5                   5               200     1.3651513  0.8700563
##   0.01       5                   5               500     0.9260343  0.9195604
##   0.01       5                   5              1000     0.8833141  0.9230565
##   0.01       5                  10               200     1.3645595  0.8708424
##   0.01       5                  10               500     0.9227677  0.9199001
##   0.01       5                  10              1000     0.8771536  0.9240475
##   0.05       1                   5               200     1.1493840  0.8888889
##   0.05       1                   5               500     0.8890472  0.9228757
##   0.05       1                   5              1000     0.8718172  0.9248368
##   0.05       1                  10               200     1.1419223  0.8900205
##   0.05       1                  10               500     0.8903988  0.9223743
##   0.05       1                  10              1000     0.8699625  0.9251037
##   0.05       3                   5               200     0.9022204  0.9199051
##   0.05       3                   5               500     0.9126036  0.9175669
##   0.05       3                   5              1000     0.9312784  0.9142047
##   0.05       3                  10               200     0.8912592  0.9217999
##   0.05       3                  10               500     0.9012182  0.9195659
##   0.05       3                  10              1000     0.9276446  0.9148586
##   0.05       5                   5               200     0.9009020  0.9197521
##   0.05       5                   5               500     0.9200159  0.9161603
##   0.05       5                   5              1000     0.9444168  0.9116945
##   0.05       5                  10               200     0.8954884  0.9207513
##   0.05       5                  10               500     0.9212543  0.9159846
##   0.05       5                  10              1000     0.9446524  0.9117000
##   MAE      
##   1.7793451
##   1.2354141
##   0.9175543
##   1.7776756
##   1.2366568
##   0.9172518
##   1.2135261
##   0.8022057
##   0.7072477
##   1.2200945
##   0.7988983
##   0.7056066
##   1.0867415
##   0.7356025
##   0.7012343
##   1.0846713
##   0.7316237
##   0.6972771
##   0.9178267
##   0.7020513
##   0.6927187
##   0.9120316
##   0.7091831
##   0.6921418
##   0.7158640
##   0.7239467
##   0.7339373
##   0.7059225
##   0.7111839
##   0.7304012
##   0.7171558
##   0.7290537
##   0.7482224
##   0.7123519
##   0.7298683
##   0.7495009
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  1, shrinkage = 0.05 and n.minobsinnode = 10.

plot(caret_gbm)

13. Early Stopping and Validation

When train.fraction is used in gbm, the model tracks performance on the held-out validation portion and gbm.perf(..., method = "test") can be used to find the optimal number of trees. In caret, use trainControl with method = "cv" or method = "repeatedcv" and integrate early stopping by large n.trees and using the tuning results.

14. Extensions and Relationship to Other Algorithms

XGBoost / LightGBM / CatBoost: more advanced, faster, and more regularized implementations of gradient boosting with additional features (L1/L2 regularization, histogram-based algorithms, categorical handling, GPU support).
Stochastic Gradient Boosting: uses subsampling to add robustness (bag.fraction < 1).
Gradient Boosted Quantile Regression: by using quantile loss.

15. Reporting Recommendations

Report preprocessing, loss function, and whether responses were transformed.
Report tuning grid for shrinkage, interaction.depth, n.trees, n.minobsinnode, bag.fraction.
State early stopping criterion and final n.trees used.
Present OOB or validation-set learning curves and variable importance plots.

References

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.

Extreme Gradient Boosting (XGBoost) — Theory and Concepts

1. Introduction

XGBoost (Chen & Guestrin, 2016) is an optimized implementation of gradient boosting that uses a regularized objective, second-order Taylor expansion, and algorithmic improvements (sparsity-awareness, column subsampling, weighted quantile sketch, parallelized split finding). It is widely used for tabular supervised learning due to its speed and predictive performance.

XGBoost fits an additive model of trees:

\[ \hat{y}_i = \sum_{k=1}^K f_k(x_i), \qquad f_k \in \mathcal{F}, \]

where \(\mathcal{F}\) is the space of regression trees.

2. Regularized Objective

XGBoost minimizes the regularized objective:

\[ \mathcal{L}(\phi) = \sum_{i=1}^n l\big(y_i, \hat{y}_i\big) + \sum_{k=1}^K \Omega(f_k), \]

where \(l\) is the training loss (e.g., squared error, logistic loss) and the regularization term for a tree \(f\) with \(T\) leaves is

\[ \Omega(f) = \gamma T + \tfrac{1}{2} \lambda \sum_{j=1}^T w_j^2. \]

Here \(w_j\) are the leaf weights, \(\gamma\) penalizes tree complexity (number of leaves) and \(\lambda\) is an \(L_2\) penalty on leaf weights.

3. Second-order Taylor Approximation (Derivation)

At iteration \(t\), suppose the model so far is \(\hat{y}_i^{(t-1)}\). We add a new tree \(f_t\) and approximate the objective by a second-order Taylor expansion of the loss around \(\hat{y}_i^{(t-1)}\):

\[ l\big(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\big) \approx l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \tfrac{1}{2} h_i f_t(x_i)^2, \]

where

\[ g_i = \left. \frac{\partial l(y_i, \hat{y})}{\partial \hat{y}} \right|_{\hat{y}=\hat{y}_i^{(t-1)}}, \qquad h_i = \left. \frac{\partial^2 l(y_i, \hat{y})}{\partial \hat{y}^2} \right|_{\hat{y}=\hat{y}_i^{(t-1)}}. \]

Plugging this approximation into the objective (and dropping constants independent of \(f_t\)) yields the per-iteration objective to minimize:

\[ \tilde{\mathcal{L}}^{(t)} = \sum_{i=1}^n \big[g_i f_t(x_i) + \tfrac{1}{2} h_i f_t(x_i)^2\big] + \Omega(f_t). \]

4. Optimal Leaf Weights and Gain

A regression tree partitions instances into \(T\) disjoint leaves. For leaf \(j\) with instance set \(I_j\) and constant prediction \(w_j\) on that leaf, the objective contribution from that leaf is

\[ \sum_{i\in I_j} \big[g_i w_j + \tfrac{1}{2} h_i w_j^2\big] + \tfrac{1}{2}\lambda w_j^2 + \gamma. \]

Solving for the optimal \(w_j\) by setting derivative to zero gives

\[ w_j^* = -\frac{\sum_{i\in I_j} g_i}{\sum_{i\in I_j} h_i + \lambda}. \]

The resulting optimal objective reduction (score) for leaf \(j\) is

\[ \text{Score}(I_j) = -\tfrac{1}{2} \frac{(\sum_{i\in I_j} g_i)^2}{\sum_{i\in I_j} h_i + \lambda} + \gamma. \]

For a candidate split that divides parent set \(I\) into left/right \(I_L, I_R\), the gain is

\[ \text{Gain} = \tfrac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right) - \gamma, \]

where \(G = \sum_{i\in I} g_i\), \(H = \sum_{i\in I} h_i\), and similarly for \(G_L,G_R,H_L,H_R\). Splits with Gain > 0 (and larger Gain preferred) are chosen.

5. Practical Algorithmic Improvements in XGBoost

Sparsity-aware split finding: handles missing or zero values by learning default directions.
Weighted quantile sketch: fast approximate split-finding for weighted data.
Column (feature) subsampling: colsample_bytree, colsample_bylevel to reduce overfitting and speed up learning.
Parallelization: per-column histogram construction and parallel split search.
Regularization: lambda (L2), alpha (L1), gamma (min split loss) control model complexity.

6. Loss Examples: Gradients and Hessians

Squared error (regression): \(l(y,\hat{y}) = \tfrac{1}{2}(y-\hat{y})^2\)

\[ g_i = \hat{y}_i - y_i, \qquad h_i = 1. \]

Logistic (binary classification, y in {0,1}): with log-loss

\[ p_i = \sigma(\hat{y}_i) = \frac{1}{1+e^{-\hat{y}_i}}, \] \[ g_i = p_i - y_i, \qquad h_i = p_i(1-p_i). \]

(For y coded as {±1} different conventions apply; XGBoost expects y in {0,1} or real values depending on objective.)

7. Hyperparameters and Tuning Guidance

Key hyperparameters and practical recommendations:

eta (learning rate): 0.01–0.3. Smaller requires more trees. Typical: 0.01, 0.05, 0.1.
nrounds / n_estimators: 100–5000; use early stopping on validation set.
max_depth: 3–10 (controls tree complexity).
min_child_weight: 1–10 (minimum sum of hessian in a leaf; prevents overfitting).
subsample: 0.5–1.0 (row subsampling).
colsample_bytree: 0.3–1.0 (column subsampling).
lambda, alpha: regularization (L2/L1) on leaf weights.
gamma: minimum loss reduction to make a split.

Tuning strategy: 1. Tune max_depth and min_child_weight first (tree complexity). 2. Tune subsample and colsample_bytree next. 3. Tune regularization lambda/alpha and gamma. 4. Use eta small (0.01–0.1) and large nrounds; select nrounds by early stopping.

8. R Implementation Example (xgboost)

# Example: regression using xgboost
library(xgboost)
set.seed(123)

# synthetic data
n= 1000; p= 8
X= matrix(runif(n*p, -2, 2), nrow = n)
y= 1.5*X[,1]-2*X[,2]^2+sin(2*X[,3])+rnorm(n,0,0.8)

# split
tr_idx= sample(seq_len(n), size = 0.8*n)
train_x= X[tr_idx, ]; train_y= y[tr_idx]
valid_x= X[-tr_idx, ]; valid_y= y[-tr_idx]

dtrain= xgb.DMatrix(data= train_x,label= train_y)
dvalid= xgb.DMatrix(data= valid_x, label = valid_y)
watchlist= list(train=dtrain,valid=dvalid)

params= list(objective = 'reg:squarederror',
             eta = 0.05,
             max_depth = 5,
             subsample = 0.8,
             colsample_bytree = 0.8,
             lambda = 1,
             alpha = 0)

xgb_mod= xgb.train(params = params,
                   data = dtrain,
                   nrounds = 2000,
                   watchlist = watchlist,
                   early_stopping_rounds = 50,
                   print_every_n = 50)

## [1]  train-rmse:4.213583 valid-rmse:4.202594 
## Multiple eval metrics are present. Will use valid_rmse for early stopping.
## Will train until valid_rmse hasn't improved in 50 rounds.
## 
## [51] train-rmse:0.903106 valid-rmse:1.337959 
## [101]    train-rmse:0.523974 valid-rmse:1.086337 
## [151]    train-rmse:0.411107 valid-rmse:1.048792 
## [201]    train-rmse:0.337788 valid-rmse:1.041219 
## [251]    train-rmse:0.284987 valid-rmse:1.038966 
## [301]    train-rmse:0.241613 valid-rmse:1.037616 
## Stopping. Best iteration:
## [294]    train-rmse:0.247224 valid-rmse:1.036510

# variable importance
imp= xgb.importance(model = xgb_mod)
print(head(imp))

##    Feature       Gain      Cover  Frequency
##     <char>      <num>      <num>      <num>
## 1:      f1 0.55512803 0.21116600 0.18110535
## 2:      f0 0.30282739 0.17333012 0.17882889
## 3:      f2 0.06072744 0.14247355 0.14202605
## 4:      f4 0.02525411 0.08501805 0.10054382
## 5:      f5 0.01561302 0.11812182 0.11268496
## 6:      f3 0.01426617 0.09036454 0.09750854

# partial dependence via pdp package or xgboost::xgb.plot.importance

Gradient Boosting + Gaussian Processes / Random Effects (GPBoost) — Theory and Concepts

1. Introduction

GPBoost integrates gradient-boosted decision trees with Gaussian Processes (GPs) or mixed-effects models. It is designed for data with hierarchical structure (clusters, repeated measures) or spatial/temporal correlation where capturing both complex fixed effects and structured random effects is important.

Model conceptually decomposes the signal into a nonparametric part learned by boosting and a correlated random effect modeled by GP or random intercepts/slopes.

2. Model Specification

A general GPBoost model for continuous outcomes can be written as:

\[ \mathbf{y} = f(\mathbf{X}) + \mathbf{Zb} + \boldsymbol{\epsilon}, \]

where:

\(f(\mathbf{X})\) is the boosting predictor (sum of trees),
\(\mathbf{Zb}\) represents random effects (with \(\mathbf{b} \sim \mathcal{N}(0, \Sigma(\theta))\)),
\(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I)\) is observation noise.

A Gaussian Process is a special case where \(\mathbf{Z} = I\) and \(\Sigma(\theta)\) is a covariance kernel defined by hyperparameters \(\theta\) (e.g., Mat'ern, RBF).

3. Likelihood and Marginalization

Conditional on \(f\) the marginal distribution of \(\mathbf{y}\) is

\[ \mathbf{y} \sim \mathcal{N}\big(f(\mathbf{X}),\; \Sigma(\theta) + \sigma^2 I\big). \]

The (negative) log marginal likelihood is therefore:

\[ \ell(\theta; f) = \tfrac{1}{2}(\mathbf{y} - f)^\top K^{-1}(\theta) (\mathbf{y} - f) + \tfrac{1}{2} \log |K(\theta)| + \tfrac{n}{2}\log(2\pi), \]

with \(K(\theta) = \Sigma(\theta) + \sigma^2 I\).

GPBoost maximizes the marginal likelihood w.r.t. covariance parameters \(\theta\) (or uses REML) while estimating \(f\) via boosting.

4. Training Algorithm (Alternating Optimization)

A practical training algorithm alternates between two steps:

Boosting step (update f):
- Compute negative gradient of the marginal negative log-likelihood w.r.t. \(f\): \[ r = -\frac{\partial \ell}{\partial f} = K^{-1}(\mathbf{y} - f). \]
- Fit a tree to the pseudo-residuals \(r\) (weighted by relevant quantities), obtain \(h_m\), and update \(f\).
GP / Random effect step (update θ):
- Given current \(f\), maximize the (restricted) marginal likelihood w.r.t. \(\theta\) (e.g., length-scale, variance components) using gradient-based or EM-like methods.

Repeat until convergence. This alternation resembles an EM or block-coordinate optimization.

5. Kernels and Random Effects

Common covariance structures in GPBoost include:

Random intercepts / grouped effects: \(\Sigma = \tau^2 ZZ^\top\).
RBF / SE kernel: \(K(x,x') = \sigma^2 \exp(-\tfrac{||x-x'||^2}{2\rho^2})\).
Mat'ern kernel: flexible smoothness parameter \(\nu\).
AR(1) / temporal kernels: for longitudinal data.

Choice depends on domain knowledge (spatial coordinates, repeated measures, cluster IDs).

6. Inference and Predictions

Given fitted \(f\) and estimated \(\theta\), predictive distribution at new input \(x_\ast\) with possible new grouping structure is obtained by standard GP conditioning formulas, combined with predictions from boosting for fixed effects. This yields predictive means and variances that account for both tree-based uncertainty (approximate) and GP/random-effect uncertainty (exact within Gaussian assumptions).

GPBoost implementations provide functions for obtaining best linear unbiased predictors (BLUPs) for random effects and predictive intervals.

7. Hyperparameter Guidance (GPBoost)

Boosting-related: learning_rate, max_depth, nrounds, subsample, colsample_bytree — tune similarly to XGBoost/GBM.

GP-related: kernel type, length-scale(s) \(\rho\), process variance \(\sigma^2\), nugget/noise \(\sigma^2_\epsilon\), group-level variance \(\tau^2\). These are estimated by maximizing marginal likelihood (REML) but may require sensible starting values and bounds.

Computational note: covariance matrix operations scale poorly with large n (\(O(n^3)\)). GPBoost uses structure (grouped random effects, sparse approximations) to scale to larger datasets, but caution is needed for very large n.

8. R Implementation Examples (gpboost)

Note: The gpboost R package is available on CRAN/ GitHub (installation may require dependencies). Below is a minimal example; refer to package docs for full functionality.

# Example: Gaussian mixed model + boosting (illustrative)
# install.packages("gpboost") # if not installed
library(gpboost)

## Loading required package: R6

## Warning: package 'R6' was built under R version 4.2.3

## 
## Attaching package: 'gpboost'

## The following objects are masked from 'package:xgboost':
## 
##     getinfo, setinfo, slice

set.seed(123)

# ------------------------------------
# 1. Simulate grouped data
# ------------------------------------
n_groups= 50
obs_per_group= 20
n= n_groups * obs_per_group

group= rep(1:n_groups, each = obs_per_group)
X1= runif(n, -2, 2)
X2= runif(n, -1, 1)

# true underlying function
f_true= 2*X1 + sin(2*X2)

# random intercepts for each group
tau= 1
b_group= rnorm(n_groups, 0, tau)

# response
y= f_true + b_group[group] + rnorm(n, 0, 0.5)

# feature matrix
X= cbind(X1, X2)

# ------------------------------------
# 2. Define GPBoost random effect model
# ------------------------------------
gp_model= GPModel(group_data = group,group_rand_coef_data = NULL,likelihood = "gaussian")

# ------------------------------------
# 3. Convert predictors to GPBoost Dataset
# ------------------------------------
dtrain= gpb.Dataset(data = X,label = y)

# ------------------------------------
# 4. Train the GPBoost model
# ------------------------------------
fit = gpb.train(data = dtrain,
                gp_model = gp_model,
                nrounds = 200,
                learning_rate = 0.05,
                max_depth = 4,
                min_data_in_leaf = 10,objective = "regression_l2",verbose = 0)

# ------------------------------------
# 5. Prediction (fixed + random)
# ------------------------------------
pred= predict(
  fit,
  data = X,
  group_data_pred = group
)$response_mean

head(pred)

## [1] -3.158746  1.689104 -2.543622  2.828992  3.659621 -4.254217

Diagnostics, Model Comparison, and Reporting (XGBoost and GPBoost)

1. Diagnostics

Learning curves: training and validation loss vs iterations.
Feature importance: XGBoost: gain / cover / frequency; GPBoost: importance from boosting part plus variance components for GP.
Residual analysis: residual vs fitted, spatial residual maps when applicable.
Calibration (classification): reliability diagrams, Brier score.

2. Model Comparison

Compare using nested CV (your existing 5x5 NCV) to get unbiased performance estimates.
For models with random effects (GPBoost), ensure the outer split respects grouping (i.e., group-wise splits) to avoid information leakage.

3. Reporting Recommendations

Report software versions and seeds.
Document hyperparameter grids and selection method (inner CV, early stopping).
For GPBoost, report covariance/kernel choices and estimated length-scales / variance components.
Provide variable importance, partial dependence plots, and predictive uncertainty summaries.

References

Chen, T. & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD.
Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Ann. Statist.
Rokach, L. (2010). Ensemble-based classifiers. Artificial Intelligence Review.
GPBoost project documentation and original publications.

Artificial Neural Networks (ANN): Theory and Concepts

Artificial Neural Networks (ANNs) are computational models inspired by the structure and learning process of the human brain. They are especially powerful for capturing non-linear relationships and complex interactions that traditional statistical models may not detect.

1. What is an ANN?

An ANN consists of interconnected units called neurons, arranged in layers. Each neuron: - Receives input signals - Multiplies inputs by weights, adds a bias - Passes the result through a non-linear activation function - Produces an output for the next layer

For input vector \(x = (x_1, \ldots, x_p)\):

\[ z = \sum_{j=1}^p w_j x_j + b \]

\[ \hat{y} = f(z) \]

2. ANN Architecture

a) Input Layer

Represents predictor variables. No computation happens here.

b) Hidden Layer(s)

Each layer applies weighted transformation + activation. More layers increase model capacity.

c) Output Layer

Regression: linear activation
Binary classification: sigmoid
Multi-class classification: softmax

3. Activation Functions**

Activation functions introduce non-linearity.

Activation	Formula	Use Case
ReLU	\(\max(0,z)\)	Deep networks, fast training
Sigmoid	\(\frac{1}{1+e^{-z}}\)	Binary classification
tanh	\(\tanh(z)\)	Symmetric output
Linear	\(z\)	Regression

4. Forward Propagation

This step computes outputs layer by layer:

\[ a^{(l)} = f(W^{(l)} a^{(l-1)} + b^{(l)}) \]

Ends at the output layer.

5. Loss Functions

Regression: \(MSE = \frac{1}{n}\sum (y-\hat{y})^2\)
Binary classification: cross-entropy
Multiclass classification: softmax cross-entropy

6. Backpropagation (Learning Rule)

The algorithm updates weights by computing gradients using the chain rule.

Weight update:

\[ w_{new} = w_{old} - \eta \frac{\partial L}{\partial w} \]

Where \(\eta\) = learning rate.

7. Optimization Algorithms

Common optimizers: - SGD - Adam (most widely used) - RMSProp

8. Preventing Overfitting

Dropout (randomly disables neurons during training)
Weight decay (L2 regularization)
Early stopping (monitor validation loss)
Batch normalization

9. Hyperparameters in ANN

Important tuning parameters: - Number of hidden layers - Neurons per layer - Activation functions - Learning rate - Batch size - Epochs - Dropout rate - L1/L2 regularization strength

10. ANN for Regression and Classification

Regression

Linear activation
MSE loss

Binary Classification

Sigmoid activation
Binary cross-entropy loss

Multiclass Classification

Softmax activation
Categorical cross-entropy

11. Strengths and Limitations of ANN

Strengths

Captures complex non-linear patterns
Scales to high-dimensional data
Flexible architecture

Limitations

Requires large datasets
Computationally intensive
Harder to interpret
Sensitive to feature scaling

12. ANN vs SVM vs Random Forest

Model	Strengths	Weaknesses
ANN	Learns complex patterns	Needs large data
SVM	Good in high dimensions	Hard to scale
Random Forest	Stable, interpretable	Less flexible for complex boundaries

library(nnet)
set.seed(123)

# Example regression data
n= 200
x1= runif(n, -2, 2)
x2= runif(n, -1, 1)
y= 3*x1-2*x2+sin(2*x1)+rnorm(n, 0, 0.3)

df= data.frame(x1, x2, y)

# Scale features
X_scaled= scale(df[, c("x1","x2")])

df_scaled= data.frame(X_scaled, y = df$y)

# Fit ANN with 5 hidden units
ann_fit= nnet(y~x1+x2,data = df_scaled,
              size = 5,       # hidden neurons
              decay = 0.01,   # weight decay
              linout = TRUE,  # regression
              maxit = 500,
              trace = FALSE)

# Predictions
pred= predict(ann_fit, df_scaled)

# Performance
mse= mean((pred - df_scaled$y)^2)
mse

## [1] 0.08343786

Comprehensive Summary On Some Most Applicable Machine Learning Techniques

Rajesh Majumder

2025-12-02

Support Vector Machine: (Primal and Dual Formulation) Theory and Concepts

1. Hard-Margin SVM

1.1 Primal Optimization Problem

1.2 Lagrangian (Hard-Margin)

1.3 Dual Formulation

2. Soft-Margin SVM

2.1 Primal Optimization Problem

2.2 Lagrangian (Soft-Margin)

2.3 Dual Form

3. Kernelized SVM Dual

4. Simple Example in R

Random Forest — Theory and Concepts

1. Introduction

2. Decision Tree Base Learner: brief recap

3. Bagging (Bootstrap Aggregation)

4. Random Forest Algorithm (Breiman, 2001)

5. Out-of-Bag (OOB) Estimation

6. Variable Importance Measures

7. Bias–Variance Decomposition and Role of mtry

8. Hyperparameters and Their Effects

9. Theoretical Results and Consistency

10. Practical Diagnostics and Interpretation

11. Limitations and Failure Modes

12. Implementation Examples in R

12.1 Tuning mtry with caret (nested CV friendly setup)

12.2 Partial Dependence Example (pdp package)

13. Reporting Recommendations

References

Gradient Boosting Machines (GBM) — Theory and Concepts

1. Introduction

2. Functional Gradient Descent: Derivation

3. Loss Functions and Pseudo-Residuals (Common Cases)

3.1 Squared Error Loss (Regression)

3.2 Absolute Error Loss (L1)

3.3 Logistic Loss (Binary Classification)

3.4 Custom Losses

4. Base Learner: Regression Trees (Weak Learners)

5. Shrinkage, Number of Trees, and Trade-offs

6. Stochastic Gradient Boosting (Subsampling)

7. Regularization Options

8. Hyperparameters and Tuning Guidance

9. Diagnostics and Interpretability

10. Statistical Properties and Theory

11. Practical Implementation in R

12. Using caret to tune GBM (inner CV example)

13. Early Stopping and Validation

14. Extensions and Relationship to Other Algorithms

15. Reporting Recommendations

References

Extreme Gradient Boosting (XGBoost) — Theory and Concepts

1. Introduction

2. Regularized Objective

3. Second-order Taylor Approximation (Derivation)

4. Optimal Leaf Weights and Gain

5. Practical Algorithmic Improvements in XGBoost

6. Loss Examples: Gradients and Hessians

7. Hyperparameters and Tuning Guidance

8. R Implementation Example (xgboost)

Gradient Boosting + Gaussian Processes / Random Effects (GPBoost) — Theory and Concepts

1. Introduction

2. Model Specification

3. Likelihood and Marginalization

4. Training Algorithm (Alternating Optimization)

5. Kernels and Random Effects

6. Inference and Predictions

7. Hyperparameter Guidance (GPBoost)

8. R Implementation Examples (gpboost)

Diagnostics, Model Comparison, and Reporting (XGBoost and GPBoost)

1. Diagnostics

2. Model Comparison

3. Reporting Recommendations

References

Artificial Neural Networks (ANN): Theory and Concepts

1. What is an ANN?

2. ANN Architecture

a) Input Layer

b) Hidden Layer(s)

12.1 Tuning `mtry` with caret (nested CV friendly setup)