We want the hyperplane \[f(x)=w^Tx+b\] that separates the data with maximum margin.
Given training data \((x_i,y_i)\) where \(y_i\in \{-1,+1\}\)
We want to find a hyperplane that maximizes the margin between two linearly separable classes.
The primal form is:
\[ \min_{w,b} \frac{1}{2} \|w\|^2 \] subject to \[ y_i (w^T x_i + b) \ge 1, \quad \forall i. \]
We form the Lagrangian with multipliers \(\alpha_i \ge 0\):
\[ \mathcal{L}(w, b, \alpha) = \frac{1}{2} \|w\|^2 - \sum_{i=1}^n \alpha_i [y_i (w^T x_i + b) - 1]. \]
Take partial derivatives:
Derivative w.r.t. \(w\): \[\frac{\partial \mathcal{L}}{\partial w} = w - \sum_i \alpha_i y_i x_i = 0 \Rightarrow w = \sum_i \alpha_i y_i x_i.\]
Derivative w.r.t. \(b\): \[\frac{\partial \mathcal{L}}{\partial b} = -\sum_i \alpha_i y_i = 0.\]
Substitute into Lagrangian → the dual:
\[ \max_\alpha \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j (x_i^T x_j) \] subject to \[ \sum_i \alpha_i y_i = 0, \quad \alpha_i \ge 0. \]
This is a Quadratic Programming (QP) problem.
For non-separable data, introduce slack variables \(\xi_i\).
\[ \min_{w,b,\xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^n \xi_i \] subject to \[ y_i (w^T x_i + b) \ge 1 - \xi_i, \quad \xi_i \ge 0. \]
Lagrangian:
\[ \mathcal{L}(w,b,\xi,\alpha,\mu) = \frac{1}{2}\|w\|^2 + C \sum_i \xi_i - \sum_i \alpha_i [y_i(w^T x_i + b) - 1 + \xi_i] - \sum_i \mu_i \xi_i \] with \(\alpha_i, \mu_i \ge 0\).
Derivative wrt \(w\): \[w=\sum_{i=1}^n \alpha_iy_ix_i\]
Derivative wrt \(b\): \[\sum_i\alpha_iy_i=0\]
Derivative wrt \(\xi_i\): \[C-\alpha_i-\mu_i=0\implies 0\leq\alpha_i\leq C\]
Substituting gradients yields the dual:
\[ \max_\alpha \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i^T x_j) \] subject to \[ \sum_i \alpha_i y_i = 0, \] \[ 0 \le \alpha_i \le C. \]
Differences from hard margin:
Replace dot products with a kernel function:
\[K(x_i, x_j) = \phi(x_i)^T \phi(x_j).\]
Dual becomes:
\[ \max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j) \] subject to \[ \sum_i \alpha_i y_i = 0, \] \[ 0 \le \alpha_i \le C. \]
# Example dataset
x= matrix(c(1,2, 2,3, 3,3, 2,1), ncol=2, byrow=TRUE)
y= c(1,1,-1,-1)
library(e1071)
## Warning: package 'e1071' was built under R version 4.2.3
model= svm(x, y, type="C-classification", kernel="linear", cost=1, scale=FALSE)
model
##
## Call:
## svm.default(x = x, y = y, scale = FALSE, type = "C-classification",
## kernel = "linear", cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 4
Random Forest (RF) is an ensemble method introduced by Breiman (2001) that builds a collection of decision trees trained on bootstrap samples and with randomized feature selection at each split. The ensemble prediction is an average (regression) or majority vote (classification). The two key mechanisms that make RF effective are:
mtry) at each split which reduces correlation between
trees and therefore further reduces ensemble variance.RF is nonparametric, handles high-dimensional inputs, and requires relatively little preprocessing.
A regression tree partitions the predictor space into disjoint regions \(R_1,\dots,R_M\) and predicts the mean of the response within each region. For squared error loss, a single split is chosen to minimize within-node sum of squares. Formally, for a split on variable \(j\) at value \(s\):
\[ \min_{j,s} \Bigg\{ \sum_{x_i \in R_1(j,s)} (y_i - \bar{y}_{R_1})^2 + \sum_{x_i \in R_2(j,s)} (y_i - \bar{y}_{R_2})^2 \Bigg\}, \]
where \(R_1(j,s) = \{x: x_j \le s\}\) and \(R_2(j,s) = \{x: x_j > s\}\).
Trees are grown recursively and typically grown deep in RF (no pruning) to keep bias low.
Bagging aims to reduce variance by averaging many unstable estimators. Suppose we have an estimator \(\hat{f}(x; \mathcal{D})\) trained on dataset \(\mathcal{D}\). Draw \(B\) bootstrap datasets \(\mathcal{D}^{(b)}\) and form trees \(T^{(b)}(x) = \hat{f}(x; \mathcal{D}^{(b)})\). The bagged predictor is
\[ \hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^B T^{(b)}(x). \]
If \(\mathrm{Var}[T(x)] = \sigma^2\) and pairwise correlation between trees is \(\rho\), the variance of the bagged predictor is approximately:
\[ \mathrm{Var}[\hat{f}_{\text{bag}}(x)] = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2. \]
As \(B\to\infty\), variance tends to \(\rho \sigma^2\). Thus, reducing the correlation \(\rho\) among trees is crucial; Random Forest uses random feature selection to do this.
Given training data \(\mathcal{D} = \{(x_i,y_i)\}_{i=1}^n\), predictors \(p\), and hyperparameters \(B\) (number of trees) and \(m\) (mtry):
For \(b = 1,\dots,B\):
\[ \hat{f}_{RF}(x) = \frac{1}{B} \sum_{b=1}^B T^{(b)}(x). \]
For classification, the ensemble predicts the class with the largest number of votes across trees.
Practical defaults: for regression \(m \approx p/3\), for classification \(m \approx \sqrt{p}\).
Each bootstrap sample excludes on average about \(e^{-1} \approx 0.368\) fraction of the observations. For an observation \(i\), collect the subset of trees for which \(i\) was OOB. The OOB prediction for \(x_i\) is the average (or majority vote) across those trees. The OOB error is:
\[ \mathrm{Err}_{OOB} = \frac{1}{n} \sum_{i=1}^n L\big(y_i, \hat{f}_{OOB}(x_i)\big), \]
where \(L\) is an appropriate loss (squared error or 0–1 loss). OOB error is a nearly unbiased estimator of test error and is commonly used for model selection and quick diagnostics.
Two widely used importance metrics are:
Mean Decrease in Impurity (MDI): For each split in each tree, record the impurity reduction (e.g., reduction in MSE for regression or Gini impurity for classification). Sum these decreases for each variable across all trees and normalize.
Permutation Importance (Mean Decrease Accuracy, MDA): For each variable \(X_j\):
Permutation importance reflects effect on predictive accuracy; MDI can be biased toward variables with many categories or continuous variables.
Consider the decomposition of expected squared error at a point \(x\):
\[ \mathbb{E}\big[(Y - \hat{f}_{RF}(x))^2\big] = \underbrace{\sigma^2_{\epsilon}}_{\text{irreducible}} + \underbrace{\big(\mathbb{E}[\hat{f}_{RF}(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\mathrm{Var}[\hat{f}_{RF}(x)]}_{\text{variance}}. \]
In RF we intentionally grow deep trees (low bias) and reduce variance by averaging. The variance term includes the tree–tree correlation \(\rho\); lowering \(m\) (mtry) tends to reduce \(\rho\) but may increase bias slightly. Thus \(m\) controls the bias–variance trade-off.
Common hyperparameters:
ntree (B): number of trees. Larger B reduces variance;
typical values: 200–2000.mtry (m): number of variables sampled at each split.
Controls correlation; typical defaults: \(p/3\) (regression), \(\sqrt{p}\) (classification).nodesize: minimum observations in terminal nodes
(affects smoothness).maxnodes / maxdepth: restrict tree size if
desired.sample_size / replace: bootstrap sample
size and whether sampling is with replacement.Tuning guidance: mtry is usually the
most important. Use OOB error or inner CV to tune mtry and
nodesize, keep ntree large enough to stabilize
errors.
Under some regularity conditions (e.g., trees grown with appropriate splitting rules, terminal node sizes shrinking properly), Breiman-style forests and certain simplified variants have been shown to be consistent: the RF predictor converges to the true regression function as \(n\to\infty\). More recent theoretical work (e.g., Biau, Scornet, Wager) studies rates of convergence and conditions under which forests adapt to sparsity or low-dimensional structure.
A key theoretical insight is that the randomness (bootstrap + feature subsampling) stabilizes the high-variance nature of trees while preserving their low bias for complex functions.
ntree.# Example: synthetic regression dataset and Random Forest analysis
set.seed(42)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
# synthetic data
n= 500
p= 10
X= matrix(runif(n * p, -1, 1), nrow = n)
dimnames(X)= list(NULL, paste0("X", 1:p))
# true function: nonlinear in first three variables
Y= with(as.data.frame(X), 3*X1 - 2 * X2^2 + sin(3 * X3) + rnorm(n, 0, 0.5))
df= data.frame(Y = Y, X)
# Fit RF
rf= randomForest(Y ~ ., data = df, ntree = 500, mtry = floor(p/3), importance = TRUE)
print(rf)
##
## Call:
## randomForest(formula = Y ~ ., data = df, ntree = 500, mtry = floor(p/3), importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.6608643
## % Var explained: 85.07
# OOB error plot
plot(rf)
# Variable importance (permutation)
importance(rf)
## %IncMSE IncNodePurity
## X1 109.3714244 1330.22300
## X2 22.0475191 174.26256
## X3 43.6675479 252.22440
## X4 1.5293574 52.20063
## X5 0.2096404 67.99561
## X6 0.9692954 62.69983
## X7 -0.7914752 57.43699
## X8 -4.3104377 52.38036
## X9 -0.9216762 60.16870
## X10 2.7006062 71.46540
varImpPlot(rf, type= 1) # 1 = permutation importance (mean decrease in accuracy)
mtry with caret (nested CV friendly
setup)library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
set.seed(123)
ctrl= trainControl(method= "cv", number= 5, search= "grid")
# grid over mtry
grid= expand.grid(.mtry = c(2, 3, 4, 6, 8))
caret_rf= train(Y ~ ., data= df,method="rf",trControl=ctrl, tuneGrid= grid, ntree= 500)
caret_rf
## Random Forest
##
## 500 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.9560004 0.8750425 0.7768376
## 3 0.8204965 0.8839073 0.6653442
## 4 0.7688577 0.8842186 0.6216426
## 6 0.7435022 0.8825609 0.5964405
## 8 0.7532739 0.8779009 0.6012917
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
plot(caret_rf)
library(pdp)
# PDP for X1
pdp_X1= partial(rf, pred.var= "X1", train= df)
plotPartial(pdp_X1)
# 2-D PDP for X1 and X3
pdp_X1X3= partial(rf, pred.var=c("X1","X3"),chull= T,train= df)
plotPartial(pdp_X1X3,levelplot=F)
When reporting Random Forest results in a paper or presentation:
ntree,
mtry, nodesize, and any sampling options.Gradient Boosting Machines (GBM) are a class of powerful ensemble methods that build a strong predictive model by sequentially fitting many weak learners (usually shallow regression trees) to the negative gradient (pseudo-residuals) of a chosen loss function. The method is due to Friedman (2001) and can be viewed as steepest-descent (functional gradient) optimization in function space.
Key ingredients:
GBM is flexible (custom loss functions possible) and often yields excellent predictive performance on structured/tabular data.
Let \(F(x)\) denote the ensemble model. We aim to minimize empirical risk:
\[ \hat{R}(F) = \sum_{i=1}^n L\big(y_i, F(x_i)\big). \]
The functional gradient descent algorithm constructs an additive model:
\[ F_M(x) = F_0(x) + \sum_{m=1}^M \nu \gamma_m h_m(x), \]
where:
\(F_0(x)\) is an initial constant (e.g., for squared loss, the sample mean),
\(h_m(x)\) is the base learner at iteration \(m\),
\(\nu\) is the learning rate (shrinkage), and
\(\gamma_m\) is a step-length determined by line search.
At iteration \(m\):
Compute pseudo-residuals (negative gradients): \[ r_{im} = - \left| \frac{\partial L\big(y_i, F(x_i)\big)}{\partial F(x_i)} \right|_{F = F_{m-1}}. \]
Fit the base learner to the pseudo-residuals: find \(h_m(x)\) that approximates \(r_{im}\) : \(h_m(x)\approx r_{im}\).
Compute multiplier (line search): \[ \gamma_m = \arg\min_{\gamma} \sum_{i=1}^n L\big(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\big). \]
Update model: \[ F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x). \]
This is equivalent to performing gradient descent in the space of functions spanned by the base learners.
\[ L(y,F)=\tfrac{1}{2}(y-F)^2 \]
Pseudo-residuals: \[ r_{im} = y_i - F_{m-1}(x_i). \]
Line search gives \(\gamma_m=1\) when trees are fit to residuals directly (if tree predictions are the fitted residuals).
Thus, the update simplifies to:
\[F_m(x)= F_{m-1}(x)+ \nu h_m(x)\]
\[ L(y,F)=|y-F| \]
Pseudo-residuals are signs of residuals; line search finds median-type updates.
For binary labels \(y\in\{0,1\}\) or \(\{\pm1\}\), logistic loss is common.
Using \(y\in\{0,1\}\): \[ L(y,F)= -y \log p - (1-y) \log(1-p), \quad p = \sigma(2F) = \frac{1}{1+e^{-2F}}. \]
Pseudo-residuals (for coding \(y\in\{0,1\}\)): \[ r_{im} = y_i - p_{m-1}(x_i). \]
Alternatively with \(y\in\{\pm1\}\), one obtains the familiar form involving \(2y/(1+e^{2yF})\).
GBM allows any differentiable loss (Poisson, quantile, Cox partial likelihood for survival, etc.) by defining the appropriate negative gradient.
Trees used in GBM are typically shallow (small interaction.depth) to ensure the weak-learner property. Typical choices:
interaction.depth (max depth): 1–5 (1 = stumps)n.minobsinnode (min samples per leaf)Shallow trees produce localized, low-bias corrections to the model at each step; deeper trees increase per-iteration complexity and risk overfitting.
Learning rate (\(\nu\)) controls contribution of each tree. Typical values: 0.001–0.1. Smaller values require more trees but often improve generalization.
Number of trees (M or n.trees) must be large enough when \(\nu\) is small. Use early stopping based on a validation set to choose optimal number of trees.
Interaction depth determines interaction order captured per tree. Lower depth → more additive structure; higher depth → captures interactions faster but may overfit.
Trade-off strategy: choose small \(\nu\) (e.g., 0.01) and large \(n.trees\) (e.g., up to several thousand) with early stopping.
Introducing subsampling of the training data at each iteration
(bag.fraction < 1) yields stochastic gradient boosting (Friedman,
2002). Typical bag.fraction values: 0.5–0.8.
Benefits:
Algorithmic change: at each iteration, fit the tree to a subsample of the training data without replacement.
interaction.depth, n.minobsinnode.XGBoost and LightGBM add further regularization (lambda, alpha) and algorithmic improvements (histogram-based splits, sparsity-awareness).
Key hyperparameters (caret/gbm naming):
n.trees (number of trees): 100–5000 (use early
stopping)shrinkage (learning rate): 0.001–0.1 (commonly
0.01)interaction.depth (max depth): 1–10 (commonly 1–5)n.minobsinnode: 5–20bag.fraction: 0.5–1.0Tuning strategy:
shrinkage (e.g., 0.01), tune
interaction.depth and n.minobsinnode on a
grid.n.trees.bag.fraction (0.5–0.8) to add variance
reduction.Your inner-grid (example):
shrinkage ∈ {0.01, 0.05},
interaction.depth ∈ {1,3,5},
n.trees ∈ {200,500} — good starter grid; expand with lower
shrinkage and larger n.trees when compute allows.
GBM can be framed as greedy function approximation and, under certain conditions, is consistent. Theoretical analysis is complex due to the sequential, data-dependent nature of tree fitting. Recent research studies GBM generalization under assumptions on tree complexity, learning rate, and sample size.
Important intuition: - Shrinkage and subsampling act as regularizers controlling capacity. - Shallow base learners favor bias control; many iterations control variance.
# Synthetic regression example using gbm
set.seed(101)
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# Synthetic data
n= 1000
p= 8
X= as.data.frame(matrix(runif(n * p, -2, 2), nrow = n))
colnames(X)= paste0("X", 1:p)
Y= with(X, 1.5 * X1 - 2 * X2^2 + sin(2 * X3) + rnorm(n, 0, 0.8))
train_df= data.frame(Y = Y, X)
# Fit gbm (squared error)
set.seed(123)
gbm_model= gbm(
formula = Y ~ .,
data = train_df,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
shrinkage = 0.01,
n.minobsinnode = 10,
bag.fraction = 0.7,
train.fraction = 0.8,
verbose = FALSE)
# Best iteration by OOB estimation
best_iter= gbm.perf(gbm_model,method= "OOB",plot.it= FALSE)
## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.
best_iter
## [1] 745
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4,
## length(x)/10), 50))
##
## Number of Observations: 1000
## Equivalent Number of Parameters: 40
## Residual Standard Error: 0.002814
# Partial dependence for X1
pdp_X1= plot(gbm_model, i.var= "X1",n.trees= best_iter,return.grid= TRUE)
head(pdp_X1)
## X1 y
## 1 -1.998512 -5.294761
## 2 -1.958151 -5.294761
## 3 -1.917791 -5.223514
## 4 -1.877431 -5.223514
## 5 -1.837071 -5.185695
## 6 -1.796710 -5.171040
library(caret)
set.seed(202)
grid= expand.grid(
.n.trees = c(200, 500, 1000),
.interaction.depth = c(1, 3, 5),
.shrinkage = c(0.01, 0.05),
.n.minobsinnode = c(5, 10)
)
ctrl= trainControl(method = "cv", number = 5)
caret_gbm= train(Y ~ ., data = train_df,
method = "gbm",
trControl = ctrl,
tuneGrid = grid,
verbose = FALSE)
caret_gbm
## Stochastic Gradient Boosting
##
## 1000 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 800, 800, 800, 800, 800
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 0.01 1 5 200 2.2037916 0.7847005
## 0.01 1 5 500 1.5433776 0.8435053
## 0.01 1 5 1000 1.1519684 0.8896137
## 0.01 1 10 200 2.2009155 0.7862203
## 0.01 1 10 500 1.5444175 0.8430913
## 0.01 1 10 1000 1.1501194 0.8889046
## 0.01 3 5 200 1.5290604 0.8465255
## 0.01 3 5 500 1.0106880 0.9089685
## 0.01 3 5 1000 0.8903781 0.9221595
## 0.01 3 10 200 1.5348811 0.8462302
## 0.01 3 10 500 1.0077826 0.9100372
## 0.01 3 10 1000 0.8884905 0.9225327
## 0.01 5 5 200 1.3651513 0.8700563
## 0.01 5 5 500 0.9260343 0.9195604
## 0.01 5 5 1000 0.8833141 0.9230565
## 0.01 5 10 200 1.3645595 0.8708424
## 0.01 5 10 500 0.9227677 0.9199001
## 0.01 5 10 1000 0.8771536 0.9240475
## 0.05 1 5 200 1.1493840 0.8888889
## 0.05 1 5 500 0.8890472 0.9228757
## 0.05 1 5 1000 0.8718172 0.9248368
## 0.05 1 10 200 1.1419223 0.8900205
## 0.05 1 10 500 0.8903988 0.9223743
## 0.05 1 10 1000 0.8699625 0.9251037
## 0.05 3 5 200 0.9022204 0.9199051
## 0.05 3 5 500 0.9126036 0.9175669
## 0.05 3 5 1000 0.9312784 0.9142047
## 0.05 3 10 200 0.8912592 0.9217999
## 0.05 3 10 500 0.9012182 0.9195659
## 0.05 3 10 1000 0.9276446 0.9148586
## 0.05 5 5 200 0.9009020 0.9197521
## 0.05 5 5 500 0.9200159 0.9161603
## 0.05 5 5 1000 0.9444168 0.9116945
## 0.05 5 10 200 0.8954884 0.9207513
## 0.05 5 10 500 0.9212543 0.9159846
## 0.05 5 10 1000 0.9446524 0.9117000
## MAE
## 1.7793451
## 1.2354141
## 0.9175543
## 1.7776756
## 1.2366568
## 0.9172518
## 1.2135261
## 0.8022057
## 0.7072477
## 1.2200945
## 0.7988983
## 0.7056066
## 1.0867415
## 0.7356025
## 0.7012343
## 1.0846713
## 0.7316237
## 0.6972771
## 0.9178267
## 0.7020513
## 0.6927187
## 0.9120316
## 0.7091831
## 0.6921418
## 0.7158640
## 0.7239467
## 0.7339373
## 0.7059225
## 0.7111839
## 0.7304012
## 0.7171558
## 0.7290537
## 0.7482224
## 0.7123519
## 0.7298683
## 0.7495009
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 1, shrinkage = 0.05 and n.minobsinnode = 10.
plot(caret_gbm)
When train.fraction is used in gbm, the
model tracks performance on the held-out validation portion and
gbm.perf(..., method = "test") can be used to find the
optimal number of trees. In caret, use trainControl with
method = "cv" or method = "repeatedcv" and
integrate early stopping by large n.trees and using the
tuning results.
shrinkage,
interaction.depth, n.trees,
n.minobsinnode, bag.fraction.n.trees
used.XGBoost (Chen & Guestrin, 2016) is an optimized implementation of gradient boosting that uses a regularized objective, second-order Taylor expansion, and algorithmic improvements (sparsity-awareness, column subsampling, weighted quantile sketch, parallelized split finding). It is widely used for tabular supervised learning due to its speed and predictive performance.
XGBoost fits an additive model of trees:
\[ \hat{y}_i = \sum_{k=1}^K f_k(x_i), \qquad f_k \in \mathcal{F}, \]
where \(\mathcal{F}\) is the space of regression trees.
XGBoost minimizes the regularized objective:
\[ \mathcal{L}(\phi) = \sum_{i=1}^n l\big(y_i, \hat{y}_i\big) + \sum_{k=1}^K \Omega(f_k), \]
where \(l\) is the training loss (e.g., squared error, logistic loss) and the regularization term for a tree \(f\) with \(T\) leaves is
\[ \Omega(f) = \gamma T + \tfrac{1}{2} \lambda \sum_{j=1}^T w_j^2. \]
Here \(w_j\) are the leaf weights, \(\gamma\) penalizes tree complexity (number of leaves) and \(\lambda\) is an \(L_2\) penalty on leaf weights.
At iteration \(t\), suppose the model so far is \(\hat{y}_i^{(t-1)}\). We add a new tree \(f_t\) and approximate the objective by a second-order Taylor expansion of the loss around \(\hat{y}_i^{(t-1)}\):
\[ l\big(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\big) \approx l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \tfrac{1}{2} h_i f_t(x_i)^2, \]
where
\[ g_i = \left. \frac{\partial l(y_i, \hat{y})}{\partial \hat{y}} \right|_{\hat{y}=\hat{y}_i^{(t-1)}}, \qquad h_i = \left. \frac{\partial^2 l(y_i, \hat{y})}{\partial \hat{y}^2} \right|_{\hat{y}=\hat{y}_i^{(t-1)}}. \]
Plugging this approximation into the objective (and dropping constants independent of \(f_t\)) yields the per-iteration objective to minimize:
\[ \tilde{\mathcal{L}}^{(t)} = \sum_{i=1}^n \big[g_i f_t(x_i) + \tfrac{1}{2} h_i f_t(x_i)^2\big] + \Omega(f_t). \]
A regression tree partitions instances into \(T\) disjoint leaves. For leaf \(j\) with instance set \(I_j\) and constant prediction \(w_j\) on that leaf, the objective contribution from that leaf is
\[ \sum_{i\in I_j} \big[g_i w_j + \tfrac{1}{2} h_i w_j^2\big] + \tfrac{1}{2}\lambda w_j^2 + \gamma. \]
Solving for the optimal \(w_j\) by setting derivative to zero gives
\[ w_j^* = -\frac{\sum_{i\in I_j} g_i}{\sum_{i\in I_j} h_i + \lambda}. \]
The resulting optimal objective reduction (score) for leaf \(j\) is
\[ \text{Score}(I_j) = -\tfrac{1}{2} \frac{(\sum_{i\in I_j} g_i)^2}{\sum_{i\in I_j} h_i + \lambda} + \gamma. \]
For a candidate split that divides parent set \(I\) into left/right \(I_L, I_R\), the gain is
\[ \text{Gain} = \tfrac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right) - \gamma, \]
where \(G = \sum_{i\in I} g_i\), \(H = \sum_{i\in I} h_i\), and similarly for \(G_L,G_R,H_L,H_R\). Splits with Gain > 0 (and larger Gain preferred) are chosen.
colsample_bytree, colsample_bylevel to reduce
overfitting and speed up learning.lambda (L2),
alpha (L1), gamma (min split loss) control
model complexity.Squared error (regression): \(l(y,\hat{y}) = \tfrac{1}{2}(y-\hat{y})^2\)
\[ g_i = \hat{y}_i - y_i, \qquad h_i = 1. \]
Logistic (binary classification, y in {0,1}): with log-loss
\[ p_i = \sigma(\hat{y}_i) = \frac{1}{1+e^{-\hat{y}_i}}, \] \[ g_i = p_i - y_i, \qquad h_i = p_i(1-p_i). \]
(For y coded as {±1} different conventions apply; XGBoost expects y in {0,1} or real values depending on objective.)
Key hyperparameters and practical recommendations:
eta (learning rate): 0.01–0.3. Smaller requires more
trees. Typical: 0.01, 0.05, 0.1.nrounds / n_estimators: 100–5000; use
early stopping on validation set.max_depth: 3–10 (controls tree complexity).min_child_weight: 1–10 (minimum sum of hessian in a
leaf; prevents overfitting).subsample: 0.5–1.0 (row subsampling).colsample_bytree: 0.3–1.0 (column subsampling).lambda, alpha: regularization (L2/L1) on
leaf weights.gamma: minimum loss reduction to make a split.Tuning strategy: 1. Tune max_depth and
min_child_weight first (tree complexity). 2. Tune
subsample and colsample_bytree next. 3. Tune
regularization lambda/alpha and
gamma. 4. Use eta small (0.01–0.1) and large
nrounds; select nrounds by early stopping.
# Example: regression using xgboost
library(xgboost)
set.seed(123)
# synthetic data
n= 1000; p= 8
X= matrix(runif(n*p, -2, 2), nrow = n)
y= 1.5*X[,1]-2*X[,2]^2+sin(2*X[,3])+rnorm(n,0,0.8)
# split
tr_idx= sample(seq_len(n), size = 0.8*n)
train_x= X[tr_idx, ]; train_y= y[tr_idx]
valid_x= X[-tr_idx, ]; valid_y= y[-tr_idx]
dtrain= xgb.DMatrix(data= train_x,label= train_y)
dvalid= xgb.DMatrix(data= valid_x, label = valid_y)
watchlist= list(train=dtrain,valid=dvalid)
params= list(objective = 'reg:squarederror',
eta = 0.05,
max_depth = 5,
subsample = 0.8,
colsample_bytree = 0.8,
lambda = 1,
alpha = 0)
xgb_mod= xgb.train(params = params,
data = dtrain,
nrounds = 2000,
watchlist = watchlist,
early_stopping_rounds = 50,
print_every_n = 50)
## [1] train-rmse:4.213583 valid-rmse:4.202594
## Multiple eval metrics are present. Will use valid_rmse for early stopping.
## Will train until valid_rmse hasn't improved in 50 rounds.
##
## [51] train-rmse:0.903106 valid-rmse:1.337959
## [101] train-rmse:0.523974 valid-rmse:1.086337
## [151] train-rmse:0.411107 valid-rmse:1.048792
## [201] train-rmse:0.337788 valid-rmse:1.041219
## [251] train-rmse:0.284987 valid-rmse:1.038966
## [301] train-rmse:0.241613 valid-rmse:1.037616
## Stopping. Best iteration:
## [294] train-rmse:0.247224 valid-rmse:1.036510
# variable importance
imp= xgb.importance(model = xgb_mod)
print(head(imp))
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: f1 0.55512803 0.21116600 0.18110535
## 2: f0 0.30282739 0.17333012 0.17882889
## 3: f2 0.06072744 0.14247355 0.14202605
## 4: f4 0.02525411 0.08501805 0.10054382
## 5: f5 0.01561302 0.11812182 0.11268496
## 6: f3 0.01426617 0.09036454 0.09750854
# partial dependence via pdp package or xgboost::xgb.plot.importance
GPBoost integrates gradient-boosted decision trees with Gaussian Processes (GPs) or mixed-effects models. It is designed for data with hierarchical structure (clusters, repeated measures) or spatial/temporal correlation where capturing both complex fixed effects and structured random effects is important.
Model conceptually decomposes the signal into a nonparametric part learned by boosting and a correlated random effect modeled by GP or random intercepts/slopes.
A general GPBoost model for continuous outcomes can be written as:
\[ \mathbf{y} = f(\mathbf{X}) + \mathbf{Zb} + \boldsymbol{\epsilon}, \]
where:
\(f(\mathbf{X})\) is the boosting predictor (sum of trees),
\(\mathbf{Zb}\) represents random effects (with \(\mathbf{b} \sim \mathcal{N}(0, \Sigma(\theta))\)),
\(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I)\) is observation noise.
A Gaussian Process is a special case where \(\mathbf{Z} = I\) and \(\Sigma(\theta)\) is a covariance kernel defined by hyperparameters \(\theta\) (e.g., Mat'ern, RBF).
Conditional on \(f\) the marginal distribution of \(\mathbf{y}\) is
\[ \mathbf{y} \sim \mathcal{N}\big(f(\mathbf{X}),\; \Sigma(\theta) + \sigma^2 I\big). \]
The (negative) log marginal likelihood is therefore:
\[ \ell(\theta; f) = \tfrac{1}{2}(\mathbf{y} - f)^\top K^{-1}(\theta) (\mathbf{y} - f) + \tfrac{1}{2} \log |K(\theta)| + \tfrac{n}{2}\log(2\pi), \]
with \(K(\theta) = \Sigma(\theta) + \sigma^2 I\).
GPBoost maximizes the marginal likelihood w.r.t. covariance parameters \(\theta\) (or uses REML) while estimating \(f\) via boosting.
A practical training algorithm alternates between two steps:
Repeat until convergence. This alternation resembles an EM or block-coordinate optimization.
Common covariance structures in GPBoost include:
Choice depends on domain knowledge (spatial coordinates, repeated measures, cluster IDs).
Given fitted \(f\) and estimated \(\theta\), predictive distribution at new input \(x_\ast\) with possible new grouping structure is obtained by standard GP conditioning formulas, combined with predictions from boosting for fixed effects. This yields predictive means and variances that account for both tree-based uncertainty (approximate) and GP/random-effect uncertainty (exact within Gaussian assumptions).
GPBoost implementations provide functions for obtaining best linear unbiased predictors (BLUPs) for random effects and predictive intervals.
Boosting-related: learning_rate, max_depth, nrounds, subsample, colsample_bytree — tune similarly to XGBoost/GBM.
GP-related: kernel type, length-scale(s) \(\rho\), process variance \(\sigma^2\), nugget/noise \(\sigma^2_\epsilon\), group-level variance \(\tau^2\). These are estimated by maximizing marginal likelihood (REML) but may require sensible starting values and bounds.
Computational note: covariance matrix operations scale poorly with large n (\(O(n^3)\)). GPBoost uses structure (grouped random effects, sparse approximations) to scale to larger datasets, but caution is needed for very large n.
Note: The
gpboostR package is available on CRAN/ GitHub (installation may require dependencies). Below is a minimal example; refer to package docs for full functionality.
# Example: Gaussian mixed model + boosting (illustrative)
# install.packages("gpboost") # if not installed
library(gpboost)
## Loading required package: R6
## Warning: package 'R6' was built under R version 4.2.3
##
## Attaching package: 'gpboost'
## The following objects are masked from 'package:xgboost':
##
## getinfo, setinfo, slice
set.seed(123)
# ------------------------------------
# 1. Simulate grouped data
# ------------------------------------
n_groups= 50
obs_per_group= 20
n= n_groups * obs_per_group
group= rep(1:n_groups, each = obs_per_group)
X1= runif(n, -2, 2)
X2= runif(n, -1, 1)
# true underlying function
f_true= 2*X1 + sin(2*X2)
# random intercepts for each group
tau= 1
b_group= rnorm(n_groups, 0, tau)
# response
y= f_true + b_group[group] + rnorm(n, 0, 0.5)
# feature matrix
X= cbind(X1, X2)
# ------------------------------------
# 2. Define GPBoost random effect model
# ------------------------------------
gp_model= GPModel(group_data = group,group_rand_coef_data = NULL,likelihood = "gaussian")
# ------------------------------------
# 3. Convert predictors to GPBoost Dataset
# ------------------------------------
dtrain= gpb.Dataset(data = X,label = y)
# ------------------------------------
# 4. Train the GPBoost model
# ------------------------------------
fit = gpb.train(data = dtrain,
gp_model = gp_model,
nrounds = 200,
learning_rate = 0.05,
max_depth = 4,
min_data_in_leaf = 10,objective = "regression_l2",verbose = 0)
# ------------------------------------
# 5. Prediction (fixed + random)
# ------------------------------------
pred= predict(
fit,
data = X,
group_data_pred = group
)$response_mean
head(pred)
## [1] -3.158746 1.689104 -2.543622 2.828992 3.659621 -4.254217
Artificial Neural Networks (ANNs) are computational models inspired by the structure and learning process of the human brain. They are especially powerful for capturing non-linear relationships and complex interactions that traditional statistical models may not detect.
An ANN consists of interconnected units called neurons, arranged in layers. Each neuron: - Receives input signals - Multiplies inputs by weights, adds a bias - Passes the result through a non-linear activation function - Produces an output for the next layer
For input vector \(x = (x_1, \ldots, x_p)\):
\[ z = \sum_{j=1}^p w_j x_j + b \]
\[ \hat{y} = f(z) \]
Represents predictor variables. No computation happens here.
Activation functions introduce non-linearity.
| Activation | Formula | Use Case |
|---|---|---|
| ReLU | \(\max(0,z)\) | Deep networks, fast training |
| Sigmoid | \(\frac{1}{1+e^{-z}}\) | Binary classification |
| tanh | \(\tanh(z)\) | Symmetric output |
| Linear | \(z\) | Regression |
This step computes outputs layer by layer:
\[ a^{(l)} = f(W^{(l)} a^{(l-1)} + b^{(l)}) \]
Ends at the output layer.
The algorithm updates weights by computing gradients using the chain rule.
Weight update:
\[ w_{new} = w_{old} - \eta \frac{\partial L}{\partial w} \]
Where \(\eta\) = learning rate.
Common optimizers: - SGD - Adam (most widely used) - RMSProp
Important tuning parameters: - Number of hidden layers - Neurons per layer - Activation functions - Learning rate - Batch size - Epochs - Dropout rate - L1/L2 regularization strength
| Model | Strengths | Weaknesses |
|---|---|---|
| ANN | Learns complex patterns | Needs large data |
| SVM | Good in high dimensions | Hard to scale |
| Random Forest | Stable, interpretable | Less flexible for complex boundaries |
library(nnet)
set.seed(123)
# Example regression data
n= 200
x1= runif(n, -2, 2)
x2= runif(n, -1, 1)
y= 3*x1-2*x2+sin(2*x1)+rnorm(n, 0, 0.3)
df= data.frame(x1, x2, y)
# Scale features
X_scaled= scale(df[, c("x1","x2")])
df_scaled= data.frame(X_scaled, y = df$y)
# Fit ANN with 5 hidden units
ann_fit= nnet(y~x1+x2,data = df_scaled,
size = 5, # hidden neurons
decay = 0.01, # weight decay
linout = TRUE, # regression
maxit = 500,
trace = FALSE)
# Predictions
pred= predict(ann_fit, df_scaled)
# Performance
mse= mean((pred - df_scaled$y)^2)
mse
## [1] 0.08343786