Regression is a statistical method used to study the relationship between variables. It helps us understand how one or more independent variables (predictors) influence a dependent variable (response).
At its core, regression answers this question:
“How does a change in one variable affect another?”
đ Why Study Regression?
Regression is a powerful tool used to:
Predict future outcomes
Identify trends and relationships
Test hypotheses
Inform decisions and policies
đ Real-World Questions Answered by Regression
Context
Question
Genetics
Are daughters taller than their mothers?
Education
Does reducing class size improve student performance?
Geology
Can we predict the time of Old Faithfulâs next eruption using the last one?
Health & Nutrition
Do dietary changes reduce cholesterol levels?
Economics & Demographics
Do wealthier countries have lower birth rates?
Transportation
Can better highway designs lower accident rates?
Environmental Science
Is water usage increasing over the years?
Real Estate & Conservation
Do conservation easements reduce land values?
đ§Ž Linear Regression: The Foundation
Here we will focus on Linear Regression, the most commonly used regression technique.
What is Linear Regression?
Linear regression models the relationship between variables by fitting a straight line to the data. It assumes the response is a linear function of the predictors.
It is easy to interpret.
It forms the basis for more advanced regression techniques.
It is widely applicable in fields like economics, biology, engineering, and social sciences.
đŻ Objective of Regression Analysis
The main goal of regression is to:
Summarize complex data in a simple and meaningful way
Understand relationships between variables
Make predictions and inform decisions
Represent relationships elegantly and effectively
Sometimes, a theory or prior knowledge may guide the form of the relationship (e.g., linear, quadratic).
âď¸ Key Takeaways
Regression studies dependence between variables.
It helps answer a wide range of real-world questions.
Linear regression is the cornerstone of most regression techniques.
Simplicity, interpretability, and utility are at the heart of regression analysis.
Thank you! Here’s the previously structured version with all heading numbers removed, making it visually clean and ideal for student-friendly materials like slides, handouts, or websites.
đ Scatterplots â A First Look at Regression
đ Understanding the Basics
In simple regression, we study how one variable (X)âcalled the predictor influences another variable (Y), the response. We observe data in pairs:
X: Independent variable (e.g., motherâs height)
Y: Dependent variable (e.g., daughterâs height)
To explore the relationship visually, we use a scatterplot.
đ What is a Scatterplot?
A scatterplot is a graph that shows each observation as a point:
đ¨âđŠâđ§ Inheritance of Height: A Historical Example
Karl Pearson (1893â1898) collected data on 1375 motherâdaughter pairs in the UK. He wanted to understand: đ âDo taller mothers tend to have taller daughters?â
Predictor
Response
Mother’s Height (mheight)
Daughterâs Height (dheight)
We visualize the data using a scatterplot:
đ Figure: Jittered Scatterplot
Adds slight randomness to avoid overplotting (many points at same location).
Gives clearer view of point density.
đ Figure: Original Data
Shows exact height values (rounded to nearest inch).
Suffers from overplottingâmultiple data points overlap.
⨠Key Insights from the Scatterplot
Equal Axes for Fair Comparison
Since mothers and daughters have similar height ranges, both axes should be scaled equally.
A perfect 45° line would represent identical heights.
Jittering Removes Overlap
Helps in showing all data points.
A small random value (Âą0.5) is added to each value to unstack the points.
Detecting Dependence
Scatter of points changes with the predictor.
See visual comparison using mheight = 58, 64, 68: ⤠Average daughterâs height increases with motherâs height.
Elliptical Shape Suggests Linearity
Points form an ellipse tilted upward.
Implies a positive linear trend.
A good candidate for simple linear regression.
Special Data Points
Type
Description
Role
Leverage Points
Unusually high or low X-values
Strong influence on regression line
Outliers
Unusually high or low Y-values for given X
May indicate anomalies or errors
A scatterplot helps check if the response depends on the predictorâhere, taller mothers generally have taller daughters, showing an upward trend. Figure 1.2 shows that as motherâs height increases (58, 64, 68 inches), the average daughterâs height also increases. Points far from others horizontally are leverage points; vertically distant ones are potential outliers.
đ§ Summary: Why Scatterplots Matter
Offer visual insight before any formal modeling.
Help assess:
Strength and direction of relationship
Suitability for regression (e.g., linearity)
Presence of outliers or influential points
â Pro Tips
Always plot your data firstâregression comes later!
Use jittering when data is rounded.
Maintain equal axis scaling when variables are measured on similar scales.
Forbesâs Data (Figures 1.3 & 1.4): Collected in the 19th century (1800s) by James D. Forbes, the data show the relationship between boiling point and atmospheric pressure. Figure 1.3a reveals a curved trend, and residuals in 1.3b show a poor linear fit. After applying a log transformation to pressure, Figure 1.4a becomes linear, and residuals in 1.4b scatter evenlyâindicating a good linear model.
Smallmouth Bass Growth (Figure 1.5): Collected from Lake Ontario in the 1970s, this data shows how length increases with age. However, there’s large variability among fish of the same age, meaning regression estimates the average size, not individual growth. Predictions carry uncertainty due to biological variation.
Snowfall Prediction (Figure 1.6): From Flagstaff, Arizona, data from 1910 to 1960 compares early and late seasonal snowfall. The scatterplot shows no meaningful trendâearly snowfall (SeptâDec) does not predict later snowfall (JanâMay). Correlation is weak or absent.
Turkey Weight Gain (Figure 1.7): From an animal nutrition study in the 1960s, the data explore how turkey weight gain varies with methionine dose and its source. Gain increases with dose, but overlapping variability and different slopes among sources suggest potential interactions and biological complexity.
MEAN FUNCTION
The mean function describes how the average value of a response variable (Y) changes with a predictor variable (X). It is written as E(Y | X = x), meaning the expected value of Y given that X takes a specific value x. In many cases, we model this relationship using a straight line, such as E(Y | X = x) = βâ + βâx, where βâ is the intercept and βâ is the slope. This linear form represents a simple mean function and helps us understand the trend between variables.
In the Galton height dataset (collected in 1885â1886), we examine how a daughterâs height (dheight) depends on her motherâs height (mheight). If mothers and daughters were exactly the same height, weâd expect a mean function with slope 1. This is shown as a dashed line in Figure 1.8. However, the observed data shows a solid line with slope less than 1, indicating that tall mothers tend to have slightly shorter daughters, and short mothers slightly taller daughters. This effect is known as regression to the mean, a phenomenon first noted by Francis Galton.
In another example, Figure 1.5 uses smallmouth bass data to illustrate the concept. The dashed curve connects the average length of fish at each age, forming an empirical estimate of E(length | age). This curve acts as the mean function, summarizing how expected fish length increases with age. The individual data points still show variation around the curve, reminding us that not all fish grow at the same rate.
What does “mean” have to do with it?
The mean (average) serves as a reference point for understanding the relationship between mothers’ and daughters’ heights. In the context of Galton’s data:
If height were perfectly inherited, we’d expect that a mother one inch taller than average would have a daughter also one inch taller than average.
This would result in a mean function (i.e., the expected daughter height given motherâs height) with a slope of 1 â meaning daughters’ heights track perfectly with mothers’.
But what was observed?
The regression line (the solid line) had a slope less than 1 â meaning:
Tall mothers tend to have daughters who are tall but not quite as tall.
Short mothers tend to have daughters who are short but not quite as short.
Over many cases, extreme values (very tall or very short) regress toward the mean height.
đ Why is the mean function important here?
It defines expectations:
The mean function tells us the expected daughterâs height for a given motherâs height.
Without the concept of the mean, we can’t define or detect regression to the mean.
It reveals the pattern:
The deviation from the dashed line (slope = 1) shows that heredity is not perfect.
This pattern emerges statistically when analyzing many motherâdaughter pairs and comparing them to the mean of the population.
đ§ What does “regression to the mean” actually mean?
It means that:
Children of extreme parents (very tall or very short) tend to be closer to the average height than their parents.
Itâs not because of any active “pull” toward the mean â itâs a statistical effect that arises because traits like height are influenced by both genetics and environment, and some of the extremes are due to random variation.
Thus, the mean function is central to regressionâit captures how the average response changes with the predictor and provides the basis for building statistical models.
đ Understanding Variance Functions in Regression
When analyzing how a response variable (like a daughter’s height) depends on a predictor (like a motherâs height), we donât just look at the average trendâwe also care about how spread out the data is. This is where the variance function comes in.
đ What Is a Variance Function?
The variance function tells us how variable the response is when we fix the predictor at a certain value. Mathematically, it is written as:
Var(Y | X = x) (read as: the variance of Y given X equals x)
This function gives insight into how consistent or spread out the response values are around their mean, for each fixed value of the predictor.
đ Visual Examples
Figure 1.2 â Daughtersâ Height vs. Mothersâ Height
For the Galton height data:
The variability in daughter’s height for given motherâs heights (58, 64, 68 inches) appears roughly constant.
This means the spread of daughter heights doesnât change much across different mother heights.
Figure 1.5 â Smallmouth Bass Length by Age
Here too, the variance in fish lengths across ages seems fairly consistent.
Although it’s not guaranteed, assuming constant variance is reasonable in this case.
Turkey Data Example
In the turkey weight dataset, only treatment means are plottedânot individual pen-level values.
So, we canât evaluate the variance function directly, since the graph doesnât show within-treatment variability.
đ Common Assumption in Linear Models
In many simple linear regression models, we assume the variance stays the same for all values of x:
Var(Y | X = x) = Ď² Where:
Ď² (sigma squared) is a positive constant,
It reflects a uniform level of variability across all predictor values.
This assumption helps simplify model fitting, although more advanced models (explored in Chapter 7) can allow variance to change with x.
đ§ Key Takeaway
While the mean function tells us the average trend, the variance function reveals how spread out the data is at each level of the predictor. Both are essential for building and interpreting effective regression models.
đ Disclaimer: The concepts, explanations, and figures presented on this webpage are adapted from the textbook Applied Linear Regression by Sanford Weisberg, 4th Edition, Wiley-Interscience (2013). All rights and credits belong to the original author and publisher
When attributes are continuous, we assume they follow a normal (Gaussian) distribution. This allows us to use the Gaussian Probability Density Function (PDF) to calculate probabilities.
A regression tree is a type of decision tree adapted for regression problems, where the target variable is continuous rather than categorical. While traditional decision treesâlike ID3, C4.5, and CARTâare primarily designed for classification tasks, regression trees use similar principles but apply different metrics to split the data. Instead of focusing on metrics like information gain, gain ratio, or the Gini index (used for classification), regression trees typically use metrics related to minimizing error, such as variance reduction or mean squared error (MSE).
The most common algorithm for building regression trees is the CART (Classification and Regression Trees) algorithm. While CART works for classification tasks by using the Gini index, for regression problems, CART adapts by using the variance of the target variable as the criterion for splitting nodes.
Step-by-Step Guide for Regression Trees with CART:
Letâs walk through a simple example using the same dataset used for classification in a prior experiment (golf playing decision). However, in this case, the target variable represents the number of golf players, which is a continuous numerical value rather than a categorical one (like true/false in the original experiment).
1. Data Understanding
In the previous classification version of this dataset, the target column represented the decision to play golf (True/False). Here, the target column represents the number of golf players, which is a real number. The features might be the same (e.g., weather conditions, time of day), but the key difference is that the target is continuous rather than categorical.
2. Handling Continuous Target Variable
Since the target is now a continuous variable, we cannot use traditional classification metrics like counting occurrences of âTrueâ and âFalse.â Instead, we use the variance of the target variable as a way to decide how to split the data. The objective is to reduce the variance within each subset after a split.
3. Splitting the Data (Using Variance)
For each potential split, we calculate the variance (or standard deviation) of the target variable in the two child nodes.
The goal is to choose the split that results in the largest reduction in variance. The split with the least variance in each subset will indicate the most âhomogeneousâ groups with respect to the target variable.
4. Recursive Process
The tree-building process proceeds recursively. After finding the best split based on variance reduction, the dataset is divided into two subsets. The same process is then applied to each of these subsets until the stopping criteria are met (e.g., a maximum tree depth or a minimum number of data points in a leaf node).
5. Prediction with Regression Tree
Once the tree is built, each leaf node will contain the predicted value for the target variable. For regression trees, the value predicted by a leaf node is typically the mean of the target variable for the instances in that leaf.
New instances are passed down the tree, and the prediction is made by reaching a leaf and taking the average of the target values in that leaf.
6. Pruning the Tree
Just like classification trees, regression trees can be prone to overfitting. A tree that grows too deep may model noise in the data, leading to poor generalization. To counter this, pruning can be applied, which involves cutting back some branches of the tree to improve performance on unseen data.
Example with the Golf Dataset:
In the golf dataset, where we have the number of golf players as the target variable, we might use weather conditions or time of day as the features. Since the target variable is a continuous value (e.g., the number of players), the algorithm would calculate the variance of the number of players within different subsets of the data based on these features. The tree would split the data based on the feature that most reduces this variance at each step.
For instance, if the weather condition (e.g., sunny or rainy) significantly reduces the variance of the number of players, this would be chosen as a splitting criterion. The process continues recursively until the tree reaches a satisfactory depth or further splits no longer reduce variance significantly.
Regression trees serve as powerful tools for handling continuous target variables, and the CART algorithm adapts this idea by focusing on minimizing variance within subsets rather than classifying instances into discrete classes. This enables decision trees to make classifications and predict continuous outcomes effectively.