Real data coding exercises
MSc in Statistics for Data Science at Carlos III University of Madrid
Eduardo García-Portugués
2020-03-10, v1.0
Datasets
MNIST dataset
The MNIST database contains images of handwritten 0−9 digits used in mailing. The first 60, 000 images are
stored in the object MNIST inside MNIST-tSNE.RData. Each observation is a grayscale image made of 28 × 28
pixels that is recorded as a vector of length 282 = 784 by concatenating the pixel columns. The entries of
these vectors are numbers in the interval [0, 1] indicating the level of grayness of the pixel: 0 for white, 1
for black. These vectorised images are stored in the 60, 000 × 784 matrix in MNIST$x. The typical goal is to
classify the images as one of the digits they represent. These 0 − 9 labels are stored in MNIST$labels.
The data is high-dimensional (lives in R
784) and therefore hard to analyse. To simplify its statistical treatment,
the nonlinear dimension-reduction technique t-SNE has been applied. t-SNE can be regarded as a kind of
“nonlinear PCA” that maps the data into a low-dimensional representation, in this case in R
2
. The “t-SNE
scores” are stored in MNIST$y_tsne.
The following code chunk illustrates the basics of the data.
# Load data
load("datasets/MNIST-tSNE.RData")
# Visualization of an individual image
show_digit <- function(vec, col = gray(12:1 / 12), ...) {
image(matrix(vec, nrow = 28)[, 28:1], col = col, ...)
}
# Plot the first 10 images (in rows order)
par(mfrow = c(1, 10), mar = c(0, 0, 0, 0))
for (i in 1:10) show_digit(MNIST$x[i, ], axes = FALSE)
# Data structure
str(MNIST)
## List of 3
## $ x : num [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ...
## $ labels: int [1:60000] 5 0 4 1 9 2 1 3 1 4 ...
## $ y_tsne: num [1:60000, 1:2] -58.5 -19 85.2 -23.8 68.3 ...
1
# Plot t-SNE scores
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1)
plot(MNIST$y_tsne, col = rainbow(10)[MNIST$labels + 1], pch = 16, cex = 0.25)
legend("bottomright", col = rainbow(10), pch = 16, legend = 0:9)
−100 −50 0 50 100
−100 −50
0 50 100
MNIST$y_tsne[,1]
MNIST$y_tsne[,2]
sunspots_births dataset
The sunspots_births dataset from the rotasym package contains the recorded sunspots births during
1872–2018 from the Debrecen Photoheliographic Data (DPD) sunspot catalogue and the revised version of the
Greenwich Photoheliographic Results (GPR) sunspot catalogs. The dataset contains 51, 303 observations of 6
features of the births of groups of sunspots. These features include their positions in spherical coordinates
(theta and phi), their size (total_area), and their distance to the center of the solar disk (dist_sun_disc).
The data has been recently analysed in this paper.
The following code chunk illustrates the basics of the data.
# Install packages
# install.packages("rotasym")
# Load data
data("sunspots_births", package = "rotasym")
# Data summary
summary(sunspots_births)
## date cycle total_area
## Min. :1874-04-17 11:38:00 Min. :11.00 Min. : 0.00
## 1st Qu.:1929-04-01 01:00:01 1st Qu.:16.00 1st Qu.: 4.00
## Median :1964-12-16 08:24:00 Median :20.00 Median : 12.00
## Mean :1959-06-19 23:05:53 Mean :18.98 Mean : 54.63
## 3rd Qu.:1990-07-09 03:43:59 3rd Qu.:22.00 3rd Qu.: 43.00
2
## Max. :2018-06-19 04:58:46 Max. :24.00 Max. :2803.00
## dist_sun_disc theta phi
## Min. :0.0030 Min. :0.000 Min. :-1.0384709
## 1st Qu.:0.4637 1st Qu.:1.600 1st Qu.:-0.2565634
## Median :0.7450 Median :3.121 Median : 0.0087266
## Mean :0.6895 Mean :3.134 Mean : 0.0002379
## 3rd Qu.:0.9530 3rd Qu.:4.679 3rd Qu.: 0.2548181
## Max. :1.0000 Max. :6.283 Max. : 1.0419616
# Obtain the data from the 23rd solar cycle
sunspots_23 <- subset(sunspots_births, cycle == 23)
# Transform to Cartesian coordinates
X <- cbind(cos(sunspots_23$phi) * cos(sunspots_23$theta),
cos(sunspots_23$phi) * sin(sunspots_23$theta),
sin(sunspots_23$phi))
# Plot data
n_cols <- 100
rgl::plot3d(0, 0, 0, xlim = c(-1, 1), ylim = c(-1, 1), zlim = c(-1, 1),
radius = 1, type = "s", col = "lightblue", alpha = 0.25,
lit = FALSE)
cuts <- cut(x = sunspots_23$date, include.lowest = TRUE,
breaks = quantile(sunspots_23$date,
probs = seq(0, 1, l = n_cols + 1)))
rgl::points3d(X, col = viridisLite::viridis(n_cols)[cuts])
# Spörer's law: sunspots at the beginning of the solar cycle (dark blue
# color) tend to appear at higher latitutes, gradually decreasing to the
# equator as the solar cycle advances (yellow color)
Exercises
Chapter 2
• Exercise 2.19 (solved in exercise_2_19.R). Consider the sunspots_births dataset.
a. Compute and plot the kernel density estimator for phi using the DPI selector. Describe the result.
b. Compute and plot the kernel density derivative estimator for phi using the adequate DPI selector.
Using a horizontal line at y = 0, determine approximately the location of the main mode(s).
c. Compute the log-transformed kernel density estimator with adj.positive = 1 for total_area
using the NS selector.
d. Draw the histogram of M = 104
samples simulated from the kernel density estimator obtained in
a.
• Exercise 2.20 (solved in exercise_2_20.R). Consider the MNIST dataset.
a. Compute the average gray level, av_gray_one, for each image of the digit “1”.
b. Compute and plot the kde of av_gray_one, taking into account that it is a positive variable.
c. Overlay the lognormal distribution density, with parameters estimated by maximum likelihood
(use MASS::fitdistr).
d. Repeat c. for the Weibull density.
e. Which parametric model seems more adequate?
Chapter 3
• Exercise 3.25 (solved in exercise_3_25.R). Load the ovals.RData file.
3
a. Split the dataset into the training sample, comprised of the first 2,000 observations, and the test
sample (rest of the sample). Plot the dataset with colors for its classes. What can you say about
the classification problem?
b. Using the training sample, compute the plug-in bandwidth matrices for all the classes.
c. Use these plug-in bandwidths to perform kernel discriminant analysis.
d. Plot the contours of the kernel density estimator of each class and the classes partitions. Use
coherent colors between contours and points.
e. Predict the class for the test sample and compare with the true classes. Then report the successful
classification rate.
f. Compare the successful classification rate with the one given by LDA. Is it better than kernel
discriminant analysis?
g. Repeat f. with QDA.
• Exercise 3.26 (solved in exercise_3_26.R). Consider the MNIST dataset. Classify the digit images, via
the t-SNE scores, into the digit labels:
a. Split the dataset into the training sample, comprised of the first 50,000 t-SNE scores and their
associated labels, and the test sample (rest of the sample).
b. Using the training sample, compute the plug-in bandwidth matrices for all the classes.
c. Use these plug-in bandwidths to perform kernel discriminant analysis.
d. Plot the contours of the kernel density estimator of each class and overlay the t-SNE scores as
points. Use coherent colors between contours and points, and add a legend.
e. Obtain the successful classification rate of the kernel discriminant analysis.
• Exercise 3.27 (solved in exercise_3_27.R). Consider the MNIST dataset. Investigate the different ways
of writing the digit “3” using kernel mean shift clustering. Follow the next steps:
a. Consider the first 2,000 t-SNE scores (for the sake of computational expediency) f the class “3”.
b. Compute the normal scale bandwidth matrix that is appropriate for performing kernel mean shift
clustering with the first 2,000 t-SNE scores of the class “3”.
c. Do kernel mean shift clustering on that subset using the previously obtained bandwidth and obtain
the modes ξj
, j = 1, . . . , M, that cluster the t-SNE scores.
d. Determine the M images that have the closest t-SNE scores, using the Euclidean distance, to the
modes ξj
.
e. Show the closest images associated to the modes. Do they represent different forms of drawing the
digit “3”?
Chapter 4
• Exercise 4.14 (solved in exercise_4_14.R).
a. Code your own function that computes the local cubic estimator. The function must take as input
the vector of evaluation points x, the sample data, the bandwidth h, and the kernel K. The result
must be a vector of the same length as x containing the estimator evaluated at x.
b. Test the implementation by estimating the regression function in the location model Y = m(X)+ε,
where m(x) = (x − 1)2
, X ∼ N (1, 1), and ε ∼ N (0, 0.5). Do it from a sample of size n = 500.
• Exercise 4.15 (solved in exercise_4_15.R). Consider the sunspots_births dataset. Then:
a. Filter the dataset to account only for the 23rd cycle.
b. Inspect the graphical relation between dist_sun_disc (response) and log10(total_area) (predictor).
c. Compute the CV bandwidth for the above regression, for the local constant estimator.
d. Compute and plot the local constant estimator using CV bandwidth.
e. Repeat c. and d. for the local linear estimator.
4
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。