联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2020-03-14 10:36

Real data coding exercises

MSc in Statistics for Data Science at Carlos III University of Madrid

Eduardo García-Portugués

2020-03-10, v1.0

Datasets

MNIST dataset

The MNIST database contains images of handwritten 0−9 digits used in mailing. The first 60, 000 images are

stored in the object MNIST inside MNIST-tSNE.RData. Each observation is a grayscale image made of 28 × 28

pixels that is recorded as a vector of length 282 = 784 by concatenating the pixel columns. The entries of

these vectors are numbers in the interval [0, 1] indicating the level of grayness of the pixel: 0 for white, 1

for black. These vectorised images are stored in the 60, 000 × 784 matrix in MNIST$x. The typical goal is to

classify the images as one of the digits they represent. These 0 − 9 labels are stored in MNIST$labels.

The data is high-dimensional (lives in R

784) and therefore hard to analyse. To simplify its statistical treatment,

the nonlinear dimension-reduction technique t-SNE has been applied. t-SNE can be regarded as a kind of

“nonlinear PCA” that maps the data into a low-dimensional representation, in this case in R

2

. The “t-SNE

scores” are stored in MNIST$y_tsne.

The following code chunk illustrates the basics of the data.

# Load data

load("datasets/MNIST-tSNE.RData")

# Visualization of an individual image

show_digit <- function(vec, col = gray(12:1 / 12), ...) {

image(matrix(vec, nrow = 28)[, 28:1], col = col, ...)

}

# Plot the first 10 images (in rows order)

par(mfrow = c(1, 10), mar = c(0, 0, 0, 0))

for (i in 1:10) show_digit(MNIST$x[i, ], axes = FALSE)

# Data structure

str(MNIST)

## List of 3

## $ x : num [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ...

## $ labels: int [1:60000] 5 0 4 1 9 2 1 3 1 4 ...

## $ y_tsne: num [1:60000, 1:2] -58.5 -19 85.2 -23.8 68.3 ...

1

# Plot t-SNE scores

par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1)

plot(MNIST$y_tsne, col = rainbow(10)[MNIST$labels + 1], pch = 16, cex = 0.25)

legend("bottomright", col = rainbow(10), pch = 16, legend = 0:9)

−100 −50 0 50 100

−100 −50

0 50 100

MNIST$y_tsne[,1]

MNIST$y_tsne[,2]

sunspots_births dataset

The sunspots_births dataset from the rotasym package contains the recorded sunspots births during

1872–2018 from the Debrecen Photoheliographic Data (DPD) sunspot catalogue and the revised version of the

Greenwich Photoheliographic Results (GPR) sunspot catalogs. The dataset contains 51, 303 observations of 6

features of the births of groups of sunspots. These features include their positions in spherical coordinates

(theta and phi), their size (total_area), and their distance to the center of the solar disk (dist_sun_disc).

The data has been recently analysed in this paper.

The following code chunk illustrates the basics of the data.

# Install packages

# install.packages("rotasym")

# Load data

data("sunspots_births", package = "rotasym")

# Data summary

summary(sunspots_births)

## date cycle total_area

## Min. :1874-04-17 11:38:00 Min. :11.00 Min. : 0.00

## 1st Qu.:1929-04-01 01:00:01 1st Qu.:16.00 1st Qu.: 4.00

## Median :1964-12-16 08:24:00 Median :20.00 Median : 12.00

## Mean :1959-06-19 23:05:53 Mean :18.98 Mean : 54.63

## 3rd Qu.:1990-07-09 03:43:59 3rd Qu.:22.00 3rd Qu.: 43.00

2

## Max. :2018-06-19 04:58:46 Max. :24.00 Max. :2803.00

## dist_sun_disc theta phi

## Min. :0.0030 Min. :0.000 Min. :-1.0384709

## 1st Qu.:0.4637 1st Qu.:1.600 1st Qu.:-0.2565634

## Median :0.7450 Median :3.121 Median : 0.0087266

## Mean :0.6895 Mean :3.134 Mean : 0.0002379

## 3rd Qu.:0.9530 3rd Qu.:4.679 3rd Qu.: 0.2548181

## Max. :1.0000 Max. :6.283 Max. : 1.0419616

# Obtain the data from the 23rd solar cycle

sunspots_23 <- subset(sunspots_births, cycle == 23)

# Transform to Cartesian coordinates

X <- cbind(cos(sunspots_23$phi) * cos(sunspots_23$theta),

cos(sunspots_23$phi) * sin(sunspots_23$theta),

sin(sunspots_23$phi))

# Plot data

n_cols <- 100

rgl::plot3d(0, 0, 0, xlim = c(-1, 1), ylim = c(-1, 1), zlim = c(-1, 1),

radius = 1, type = "s", col = "lightblue", alpha = 0.25,

lit = FALSE)

cuts <- cut(x = sunspots_23$date, include.lowest = TRUE,

breaks = quantile(sunspots_23$date,

probs = seq(0, 1, l = n_cols + 1)))

rgl::points3d(X, col = viridisLite::viridis(n_cols)[cuts])

# Spörer's law: sunspots at the beginning of the solar cycle (dark blue

# color) tend to appear at higher latitutes, gradually decreasing to the

# equator as the solar cycle advances (yellow color)

Exercises

Chapter 2

• Exercise 2.19 (solved in exercise_2_19.R). Consider the sunspots_births dataset.

a. Compute and plot the kernel density estimator for phi using the DPI selector. Describe the result.

b. Compute and plot the kernel density derivative estimator for phi using the adequate DPI selector.

Using a horizontal line at y = 0, determine approximately the location of the main mode(s).

c. Compute the log-transformed kernel density estimator with adj.positive = 1 for total_area

using the NS selector.

d. Draw the histogram of M = 104

samples simulated from the kernel density estimator obtained in

a.

• Exercise 2.20 (solved in exercise_2_20.R). Consider the MNIST dataset.

a. Compute the average gray level, av_gray_one, for each image of the digit “1”.

b. Compute and plot the kde of av_gray_one, taking into account that it is a positive variable.

c. Overlay the lognormal distribution density, with parameters estimated by maximum likelihood

(use MASS::fitdistr).

d. Repeat c. for the Weibull density.

e. Which parametric model seems more adequate?

Chapter 3

• Exercise 3.25 (solved in exercise_3_25.R). Load the ovals.RData file.

3

a. Split the dataset into the training sample, comprised of the first 2,000 observations, and the test

sample (rest of the sample). Plot the dataset with colors for its classes. What can you say about

the classification problem?

b. Using the training sample, compute the plug-in bandwidth matrices for all the classes.

c. Use these plug-in bandwidths to perform kernel discriminant analysis.

d. Plot the contours of the kernel density estimator of each class and the classes partitions. Use

coherent colors between contours and points.

e. Predict the class for the test sample and compare with the true classes. Then report the successful

classification rate.

f. Compare the successful classification rate with the one given by LDA. Is it better than kernel

discriminant analysis?

g. Repeat f. with QDA.

• Exercise 3.26 (solved in exercise_3_26.R). Consider the MNIST dataset. Classify the digit images, via

the t-SNE scores, into the digit labels:

a. Split the dataset into the training sample, comprised of the first 50,000 t-SNE scores and their

associated labels, and the test sample (rest of the sample).

b. Using the training sample, compute the plug-in bandwidth matrices for all the classes.

c. Use these plug-in bandwidths to perform kernel discriminant analysis.

d. Plot the contours of the kernel density estimator of each class and overlay the t-SNE scores as

points. Use coherent colors between contours and points, and add a legend.

e. Obtain the successful classification rate of the kernel discriminant analysis.

• Exercise 3.27 (solved in exercise_3_27.R). Consider the MNIST dataset. Investigate the different ways

of writing the digit “3” using kernel mean shift clustering. Follow the next steps:

a. Consider the first 2,000 t-SNE scores (for the sake of computational expediency) f the class “3”.

b. Compute the normal scale bandwidth matrix that is appropriate for performing kernel mean shift

clustering with the first 2,000 t-SNE scores of the class “3”.

c. Do kernel mean shift clustering on that subset using the previously obtained bandwidth and obtain

the modes ξj

, j = 1, . . . , M, that cluster the t-SNE scores.

d. Determine the M images that have the closest t-SNE scores, using the Euclidean distance, to the

modes ξj

.

e. Show the closest images associated to the modes. Do they represent different forms of drawing the

digit “3”?

Chapter 4

• Exercise 4.14 (solved in exercise_4_14.R).

a. Code your own function that computes the local cubic estimator. The function must take as input

the vector of evaluation points x, the sample data, the bandwidth h, and the kernel K. The result

must be a vector of the same length as x containing the estimator evaluated at x.

b. Test the implementation by estimating the regression function in the location model Y = m(X)+ε,

where m(x) = (x − 1)2

, X ∼ N (1, 1), and ε ∼ N (0, 0.5). Do it from a sample of size n = 500.

• Exercise 4.15 (solved in exercise_4_15.R). Consider the sunspots_births dataset. Then:

a. Filter the dataset to account only for the 23rd cycle.

b. Inspect the graphical relation between dist_sun_disc (response) and log10(total_area) (predictor).

c. Compute the CV bandwidth for the above regression, for the local constant estimator.

d. Compute and plot the local constant estimator using CV bandwidth.

e. Repeat c. and d. for the local linear estimator.

4


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp