联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Matlab编程Matlab编程

日期:2024-09-10 12:16

DSCI550: Data Science at Scale

Homework 3, Spring 2024

SHOW EACH STEP OF COMPUTATION.

1. (25 pts) (Decision Tree) Using the following training dataset, construct a decision tree using Information Gain and Entropy as discussed in the class. Using V1, V2, V3, V4, predict C.

2. (20 pts) (Naïve Bayes Classifier) We have data on 1000 patients. They were diagnosed to be Flu, Allergy, or Other Disease using three symptoms as shown. This is our 'training set.' We will use this to predict the type of any new patient we encounter.

A new patient says “High fever, No Sneezing, and Runny Nose”. Is this Flu, Allergy, or Other? Use Naive Bayes Classifier.

3. (15 pts) Regression: A company is investigating the relationship between its advertising expenditures and the sales of their products. The following data represents a sample of 10 products. Note that AD = Advertising dollars in K and S = Sales in thousands $.

1) (5 pts) Find the equation of the regression line, using Advertising dollars as the independent variable and Sales as the response variable.

2) (3 pts) Plot the scatter diagram and the regression line.

3) (5 pts) Find r2 and interpret it in the words of the problem.

4) (2 pts) Use the line to predict the Sales if Advertising dollars = $50 K.

4. (20 pts) (Hierarchical Clustering) Five, two-dimensional data points are shown below with their distance matrix, i.e., the symmetric matrix that gives the pairwise distance between any two points.

Use the distance matrix to perform. the following two types of hierarchical clustering: MIN and MAX distance. Show your results by drawing a dendrogram. Note: the dendrogram should clearly show the order in which the points are merged.

5. (20 pts) k-Means Clustering: For the following six points,

1) Use the k-means algorithm to show the final clustering result assuming initially we assign A1, A6 as the center of each cluster, respectively.

2) Use the k-means algorithm to show the final clustering result assuming initially we assign A3, A4 as the center of each cluster, respectively.

3) Compute the quality of the K-Means clustering using the Sum of Squared Error (SSE) which shows cohesion measures how near the data points in a cluster are to the cluster centroid. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the intra-cluster sum of squares.

where μi is the mean of points in Si.

Based on SSE of 1) and 2), which clustering would be better?







版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp