联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-04-11 10:55

Assignment 2 - R Text Analysis

Due 15/04 - Before Class

How to submit your assignment

1. Download the R Workspace “newspapers” and open it with R-studio

2. Read the instructions and finish Questions 1 to 4

3. Save your R-script to “yourlastname studentID assignment2.R”

The Ideological Bias in Newspaper

Text analysis gives researchers a powerful set of tools for extracting general information

from a large body of documents.

This exercise is based on Gentzkow, M. and Shapiro, J. M. 2010. Drives Media

Slant? Evidence From U.S. Daily Newspapers.“ Econometrica, 78(1): 35-71.

We will analyze data from newspapers across the country to see what topics they

cover and how those topics are related to their ideological bias. The authors computed

a measure of a newspaper’s ”slant“ by comparing its language to speeches made by

Democrats and Republicans in the U.S. Congress.

You will use three data sources for this analysis. The first ‘dtm’, is a document

term matrix with one row per newspaper, containing the 1000 phrases—stemmed

and processed—that do the best job of identifying the speaker as a Republican or

a Democrat. For install, ”living in poverty“ is a phrase most frequently spoken

by Democrats, while ”global war on terror“ is a phrase most frequently spoken by

Republicans; a phrase like ”exchange rate“ would not be included in this data-set, as

it is used often by members of both parties and is thus a poor indicator of ideology.

The second object, ‘papers’, contains some data on the newspapers on which ‘dtm’ is

based. The row names in ‘dtm’ correspond to the ‘newsid’ variable in ’papers’. The

variables are:

1

Part I. The Ideological Bias of Newspapers

Text analysis gives researchers a powerful set of tools for extracting general information

from a large body of documents.

This exercise is based on Gentzkow, M. and Shapiro, J. M. 2010. “What Drives Media

Slant? Evidence From U.S. Daily Newspapers.” Econometrica, 78(1): 35-71.

We will analyze data from newspapers across the country to see what topics they

cover and how those topics are related to their ideological bias. The authors computed

a measure of a newspaper’s “slant” by comparing its language to speeches made by

Democrats and Republicans in the U.S. Congress.

You will use three data sources for this analysis. The first, ‘dtm’, is a document

term matrix with one row per newspaper, containing the 1000 phrases – stemmed and

processed – that do the best job of identifying the speaker as a Republican or a Democrat.

For example, “living in poverty” is a phrase most frequently spoken by Democrats, while

“global war on terror” is a phrase most frequently spoken by Republicans; a phrase like

“exchange rate” would not be included in this dataset, as it is used often by members of

both parties and is thus a poor indicator of ideology.

The second object, ‘papers’, contains some data on the newspapers on which ‘dtm’

is based. The row names in ‘dtm’ correspond to the ‘newsid’ variable in ‘papers’. The

variables are:

Name Description

‘newsid’ The newspaper ID

‘paper’ The newspaper name

‘city’ The city in which the newspaper is based

‘state’ The state in which the newspaper is based

‘district’ Congressional district where the newspaper is based (data for Texas only)

‘nslant’ The “ideological slant” (lower numbers mean more Democratic)

The third object, ‘cong’, contains data on members of Congress based on their political

speech, which we will compare to the ideological slant of newspapers from the areas that

these legislators represent. The variables are:

2

The third object, ’cong’, contains data on members of Congress based on their

political speech, which we will compare to the ideological slant of newspapers from

the areas that these legislators represent. The variables are:

Name Description

‘legname’ Legislator’s name

‘state’ Legislator’s state

‘district’ Legislator’s Congressional district

‘chamber’ Chamber in which legislator serves (House or Senate)

‘party’ Legislator’s party

‘cslant’ Ideological slant based on legislator’s speech (lower numbers mean more Democratic)

Question 1

We will first focus on the slant of newspapers, which the authors define as the

tendency to use language that would sway readers to the political left or right.

Load the data and plot the distribution of ‘nslant’ in the ‘papers’ data frame, with a

vertical line at the median. Which newspaper in the country has the largest left-wing

slant? What about right?

Question 2

We will now explore the relationship between the political slant of newspapers and

the language used by members of Congress.

Using the dataset ‘cong’, compute average slant by state separately for the House

and Senate. Now use ‘papers’ to compute the average newspaper slant by state.

Make two plots with Congessional slant on the x-axis and newspaper slant on the

y-axis – one for the House, one for the Senate. Include a best-fit line in each plot –

a red one for the Senate and a green one for the House. Label your axes, title your

plots, and make sure the axes are the same for comparability. Can you conclude

that newspapers are influenced by the political language of elected officials? How

else can you interpret the results?

Question 3

Identify the most important terms for capturing regional variation in what is considered

newsworthy – the terms that appear frequently in some documents, but not

across all documents. To do so, compute the *term frequency-inverse document frequency

(tf-idf)* for each phrase and newspaper combination in the dataset (for this,

3

Question 1

We will first focus on the slant of newspapers, which the authors define as the tendency

to use language that would sway readers to the political left or right. Load the data

and plot the distribution of ‘slant’ in the ‘papers’ data frame, with a vertical line at

the median. Which newspaper in the country has the largest left-wing slant? What

about right?

Question 2

2

We will now explore the relationship between the political slant of newspapers and the

language used by members of Congress. Using the data-set ‘cong’, compute average

slant by state separately for the House and Senate. Now use ‘papers’ to compute

the average newspaper slant by state. Make two plots with Congressional slant on

the x-axis and newspaper slant on the y-axis—one for the House, one for the Senate.

Include a best-

fit line in each plot—a red one for the Senate and a green one for the House. Label

your axes, title your plots, and make sure the axes are the same for comparability.

Can you conclude that newspapers are influenced by the political language of elected

officials? How else can you interpret the results?

Question 3

Identify the most important terms for capturing regional variation in what is considered

newsworthy the terms that appear frequently in some documents, but not across

all documents. To do so, compute the *term frequency-inverse document frequency

(tf-idf)* for each phrase and newspaper combination in the data-set (for this, use the

‘tm’ package and the ‘dtm’ object originally provided).

Question 4

Cluster all the newspapers from New Jersey on their tf-idf measure. Apply the kmeans

algorithm with 3 clusters. Summarize the results by printing out the ten most

important terms at the centroid of each of the resulting clusters, and show which

newspapers belong to each cluster. What topics does NJ care about?

3


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp