Assignment 2 - R Text Analysis
Due 15/04 - Before Class
How to submit your assignment
1. Download the R Workspace “newspapers” and open it with R-studio
2. Read the instructions and finish Questions 1 to 4
3. Save your R-script to “yourlastname studentID assignment2.R”
The Ideological Bias in Newspaper
Text analysis gives researchers a powerful set of tools for extracting general information
from a large body of documents.
This exercise is based on Gentzkow, M. and Shapiro, J. M. 2010. Drives Media
Slant? Evidence From U.S. Daily Newspapers.“ Econometrica, 78(1): 35-71.
We will analyze data from newspapers across the country to see what topics they
cover and how those topics are related to their ideological bias. The authors computed
a measure of a newspaper’s ”slant“ by comparing its language to speeches made by
Democrats and Republicans in the U.S. Congress.
You will use three data sources for this analysis. The first ‘dtm’, is a document
term matrix with one row per newspaper, containing the 1000 phrases—stemmed
and processed—that do the best job of identifying the speaker as a Republican or
a Democrat. For install, ”living in poverty“ is a phrase most frequently spoken
by Democrats, while ”global war on terror“ is a phrase most frequently spoken by
Republicans; a phrase like ”exchange rate“ would not be included in this data-set, as
it is used often by members of both parties and is thus a poor indicator of ideology.
The second object, ‘papers’, contains some data on the newspapers on which ‘dtm’ is
based. The row names in ‘dtm’ correspond to the ‘newsid’ variable in ’papers’. The
variables are:
1
Part I. The Ideological Bias of Newspapers
Text analysis gives researchers a powerful set of tools for extracting general information
from a large body of documents.
This exercise is based on Gentzkow, M. and Shapiro, J. M. 2010. “What Drives Media
Slant? Evidence From U.S. Daily Newspapers.” Econometrica, 78(1): 35-71.
We will analyze data from newspapers across the country to see what topics they
cover and how those topics are related to their ideological bias. The authors computed
a measure of a newspaper’s “slant” by comparing its language to speeches made by
Democrats and Republicans in the U.S. Congress.
You will use three data sources for this analysis. The first, ‘dtm’, is a document
term matrix with one row per newspaper, containing the 1000 phrases – stemmed and
processed – that do the best job of identifying the speaker as a Republican or a Democrat.
For example, “living in poverty” is a phrase most frequently spoken by Democrats, while
“global war on terror” is a phrase most frequently spoken by Republicans; a phrase like
“exchange rate” would not be included in this dataset, as it is used often by members of
both parties and is thus a poor indicator of ideology.
The second object, ‘papers’, contains some data on the newspapers on which ‘dtm’
is based. The row names in ‘dtm’ correspond to the ‘newsid’ variable in ‘papers’. The
variables are:
Name Description
‘newsid’ The newspaper ID
‘paper’ The newspaper name
‘city’ The city in which the newspaper is based
‘state’ The state in which the newspaper is based
‘district’ Congressional district where the newspaper is based (data for Texas only)
‘nslant’ The “ideological slant” (lower numbers mean more Democratic)
The third object, ‘cong’, contains data on members of Congress based on their political
speech, which we will compare to the ideological slant of newspapers from the areas that
these legislators represent. The variables are:
2
The third object, ’cong’, contains data on members of Congress based on their
political speech, which we will compare to the ideological slant of newspapers from
the areas that these legislators represent. The variables are:
Name Description
‘legname’ Legislator’s name
‘state’ Legislator’s state
‘district’ Legislator’s Congressional district
‘chamber’ Chamber in which legislator serves (House or Senate)
‘party’ Legislator’s party
‘cslant’ Ideological slant based on legislator’s speech (lower numbers mean more Democratic)
Question 1
We will first focus on the slant of newspapers, which the authors define as the
tendency to use language that would sway readers to the political left or right.
Load the data and plot the distribution of ‘nslant’ in the ‘papers’ data frame, with a
vertical line at the median. Which newspaper in the country has the largest left-wing
slant? What about right?
Question 2
We will now explore the relationship between the political slant of newspapers and
the language used by members of Congress.
Using the dataset ‘cong’, compute average slant by state separately for the House
and Senate. Now use ‘papers’ to compute the average newspaper slant by state.
Make two plots with Congessional slant on the x-axis and newspaper slant on the
y-axis – one for the House, one for the Senate. Include a best-fit line in each plot –
a red one for the Senate and a green one for the House. Label your axes, title your
plots, and make sure the axes are the same for comparability. Can you conclude
that newspapers are influenced by the political language of elected officials? How
else can you interpret the results?
Question 3
Identify the most important terms for capturing regional variation in what is considered
newsworthy – the terms that appear frequently in some documents, but not
across all documents. To do so, compute the *term frequency-inverse document frequency
(tf-idf)* for each phrase and newspaper combination in the dataset (for this,
3
Question 1
We will first focus on the slant of newspapers, which the authors define as the tendency
to use language that would sway readers to the political left or right. Load the data
and plot the distribution of ‘slant’ in the ‘papers’ data frame, with a vertical line at
the median. Which newspaper in the country has the largest left-wing slant? What
about right?
Question 2
2
We will now explore the relationship between the political slant of newspapers and the
language used by members of Congress. Using the data-set ‘cong’, compute average
slant by state separately for the House and Senate. Now use ‘papers’ to compute
the average newspaper slant by state. Make two plots with Congressional slant on
the x-axis and newspaper slant on the y-axis—one for the House, one for the Senate.
Include a best-
fit line in each plot—a red one for the Senate and a green one for the House. Label
your axes, title your plots, and make sure the axes are the same for comparability.
Can you conclude that newspapers are influenced by the political language of elected
officials? How else can you interpret the results?
Question 3
Identify the most important terms for capturing regional variation in what is considered
newsworthy the terms that appear frequently in some documents, but not across
all documents. To do so, compute the *term frequency-inverse document frequency
(tf-idf)* for each phrase and newspaper combination in the data-set (for this, use the
‘tm’ package and the ‘dtm’ object originally provided).
Question 4
Cluster all the newspapers from New Jersey on their tf-idf measure. Apply the kmeans
algorithm with 3 clusters. Summarize the results by printing out the ten most
important terms at the centroid of each of the resulting clusters, and show which
newspapers belong to each cluster. What topics does NJ care about?
3
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。