Working with the FEC Political Contribution Data
options(encoding = "native.enc", replace.assign = TRUE, width = 90)
The Problem
I recently read a blog that announced the Forest of Advocacy website. In their words,This site is dedicated to providing visualizations gleaned from big data regarding politics. We will provide at least one new visualization every week from now to the election. So enjoy, and come back next Monday.The first visualization they did examined the political contraptions of individuals over time within particular organizations. From this, as they point out,
Over the decades, one sees lines sketched out, reflecting the partisanship of individuals over time. For each organization, we also provide the net contributions of the entire organization, and the names of biggest Democratic, Republican, and 'bipartisan' contributors (the individual with the highest product of Democratic and Republican contributions).“This made me ask the question what about the University of Minnesota? To which party do their employees contribute?
Obtaining the Data
As a data geek, I immediately went to the Federal Election Commission webpage and saw what I could download to answer this question. On their Downloadable Data Files page there were several files that I could download. As with most Federal government datasets, it is not intuitively obvious what data these files contain given their names. Unlike most datasets, these have detailed codebooks to accompany each dataset! (Props to you FEC.)I wanted the individual contributions data (Detailed Data File). The FEC has individual contribution data for every year back to 1999–2000 available for download. I downloaded the ind12.zip data (unzips to a 327.7 MB .txt file) which has 1,994,874 records of contribution(s) from an individual to a federal committee if the contribution was at least $200. I also grabbed the Committee Master File, the Candidate Master File, and the Candidate Committee Linkage File. These files contain information that can be merged with the contributions data to link the committee and candidate information with that of the contributor.
Importing the Data into R
Now comes the fun…getting this into R so that we can actually get the data we need to answer our question. Let's start with the individual contributions data. This is a text file with 21 variables that are delimited by the pipe (|
) character. The first line of data looks like this:C00492371|N|M2|P|11990223949|15|IND|BIRDWELL, BRAD A.|CYPRESS|TX|77429|BIRDWELL|
CONTRACTOR|01232011|500||SA17A.4153|714660|||4021720111136021419
The natural choice to read in such data is to use the read.table()
function. So I tried it.pc <- read.table(file = "~/Desktop/itcont.txt", header = FALSE, sep = "|")
## Error: line 37 did not have 21 elements
For some reason the function believed that line 37 didn't have the 21 variables that the previous lines did. What to do? You could open the file in a text editor and see if there is a problem with line 37. However, my thought was, if there is an error at line 37 that I have to fix and there are 1,994,874 records this could be an all-year process of finding and fixing problems.My solution was to use a little known function called
count.fields()
. This function counts the number of fields, as separated by sep=
, in each of the lines of the file to be read. Then I used the table()
function to figure out how many problematic lines there were.n <- count.fields(file = "~/Desktop/itcont.txt", sep = "|")
table(n)
## n
## 8 9 12 13 16 17 18 20 21
## 8486 290 16700 1673 23 943 14 685 1966060
It looks like there are several lines (n = 28,814) that do not seem to have 21 fields. The largest set seems to have only 12 field. Below I use the readLines()
function to read in the data line-by-line, and in the next line, I identify the lines that only seemed to have 12 fields. (Note: I use the head()
function to save space.)x <- readLines("~/Desktop/itcont.txt")
head(x[n == 12])
## [1] "C00063164|N|M2||11930387788|15|IND|GOMPELS, NINA|SAVANNAH|GA|31404|MCDONALD'S|LICENSEE|01312011|250||2A1FCCF1182FA65F0CC|714744|||4031620111137409329"
## [2] "C00063164|N|M2||11930387788|15|IND|GRUBER, GREG|SANDY|UT|84093|MCDONALD'S|LICENSEE|01312011|500||8C3DFE145D4B7356E88|714744|||4031620111137409330"
## [3] "C00063164|N|M2||11930387788|15|IND|GUERRERO, TILA|WATSONVILLE|CA|95076|MCDONALD'S|LICENSEE|01312011|1000||D5C6C704081D8AD1910|714744|||4031620111137409331"
## [4] "C00063164|N|M2||11930387779|15|IND|ABBATE, JAMES A.|MERCED|CA|95341|MCDONALD'S|LICENSEE|01312011|2000||0ECDDCDE3B4CF443D22|714744|||4031620111137409302"
## [5] "C00063164|N|M2||11930387779|15|IND|ADAMS, ALLISON|TAMPA|FL|33607|MCDONALD'S|LICENSEE|01312011|500||543088A9C13C4BE1A71|714744|||4031620111137409304"
## [6] "C00063164|N|M2||11930387780|15|IND|ARMSTRONG, DONALD D.|BEAVERTON|OR|97006|MCDONALD'S|LICENSEE|01312011|2600||8DBDD6D125EDFAB256C|714744|||4031620111137409307"
Each of these lines stopped reading 12 fields in. The problem with these lines is the apostrophe in the 12th field. R is treating this like a string, so everything after the apostrophe appears as a single field (there is no closing apostrophe). We can remedy this by using the quote=
argument to let R know that strings need to be in double-quotes rather than single quotes. One more issue is that we cannot write quote="""
, because R will usethe first two quotation marks. We need to escape the middle quote using the backslash (\
). Now we try again,pc <- read.table(file = "~/Desktop/itcont.txt", header = FALSE, sep = "|", quote = "\"")
## Error: line 1151 did not have 21 elements
Okay. This time we made it through 1,150 lines before the code broke. Again we use count.fields()
this time adding the quote=
argument.n <- count.fields(file = "~/Desktop/itcont.txt", sep = "|", quote = "\"")
table(n)
## n
## 8 9 12 13 20 21
## 13 1 439 18 73 1994330
Once again, the 12th field is giving problems. We again examine some of those fields.head(x[n == 12])
## [1] "C00030718|N|M3||11930496503|15|IND|DURANTE, KATHLEEN S|CHEYENNE|WY|82009|#1 PROPERTIES|REAL ESTATE BROKER|01262011|300||A09703181A2EF43EFA3B|717445|||4031820111137607355"
## [2] "C00030718|N|M3||11930496503|15|IND|BIGGS, CYNTHIA L MRS.|LARAMIE|WY|82070|#1 PROPERTIES|REAL ESTATE BROKER|01262011|300||A553A97B481D44B5EA16|717445|||4031820111137607357"
## [3] "C00401190|N|Q1|P|11930643549|15|IND|PEER, WILBUR T.|UPPER MARLBORO|MD|20774|MANAGEMENT CONSULTANTS INC #1|CONSULTANT|03312011|500||C4754843|722315|||4041320111138070572"
## [4] "C00089086|N|M2||11990218895|15|IND|MONTGOMERY, SHERI|EVANSTON|WY|82930|UNITA COUNTY SCHOOL DIST #4|OCCUPATIONAL THERAPIST|01192011|400||38594231|713984|||4031120111137183347"
## [5] "C00349746|N|M2||11990229557|15|IND|ANKER, ED|VISALIA|CA|93291|SHADY ACRES DAIRY #2|DAIRYMAN|01122011|261||SA11AI.28076|715177|||4031520111137265488"
## [6] "C00349746|N|M2||11990229568|15|IND|ROELOFFS, JOHNNY|TIPTON|CA|93272|J R DAIRY #2|DAIRYMAN|01122011|292||SA11AI.28030|715177|||4031520111137265522"
A-ha! The hash tag (pound sign) is causing problems. Remember that R treats the hash tag as a character to indicate a comment. Thus, everything on a line following a hash tag is commented out. We want the hash tag to be part of the string in that field. We can now add the argument comment.char=""
to the read.table()
function.pc <- read.table(file = "~/Desktop/itcont.txt", header = FALSE, sep = "|", quote = "\"", comment.char = "")
Great! It worked! Well…we should check that.nrow(pc)
## [1] 1994874
length(pc)
## [1] 21
Perfect. Now we can use the online codebook to add the variable names to our data.names(pc) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP",
"ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT",
"TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
Reading in the Other Files and Merging the Results
While we can breathe a big sigh of relief at getting the contribution data read in, it isn't enough. A quick scan of the variables suggest that this file does not include important information. For example, which political party the individual contributed to. Scanning the codebooks for the other files—the Committee Master File, the Candidate Master File, and the Candidate Committee Linkage File—we find that both the Committee Master File and the Candidate Master File have party affiliation data.How do you get the data from these files into our individual contributions data? To do this we use the
merge()
function. Before doing that, however, it is worth a quick diversion to talk about databases and tables.Databases
Databases are alternative ways to store data. They are particularly useful for large datasets like this one. The key to understanding databases is to remember that they are composed of tables. Tables are nothing more than rectangular arrays (they have rows and columns). For example, in R, a data frame is a table—it is made up of rows and columns. Most databases include multiple tables.We can think of each of the files from our political contribution files as tables.
File | Rows x Columns |
---|---|
Individual Contributions File | 1,994,874 x 21 |
Committee Master File | 13,093 x 15 |
Candidate Master File | 5,345 x 15 |
Linkage File | 5,954 x 7 |
CMTE_ID
is common to both tables. So, we can use that to link the two tables or files. Merging Data Frames
To examine how we will merge two data frames together, we first read in the Committee Master File and Candidate Master File.cm <- read.table(file = "~/Desktop/cm.txt", sep = "|", header = FALSE, quote = "\"", comment.char = "")
nrow(cm)
## [1] 13093
length(cm)
## [1] 15
names(cm) <- c("CMTE_ID", "CMTE_NM", "TRES_NM", "CMTE_ST1", "CMTE_ST2", "CMTE_CITY", "CMTE_ST",
"CMTE_ZIP", "CMTE_DSGN", "CMTE_TP", "CMTE_PTY_AFFILIATION", "CMTE_FILING_FREQ", "ORG_TP",
"CONNECTED_ORG_NM", "CAND_ID")
cn <- read.table(file = "~/Desktop/cn.txt", sep = "|", header = FALSE, quote = "\"", comment.char = "")
nrow(cn)
## [1] 5345
length(cn)
## [1] 15
names(cn) <- c("CAND_ID", "CAND_NAME", "CAND_PTY_AFFILIATION", "CAND_ELECTION_YR", "CAND_OFFICE_ST",
"CAND_OFFICE", "CAND_OFFICE_DISTRICT", "CAND_ICI", "CAND_STATUS", "CAND_PCC", "CAND_ST1",
"CAND_ST2", "CAND_CITY", "CAND_ST", "CAND_ZIP")
We then will use the merge()
function to join the two tables (data frames) together. This function takes as its initial two arguments the names of the objects that the data frames are assigned to, namely x=
and y=
(e.g., x=pc
and y=cm
). It is important to note that the data in the second argument (y
) will be added to the rows of data from the first argument (x
). After providing the names of the data frames, the argument by=
is used to give the names of the variable names that are common to both data frames. (If there is more than one, use the c()
function to bind them together.) We also use the all.x=TRUE
argument. This will keep any rows in x
that don't have a match in y
. If this argument is not included then rows in x
without matches in y
will be dropped.pccm <- merge(x = pc, y = cm, by = "CMTE_ID", all.x = TRUE)
nrow(pccm)
## [1] 1994874
length(pccm)
## [1] 35
The new data frame will have the same number of rows as the data frame in the x=
argument. The number of variables will be equal to the sum of the unique variables in both data frames and the number of common variables between the two. Note how the results are different if we reverse the order of the two data frames.cmpc <- merge(x = cm, y = pc, by = "CMTE_ID", all.x = TRUE)
nrow(cmpc)
## [1] 2001639
length(cmpc)
## [1] 35
The number of variables is the same, but the number of rows is different because this time the information from the pc
data frame was added to the rows in the cm
data. In SQL terms, this is called a left-inner join. If you are familiar with SQL, you might like to use the sqldf package.We can also join the candidate data to our
pccm
data.pccmcn <- merge(x = pccm, y = cn, by = "CAND_ID", all.x = TRUE)
nrow(pccmcn)
## [1] 1994874
length(pccmcn)
## [1] 49
Finding University of Minnesota Contributors with Grep
The data frame includes data from all contribotors, and I was interested in only the University of Minnesota employees. The variableEMPLOYER
contains information on the contributor's employer. This variable contains character (string) data. To find particular strings we can use the grep()
function. Grep was named after the Unix qed/ed editor subcommand “g/re/p”. In this subcommand, g stands for global, re stands for a regular expression, and p stands for print. In essence these subcommands would globally search some text for a given regular expression and print the lines containing matches. The
grep()
function takes as its first argument, the character string to search for and as its second argument the variable (or vector to search). Here we search for the string “UNIVERSITY OF MINNESOTA” in the EMPLOYER
variable and print the first fe results.head(grep("UNIVERSITY OF MINNESOTA", pccmcn$EMPLOYER))
## [1] 13529 14294 19110 21758 22403 23838
The function outputs the row number for any row in which the string “UNIVERSITY OF MINNESOTA” is found in the EMPLOYER
variable. We can use indexing to assign the rows of data that include this string in a new data frame.umn <- pccmcn[grep("UNIVERSITY OF MINNESOTA", pccmcn$EMPLOYER), ]
nrow(umn)
## [1] 336
length(umn)
## [1] 49
The output shows that 336 rows have the string “UNIVERSITY OF MINNESOTA” as the employer. Note also that the new data frame has retained all of the columns as our previously merged data frame. We can now examine the affilation of the political party that these emplyees contributed to.table(umn$CAND_PTY_AFFILIATION)
##
## AIP AMP CIT CON CRV CST DCG DEM DFL FED GRE GRN IAP IDP IND IP LIB N N/A NNE NOP
## 0 0 0 0 0 0 0 0 165 27 0 1 0 0 0 0 0 0 0 0 0 0
## NPA OTH PAF PRI REF REP RTL SUS SWP TEA TX UNK W WFP
## 0 0 0 0 0 42 0 0 0 0 0 0 0 0
table(umn$CMTE_PTY_AFFILIATION)
##
## 0 ACP AIP AMP CIT COM CON CRV CST DEM DFL FED FWP GOP GRE GRN HOS IND LIB MW NNE
## 78 0 0 0 0 0 0 0 0 0 150 63 0 0 0 1 0 0 0 0 0 0
## NON NPA OTH PAC PAF PRO REF REP SOC STA SUS SWP TEA UNK UPA UTP WF WFP WHG WNC
## 0 0 0 0 0 0 0 43 0 0 0 0 0 1 0 0 0 0 0 0
The only parties that University of Minnesota employees have contributed to are the Democratic, DFL, and Republican. We will append a variable to the umn
data frame called PARTY
that has two levels DEM/DFL
and REP
. umn$PARTY[umn$CAND_PTY_AFFILIATION == "DEM"] <- "DEM/DFL"
umn$PARTY[umn$CAND_PTY_AFFILIATION == "DFL"] <- "DEM/DFL"
umn$PARTY[umn$CMTE_PTY_AFFILIATION == "DEM"] <- "DEM/DFL"
umn$PARTY[umn$CMTE_PTY_AFFILIATION == "DFL"] <- "DEM/DFL"
umn$PARTY[umn$CAND_PTY_AFFILIATION == "REP"] <- "REP"
umn$PARTY[umn$CMTE_PTY_AFFILIATION == "REP"] <- "REP"
table(umn$PARTY)
##
## DEM/DFL REP
## 242 44
Plotting the Contributions with ggplot
First we load the ggplot2 library.library(ggplot2)
We plot the contributions conditional on party. (To learn more about the use of ggplot see the excellent online documentation.)ggplot(data = na.omit(umn), aes(x = PARTY, y = TRANSACTION_AMT, color = PARTY)) +
geom_jitter(alpha = 0.3, position = position_jitter(width = 0.1, height = 0)) +
scale_color_manual(name = "Affiliation", values = c("Black", "Red")) +
xlab("") +
ylab("Contribution") +
theme_bw()
We note that four observations are negative.
umn2 <- umn[umn$TRANSACTION_AMT > 0, ]
ggplot(data = na.omit(umn2), aes(x = PARTY, y = TRANSACTION_AMT, color = PARTY)) +
geom_boxplot() +
geom_jitter(alpha = 0.3, position = position_jitter(width = 0.1, height = 0)) +
stat_summary(fun.y = "mean", geom = "point", size = 4) +
scale_color_manual(name = "Affiliation", values = c("Black", "Red")) +
xlab("") +
ylab("Contribution") +
theme_bw()
Removing these, we observe that the political contributions seem to be comparable (medians are roughly equal) for the two sides of the political aisle. However, there is a great deal more variation in the Republican contributions.
--> Type MIME Type text/html Resource Type Document Location Full URL http://rstudio-pubs-static.s3.amazonaws.com/1689_9eee1ef2fbe948f6ba5901a17ca52308.html Scheme http Host rstudio-pubs-static.s3.amazonaws.com Path /1689_9eee1ef2fbe948f6ba5901a17ca52308.html Filename 1689_9eee1ef2fbe948f6ba5901a17ca52308.html
No comments:
Post a Comment