9.27.2012

9.26.2012

Digital Scrapbook: Broomball

Here is the broomball team that I played with for a couple years in the late 90's early 00's. We were pretty good.






Big Ten Heat Map (Week 4)


9.24.2012

Digital Scrapbook: Soccer Teams

Here are several of the soccer teams that I coached over the years.











9.15.2012

Fonts

Here are two pieces on fonts. The first, a letter to the world from Comic Sans printed in McSweeney's. [Read it here.] The second reports on the results of a study asking whether particular fonts are more believable [Read]. Sorry Comic Sans.

Political Contributions

Working with the FEC Political Contribution Data <!-- Styles for R syntax highlighter

Working with the FEC Political Contribution Data

options(encoding = "native.enc", replace.assign = TRUE, width = 90)

The Problem

I recently read a blog that announced the Forest of Advocacy website. In their words,
This site is dedicated to providing visualizations gleaned from big data regarding politics. We will provide at least one new visualization every week from now to the election. So enjoy, and come back next Monday.
The first visualization they did examined the political contraptions of individuals over time within particular organizations. From this, as they point out,
Over the decades, one sees lines sketched out, reflecting the partisanship of individuals over time. For each organization, we also provide the net contributions of the entire organization, and the names of biggest Democratic, Republican, and 'bipartisan' contributors (the individual with the highest product of Democratic and Republican contributions).“
This made me ask the question what about the University of Minnesota? To which party do their employees contribute?

Obtaining the Data

As a data geek, I immediately went to the Federal Election Commission webpage and saw what I could download to answer this question. On their Downloadable Data Files page there were several files that I could download. As with most Federal government datasets, it is not intuitively obvious what data these files contain given their names. Unlike most datasets, these have detailed codebooks to accompany each dataset! (Props to you FEC.)
I wanted the individual contributions data (Detailed Data File). The FEC has individual contribution data for every year back to 1999–2000 available for download. I downloaded the ind12.zip data (unzips to a 327.7 MB .txt file) which has 1,994,874 records of contribution(s) from an individual to a federal committee if the contribution was at least $200. I also grabbed the Committee Master File, the Candidate Master File, and the Candidate Committee Linkage File. These files contain information that can be merged with the contributions data to link the committee and candidate information with that of the contributor.

Importing the Data into R

Now comes the fun…getting this into R so that we can actually get the data we need to answer our question. Let's start with the individual contributions data. This is a text file with 21 variables that are delimited by the pipe (|) character. The first line of data looks like this:
C00492371|N|M2|P|11990223949|15|IND|BIRDWELL, BRAD A.|CYPRESS|TX|77429|BIRDWELL|
CONTRACTOR|01232011|500||SA17A.4153|714660|||4021720111136021419
The natural choice to read in such data is to use the read.table() function. So I tried it.
pc <- read.table(file = "~/Desktop/itcont.txt", header = FALSE, sep = "|")
## Error: line 37 did not have 21 elements
For some reason the function believed that line 37 didn't have the 21 variables that the previous lines did. What to do? You could open the file in a text editor and see if there is a problem with line 37. However, my thought was, if there is an error at line 37 that I have to fix and there are 1,994,874 records this could be an all-year process of finding and fixing problems.
My solution was to use a little known function called count.fields(). This function counts the number of fields, as separated by sep=, in each of the lines of the file to be read. Then I used the table() function to figure out how many problematic lines there were.
n <- count.fields(file = "~/Desktop/itcont.txt", sep = "|")
table(n)
## n
##       8       9      12      13      16      17      18      20      21 
##    8486     290   16700    1673      23     943      14     685 1966060
It looks like there are several lines (n = 28,814) that do not seem to have 21 fields. The largest set seems to have only 12 field. Below I use the readLines() function to read in the data line-by-line, and in the next line, I identify the lines that only seemed to have 12 fields. (Note: I use the head() function to save space.)
x <- readLines("~/Desktop/itcont.txt")
head(x[n == 12])
## [1] "C00063164|N|M2||11930387788|15|IND|GOMPELS, NINA|SAVANNAH|GA|31404|MCDONALD'S|LICENSEE|01312011|250||2A1FCCF1182FA65F0CC|714744|||4031620111137409329"         
## [2] "C00063164|N|M2||11930387788|15|IND|GRUBER, GREG|SANDY|UT|84093|MCDONALD'S|LICENSEE|01312011|500||8C3DFE145D4B7356E88|714744|||4031620111137409330"             
## [3] "C00063164|N|M2||11930387788|15|IND|GUERRERO, TILA|WATSONVILLE|CA|95076|MCDONALD'S|LICENSEE|01312011|1000||D5C6C704081D8AD1910|714744|||4031620111137409331"    
## [4] "C00063164|N|M2||11930387779|15|IND|ABBATE, JAMES A.|MERCED|CA|95341|MCDONALD'S|LICENSEE|01312011|2000||0ECDDCDE3B4CF443D22|714744|||4031620111137409302"       
## [5] "C00063164|N|M2||11930387779|15|IND|ADAMS, ALLISON|TAMPA|FL|33607|MCDONALD'S|LICENSEE|01312011|500||543088A9C13C4BE1A71|714744|||4031620111137409304"           
## [6] "C00063164|N|M2||11930387780|15|IND|ARMSTRONG, DONALD D.|BEAVERTON|OR|97006|MCDONALD'S|LICENSEE|01312011|2600||8DBDD6D125EDFAB256C|714744|||4031620111137409307"
Each of these lines stopped reading 12 fields in. The problem with these lines is the apostrophe in the 12th field. R is treating this like a string, so everything after the apostrophe appears as a single field (there is no closing apostrophe). We can remedy this by using the quote=argument to let R know that strings need to be in double-quotes rather than single quotes. One more issue is that we cannot write quote=""", because R will usethe first two quotation marks. We need to escape the middle quote using the backslash (\). Now we try again,
pc <- read.table(file = "~/Desktop/itcont.txt", header = FALSE, sep = "|", quote = "\"")
## Error: line 1151 did not have 21 elements
Okay. This time we made it through 1,150 lines before the code broke. Again we use count.fields() this time adding the quote= argument.
n <- count.fields(file = "~/Desktop/itcont.txt", sep = "|", quote = "\"")
table(n)
## n
##       8       9      12      13      20      21 
##      13       1     439      18      73 1994330
Once again, the 12th field is giving problems. We again examine some of those fields.
head(x[n == 12])
## [1] "C00030718|N|M3||11930496503|15|IND|DURANTE, KATHLEEN S|CHEYENNE|WY|82009|#1 PROPERTIES|REAL ESTATE BROKER|01262011|300||A09703181A2EF43EFA3B|717445|||4031820111137607355"    
## [2] "C00030718|N|M3||11930496503|15|IND|BIGGS, CYNTHIA L MRS.|LARAMIE|WY|82070|#1 PROPERTIES|REAL ESTATE BROKER|01262011|300||A553A97B481D44B5EA16|717445|||4031820111137607357"   
## [3] "C00401190|N|Q1|P|11930643549|15|IND|PEER, WILBUR T.|UPPER MARLBORO|MD|20774|MANAGEMENT CONSULTANTS INC #1|CONSULTANT|03312011|500||C4754843|722315|||4041320111138070572"     
## [4] "C00089086|N|M2||11990218895|15|IND|MONTGOMERY, SHERI|EVANSTON|WY|82930|UNITA COUNTY SCHOOL DIST #4|OCCUPATIONAL THERAPIST|01192011|400||38594231|713984|||4031120111137183347"
## [5] "C00349746|N|M2||11990229557|15|IND|ANKER, ED|VISALIA|CA|93291|SHADY ACRES DAIRY #2|DAIRYMAN|01122011|261||SA11AI.28076|715177|||4031520111137265488"                          
## [6] "C00349746|N|M2||11990229568|15|IND|ROELOFFS, JOHNNY|TIPTON|CA|93272|J R DAIRY #2|DAIRYMAN|01122011|292||SA11AI.28030|715177|||4031520111137265522"
A-ha! The hash tag (pound sign) is causing problems. Remember that R treats the hash tag as a character to indicate a comment. Thus, everything on a line following a hash tag is commented out. We want the hash tag to be part of the string in that field. We can now add the argument comment.char="" to the read.table() function.
pc <- read.table(file = "~/Desktop/itcont.txt", header = FALSE, sep = "|", quote = "\"", comment.char = "")
Great! It worked! Well…we should check that.
nrow(pc)
## [1] 1994874
length(pc)
## [1] 21
Perfect. Now we can use the online codebook to add the variable names to our data.
names(pc) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", 
    "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", 
    "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

Reading in the Other Files and Merging the Results

While we can breathe a big sigh of relief at getting the contribution data read in, it isn't enough. A quick scan of the variables suggest that this file does not include important information. For example, which political party the individual contributed to. Scanning the codebooks for the other files—the Committee Master File, the Candidate Master File, and the Candidate Committee Linkage File—we find that both the Committee Master File and the Candidate Master File have party affiliation data.
How do you get the data from these files into our individual contributions data? To do this we use the merge() function. Before doing that, however, it is worth a quick diversion to talk about databases and tables.

Databases

Databases are alternative ways to store data. They are particularly useful for large datasets like this one. The key to understanding databases is to remember that they are composed of tables. Tables are nothing more than rectangular arrays (they have rows and columns). For example, in R, a data frame is a table—it is made up of rows and columns. Most databases include multiple tables.
We can think of each of the files from our political contribution files as tables.
File Rows x Columns
Individual Contributions File 1,994,874 x 21
Committee Master File 13,093 x 15
Candidate Master File 5,345 x 15
Linkage File 5,954 x 7
Note that from the codebooks we can also see that these "tables” have some columns with common variable names, other variables are different across tables. This is because in a database, each table (should) represent different information. The common variable names are used to link information across tables. For example, the Individual Contributions File contains file on individuals who made contributions and the Committee Master File contains information on political committees. The variable CMTE_ID is common to both tables. So, we can use that to link the two tables or files.

Merging Data Frames

To examine how we will merge two data frames together, we first read in the Committee Master File and Candidate Master File.
cm <- read.table(file = "~/Desktop/cm.txt", sep = "|", header = FALSE, quote = "\"", comment.char = "")
nrow(cm)
## [1] 13093
length(cm)
## [1] 15
names(cm) <- c("CMTE_ID", "CMTE_NM", "TRES_NM", "CMTE_ST1", "CMTE_ST2", "CMTE_CITY", "CMTE_ST", 
    "CMTE_ZIP", "CMTE_DSGN", "CMTE_TP", "CMTE_PTY_AFFILIATION", "CMTE_FILING_FREQ", "ORG_TP", 
    "CONNECTED_ORG_NM", "CAND_ID")
cn <- read.table(file = "~/Desktop/cn.txt", sep = "|", header = FALSE, quote = "\"", comment.char = "")
nrow(cn)
## [1] 5345
length(cn)
## [1] 15
names(cn) <- c("CAND_ID", "CAND_NAME", "CAND_PTY_AFFILIATION", "CAND_ELECTION_YR", "CAND_OFFICE_ST", 
    "CAND_OFFICE", "CAND_OFFICE_DISTRICT", "CAND_ICI", "CAND_STATUS", "CAND_PCC", "CAND_ST1", 
    "CAND_ST2", "CAND_CITY", "CAND_ST", "CAND_ZIP")
We then will use the merge() function to join the two tables (data frames) together. This function takes as its initial two arguments the names of the objects that the data frames are assigned to, namely x= and y= (e.g., x=pc and y=cm). It is important to note that the data in the second argument (y) will be added to the rows of data from the first argument (x). After providing the names of the data frames, the argument by= is used to give the names of the variable names that are common to both data frames. (If there is more than one, use the c() function to bind them together.) We also use the all.x=TRUE argument. This will keep any rows in x that don't have a match in y. If this argument is not included then rows in x without matches in y will be dropped.
pccm <- merge(x = pc, y = cm, by = "CMTE_ID", all.x = TRUE)
nrow(pccm)
## [1] 1994874
length(pccm)
## [1] 35
The new data frame will have the same number of rows as the data frame in the x= argument. The number of variables will be equal to the sum of the unique variables in both data frames and the number of common variables between the two. Note how the results are different if we reverse the order of the two data frames.
cmpc <- merge(x = cm, y = pc, by = "CMTE_ID", all.x = TRUE)
nrow(cmpc)
## [1] 2001639
length(cmpc)
## [1] 35
The number of variables is the same, but the number of rows is different because this time the information from the pc data frame was added to the rows in the cm data. In SQL terms, this is called a left-inner join. If you are familiar with SQL, you might like to use the sqldf package.
We can also join the candidate data to our pccm data.
pccmcn <- merge(x = pccm, y = cn, by = "CAND_ID", all.x = TRUE)
nrow(pccmcn)
## [1] 1994874
length(pccmcn)
## [1] 49

Finding University of Minnesota Contributors with Grep

The data frame includes data from all contribotors, and I was interested in only the University of Minnesota employees. The variable EMPLOYER contains information on the contributor's employer. This variable contains character (string) data. To find particular strings we can use the grep() function. Grep was named after the Unix qed/ed editor subcommand “g/re/p”. In this subcommand, g stands for global, re stands for a regular expression, and p stands for print. In essence these subcommands would globally search some text for a given regular expression and print the lines containing matches.
The grep() function takes as its first argument, the character string to search for and as its second argument the variable (or vector to search). Here we search for the string “UNIVERSITY OF MINNESOTA” in the EMPLOYER variable and print the first fe results.
head(grep("UNIVERSITY OF MINNESOTA", pccmcn$EMPLOYER))
## [1] 13529 14294 19110 21758 22403 23838
The function outputs the row number for any row in which the string “UNIVERSITY OF MINNESOTA” is found in the EMPLOYER variable. We can use indexing to assign the rows of data that include this string in a new data frame.
umn <- pccmcn[grep("UNIVERSITY OF MINNESOTA", pccmcn$EMPLOYER), ]
nrow(umn)
## [1] 336
length(umn)
## [1] 49
The output shows that 336 rows have the string “UNIVERSITY OF MINNESOTA” as the employer. Note also that the new data frame has retained all of the columns as our previously merged data frame. We can now examine the affilation of the political party that these emplyees contributed to.
table(umn$CAND_PTY_AFFILIATION)
## 
##     AIP AMP CIT CON CRV CST DCG DEM DFL FED GRE GRN IAP IDP IND  IP LIB   N N/A NNE NOP 
##   0   0   0   0   0   0   0   0 165  27   0   1   0   0   0   0   0   0   0   0   0   0 
## NPA OTH PAF PRI REF REP RTL SUS SWP TEA  TX UNK   W WFP 
##   0   0   0   0   0  42   0   0   0   0   0   0   0   0
table(umn$CMTE_PTY_AFFILIATION)
## 
##       0 ACP AIP AMP CIT COM CON CRV CST DEM DFL FED FWP GOP GRE GRN HOS IND LIB  MW NNE 
##  78   0   0   0   0   0   0   0   0   0 150  63   0   0   0   1   0   0   0   0   0   0 
## NON NPA OTH PAC PAF PRO REF REP SOC STA SUS SWP TEA UNK UPA UTP  WF WFP WHG WNC 
##   0   0   0   0   0   0   0  43   0   0   0   0   0   1   0   0   0   0   0   0
The only parties that University of Minnesota employees have contributed to are the Democratic, DFL, and Republican. We will append a variable to the umn data frame called PARTY that has two levels DEM/DFL and REP.
umn$PARTY[umn$CAND_PTY_AFFILIATION == "DEM"] <- "DEM/DFL"
umn$PARTY[umn$CAND_PTY_AFFILIATION == "DFL"] <- "DEM/DFL"
umn$PARTY[umn$CMTE_PTY_AFFILIATION == "DEM"] <- "DEM/DFL"
umn$PARTY[umn$CMTE_PTY_AFFILIATION == "DFL"] <- "DEM/DFL"
umn$PARTY[umn$CAND_PTY_AFFILIATION == "REP"] <- "REP"
umn$PARTY[umn$CMTE_PTY_AFFILIATION == "REP"] <- "REP"
table(umn$PARTY)
## 
## DEM/DFL     REP 
##     242      44

Plotting the Contributions with ggplot

First we load the ggplot2 library.
library(ggplot2)
We plot the contributions conditional on party. (To learn more about the use of ggplot see the excellent online documentation.)
ggplot(data = na.omit(umn), aes(x = PARTY, y = TRANSACTION_AMT, color = PARTY)) +
  geom_jitter(alpha = 0.3, position = position_jitter(width = 0.1, height = 0)) +
    scale_color_manual(name = "Affiliation", values = c("Black", "Red")) +
    xlab("") +
    ylab("Contribution") +
    theme_bw()
plot of chunk unnamed-chunk-20
We note that four observations are negative.
umn2 <- umn[umn$TRANSACTION_AMT > 0, ]
ggplot(data = na.omit(umn2), aes(x = PARTY, y = TRANSACTION_AMT, color = PARTY)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.3, position = position_jitter(width = 0.1, height = 0)) +
  stat_summary(fun.y = "mean", geom = "point", size = 4) +
  scale_color_manual(name = "Affiliation", values = c("Black", "Red")) +
    xlab("") +
    ylab("Contribution") +
    theme_bw()
plot of chunk unnamed-chunk-21
Removing these, we observe that the political contributions seem to be comparable (medians are roughly equal) for the two sides of the political aisle. However, there is a great deal more variation in the Republican contributions.
--> Type MIME Type text/html Resource Type Document Location Full URL http://rstudio-pubs-static.s3.amazonaws.com/1689_9eee1ef2fbe948f6ba5901a17ca52308.html Scheme http Host rstudio-pubs-static.s3.amazonaws.com Path /1689_9eee1ef2fbe948f6ba5901a17ca52308.html Filename 1689_9eee1ef2fbe948f6ba5901a17ca52308.html