9.27.2012

9.26.2012

Digital Scrapbook: Broomball

Here is the broomball team that I played with for a couple years in the late 90's early 00's. We were pretty good.






Big Ten Heat Map (Week 4)


9.24.2012

Digital Scrapbook: Soccer Teams

Here are several of the soccer teams that I coached over the years.











9.15.2012

Fonts

Here are two pieces on fonts. The first, a letter to the world from Comic Sans printed in McSweeney's. [Read it here.] The second reports on the results of a study asking whether particular fonts are more believable [Read]. Sorry Comic Sans.

Political Contributions

Working with the FEC Political Contribution Data <!-- Styles for R syntax highlighter

Working with the FEC Political Contribution Data

options(encoding = "native.enc", replace.assign = TRUE, width = 90)

The Problem

I recently read a blog that announced the Forest of Advocacy website. In their words,
This site is dedicated to providing visualizations gleaned from big data regarding politics. We will provide at least one new visualization every week from now to the election. So enjoy, and come back next Monday.
The first visualization they did examined the political contraptions of individuals over time within particular organizations. From this, as they point out,
Over the decades, one sees lines sketched out, reflecting the partisanship of individuals over time. For each organization, we also provide the net contributions of the entire organization, and the names of biggest Democratic, Republican, and 'bipartisan' contributors (the individual with the highest product of Democratic and Republican contributions).“
This made me ask the question what about the University of Minnesota? To which party do their employees contribute?

Obtaining the Data

As a data geek, I immediately went to the Federal Election Commission webpage and saw what I could download to answer this question. On their Downloadable Data Files page there were several files that I could download. As with most Federal government datasets, it is not intuitively obvious what data these files contain given their names. Unlike most datasets, these have detailed codebooks to accompany each dataset! (Props to you FEC.)
I wanted the individual contributions data (Detailed Data File). The FEC has individual contribution data for every year back to 1999–2000 available for download. I downloaded the ind12.zip data (unzips to a 327.7 MB .txt file) which has 1,994,874 records of contribution(s) from an individual to a federal committee if the contribution was at least $200. I also grabbed the Committee Master File, the Candidate Master File, and the Candidate Committee Linkage File. These files contain information that can be merged with the contributions data to link the committee and candidate information with that of the contributor.

Importing the Data into R

Now comes the fun…getting this into R so that we can actually get the data we need to answer our question. Let's start with the individual contributions data. This is a text file with 21 variables that are delimited by the pipe (|) character. The first line of data looks like this:
C00492371|N|M2|P|11990223949|15|IND|BIRDWELL, BRAD A.|CYPRESS|TX|77429|BIRDWELL|
CONTRACTOR|01232011|500||SA17A.4153|714660|||4021720111136021419
The natural choice to read in such data is to use the read.table() function. So I tried it.
pc <- read.table(file = "~/Desktop/itcont.txt", header = FALSE, sep = "|")
## Error: line 37 did not have 21 elements
For some reason the function believed that line 37 didn't have the 21 variables that the previous lines did. What to do? You could open the file in a text editor and see if there is a problem with line 37. However, my thought was, if there is an error at line 37 that I have to fix and there are 1,994,874 records this could be an all-year process of finding and fixing problems.
My solution was to use a little known function called count.fields(). This function counts the number of fields, as separated by sep=, in each of the lines of the file to be read. Then I used the table() function to figure out how many problematic lines there were.
n <- count.fields(file = "~/Desktop/itcont.txt", sep = "|")
table(n)
## n
##       8       9      12      13      16      17      18      20      21 
##    8486     290   16700    1673      23     943      14     685 1966060
It looks like there are several lines (n = 28,814) that do not seem to have 21 fields. The largest set seems to have only 12 field. Below I use the readLines() function to read in the data line-by-line, and in the next line, I identify the lines that only seemed to have 12 fields. (Note: I use the head() function to save space.)
x <- readLines("~/Desktop/itcont.txt")
head(x[n == 12])
## [1] "C00063164|N|M2||11930387788|15|IND|GOMPELS, NINA|SAVANNAH|GA|31404|MCDONALD'S|LICENSEE|01312011|250||2A1FCCF1182FA65F0CC|714744|||4031620111137409329"         
## [2] "C00063164|N|M2||11930387788|15|IND|GRUBER, GREG|SANDY|UT|84093|MCDONALD'S|LICENSEE|01312011|500||8C3DFE145D4B7356E88|714744|||4031620111137409330"             
## [3] "C00063164|N|M2||11930387788|15|IND|GUERRERO, TILA|WATSONVILLE|CA|95076|MCDONALD'S|LICENSEE|01312011|1000||D5C6C704081D8AD1910|714744|||4031620111137409331"    
## [4] "C00063164|N|M2||11930387779|15|IND|ABBATE, JAMES A.|MERCED|CA|95341|MCDONALD'S|LICENSEE|01312011|2000||0ECDDCDE3B4CF443D22|714744|||4031620111137409302"       
## [5] "C00063164|N|M2||11930387779|15|IND|ADAMS, ALLISON|TAMPA|FL|33607|MCDONALD'S|LICENSEE|01312011|500||543088A9C13C4BE1A71|714744|||4031620111137409304"           
## [6] "C00063164|N|M2||11930387780|15|IND|ARMSTRONG, DONALD D.|BEAVERTON|OR|97006|MCDONALD'S|LICENSEE|01312011|2600||8DBDD6D125EDFAB256C|714744|||4031620111137409307"
Each of these lines stopped reading 12 fields in. The problem with these lines is the apostrophe in the 12th field. R is treating this like a string, so everything after the apostrophe appears as a single field (there is no closing apostrophe). We can remedy this by using the quote=argument to let R know that strings need to be in double-quotes rather than single quotes. One more issue is that we cannot write quote=""", because R will usethe first two quotation marks. We need to escape the middle quote using the backslash (\). Now we try again,
pc <- read.table(file = "~/Desktop/itcont.txt", header = FALSE, sep = "|", quote = "\"")
## Error: line 1151 did not have 21 elements
Okay. This time we made it through 1,150 lines before the code broke. Again we use count.fields() this time adding the quote= argument.
n <- count.fields(file = "~/Desktop/itcont.txt", sep = "|", quote = "\"")
table(n)
## n
##       8       9      12      13      20      21 
##      13       1     439      18      73 1994330
Once again, the 12th field is giving problems. We again examine some of those fields.
head(x[n == 12])
## [1] "C00030718|N|M3||11930496503|15|IND|DURANTE, KATHLEEN S|CHEYENNE|WY|82009|#1 PROPERTIES|REAL ESTATE BROKER|01262011|300||A09703181A2EF43EFA3B|717445|||4031820111137607355"    
## [2] "C00030718|N|M3||11930496503|15|IND|BIGGS, CYNTHIA L MRS.|LARAMIE|WY|82070|#1 PROPERTIES|REAL ESTATE BROKER|01262011|300||A553A97B481D44B5EA16|717445|||4031820111137607357"   
## [3] "C00401190|N|Q1|P|11930643549|15|IND|PEER, WILBUR T.|UPPER MARLBORO|MD|20774|MANAGEMENT CONSULTANTS INC #1|CONSULTANT|03312011|500||C4754843|722315|||4041320111138070572"     
## [4] "C00089086|N|M2||11990218895|15|IND|MONTGOMERY, SHERI|EVANSTON|WY|82930|UNITA COUNTY SCHOOL DIST #4|OCCUPATIONAL THERAPIST|01192011|400||38594231|713984|||4031120111137183347"
## [5] "C00349746|N|M2||11990229557|15|IND|ANKER, ED|VISALIA|CA|93291|SHADY ACRES DAIRY #2|DAIRYMAN|01122011|261||SA11AI.28076|715177|||4031520111137265488"                          
## [6] "C00349746|N|M2||11990229568|15|IND|ROELOFFS, JOHNNY|TIPTON|CA|93272|J R DAIRY #2|DAIRYMAN|01122011|292||SA11AI.28030|715177|||4031520111137265522"
A-ha! The hash tag (pound sign) is causing problems. Remember that R treats the hash tag as a character to indicate a comment. Thus, everything on a line following a hash tag is commented out. We want the hash tag to be part of the string in that field. We can now add the argument comment.char="" to the read.table() function.
pc <- read.table(file = "~/Desktop/itcont.txt", header = FALSE, sep = "|", quote = "\"", comment.char = "")
Great! It worked! Well…we should check that.
nrow(pc)
## [1] 1994874
length(pc)
## [1] 21
Perfect. Now we can use the online codebook to add the variable names to our data.
names(pc) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", 
    "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", 
    "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

Reading in the Other Files and Merging the Results

While we can breathe a big sigh of relief at getting the contribution data read in, it isn't enough. A quick scan of the variables suggest that this file does not include important information. For example, which political party the individual contributed to. Scanning the codebooks for the other files—the Committee Master File, the Candidate Master File, and the Candidate Committee Linkage File—we find that both the Committee Master File and the Candidate Master File have party affiliation data.
How do you get the data from these files into our individual contributions data? To do this we use the merge() function. Before doing that, however, it is worth a quick diversion to talk about databases and tables.

Databases

Databases are alternative ways to store data. They are particularly useful for large datasets like this one. The key to understanding databases is to remember that they are composed of tables. Tables are nothing more than rectangular arrays (they have rows and columns). For example, in R, a data frame is a table—it is made up of rows and columns. Most databases include multiple tables.
We can think of each of the files from our political contribution files as tables.
File Rows x Columns
Individual Contributions File 1,994,874 x 21
Committee Master File 13,093 x 15
Candidate Master File 5,345 x 15
Linkage File 5,954 x 7
Note that from the codebooks we can also see that these "tables” have some columns with common variable names, other variables are different across tables. This is because in a database, each table (should) represent different information. The common variable names are used to link information across tables. For example, the Individual Contributions File contains file on individuals who made contributions and the Committee Master File contains information on political committees. The variable CMTE_ID is common to both tables. So, we can use that to link the two tables or files.

Merging Data Frames

To examine how we will merge two data frames together, we first read in the Committee Master File and Candidate Master File.
cm <- read.table(file = "~/Desktop/cm.txt", sep = "|", header = FALSE, quote = "\"", comment.char = "")
nrow(cm)
## [1] 13093
length(cm)
## [1] 15
names(cm) <- c("CMTE_ID", "CMTE_NM", "TRES_NM", "CMTE_ST1", "CMTE_ST2", "CMTE_CITY", "CMTE_ST", 
    "CMTE_ZIP", "CMTE_DSGN", "CMTE_TP", "CMTE_PTY_AFFILIATION", "CMTE_FILING_FREQ", "ORG_TP", 
    "CONNECTED_ORG_NM", "CAND_ID")
cn <- read.table(file = "~/Desktop/cn.txt", sep = "|", header = FALSE, quote = "\"", comment.char = "")
nrow(cn)
## [1] 5345
length(cn)
## [1] 15
names(cn) <- c("CAND_ID", "CAND_NAME", "CAND_PTY_AFFILIATION", "CAND_ELECTION_YR", "CAND_OFFICE_ST", 
    "CAND_OFFICE", "CAND_OFFICE_DISTRICT", "CAND_ICI", "CAND_STATUS", "CAND_PCC", "CAND_ST1", 
    "CAND_ST2", "CAND_CITY", "CAND_ST", "CAND_ZIP")
We then will use the merge() function to join the two tables (data frames) together. This function takes as its initial two arguments the names of the objects that the data frames are assigned to, namely x= and y= (e.g., x=pc and y=cm). It is important to note that the data in the second argument (y) will be added to the rows of data from the first argument (x). After providing the names of the data frames, the argument by= is used to give the names of the variable names that are common to both data frames. (If there is more than one, use the c() function to bind them together.) We also use the all.x=TRUE argument. This will keep any rows in x that don't have a match in y. If this argument is not included then rows in x without matches in y will be dropped.
pccm <- merge(x = pc, y = cm, by = "CMTE_ID", all.x = TRUE)
nrow(pccm)
## [1] 1994874
length(pccm)
## [1] 35
The new data frame will have the same number of rows as the data frame in the x= argument. The number of variables will be equal to the sum of the unique variables in both data frames and the number of common variables between the two. Note how the results are different if we reverse the order of the two data frames.
cmpc <- merge(x = cm, y = pc, by = "CMTE_ID", all.x = TRUE)
nrow(cmpc)
## [1] 2001639
length(cmpc)
## [1] 35
The number of variables is the same, but the number of rows is different because this time the information from the pc data frame was added to the rows in the cm data. In SQL terms, this is called a left-inner join. If you are familiar with SQL, you might like to use the sqldf package.
We can also join the candidate data to our pccm data.
pccmcn <- merge(x = pccm, y = cn, by = "CAND_ID", all.x = TRUE)
nrow(pccmcn)
## [1] 1994874
length(pccmcn)
## [1] 49

Finding University of Minnesota Contributors with Grep

The data frame includes data from all contribotors, and I was interested in only the University of Minnesota employees. The variable EMPLOYER contains information on the contributor's employer. This variable contains character (string) data. To find particular strings we can use the grep() function. Grep was named after the Unix qed/ed editor subcommand “g/re/p”. In this subcommand, g stands for global, re stands for a regular expression, and p stands for print. In essence these subcommands would globally search some text for a given regular expression and print the lines containing matches.
The grep() function takes as its first argument, the character string to search for and as its second argument the variable (or vector to search). Here we search for the string “UNIVERSITY OF MINNESOTA” in the EMPLOYER variable and print the first fe results.
head(grep("UNIVERSITY OF MINNESOTA", pccmcn$EMPLOYER))
## [1] 13529 14294 19110 21758 22403 23838
The function outputs the row number for any row in which the string “UNIVERSITY OF MINNESOTA” is found in the EMPLOYER variable. We can use indexing to assign the rows of data that include this string in a new data frame.
umn <- pccmcn[grep("UNIVERSITY OF MINNESOTA", pccmcn$EMPLOYER), ]
nrow(umn)
## [1] 336
length(umn)
## [1] 49
The output shows that 336 rows have the string “UNIVERSITY OF MINNESOTA” as the employer. Note also that the new data frame has retained all of the columns as our previously merged data frame. We can now examine the affilation of the political party that these emplyees contributed to.
table(umn$CAND_PTY_AFFILIATION)
## 
##     AIP AMP CIT CON CRV CST DCG DEM DFL FED GRE GRN IAP IDP IND  IP LIB   N N/A NNE NOP 
##   0   0   0   0   0   0   0   0 165  27   0   1   0   0   0   0   0   0   0   0   0   0 
## NPA OTH PAF PRI REF REP RTL SUS SWP TEA  TX UNK   W WFP 
##   0   0   0   0   0  42   0   0   0   0   0   0   0   0
table(umn$CMTE_PTY_AFFILIATION)
## 
##       0 ACP AIP AMP CIT COM CON CRV CST DEM DFL FED FWP GOP GRE GRN HOS IND LIB  MW NNE 
##  78   0   0   0   0   0   0   0   0   0 150  63   0   0   0   1   0   0   0   0   0   0 
## NON NPA OTH PAC PAF PRO REF REP SOC STA SUS SWP TEA UNK UPA UTP  WF WFP WHG WNC 
##   0   0   0   0   0   0   0  43   0   0   0   0   0   1   0   0   0   0   0   0
The only parties that University of Minnesota employees have contributed to are the Democratic, DFL, and Republican. We will append a variable to the umn data frame called PARTY that has two levels DEM/DFL and REP.
umn$PARTY[umn$CAND_PTY_AFFILIATION == "DEM"] <- "DEM/DFL"
umn$PARTY[umn$CAND_PTY_AFFILIATION == "DFL"] <- "DEM/DFL"
umn$PARTY[umn$CMTE_PTY_AFFILIATION == "DEM"] <- "DEM/DFL"
umn$PARTY[umn$CMTE_PTY_AFFILIATION == "DFL"] <- "DEM/DFL"
umn$PARTY[umn$CAND_PTY_AFFILIATION == "REP"] <- "REP"
umn$PARTY[umn$CMTE_PTY_AFFILIATION == "REP"] <- "REP"
table(umn$PARTY)
## 
## DEM/DFL     REP 
##     242      44

Plotting the Contributions with ggplot

First we load the ggplot2 library.
library(ggplot2)
We plot the contributions conditional on party. (To learn more about the use of ggplot see the excellent online documentation.)
ggplot(data = na.omit(umn), aes(x = PARTY, y = TRANSACTION_AMT, color = PARTY)) +
  geom_jitter(alpha = 0.3, position = position_jitter(width = 0.1, height = 0)) +
    scale_color_manual(name = "Affiliation", values = c("Black", "Red")) +
    xlab("") +
    ylab("Contribution") +
    theme_bw()
plot of chunk unnamed-chunk-20
We note that four observations are negative.
umn2 <- umn[umn$TRANSACTION_AMT > 0, ]
ggplot(data = na.omit(umn2), aes(x = PARTY, y = TRANSACTION_AMT, color = PARTY)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.3, position = position_jitter(width = 0.1, height = 0)) +
  stat_summary(fun.y = "mean", geom = "point", size = 4) +
  scale_color_manual(name = "Affiliation", values = c("Black", "Red")) +
    xlab("") +
    ylab("Contribution") +
    theme_bw()
plot of chunk unnamed-chunk-21
Removing these, we observe that the political contributions seem to be comparable (medians are roughly equal) for the two sides of the political aisle. However, there is a great deal more variation in the Republican contributions.
--> Type MIME Type text/html Resource Type Document Location Full URL http://rstudio-pubs-static.s3.amazonaws.com/1689_9eee1ef2fbe948f6ba5901a17ca52308.html Scheme http Host rstudio-pubs-static.s3.amazonaws.com Path /1689_9eee1ef2fbe948f6ba5901a17ca52308.html Filename 1689_9eee1ef2fbe948f6ba5901a17ca52308.html

9.12.2012

How Google Makes its Maps

An exciting piece of journalism about data and computing appeared in the Atlantic.
"Behind every Google Map, there is a much more complex map that's the key to your queries but hidden from your view. The deep map contains the logic of places: their no-left-turns and freeway on-ramps, speed limits and traffic conditions. This is the data that you're drawing from when you ask Google to navigate you from point A to point B -- and last week, Google showed me the internal map and demonstrated how it was built. It's the first time the company has let anyone watch how the project it calls GT, or "Ground Truth," actually works. " Read the article online here [read article].

9.10.2012

2–0

That is the record of the Gophers. It has been a long time since that has occurred. It feels good. I finally know what it is like to feel like a USC or LSU fan. To even make the weekend better, the Vikings also won, and the Badgers and Packers lost! When is the last time we had a Gopher/Viking win and Badger/Packer loss in the same weekend???

On a more somber note, my condolences to Iowa fans. Your neighbors to the north know exactly what it feel like to lose to a team you should beat, and in that loss realize that you need a running back, a quarterback, a defense, a wide receiver, and a secondary. We know the agony of three quarters of hope followed by a quarter of disaster and pain–seeing your season turn from a bowl-bound 7-5 to 2-10 on a single interception or failed first down. Been there.

I read a few excellent articles and blog postings over the weekend. One such blog entry gave a great answer to the question that is asked to all faculty members at one point, aside from teaching, what do academics do? [Read the blog here.] The author, a USC faculty member, has categorized our work into five dimensions that nicely summarize what we do:

  • Learning
  • Discovering
  • Sharing
  • Helping 
  • Fund-raising
These dimensions, as he points out, are waaaay more useful then teaching, research and service, the big-three that universities seem to have coveted for years. The last two (helping and fund-raising) I would like to see changed, not because they are incorrect, but rather because they don't have the "humph" that the others do. They are quite descriptive, and encompass what we do, but you would never see the universities starting a campaign of "Driven to Help" or "Driven to Fund-Raise". Maybe citizenship would be better than helping. But, I do like the participle "ing" on the end of each noun....and citizening doesn't cut it.

Another awesome read last night was Jay-Z’s “99 Problems,” Verse 2: A Close Reading With Fourth Amendment Guidance for Cops and Perps. This is a line-by-line analysis of the second verse of Jay-Z's masterpiece written from the perspective of a criminal procedure professor. He examines the ins-and-outs of various case law relating to the legalities of a traffic stop as documented by Mr. Z. For example, he writes
A traffic stop is a legitimate seizure of the person, for purposes of investigating the violation of the traffic law and writing up the citation. But it cannot be prolonged for longer than reasonably necessary to complete that legitimate activity...And the Fourth Amendment rule is very clear: if the police detain you after they’ve finished processing the ticket—or if they simply dawdle over the ticket processing for an unreasonable length of time—in order to get a K-9 team there, then the eventual dog sniff will be the fruit of an illegal detention, and any evidence found will be suppressed. The officer in Jay-Z’s case apparently knew this, and so released Jay-Z after the stop when the K-9 unit he’d called was late in arriving. I got 99 problems but a bitch ain’t one...
Lastly, I have been reading the Nieman Journalism Labs review of J-schools. They have had big-wigs comment on what these schools are doing well and what they wish students would have learned. The most common wish has been that journalists learn some coding. Here are two takes.

Brian Boyer, the head of NPR’s news apps team, suggests a course in Hacker Journalism 101. He provides a reading list and some assignments. He also offers three virtues of a programmer that the students would apply to all of their work in the course,

  • Laziness: I will do anything to work less.
  • Impatience: The waiting, it makes me crazy.
  • Hubris: I can make this computer do anything.
Miranda Mulligan, executive director at Northwestern’s Knight News Innovation Lab, also promoted learning to code, arguing that journalists need to learn how to code if they want to become better (and employable) storytellers. [Read Miranda's take here.]

And yet if you attend any event with a collection of jouro-nerd types, inevitably the same question will come up. Someone will ask...“How can we tell better stories on the web?” and proceed to bemoan the tedium of reading a daily newspaper and a newspaper website...It is our job as educators to remove fear of learning, a fear notoriously prevalent in journalists...For me, there’s only one response to this: Journalists should learn more about code. Understanding our medium makes us better storytellers. For an industry that prides itself on being smart, tolerating ignorance of the Internet is just stupid.
Forgot one other thing. Saw this and thought every undergraduate should be given an orientation to the information on searching with Google in these slides.