3.17.2012

Some Word Clouds

I was dorking around with R today and decided to create some Word clouds. The first is of the 2012 State of the University Address given by President Kaler on March 1, 2012. After removing punctuation (except for hyphens), numbers, converting everything to lowercase, and stripping whitespace, I used the tm package to create a document-term matrix (recording each word along with its frequency). After converting this to a data frame, I used the brewer.pal() and wordcloud() functions to create the word cloud itself.


The result shows that President Kaler used all of the appropriate terms one would expect in a State of the University Address. The terms "students", "faculty, and "research" are all prominent, as are "budget", "tuition", balancing", "support" and "learning" and other administrative catch-all words.

The code I used was
library(tm)
library(wordcloud)

## Read in the data from a folder which contains the text document(s)
(ovid <- Corpus(DirSource("/Users/andrewz/Documents/Data/State-of-the-University/"), 
    readerControl = list(reader = readPlain)))

## Document preparation
sotu <- tm_map(ovid, removePunctuation, preserve_intra_word_dashes = TRUE)
sotu <- tm_map(sotu, removeNumbers)
sotu <- tm_map(sotu, tolower)
sotu <- tm_map(sotu, stripWhitespace)
sotu <- tm_map(sotu, removeWords, stopwords("english"))
sotu <- tm_map(sotu, stripWhitespace)

## Create document-term matrix
tdm <- DocumentTermMatrix(sotu)
m <- as.matrix(tdm)
v <- sort(colSums(m),decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)

## Plot
pal <- brewer.pal(7, "Set3")
pdf("/Users/andrewz/Desktop/SOTU.pdf", width = 8.33, height = 6.67, bg = "black")
wordcloud(d$word,d$freq, 
 #scale=c(8, 0.3),
 min.freq = 3, 
 #max.words = 100, 
 #random.order = TRUE, 
 rot.per = 0.15, 
 colors = pal, 
 vfont=c("sans serif","plain")
 )
dev.off()


The second word cloud is based on my Google Scholar page. The cloud on the left-hand side shows my co-authors (sized by most frequent) and the cloud on the right-hand side shows terms that show up in the work linked to my Scholar page.

The summary citation info can also be output in R. Mine is


Total papers = 20
Median citations per paper = 1.5
Median (citations / # of authors) per paper = 0.4166667
H-index = 6
G-index = 9
M-index = 1
First author H-index = 4
Last author H-index = 2
First or last author H-index = 5
First or second author H-index = 5


The code is below
source("http://biostat.jhsph.edu/~jleek/code/googleCite.r")
out <- googleCite("http://scholar.google.com/citations?user=cWpN_s8AAAAJ&hl=en", 
    pdfname = "/Users/andrewz/Desktop/Zieffler_wordcloud.pdf")
gcSummary(out)


No comments: