The result shows that President Kaler used all of the appropriate terms one would expect in a State of the University Address. The terms "students", "faculty, and "research" are all prominent, as are "budget", "tuition", balancing", "support" and "learning" and other administrative catch-all words.
The code I used was
library(tm) library(wordcloud) ## Read in the data from a folder which contains the text document(s) (ovid <- Corpus(DirSource("/Users/andrewz/Documents/Data/State-of-the-University/"), readerControl = list(reader = readPlain))) ## Document preparation sotu <- tm_map(ovid, removePunctuation, preserve_intra_word_dashes = TRUE) sotu <- tm_map(sotu, removeNumbers) sotu <- tm_map(sotu, tolower) sotu <- tm_map(sotu, stripWhitespace) sotu <- tm_map(sotu, removeWords, stopwords("english")) sotu <- tm_map(sotu, stripWhitespace) ## Create document-term matrix tdm <- DocumentTermMatrix(sotu) m <- as.matrix(tdm) v <- sort(colSums(m),decreasing = TRUE) d <- data.frame(word = names(v), freq = v) ## Plot pal <- brewer.pal(7, "Set3") pdf("/Users/andrewz/Desktop/SOTU.pdf", width = 8.33, height = 6.67, bg = "black") wordcloud(d$word,d$freq, #scale=c(8, 0.3), min.freq = 3, #max.words = 100, #random.order = TRUE, rot.per = 0.15, colors = pal, vfont=c("sans serif","plain") ) dev.off()
The second word cloud is based on my Google Scholar page. The cloud on the left-hand side shows my co-authors (sized by most frequent) and the cloud on the right-hand side shows terms that show up in the work linked to my Scholar page.
Total papers = 20
Median citations per paper = 1.5
Median (citations / # of authors) per paper = 0.4166667
H-index = 6
G-index = 9
M-index = 1
First author H-index = 4
Last author H-index = 2
First or last author H-index = 5
First or second author H-index = 5
The code is below
source("http://biostat.jhsph.edu/~jleek/code/googleCite.r") out <- googleCite("http://scholar.google.com/citations?user=cWpN_s8AAAAJ&hl=en", pdfname = "/Users/andrewz/Desktop/Zieffler_wordcloud.pdf") gcSummary(out)