Just wrapped up a couple weeks of text mining, topic modeling, and sentiment analysis using R! The Wells Fargo Campus Analytic Challenge asked students from 13 Universities to submit code as well as analytical responses to parse topic and substance from a data set of around 250,000 social media posts.
Having never done anything like this before, it took me a while to identify an approach and the R packages that would be most helpful. Luckily, R has a couple of very rich packages (and package vignettes!) for doing this sort of work. Unfortunately, the vignettes were one of the few sources available to investigate implementation and examples. One good web resource I found for the ‘LDA’ package was Carson Sievert’s page here. He walks through a straightforward implementation and shows off some nifty visuals as well.
I did most of the clean up and text pre-processing work using ‘tm’ and ‘dplyr’. My favorite package though is ‘STM‘, which I used to fit the topic model. It’s got lots of functionality and is relatively intuitive. It supports several D3 visualizations as well. I originally used the ‘LDA’ package, but the topics where much more coherent in the STM model. Both packages use an LDA model to fit the data, but the STM package also uses a generative model, but uses a slightly different algortihm (“Spectral”) and that seemed to make the difference. STM also has some nice interactive visuals (D3) that were converted into htmlwidgets by Kent Russell, who is very talented and a very nice guy – he takes requests!
A great package for sentiment analysis (and more) is qdap. I had a lot of fun going through the qdap vignette, which is extremely comprehensive. It took a long time to work through it but it was just awesome.
You can see my submission (best D3 experience on Firefox or Safari) here
And view the code here
Hope this helps someone else get started in Topic Modeling and doing Sentiment Analysis in R!