This week I dove into more computational text analysis. While I gained a better understanding of the methodology through the articles we read, when it came time to learn how to use the tools, my performance was a bit lackluster. For God’s sake, it took me hours just to get the Natural Language Tool Kit to download. After following Roy’s advice, and a tip by my colleague Katherine Knowles (who also had issues with the setup), pointing out that the download page opens behind the browser pages when you run the cell for NLTK on Anaconda, I was finally able to go though the required exercises for the week. Normally, I work on the exercises on Saturday and post my reflections for the week, but as you can see, it spilled over to Sunday this week. Given I had no background in Python, I expected this to be a bit of a bumpy week, and it was. The problem wasn’t getting my feet wet with a new language. It was me wandering aimlessly into the world of open-source computational notebooks.

James L. Sutter on Twitter: "Have people been using Big Lebowski memes  against Donald Trump? If not, we should probably get on that. You're out of your  element, Donny.…"

Once I finally got things up and running, I realized just how lazy I have been in the past with learning any sort of computer code. My experience is very much “this needs to be done, let me copy and paste what I can to get the result I need.” As a data specialist at a high school, I attended multiple trainings and conferences to learn how to use one relational database system that was anything but open-source. I became familiar with SQL queries and created pages for administrators to pull student data from our database. When queries started having multiple conditions the code was insufferably long and not practical-looking, but I just pasted it from another user in my sharing network without understanding what it meant. I have always been an indolent language learner, be it computer or natural language. Being able to order coffee, eat, and find the bathroom has kind of been my take on language. I know, stereotypical person from the US. It was humbling to actually do some exercises that tested what I was learning.

It is a shame it took me so long to get everything set up because I enjoyed what I could finish in the tutorials (especially exploring the Holy Grail. I did not know that Python was named after Monty Python. How cool is that?) and would like to spend more time dinking around. But, alas, spending thirty hours learning Python just doesn’t fit right now, and will probably not help me work toward the proposal I should be working on for my mapping project. But at least I now understand the infrastructure and am set up to revisit the NLTK tutorials.

The readings this week were also extremely helpful in furthering my understanding of the potentials and limitations of computational text analysis. I would suggest Dong Nguyen et al’s article for anyone trying to figure out how social and cultural research can be accomplished through computational analysis. Their step-by-step deconstruction of methodology works well for those of us trained in social science and think in a classic hourglass of research framework. What was most interesting to me, and goes back to questions I had last week, is the selection and quality of textual data.

The authors discuss the proliferation of born-digital data and the ethical issues with using this information without the consent of the those that produced it, and more important, the ability for open-sourced methodologies to be used in nefarious ways. The example of analyzing Reddit for hate speech is a great example. If a model is created that traces hate speech, one could theoretically use this program to connect networks of racists. The other issue I found interesting was that of the selectivity of analog texts that have been digitized and the need to marry digital research with qualitative studies.

Sandeep Soni et al’s work on abolitionist newspapers uses robust statistical analysis of semantic change over time by randomly aggregating newspapers from the Accessible Archives which provide racial and gender identification of editors of newspapers. While their results cannot take the place of qualitative research which relies on human intuition, it directs further research questions and narrative opportunities. Their findings that women-edited, both black and white, papers were at the vanguard of lexical change, opens up interesting avenues for historians. To find that lexical change, they focused on “neighbor” words around the standard words like “freedom,” and more important, randomized their analysis to make up for gaps in time and the archive. This is important, and comes to my final reflection which goes back to the idea of availability.

Nguyen et al point to the importance of understanding why particular newspapers are digitized. Silences of the archive increase as access expands. Historians are using traditional archival methods with digitized documents which is not the same as visiting an archive. The bias of this approach can be addressed by marrying computational methods with text selections.

For example (and I don’t want to simply pick on one book because this is widespread and this is an excellent monograph, but it came to mind while reading it), Paul Escott’s work on white supremacy in the Civil War North analyzes northern newspapers to show the explicit racism throughout the region, even among many abolitionist-supporting publications. Only two papers were “systematically studied” in Michigan: The Detroit Free Press and the Grand Haven News. It seemed like an odd choice to pick this little random town, but then again, the digital version of this paper is easily accessible. Escott’s extensive analysis and piles of qualitative data make for a compelling argument. However, some front-end work like the statistical research done by Sandeep Sone et al could have answered the question I had while reading the book and echoed by Nguyen, “are the sources representative?” Perhaps I am beginning to see the value in this stuff.

Leave a Reply