Method

Overview

The Stratford Heritage Guide website is built using a Bootstrap framework. The map is constructed in Leaflet using MapBox tiles. There is a layer for each of the guidebooks used in this project, and the icons are numbered to indicate the path one should take through the town based on the description in that guidebook. By clicking the link in any given pin, users are taken to a page for that site that analyzes the way it is described across the guidebooks using topic modeling. This section describes how the documents were prepared and then analyzed using digital tools as well as how that data is then represented visually on each site page.

OCR


The majority of data used in this project will be collected from digitized guidebooks spanning the 19th to 21st centuries that are housed on Google Books. The quality of the digitized guidebooks varies quite significantly across the corpus due to the lack of standardization in scanning requirements for the Google Books initiative. I used Google Docs OCR to transform the scans from PDFs to text documents so they are readable by the Topic Modeling Tool. I then cleaned these OCR copies, which includes removing images, stray characters, and cleaning up errors that resulted from the software being unable to read the font. If the same page was accidentally included multiple times, I deleted the repeated pages. Some pages may be missing from the guidebooks, and someday those missing pages may be transcribed based on archival or library copies of that guidebook. 

Topic Modeling

After preparing the guidebooks as individual text files, I then ran them through the Topic Modeling Tool (TMT). I set the parameters to generate twenty different topics after running the model 500 times. The software then generated a csv report that indicated the frequency of each topic in each guidebook. The frequencies of each topic in a single guidebook add up to one. To generate data for the individual site pages, copied the chapter or section on that site from each document and put it in its own text file. I ran the collection of files of individual sections through the TMT using the same parameters as I did for the full corpus. I had to be sure to remove the .DS_Store file each time running the topics as it would run with the text files otherwise. Ultimately, I chose not to include stop words although I had originally intended to remove "Shakespeare" and "Stratford-upon-Avon"; however, I decided it might be interesting to examine changes in spelling or use of these terms across time.

Data Visualizations

Using the data derived from the generated topics, I created a series of interactive data visualizations in Flourish. For each site, I created two different graphs: one that includes all of the topics that resulted from the topic models, and one that shows the main topics. Both graphs are in the form of a stacked column chart to clearly demonstrate the breakdown of each of the topics for each guidebook. The first graph on each page includes all topics generated for that site via the TMT, regardless of how infrequently they appear in that particularly text. Each column adds up to the full topic score of one. For this second graph, I included any topic with a score of .15 or above. Generally, this resulted in about eight to twelve of the twenty topics being graphically represented. This removes any of the infrequent topics and makes the graph slightly more clear, but the total topic score for each guidebook no longer necessarily adds up to the full score of one.