For the past year or so I’ve been using text mining to study the historical meanings of words. (An early version of this work can be found here.) Lately, working with Eric Gidal, I’ve been experimenting with a geospatial approach to text analysis, looking at placenames in particular. What kind of word is “Edinburgh”? What are some of its various connotations?
Our data is drawn primarily from a collection of nineteenth-century British geography: gazetteers, statistical accounts, and topographical dictionaries. (Downloaded from the Internet Archive.) A topographical dictionary is just like it sounds: a big book with an alphabetical list of placenames and a description for each:
As you can see, these are highly structured documents, and the OCR was quite good, so it was simply a matter of parsing the plain text, capturing the name and description of each place, then matching those names to a list of places published by the U.K. government.
Here’s a map that performs a hotspot analysis on the place descriptions. It highlights the clusters of points that are most likely to have the word “Argyll” in their descriptions. Towns or cities in Argyllshire, it turns out, were often described as being “in Argyll,” so a hotspot analysis of the word returns points in a region that overlaps almost exactly with the historical boundaries of Argyllshire county. The cool thing about this map is that it discovers administrative boundaries in a bottom-up way, extrapolating those boundaries from text-based data points.
But descriptions aren’t limited to official geographies. Nineteenth-century gazetteers are rich testimonies of environmental and cultural change. So here’s a similar map, only instead of showing points with a high rate of “Argyll,” it shows places with descriptions that use the word “Caledonia.”
“Caledonia” was the ancient Latin name of north Britain. It was often invoked by Scottish nationalists who felt uneasy under English hegemony. Caledonia itself was never really a nation — more a hodgepodge of principalities and clans on the furthest outskirts of the Roman Empire. But it loomed large in the historical imagination of Scottish writers.
This map, then, actually is a map of Caledonia (sort of, at least insofar as “Caledonia” itself was a back-projection of the Scottish Enlightenment). It’s a map of places that nineteenth-century writers associated with the ancient land.
I begin with these examples just to highlight a kind of metaphysical disconnect at the heart of geospatial text analysis. On the one hand it depends on grounding analysis in “real life” data drawn from official sources. On the other hand, by exposing the many definitions and narratives that attach to official placename concepts, it draws out their multiple, often conflicting conceptualizations.
We’ve started calling this line of inquiry historical geospatial semantics.
The Mapping Scotland project
This work stems out of Eric’s book on the reception history of Ossian, a mythical Gaelic poet whose epics were “discovered” in the 1760s and 1770s by the poet and translator, James Macpherson. Although the epics themselves were largely fabricated, Macpherson borrowed from a real tradition of Ossianic Gaelic verse that he found in old manuscripts and heard recited in northern villages. Controversy over Macpherson’s work simmered for decades — Was it a mere hoax? — but Scottish writers were long among his strongest advocates. In the nineteenth century, at the same time that Scotland’s countrysides were transformed by industrialization and surveyed by government agencies with unprecedented levels of detail, Scottish historians and geographers scoured the landscapes of the Highlands and the Hebrides for evidence of their nation’s heroic past.
The study of Ossianic geography is the study of how meanings attach to places. At the most abstract, it’s about reconciling two representative systems: natural language and geographic models. In the eighteenth and nineteenth centuries, geographic models took the form of place descriptions, maps, and statistics. The work of these geographers evolved into the British Ordnance Survey, which is still the official geographic agency of the U.K. and now maintains detailed files of the British landscape, available for download and easy to use in GIS software. Perhaps strangely, because of the institutional continuity of the Ordnance Survey, the geographic imagination of the nineteenth century in fact survives in modern computerized systems, in the forms of towns, rivers, and lakes named after their mythical predecessors. Ossianic geographies permeate modern GIS systems just as they infiltrated and sparked the imaginations of Macpherson and his later defenders.
But how does language touch its physical environments? How should we characterize the relationships among meanings and the spaces of their circulation?
Our idea was to dig through some representative texts and mark up all the places mentioned, referring them back to places named in the Ordnance Survey, while also capturing their descriptions. Then, we automatically georeferenced 47 volumes of nineteenth-century gazetteers and topographical dictionaries. Across this corpus, which totals more than 7 million words, we captured 70,000 descriptions of almost 20,000 unique places. We ask: How were Scottish places described? How did Ossianic and official geographies differ, and what, if any, literary, economic, and environmental concepts informed them?
Luckily, although our questions and primary materials are somewhat novel, we’re not the first scholars to try something like this. Geospatial text analysis has a small following within both digital humanities and geography (although the two groups don’t seem to talk to one another, as far as I can tell). Matthew Wilkens has been a leading proponent of quantitative approaches to literary geography. So, too, have Ian Gregory and Andrew Hardie, whose essays on this topic cover very important technical ground. I recommend in particular their chapter from the recent collection, Deep Maps and Spatial Narratives (IUP, 2015).1
Among geographers, the subfield of “geospatial semantics” has a lot in common with work like Gregory’s and Hardie’s. It studies geographic concepts and designs formal ontologies for representing those concepts in software systems. Like other aspects of natural language, human words for places and spaces lack crisp definition. As Werner Kuhn has emphasized, “geospatial information is often based on human perception and social agreements” and for this reason will be marked by “vagueness, uncertainty, and [differing] levels of granularity.” Geospatial software systems must be sensitive to the various and conflicting “naive” geographies of human discourse. As Andrea Ballatore (et al.) have remarked, “To share geographic information across a community, it is necessary to extract concepts from the chaotic repository of implicit knowledge that lies in human minds.”2
The point here is not that human cognition is flawed, but that place is linguistically constructed. From a GIS perspective, this has practical consequences. Environmental, commercial, and government data needs to operate across very different linguistic models and so must be sensitive to the nuances of language. For the spatial humanities, this sensitivity has a very different connotation: it suggests that GIS, with its emphasis on modeling multiplicity, shares a real commensurability with postmodern geography.
Humanists looking to do this kind of work are best off starting with Gregory’s and Hardie’s work, I think. Adam Crymble has a nice post in Programming Historian if you like working in Python. However, Angela Schwering has written some excellent surveys of the field, and the work of Krzysztof Janowicz and Andrea Ballatore, to name just a few others, is pretty exciting.
Geospatial semantics: profiles & footprints
When combined with corpus linguistics, geospatial analysis sits at the nexus of two theoretical traditions. The first comes from Zellig Harris and J. R. Firth from the 1950s. The distributional hypothesis of computational semantics says that similar words will tend to appear in similar contexts. From geography, it borrows the notion of spatial autocorrelation, which proposes that places that are physically close to each other will tend to share similar properties. These ideas can be combined into one proposition (with a dash of temporality thrown in for taste):
Similar places at similar times will tend to be described using similar words.
This hypothesis proposes that geospatial text analysis should work for answering these kinds of questions:
How were the Highlands described? Which regions were most distinctly imagined as sites of Ossianic epic? Which towns were most affected by industrialization? Which regions were affiliated with ancient Scottish nations, peoples, or heroes? Which places bear the mark of myth?
When analyzing georeferenced texts, there are two basic ways to go about it. You can either choose one place and look at all the ways it has been described, or you can choose one word or topic, then map all the places where it was invoked.
We find these definitions helpful:
The semantic profile of a place refers to the linguistic distribution of the vocabulary of the collocates of its name(s). The semantic footprint of a concept refers to the geographic distribution of its placename collocates.
These two modes of analysis suggest two corresponding ways to think about similarity:
Places are semantically similar if they share similar profiles. Concepts are geographically similar if they share similar footprints.
This suggests a path forward for how “distant reading” might contribute to our knowledge of historical geography. If places change in culturally significant ways, those changes should be reflected in the discourse. It also suggests how GIS can contribute to intellectual history, but that will be a bit trickier. From the perspective of intellectual history, geosemantics opens a set of questions we only now can formulate. How were ideas distributed geographically? What kinds of ideas were distributed in what kinds of ways?
Industrialization & epic
Much more could be said. Right now, though, our research is still at the “Let’s see if we can get the numbers to show us what we already expect to find” phase. We want to be able to track cultural geographies of various kinds, to identify competing cultural landscapes, and to account for change over time. We start with Ossian because we know it well.
Here are a couple of maps showing the results of a topic model we ran over the topographical dictionaries. We performed a hotspot analysis, just as above, to identify spatially distributed clusters of the texts. The first one shows descriptions of fishing (“herring”, “cod”) and waterfront development (“docks”, “boats”).
This one shows results from the same model, but displaying a topic that has to do with the mining industry, with a special emphasis on coal:
The other maps show formally and culturally defined regions, like “Argyll” and “Caledonia.” By contrast, maps of fishing and mining expose functionally defined regions.
Lastly, here’s the “Caledonia” map again, but with an extra layer showing a hotspot analysis of the word “Fingal.” Fingal was the name of the warrior-king in Ossian’s epics. The “Fingal” points are blue, and the purple region shows where that word overlaps with “Caledonia.”
This purple region overlaps perfectly with the Scottish Highlands, as well as with the points most closely associated with Ossian by scholars, whose work we track separately. (But I’ll spare you yet another map.)
Notice that the area associated with “Fingal” is very distinct from the region associated with the Scottish mining industry in the map above, but “Caledonia” crosses both regions. Remember that Eric’s book was about how the Ossian controversy erupted during a moment of important cultural shift in Scotland: industrialization combined with nationalism to drive geographers and literary historians to seek out a mythic past and to locate that past quite concretely in the landscape of northern Scotland.
These maps show that dynamic quite distinctly, and it’s now actually possible to perform statistical analyses that show, for example, that Ossianic and industrializing discourses were negatively correlated geographically (that is, they tended to separate from each other) while both being positively correlated to Scottish nationalism more generally.
In its broad outlines, these results fit well with a lot of what we know about nineteenth-century Scottish literature and history. As soon as you dig down a bit, though, they raise innumerable questions.
Thanks go out to Eric for sharing his project with me, to the University of South Carolina for our initial seed grant, to Sarah Thompson, a PhD student who worked with me last summer to work through the initial designs, and to Jeannie Britton and Tony Jarrells, who have consulted along the way.
See Gregory et al, “Spatializing and Analyzing Digital Texts: Corpora, GIS, and Places,” in Deep Maps and Spatial Narratives (IUP, 2015). This essay builds from Gregory’s and Hardie’s earlier article, “Visual GISting: Bringing Together Corpus Linguistics and Geographical Information Systems,” Literary & Linguistic Computing 26, 3 (2011): 297-314. The newer essay is more detailed and spares readers the unfortunate neologism, “visual gisting.” See Werner Kuhn, “Geospatial Semantics: Why, of What, and How?” Journal on Data Semantics III (2005): 1-24; and Andrea Ballatore, David C. Wilson, and Michela Bertolotto, “Computing the Semantic Similarity of Geographic Terms Using Volunteered Lexical Definitions,”” International Journal of Geographical Information Science 27, 10 (2013): 2099-2118.