Recent Posts

Microsoft Annual Shareholders Meeting

AT&T signs $2 billion cloud manage Microsoft

While AWS drives the cloud framework advertise by a wide edge, Microsoft isn’t doing too seriously, hid solidly in runner up, the main other organization with twofold digit share. Today, it reported a major ordeal with AT&T that envelops both Azure cloud foundation administrations and Office 365.

An individual with learning of the agreement pegged the consolidated arrangement at a clean $2 billion, a pleasant plume in Microsoft’s cloud top. As indicated by a Microsoft blog entry reporting the arrangement, AT&T has an objective to move the vast majority of its non-organizing remaining tasks at hand to the open cloud by 2024, and Microsoft just got itself a major cut of that pie, doubtlessly one that adversaries AWS, Google and IBM (which shut the $34 billion Red Hat arrangement a week ago) would profoundly have wanted to get.

As you would expect, Microsoft CEO Satya Nadella talked about the arrangement in grandiose terms around change and development. “Together, we will apply the intensity of Azure and Microsoft 365 to change the manner in which AT&T’s workforce teams up and to shape the fate of media and interchanges for individuals all over,” he said in an announcement in the blog entry declaration.

Keeping that in mind, they are hoping to team up on developing advancements like 5G and accept that by consolidating Azure with AT&T’s 5G organize, the two organizations can enable clients to make new sorts of utilizations and arrangements. For instance refered to in the blog entry, they could see utilizing the speed of the 5G system joined with Azure AI-controlled live voice interpretation to enable people on call for discuss momentarily with somebody who talks an alternate language.

It’s important that while this arrangement to bring Office 365 to AT&T’s 250,000 representatives is a decent success, that piece of the arrangement falls under the SaaS umbrella, so it won’t help with Microsoft’s cloud framework piece of the overall industry. All things considered, no matter how you might look at it, this is a major ordeal.

Parsing the EEBO-TCP Imprint

Earlier today I posted an image of EEBO-TCP as a Giant Hairball, and I’ve had some questions about how the data was put together and a few requests to see it, so here’s a brief narrative with some download links at the bottom.

Inspired by the incredible work over at the Early Modern OCR Project (eMOP) led by Laura Mandell, I thought I should share some of the inital work I’ve done parsing early modern imprints. eMOP recently released data from their project in XML form, linking parsed imprints to EEBO-TCP and ESTC data. Their files can be found here: https://github.com/Early-Modern-OCR/ImprintDB

Identifying and differentiating the printers and booksellers who produced old books is rarely a straightforward process. Publication data from title pages are notoriously irregular. Spelling variation in names, and incomplete or inaccurate attribution, is common. Names are often given in Latin and often listed only as initials. As a result, title page imprints appear in forms like this, “London: printed by T.N. for H. Heringman.” For this reason, library catalogs, which have been inherited by digital projects like Early English Books Online, typically offer only the character string of each imprint, leaving it to human readers to figure out who these people are.

Cleaning up publication metadata and making it available for search and analysis would have many important research applications for scholars working on the history of publishing, authorship, and other areas of print history. My own interests are in network analysis. Who published with whom? How did different political, religious, and literary ideas circulate in the print marketplace? Especially now that so much of the early record is available in full-text form, improving the metadata is a major task facing scholars right now.

Matthew Christy, eMOP’s co-project manager and lead developer, worked with their team to break the imprints up into attribution statements, marking out “Printed for” and “Printed by” relationships. Their work is hugely valuable. Working with Travis Mullen here at the University of South Carolina, we tackled the problem from a different angle. Our goal was to pull out the names to see if we can reconcile common entities across the catalog. If one book was attributed to “T.N.”, another to “Thomas Newcombe”, and a third to “T. Newcomb”, we wanted each to be attributed to the same person. Using a combination of algorithmic and hand-corrected methods, we figured this should be doable. The results are here: http://github.com/michaelgavin/htn

Before delving into our process, a few caveats should be kept in mind. First of all, imprints, as I mentioned above, are less than perfectly reliable. Names were often left off completely; sometimes false names were added in their place. Like eMOP’s, our technique does nothing to solve this problem. We can only parse the information available. Even in the case of false imprints, though, it makes sense to us to capture what the books actually say.

Second, we haven’t yet reconciled the names to existing name authority files, like those published by the Library of Congress or VIAF. Many of our printers and booksellers are included in linked data resources, but many aren’t. In the long term, we’d like to get them all into shape to be linked up to other resources, but we have set that ambition aside for now.

Third, because of ideological and practical motives, we looked only at books freely available from the EEBO Text Creation Partnership. On principle, I don’t really like working with proprietary data. Even among the freely available stuff through, there were practical problems. American imprints from Evans and eighteenth-century books from ECCO were far more difficult to process (for reasons that will be clear).

Lastly, as with any computer-aided process, some errors slipped through, so our data’s still far from perfect. The intial pass returned a little over 30,000 attributions, and of those about 5% were easy-to-spot errors. We tried to clean out by hand, but errors and omissions certainly remain. I am putting the initial data out now, in part, to invite collaboration from anyone who might be interested in building up or further correcting the metadata.
What did we do?

Basically, we designed a little decision-tree algorithm to read each imprint, pull out name words, and then find likely matches in the British Book Trade Index.

What makes the BBTI a great resource is that they include almost everything. If a name is on an imprint, there’s a very good chance that it’s somewhere in the BBTI. The other great thing about BBTI is that, although they don’t standarize their names, they do provide one crucial piece of data: trade dates. Unlike birth or death years, trade dates refer to a person’s professional life. The inital trade date is usually the year of the first imprint they appear on or the year they were taken on as an apprentice. This means we didn’t have to search the entire BBTI for every book, we just had to look for names in the small subset of stationers active around the time of each book.

We designed a custom set of processing rules for the imprints. Names of streets and neighborhoods were taken out, as were names of bookshops. So

“Oxon : Printed by L. Lichfield and are to be sold by A. Stephens, 1683.”

gets reduced to a vector of five words:

[1] “Oxon” “L” “Lichfield” “A” “Stephens”

The core process then had three steps:

Subset the BBTI to look only at entries where the initial trade date was within range of the imprint date. For each word in the imprint, search by last name, looking for matches or near matches. Then, look at the word to the left of the target word in the imprint. Select only those with the same first letter, then choose the closest match. If there are multiple matches or no matches, just skip to the next word in the imprint.

Using the example above, the algorthim searched through several possibilities.

“Oxon L” “L Lichfield” “Lichfield A” and “A Stephens”

The first and the third didn’t hit any matches. The second and the fourth returned these two:

bbtiID                         name    TCP                               Role
483541 Lichfield, Leonard II 1657 A36460 Printer, Bookseller (antiquarian)
483551 Stephens, Anthony 1657 A36460 Bookseller

The result was almost always the exact name I would have chosen, if I’d looked it up by hand. The system differentiates Lichfield Jr. from Leonard Lichfield Sr. by the publication date, and the roles are just the occupation titles given by BBTI. Unlike eMOP’s, these don’t differentiate “Printed for” from “Printed by” statements, but the roles seemed generally very consistent. (It’ll be interesting now to cross reference our results with theirs.) Overall the algorithm did a good job catching spelling variation (even, often, the Latin) while also distinguishing the Jacobs and Johns from the Josephs.

There were lots of special cases that had to be handled separately. Because of “Saint Paul’s Churchyard” in all its variation, the name “Paul” was particularly difficult and had to have its own set of pre-processing rules. First-name last names like “Johnson,”” “Thomson,” or “Williams” caused lots of little problems, but they were easy to clean out in post-processing. Names like “Iohn” and “VViliam” were changed in pre-processing to “John” and “William.” There were quite a few cases like these, but not too many for the relatively small EEBO dataset. Our technique might not scale up to the entire ESTC, though. As I mentioned above, about 5% of the results were obviously false matches, and I have no doubt that a small number slipped through my attempts to catch them by hand. No effort has yet been made to measure the accuracy of the dataset as it exists,The ESTC is an order of magnitude larger, which means the initial results would need to be better. Also, because our algorithim looks for first name or first initial matches, it doesn’t work nearly as well on eighteenth-century imprints, when many printers and sellers referred to themselves as “Mr. So-and-so.” Some adjustments would need to be made.

Overall, after hand correction, the process resulted in about 29,000 stationer attributions over 22,000 EEBO-TCP entries. The total dataset, including authors and others, includes 64,887 attributions over the EEBO, Evans, and ECCO TCP documents.

Historical Geospatial Semantics

For the past year or so I’ve been using text mining to study the historical meanings of words. (An early version of this work can be found here.) Lately, working with Eric Gidal, I’ve been experimenting with a geospatial approach to text analysis, looking at placenames in particular. What kind of word is “Edinburgh”? What are some of its various connotations?

Our data is drawn primarily from a collection of nineteenth-century British geography: gazetteers, statistical accounts, and topographical dictionaries. (Downloaded from the Internet Archive.) A topographical dictionary is just like it sounds: a big book with an alphabetical list of placenames and a description for each:

As you can see, these are highly structured documents, and the OCR was quite good, so it was simply a matter of parsing the plain text, capturing the name and description of each place, then matching those names to a list of places published by the U.K. government.

Here’s a map that performs a hotspot analysis on the place descriptions. It highlights the clusters of points that are most likely to have the word “Argyll” in their descriptions. Towns or cities in Argyllshire, it turns out, were often described as being “in Argyll,” so a hotspot analysis of the word returns points in a region that overlaps almost exactly with the historical boundaries of Argyllshire county. The cool thing about this map is that it discovers administrative boundaries in a bottom-up way, extrapolating those boundaries from text-based data points.

But descriptions aren’t limited to official geographies. Nineteenth-century gazetteers are rich testimonies of environmental and cultural change. So here’s a similar map, only instead of showing points with a high rate of “Argyll,” it shows places with descriptions that use the word “Caledonia.”

“Caledonia” was the ancient Latin name of north Britain. It was often invoked by Scottish nationalists who felt uneasy under English hegemony. Caledonia itself was never really a nation — more a hodgepodge of principalities and clans on the furthest outskirts of the Roman Empire. But it loomed large in the historical imagination of Scottish writers.

This map, then, actually is a map of Caledonia (sort of, at least insofar as “Caledonia” itself was a back-projection of the Scottish Enlightenment). It’s a map of places that nineteenth-century writers associated with the ancient land.

I begin with these examples just to highlight a kind of metaphysical disconnect at the heart of geospatial text analysis. On the one hand it depends on grounding analysis in “real life” data drawn from official sources. On the other hand, by exposing the many definitions and narratives that attach to official placename concepts, it draws out their multiple, often conflicting conceptualizations.

We’ve started calling this line of inquiry historical geospatial semantics.

The Mapping Scotland project

This work stems out of Eric’s book on the reception history of Ossian, a mythical Gaelic poet whose epics were “discovered” in the 1760s and 1770s by the poet and translator, James Macpherson. Although the epics themselves were largely fabricated, Macpherson borrowed from a real tradition of Ossianic Gaelic verse that he found in old manuscripts and heard recited in northern villages. Controversy over Macpherson’s work simmered for decades — Was it a mere hoax? — but Scottish writers were long among his strongest advocates. In the nineteenth century, at the same time that Scotland’s countrysides were transformed by industrialization and surveyed by government agencies with unprecedented levels of detail, Scottish historians and geographers scoured the landscapes of the Highlands and the Hebrides for evidence of their nation’s heroic past.

The study of Ossianic geography is the study of how meanings attach to places. At the most abstract, it’s about reconciling two representative systems: natural language and geographic models. In the eighteenth and nineteenth centuries, geographic models took the form of place descriptions, maps, and statistics. The work of these geographers evolved into the British Ordnance Survey, which is still the official geographic agency of the U.K. and now maintains detailed files of the British landscape, available for download and easy to use in GIS software. Perhaps strangely, because of the institutional continuity of the Ordnance Survey, the geographic imagination of the nineteenth century in fact survives in modern computerized systems, in the forms of towns, rivers, and lakes named after their mythical predecessors. Ossianic geographies permeate modern GIS systems just as they infiltrated and sparked the imaginations of Macpherson and his later defenders.

But how does language touch its physical environments? How should we characterize the relationships among meanings and the spaces of their circulation?

Our idea was to dig through some representative texts and mark up all the places mentioned, referring them back to places named in the Ordnance Survey, while also capturing their descriptions. Then, we automatically georeferenced 47 volumes of nineteenth-century gazetteers and topographical dictionaries. Across this corpus, which totals more than 7 million words, we captured 70,000 descriptions of almost 20,000 unique places. We ask: How were Scottish places described? How did Ossianic and official geographies differ, and what, if any, literary, economic, and environmental concepts informed them?

Geospatial semantics

Luckily, although our questions and primary materials are somewhat novel, we’re not the first scholars to try something like this. Geospatial text analysis has a small following within both digital humanities and geography (although the two groups don’t seem to talk to one another, as far as I can tell). Matthew Wilkens has been a leading proponent of quantitative approaches to literary geography. So, too, have Ian Gregory and Andrew Hardie, whose essays on this topic cover very important technical ground. I recommend in particular their chapter from the recent collection, Deep Maps and Spatial Narratives (IUP, 2015).1

Among geographers, the subfield of “geospatial semantics” has a lot in common with work like Gregory’s and Hardie’s. It studies geographic concepts and designs formal ontologies for representing those concepts in software systems. Like other aspects of natural language, human words for places and spaces lack crisp definition. As Werner Kuhn has emphasized, “geospatial information is often based on human perception and social agreements” and for this reason will be marked by “vagueness, uncertainty, and [differing] levels of granularity.” Geospatial software systems must be sensitive to the various and conflicting “naive” geographies of human discourse. As Andrea Ballatore (et al.) have remarked, “To share geographic information across a community, it is necessary to extract concepts from the chaotic repository of implicit knowledge that lies in human minds.”2

The point here is not that human cognition is flawed, but that place is linguistically constructed. From a GIS perspective, this has practical consequences. Environmental, commercial, and government data needs to operate across very different linguistic models and so must be sensitive to the nuances of language. For the spatial humanities, this sensitivity has a very different connotation: it suggests that GIS, with its emphasis on modeling multiplicity, shares a real commensurability with postmodern geography.

Humanists looking to do this kind of work are best off starting with Gregory’s and Hardie’s work, I think. Adam Crymble has a nice post in Programming Historian if you like working in Python. However, Angela Schwering has written some excellent surveys of the field, and the work of Krzysztof Janowicz and Andrea Ballatore, to name just a few others, is pretty exciting.

Geospatial semantics: profiles & footprints

When combined with corpus linguistics, geospatial analysis sits at the nexus of two theoretical traditions. The first comes from Zellig Harris and J. R. Firth from the 1950s. The distributional hypothesis of computational semantics says that similar words will tend to appear in similar contexts. From geography, it borrows the notion of spatial autocorrelation, which proposes that places that are physically close to each other will tend to share similar properties. These ideas can be combined into one proposition (with a dash of temporality thrown in for taste):

Similar places at similar times will tend to be described using similar words.

This hypothesis proposes that geospatial text analysis should work for answering these kinds of questions:

How were the Highlands described? Which regions were most distinctly imagined as sites of Ossianic epic? Which towns were most affected by industrialization? Which regions were affiliated with ancient Scottish nations, peoples, or heroes? Which places bear the mark of myth?

When analyzing georeferenced texts, there are two basic ways to go about it. You can either choose one place and look at all the ways it has been described, or you can choose one word or topic, then map all the places where it was invoked.

We find these definitions helpful:

  • The semantic profile of a place refers to the linguistic distribution of the vocabulary of the collocates of its name(s).
  • The semantic footprint of a concept refers to the geographic distribution of its placename collocates.

These two modes of analysis suggest two corresponding ways to think about similarity:

  • Places are semantically similar if they share similar profiles.
  • Concepts are geographically similar if they share similar footprints.

This suggests a path forward for how “distant reading” might contribute to our knowledge of historical geography. If places change in culturally significant ways, those changes should be reflected in the discourse. It also suggests how GIS can contribute to intellectual history, but that will be a bit trickier. From the perspective of intellectual history, geosemantics opens a set of questions we only now can formulate. How were ideas distributed geographically? What kinds of ideas were distributed in what kinds of ways?

Industrialization & epic

Much more could be said. Right now, though, our research is still at the “Let’s see if we can get the numbers to show us what we already expect to find” phase. We want to be able to track cultural geographies of various kinds, to identify competing cultural landscapes, and to account for change over time. We start with Ossian because we know it well.

Here are a couple of maps showing the results of a topic model we ran over the topographical dictionaries. We performed a hotspot analysis, just as above, to identify spatially distributed clusters of the texts. The first one shows descriptions of fishing (“herring”, “cod”) and waterfront development (“docks”, “boats”).

This one shows results from the same model, but displaying a topic that has to do with the mining industry, with a special emphasis on coal:

The other maps show formally and culturally defined regions, like “Argyll” and “Caledonia.” By contrast, maps of fishing and mining expose functionally defined regions.

Lastly, here’s the “Caledonia” map again, but with an extra layer showing a hotspot analysis of the word “Fingal.” Fingal was the name of the warrior-king in Ossian’s epics. The “Fingal” points are blue, and the purple region shows where that word overlaps with “Caledonia.”

This purple region overlaps perfectly with the Scottish Highlands, as well as with the points most closely associated with Ossian by scholars, whose work we track separately. (But I’ll spare you yet another map.)

Notice that the area associated with “Fingal” is very distinct from the region associated with the Scottish mining industry in the map above, but “Caledonia” crosses both regions. Remember that Eric’s book was about how the Ossian controversy erupted during a moment of important cultural shift in Scotland: industrialization combined with nationalism to drive geographers and literary historians to seek out a mythic past and to locate that past quite concretely in the landscape of northern Scotland.

These maps show that dynamic quite distinctly, and it’s now actually possible to perform statistical analyses that show, for example, that Ossianic and industrializing discourses were negatively correlated geographically (that is, they tended to separate from each other) while both being positively correlated to Scottish nationalism more generally.

In its broad outlines, these results fit well with a lot of what we know about nineteenth-century Scottish literature and history. As soon as you dig down a bit, though, they raise innumerable questions.

Notes

Thanks go out to Eric for sharing his project with me, to the University of South Carolina for our initial seed grant, to Sarah Thompson, a PhD student who worked with me last summer to work through the initial designs, and to Jeannie Britton and Tony Jarrells, who have consulted along the way.

  • See Gregory et al, “Spatializing and Analyzing Digital Texts: Corpora, GIS, and Places,” in Deep Maps and Spatial Narratives (IUP, 2015). This essay builds from Gregory’s and Hardie’s earlier article, “Visual GISting: Bringing Together Corpus Linguistics and Geographical Information Systems,” Literary & Linguistic Computing 26, 3 (2011): 297-314. The newer essay is more detailed and spares readers the unfortunate neologism, “visual gisting.”
  • See Werner Kuhn, “Geospatial Semantics: Why, of What, and How?” Journal on Data Semantics III (2005): 1-24; and Andrea Ballatore, David C. Wilson, and Michela Bertolotto, “Computing the Semantic Similarity of Geographic Terms Using Volunteered Lexical Definitions,”” International Journal of Geographical Information Science 27, 10 (2013): 2099-2118.