This lesson requires you to use the command line. If you have no previous experience using the command line you may find it helpful to work through the Programming Historian Bash Command Line lesson.
In this lesson you will first learn what topic modeling is and why you might want to employ it in your research. You will then learn how to install and work with the MALLET natural language processing toolkit to do so. MALLET involves modifying an environment variable (essentially, setting up a short-cut so that your computer always knows where to find the MALLET program) and working with the command line (ie, by typing in commands manually, rather than clicking on icons or menus). We will run the topic modeller on some example files, and look at the kinds of outputs that MALLET installed. This will give us a good idea of how it can be used on a corpus of texts to identify topics found in the documents without reading them individually.
Please see the MALLET users’ discussion list for the full range of things one can do with the software.
(We would like to thank Robert Nelson and Elijah Meeks for hints and tips in getting MALLET to run for us the first time, and for their examples of what can be done with this tool.)
What is Topic Modeling And For Whom is this Useful?
A topic modeling tool takes a single text (or corpus) and looks for patterns in the use of words; it is an attempt to inject semantic meaning into vocabulary. Before you begin with topic modeling, you should ask yourself whether or not it is likely to be useful for your project. Matthew Kirschenbaum’s Distant Reading (a talk given at the 2009 National Science Foundation Symposium on the Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation) and Stephen Ramsay’s Reading Machines are good places for beginning to understand in which circumstances a technique such as this could be most effective. As with all tools, just because you can use it, doesn’t necessarily mean that you should. If you are working with a small number of documents (or even a single document) it may well be that simple frequency counts are sufficient, in which case something like Voyant Tools might be appropriate. However, if you have hundreds of documents from an archive and you wish to understand something of what the archive contains without necessarily reading every document, then topic modeling might be a good approach.
Topic models represent a family of computer programs that extract topics from texts. A topic to the computer is a list of words that occur in statistically meaningful ways. A text can be an email, a blog post, a book chapter, a journal article, a diary entry – that is, any kind of unstructured text. By unstructured we mean that there are no computer-readable annotations that tell the computer the semantic meaning of the words in the text.
Topic modeling programs do not know anything about the meaning of the words in a text. Instead, they assume that any piece of text is composed (by an author) by selecting words from possible baskets of words where each basket corresponds to a topic. If that is true, then it becomes possible to mathematically decompose a text into the probable baskets from whence the words first came. The tool goes through this process over and over again until it settles on the most likely distribution of words into baskets, which we call topics.
There are many different topic modeling programs available; this tutorial uses one called MALLET. If one used it on a series of political speeches for example, the program would return a list of topics and the keywords composing those topics. Each of these lists is a topic according to the algorithm. Using the example of political speeches, the list might look like:
- Job Jobs Loss Unemployment Growth
- Economy Sector Economics Stock Banks
- Afghanistan War Troops Middle-East Taliban Terror
- Election Opponent Upcoming President
- et cetera
By examining the keywords we can discern that the politician who gave the speeches was concerned with the economy, jobs, the Middle East, the upcoming election, and so on.
As Scott Weingart warns, there are many dangers that face those who use topic modeling without fully understanding it. For instance, we might be interested in word use as a proxy for placement along a political spectrum. Topic modeling could certainly help with that, but we have to remember that the proxy is not in itself the thing we seek to understand – as Andrew Gelman demonstrates in his mock study of zombies using Google Trends. Ted Underwood and Lisa Rhody (see Further Reading) argue that we as historians would be better to think of these categories as discourses; however for our purposes here we will continue to use the word: topic.
Note: You will sometimes come across the term “LDA” when looking into the bibliography of topic modeling. LDA and Topic Model are often used synonymously, but the LDA technique is actually a special case of topic modeling created by David Blei and friends in 2002. It was not the first technique now considered topic modeling, but it is by far the most popular. The myriad variations of topic modeling have resulted in an alphabet soup of techniques and programs to implement them that might be confusing or overwhelming to the uninitiated; ignore them for now. They all work in much the same way. MALLET uses LDA.
Examples of topic models employed by historians:
- Rob Nelson, Mining the Dispatch
- Cameron Blevins, “Topic Modeling Martha Ballard’s Diary” Historying, April 1, 2010.
- David J Newman and Sharon Block, “Probabilistic topic decomposition of an eighteenth century American newspaper,” Journal of the American Society for Information Science and Technology vol. 57, no. 6 (April 1, 2006): 753-767.
There are many tools one could use to create topic models, but at the time of this writing (summer 2017) the simplest tool to run your text through is called MALLET. MALLET uses an implementation of Gibbs sampling, a statistical technique meant to quickly construct a sample distribution, to create its topic models. MALLET requires using the command line – we’ll talk about that more in a moment, although you typically use the same few commands over and over.
The installation instructions are different for Windows and Mac. Follow the instructions appropriate for you below:
- Go to the MALLET project page. You can download MALLET here.
- You will also need the Java developer’s kit – that is, not the regular Java that’s on every computer, but the one that lets you program things. Install this on your computer.
- Unzip MALLET into your directory . This is important: it cannot be anywhere else. You will then have a directory called or similar. For simplicity’s sake, rename this directory just .
- MALLET uses an environment variable to tell the computer where to find all the various components of its processes when it is running. It’s rather like a shortcut for the program. A programmer cannot know exactly where every user will install a program, so the programmer creates a variable in the code that will always stand in for that location. We tell the computer, once, where that location is by setting the environment variable. If you moved the program to a new location, you’d have to change the variable.
To create an environment variable in Windows 7, click on your (Figures 1,2,3). Click new and type in the variable name box. It must be like this – all caps, with an underscore – since that is the shortcut that the programmer built into the program and all of its subroutines. Then type the exact path (location) of where you unzipped MALLET in the variable value, e.g., .
To see if you have been successful, please read on to the next section.
Running MALLET using the Command Line
MALLET is run from the command line, also known as Command Prompt (Figure 4). If you remember MS-DOS, or have ever played with a Unix computer Terminal, this will be familiar. The command line is where you can type commands directly, rather than clicking on icons and menus.
- Click on your .\ You’ll get the command prompt window, which will have a cursor at (or similar; see Figure 4).
- Type (That is: cd-space-period-period) to change directory. Keep doing this until you’re at the . (as in Figure 5)
- Then type and you are in the MALLET directory. Anything you type in the command prompt window is a command. There are commands like (change directory) and (list directory contents) that the computer understands. You have to tell the computer explicitly that ‘this is a MALLET command’ when you want to use MALLET. You do this by telling the computer to grab its instructions from the MALLET bin, a subfolder in MALLET that contains the core operating routines.
- Type as in Figure 6. If all has gone well, you should be presented with a list of MALLET commands – congratulations! If you get an error message, check your typing. Did you use the wrong slash? Did you set up the environment variable correctly? Is MALLET located at ?
You are now ready to skip ahead to the next section.
Many of the instructions for OS X installation are similar to Windows, with a few differences. In fact, it is a bit easier.
- Download and install MALLET.
- Download the Java Development Kit.
Unzip MALLET into a directory on your system (for ease of following along with this tutorial, your directory works but anywhere is okay). Once it is unzipped, open up your Terminal window (in the directory in your Finder. Navigate to the directory where you unzipped MALLET using the Terminal (it will be . If you unzipped it into your directory as was suggested in this lesson, you can navigate to the correct directory by typing ). cd is short for “change directory” when working in the Terminal.
The same command will suffice to run commands from this directory, except you need to append (period-slash) before each command. This needs to be done before all MALLET commands when working on a Mac.
Going forward, the commands for MALLET on a Mac will be nearly identical to those on Windows, except for the direction of slashes (there are a few other minor differences that will be noted when they arise). If on Windows a command would be , on a Mac you would instead type:
A list of commands should appear. If it does, congratulations – you’ve installed it correctly!
Typing in MALLET Commands
Now that you have MALLET installed, it is time to learn what commands are available to use with the program. There are nine MALLET commands you can use (see Figure 6 above). Sometimes you can combine multiple instructions. At the Command Prompt or Terminal (depending on your operating system), try typing:
You are presented with the error message that is not recognized as an internal or external command, operable program, or batch file. This is because we forgot to tell the computer to look in the MALLET for it. Try again, with
Remember, the direction of the slash matters (See Figure 7, which provides an entire transcript of what we have done so far in the tutorial). We checked to see that we had installed MALLET by typing in . We then made the mistake with a few lines further down. After that, we successfully called up the help file, which told us what does, and it listed all of the potential parameters you can set for this tool.
Note: there is a difference in MALLET commands between a single hyphen and a double hyphen. A single hyphen is simply part of the name; it replaces a space (e.g., rather than import dir), since spaces offset multiple commands or parameters. These parameters let us tweak the file that is created when we import our texts into MALLET. A double hyphen (as with above) modifies, adds a sub-command, or specifies some sort of parameter to the command.
For Windows users, if you got the error ‘exception in thread “main” java.lang.NoClassDefFoundError:’ it might be because you installed MALLET somewhere other than in the directory. For instance, installing MALLET at will produce this error message. The second thing to check is that your environment variable is set correctly. In either of these cases, check the Windows installation instructions and double check that you followed them properly.
Working with data
MALLET comes pre-packaged with sample files with which you can practice. Type at the , and you are given the listing of the MALLET directory contents. One of those directories is called . You know it is a directory because it has the word <dir> beside it.
Type . Type again. Using what you know, navigate to first the then the directories. You can look inside these files by typing the full name of the file (with extension).
Note that you cannot now run any MALLET commands from this directory. Try it:
You get the error message. You will have to navigate back to the main MALLET folder to run the commands. This is because of the way MALLET and its components are structured.
In the directory, there are a number of files. Each one of these files is a single document, the text of a number of different web pages. The entire folder can be considered to be a corpus of data. To work with this corpus and find out what the topics are that compose these individual documents, we need to transform them from several individual text files into a single MALLET format file. MALLET can import more than one file at a time. We can import the entire directory of text files using the command. The commands below import the directory, turn it into a MALLET file, keep the original texts in the order in which they were listed, and strip out the stop words (words such as and, the, but, and if that occur in such frequencies that they obstruct analysis) using the default English dictionary. Try the following, which will use sample data.
If you type now (or for Mac), you will find a file called . (If you get an error message, you can hit the cursor up key on your keyboard to recall the last command you typed, and look carefully for typos). This file now contains all of your data, in a format that MALLET can work with.
Try running it again now with different data. For example, let’s imagine we wanted to use the German sample data instead. We would use:
And then finally, you could use your own data. Change to a directory that contains your own research files. Good luck!
If you are unsure how directories work, we suggest the Programming Historian lesson “Introduction to the Bash Command Line”.
Mac instructions are similar to those above for Windows, but note some of the differences below:
Issues with Big Data
If you’re working with large file collections – or indeed, very large files – you may run into issues with your heap space, your computer’s working memory. This issue will initially arise during the import sequence, if it is relevant. By default, MALLET allows for 1GB of memory to be used. If you run into the following error message, you’ve run into your limit:
If your system has more memory, you can try increasing the memory allocated to your Java virtual machine. To do so, you need to edit the code in the file found in the subdirectory of your MALLET folder. Using Komodo Edit, (See Mac, Windows, Linux for installation instructions), open the file () if you are using Windows, or the file () if you are using Linux or OS X.
Find the following line:
You can then change the 1g value upwards – to 2g, 4g, or even higher depending on your system’s RAM, which you can find out by looking up the machine’s system information.
Save your changes. You should now be able to avoid the error. If not, increase the value again.
Your first topic model
At the command prompt in the MALLET directory, type:
This command opens your file, and runs the topic model routine on it using only the default settings. As it iterates through the routine, trying to find the best division of words into topics, your command prompt window will fill with output from each run. When it is done, you can scroll up to see what it was outputting (as in Figure 8).
The computer is printing out the key words, the words that help define a statistically significant topic, per the routine. In Figure 8, the first topic it prints out might look like this (your key words might look a bit different):
If you are a fan of cricket, you will recognize that all of these words could be used to describe a cricket match. What we are dealing with here is a topic related to Australian cricket. If you go to , you will see that this file is a brief biography of the noted Australian cricketer Clem Hill. The 0 and the 5 we will talk about later in the lesson. Note that MALLET includes an element of randomness, so the keyword lists will look different every time the program is run, even if on the same set of data.
Go back to the main MALLET directory, and type . You will see that there is no output file. While we successfully created a topic model, we did not save the output! At the command prompt, type
Here, we have told MALLET to create a topic model () and everything with a double hyphen afterwards sets different parameters
- opens your file
- trains MALLET to find 20 topics
- outputs every word in your corpus of materials and the topic it belongs to into a compressed file (; see www.gzip.org on how to unzip this)
- outputs a text document showing you what the top key words are for each topic ()
- and outputs a text file indicating the breakdown, by percentage, of each topic within each original text file you imported (). (To see the full range of possible parameters that you may wish to tweak, type at the prompt.)
Type . Your outputted files will be at the bottom of the list of files and directories in . Open in a word processor (Figure 9). You are presented with a series of paragraphs. The first paragraph is topic 0; the second paragraph is topic 1; the third paragraph is topic 2; etc. (The output begins counting at 0 rather than 1; so if you ask it to determine 20 topics, your list will run from 0 to 19). The second number in each paragraph is the Dirichlet parameter for the topic. This is related to an option which we did not run, and so its default value was used (this is why every topic in this file has the number 2.5).
If when you ran the topic model routine you had included
the output might look like this:
That is, the first number is the topic (topic 0), and the second number gives an indication of the weight of that topic. In general, including leads to better topics.
The composition of your documents
What topics compose your documents? The answer is in the file. To stay organized, import the file into a spreadsheet (Excel, Open Office, etc). You will have a spreadsheet with a #doc, source, topic, proportion columns. All subsequent columns run topic, proportion, topic, proportion, etc., as in figure 10.
This can be a somewhat difficult file to read. The topics begin in the third column, in this case Column C, and continue until the last topic in Column V. This is because we have trained 20 topics – if we trained 25, for example, they would run until column AA.
From this, you can see that doc# 0 (ie, the first document loaded into MALLET), has topic 0 at a percentage of 0.43% (column C). We can see that topic 17 is the principal topic, at 59.05%, by locating the highest value. Your own topics may be different given the nature of MALLET.
If you have a corpus of text files that are arranged in chronological order (e.g., is earlier than ), then you can graph this output in your spreadsheet program, and begin to see changes over time, as Robert Nelson has done in Mining the Dispatch.
How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics; the settings were too coarse. There are computational ways of searching for this, including using MALLETs , but for the reader of this tutorial, it is probably just quicker to cycle through a number of iterations (but for more see Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Science, 101, 5228-5235).
Getting your own texts into MALLET
The folder in MALLET is your guide to how you should arrange your texts. You want to put everything you wish to topic model into a single folder within , ie . Your texts should be in format (that is, you create them with Notepad, or in Word choose ). You have to make some decisions. Do you want to explore topics at a paragraph by paragraph level? Then each file should contain one paragraph. Things like page numbers or other identifiers can be indicated in the name you give the file, e.g., . If you are working with a diary, each text file might be a single entry, e.g., . (Note that when naming folders or files, do not leave spaces in the name. Instead use underscores to represent spaces). If the texts that you are interested in are on the web, you might be able to automate this process.
Further Reading about Topic Modeling
To see a fully worked out example of topic modeling with a body of materials culled from webpages, see Mining the Open Web with Looted Heritage Draft.
You can grab the data for yourself at Figshare.com, which includes a number of files. Each individual file is a single news report.
- For extensive background and bibliography on topic modeling you may wish to begin with Scott Weingart’s Guided Tour to Topic Modeling
- Ted Underwood’s ‘Topic modeling made just simple enough’ is an important discussion on interpreting the meaning of topics.
- Lisa Rhody’s post on interpreting topics is also illuminating. ‘Some Assembly Required’ Lisa @ Work August 22, 2012.
- Clay Templeton, ‘Topic Modeling in the Humanities: An Overview | Maryland Institute for Technology in the Humanities’, n.d.
- David Blei, Andrew Ng, and Michael Jordan, ‘Latent dirichlet allocation,’ The Journal of Machine Learning Research 3 (2003).
- Finally, also consult David Mimno’s bibliography of topic modeling articles. They’re tagged by topic to make finding the right one for a particular application that much easier. Also take a look at his recent article on Computational Historiography from ACM Transactions on Computational Logic which goes through a hundred years of Classics journals to learn something about the field. While the article should be read as a good example of topic modeling, his ‘Methods’ section is especially important, in that it discusses preparing text for this sort of analysis.
About the authors
Shawn Graham is associate professor of digital humanities and history at Carleton University. Scott Weingart is a historian of science and digital humanities specialist at Carnegie Mellon University. Ian Milligan is an associate professor of history at the University of Waterloo.
The Digital Humanities Contribution to Topic Modeling
Scott B. Weingart and Elijah Meeks
Topic modeling could stand in as a synecdoche of digital humanities. It is distant reading in the most pure sense: focused on corpora and not individual texts, treating the works themselves as unceremonious “buckets of words,” and providing seductive but obscure results in the forms of easily interpreted (and manipulated) “topics.” In its most commonly used tool, it runs in the command line. To achieve its results, it leverages occult statistical methods like “dirichlet priors” and “bayesian models.” Were a critic of digital humanities to dream up the worst stereotype of the field, he or she would likely create something very much like this, and then name a popular implementation of it after a hammer.
Since 2010, introductions to topic modeling for humanists have appeared with increasing frequency. Most offer you a list of words, all apparently related yet in no discernible order, identified as a “topic.” You’re introduced to topics, and how a computer came to generate them automatically without any prior knowledge of word definitions or grammar. It’s amazing, you read, but not magic: a simple algorithm that can be understood easily if only you are willing to dedicate an hour or two to learn it. The results would speak for themselves, and a decade ago you would have been forgiven if you imagined only a human could have produced the algorithm’s output. You would marvel at the output, for a moment, before realizing there isn’t much immediately apparent you can actually do with it, and the article would list a few potential applications along with a slew of caveats and dangers. We are ready, now, for a more sustained and thorough exploration of topic modeling.
In our role as guest editors, we have designed this issue of the Journal of Digital Humanities to push the conversation on topic modeling and also to reflect on the larger community in which it is situated. We believe the rapid pace of communication about topic modeling, the focus on workshops and gray literature and snippets of code, the mixed methods invoked and used, are an ideal introduction to what it means to be a digital humanist in a networked world. This is not to say that the issue is another round in defining the digital humanities — far from it — the pieces herein provide an understanding of how to do topic modeling, what to use, its dangers, and some excellent examples of topic models in practice.
Just as tools are enshrined methodologies, methods like topic modeling are reflections of movements. Topic modeling itself is about 15 years old, arriving from the world of computer science, machine learning, and information retrieval. It describes a method of extracting clusters of words from sets of documents. Topic modeling has been applied to datasets in multiple domains, from bioinformatics to comparative literature, and to documents ranging in size from monographs to tweets. One particular variety of topic model, an approach called Latent Dirichlet Allocation (LDA), along with its various derivatives, has been the most popular approach to topic modeling in the humanities.
LDA originated in Michael I. Jordan’s computer science lab in 2002/2003 in collaboration with David M. Blei and Andrew Y. Ng, and the term LDA has since become nearly synonymous with topic modeling in general. Over the last several years, LDA crept slowly into the humanities world. The software team behind MALLET, by far the most popular tool for topic modeling in the humanities, was led by computer scientist Andrew McCallum and eventually included David Mimno, a researcher with a background in digital humanities. Around the same time, computer scientist David J. Newman and historian Sharon Block collaborated on topic modeling an eighteenth century newspaper, a project culminating in the history article “Doing More with Digitization” in 2006. Others at Stanford and elsewhere continued working fairly quietly combining topic modeling with digital humanities for some time, before the explosion of interest that began in 2010.
Two widely circulated blog posts first introduced topic modeling to the broader digital humanities community: Matthew L. Jockers on topic modeling a Day of DH and Cameron Blevins on a late eighteenth century diary. Then at one of the first NEH-funded Institutes for Advanced Topics in the Digital Humanities, held at UCLA in August 2010 and focusing on network analysis, Mimno, Blei, and David Smith introduced many digital humanists to topic modeling for the first time. Since that time, dozens of tutorials, walkthroughs, techniques, implementations, and cries of frustration have been posted through various web outlets, often inspiring multithreaded conversations, reply posts, or backchannel Twitter chatter.
In this additional way topic modeling typifies digital humanities: the work is almost entirely represented in that gray literature. While there is a hefty bibliography for spatial analysis in humanities scholarship, for example, in order to follow research that deploys topic modeling for humanities inquiry you must read blogs and attend conference presentations and workshops. For those not already participating in the conversation, this dispersed discussion can be a circuitous and imposing barrier to entry. In addition to sprawling across blogs, tweets, and comment threads, contributions also span methods and disciplines, employ sophisticated visualizations, sometimes delve into statistics and code, and other times adopt the language of literary critique.
This topical issue of the Journal of Digital Humanities is meant to catch and present the most salient elements of the topic modeling conversation: a comprehensive introduction, technical details, applications, and critiques from a humanistic perspective. By doing so, we hope to make topic modeling more accessible for new digital humanities scholars, highlight the need for existing practitioners to continue to develop their theoretical approaches, and further sketch out the relationship between this particular method and those of the broader digital humanities community.
This issue also features an experimental this-space-left-intentionally-blank section; any conversation inspired by this issue over the next month, either posted as a comment or tagged on Twitter using #JDHTopics, will eventually be folded into the issue itself as supplemental material. Naturally, this forthcoming section also will include some topic modeling of that material. While we hope the engagement with this issue continues for some time, only material submitted by May 11, 2013 will be included in the final addition to the issue.
Section 1: Concepts
The creator of LDA, David M. Blei, opens the issue with an original article offering a grand narrative of topic modeling and its application in the humanities. He explains the basic principles behind topic modeling, frames it in relation to probabilistic modeling as a field, and explores modeling as a tool for finding and expressing meaning. Blei urges humanities scholars to focus on the model in topic modeling, echoing Willard McCarty’s claim that “modeling points the way to a computing that is of as well as in the humanities: a continual process of coming to know by manipulating representations.”
A more instructional piece is presented by Megan R. Brett, to frame the conversations appearing in this issue. Originally written to introduce students to topic modeling, Brett brings together many invaluable resources and examples. Those unfamiliar with topic modeling will find this piece particularly helpful context for the remaining articles in this special issue.
Next David Mimno’s presentation, given at the Maryland Institute of Technology and the Humanities (MITH) topic modeling workshop in November 2012, provides the most accessible introduction to the math behind topic modeling available. Mimno argues that those intending to implement topic modeling should understand the details of behind topic modeling, and offers an insightful presentation about how topic models are trained, evaluated, and visualized.
Section 2: Applications and Critiques
If topic modeling has recently inspired a wealth of introductions for humanists, actual applications written in humanities channels have been harder to come by until very recently. Perhaps two of the most notable projects are Matthew L. Jockers’ forthcoming book Macroanalysis, which explores literature using — among other methods — topic modeling, and David Mimno’s recent article on topic modeling the last century of classics journals.
Lisa M. Rhody provides a long piece drawn from her dissertation research project that extends the traditionally thematic-oriented topic modeling to figurative and poetic language. She explores the productive failure of topic modeling, which highlights the processual nature of topic modeling, and reinforces the dialectic with traditional reading. Rhody’s work is perhaps the best evidence thus-far that what we might have identified as cohesive “topics” are more complex than simple thematic connections; indeed, “topics” are more closely related to what Ted Underwood calls “discourses,” a comparison discussed in greater detail within the article. Some of her raw model data is available in an appendix.
Andrew Goldstone and Ted Underwood offer a history of literary criticism through topic models of PMLA. In this piece, originally cross-posted on their blogs, they integrate network analysis and representation to better understand and simultaneously complicate the results of the topic models they run. By highlighting the process of topic modeling, Goldstone and Underwood reveal how different methodological choices may lead to contrasting results.
Because topic modeling transforms or compresses free data (raw narrative text) into structured data (topics as a ratio of word tokens and their strength of representation in documents) it is tempting to think of it as “solving” text. Ben Schmidt addresses this in an expansion and revision of his earlier critiques of topic modeling and its use in the humanities. As with other pieces in this edition, his research integrates what are becoming less and less distinct computational methods — in this case data visualization and spatial analysis — to better understand the strengths and weaknesses of topic models. The result is a call for caution in the use of topic modeling because it moves scholars away from interpreting language — their great strength — toward interpreting “topics,” an ill-defined act which might provide the false security of having resolved the distinction between a word and the thing that it represents. Schmidt’s code is available in an appendix.
Section 3: Tools
Two tools in particular have enjoyed wide adoption among digital humanists: MALLET, produced by Andrew McCallum and computer scientists, and Paper Machines, created and developed by Jo Guldi and Chris Johnson-Roberson. For those looking to try their hand at topic modeling their own sets of documents, the Programming Historian includes a tutorial on MALLET by Shawn Graham, Scott B. Weingart, and Ian Milligan; and Sarita Alami of the Emory Digital Scholarship Commons offers a two-part series (Part I, Part II) introducing Paper Machines.
Ian Milligan and Shawn Graham, authors of the Programming Historian’s tutorial on MALLET (with Scott B. Weingart), offer here a review not only of how the tool works, but what it means as an instantiation of a method. The review includes links to tutorials and guides to get started, as well as some rumination on the “magic” of topic modeling.
Adam Crymble provides a review of Paper Machines, an open source tool which connects with Zotero to analyze sets of documents collected therein. Crymble situates topic modeling in a typical research ecosystem of analysis and search, and ties into the growing prevalence of information visualization techniques of digital humanities work.
In digital humanities research we use tools, make tools, and theorize tools not because we are all information scientists, but because tools are the formal instantiation of methods. That is why MALLET often stands in for topic modeling and topic modeling often stands in for the digital humanities.
The work in this issue integrates the Natural Language Processing technique of topic modeling with network representation, GIS, and information visualization. This approach takes advantage of the growing accessibility of tools and methods that had until recently required great resources (technical, professional, and financial). MALLET is an argument about text using topic modeling that a scholar employs. Scholars can choose to engage with and adjust the algorithms in MALLET. But the tool itself also allows for uncritical use of machinery built for Natural Language Processing.
The humanities is unused to such formal simulacra, however, and so a journal about scholarship might appear to be a journal about tools and software. But none of the authors in this issue simply run and accept the results as “useful” or “interesting” for humanities scholarship. Instead, they critically wrestle with the process. Their work is done with as much of a focus on what the computational techniques obscure as reveal.
Traditional humanities scholars often equate digital humanities with technological optimism. Rather the opposite is true: digital humanists offer the jaundiced realization that computational techniques like topic modeling — long held inaccessible and unapproachable and therefore unassailable — are not an upgrade from simplistic human-driven research, but merely more tools in the ever-growing shed. Whether as part of a particular research agenda, or the method as enshrined in tools, or as a part of a larger movement toward modeling in the humanities, topic modeling in the humanities has been deployed critically. The adoption of “critical technique” is just what you would expect from scholars accustomed to “critical reading.”
About Scott B. Weingart, and Elijah Meeks
Scott B. Weingart is an NSF Graduate Research Fellow and PhD student at Indiana University, where he studies Information Science and History of Science. His research focuses on the intersection of historiographic and quantitative methodologies, particularly as they can be used to study scholarly communications in the past and present. He also writes a blog called the scottbot irregular, aiming to make computational tools and big data analytics accessible to a wider, humanities-oriented audience. When not researching, Scott fights for open access and the reform of modern scholarly communication.
Elijah Meeks is the Digital Humanities Specialist at Stanford University, where he helps bring network analysis, text analysis, and spatial analysis to bear on traditional humanities research questions. He has worked as the technical lead on The Digital Gazetteer of the Song Dynasty, Authorial London, and ORBIS: The Stanford Geospatial Network Model of the Roman World. In his time at Stanford, he's worked with Mapping the Republic of Letters, the Stanford Literary Lab, and the Spatial History Lab, as well as individual faculty and graduate students, to explore a wide variety of digital humanities research questions.