Friday, April 13, 2018
Time: 12:00-3:00 pm
Location: UC Irvine Humanities Hall 251
Topic modelling begins with the insight that texts are constructed from building blocks called "topics".
Topic modelling algorithms use information in the texts themselves to generate these topics.
We can explore the results of this process in order to learn something about the texts that were used to create the model.
A topic model can produce amazing, magical insights about your texts...
(my current definition)
(Computationally) finding quantitative patterns in natural language samples and attributing meaning to these patterns.
Pre-processing creates a “deformed” version of the original text for analysis.
Statistical processing transforms the text from natural language to quantitative data. This type of “deformance” typically involves dimensionality reduction, a simplification of the data so that it can be represented in two-dimensional space.
See Samuels, Lisa, and Jerome McGann. “Deformance and Interpretation.” New Literary History 30.1 (1999) : 25–56.
A “narrative of meaning” is an account of the significance of the results of text analysis.
Such a narrative must include an account of the decisions made as part of pre-processing, statistical processing, and visualisation steps in the workflow.
Topic modelling attempts to map out the semantic categories that make up a collection of documents.
Formal definition: A probability distribution over terms.
Informal definition: Some potentially meaningful category onto which we can map the terms and documents in our collection.
Working definition: A list of terms (usually words) from your document collection, each of which has a certain probability of occurring with the other terms in the list.
Source: clipartxtras.com
Traditional methods relay on our contextual knowledge of the documents to identify something like topics.
Algorithmic approaches use only the information contained within the documents themselves to identify topics.*
* But they are typically tuned by human decisions which require some prior assumptions and disciplinary understanding of the material.
It uses an algorithm to generate the topics by examining the tendency of individual terms to occur together in the same documents.
In other words, topics are inferred from the documents themselves.
One of the most popular algorithms is Latent Dirichlet Allocation (LDA).
Voilà! Each bag of words is a “topic”.
Matthew Jockers, “The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors”
Ted Underwood, “Topic modeling made just simple enough”
Follow the links in Scott Weingart, “Topic Modeling for Humanists: A Guided Tour” (provides a gentle pathway into the statistical intricacies)
Christof Schöch, “Topic Modeling Workshop: Beyond the Black Box https://christofs.github.io/topic-modeling-edinburgh/#/”.
Source: Ted Underwood, “Topic modeling made just simple enough”
Topic | Prominence | Keywords |
---|
0 | 0.05337 | atlanta buck sherman coffee lil city soldiers union donaldson war opera music men dance wrote vamp |
1 | 0.03937 | chinese singapore quantum scientific china percent han evidence xinjiang physics bjork language study ethnic test culture pretest memory miksic |
2 | 0.06374 | dr medical pain osteopathic medicine brain patients doctors creativity care health smithsonian touro patient benson cancer physician skorton physicians |
3 | 0.12499 | book poetry books literary writer writers american literature fiction writing poet author freud english novels culture published review true |
4 | 0.14779 | museum art mr ms arts museums artist center artists music hong kong contemporary works director china painting local institute |
5 | 0.0484 | love robinson mr godzilla slater gorman movie mother lila literature read sachs happy taught asked writing house child lived |
6 | 0.03013 | oct org street nov center museum art gallery sundays saturdays theater free road noon arts connecticut tuesdays avenue university |
7 | 0.09138 | johnson rights editor mr wilson poverty civil war vietnam kristof president writer jan ants hope human bill lyndon presidents |
8 | 0.18166 | technology computer business ms tech engineering jobs mr women people science percent ireland work skills companies fields number company |
9 | 0.08085 | israel women police violence black gender war church poland white northern officers country trial racism rights civil justice rice |
10 | 0.94475 | advertisement people time years work make life world year young part day made place back great good times things |
11 | 0.08681 | mr chief smith russian vista company russia financial times equity dealbook million reports street private berggruen york bank executive |
12 | 0.1135 | street york show free sunday children saturday theater city monday tour friday martin center members students manhattan village west |
13 | 0.17297 | times video photo community commencement article york lesson credit read tumblr online students blog digital college plan twitter news |
14 | 0.4395 | university american research mr studies international faculty state center work dr director arts universities academic bard advertisement education history |
15 | 0.55649 | people human world professor science humanities time knowledge life questions study learn social find ways change thinking problem don |
16 | 0.10946 | mr ms professor marriage york wrote degree newark mondale mother born received father school aboulela ajami price married home |
17 | 0.3896 | years government mr president report programs public american humanities state ms year million information board budget today left private |
18 | 0.07622 | religion religious buddhist faith philosophy traditions god derrida philosophers life beliefs hope buddhism jesus doctrine stone deconstruction theology lives |
19 | 0.31793 | students school education college schools student teachers teaching graduate year harvard percent colleges class high graduates job learning universities |
Topic model data courtesy of Alan Liu.
The most popular implementation used by digital humanists is MALLET (written in Java).
You can run MALLET as a desktop app with the GUI Topic Modeling tool. This is an especially useful for students.
Good implementations are available for the Python and R programming languages (the Python gensim
library is very accessible).
You get the latest version.
A few MALLET functions are not implemented in the GUI Topic Modeling Tool, such as random-seed
, which ensures that topic models are reproducible.
The Programming Historian, “Getting Started with Topic Modeling and MALLET”
DARIAH-DE, Text Analysis with Topic Models for the Humanities and Social Sciences (Python and MALLET)
Beginners Guide to Topic Modeling in Python (uses the Python gensim)
Matthew Jockers, Text Analysis with R for Students of Literature
See the WhatEvery1Says Report on Topic Modeling Interfaces for an overview of other methods of visualising topic models.
Good topics are normally judged by the “semantic coherence” of their terms, but there is proven statistical heuristic for demonstrating this.
Typically, human intuition is used to label the topics (e.g. Religion and Deconstructionism: religion religious buddhist faith philosophy traditions god derrida ...
).
Less semantically coherent topics can be the most interesting because they bring together terms human users might not relate.
Junk topics can be ignored, but a “good” topic model will have a relatively low percentage of junk topics.
If each stage is a transformation (“deformance”) of the source text, how do we relate the results of this transformation to the original?
How does the size and nature of our data affect the results of topic modelling?
What human decisions influence the construction of the model?
Mac | Windows |
---|---|
1. Download TopicModelingTool.dmg .2. Open it by double-clicking. 3. Drag the app into your Applications folder – or into any folder at all. 4. Run the app by double-clicking. |
1. Download TopicModelingTool.zip .2. Extract the files into any folder. 3. Open the folder containing the files. 4. Double-click on the file called TopicModelingTool.exe to run it.
|
On the Mac, if you get an error saying that the file is from an “unidentified developer”, press control while double-clicking. You will be given an option to run the file.
input
directory and an output
directory. Dump your text collection in the former. All documents should be text files in the same directory.tm_project
).input
and a folder called output
.input
folder.Important: All texts should be in plain text format, and all files should be at the same level. If you find yourself encountering problems with character encoding, read the advice in the Quickstart Guide.
Need some data? There are some sample sets in the workshop sandbox: https://bit.ly/2Hfk3YP.
Learn Topics
.output
folder and explore the contents. Look especially at the output_html
folder and open all_topics.html
in a browser.output2
. In the GUI Topic Modeling Tool, set that as the new output folder.Number of Topics
to “20”.Optional settings...
button.Preserve raw MALLET output
checkbox and change the number of topic words to print to “10”. Click OK
.Learn Topics
output2
folder and open all_topics.html
in a browser. What differences are there?Document Clouds
toggle so that it reads Topic Clouds
.Upload File
button. In your output2
folder, find the file words-topic-counts.txt
in the output_malled
folder and select it.Get Graphs
. The multiclouds will take time to generate and your browser may freeze. Be patient.Convert topics to documents
, your topics will be converted into “texts” which you can explore with the other features of Lexos (e.g. cluster analysis).stopwords.txt
file from the workshop sandbox repository (https://bit.ly/2Hfk3YP).output3
. In the GUI Topic Modeling Tool, set that as the new output folder.Optional settings...
button.Remove default English stopwords
.Stopword file...
button and select stopwords.txt
. Click OK
.Learn Topics
output3
folder and open all_topics.html
in a browser. What differences are there?output4
. In the GUI Topic Modeling Tool, set that as the new output folder.Optional settings...
button and select the options you wish to use.OK
.Learn Topics
output4
folder and explore the differences.Interval between hyperprior optimizations
(also referred to as "hyperparameter optimization"). In short, this allows more general topics to be more prominent in the model. For a fuller explanation, see Christof Schöch's “Topic Modeling with MALLET: Hyperparameter Optimization”.Metadata
tool as documented in the Quickstart Guide.--random-seed
command to get reproducible results.Slideshow produced by Scott Kleinman.
Sponsored by UC Irvine, Humanities Commons
and the WhatEvery1Says Project.