University of Luxembourg
October 18, 2016
Online slides optimised for Full-HD screens in full-screen mode
Download PDF here
Doing Digital History: Introduction to Tools and Technology
What are aspects to define big data?
What is correlation?
What are two methods of machine learning?
What is 'close reading' anyway? What do you do when reading a text?
[Y]ou invest so much in individual texts only if you think that very few of them really matterMoretti (2013)
The injunction of traditional historians to look at “everything” cannot survive in a digital era in which “everything” has survivedRosenzweig (2003)
At bottom, it’s a theological exercise—very solemn treatment of very few texts taken very seriously—whereas what we really need is a little pact with the devil: we know how to read texts, now let’s learn how not to read them. Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the textMoretti (2013)
[U]nderstanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data. (Schulz, 2011)
Rather than analyzing a book page by page, analyze a corpus book by book
Corpus (here): an aggregated set of sources
The charts aim to show how one variable relates to another
Y-axis is often frequency per X words
X-axis is often time
Not always the case, e.g. Gendered Language in Teacher Reviews
what are the Y and X-axis here?
What does colour mean?
Can we move even further away from the book?
Analysing the use of words to interpret the culture
The -omics denotes big data, which is what that suffix has come to imply in modern biology and beyond.Aiden & Michel (2013) Uncharted
The culture is the culture of Boas: empirically knowable, its vast variations a matter of endless curiosity and genuine celebration.
If we list all the words by frequency
Each word occurs 1/rank as often as the most frequent word
Can we use concordances?
An n-gram is a sequence of words of n-length
If we collect enough n-grams, we could recreate the original sentences from a book:
So: words and phrases that are rare are excluded so that full texts cannot be reconstructed
500 billion words, written over 5 centuries, 7 different languages, 4% of all books ever published (2009)
In 2012, up to 6% (where all digitized books would've been 20%)
[T]he y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are "nursery school" or "child care"? Of all the unigrams, what percentage of them are "kindergarten"?
What makes this chart difficult to interpret?
Important to note is smoothing
Makes the chart easier to read, but abstracts away information
Recent research on Google Books:
[O]ne study that used Google Books to make broad claims about the changing nature of childhood in the mid-20th century, a study that failed to acknowledge that parenting manuals emerged as a genre during that era.
25 October (double lecture)
Reading: (see Moodle)
Work in pairs of two
Use one of the tools discussed today to try and find something you find interesting. Document your steps and choices and discuss why a finding is of interest, and whether you can be certain of this finding.
Hand in the assignment in HTML, include your name and a decent profile photo
Do note: the finding itself is not the most important aspect of the assignment
Email to firstname.lastname@example.org before the start of the next lecture