Distant Reading

Max Kemman
University of Luxembourg
October 18, 2016

Online slides optimised for Full-HD screens in full-screen mode
Download PDF here

Doing Digital History: Introduction to Tools and Technology

Recap from last time

What are aspects to define big data?

What is correlation?

What are two methods of machine learning?

Today

  • What is distant reading?
  • Culturomics
  • Google Books NGram Viewer
  • Biases in the chart
  • Hands-on
  • Next time
    • Assignment

What is distant reading?

What is 'close reading' anyway? What do you do when reading a text?

[Y]ou invest so much in individual texts only if you think that very few of them really matter

Moretti (2013)

Against close reading

The injunction of traditional historians to look at “everything” cannot survive in a digital era in which “everything” has survived

Rosenzweig (2003)

Pact with the devil

At bottom, it’s a theological exercise—very solemn treatment of very few texts taken very seriously—whereas what we really need is a little pact with the devil: we know how to read texts, now let’s learn how not to read them. Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text

Moretti (2013)

Distant reading

[U]nderstanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data. (Schulz, 2011)

Rather than analyzing a book page by page, analyze a corpus book by book

Corpus (here): an aggregated set of sources

Viewing the aggregate (Moretti)

Viewing the aggregate (Aiden & Michel)

The charts

The charts aim to show how one variable relates to another

Vertical: y-axis

Horizontal: x-axis

Y-axis is often frequency per X words

X-axis is often time

X-Axis

Not always the case, e.g. Gendered Language in Teacher Reviews

what are the Y and X-axis here?

What does colour mean?

Reading the distance

Looking closer (but not too close!)

Playing around with the view

Finding a correlation

Culturomics

Can we move even further away from the book?

Analysing the use of words to interpret the culture

The -omics denotes big data, which is what that suffix has come to imply in modern biology and beyond.
The culture is the culture of Boas: empirically knowable, its vast variations a matter of endless curiosity and genuine celebration.

Aiden & Michel (2013) Uncharted

Zipf's Law?

If we list all the words by frequency

Each word occurs 1/rank as often as the most frequent word

Guidelines for creating dataset

  1. Protect rights of the millions of people who created the original dataset
  2. Needs to be interesting
  3. Cannot run counter to the purposes of the company that serves as the data's gatekeeper
  4. Needs to be something one can actually generate in practice

Can we use concordances?

N-grams

An n-gram is a sequence of words of n-length

  1. this (1-gram)
  2. this book (2-gram)
  3. this book is (3-gram
  4. this book is very (4-gram)
  5. this book is very heavy (5-gram)
  6. etc

Protecting copyright

If we collect enough n-grams, we could recreate the original sentences from a book:

  1. this book is very heavy
  2. is very heavy and I
  3. heavy and I have to
  4. I have to lift it

So: words and phrases that are rare are excluded so that full texts cannot be reconstructed

Reading the distance

Looking closer (but not too close!)

Other examples of Culturomics

Lev Manovich coined "cultural analytics"

Visual units of analysis

http://selfiecity.net/
http://www.the-everyday.net/

Google Books NGram Viewer

500 billion words, written over 5 centuries, 7 different languages, 4% of all books ever published (2009)

In 2012, up to 6% (where all digitized books would've been 20%)

https://books.google.com/ngrams

What does GBooks show?

[T]he y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are "nursery school" or "child care"? Of all the unigrams, what percentage of them are "kindergarten"?

Google Ngrams raw data

Advanced usage

  • Wildcards: * in place of a word
  • Inflection: book/booked/booking -> book_INF
  • Case insenstitive (book, BOOK -> 2 different words or the same?)
  • Part-of-speech tags: nouns vs verbs: tackle_VERB vs tackle_NOUN -> full list of tags given
  • Combining ngrams: + - / * :

https://books.google.com/ngrams/info

Biases in the charts

What makes this chart difficult to interpret?

OCR

The bumps

Important to note is smoothing

Makes the chart easier to read, but abstracts away information

ftrolling,strolling Google ngram

What is in the corpus?

(Source)

What is in the corpus?

Recent research on Google Books:

  • Raw numbers do not reflect popularity
  • Lots of scientific literature
  • Language
  • Spurious correlations?
    [O]ne study that used Google Books to make broad claims about the changing nature of childhood in the mid-20th century, a study that failed to acknowledge that parenting manuals emerged as a genre during that era.

(Source)

Hands-on

Go to http://bookworm.culturomics.org/ and choose a bookworm or other ngram viewer

Or use the Google Books Ngram Viewer https://books.google.com/ngrams

Try the same keywords in different tools

Try different keywords in the same tool

For next time

25 October (double lecture)

What? Investigating what a corpus is about

Reading: (see Moodle)

  • Braake, S. ter, & Fokkens, A. (2015). How to Make it in History. Working Towards a Methodology of Canon Research with Digital Methods. In Biographical Data in a Digital World 2015 (pp. 85–93).

Assignment

Work in pairs of two

Use one of the tools discussed today to try and find something you find interesting. Document your steps and choices and discuss why a finding is of interest, and whether you can be certain of this finding.

Hand in the assignment in HTML, include your name and a decent profile photo

Assignment

Grading

  • 1pt for free
  • 3pts for HTML
  • 3pts for documentation of your process
  • 3pts for critical reflection on your finding

Do note: the finding itself is not the most important aspect of the assignment

Email to max.kemman@uni.lu before the start of the next lecture