• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Google Does Culturomics

Kaylee

Illuminator
Joined
Feb 5, 2005
Messages
4,287
Just found this tool -- Google Ngram Viewer.

Google has a tool that allows you to search for words and terms among their library of digitalized books. As a result you can see how the usuage of phrases like "rock and roll", "The Great War", "World War I", etc. has waxed and waned over the past several hundred years.

The default search parameters include the years 1800 - 2000. I think the default of 1800 was chosen because there were less than 500,000 books published in English before the 19th century. More info on how the Ngram viewer works here and here.

Per the last link, the Ngram viewer searches through over 5.2 million books!* And it's lightening fast.

I'm old enough to remember when I use to fall asleep waiting for my tiny spreadsheet file to update.

This is amazing stuff.




* if you select the English corpus
 
Last edited:
Be careful about the limitations of Google's optical character recognition.

For a good example, search "Internet". There's a healthy rise in results in the first decade of the 20th century. Why? It looks like the OCR program saw the abbreviation "Internat." for "International" and decided to "correct" it to a word that was already in its dictionary.

The same thing happens for Louis-Antoine-Cyprien Infernet, a French Naval officer who was at the Battle of Trafalgar.

Although I am grateful for discovering the latter, since his is an interesting story and his name makes a useful nickname for the Web.
 
Be careful about the limitations of Google's optical character recognition.

For a good example, search "Internet". There's a healthy rise in results in the first decade of the 20th century. Why? It looks like the OCR program saw the abbreviation "Internat." for "International" and decided to "correct" it to a word that was already in its dictionary.

The same thing happens for Louis-Antoine-Cyprien Infernet, a French Naval officer who was at the Battle of Trafalgar.

Excellent catch. So I'm guessing that you downloaded the raw data?

Although I am grateful for discovering the latter, since his is an interesting story and his name makes a useful nickname for the Web.

Great idea!
 
Last edited:
Excellent catch.
Thank you. I've seen results from OCR scanning of books from Google and other places. Mistakes happen often enough for me to expect some noise in this tool.

So I'm guessing that you downloaded the raw data?
No, I just looked at some results from the Ngram Viewer itself. Below the graph is a row of dates. I clicked on the first two, refined some of the time parameters, and looked at the resulting book pages. In the first time period for "Internet", look at the results for "THE GENTLEMAN'S MAGAZINE AND HIFTORICAL CHRONICLE". (Even the title is spelled wrong thanks to the computer's misreading long 's'.)

Click the book search for "1906 - 1997", and then adjust the end date to 1910 (bottom of the left-hand column). Then look at the first result: "The Geographical journal: Volume 27 - Page 519", which mentions "Rep. Eighth Internat. G. Congress 1904 (1905)". The OCR even interpreted one of these "G." as an "O."

It's a fun tool, but people should know that it might kick up some erroneous and even anachronistic results.
 
I searched the C word (well, why not?), and found quite a few earlier than one might expect. Nearly all of them, apart from dictionaries of nautical terms, turn out to be misreadings of cant or other words.
 
Thank you. I've seen results from OCR scanning of books from Google and other places. Mistakes happen often enough for me to expect some noise in this tool.


No, I just looked at some results from the Ngram Viewer itself. Below the graph is a row of dates. I clicked on the first two, refined some of the time parameters, and looked at the resulting book pages. In the first time period for "Internet", look at the results for "THE GENTLEMAN'S MAGAZINE AND HIFTORICAL CHRONICLE". (Even the title is spelled wrong thanks to the computer's misreading long 's'.)

Click the book search for "1906 - 1997", and then adjust the end date to 1910 (bottom of the left-hand column). Then look at the first result: "The Geographical journal: Volume 27 - Page 519", which mentions "Rep. Eighth Internat. G. Congress 1904 (1905)". The OCR even interpreted one of these "G." as an "O."

It's a fun tool, but people should know that it might kick up some erroneous and even anachronistic results.

I'm glad you posted -- I didn't notice that the years at the bottom of the page were links to a listing of the sources. That makes it even more useful even though, as you pointed out, the OCR scanning is buggy.
 
I searched the C word (well, why not?), and found quite a few earlier than one might expect. Nearly all of them, apart from dictionaries of nautical terms, turn out to be misreadings of cant or other words.

:D

Thanks Rat. Nice to see you doing your part to be the 'E' in the JREF :) .
 
You're welcome. You may also find the misreading of the long-S in 'suck' may cause some unintentionally funny passages, as it were.
 
"Becalmed" shows a similar downward trend, but "steam" shows a definite peak at about 1905.
 
Rather interesting, I did a search for "atheist", thinking that the frequency of use would have increased...but in fact, there seems to be a general downward trend, with the peaks being in 1810 and 1840.

By contrast, the phrase "critical thinking" is almost non-existent prior to 1920, then has a very steep increase in use thereafter.

"Evolution" has a predictable increase following Darwin's publication, but more puzzling to me is the distinct decrease in use of the term from 1930 to 1960.
 
Last edited:
I like looking at this stuff and just trying to figure out "why".

Tried "Marilyn Monroe", and found that her stats are almost zero until after her death; presumably because that's when people actually started writing books about her. But her popularity in literature takes a major nosedive (almost back to zero) in the 70's, then increases rapidly after that, with a continuous upward trend. Don't know if it would be a contributing factor, but the increase is use of her name seems to coincide generally with the time period of Madonna's rise to fame.

I did a search for "Hitler", expecting to find it (as a family name) prior to Adolph, but with a major increase in use after the events of WW II; but in fact, the most frequent use in literature was between 1800-1850...long before the rise and fall of Adolph Hitler's Nazis. And looking at the results, one discovers that the word "hitler" (lower case) was used variously as an adjective and verb:

"the reputation of the French arms would hitler in the opinion of the public"
"We ft/id also littie difference between the temperature os the air at the bottom of this hitler mountain, and that of its summit"
"and yet he would neither hitler it to be taken out"

Yet as mentioned by others, there are transcription errors (read the second sentence above, where "find" has been scanned as "ft/id" and "little" as "littie"), so I'm not sure if this is because of transcription errors (the second sentence, for example, could be "littler", not "hitler"), or if this is a word that simply fell out of use. I did a search of a few online dictionaries, and can find no reference to any meaning of the word other than those tied directly to Adolph.
 
The N-word was used the most during the 1940s. The early 1800s were the least racist time period. :p

The F-word is by far the most interesting. After 1960s it increases exponentially. Prior to 1820 we see an exponential decline. In the middle it's almost unused.

God decreases gradually.

My first name gained popularity a lot around my birthday. Odd to think that there are tons of other people with the same mentality of my parents.
 

Back
Top Bottom