"In a 'closed vocabulary' approach," Kern said, "psychologists might pick a list of words they think signal positive emotion, like 'contented,' 'enthusiastic' or 'wonderful' and then look at the frequency of a person's use of these words as a way to measure how happy that person is. However, closed vocabulary approaches have several limitations, including that they do not always measure what they intend to measure."
"For example," Ungar said, "one might find the energy sector uses more negative emotion words, simply because they use the word 'crude' more. But this points to the need to use multi-word expressions to understand the intended meaning. 'Crude oil' is different than 'crude,' and, likewise, being 'sick of' is different from merely being 'sick.'"
Another inherent limitation to the closed vocabulary approach is that it relies upon a preconceived, fixed set of words. Such a study might be able to confirm that depressed people do indeed use expected words (like "sad") more frequently but cannot generate new insights (that they talk less about sports or social activities than happy people, for example.)
Past psychological language studies have necessarily relied on closed vocabulary approaches as their small sample sizes made open approaches impractical. The emergence of massive language datasets afforded by social media now allows for qualitatively different analyses.
"Most words occur rarely -- any sample of writing, including Facebook status updates, only contains a small portion of the average vocabulary," Schwartz said. "This means that, for all but the most common words, you need writing samples from many people in order to make connections with psychological traits. Traditional studies have found interesting connections with pre-chosen categories of words such as 'positive emotion' or 'function words.' However, the billions of word instances available in social media allow us to find patterns at a much richer level."
The open-vocabulary approach, by contrast, derives important words and phrases from the sample itself. With more than 700 million words, phrases and topics drilled out of this study's sample of
This large data size was critical to the specific technique the team used, known as differential language analysis, or DLA. The researchers used DLA to isolate the words and phrases that clustered around the various characteristics self-reported in the volunteers' questionnaires: age, gender and scores for the "Big Five" personality traits, which are extraversion, agreeableness, conscientiousness, neuroticism and openness. The Big Five model was chosen as it is a common and well-studied way of quantifying personality traits, but the researchers' method could be applied to models that measure other characteristics, including depression or happiness.
To visualize their results, the researchers created word clouds that summarized the language that statistically predicted a given trait, with the correlation strength of a word in a given cluster being represented by its size. For example, a word cloud that shows language used by extraverts prominently features words and phrases like "party," "great night" and "hit me up," while a word cloud for introverts features many references to Japanese media and emoticons.
Most Popular Stories
- Slow Week Ahead of December FOMC Meeting
- Hispanics Seek to Grow School Board Members
- 'Knockout Game': Myth or Menace?
- U.S. Companies Eager for Iranian Business
- Questions Remain in Jenni Rivera's Death
- Banks Fret as Volcker Vote Approaches
- Entrepreneurs' Next Creation May Be New Laws
- GM Bailout Saved 1.2 Million U.S. Jobs, Report Says
- Bitcoin Used to Buy Tesla Car
- Paul Walker Fans Pay Respects