Natural language processing and social media intelligence

Natural language processing and social media intelligence
Mining insights from social media data requires more than sorting and counting words.
By Alan Morrison and Steve Hamby



Most enterprises are more than eager to further develop their capabilities in social media intelligence (SMI)—the ability to mine the public social media cloud to glean business insights and act on them. They understand the essential value of finding customers who discuss products and services candidly in public forums. The impact SMI can have goes beyond basic market research and test marketing. In the best cases, companies can uncover clues to help them revisit product and marketing strategies.

“Ideally, social media can function as a really big focus group,” says Jeff Auker, a director in PwC’s Customer Impact practice. Enterprises, which spend billions on focus groups, spent nearly $1.6 billion in 2011 on social media marketing, according to Forrester Research. That number is expected to grow to nearly $5 billion by 2016.1

Auker cites the example of a media company’s use of SocialRep,2 a tool that uses a mix of natural language processing (NLP) techniques to scan social media. Preliminary scanning for the company, which was looking for a gentler approach to countering piracy, led to insights about how motivations for movie piracy differ by geography. “In India, it’s the grinding poverty. In Eastern Europe, it’s the underlying socialist culture there, which is, ‘my stuff is your stuff.’ There, somebody would buy a film and freely copy it for their friends. In either place, though, intellectual property rights didn’t hold the same moral sway that they did in some other parts of the world,” Auker says.

This article explores the primary characteristics of NLP, which is the key to SMI, and how NLP is applied to social media analytics. The article considers what’s in the realm of the possible when mining social media text, and how informed human analysis becomes essential when interpreting the conversations that machines are attempting to evaluate.

Back to top

Natural language processing: Its components and social media applications

NLP technologies for SMI are just emerging. When used well, they serve as a more targeted, semantically based complement to pure statistical analysis, which is more scalable and able to tackle much larger data sets. While statistical analysis looks at the relative frequencies of word occurrences and the relationships between words, NLP tries to achieve deeper insights into the meanings of conversations.

“It takes very rare skill sets in the NLP community to figure this stuff out. It’s incredibly processing and storage intensive, and it takes awhile. If you used pure NLP to tell me everything that’s going on, by the time you indexed all the conversations, it might be days or weeks later. By then, the whole universe isn’t what it used to be.”
—Jeff Auker, PwC

The best NLP tools can provide a level of competitive advantage, but it’s a challenging area for both users and vendors. “It takes very rare skill sets in the NLP community to figure this stuff out,” Auker says. “It’s incredibly processing and storage intensive, and it takes awhile. If you used pure NLP to tell me everything that’s going on, by the time you indexed all the conversations, it might be days or weeks later. By then, the whole universe isn’t what it used to be.”

First-generation social media monitoring tools provided some direct business value, but they also left users with more questions than answers. And context was a key missing ingredient. Rick Whitney, a director in PwC’s Customer Impact practice, makes the following distinction between the first- and second-generation SMI tools: “Without good NLP, the first-generation tools don’t give you that same context,” he says.

What constitutes good NLP is open to debate, but it’s clear that some of the more useful methods blend different detailed levels of analysis and sophisticated filtering, while others stay attuned to the full context of the conversations to ensure that novel and interesting findings that inadvertently could be screened out make it through the filters.

What constitutes good NLP is open to debate, but it’s clear that some of the more useful methods blend different detailed levels of analysis and sophisticated filtering, while others stay attuned to the full context of the conversations.

Types of NLP

NLP consists of several subareas of computer-assisted language analysis, ways to help scale the extraction of meaning from text or speech. NLP software has been used for several years to mine data from unstructured data sources, and the software had its origins in the intelligence community. During the past few years, the locus has shifted to social media intelligence and marketing, with literally hundreds of vendors springing up.

NLP techniques span a wide range, from analysis of individual words and entities, to relationships and events, to phrases and sentences, to document-level analysis. (See Figure 1.)

Figure 1
The varied paths to meaning in text analytics

The primary NLP techniques include these:

Word or entity (individual element) analysis

  • Word sense disambiguation—Identifies the most likely meaning of ambiguous words based on context and related words in the text. For example, it will determine if the word “bank” refers to a financial institution, the edge of a body of water, the act of relying on something, or one of the word’s many other possible meanings.
  • Named entity recognition (NER)—Identifies proper nouns. Capitalization analysis can help with NER in English, for instance, but capitalization varies by language and is entirely absent in some.
  • Entity classification—Assigns categories to recognized entities. For example, “John Smith” might be classified as a person, whereas “John Smith Agency” might be classified as an organization, or more specifically “insurance company.”
  • Part of speech (POS) tagging—Assigns a part of speech (such as noun, verb, or adjective) to every word to form a foundation for phrase- or sentence-level analysis.

Relationship and event analysis

  • Relationship analysis—Determines relationships within and across sentences. For example, “John’s wife Sally …” implies a symmetric relationship of spouse.
  • Event analysis—Determines the type of activity based on the verb and entities that have been assigned to a classification. For example, an event “BlogPost” may have two types associated with it—a blog post about a company versus a blog post about its competitors—even though a single verb “blogged” initiated the two events. Event analysis can also define relationships between entities in a sentence or phrase; the phrase “Sally shot John” might establish a relationship between John and Sally of murder, where John is also categorized as the murder victim.
  • Co-reference resolution—Identifies words that refer to the same entity. For example, in these two sentences—“John bought a gun. He fired the gun when he went to the shooting range.”—the “He” in the second sentence refers to “John” in the first sentence; therefore, the events in the second sentence are about John.

Syntactic (phrase and sentence construction) analysis

  • Syntactic parsing—Generates a parse tree, or the structure of sentences and phrases within a document, which can lead to helpful distinctions at the document level. Syntactic parsing often involves the concept of sentence segmentation, which builds on tokenization, or word segmentation, in which words are discovered within a string of characters. In English and other languages, words are separated by spaces, but this is not true in some languages (for instance, Chinese).
  • Language services—Range from translation to parsing and extracting in native languages. For global organizations, these services are a major differentiator because of the different techniques required for different languages.

Document analysis

  • Summarization and topic identification—Summarizes (in the case of topic identification) in a few words the topic of an entire document or subsection. Summarization, by contrast, provides a longer summary of a document or subsection.
  • Sentiment analysis—Recognizes subjective information in a document that can be used to identify “polarity” or distinguish between entirely opposite entities and topics. This analysis is often used to determine trends in public opinion, but it also has other uses, such as determining confidence in facts extracted using NLP.
  • Metadata analysis—Identifies and analyzes the document source, users, dates, and times created or modified.

NLP applications require the use of several of these techniques together. Some of the most compelling NLP applications for social media analytics include enhanced extraction, filtered keyword search, social graph analysis, and predictive and sentiment analysis.

Enhanced extraction

NLP tools are being used to mine both the text and the metadata in social media. For example, the inTTENSITY Social Media Command Center (SMCC) integrates Attensity Analyze with Inxight ThingFinder—both established tools—to provide a parser for social media sources that include metadata and text. The inTTENSITY solution uses Attensity Analyze for predicate analysis to provide relationship and event analysis, and it uses ThingFinder for noun identification.

Filtered keyword search

Many keyword search methods exist. Most require lists of keywords to be defined and generated. Documents containing those words are matched. WordStream is one of the prominent tools in keyword search for SMI. It provides several ways for enterprises to filter keyword searches.

Social graph analysis

Social graphs assist in the study of a subject of interest, such as a customer, employee, or brand. These graphs can be used to:

  • Determine key influencers in each major node section
  • Discover if one aspect of the brand needs more attention than others
  • Identify threats and opportunities based on competitors and industry
  • Provide a model for collaborative brainstorming

Many NLP-based social graph tools extract and classify entities and relationships in accordance with a defined ontology or graph. But some social media graph analytics vendors, such as Nexalogy Environics, rely on more flexible approaches outside standard NLP. “NLP rests upon what we call static ontologies—for example, the English language represented in a network of tags on about 30,000 concepts could be considered a static ontology,” Claude Théoret, president of Nexalogy Environics, explains. “The problem is that the moment you hit something that’s not in the ontology, then there’s no way of figuring out what the tags are.”

In contrast, Nexalogy Environics generates an ontology for each data set, which makes it possible to capture meaning missed by techniques that are looking just for previously defined terms. “That’s why our stuff is not quite real time,” he says, “because the amount of number crunching you have to do is huge and there’s no human intervention whatsoever.” (For an example of Nexalogy’s approach, see the article, “The third wave of customer analytics,” on page 06.)

Predictive analysis and early warning

Predictive analysis can take many forms, and NLP can be involved, or it might not be. Predictive modeling and statistical analysis can be used effectively without the help of NLP to analyze a social network and find and target influencers in specific areas. Before he came to PwC, Mark Paich, a director in the firm’s advisory service, did some agent-based modeling3 for a Los Angeles–based manufacturer that hoped to change public attitudes about its products. “We had data on what products people had from the competitors and what people had products from this particular firm. And we also had some survey data about attitudes that people had toward the product. We were able to say something about what type of people, according to demographic characteristics, had different attitudes.”

Paich’s agent-based modeling effort matched attitudes with the manufacturer’s product types. “We calibrated the model on the basis of some fairly detailed geographic data to get a sense as to whose purchases influenced whose purchases,” Paich says. “We didn’t have direct data that said, ‘I influence you.’ We made some assumptions about what the network would look like, based on studies of who talks to whom. Birds of a feather flock together, so people in the same age groups who have other things in common tend to talk to each other. We got a decent approximation of what a network might look like, and then we were able to do some statistical analysis.”

That statistical analysis helped with the influencer targeting. According to Paich, “It said that if you want to sell more of this product, here are the key neighborhoods. We identified the key neighborhood census tracts you want to target to best exploit the social network effect.”

Predictive modeling is helpful when the level of specificity needed is high (as in the Los Angeles manufacturer’s example), and it’s essential when the cost of a wrong decision is high.4 But in other cases, less formal social media intelligence collection and analysis are often sufficient. When it comes to brand awareness, NLP can help provide context surrounding a spike in social media traffic about a brand or a competitor’s brand.

“As Clay Shirky pointed out in 2003, influence is only influential within a context.”
—Claude Théoret, Nexalogy Environics

That spike could be a key data point to initiate further action or research to remediate a problem before it gets worse or to take advantage of a market opportunity before a competitor does. (See the article, “The third wave of customer analytics,” on page 06.) Because social media is typically faster than other data sources in delivering early indications, it’s becoming a preferred means of identifying trends. Many companies mine social media to determine who the key influencers are for a particular product. But mining the context of the conversations via interest graph analysis is important. “As Clay Shirky pointed out in 2003, influence is only influential within a context,” Théoret says.

Nearly all SMI products provide some form of timeline analysis of social media traffic with historical analysis and trending predictions.

Sentiment analysis

Even when overall social media traffic is within expected norms or predicted trends, the difference between positive, neutral, and negative sentiment can stand out. Sentiment analysis can suggest whether a brand, customer support, or a service is better or worse than normal. Correlating sentiment to recent changes in product assembly, for example, could provide essential feedback.

Most customer sentiment analysis today is conducted only with statistical analysis. Government intelligence agencies have led with more advanced methods that include semantic analysis. In the US intelligence community, media intelligence generally provides early indications of events important to US interests, such as assessing the impact of terrorist activities on voting in countries the Unites States is aiding, or mining social media for early indications of a disease outbreak. In these examples, social media prove to be one of the fastest, most accurate sources for this analysis.

Back to top

NLP-related best practices

After considering the breadth of NLP, one key takeaway is to make effective use of a blend of methods. Too simple an approach can’t eliminate noise sufficiently or help users get to answers that are available. Too complicated an approach can filter out information that companies really need to have.

“Our models are built on seeds from analysts with years of experience in each industry. We can put in the word ‘Escort’ or ‘Suburban,’ and then behind that put a car brand such as ‘Ford’ or ‘Chevy.’ The models combined could be strings of 250 filters of various types.”
—Vince Schiavone, ListenLogic

Some tools classify many different relevant contexts. ListenLogic, for example, combines lexical, semantic, and statistical analysis, as well as models the company has developed to establish specific industry context. “Our models are built on seeds from analysts with years of experience in each industry. We can put in the word ‘Escort’ or ‘Suburban,’ and then behind that put a car brand such as ‘Ford’ or ‘Chevy,’” says Vince Schiavone, co-founder and executive chairman of ListenLogic. “The models combined could be strings of 250 filters of various types.” The models fall into five categories:

  • Direct concept filtering—Filtering based on the language of social media
  • Ontological—Models describing specific clients and their product lines
  • Action—Activity associated with buyers of those products
  • Persona—Classes of social media users who are posting
  • Topic—Discovery algorithms for new topics and topic focusing

Other tools, including those from Nexalogy Environics, take a bottom-up approach, using a data set as it comes and, with the help of several proprietary universally applicable algorithms, processing it with an eye toward categorization on the fly. Equally important, Nexalogy’s analysts provide interpretations of the data that might not be evident to customers using the same tool. Both kinds of tools have strengths and weaknesses. Table 1 summarizes some of the key best practices when collecting SMI.

Table 1
A few NLP best practices

Table 1: A few NLP best practices.

Back to top

Conclusion: A machine-assisted and iterative process, rather than just processing alone

Good analysis requires consideration of a number of different clues and quite a bit of back-and-forth. It’s not a linear process. Some of that process can be automated, and certainly it’s in a company’s interest to push the level of automation. But it’s also essential not to put too much faith in a tool or assume that some kind of automated service will lead to insights that are truly game changing. It’s much more likely that the tool provides a way into some far more extensive investigation, which could lead to some helpful insights, which then must be acted upon effectively.

One of the most promising aspects of NLP adoption is the acknowledgment that structuring the data is necessary to help machines interpret it. Developers have gone to great lengths to see how much knowledge they can extract with the help of statistical analysis methods, and it still has legs. Search engine companies, for example, have taken pure statistical analysis to new levels, making it possible to pair a commonly used phrase in one language with a phrase in another based on some observation of how frequently those phrases are used. So statistically based processing is clearly useful. But it’s equally clear from seeing so many opaque social media analyses that it’s insufficient.

An in-memory appliance to explore graph data

YarcData’s uRiKA analytics appliance,1 announced at O’Reilly’s Strata data science conference in March 2012, is designed to analyze the relationships between nodes in large graph data sets. To accomplish this feat, the system can take advantage of as much as 512TB of DRAM and 8,192 processors with over a million active threads.

In-memory appliances like these allow very large data sets to be stored and analyzed in active or main memory, avoiding memory swapping to disk that introduces lots of latency. It’s possible to load full business intelligence (BI) suites, for example, into RAM to speed up the response time as much as 100 times. (See “What in-memory technology does” on page 33 for more information on in-memory appliances.) With compression, it’s apparent that analysts can query true big data (data sets of greater than 1PB) directly in main memory with appliances of this size.

Besides the sheer size of the system, uRiKA differs from other appliances because it’s designed to analyze graph data (edges and nodes) that take the form of subject-verb-object triples. This kind of graph data can describe relationships between people, places, and things scalably. Flexible and richly described data relationships constitute an additional data dimension users can mine, so it’s now possible, for example, to query for patterns evident in the graphs that aren’t evident otherwise, whether unknown or purposely hidden.2

But mining graph data, as YarcData (a unit of Cray) explains, demands a system that can process graphs without relying on caching, because mining graphs requires exploring many alternative paths individually with the help of millions of threads— a very memory- and processor-intensive task. Putting the full graph in a single random access memory space makes it possible to query it and retrieve results in a timely fashion.

The first customers for uRiKA are government agencies and medical research institutes like the Mayo Clinic, but it’s evident that social media analytics developers and users would also benefit from this kind of appliance. Mining the social graph and the larger interest graph (the relationships between people, places, and things) is just beginning.3 Claude Théoret of Nexalogy Environics has pointed out that crunching the relationships between nodes at web scale hasn’t previously been possible. Analyzing the nodes themselves only goes so far.

1 The YarcData uRiKA Graph Appliance: Big Data Relationship Analytics, Cray white paper,, March 2012, accessed April 3, 2012.

2 Michael Feldman, “Cray Parlays Supercomputing Technology Into Big Data Appliance,” Datanami, March 2, 2012,, accessed April 3, 2012.

3 See “The collaboration paradox,” Technology Forecast 2011, Issue 3,, for more information on the interest graph.

Structuring textual data, as with numerical data, is important. Enterprises cannot get to the web of data if the data is not in an analysis-friendly form—a database of sorts. But even when something materializes resembling a better described and structured web, not everything in the text of a social media conversation will be clear. The hope is to glean useful clues and starting points from which individuals can begin their own explorations.

Perhaps one of the more telling trends in social media is the rise of online word-of-mouth marketing and other similar approaches that borrow from anthropology. So-called social ethnographers are monitoring how online business users behave, and these ethnographers are using NLP-based tools to land them in a neighborhood of interest and help them zoom in once there. The challenge is how to create a new social science of online media, one in which the tools are integrated with the science.

1 Shar VanBoskirk, US Interactive Marketing Forecast, 2011 To 2016, Forrester Research report, August 24, 2011,, accessed February 12, 2012.

2 PwC has joint business relationships with SocialRep, ListenLogic, and some of the other vendors mentioned in this publication.

3 Agent-based modeling is a means of understanding the behavior of a system by simulating the behavior of individual actors, or agents, within that system. For more on agent-based modeling, see the article “Embracing unpredictability” and the interview with Mark Paich, “Using simulation tools for strategic decision making,” in Technology Forecast 2010, Issue 1,, accessed February 14, 2012.

4For more information on best practices for the use of predictive analytics, see Putting predictive analytics to work, PwC white paper, January 2012,, accessed February 14, 2012.

Back to top