Corpus-based methods

Projects like Open Mind Common Sense (OMCS), ISI Learner and MindPixel are data-centric efforts to accumulate raw data about commonsense as the first step in engineering commonsense reasoning and more generally intelligent systems. These projects target those applications where breadth of knowledge is more important than the depth of knowledge. In some circumstances, such broad and superficial knowledge may even provide a useful illusion of deep knowledge. Consider, for example the almost uncanny ability of modern search engines to reveal appropriate web-sites matching a query.

In these approaches, large corpora of data are gathered from ‘free’ resources such as volunteer contributions and the world wide web, or by structuring play in online games that are specially crafted to make knowledge elicitation entertaining. Unfortunately, the cost savings associated with such ‘free’ resources come at the expense of reduced quality, precision and depth of the extracted knowledge. Indeed, corpus-based methods present a bootstrap problem: it is difficult to understand the implied commonsense knowledge in a database without already possessing the commonsense knowledge required to first disambiguate and interpret the data. Consequently, corpus-based methods emphasize shallow or surface knowledge: treating sentences as merely ‘bags of words’ or by emphasizing factual surface knowledge (of a parser) rather than the deep commonsense implications of data.

Consider EmpathyBuddy, an e-mail client that identifies the emotional content of a message. While a traditional approach to designing EmpathyBuddy might begin with rich theories of human emotions and conversational style, it turns out that EmpathyBuddy is able to convincingly identify the emotional subtext of a message through simple statistical associations between words in the message and emotive words. EmpathyBuddy computes distances in the ConceptNet semantic network (built from Open Mind Common Sense data) and uses these distances as a shallow but effective estimate of the emotional character of each word. Its operation is analogous to modern statistical spam filters. A spam filter does not need to fully understand the true meaning of a message if key words like ‘Viagra’ or ‘unclaimed prize’ are highly indicative of spam.