Bayes’ Rule and App SearchSeptember 27th, 2011 | Posted by in Technology
Paul Christiano is a 4th-year mathematics student at MIT with an emphasis on theoretical computer science and is a past summer engineering intern at Quixey. Paul was awarded “Best Research Paper” in the 2011 ACM Symposium on Theory of Computing (STOC).
In order to provide relevant search results, Quixey needs to integrate many different sources of evidence — not only each app’s name and description, but content from all over the web that refers to specific apps. Aggregating huge quantities of information into a single judgment is a notoriously difficult problem, and modern machine learning offers many approaches.
When you need to incorporate just a few pieces of information, there’s a mathematical version of “brute force” that works quite well, based on Bayes’ Rule:
If you’re not familiar with mathematical formalism and probability theory, this might look a little opaque. Let’s go through a simple example to show intuitively how this sort of mathematical reasoning might work.
Suppose I have no prior knowledge about an app other than a blog that described this app as “challenging.” What is the probability that the app is a reasonable result for the query “games”?
Say I know that about 10% of apps are reasonable results for the query “games.” I also know that a blog describes an app as “challenging” with probability 50% if the app is a game. Further, I know that if an app isn’t a game, then a blog would only describe it as “challenging” with about 5% probability.
So let’s do the math… I know that 10% of apps are games, and 50% of those are described as “challenging”. This means that 5% of all apps are “games which are described as challenging.”
I know that 90% of apps are not games, and 5% of those are described as challenging. Therefore, 4.5% of all apps are “non-game apps which are described as challenging.”
In total, 9.5% of all apps are described as “challenging”, while 5% of all apps are “games which are described as challenging,” so 5% / 9.5% = 53% of all apps described as challenging are games. Knowing nothing else, when I see an app described as “challenging,” I conclude it has a 53% chance of being a game.
Processing a Mountain of Evidence
Fortunately, Quixey doesn’t have to guess whether an app is a good result for a query after only looking at one piece of evidence. We look at a huge volume of evidence before making our decision. But having that much data introduces new problems.
Perhaps the most pressing question is: How do we get all the knowledge necessary to do our analysis? It may be the case that 50% of all games are described as challenging by blogs, but how do we know that unless we already know which apps are games?
The solution is for human programmers to write a description, in general terms, of what sort of evidence is relevant and how. Then an inference algorithm is responsible for filling in all the details. This is done by applying Bayes’ Rule to the structure of the human-defined model, instead of simply to the apps themselves.
Here’s an example of this approach. Suppose we expect that one of two different models describes our data. One model might specify that 50% of all games are described as challenging, while a different model might specify that only 5% of games are described as challenging. Initially, we have some beliefs about which model is accurate — say, we think the two models are equally likely. As we observe data, we can improve our beliefs. If one model consistently describes the data better than the other, then its likelihood increases. This is similar to our first example, in which our “this app is a game” hypothesis became more likely once we saw that the app was described as “challenging.”
Specifying two models may not be any easier than specifying one. However, specifying a large class of models that contains a good one may be quite easy. For example, consider the infinite list of models of the form: “X% of all games are described as challenging.” This class does contain a good model, even though it took almost no thought to write down. If an algorithm is able to automatically learn which of these models work well, then it may be able to use evidence effectively without humans injecting any knowledge at all, except our very general understanding that the logical properties of an app affect how blogs describe it.
So far, we’ve only talked about models that relate a single piece of evidence to a judgment. Ultimately, we need to combine many different pieces of evidence to make our judgment. So how can we easily write down a class of models that relates many pieces of evidence to each other, and then search by relevance?
A full answer would involve much more math than we’ve engaged so far, but the intuition can be communicated effectively. The idea is to describe the general framework by which different pieces of evidence relate to each other — and equally important, which pieces of evidence don’t relate to each other — and then to consider models which respect these relationships.
Applying these ideas to apps is necessarily quite complicated (and a trade secret!) so here’s a popular illustration from a different domain.
Suppose I’m interested in knowing about the relationship between the grass being wet, the season, recent rainfall and the sprinklers. We can visually represent the properties we care about as four circles, or nodes:
We imagine each circle as holding the value of the corresponding variable. For example, the circle “Rainfall” contains a value indicating how much it has rained recently. Similarly, the circle “Sprinkler” contains a value indicating how recently the sprinklers have been used, “Season” contains a value indicating the current season, and “Wet Lawn” contains a value indicating how wet the lawn is. These different values are all uncertain, but related to one another: the season affects the probability of sprinkler use and rainfall, rainfall makes the lawn wet and makes sprinkler use less likely, and sprinkler use also makes the lawn wet. We can capture these relationships in a diagram, where two nodes are connected if they are related.
The most important part of this picture isn’t the relationships that have been drawn in, but the relationship that haven’t: there’s no line connecting “Season” to “Wet Lawn” directly. The season does affect how wet the lawn is, but only via the intermediates “Sprinkler” and “Rainfall”. This is expressed by a mathematical property known as conditional independence: if we already know about recent rainfall and sprinkler use, then knowing how wet the lawn is tells us nothing about the season. This model, of course, is a simplification of reality. In practice, the picture is much more complicated, and other factors play a role in lawn dampness, such as heat and humidity.
What this picture does not tell us is exactly how the season relates to rainfall or sprinkler use, or how these relate to lawn dampness. Fortunately, we may not have to specify this information exactly to get useful results. We can consider a very large class of possible models that obey the relationships in the picture. Then gradually we learn which of these models most accurately describe the relationship between the quantities in question.
For app search, we construct a much, much larger picture, which communicates considerably richer information. Then we construct a model that’s consistent with the relationships (and non-relationships) described in our intuitive picture.
In our first simple example, it was easy to compute that 53% of apps described as “challenging” are games. But our actual models are much more complicated, and furthermore, our algorithms are tasked with learning which model to use — so inferring properties of apps can become difficult. In fact, most computer scientists are extremely confident that such inference cannot be performed exactly in a reasonable amount of time, regardless of the sophistication of the algorithms involved. To circumvent this difficulty, Quixey relies on approximate inference.
Many approaches to approximate inference rely on a very important and general idea called local search. Instead of trying to find the best explanation for the data all at once, we start with an initial “best guess,” a preliminary model that explains the data passably. We then proceed by trying to improve this best guess incrementally. Instead of trying to find the best model from the space of all possible models, we try to find the best model that is similar to our current best guess. Once we have found an improvement on our current best guess, we repeat the process. Under reasonable conditions, our best guess will continue improving, becoming a better and better approximation to the true model.
Applications of this approach include expectation maximization and Gibbs sampling — the interested reader can find a vast literature on both of these algorithms, and on approximate inference in general.
Quixey’s search combines many techniques, and we’re constantly experimenting with new approaches. Bayesian inference algorithms, in their various forms, are an essential piece of our search technology.