How to Build a Search Engine

August 9th, 2011 | Posted by Quixey in Technology - (0 Comments)

This post was written by one of our search advisors, Eric Glover.  Eric Glover has more than twelve years of commercial web search experience and a Ph.D. in Artificial Intelligence from the University of Michigan. He has held key technical roles at multiple search engine companies and is a recognized expert in web scale machine learning and categorization.

In this blog post, I’d like to talk about the main components of search. Search has five main parts:Step 1: Receive the user’s query. Step 2: Parse the query to construct a ‘database query.’ Step 3: Using the database query, assemble the ‘consideration set’ of results to consider for scoring. Step 4: Score the consideration set. Step 5: Present results to the user based on the scores from Step 4.

These steps are common to all major search engines. Notably, Google, eBay, Bing, and Amazon apply unique algorithms and search different data for different purposes, and end up with different results.

Quixey is particularly unique among search engines. We invented a new type of search – functional search – specifically for apps. In this post, I’ll explain what standard search engines do and what we do differently.

Step 1 – Receive the user’s query.

Receiving the user’s query isn’t as simple as receiving what the user has typed. Sometimes external factors influence the query’s meaning – factors like the user’s country, browser, language, or device.

Screen shot 2011-08-09 at 12.01.52 PM.png

This is particularly crucial for an app search engine and functional search. For example, whether users search from an iPhone or an Android phone should influence how the search engine understands their queries. Moreover, someone in Russia may want different apps than someone in the United States.

Step 2 – Parse the query to construct a ‘database query’.

The old-school, simple method of parsing a query is to split it into individual words. Sometimes this involves removing ‘stop words’ (the, is, at, which) and the ‘stem’ of the query – for example, “help me to do my taxes” could be simplified into ['do','help','me','my',taxes','to'], or even something as simple as ['tax'].

As you might imagine, this is a dangerous strategy. It ignores the fact that user intent is not always a direct function of the individual words entered by the searcher. When searching for apps, “convert mp3 to flac” is a very different query than “mp3 AND flac.” Likewise, “tax” and “do my taxes” mean very different things. Misspellings, extra words, missing words, and unrelated words will severely throw off a search engine that doesn’t make this distinction.

It is a huge task to properly parse queries for your database. Most other search engines ignore the problem or solve it poorly. A query like ‘restaurants’, when interpreted incorrectly, will return so many results that the search engine won’t be able to score them all. Linguistics, intent-analysis and data considerations make this step extremely difficult.

Step 3 – Using the database query, assemble the ‘consideration set’ of results to consider for scoring.

Once the query has been transformed for optimal searching, you can use your new ‘database query’ to query the database. It is worth noting that the database query might have virtually no text in common with the keywords the user entered.

You can then select a consideration set based on the database query. The consideration set defines which results are worth considering. For example, queries like “paleontology” might generate a reasonably small consideration set, but queries like ‘racing games’ can generate consideration sets so large that scoring is impractical. That means it’s really important to parse queries properly in Step 2, as well as use algorithms that form manageable consideration sets.

Not only must the consideration set be small enough, but the algorithms used to generate it must be computationally efficient.

Step 4 – Score the consideration set.

A CS grad student writing a paper about ranking can win a Best Paper award even if his algorithm takes an hour to run a single query. Unfortunately, a commercial search engine can’t tell the user to wait hours for results. This creates a trade-off, often not discussed in academic circles, between the result response time and quality of results. The quality of results also correlates with the number of results considered.

Scoring is often the most computationally costly part of a search. Scoring results works as follows:

  1. Map each result to a set of (usually numerical) features, like the keywords’ frequency in the description.
  2. Apply a function to these features to calculate a meaningful score.

Competitors in the search space are differentiated by the data they access and the algorithms they use to score results. Quixey stands apart because of the wealth of data we use to score apps, and the precision of our scoring algorithm.

This combination is so successful because we invented functional search specifically for apps. Apps have a different set of important features than static web pages, so the standard text features sought by traditional search engines aren’t effective for finding apps. What makes one web page a better result than another is meaningless for apps, which often lack official home pages.

For example, the number of times an app is downloaded is more important than how many people link to its homepage. Likewise, the number of reviews might not be useful, because this metric strongly favors older apps.

Step 5 – Present results to the user based on the scores from Step 4.

It might seem easy to present results to the users by sorting by score. Unfortunately, this problem is more complex. What information should you display, and how do you organize the results? Should a result with a score of 49.9999 really display below a result with a score of 50.0000?

A simple ‘title text-match’ app search works fine when users knows the names of the apps they are looking for. However, our data shows that 86% of queries describe functions, not app names. People need to be able to find apps even if they don’t know what those apps are called.

Any type of ranking system is quite difficult and requires a lot of expertise. Our search requires collecting the right data, analyzing queries, effectively using and building a database, scoring based on the right features, and presenting the results in an intelligent way that helps users determine what they want.

In my next post, I will go into more detail about how Quixey does search differently.

Without Quixey, app search often requires knowing the app’s name or  keywords in its app store description. But you shouldn’t need to know an app’s name to find it – it’s hard enough to memorize app names like Uber, Shazam and Kayak.

Quixey lets you find apps just by answering the question:

Apps are solutions to everyday problems like calling cabs, identifying music and planning trips. People know what they want to do, but they don’t know the app that can help them do it. Unfortunately, current app stores aren’t good at searching for apps by function.

For example, “games for 3 year olds” returns nothing in the iPhone app store:

This is most likely because the query “games for 3 year olds” is a long query. The App Store’s search struggles with queries written in plain English and queries longer than three words. This makes it difficult to find what you need.

In contrast, Quixey is designed to process queries exactly how you naturally write them.

For example, see our results for the same query:

Our core technology (functional search) allows us to find good results, even when the official stores can’t. How do we do it? We gather outside data. Since our search engine is constantly scanning the web for descriptions, reviews, and blogs about apps, it learns exactly what each app can do. This enables Quixey to find apps that do what you want, no matter how you describe them.

Let’s take a look at another example on the App Store.

The selection found with this query is seemingly random. The results include 2 video editors, an ecards app and a few other random apps. On the other hand, results for “edit photos” are much better. Despite the fact that these queries are so similar, the average app search engine gets confused with one and not the other.

At Quixey, understanding natural language is one of our specialties and we don’t run into the same problem.

The past two examples represent a universal problem. I can’t tell you how many times I’ve heard “I want to find X and all I get is random apps,” or “I always think there must be an app for that, but my app store doesn’t have anything.” You might know the feeling.

Of course, I don’t mean to pick on iPhone. The same problem exists across all platforms.

So, the takeaway: Quixey is different because you don’t need to know an app’s name or its exact description to find what you want. Just answer the question: what do you want to do?