CTO Liron Shapira gives a deep dive into Quixey’s Functional Search™ technology at a Bay Area Search meetup.

YouTube Preview Image

Last August, Facebook doubled their iOS app’s performance by using native code instead of HTML5. It was a big deal, at least according to Mark Zuckerberg. In September, Zuckerberg said betting on HTML5 was Facebook’s biggest mistake with mobile.

HTML5 web apps have a history of inferior performance compared to native. Slow JavaScript execution has been a problem since the dawn of web apps. And on any smartphone today, you know that if you see a jerky UI effect, it’s probably rendering with HTML5 and JavaScript.

The good news is that HTML5 performance is relentlessly improving. New browser versions pack performance features like faster JavaScript interpreters and hardware acceleration. And today, web apps rival performance of native and desktop applications, at least according to Google’s HTML5 Rocks advocacy site.

Maybe the story ends with HTML5 performance catching up to native apps and staying neck and neck. But you know what else might happen? HTML5 might pull ahead. In the long term, we think HTML5 apps will be faster than native apps.

Resource Caching

The web has an amazing infrastructure for reusing parts of an app, as long as the parts you want to reuse are “resources”, meaning they have their own URLs. It’s perfect for many applications, such as cat pics:

The image above was served from http://www.lolcats.com/images/u/07/22/lolcatsdotcomkzi5b2ktz93zav8f.jpg. Since the HTML on this page told your browser to download and display that resource, the image is now in your browser’s cache. Any time you go to a web page that contains an HTML tag like this:

<img src="http://www.lolcats.com/images/u/07/22/lolcatsdotcomkzi5b2ktz93zav8f.jpg">

The lolcat image will display instantly with zero download lag. That’s the beauty of resource caching.

Unfortunately, resource caching doesn’t work too well for web apps today. Web apps typically send code to your browser in the form of a few JavaScript resources (a.k.a. files). JavaScript files aren’t as similar to cat pics as you might think – there’s usually no other app that wants to reuse them. The main exception is popular JavaScript libraries, which make great resources because they get reused a lot. In fact, your browser probably has most of the Google Hosted Libraries in its cache right now.

If you factor an app (native or web) into a hierarchy of UI components, you can see that the same components get written over and over again in varying forms. Every good developer knows reusing code is better than unwittingly rewriting it, but it’s easier said than done.

We think the web will evolve to make it easier to factor web apps into components. W3C’s web components initiative is a step in the right direction. And we’ve previously written about how web components should be treated as resources on the web. When the web finally gets a robust solution for component reusability, resource caching will give web app performance a boost over native apps.

Incremental Downloading

The world’s snappiest native app still makes you wait twenty seconds the first time you want to use it. That’s about how long it  takes to download and install it.

On the other hand, if you get a link to a web app you’ve never used before, it typically takes only about a second to start using it. That’s because a properly engineered web app only downloads and runs a little bit of code on your machine at a time, just what it needs to render its current screenful of content.

Incremental downloading is a big advantage for web apps in theory, but it’s not so simple in practice. Remember what it feels like to click around a typical web app today – chunks of UI blinking in and out as data gets refreshed from the server. The problem is that web developers are being forced to roll their own custom solutions for graceful incremental downloading, and it’s hard to get that right. So you rarely get the same seamless state transitions in an incrementally-downloading web app that you’d get from a fully-downloaded native app.

That said, we’re willing to bet that the web will eventually make seamless incremental downloading the norm. The implementation details will all be taken care of by web browsers, and abstracted away from developers. In particular, if our components-as-resources idea plays out on the web, then browsers will be able to automate incremental downloading using an approach similar to link prefetching.

Embedded Native Code

Today, native apps can run fast native code that browsers can’t. But next-generation web browsers are chipping away at that advantage. It’s a safe bet that browsers will eventually let web apps run low-level native code alongside high-level JavaScript.

Google is leading the way with its Native Client for Chrome. In fact, we already mentioned Native Client when we included “reverse hybrid apps” in our Native-to-Web App Spectrum.

Optimizing Compilers

When you develop apps, sometimes you have to make tradeoffs. When Facebook rewrote their iPhone app with more native code, they traded off development time and code reusability for performance.

But sometimes you don’t have to make tradeoffs. Sometimes you can have it all. If you switch from writing assembly to writing C, you win better elegance, you win better performance, and you lose nothing. Because it turns out C compilers are better than humans at optimizing assembly code, so C code effectively runs faster than assembly.

As for the web, we think the rapidly-evolving infrastructure for downloading and executing HTML5 will do for web apps what optimizing compilers did for C.

The New Norm

Last month, HTML5 app platform maker Sencha shocked everyone by releasing Fastbook - an HTML5 clone of Facebook’s native iOS app which is noticeably faster despite using zero native code.

Today, Fastbook’s speed is an impressive achievement. But pretty soon, we bet optimal performance will be the norm for HTML5.

People talk as if web apps and native apps are fundamentally disjoint. But we see a four-point spectrum from native app to web app.

The Native-to-Web App Spectrum

1. Native App

A native app consists entirely of native code. Native apps have some advantages:

  • Native UI elements. Native apps use UI elements from their platform’s standard library. Users are accustomed to the look and feel of native UI elements. And they usually feel snappy.
  • Better performance. Native apps run code directly on a device, close to the metal, rather than in a browser. High-performance apps need to cut out the middleman, i.e. layers of interpreters, to run smoothly.
  • Device features. Native apps usually get better access to device features like file storage, GPS, sensors and vibration.
  • Ecosystem features. A platform’s ecosystem, e.g. iTunes ecosystem for iOS apps, often exposes special APIs to native apps. For example, native iOS apps can have one-click payment buttons for in-app purchases through the user’s iTunes account.
  • App stores. Most app stores focus on listing native apps, at least for now.

2. Hybrid App

A hybrid app is a native app with embedded iFrame-like HTML5 web content.

Developers love building hybrid apps because the same HTML5-based functionality can be easily shared across platforms and devices. If you want to write native apps for Android and Windows Phone, you don’t have to build two native apps from scratch. You can build 90% of the functionality in HTML5, put it on the web, and embed it into both apps. Now you only have to do 10% of the platform-specific engineering work.

Platforms treat hybrid apps the same as fully native apps. Hybrid apps get full access to native UI elements, devices features, ecosystem features and app stores. They can run the same high-performance code that full native apps use. The difference is that if parts of the app don’t need great performance, hybrid apps can strategically implement them with HTML5.

For example, Gmail for iOS is a hybrid app. It uses the iOS UIWebView element to slyly embed rectangular frames of HTML5-powered UI into their native app.

3. Hybrid Web App

A hybrid web app is an HTML5 web app with embedded sections running native code. You could call it a “reverse hybrid app” because it’s native-in-web instead of web-in-native.

Hybrid web apps aren’t native apps, and they lack many of their advantages – especially device, ecosystem and app store advantages. But hybrid web apps also have unique advantages over native apps and hybrid apps:

  • Sections of the web app can get the optimized performance of a native app.
  • The download can be faster if different parts of the hybrid web app embed different sections of native code.
  • There’s no installation step.
  • They can take advantage of web app standards like URLs and <meta> tags. The various standards around metadata, searchability and interoperation for web apps are richer and more universal than for native apps.

Hybrid apps seem like a natural point on the spectrum from native apps to web apps. But actually, they don’t really exist yet. While most app platforms support embedding web content, most web browsers don’t support embedding native code, for both technical and security reasons.

But given the unique advantages of hybrid web apps, that’s likely to change. Google already has a project called Native Client to enable web apps viewed in the Chrome browser to run native code. It’s powerful stuff, as you can see from their demo of Native Client running the Quake game engine in Chrome.

4. Web App

A (non-hybrid) web app is written in pure HTML5 and runs in a web browser.

HTML5 web apps have one supreme advantage: They run on almost every platform. This is the fundamental force that’s attracting app development toward HTML5 and making it spread out across the hybrid region of the spectrum.

Web apps traditionally don’t have direct access to the device they run on. But many device features, like GPS and local storage, have already been made available to web apps through HTML5 APIs. If the trend continues, HTML5 will have rich support for native device features.

Conclusion

We saw a lot of pure native apps a few years ago, when new mobile device platforms were proliferating. It seems like high-profile hybrid apps are more common now. And hybrid web apps seem like a promising new technology that’s almost ready for prime time. So now is a good time to ask, are native apps moving to the web?

If you want to talk about the evolution of the web, you can just talk about the evolution of URLs.

A URL is supposed to tell your browser to fetch a “resource” on the web – that’s what the “R” in URL stands for. In the old days, the resource you’d fetch would be a static HTML page, and websites were made out of lots of HTML pages. Then in the early days of Web 2.0, it was common to see single-page web apps that ran all their functionality on one page with one URL. But it didn’t take long for developers to realize URLs are essential for linking and SEO, so we got JavaScript APIs for exposing URLs.

Today, everyone agrees static pages aren’t the only type of resource you can load in a browser. Google Maps is a single page app, but it treats map positions and zoom levels as resources. Dragging the map around makes it give you a different link URL.

In general, a web app’s state is a type of resource on the dynamic web. HTML5 has given us the following analogy:

static web resource : dynamic web resource :: page : state

In addition to pages and states, which are the primary resources you browse to directly, the web has “supporting resources” that also have URLs, like images and CSS style sheets. Web apps almost always include references to supporting resources:

<link rel="stylesheet" type="text/css" href="theme.css" /> <img src="http://lolcat.com/images/lolcats/1327.jpg" />

It’s natural to ask whether there’s an analogy like the one above that can be made for supporting resources:

 static web supporting resource : dynamic web supporting resources :: images and stylesheets : _____?_____ 

We want to propose completing the analogy with components.

By “components”, we mean the building blocks of any robust object-oriented UI architecture. For example:

We’ll argue why it makes sense to treat components as a new type of supporting resource that web apps can reference.

Component Namespacing

How do you give your components unique names?

  • Web Components have global names like x-fancybutton.
  • jQuery UI components have global names like slider.
  • QuickUI components have global names like MenuBar.
  • GWT actually has a solid global namespacing system taken from Java, with component names likecom.google.gwt.user.client.ui.StackPanel.

Good namespacing is essential for component reusability. Jan Miksovsky, the creator of QuickUI, has pointed out that more than half of web app UI code is reinventing results already achieved many times before. Jan distinguishes feature-specific, app-specific, company-specific, domain-specific and general-purpose components. If you’ve written a component-based web app, chances are you haven’t thought through good enough namespacing for sharing code at all these levels. URLs are a convenient solution for namespacing components at all levels of generality.

Component Distribution

How do you package and distribute your components?

  • The Web Components spec doesn’t say anything about component distribution. But all the examples have component definitions like <element extends="button" name="x-fancybutton"> inlined into the same HTML file where they’re referenced with code like<button is="x-fancybutton">. It would be nice to be able to write<button is="http://mysite.com/components/fancybutton">, to take advantage of the web’s standard resource-referencing convention.
  • jQuery UI lets you manually select a package of components to download into your web app’s source tree.
  • GWT makes you manually download the components you want, put them in your source tree, then write import statements where you want them in your code.
  • QuickUI lets you drop individual components into your source tree by checking out its Github repo. QuickUI has also packaged its entire controls catalog into a few hosted JS and CSS files on the web. The idea that libraries are resources is popular on the dynamic web today, e.g. Google Hosted Libraries lets you reference full copies of the world’s dozen or so most popular JS libraries. But the idea that individual components are resources is still new.
  • If you roll your own components, you probably just created a bunch of globally-namespaced component files in your source directory tree. Maybe you have a separate repository to distribute components within your company.

There’s no excuse for the pathetic state of component distribution on the web, when one of the web’s greatest strengths is the ease of embedding resources. To use a component in your web app, you should just be able to refer to its URL anywhere your code needs it. Even if you don’t want to wait for the web to offer native support for components as resources, you can easily just write a build script that reads your code to determine what component URLs it needs to download into your source directory before deploying your app.

This scheme lets you incorporate anyone’s components into your web app with zero friction. Conversely, it makes distributing your own components as easy as hosting any web resource.

Component Versioning

How do you version your components? Maybe you rely on your source control system’s commmit identifiers to refer to versions of your in-house components. Chances are, you don’t do any component versioning at all.

If we locate components at URLs, we can solve component versioning the same way we solved library versioning – by making version-specific component URLs. Just like Google hosts version 1.9.2 of jQuery UI athttp://ajax.googleapis.com/ajax/libs/jqueryui/1.9.2/jquery-ui.min.js, it would be great if Google would host version 1.9.2 of just the jQuery UI slider component at a URL likehttp://ajax.googleapis.com/ajax/libs/jqueryui/1.9.2/components/slider.js.

If web apps could directly reference components by URL, that would even let two parts of your web app seamlessly use two different versions of the jQuery UI slider component. This is the kind of robust resource referencing we’re used to on the static web, but inexplicably missing in the dynamic web.

Component Optimization

By treating components as resources referenced in client-side code, we can add browser support for fetching and prefetching them only when they’re needed. We can make better use of CDNs by factoring reusable components out of web apps. And by fixing those all-too-ubiquitous web apps that transfer their whole component library to the client before rendering their first welcome screen, we can drastically speed up the web.

The Web (Not the Browser) As A Platform

Ten years ago, we had the static web. Today, HTML5 has supposedly given us the dynamic web – but what we’ve really gotten are heavy chunks of client-side software embedded into the static web. It’s not enough for the browser to evolve into a platform for client-side software. We want to see the web evolve into a platform that supports any balance of client- and server-side software. Treating components as resources with URLs is the next step of that evolution.

If you’re a software architect and not a theoretical physicist, you don’t have to be as insightful as Einstein. But it helps!

The Principle of Locality

When Albert Einstein published his Special Theory of Relativity, he knew it was only a temporary hack. The theory was accurate and powerful, but it violated an important principle – the principle of locality.

The principle of locality says that you can never have action at a distance - you can only make things happen by starting chains of local effects. If I want to say “hi” to you, I can’t vibrate your eardrum from a distance. I can only vibrate the air near my mouth, and start a chain reaction that eventually reaches your ear.

Einstein was sure that all physical laws were local in nature. He wrote:

Of objects far apart in space, A and B: external influence on A has no direct influence on B; this is known as the Principle of Local Action, which is used consistently only in field theory. If this axiom were to be completely abolished, the idea of the existence of quasienclosed systems, and thereby the postulation of laws which can be checked empirically in the accepted sense, would become impossible. – Albert Einstein, ”Quantum Mechanics and Reality” (“Quanten-Mechanik und Wirklichkeit”, Dialectica 2:320-324, 1948)

We know sound vibrations obey the principle of locality, but what about gravity? Don’t gravitational forces act at a distance? That’s what Isaac Newton thought in 1686, when he invented the original concept of gravity. Newton thought the Earth, as it moves around the sun, instantly changes the direction of its gravitational pull on the moon. (Newton believed that if you want to send a message faster than light, you can theoretically just hop around in a Morse code pattern, and someone far away from you can receive your message instantaneously by measuring the disturbance in the gravitational force field at their own location.)

Einstein wasn’t happy with the non-locality of Newton’s theory of gravity. But when he published his Special Theory of Relativity in 1905, he still hadn’t managed to give a local account of gravity. Similarly, a lot of programmers talk a good game about the virtues of modularity, componentization and “local” variables. There’s even a Wikipedia article called “Action at a distance (computer programming)“. And yet, it seems like modern software architectures violate locality a lot.

Locality in Software Architecture

jQuery is Non-Local

Say you make a UI component for a restaurant review. You design the component so it can take just two input parameters – a data object of type Restaurant and a data object of type Review - and render something like this:

It’s nice to think of your components as self-sufficient modules that only interface with the rest of your app by passing a few parameter values. But if your app’s architecture doesn’t obey the principle of locality, you won’t be able to rely on this convenient property.

A typical non-local thing people do in web apps is use globally-scoped jQuery selectors. For example, if the app learns that John C. just changed his user icon, it might execute an instruction like $(".user_icon").attr("src", john.iconUrl) from somewhere outside the review component. This is certainly one way to dynamically update John’s icon images everywhere they appear on the page. But since this approach bypasses the component’s parameter interface, it’s action at a distance.

It’s as if the physical universe had let me talk to you by directly vibrating your eardrum. It’s a quick and dirty way to get my message to you, but since the sound doesn’t propagate in a local way, the earmuffs you just bought to block the sound can’t function. By violating locality in the case of user-icon-propagation, I’ve undermined the modularity of my whole component architecture.

So how do I write locality-preserving code for updating John C.’s user icon everywhere it appears on your web page? The trick is to propagate the iconUrl update through every component’s explicitly-defined interface, instead of voodoo-jQuerying your way in. One way to do that is to think of your review-displayer component as having an “input wire” for a user object, in this case john. Then when you want to update the displayed icon, you just have to alter the signal on the ReviewComponent.user wire. And if you use a model-view framework like Backbone, you can just make john an instance of your user model class, and connect it to your ReviewComponent.user wire.

AJAX is Non-Local

You’ve followed in Einstein’s footsteps, and now you have a ReviewComponent whose behavior is fully determined by its local surroundings (the objects connected to its “input wires”). But what if you want your ReviewComponent to have dynamic AJAX functionality?

Say you want to build another feature for your ReviewComponent: a live, up-to-the-minute reading of the user’s rank on the site (as determined by some secret formula combining number of reviews written and upvotes from other users). The naive approach for loading the data is to call something like $.ajax('/getUserRank') in your ReviewComponent implementation. Seems like a reasonable solution, right? Too bad it completely destroys the property of locality.

You started with a ReviewComponent which state and behavior was entirely determined by short wires poking into its local environment. Now your ReviewComponent has a giant wire that travels all the way out of the local environment and into the browser’s AJAX API. (It’s a shame people don’t use programming IDEs that visualize the component dependency graph, because it’s all too easy to write an innocent-seeming line of code that introduces a huge wire.)

Ok, so you violated locality and accessed your AJAX API directly from inside a UI component… what’s the big deal? It might seem like your component architecture can withstand these kinds of locality violations. But if you put on your Einstein hat and aim for true elegance – true archutectural modularity and component reusability – you might find that your code is more fragile than you think when it comes to these kinds of locality violations. Here’s the thing about AJAX calls inside components:

You broke the invariant that a component’s state is a function of the data on its input wires. Now you don’t have an elegant representation of your component’s state.

The fix for this is to have the component’s AJAX response handler put the results of the call on an input wire.

You hard-coded the global address of the AJAX call handler. Now it’s not really a component, because it’s tethered to your server.

The fix for this is to leave method-addressing semantics up to the control’s local environment. Instead of sending the getUserRank method call directly to your faraway, globally-addressed server, you can ask your local environment to getUserRank for you, whatever that may mean in context. Of course, the most obvious thing for the component’s environment to do is to route your getUserRank call to the nearest server, and you get the same effect as $.ajax. But if you place your component in different environments, you can imagine that some of them will route your method call to a different server, or even answer your method call using client-side logic. If you realize that AJAX calls are really method calls, and hard-coded AJAX violates locality, you’re one step closer to the holy grail of modularity – truly reusable UI components.

You baked a specific data-retrieval mechanism into your UI code. Now you can’t take advantage of frameworks that help with things like polling, caching and WebSockets.

Your real goal isn’t to “do an AJAX call”. Your real goal is to wire your view component’s UI to the output of a (parameterized) method named getUserRank. The fact that you’re using an AJAX call to jump the gap between your client-side and server-side app should be completely irrelevant to your component. First, the client-server gap isn’t located “adjacent” to your component, so it isn’t your component’s business (the previous section addresses this point further). And second, the fact that you’re using AJAX as your data-retrieval mechanism should also be irrelevant to your component.

A modular component is a unit of UI-rendering logic specified as a function of locally-present data. A modular component never has to talk about data-retrieval concepts like AJAX, polling, caching, websockets, JSON serialization, etc. The ReviewComponent in our example should say “I want to propagate the abstract output of getUserRank into this part of my UI like so”. It should be up to the external environment to take care that the abstract output of getUserRank is present on the component’s input wires. Locality is the key to factoring data retrieval logic out of UI rendering logic.

For this reason, it’s a shame to ever write $.ajax in a component’s code. Unfortunately, we don’t know of any programming frameworks with idiomatic ways to write fully local components. The Functional Reactive Programming model, as seen in frameworks like FlapjaxElm and Asana’s Luna, looks like a step in the right direction for true componentization. For now, you might have to bite the bullet and keep writing AJAX calls that violate locality.

Conclusion

The principle of locality points the way forward in software architecture as surely as in physics. Einstein could have just taken a break after formulating his Special Theory, but he wasn’t happy with a still-non-local account of gravitation. Einstein’s obsessive dedication to the principle of locality led to his most powerful and elegant theory ten years later – the General Theory of Relativity.

If you design your app’s architecture with locality in mind, you probably won’t discover one of the most profound insights in the entire 400-year history of science. But you’ll write amazingly elegant and modular code. Whether you’re unlocking the secrets of the universe or programming an app, it pays to think about locality.

Search engines are a great example of a situation where fast reads matter more than anything, which is a perfect scenario for data denormalization. At Quixey, we have all kinds of data flowing in, we run intensive computations on it, and we try to denormalize their output as much as possible.

Data streams in from outside sources

  • We monitor various data feeds.
  • We crawl app stores and the open web.
  • We accept input directly from app developers, such as edits to app descriptions.

We maintain our own normalized model of the app universe

  • We merge together different app data from the iTunes and Google Play stores to form a more coherent picture of the Angry Birds app in our own schema.
  • We calculate and store statistics about apps, developers, platforms and other data in our system.
  • We have MongoDB indexes on our normalized fields, and on some denormalized values (e.g. we denormalize some field values to all-lowercase versions for case-insensitive lookups).

We build search indexes

  • We build various types of inverted indexes.
  • We run an offline process that uses our normalized app data to calculate values for various types of features.
  • We re-index some of our normalized app data in realtime.

When a user searches, we need to crunch data from various parts of our schema. And we need everything to be lightning fast to minimize search response time. Our solution is a three-part data denormalization architecture that can handle all our needs. It looks like this:

As you can see, it’s pretty simple and elegant. But it actually took us a lot of thought to get to this point. Try thinking rigorously about how your own denormalization specification should look, in terms of normalized kernels and denormalization-propagation graphs. You might discover elegant new solutions for your systems.

Good data architecture isn’t just about having a well-designed schema. We’ll show you why it’s also about having a rigorous specification for data denormalization.

First, a quick note about database indexes. Indexes are a special case of data denormalization, so we’ll come back to them at the end of the post. For now, assume a database table/collection/etc is only indexed by primary key or ID. Assume the only available queries are fast ID-based lookups and slow brute-force scans.

An Example

To make our ideas more concrete, let’s write a plausibly-realistic pseudo-schema for an app like RottenTomatoes:

Movie #42
name: 'Toy Story'
year: 1995
actors: [Actor #11, Actor #30, Actor #60] // foreign key references

Actor #11
name: 'Tom Hanks'
height: 1.83 // meters
movies: [Movie #35, Movie #42, Movie #58, Movie #75, Movie #93]
numMovies: 5

You can see we’ve represented the fact that Tom Hanks acted in Toy Story, and we’re using a denormalized schema for Actor-Movie relationships.

Denormalization just means data redundancy. Our schema is denormalized because our Actor.movies and Movie.actors fields are redundant with respect to each other, and our Actor.numMovies field is redundant with respect to both Movie.actors and Actor.movies. If we delete Actor.movies and Actor.numMovies from our schema, we don’t lose any information. We can still read out Tom Hanks’ list of movies and number of movies, although it would involve linearly scanning our Movies. We originally chose to denormalize our data because we wanted to speed up these kinds of read operations at the expense of a little more storage and longer write operations.

What would you say if an engineer showed you this schema and asked you to review their data architecture? You might say that it looks fine and it’s time for them to start coding. But wait – there’s an unanswered question about data denormalization that should be part of any software project’s data architecture phase. You probably didn’t think to ask it, because it’s technically not a question about the schema. It just happens to be a question about something a little beyond the representational power of most schema languages. Can you guess what it is?

The Hidden Question of Denormalization

Whenever you’re looking at a schema design, you always want to ask about two denormalization concepts: normalized kernels and denormalization-propagation graphs.

Normalized Kernels

A normalized kernel (NK) of a denormalized schema is a smaller, normalized schema that contains a subset of the fields in the denormalized schema – and doesn’t lose any of the information in the original schema’s data. You can create an NK for any schema by deleting redundant fields one by one until there are no redundant fields left. (I mean, there are no fields left whose value can be entirely determined by looking at the values of other fields in your NK. At finer granularity, there will always be lots of data redundancy from an information-theory perspective, and that’s okay.) Denormalized schemas have many possible NKs. You should choose one NK, and make it “the” NK for your schema. And you should treat the choice of a normalized kernel as an important data architecture decision for your app, even though it doesn’t affect your actual denormalized schema.

Let’s pick a normalized kernel for our example app. We know we have to keep either Movie.actors or Actor.movies in our NK, and we can’t keep both, because they’re 100% redundant. Which one do you think we should keep?

Sometimes it’s easy to make a decision about which NK fields to keep, like if one of the fields in a redundant set intuitively strikes you as being more fundamental, or if one of the fields is obviously more closely tied to your app’s core functionality. But in our example of Movie.actors vs. Actor.movies, the choice is totally arbitrary. These kinds of arbitrary NK situations happen all the time. You just have to make a choice and stick with it, so that you get a rigorously-defined NK for your schema. Let’s go ahead and choose Movie.actors to be included in our example NK. Let’s not include Actor.numMovies in our NK, because its data is entirely deducible from Movie.actors (and not vice versa).

Now we’ve got a valid NK for our denormalized schema: {Movie.name, Movie.year, Movie.actors, Actor.name, Actor.height}.

Denormalization-Propagation Graphs

A denormalization-propagation graph (DPG) is a visual representation of an automated data denormalization process that shows how information propagates out of your NK fields and into your non-NK fields. Formally, a DPG is a directed acyclic graph whose nodes represent schema fields, and whose edges are chosen such that every field’s data can be deduced from the data in the fields that point to it – except the nodes for fields in the NK, which simply don’t have incoming edges. Our example schema with our NK can have either of the DPGs shown here:

Why NKs and DPGs matter

Normalized kernels and denormalization-propagation graphs are fun to think about, but what’s the payoff? Why not just ignore the denormalization question and write your app the same way you always do? Two reasons:

  1. Abstracting away the implementation details of data denormalization. When an actor comes out with a new movie, your app has to explicitly issue database updates to three fields: Movie.actors, Actor.movies and Actor.numMovies. Will you always remember to do these three updates together? You know better than to think so, so you write ad-hoc functions that bundle together database writes and maintain the consistency of denormalized data. Now you have an addActorInMovie function and a deleteActorInMovie function (so your app can handle the case when actors back out of their movie contracts), among others. In a good-sized app, you’ll get tons of ad-hoc functions like this, and you’ll have to manually update their code whenever you change your schema (in our example, both functions will require a programmer’s attention if we add a Movie.numActors field). Are all these ad-hoc denormalization-related functions really necessary? Hell no. If you have an NK and a DPG, you can just write app code that inserts/deletes/updates the data in your NK. A denormalization-propagation framework that isn’t app-specific can take it from there. Don’t turn yourself into a human compiler by writing and maintaining ad-hoc functions to do the same work that a simple algorithm in a denormalization-propagation framework can do.
  2. Recovering from app-caused data inconsistency. I’m not talking about low-level data corruption like if the power goes out at a bad time or a cosmic ray flips a bit in your RAM. I’m talking about data inconsistency that’s probably your app’s own fault, like if you accidentally deployed some code with a buggy UPDATE call. (Actually, it shouldn’t be possible to create data-inconsistency-causing bugs if your database wrapper uses a good denormalization-propagation framework, but maybe you work on an app with an ordinary database stack and no one on your team has thought rigorously about denormalization specifications.) As long as you’re happy with the current state of the data in your normalized kernel, you can just re-run a denormalizatoin-propagation algorithm to recompute all your denormalized data, starting from your NK and propagating outward as specified by your DPG – and data consistency will be restored.

Indexes

Now that we know how to think about data denormalization, let’s think about indexes. It turns out that indexes give you all the aforementioned benefits of normalized kernels and denormalization-propagation graphs, because database programmers have already thought about NKs and DPGs for you. But they don’t call them “NKs” and “DPGs”; they call them “data” and “index definitions”. Your indexes themselves are just data which lies outside your NK, and your database has a hidden schema for storing your index data together with your NK data. You’ve probably never thought about things this way because database programmers have taken my advice from part 1 of the last section. Databases abstract away their denormalization-propagation algorithm, a.k.a. indexer, so that your app code just has to issue inserts/updates/deletes on “your data”, a.k.a. the normalized kernel of your data with respect to your index definitions.

If you’re good at using indexes, and you make sure you always keep your (non-index) data normalized, then your database’s indexer is the only denormalization-propagation algorithm you need. But most sophisticated DPGs are beyond the expressive power of a typical database’s index definition language. If you’re doing schema design for a good-sized app, you’ll probably end up with a DPG that can’t be converted into a set of index definitions. For example, maybe you want to denormalize the sum Proposition.yesVotes + Proposition.noVotes into the field Proposition.totalVotes (because you want to denormalize further into an index on Proposition.totalVotes which you can use to efficiently sort ballot propositions by their total number of votes). So you should try to use a data stack that comes with a more general denormalization-propagation framework than a typical database indexer – ideally one that can handle any DPG.

Conclusion

If you understand NKs and DPGs, you’ll think more clearly about how your app’s data denormalization works. Even if you don’t have a general data-propagation framework installed – even if you’re using a conventional data stack. You’ll plan your app’s architecture better and you’ll write more elegant code. To tell you the truth, we haven’t yet needed a denormalization-propagation framework here at Quixey, and we don’t know if a good framework even exists. If you really want to understand data denormalization, you should write one!

That mentality, of starting a company around a problem not around a solution, is what let us see the domain better than anybody else. That’s what let us invent apps as first order objects, and invent the concept of Functional Search™.

Want to know the thinking and philosophy behind Quixey’s Functional Search™ technology? It’s all explained in this new video with Liron Shapira, Quixey co-founder and CTO.

YouTube Preview Image

In our engineering interviews, we often ask candidates to write a binary search on paper, and the candidates always do it with lower and upper variables. But they almost always fail to maintain consistent semantics for these variables, and their algorithms are usually buggy as a result.

For example, this Quixey Challenge problem demonstrates a bug we often see during interviews. Can you find it?

def find_in_sorted(arr, x):
    '''
        Returns index of x in arr
        Precondition: arr is sorted
    '''

    def binsearch(lower, upper):
        if lower == upper:
            return -1
        mid = lower + (upper - lower) // 2
        if x < arr[mid]:
            return binsearch(lower, mid)
        elif x > arr[mid]:
            return binsearch(mid, upper)
        else:
            return mid

    return binsearch(0, len(arr))

We tell our interview candidates to take their time and make sure their algorithm is correct before we judge it. They often respond that it’s impossible to have confidence about their algorithm’s correctness before running it. This “guess and check” approach to algorithm design is fine for quick hacks, but we’re going to explain why it lacks something important.

Rice’s Theorem

You’ve probably heard of the halting problem, and you’ve probably seen a proof of why a program’s halting behavior is impossible to analyze in the general case. But you might not know there’s a profound generalization called Rice’s theorem which follows as a simple corollary.

Imagine you’re given an arbitrary computer program. You know you might not ever be able to figure out whether it halts, but it seems like you should at least be able to figure out whether its behavior is equivalent to the program print 'hello world', right?

Actually, Rice’s theorem says that computer programs are resistant to understanding of any kind in the general case. So no matter how much static analysis and simulation you run on the code, no matter how many different sticks you poke it with, even the seemingly trivial property of being a Hello World implementation is generally impossible to discern.

As a programmer, you have to wade into the Ricean ocean of un-analyzable code, and find a way to float. Each line of code you write is an opportunity to fall off the surface of provable correctness into the chaotic froth.

Invariants

In light of Rice’s Theorem, how can you manage to write correct programs? The answer is that you stay away from the Ricean “general case” by maintaining invariants – statements about your program that remain true as the program runs.

A common type of invariant you want to maintain is variable semantics. For example, a good binary search might maintain the invariant that the lower variable is always less than or equal to the algorithm’s final output (except when the binary search turns up empty-handed).

Another common type of invariant applies to the program’s control logic. For example, you can declare that any time the value we’re binary-searching for isn’t present in the input list, control must be routed to a single return -1 statement. Many engineering candidates fail to maintain this invariant, and their algorithms crash when given an empty list. If you’re careful to maintain the invariant above, an empty list won’t even be a special case, because your algorithm will handle it the same way it handles an empty search interval demarcated by the values of lower and upper.

It’s important to approach coding problems with invariants in mind. An invariant is like a fragile shell that protects the correctness of your code as long as it’s properly maintained. And if you’ve ever tried gluing a cracked egg back together, you know what it’s like to debug an algorithm with no well-maintained invariants.

When we watch an engineering candidate write code, we try to gauge how well they’re thinking in terms of invariants. If we see them adding a bunch of special-case logic to their binary search, it’s a telltale sign that they haven’t developed an invariant-savvy mindset. The best programmers have a high-level understanding of their code, and know how to wield invariants to gracefully avoid off-by-one bugs. That’s why we test engineering candidates by asking them to write a perfect binary search on paper.

We hire engineers who think in algorithms, dream in code, and eat control structures for breakfast (e.g. Froot Loops). We’re looking for people with a sixth sense for good code – people who can fix bugs with their mind. So today, we’re introducing the Quixey Challenge.

The Quixey Challenge
We show you a famous algorithm with a bug, you fix it in 1 minute or less.

The Prize

If you successfully complete the challenge, we’ll send you

  1. $100
  2. An exclusive “founder’s edition” Quixey T-shirt

Not bad for 1 minute!

Ground Rules

  1. The algorithm will be implemented as a Python function.
  2. The function will not have a syntax error.
  3. The function won’t always return the correct thing.
  4. It will be possible to make the function return the correct thing either by adding one line of code, or by modifying one line of code.

Test Drive

YouTube Preview Image

Winners

  1. Tom McCabe
  2. Mathijs Vogelzang
  3. Zachary Travis
  4. Andy Terrel
  5. Zack Bloom
  6. Tanooj Luthra
  7. Dylan Lukes
  8. Felipe Restrepo
  9. David Bieber
  10. Jacob Hurwitz
  11. Mark Vitale
  12. Karl-Aksel Puulmann
  13. Arjun Comar
  14. Anand Oza
  15. Shawn Park

How to Play

Got a minute? Just send an IM to Skype user quixeychallenge. We’ll be monitoring that account today (Monday, October 10) from 7am to 7pm PDT. Good luck!

*Update: Due to the success of the Quixey Challenge, we created quixeychallenge.com, where you can complete practice questions and qualify for upcoming challenges.