Google’s VPS: An Analysis Attempt from Afar

It was an impressive looking demo reel: an indoor guidance system that used no hardware at all on-site. This is in contrast to the usual wisdom of indoor positioning being inaccurate without radio support. That is a promise of epic proportions, and yet thinking about it, it doesn’t look as revolutionary as the demo reel makes it out to be.

First, note that the demo requires a Tango phone. This isn’t an issue for the scanning part – a store can easily spare the cash to buy or rent a phone to perform a large, detailed walkthrough of the entire store with no customers – but is a bit more troublesome for the access part. This isn’t to say that the demo is unimpressive, certainly it is, but that a more impressive demo would have used any camera to view the output, leveraging Google’s massive cloud processing to make sense of the image input. But I digress.

There was little detail publicly shown on how VPS works in its innards, so I am left to assume that it is a scaled up version of Tango’s workings. That is to say, I assume that the main magic that is going on with the cloud processing and preparation is to create a model so that when Tango selects its anchor POIs, it can match it with the ones in the database. I could not find anyone on Google I/O that was willing or able to divulge more on the subject matter, but there were some tidbits to come out elsewhere.

On Tango itself, I/O17 had a semi-supervised demo where they allowed participants to carry the phones around. The scenes were fairly staged, presumably to allow the phones to easily pick up tracking points. For the purposes of this entry, I will assume POI detection is identical between the tango AR demos and the VPS demo. The above scenario worked fairly well, but it benefited from quite many POIs: a patterned floor, a wall with textures, and very bright lights.

A demo that fared quite differently was trying to show off Tango as a museum goer’s AR tool. It was staged in a similar fashion to the Singapore Art and Science Museum tour, with us given a Tango phone on a selfie stick, and allowed to walk around the small “exhibit”, while a tour guide would guide us through the show, and could direct the AR world for all participants. The main POI staging appeared to be the floor, as captured by @stevekovach:

Blurry, but in the bottom of the picture we see a very conspicuously patterned tile on the floor. That tile was the “base” for all the AR demos shown. Notably, when I walked around the exhibit – something we were allowed to do but few people did – when viewed from the other side of the globe, the AR picture had a tendency to wander off the viewing area. Similarly, trying to view the AR image from below – pointing the camera up towards the ceiling – would cause the world to shift.

In a quick chat with a technician, I learned some interesting tidbits about that behaviour.

  • The Tango positioning is relatively coarse. The phone can deduce its position on a granularity of around 1 metre, but needs to make assumptions to reach the rest of the accuracy.
  • One of the ways is to make assumptions on the world. For example, for the impressive Tango room scanning demo relies on the assumption that any wall less than approximately 10cm is a back-to-back wall, and will render it flat (look carefully on the demo, and you will notice that the wall between rooms is unnaturally thin)
  • Another way is to triangulate between known POIs. Assuming you capture and are reasonably certain about a set of POIs, it is possible to measure relative movement as it relates around those POIs.
  • For the Tango room scanning, all processing appears to be done as a batch at the end. While the user is walking around, the phone is recording a video that takes up approximately 1gb per minute. The video consists of raw, uncompressed RGB data plus the depth sensor data.

Putting these pieces together, it looks reasonably possible that what Google’s cloud processing for the scanning does is a scaled up version of the room scanning. A Lowe’s walkthrough would be 30-60 minutes, so far more data than the phone could reasonably process, but easily in the reach of the cloud. The phone would then use its positioning recognition to get a rough “lock” on its location within the 1 metre grid, and then use nearby known POIs to triangulate the rest of the way down to the necessary accuracy to do AR rendering and navigational guidance.

It’s an impressive technology nevertheless, and with Google’s pace for miniaturisation, every phone might eventually be a Tango phone.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.