Augmented Reality, in general, has never been something that has been of high interest to me over the last couple of years. I think this is most probably down to the idea that AR is early and that the technology is not yet powerful enough to create anything of use or yet to be widely adopted.
Recently, however, I have taken a dive into Apple’s ARKit framework. After watching the examples shown at WWDC last year and exploring some of the examples provided in their documentation I have become a lot more excited. I say excited because although what is on offer at the moment may not be completely stable, ARKit provides a lot of potential for the future and is only going to get better. Just the way ARKit relays and processes information was enough to keep me exploring the science behind it!
I have unfortunately found it difficult to find a resource that explains the essentials of ARKit. As always, the developer documentation is extremely informative but mainly through implementation, types, and methods, rather than providing palatable overviews.
This story is split into the three main components of ARKit: scene understanding, world tracking, and scene rendering and aims to provide a high level over view rather than a working tutorial.
Breaking things down
ARKit can be split into 3 distinct parts:
2. Scene Understanding
3. Scene Rendering
Each component directly compliments one another ultimately to display 2D or 3D virtual content for the user to see and interact with.
How to track the world
ARKit uses a term called visual-intertidal odometry which in essence estimates the translation (motion) and orientation of the device’s camera, in relation to its starting position in reality.
In simple terms, the device tracks where it is in the world in relation to features in its surrounding environment, such as a corner of a table. The more features it can see, the better the tracking and positioning of virtual content.
World Tracking makes use of Scene Understanding which ultimately generates a 3D coordinate system from the device. But before this coordinate system is established, lets talk about feature points.
ARKit automatically detects features from the scene produced by the device’s camera and tracks those features across multiple frames. As the user moves and more features are detected, information such as the camera’s position relevant to the features are collected. The more features that are detected, the higher the accuracy.
The yellow dots displayed on the image show the feature points being tracked. Notice how these features are the differences in the wood grain. The more of these differences in the camera’s scene the more feature points there will be for ARKit to recognise.
The information relayed back about the planes includes a node and an anchor. The node is a SceneKit node that ARKit has created and holds some information in its properties such as its orientation and position. The anchor provides further information such as the extent (estimated width + height) as well as the centre of the plane.
Just to clarify, a node is essentially a coordinate point on a 2D axis and suggests where this plane actually is on the 2D axis of the device’s display.
So, how does ARKit actually position virtual content?
Well at this point we have a 2D point that the plane has given us. This 2D point can be used to display the content on the screen of the device but what about the distance from the device?
Introducing Hit Testing. Imagine a hit test as a ray that the device fires out from the 2D point given to us above. The intersection of the said ray and the plane detected from earlier provides the 3D coordinate point. We now have a coordinate of (x: red, y: green, z: blue) where virtual content can be placed 🎊.
You can also place virtual content manually and measure distances using hit tests. SceneKit provides a method that can fire hit tests given a 2D coordinate which can be provided through gesture recognisers. The resulting 3D point returned can be used to display content. The distance between multiple coordinate points selected by the user can be used to calculate measurements in the real world too.
Keeping things simple, the two high-level frameworks used to render content are SceneKit or SpriteKit. SceneKit is used to displayed 3D content and SpriteKit 2D content.
I have yet to explore how SceneKit actually renders content however it displays content from a scene. This scene contains a hierarchy of nodes. These nodes being visual graphics from the assets folder or shape geometry. SpriteKit is similar however it uses 2D SKNodes which rotate as the user moves around. Almost like a 2D visual overlay at a certain coordinate.
There is still a lot to learn.
As I’ve mentioned this is a high-level overview of the processes behind ARKit. Trying not to focus too much on types or implementation. Ultimately, there is so much you can do with ARKit such as content interaction, environmental texturing, light estimation (shadows), image and 3D object detection and more. The framework will be forever growing. I’m excited for what the future holds and the innovative ideas that people will inevitably come up with.
Check out Mitch’s website for more iOS and industry relevant blogs: https://medium.com/@mitch_little