Analysing visual content using HoloLens, Computer Vision APIs, Unity and the Windows Mixed Reality Toolkit

In these days, I’m exploring the combination of HoloLens/Windows Mixed Reality and the capabilities offered by Cognitive Services to analyse and extract information from images captured via the device camera and processed using the Computer Vision APIs and the intelligent cloud.
In this article, we’ll explore the steps I followed for creating a Unity application running on HoloLens and communicating with the Microsoft AI platform.

Registering for Computer Vision APIs

The first step was to navigate to the Azure portal https://portal.azure.com and create a new Computer Vision API resource:

I noted down the Keys and Endpoint and started investigating how to approach the code for capturing images on HoloLens and sending them to the intelligent cloud for processing.

Before creating the Unity experience, I decided to start with a simple UWP app for analysing images.

Writing the UWP test app and the shared library

There are already some samples available for Cognitive Services APIs, so I decided to reuse some code available and described in this article here supplemented by some camera capture UI in UWP.

I created a new Universal Windows app and library (CognitiveServicesVisionLibrary) to provide, respectively, a test UI and some reusable code that could be referenced later by the HoloLens experience.

The Computer Vision APIs can be accessed via the package Microsoft.ProjectOxford.Vision available on NuGet so I added a reference to both projects:

The test UI contains an image and two buttons: one for selecting a file using a FileOpenPicker and another for capturing a new image using the CameraCaptureUI. I decided to wrap these two actions in an InteractionsHelper class:

I then worked on the shared library creating a helper class for processing the image using the Vision APIs available in Microsoft.ProjectOxford.Vision and parsing the result.

Tip: after creating the VisionServiceClient, I received an unauthorised error when specifying only the key: the error disappeared by also specifying the endpoint URL available in the Azure portal.

I then launched the test UI, and the image was successfully analysed, and the results returned from the Computer Vision APIs, in this case identifying a building and several other tags like outdoor, city, park: great!

I also added a Speech Synthesizer playing the general description returned by the Cognitive Services call:

I then moved to HoloLens and started creating the interface using Unity, the Mixed Reality Toolkit and UWP.

Creating the Unity HoloLens experience

First of all, I created a new Unity project using Unity 2017.2.1p4 and then added a new folder named Scenes and saved the active scene as CognitiveServicesVision Scene.

I downloaded the corresponding version of the Mixed Reality Toolkit from the releases section of the GitHub project https://github.com/Microsoft/MixedRealityToolkit-Unity/releases and imported the toolkit package HoloToolkit-Unity-2017.2.1.1.unitypackage using the menu Assets->Import Package->Custom package.

Then, I applied the Mixed Reality Project settings using the corresponding item in the toolkit menu:

And selected the Scene Settings adding the Camera, Input Manager and Default Cursor prefabs:

And finally set the UWP capabilities as I needed access to the camera for retrieving the image, the microphone for speech recognition and internet client for communicating with Cognitive Services:

I was then ready to add the logic to retrieve the image from the camera, save it to the HoloLens device and then call the Computer Vision APIs.

Creating the Unity Script

The CameraCaptureUI UWP API is not available in HoloLens, so I had to research a way to capture an image from Unity, save it to the device and then convert it to a StorageFile ready to be used by the CognitiveServicesVisionLibrary implemented as part of the previous project.

First of all, I enabled the Experimental (.NET 4.6 Equivalent) Scripting Runtime version in the Unity player for using features like async/await. Then, I enabled the PicturesLibrary capability in the Publishing Settings to save the captured image to the device.

Then, I created a Scripts folder and added a new PhotoManager.cs script taking as a starting point the implementation available in this GitHub project.

The script can be attached to a TextMesh component visualising the status:

Initialising the PhotoCapture API available in Unity https://docs.unity3d.com/Manual/windowsholographic-photocapture.html

Saving the photo to the pictures library folder and then passing it to the library created in the previous section:

The code references the CognitiveServicesVisionLibrary UWP library created previously: to use it from Unity, I created a new Plugins folder in my project and ensured that the Build output of the Visual Studio library project was copied to this folder:

And then set the import settings in Unity for the custom library:

And for the NuGet library too:

Nearly there! Let’s see now how I enabled Speech recognition and Tagalong/Billboard using the Mixed Reality Toolkit.

Enabling Speech

I decided to implement a very minimal UI for this project, using the speech capabilities available in HoloLens for all the interactions.

In this way, a user can just simply say the work Describe to trigger the image acquisition and the processing using the Computer Vision API, and then naturally listening to the results.

In the Unity project, I selected the InputManager object:

And added a new Speech Input Handler Component to it:

Then, I mapped the keyword Describe with the TakePhoto() method available in the PhotoManager.cs script already attached to the TextMesh that I previously named as Status Text Object.

The last step required to enable Text to Speech for receiving the output: I simply added a Text to Speech component to my TextMesh:

And enabled the speech in the script using StartSpeaking():

I also added other two components available in the Mixed Reality Toolkit: Tagalong and Billboard to have the status text following me and not anchored to a specific location:

I was then able to generate the final package using Unity specifying the starting scene:

And then I deployed the solution to the HoloLens device and started extracting and categorising visual data using HoloLens, Camera, Speech and the Cognitive Services Computer Vision APIs.

Conclusions

The combination of Mixed Reality and Cognitive Services opens a new world of experiences combining the capabilities of HoloLens and all the power of the intelligent cloud. In this article, I’ve analysed the Computer Vision APIs, but a similar approach could be applied to augment Windows Mixed Reality apps and enrich them with the AI platform https://www.microsoft.com/en-gb/ai.

The source code for this article is available on GitHub: https://github.com/davidezordan/CognitiveServicesSamples

 

Leave a Reply

Your email address will not be published. Required fields are marked *