In an earlier post, in the overview of the Cortana offering, we talked about the Cognitive services offered by Microsoft.

In the meantime we got our invitation for thcrisapi-maine Custom Recognition Intelligent Service or CRIS.
The information is still very much under construction, but it seems that this service offer an elaborate set of tools to configure your text-to-speech applications in a detailed way.


There is already a speech-to-text API in the Microsoft offering, so why this extra one? If you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customizing the language model.

For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented” or “namespace” or “dot net” will appear more frequently than in typical voice applications. Customizing the language model will enable the system to learn this.

Similarly, customizing the model can enable the system to learn to do a better job recognizing speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customized acoustic model can more accurately recognize speech in the presence of the noises found in these environments.

What is it?

By uploading speech and/or text data to CRIS (that reflects your application and your users), you can create custom models that can be used in conjunction with Microsoft’s existing speech models.


Speech recognition systems are composed of several components, two of the most important ones are the acoustic model and the language model.

The acoustic model is a classifier that labels short fragments of audio into one of a number of phonemes, or sound units, in a given language. For example, the word “speech” is comprised of four phonemes “s p iy ch”.

The language model is a probability distribution over sequences of words. The language model helps the system decide among sequences of words that sound simhtwanb-coverilar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model.

Both the acoustic and language models are statistical models learned from training data. As a result, they perform best when the speech they encounter when used in applications is similar to the data observed during training.

How does it work?

You upload your different “models” to the environment and then you have to deploy them.The requirements for the acoustic model are:

File Format RIFF (WAV)
Sampling Rate 8000 Hz or 16000 Hz
Channels 1 (mono)
Sample Format PCM, 16 bit integers
File Duration 0.1 seconds < duration < 60 seconds
Silence Collar > 0.1 seconds
Archive Format Zip
Maximum Archive Size 2 GB

The custom language model (the text file) for the corresponding acoustic model has the following properties:

Text Encoding en-US: US-ACSII or UTF-8
zh-CN: UTF-8
# of Utterances per line 1
Maximum File Size 2 GB

After that you have to create your endpoint (URL). Where you list your acoustic and language model.


You then deploy the endpoints (URL).


Then you are ready to use the API for clients or REST to call your CRIS model.

It is as easy as that!