Machine learning and advanced analytics
In an earlier post we mentioned the Cortana Analytics suite as it was called at that time. The name changed to Cortana Intelligence suite in the meantime.
Microsoft put 5 different categories of tools and solutions under the Cortana umbrella. In the first post we discussed about the “Information Management” part, in the second we focused on “Big data stores”, this one is all about “Machine learning and advanced analytics”
There are three different parts under this heading: Machine Learning, HDInsight and Stream Analysis. We discuss them in more detail in the next chapters, but focus mainly on the Microsoft offerings.
This is one of our favourite parts in the Cortana stack. The whole solution is based on the Azure Machine Learing Studio, an online tool to “draw” data science processes. In the beginning of 2016 we posted a first article on the subject, you can find it here.
The Azure ML Studio can be found here, you should definitely go take a look and try it out; you can have trial accounts. It is a very nice tool to create in a “flow-diagram”-way the different steps and actions. Below you can see a screenshot from an actual exercise in Microsofts Data Science track, something we will discuss later we I finished the whole track:
On the left in the screen you find all the different actions you can do in a step, there is plenty of choice and they are updated regularly. If you don’t find what you are looking for you can always integrate your own code in R or Python, it seamlessly integrates in the environment.
A nice touch is that within the canvas you can use a lot of example datasets to start playing and testing different ML models:
Once you start building you what they call an “experiment” you can drag & drop different actions. At the end you can even publish your model through a web service in a very easy way and use it in whatever application you want:
And using this in Excel:
As stated earlier, we are currently enrolled in the Data Science Track in the Microsoft Professional Degree Program and use extensively the features in ML Studio. So far so good!
HDInsight is the managed Apache Hadoop, Spark, R (Server) HBase and Storm cloud service from Microsoft. The following services are available:
In later articles we will do some tests on this environment, for the moment we focus on the pure Microsoft products such as the one below.
Stream Analytics is a pure Microsoft offering to let you rapidly develop and deploy low-cost solutions to gain real-time insights from devices, sensors, infrastructure, and applications. Typically for Internet of Things (IoT) scenarios, such as real-time remote management and monitoring or gaining insights from devices like mobile phones and connected cars. The whole idea behind this is to be able to use certain (machine learning) algorithms to predict movements in the data and for example carry out preventive maintenance to avoid problems later.
The principle is based on different “inputs” that basically “feed” your datastream. you can query this datastream using some kind of SQL dialect and use this data to create real-time dashboard using Power BI, we will discussed this in the next post.
The solution guarantees event delivery, so every “input” received will never be lost. In the picture below you can see a graphical overview of the whole process:
Event Inputs versus Reference Data
There is a distinction between incoming live data that changes very often (several times per second) and data that should be available but doesn’t change that often. for example in the scope of car data monitoring the event inputs are all the data coming from the different sensors, the Reference Data is in this case the make and model of the car or the owner, something that doesn’t change often or at all for a particular experiment.
Temporal functions are functions that allow you to do certain operations within a certain time interval (or time window). Streaming Analytics support currently following:
- Tumbling Windows: repeating, non-overlapping, fixed interval windows.
- Hopping Windows: generic window, can be overlapping, fixed in size.
- Sliding Windows: slides by an “amount” and produces output when a certain event occurs (the mean of some sample in a time period for example).
There are certain functions to handle out-of-order events and to manage “late arrivals”.
The pricing is based on volumes and streaming units.
- Volume is the size of data processed by the streaming jobs (in GB). Currently this is around $0.001 per GB (prices may vary).
- Streaming Unit is a blended measure of CPU, memory use and throughput. This is priced in $ per hour. (around 0.031 per hour, prices may vary).
Note that this is limited to 1 MB per second.
One of the nicest part of this offering is without doubt the Azure ML Studio. In the weeks I spent with it I was impressed by its ease of use and capabilities. I am not convinced that is a tool you can hand-over to business people, but with some training and providing templates they should be able to do simple tasks and modifications. For the more serious data-scientist there is the integration with Python or R (or both if you like) that opens a ton of possibilities.
In this session we focussed on use and capabilities; in one of our next articles we are going to test its performance and speed.