In an earlier post we mentioned the “Cortana Analytics” suite as it was called at that time. The name has changed to Cortana Intelligence suite in the meantime.
There is an interesting link on the site: “What’s included”. Microsoft put 5 different categories of tools and solutions under the Cortana umbrella. We dive a bit deeper in them in this and future posts. There is a nice graphical overview on the Microsoft site, the different categories are the boxes in blue:
- Information Management
- Big data stores
- Machine learning and advanced analytics
- Dashboards and visualizations
In a series of posts we will go through all of them and where possible do the hands-on tutorials!
Note that the tutorials require an Azure subscription. All of the tutorials we did were also possible on a trial subscription on Azure. We provide links to the tutorials so you can test it yourself.
Under the information management category ou can find three subdivisions:
1. Data Factory
This is basically the tooling around creating, monitoring and storing the data. From the page:
- Create, schedule, orchestrate, and manage data pipelines
- Visualize data
- Connect to on-premises and cloud data sources
- Monitor data pipeline health
- Automate cloud resource management
So, what is this all about?
It is a collection of tools to automate, orchestrate and transform data. What is usually called ETL (extract, transform and load).
There is a dedicated service in Azure that provides pipelines where you can transform your data. You create (one or more) data sets that consume certain activities. A pipeline is a group of activities. The activities run on so-called linked services.
In order to build these different components you can use several methods: Powershell, Visual Studio or the Data Factory Editor. We did the test with the Data Factory Editor tutorial.
The tutorial is clear and easy to follow, at the end of the exercise we created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand HDInsight cluster. Nice!
2. Data Catalog
According to the site the data catalog is a “place” to let you find the data you need. From the page:
- Spend less time looking for data, and more time getting value from it.
- Register enterprise data assets.
- Discover data assets and unlock their potential.
- Capture tribal knowledge to make data more understandable.
- Bridge the gap between IT and the business, allowing everyone to contribute their insights.
- Let your data live where you want it; connect with the tools you choose
Control who can discover registered data assets.
- Integrate into existing tools and processes with open REST APIs.
The data catalog is a place that lists all data sources that users can access in your company. First, as an administrator you register the different sources in the catalog. You can enrich the sources with extra information (metadata) that can be used for searching. Please note that the data itself is NOT copied to the catalog, but you can choose to upload “preview” data.
The purpose is two-fold:
1) it is a catalog that you can use to find and browse for data sources, check the data-model and get some preview of the data (if applicable). Also, users can annotate the content with their own words.
2) You can use the catalog to consume data sources, it acts then like some kind of “proxy” to the actual data source. In this way users do not need to know all the technical details of the data source.
We did the test, based on the exercises provided here.
Once again, the tutorial is clear and easy to follow, we had some hiccups with certain parts, but that were beginners errors.
⇒ Only one Data Catalog can be made per subscription (at this date), but why need more?
⇒ You need Azure Active Directory setup, no access with other accounts.
3. Event Hubs
The purpose of the even hubs is to log events and to connect devices. It is what is often called a “publish-subscribe” model where you can log millions of events per second and send them (“streaming”) into different applications. The is a very good description of the concepts in this article.
And there is even a tutorial, it can be found here. We didn’t test it yet but will come back to it later in another ioT context.
To see what the infrastructure can take on incoming (ingress) and outgoing (egress) and how much and long data stored, you can refer to following FAQ.
We will do some practical testing in the future where we will test the infrastructure with actual data coming from ioT devices. Note that this infrastructure relies completely on Azure. There is a possibility to work in a hybrid scenario, where the Service layer is on premise. This might be a good situation for high volume and high velocity data. You can find more info here.
The content on the site is evolving fast (see some article dates on the samples and tutorials). It is worth mentioning that in the few trials we did we got very good support from the Azure team.
It seems that Microsoft is working very hard to get their Information Management concepts and tools in place. Something to follow in the coming months…
In our next posts we will take a closer look at what the “Big data stores” and “Machine Learning and advanced analytics” is all about.