Big data stores
In an earlier post we mentioned the Cortana Analytics suite as it was called at that time. The name changed to Cortana Intelligence suite in the meantime.
Microsoft put 5 different categories of tools and solutions under the Cortana umbrella. In the first post we discussed about the “Information Management” part, in this post we focus on “Big data stores”.
There are two products under the Big Data stores: Data Lake and SQL Data Warehouse.
Data Lake
According to the information on the site, the Data Lake is all about:
- Store and analyze data of any kind and size.
- Develop faster, debug and optimize smarter.
- Interactively explore patterns in your data.
- No learning curve—use U-SQL, Apache Spark, Hive, HBase, and Storm.
- Managed and supported with an enterprise-grade SLA.
- Dynamically scales to match your business priorities.
- Enterprise-grade security with Azure Active Directory.
- Built on YARN (Apache Hadoop resource mgmt), designed for the cloud.
You can split the Azure Date Lake up in two parts: the analytics part and the actual store, as shown below:
If you look close you can see that this is the “Apache Hadoop” stack Microsoft is offering on Azure with an extra twist such as U-SQL…
U-SQL?
U-SQL is a language to query large data. Data that is not necessarily in a traditional relational database (RDBMS) but that can come from unstructured data sources. It is a kind of mix between the T-SQL and C#.
It executes the queries in batches, so for small recordset it won’t be the quickest system. U-SQL excels in querying large data sets where it can process billions of rows in a short timeframe. We wont go into detail here, there is a good introduction on the Azure site. Microsoft is positioning this language as a low-entry into the large data set manipulation languages such as Hive:
Taking the issues of both SQL-based and procedural languages into account, we designed U-SQL from the ground-up as an evolution of the declarative SQL language with native extensibility through user code written in C#. This unifies both paradigms, unifies structured, unstructured, and remote data processing, unifies the declarative and custom imperative coding experience, and unifies the experience around extending your language capabilities.
We have to see where this is going, but the initial idea is certainly not bad.
SQL Data Warehouse
This offering is centered around providing performance and scalability on demand. From their site:
- Petabyte scale with massively parallel processing.
- Independent scaling of compute and storage—in seconds
- Transact-SQL queries across relational and non-relational data.
- Full enterprise-class SQL Server experience.
- Works seamlessly with Power BI, Machine Learning, HDInsight, and Data Factory.
For more info we refer to the info on their site, but we would like to go a bit deeper into a particular item, namely Polybase.
PolyBase…?
Using PolyBase, leverage Transact-SQL to query seamlessly across both relational data in a relational database and non-relational data in common Hadoop formats. A single Transact-SQL command combines non-relational data from Azure blob storage with relational tables in the data warehouse
What is this PolyBase then? It is a mechanism to use traditional T-SQL to work on data coming from the classic SQL Servers and non-relational data stores such as Hadoop or other Azure Blob storage.
As you can see in the picture above it is technology that is “installed” in SQL Server. The user can transparently query relational and non-relational databases. PolyBase is built into SQL Server 2016 and does not require special software or tools, nor does it require an understanding of Map/Reduce, Hive or other Hadoop-related concepts.
People following the SQLServer world know this “extension” from SQLServer Parallel Data Warehouse (PDW).
PolyBase versus U-SQL
The difference between PolyBase and U-SQL is that PolyBase uses the traditional T-SQL syntax (with some minor schema changes it seems) to work with data from heterogeneous sources, but still running in Microsoft SQLServer.
U-SQL is an SQL language extended with features from C#, making it possible to query non-relational data sources. It runs from the client (Visual Studio) and does not require a SQL Server.
Conclusion
The part under the big data stores can be split up in the Apache stack on Azure on one hand and the SQLServer capabilities on the other hand. There are some interesting “newcomers” such as U-SQL and PolyBase that bridge between the two worlds.
In our next posts we will take a closer look at what the “Machine Learning and advanced analytics” is all about, and we will hopefully do some hands-on experiences.
2 Comments
Comments are closed.