What is Data Science, is this something new

March 2016

No. It is a term that is used more and more the last years, but it exists for quite a while. Some see it as another buzzword and state that it is basically statistics, others say it is nothing more than business analytics we know for many years.

According to Wikipedia it is a discipline to extract knowledge from structured and unstructured data. The important part is “science”, where the focus is on a structured, repeatable and verifiable process.

Data science is already visible in our daily life and will become more and more apparent;

  • the smart devices in homes that spit out thousands of data points per day,
  • the self driving cars that receive millions of info per second to base their decision on,
  • fraud detection systems,
  • personalized adds on websites,
  • … and so on.

Data science is becoming more and more accessible to the masses. We have very powerful hardware that is not so expensive anymore compared to a decade ago. Companies see advantages in the results, so they invest in it.

It is time to jump on the data train. Start capturing your data now!

Isn’t this all about statistics?

Yes and no. Statistics (or statistic methods and tools) are heavily used in Data Science but the focus is more on asking the right question and predicting the correct answer. No question, no data science. There is a great post on this from Hadley Wickham, that can be found here.

Is this related to big data?

In Data Science size does not play a role. Of course, the size of the data will have an impact on the methods and tools used, but basically we are trying understand, model and predict the data. You can have data science projects based on small data sets or based on terrabyte or more.

So what about Machine Learning and Predictive Analysis?

Machine Learning is a collection of methods and tools to “teach” machines (computers) and make them understand what the data is all about. Usually certain algorithms are used to make predictions based on different datasets and algorithms. Hence, why machine learning and predictive analysis are sometimes used in the same context. There is a very good article on Wikipedia on Machine Learning.

Do we need this in our company?

Yes. Even for small companies with reasonably small data volumes you can already do some interesting processing in the area of prediction. This can be related to discovering certain patterns in sales, manufacturing, pricing of materials and goods, site visits and so on.

It is slowly becoming a standard in more and more companies.

Start collecting data now, even if you don’t have all the tools and techniques in place for predictions or machine learning. Later on this data will certainly be of use.

Do we still need Business Intelligence?

Of course you do. Business Intelligence focuses on consolidating and visualizing known information. Using BI you delve through tons of info to get what you are looking for: sales figures, ioT samples, number of visits to your website and so.

With Data Science you try to discover the patterns you do not see using BI and provide forecasting capabilities. Using data science techniques you can make correlations to other information and see what impact they have. For example why is there a downwards trend in sales during certain periods? When you include several other parameters and using DS techniques you can discover, explain and predict these patterns.

What are the typical steps in a Data Science project?

First of all you need a question to answer. Without this there is no need to do data science. Once there is a question to be answered, data scientist take certain steps to come to results. These are:

  1. Getting and cleaning the data
  2. Exploring the data
  3. Building the (prediction) models
  4. Create and communicate the results

What technology and tools are used?

There are quite a lot of tools on the market used in data science. It also depends on the kind of work you are doing within the project. For example for storing and retrieving data the typical relational databases (SQL) are used next to distributed systems such as Hadoop and family. R and Python are programming languages that exist for many years now and built up a huge fanbase. They have a lot of libraries for all kind of data science jobs you want to do. On the front-end things are bit more complex. A lot of people are still using Excel for graphing, or specific libraries in R or Python. More and more tools are written in javascript and can provide wonderfull graphics (D3 for example). This is a market that is moving rapidly. One should take tools you feel comfortable with.

What is Microsoft doing in all this?

Microsoft has launched its Cortana Analytics suite as part of an overall program offering all kind of tools and technologies to enable business doing data science. For more info please refer to the blogpost on Cortana Analytics suite.

What do data scientists need?

A question: is there something you want to get from your data? Things you don’t immediately see using your day to day BI tools. Are you trying to discover certain patterns or anomalies? And you would like to predict them.

The data: data scientists need data. In any format. It is better to provide more data if it is available. Part of the data science project is to clean and explore the data.

What can they give back?

Usually the deliverable of a data science project is in a report format. Most include  graphs with explanations. sometimes they can provide you with polished data that you further work on yourself. The reports should be delivered in a reproducible and open format ready to integrate in your environment.

Can we do it in our company?

Yes you can. But you will need the people and the tools (see higher).

Often data science companies offer their services through a cloud-service. In such a case you upload your data to an environment that is used by the data scientist. They should of course provide the necessary privacy and non-disclosure agreements. Make sure to check this.Some companies will even provide you with the algorithms so you can reproduce the results. In such a case you do need the necessary people and tools to do so.

Can I become a data scientist?

Of course. In one of the coming articles we will explore some possibilities.

Can we get some help on these topics?

Please fill out the form here and we will get back in touch with you, or check out our offering page.