Facebook

Introduction about Machine Learning, Data Science, AI, Deep Learning, and Statistics


In this blog, I sure the different kinds of data scientists, and how data science equally and overlaps with niche fields such as machine learning, deep learning, AI, statistics, IoT, operations research, and produced mathematics. As data science is a broad discipline, I start by describing the various types of data scientists that one may encounter in any business setting: you might even discover that you are a data scientist yourself, without knowing it. As in any scientific rule, data scientists may borrow techniques from related disciplines, though we have built our arsenal, especially skills and algorithms to manage very large unstructured data sets in automated ways, even without human interactions, to perform transactions in real-time or to make predictions. 



1. Different Types of Data Scientists

To get moved and gain some historical viewpoint, you can read my article about 9 kinds of data scientists, published in 2014, or my article where I match data science with 16 analytic rules, also published in 2014. 

The following articles, published during the same period, are still useful:
  • Data Scientist vs. Data Architect
  • Data Scientist vs. Data Engineer
  • Data Scientist vs. Statistician
  • Data Scientist versus Business Analyst


More currently (August 2016) Ajit Jaokar announced Type A (Analytics) versus Type B (Builder) data scientist:

The Type A Data expert can code well enough to work with data but does not require an expert. The Type A data scientist may be an engineer in trial design, prediction, modeling, statistical inference, or other things commonly taught in statistics departments. Generally speaking though, the work product of a data scientist is not "p-values and confidence intervals" as academic statistics sometimes appear to suggest (and as it sometimes is for conventional statisticians going in the pharmaceutical industry, for example). At Google, Type A Data Scientists are known different as Statistician, Quantitative Analyst, Decision Support Engineering Analyst, or Data Scientist, and possibly a few more.

Type B Data Scientist: The B is for developing. Type B Data Scientists portion some statistical background with Type A, but they are also a very powerful developer and may be trained software engineers. The Type B Data Scientist is primarily liked in using data "in production." They develop models that connect with users, often providing recommendations (result, people you may know, ads, movies, search results). 

I also wrote about the ABCD's of business procedure company where D stands for data science, C for computer science, B for business science, and A for analytics science. Data science may or may not interact programming or mathematical exercise, as you can read in my content on low-level versus high-level data science. In a startup, data scientists wear several hats, such as administrative, data miner, data engineer or architect, researcher, statistician, modeler (as in predictive modeling) or developer.

While the data scientist is mostly portrayed as a coder skillful in R, Python, SQL, Hadoop and statistics, this is just the tip of the iceberg, made popular by data camps targeting on teaching some component of data science. But just like a lab consultant can call herself a physicist, the real physicist is much more than that, and her domains of skills are varied: astronomy, mathematical physics, nuclear physics (which is borderline chemistry), mechanics, electrical engineering, signal processing (also a sub-field of data science) and many more. The same can be said about data scientists: sectors are as varied as bioinformatics, information technology, simulations and quality control, computational finance, epidemiology, industrial engineering, and even number theory.

In my case, over the last 10 years, I expertise in machine-to-machine and device-to-device communications, developing systems to automatically process huge volume of data sets, to work automated transactions: for quick, purchasing Internet traffic or automatically generating content. It hinted developing algorithms that work with unregulated data, and it is at the connection of AI (artificial intelligence,) IoT (Internet of things,) and data science. This is assigned to as deep data science. It is relatively math-free, and it involves comparably little coding (mostly APIs), but it is quite data-intensive (including developing data systems) and depend on brand new statistical technology deployed clearly for this context. 

Before that, I created a credit card fraud threat in real-time. Previous in my career (circa 1990) I treated on image remote sensing technology, among other things to test layouts (or shapes or features, for quick) in satellite images and to work image segmentation: at that time my research was labeled as computational statistics, but the people doing the related same thing in the computer science division next door in my home university, called their research artificial intelligence. Nowadays, it would be called data science or artificial intelligence, the sub-domains being signal processing, computer vision or IoT.

Also, data scientists can be found anywhere in the lifecycle of data science models, at the data-gathering stage, or the data trial stage, all the way up to statistical modeling and maintaining existing systems. 

2. Machine Learning versus Deep Learning

Before digging deeper into the link between data science and machine learning, let's shortly announce machine learning and deep learning. Machine learning is a set of innovations that train on a data set to make a forecast or make plans to advance some systems. For instance, managed classification algorithms are used to classify unique clients into good or bad prospects, for loan purposes, based on historical data. The approach involved, for a given task (e.g. supervised clustering), are varied: naive Bayes, SVM, neural nets, ensembles, association rules, decision trees, logistic regression, or a combination of many. For a detailed list of algorithms, click here. For a list of machine learning problems, click here.

All of this is a subset of data science. When these algorithms are automated, as in automated piloting or driver-less cars, it is called AI, and more clearly, deep learning. Click here for another article comparing machine learning with deep learning. If the data collected comes from sensors and if it is conducted via the Internet, then it is machine learning or data science or deep learning applied to IoT.

Some people have various definitions of deep learning. They include deep learning as neural networks (a machine learning technique) with a deeper layer. The question was asked on Quora newly, and below is a more detailed explanation (source: Quora)

AI (Artificial intelligence) is a subfield of computer science, that was created in the 1960s, and it was (is) concerned with solving tasks that are easy for humans, but hard for computers. In particular, a so-called Strong AI would be a system that can do anything a human can (perhaps without purely physical things). This is fairly generic and includes all kinds of tasks, such as planning, moving around in the world, recognizing objects and sounds, speaking, translating, performing social or business transactions, creative work (making art or poetry), etc.

NLP (Natural language processing) is the include of AI that has to do with language (usually written).

Machine learning is distressed with one aspect of this: given some AI problem that can be defined in discrete terms (e.g. out of a particular set of actions, which one is the right one), and given a lot of detail about the world, figure out what is the “correct” plan, without having the developer compute it in. Normally some outside process is required to decide whether the plan was correct or not. In mathematical terms, it’s a function: you feed in some input, and you need it to generate the right output, so the whole problem is simply to build a model of this mathematical function in some automatic way. To distinguish with AI, if I can write a very clever program that has human-like behavior, it can be AI, but unless its parameters are automatically learned from data, it’s not machine learning.

Deep learning is one kind of machine learning that’s very popular now. It involves a particular kind of mathematical model that can be thought of like a constitution of basic blocks (function composition) of a few types, and where some of these blocks can be adjusted to better predict the outcome.
What is the difference between machine learning and statistics?

This article tries to answer the question. The author writes that statistics is machine learning with confidence intervals for the quantities being predicted or estimated. I tend to disagree, as I have built engineer-friendly confidence intervals that don't require any mathematical or statistical knowledge. 

3. Data Science versus Machine Learning

Machine learning and statistics are a slice of data science. The word learning in machine learning means that the algorithms be based on some data, used as a training set, to fine-tune some model or algorithm parameters. This encompasses many techniques such as regression, naive Bayes or supervised clustering. But not all techniques fit in this category. For instance, unsupervised clustering - a statistical and data science technique - aims at detecting clusters and cluster structures without any a-priority knowledge or training set to help the classification algorithm. A human being is needed to label the clusters found. Some techniques are hybrid, such as semi-supervised classification. Some pattern detection or density estimation techniques fit in this category.

Data science is much more than machine learning though. Data, in data science, may or may not come from a machine or mechanical process (survey data could be manually collected, clinical trials involve a specific type of small data) and it might have nothing to do with learning as I have just discussed. But the main difference is the fact that data science covers the whole spectrum of data processing, not just the algorithmic or statistical aspects. In particular, data science also covers.

Of course, in many organizations, data scientists focus on only one part of this process. To read about some of my original contributions to data science


Post a Comment

0 Comments