Authored by Gero Presser, co-founder and managing partner of Quinscape GmbH in Dortmund (@gero_presser)
Machine learning is in the ascendancy. Particularly when it comes to pattern recognition, machine learning is the method of choice. Tangible examples of its applications include fraud detection, image recognition, predictive maintenance, and train delay prediction systems. In day-to-day machine learning (ML) and the quest to deploy the knowledge gained, we typically encounter these three main problems (but not the only ones).
Data Quality – Data from multiple sources across multiple time frames can be difficult to collate into clean and coherent data sets that will yield the maximum benefit from machine learning. Typical issues include missing data, inconsistent data values, autocorrelation and so forth.
<< Download the Definitive Guide to Data Quality>>
Business Relevance – While a lot of the technology underpinning the machine learning revolution has been progressing more rapidly than ever, a lot of the application today occurs without much thought given to business value.
Operationalizing Models – Once models have gone through the build and tuning cycle, it is critical to deploy the results of the machine learning process into the wider business. This is a difficult bridge to cross as predictive modelers are typically not IT solution experts and vice versa.
There is also a whole toolbox of algorithms behind machine learning, each of which can be adjusted for greater accuracy using so-called hyperparameters. With the popular k-nearest neighbors algorithm, for example, k refers to the number of neighbors we want to take into account. In a neural network, this will cover the entire architecture of the network.
A key task that data scientists have today is finding the right algorithm for a given problem and to “set” this correctly. In reality, however, the range of tasks is much larger. A data scientist has to understand the business perspective of a problem, address the data situation, prepare the data appropriately and arrive at a model that lends itself to evaluation. This is typically a cyclical process that follows the cross-industry standard process for data mining (CRISP-DM) [1].
Correspondingly, projects in the field of machine learning are complex and demand the time of multiple people qualified in a range of fields (business, IT, data science). Furthermore, it is often unclear to begin with what the outcome will be: In this sense, therefore, such projects are risky.
The Relevance of AutoML
To this day, data science projects cannot be automated. There are cases, however, where certain steps of the project can be automated: This is what lies behind the concept of automated machine learning (AutoML). AutoML can, for example, assist in the choice of algorithm. A data scientist usually compares the results of several algorithms on the problem and selects one under consideration of a range of factors (e.g. quality, complexity/duration, robustness). Another aspect that may be automated in certain cases is the setting of hyperparameters: Many algorithms can be adjusted by means of parameters and their quality optimized with relation to the specific problem.
AutoML is a resource that can accelerate those data science projects where parts or individual steps are automated, leading to an increase in productivity. AutoML is extremely useful, for instance, in the evaluation of algorithms. Because of this, many libraries and tools have adopted AutoML as a supplementary function. Notable examples include auto-sklearn (in the Python community) or DataRobot, which specializes in AutoML. The following example, taken from RapidMiner, shows how assistants are used to compare different algorithms and very quickly find the best one for a specific problem [2]:
Nevertheless, AutoML should not be understood as a one-size-fits-all solution capable of fully automating data science projects and dispensing with the need for data scientists. In this sense, it, unfortunately, is not the Holy Grail.
As in other specialist fields, automation is useful first and foremost for tedious technical tasks where highly skilled professionals would otherwise spend most of their time systematically trying out certain parameter sets and then comparing the results – a job that really can be better left to machines.
What remains is a wealth of challenges that still have to be addressed by humans. This begins with understanding the actual problem itself and covers diverse, mostly very time-consuming, tasks ranging from data engineering to deployment. AutoML is a useful tool, but it’s not the Holy Grail yet.
[1] By Kenneth Jensen – Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610
[2] http://www.rapidminer.com/
About the author Dr. Gero Presser
Dr. Gero Presser is a co-founder and managing partner of Quinscape GmbH in Dortmund. Quinscape has positioned itself on the German market as a leading system integrator for the Talend, Jaspersoft/Spotfire, Kony and Intrexx platforms and, with their 100 members of staff, they take care of renowned customers including SMEs, large corporations and the public sector.
Gero Presser did his doctorate in decision-making theory in the field of artificial intelligence and at Quinscape he is responsible for setting up the business field of Business Intelligence with a focus on analytics and integration.
The post Automated Machine Learning: is it the Holy Grail? appeared first on Talend Real-Time Open Source Data Integration Software.