AGI solutions are being continuously investigated, though the current most promising mainstream technology, neural networks, while contributing to some extraordinary results, are still running short of achieving them.
This criticism is not new, and, most recently Gary Marcus, in “Deep Learning: A Critical Appraisal”, arXiv:1801.00631v1, has outlined many issues with current deep learning architectures, in particular their inability to ‘understand’ the information they manipulate and their ability to mostly work in a ‘stable’ world. As Marcus states in his article: ‘The logic of deep learning is such that it is likely to work best in highly stable worlds, like the board game Go, which has unvarying rules, and less well in systems such as politics and economics that are constantly changing. To the extent that deep learning is applied in tasks such as stock prediction, there is a good chance that it will eventually face the fate of Google Flu Trends, which initially did a great job of predicting epidemological [sic] data on search trends, only to complete [sic] miss things like the peak of the 2013 flu season (Lazer, Kennedy, King & Vespignani, 2014)’. Even one of the so called ‘fathers’ of Deep Learning architectures, Geoffrey Hinton, has recently voiced his concerns that deep learning needs to start over.
There is no doubt about the recent successes of deep learning. The most popular and successful machine learning applications used today, such as those for image and speech recognition, use a particular machine learning approach called neural nets. Starting in 1943 with the seminal paper by Walter Pitts and Warren McCulloch, “A Logical Calculus of Ideas Immanent in Nervous Activity”, published in Bulletin of Mathematical Biophysics 5:115-133, the Pitts-McCulloch neuron model laid the foundation for the explosion of deep learning. In 1957, Frank Rosenblatt improved on the Pitts-McCulloch neuron by adding the concept of “weighted sum” on the inputs, therefore allowing for non-binary input and output. Rosenblatt’s new model, also known by the generic name of perceptron, was described in his article “The Perceptron: A Perceiving and Recognizing Automaton” submitted to Cornell Aeronautical Laboratory in 1957.
The Pitts-McCulloch model, as well as the perceptron, were limited, because, while they could learn to work as general classifiers, they were only able to discriminate on linearly separable data. The reason for this mainly lay in the fact that their structure is shallow, in the sense that the information moves from input to output without undergoing any process aside from the different inputs being weighted and linearly added. This was changed with the introduction of extra neural layers, each of which would transform the input and extract basic information in the process, allowing for a non-linear analysis of the data. This raises a problem with training a deep neural net comprised of several layers, however, since the tuning of the weights based on minimising the cost function for the categorisation task at hand requires basic understanding of calculus and a process called “back-propagation” which was developed and refined by several authors during a span of thirty years, such as Henry Kelley, Arthur Bryson, Stuart Dreyfus and later Seppo Linnainmaa, Paul Werbos, Geoffrey Hinton and Yann LeCun. Back-propagation, while conceptually easy to understand, can be computationally expensive for large networks. The use of GPU partly obviates this issue, however other issues, like the problem of the vanishing gradient on very deep neural nets, can still make it difficult for a deep network to learn efficiently.
Most importantly, even when a neural net does learn efficiently and can be used to solve a complex problem, its definition depends on some important variables such as the number of layers, the number of neurons in each layer, and the choice of the activation function. These problems are not limited to neural networks, but every machine learning algorithm depends on a set of hyper-parameters.
For example, some of the classic neural networks used for image recognition are the AlexNet neural net and VGG16, GoogleNet. AlexNet is comprised of 8 layers, the first 5 of which are convolutional layers while the last 3 are fully connected. In VGG16 there are 16 layers, and the size of the kernel for the convolutional layers is smaller than it is in AlexNet. However, even for classical machine learning algorithms, there are other very important parameters that need to be set and that can affect the performance of the algorithm. Even simple algorithms such as linear regression can be affected by hyper-parameters such as the learning rate which, if set too large, may prevent the algorithm from converging or, if set too small, can instead make convergence be too slow to achieve.
Richard Bland, in “Learning XOR: exploring the space of a classic problem”, Computing Science Technical Report, 1998, analyses how the simple effect of different weight initialisation for a neural net for a simple problem like learning the “Exclusive OR” gate (a digital logic gate that gives a true output when the number of true inputs is odd) can dramatically change how well the neural net learns, and if the range of weights is initialised to less than 0.05 it will never converge.
These issues have produced several startups and projects that try to tackle these problems. However, their approach is often a brute-force approach, whereas several hyper-parameters are tested similarly to using the GridSearchCV functionality in python’s scikit-learn.
An automated approach to machine learning is currently a sought-after holy grail in artificial intelligence and the basis for AGI (Artificial General Intelligence).
A typical data science approach to a problem consists of several phases:
- data collection and labelling,
- feature analysis,
- data wrangling and feature selection,
- choice of the most appropriate machine learning algorithm for the specific problem, and
- hyper-parameters tuning.
The issue with most machine learning solutions is that all these steps need to be performed manually by a human analyst or data scientist. While, as we have seen, some approaches exist for (iv.) and (v.), they are mostly brute force approaches going through many different choices and using for each the best accuracy (or whatever metrics is being used) for the solution of the problem. A comprehensive solution to AGI would therefore need to improve on this approach and find a solution to (ii.) and (iii.).
It is, in fact, well-known that even the same algorithm can perform very differently on the same problem using different features. PCA (Principal Component Analysis) is often used by analysts and data scientists to eliminate features that do not help classify data correctly and to identify most strongly correlated features. Most problems are then solved by stringing together a series of algorithms and approaches that need to be performed by a human so that the best solution can be achieved using machine learning. This is because currently there is not a unique algorithm that works for every problem, and each problem needs to be addressed with an ad-hoc solution. An AGI solution would be able to perform all these steps seamlessly and be able to analyse a dataset, extract the best performing features, and solve any problem.
Several papers have been published comparing different results among algorithms, one of the most complete, albeit already 13 years old, is by Rich Caruana, “An Empirical Comparison of Supervised Learning Algorithms”, Proceedings of the 23 rd International Conference on Machine Learning, 161-168, in which they conclude “Even the best models sometimes perform poorly, and models with poor average performance occasionally perform exceptionally well”. Similarly, in 1995, another comprehensive study by King et al., “Statlog: comparison of classification algorithms on large realworld problems”, Applied Artificial Intelligence, 9, clearly states: “The main conclusion is that: there is no single best algorithm, and it is a case of horses for courses. The best algorithm for a particular dataset depends crucially on features of that dataset”. In addition, “in many problems human understanding is likely to be crucial if the learned rules are to be useful, e.g., in legal decision making, medical and engineering diagnosis, knowledge discovery in chemistry and biology, pharmaceutical and engineering design”. Despite the fact that many years have elapsed, this is still very much the case, even with powerful deep neural networks and the advance of deep learning and its almost ubiquitous use by the internet giants, i.e. Google, Facebook, Amazon, etc.
As François Chollet mentions in his “On the Measure of Intelligence”, arXiv:1911.01547v2, ‘solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to “buy” arbitrary levels of skills for a system, in a way that masks the system’s own generalization power’. He also mentions: ‘although we are able to engineer systems that perform extremely well on specific tasks, they have still stark limitations, being brittle, data hungry, unable to make sense of situations that deviate slightly from their training data or the assumptions of their creators, and unable to repurpose themselves to deal with novel tasks without significant involvement from human researchers’.
Another important issue is the issue of explainability. In many regulated markets, such as Finance or Insurance, any machine learning algorithm needs to be “explainable” in order to meet regulation requirements, while neural networks are typically considered “black boxes” that do not allow for the results to be explained and therefore rule out, for example, any possible issue of discrimination. The Microsoft experiment with its Twitter chatbot Tay has unfortunately become famous. It was meant to learn from other users and independently tweet new content but, regrettably, Tay soon learned to tweet inflammatory and offensive content and was quickly shut down by Microsoft. The inability to understand the ways through which the machine learning algorithm arrives to a solution can therefore mask inadvertent (learned) discriminations in its decision-making processes. For this reason, neural nets cannot be used in many regulated businesses despite their potentially superior classification ability with respect to more basic machine learning approaches such as simple linear regression.
All of this shows the importance of developing an AGI that is also explainable so that it can be used for different problems and its solution can be explained for regulatory purposes.
The important steps for an AGI are those we already discussed:
- the ability to automatically extract the best feature without the need for a human analyst to select the best features before-hand and manipulate the data,
- successfully classify any problem without needing to find the best algorithm for each problem.
This, in turn, allows for a true AGI whereas it is not necessary to deploy a different machine learning algorithm for each solution.
As an aside, it is sometimes necessary to note that true AGI does not imply human-level intelligence, but simply the ability to generalise the same solution to different problems.
For example two very simple, but quite different, classification problems for machine learning consist of the mnist dataset and the iris-dataset. It is very simple to create neural networks (or other machine learning algorithms) that can achieve over 99% accuracy in the first case and over 97% in the second case for these ubiquitous tasks.
The mnist dataset consists of 28 by 28 greyscale images classified with values from 0 to 9 depending on the digit the image represents. For the mnist dataset, one can usually build a simple convolutional network that can achieve over 99% accuracy. Convolutional networks have been around since the 1990s and were developed primarily by Yann LeCun. In a similar way to how the retinotopic mapping in human vision works, in convolutional networks neurons are connected only to neurons that are physically close to them. This allows for a much greater accuracy for these neural networks when it comes to image recognition, of which the digit recognition problem is just a simple example.
The iris dataset is comprised of measurements of the width and length dimensions for both the sepals and the petals of the flowers belonging to three distinct classes, setosa, versicolor and virginica, and the problem is a classification problem that requires measurements of petal and sepal lengths and widths (i.e. vector data) as input, and a species name (i.e. string) as output.
However, even for these simple problems, while we can use neural nets to solve not only the digits classification but also the iris dataset classification problems, it is impossible to use the same neural net to solve both. The neural net needs to be rewritten and the neural net needs to be modified for the different input size, different number of classes, and for a different neural net architecture.
This means that, even two problems as simple as the digit recognition problem and the iris dataset classification need a complete different neural net architecture to be solved and therefore there is no unique neural net architecture (or any other machine learning algorithm) that can solve both problems at the same time. This issue, of course, is even more pronounced for more complex problems.
To satisfy the requirements for a truly general artificial intelligence, the same algorithm should satisfactorily solve both problems. General artificial intelligence has been defined, in the context of AI research, by Legg and Hunter in “A collection of definitions of intelligence”, arXiv:0706.3639, by summarising over 70 definitions from the literature into a single statement: “Intelligence measures an agent’s ability to achieve goods in a wide range of environments”. Hernandez-Orallo in “Evolution in Artificial Intelligence: from task-oriented to ability-oriented measurements”, paraphrases McCarty and defines AI as “the science and engineering of making machines do tasks they have never seen and have not been prepared for beforehand”. Because of the need to manually re-configure a neural network, by changing its architecture, number of layers, etc., current solutions, including those using non-neural net based solutions, do not satisfy either of these definitions. However, a system that does not require any manual reconfiguration but only re-training on the new data, even in a wide range of environments, meets both previous definitions and can be considered true AGI.
For all these reasons, current state of the art mainstream classical machine learning and deep learning are still quite far from achieving true AGI.
Bland, Richard. Learning XOR: exploring the space of a classic problem. Computing Science Technical Report, (1998)
Caruana, Rich. An Empirical Comparison of Supervised Learning Algorithms. Proceedings of the 23rd International Conference on Machine Learning, 161-168, (2006)
Chollet, François. On the Measure of Intelligence”, arXiv:1911.01547v2, (2019)
Hernandez-Orallo, J. Evaluation in Artificial Intelligence: from task-oriented to ability-oriented measurements. Artificial Intelligence Review, pages 397-447, (2017).
King et al. Statlog: comparison of classification algorithms on large realworld problems. Applied Artificial Intelligence, 9, (1995)
Lazer, D., Kennedy, R., King, G. & Vespignani, A. Big data. The parable of Google Flu: traps in big data analysis. Science, 343(6176), 1203-1205, (2014)
LeCun, Y. Generalization and network design strategies. Technical Report CRG-TR-89-4, (1989).
LeCun, Y., Bengio, Y. & Hinton, G. Deep Learning: Nature (7553):436, (2015).
Marcus, Gary. Deep Learning: A Critical Appraisal, arXiv:1801.00631v1, (2018)
McCarthy, J. Generality in Artificial Intelligence. Communications of the ACM, 30(12):1030-1035, (1987)
Pitts, W. & McCulloch, W. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics 5:115-133, (1990)
Rosenblatt, Frank. The Perceptron: A Perceiving and Recognizing Automaton. Cornell Aeronautical Laboratory, Report No. 85-460-1, (1957)
Shane, Legg & Hunter, Marcus. A collection of definitions of intelligence, arXiv:0706.3639, (2007)