Why is the developmental approach important?

Today's AI systems learn by processing large amounts of supervised or unsupervised data, in a passive way. They are not able to act to produce their own data, observe causal links, test hypothesis. This is however how learning actually takes place in infants and young adults. Statistical correlations are not enough to build meaningful and grounded representation of the world: it is crucial to have a body that interacts with the environment and build up grounded sensorimotor invariants, and then grounded language conventions through interactive language games while interacting with others.

We don't understand the brain and don't have yet very good ways to observe its functioning. We have however a large amount of research in cognitive psychology about children and animal behaviors,  that we can study to understand how intelligence develops and cognitive abilities and representations are built in young children.

The goal of developmental AI is to use this corpus of knowledge to reverse engineer a theory of intelligence, defined as a grounded dynamic interaction between an agent and its physical and social environment. We can then build machines that relate to the same reality as us, "understand" the concept they have built and that they use in cooperation with others, and ultimately become better AIs that can work with us and help us on practical, real world issues.

Why not focus on deep learning only?

Deep learning, or more generally end-to-end differentiable learning systems, have achieved remarkable results in the domain of classification and prediction, and more recently in natural language processing with LLMs. They also have clearly shown limitations in the domain of explainability and more generally the capacity to understand and form internal coherent models. We are aware that some important research is going on at the moment to try to solve these issues, and these efforts might eventually succeed, but at the same time there are reasons to consider that fundamental limits could apply to differentiable models.

Once some perceptual invariants are identified by self-supervised means, classical algorithmic methods could be much more efficient, generalizable and explainable than forcing the system to implement similar processing by the means of neurons only.

Beyond categorization, deep learning models can further help traditional algorithms by providing oracles and search heuristics to overcome scalability bottlenecks. Combining the strength of both approaches seems like an exciting path for progress.

It turns out that many of the most successful applications of deep learning actually already use a combination of traditional algorithmic methods together with neural nets. Insisting on being differentiable closes the door on many optimizations and possible shortcuts, and seems to contradict the observation that high-level cognition does not occur, at least at the conscious level, in a fluid way, but rather in a form of trial and error discrete processing, reminiscent of evolutionary dynamics at the scale of ideas.

Exploration of the solutions space by gradient methods is only applicable when a gradient is computable, while many high-level cognitive activities are happening in a non-continuous, non-differentiable way, where seemingly random or heuristically-guided recombination in a representation space tends to occur. Even if in principle we could eventually come up with ways to emulate this process with artificial neurons, it might come at a huge cost in research time and computational energy.

What we propose with the hybrid approach is simply to take a step back, and use all the tools in our arsenal, in order to get to tangible results in the shortest amount of time.

Should we take inspiration from the brain?

The reality as of today is that we still don't understand much about the brain. We have still limited capabilities to measure what is going on and perform experiments to start to figure out how the brain works. It seems more likely that AI will help better understand the brain, rather than the opposite.

Taking a famous analogy, we did not invent aerodynamics by studying birds, but once we got the understanding of the Navier-Stokes equations, we were able to abstract our understanding from nature, build air planes, and ultimately explain how birds fly. We believe a similar situation will be happening with AI: we need first to be coming up with a theory of intelligence, from which we could build intelligent machines and, ultimately, understand the brain.

What do you mean by "common sense" in AI?

This is a old and fundamental question in AI, related to the question of what is "meaning". Similarly to words like "intelligence" or "understanding", we have a sort of intuitive idea about what it is, but there is no agreement in the community on a clear definition.

The developmental AI approach posits that meaning is the internal abstraction that forms when sensorimotor information is categorized by the agent. Meaning is different from language, which is merely a negotiated convention between agents to "serialize" meaning into discrete tokens. Words relate to internal abstractions (meaning) that are similar enough from one individual to another in order for them to successfully communicate, plan and jointly act. This is the fitness against which language conventions are selected. Note that these internal abstractions might materialize through patterns of activation of neurons, or internal memory configuration in silicon, but this is a detail that has no importance as long as communication via language is successful.

Most current AI natural language processing models do not have internalized meaning and treat words as fundamental and self-sufficient entities, which are put in correspondence at a statistical level. Words are however just proxies to the underlying meaning, and there is no obvious link between the relative internal structure of meaning the words are related to, and the structure of their statistical relationship. 

The approach of developmental AI aims at building the internal meaningful structures first, and then learn to serialize them into language (which is a social activity) to synchronize them with the corresponding structure in other people and achieve successful communication.

These meaningful structures, extracted from categorization of sensorimotor signals over a life-long period are in all likelihood extremely intricate and large, and cover all domains of physical and social interactions. This corpus of knowledge is what we can call "common sense" for an AI agent, and is related to its embodiment in the physical and social world, and how it chose to categorize it.

Why VR experiments and not standard simulation, or even robots?

There are two aspects that are crucial in the developmental AI paradigm: embodiment in a physical world, and also social interaction to build joint shared representation, joint plans, evolve language, etc.

Obviously, robots are good platforms to explore these issues, but unfortunately they are expensive, need a lot of maintenance, and don't allow any form of shortcut when we want to study particular aspects of the interactions, ignoring some others (for example: you can't avoid to solve the problem of grasping objects, which is still a difficult problem in robotics, if all you want to do is study human-robot task cooperation). Robots remain one of the end goals of human-level intelligence, in order to solve some real-world problems involving physical interactions, but they are a very slow and costly platform to start with.

The first requirement, about the physical world, can however easily be explored in modern simulations where it is possible to reproduce object dynamics and simulate a robot body. Video games have made tremendous progress in rendering realistic environment, and this can be achieved easily with off-the-shelves engines. However, the second requirement, about social interactions, require a seamless interface for the human to interact in the simulation. VR has made this extremely simple and intuitive, also allowing to capture subtle body movements and gestures which play an important role in the non-verbal early phase of development. Low cost VR headset are already available and would allow to build game/app to allow non-experts to interact and to raise a "baby AI" in a virtual setting.

Here is an early technical implementation of the kind of software solution that we plan to use in developmental AI experiments: 

What is the time frame for human level intelligence?

It is impossible to tell, because there is still so much to understand and fundamental research to conduct. It seems however that the technological pieces of the puzzle are, for the first time, in place: powerful computational capabilities, immersive VR, deep learning to bridge the gap between sensors and categories, networked gaming to help boost the scale of learning for interactive AIs, and business interest for financing. We believe that under these circumstances, significant progress can be made within the next decades.