Why AI has an “A”? Dark Matter of Intelligence— Missing General Human Intelligence Matter
While the question that needs to sit on the top of the mountain of the Human parallel intelligence research community is how and when attainment of the consciousness engineering is going to be achieved. But before that, let’s scratch the surface of human intelligence to understand the missing pieces in AI systems that make it artificial. This homeostasis state would be taken along in the future if we do not address these core human intelligence phenomenons.
Historical Dimension — The Dawn of Philosophy Era
Though if we have to argue on the origins of the reasoning, we will have to get into the dimension of evolution first, because the level of reasonings towards the interaction with the environment leads to the optimized survival instincts which further leads to the evolution of a being in the environment compared to other beings. This paradigm of reasoning is about environmental comprehension. Once human beings, evolved to a certain state called “Homo Sapiens”, from Hominoids, they were able to reason about self, the existence, the purpose of life, the consciousness which is the highest form of evolutionary reasoning to date.
This Philosophical reasoning is the era where the beings namely Socrates, Plato, Aristotle came into the play, talking about the human self, the universe, the conscious being, and the purpose of life. This also involved the higher order of mathematical thinking which is an abstract form of symbols processing, where symbols are a representation of real-world entities. This elementary symbol processing abstraction was taken over by more humans with better reasoning skills namely Pythagoras to develop another higher level abstraction over the 2-dimensional objects using the first defined mathematical symbols. Euclid of Alexandria another figure who came along to define the abstract ideas on top of the elementary reasoning work to formula the Euclidean Geometry. The convoluted human reasoning efforts were built on top of each other led to the creation of fields of Probability, Information processing, Calculus, Linear Algebra, Quantum Physics, etc, using which the current effort is to build these systems called AI systems which mimic what the very first philosophers came up with their reasonings.
It is very much like, we as a human, built a ladder, and now sitting on top of the ladder, we are trying to formulate, another level of abstraction which can build human ladder from scratch by itself.
There are certain core elements of human reasoning which are essential to create any form of abstraction based on rudimentary or another abstract formulation. We will try to understand all those elements, to remove the “A” and replace it with “H”.
The Poverty of Stimulus — The very first stepping stone
While this phrase came from the work of the famous computational linguistic Noam Chomsky, it applies to the whole human learning paradigm and should be the very first focus of the machine learning paradigm. What it basically means is that the human brain contains an innate or natural capacity of learning and producing language, through experience only, without receiving direct teaching from others. One of the questions tackled by the concept is how children are able to understand and use computationally complex rules without any direct instructions in the correctness of these rules or how to use them.
Our current attempt at mimicking human intelligence should involve sample-efficient learning algorithms first. The current generation of intelligence learning systems is trained with humongous training data. Yep, this was after the big data era started, but, the approach needs to be altered.
The curse of Inductive Bias
What we initialize our network weights floating-point values with, is considered as the initialization strategy, and the learning term is called the inductive bias. In the current breed of machine learning algorithms, this initialization has a huge impact on the training process. It is like, there are millions of roads connecting to multiple destinations, and only a subset of these roads lead to your destination. So you have to carefully choose which path you want to start with, otherwise, you would never be able to even reach somewhere near to your desired destination. Simple analogy, Hedge Maze.
The biggest question of the AI learning philosophy right now is finding the least effort route for the best network initializatoin and optimization.
Sample-Efficient Nonconvex Optimization
So all the machine learning(AI is the bigger umbrella — Don’t use the term interchangeably) problems, boils down to a singular value reduction. The loss function value, which is the sole indicator of the fact that the training process is moving towards the expected path of learning. In general, in Claude Shannon linguistics, the objective function is the average or the expectation of some loss function over a finite or infinite dataset. The problem of optimizing these non-convex loss functions for the matrix compute machines is an NP-hard problem.
There has been recent work on stochastic nested variance reduced gradient algorithm(SNVRG), which is developed based on the stochastic gradient descent methods and variance reduction techniques. The SNVRG achieved a near-optimal convergence rate for finding a first-order stationary point of a non-convex function. The local optimums are found examining the curvature information of the stationary points found by SNVRG. Using the local optimum information and the near-optimal convergence rate of the first-order stationary point of the non-convex function, the global minimum path propagation algorithm is formulated. I won’t be getting into the details of the working of the algorithm, since that would elongate this writing to a biblical form.
In a nutshell, the right paradigm of learning is finding the highest abstractions of conception using the least information about a process. That is true intelligence in terms of observational information processing.
Dark Matter of Intelligence — It is there and we are missing it.
In the 1980s, physicists proposed what is now the standard model of cosmology, in which the mass energy observed in the universe is around 5%, the rest of the universe is dark matter and dark energy. Their properties and characteristics cannot be observed and must be reasoned from the visible mass energy. Dark matter holds significance in explaining the formation, evolution, and motion of the visible universe.
Taking inspiration from the above physics concept, we can argue that, the information processed by these AI systems, the tensor form of signals for vision, speech, language, and motoric domains, is the mass-energy. The joint representation of all the signals in the environment and joint inference can be termed as the dark matter of the perception and its outcome as the intelligence.
A road from Big data small task to small data Big task
The current generation of AI paradigm focuses on the collection of gigantic data and then using that data to perform inferences over tasks like sentence classification, topic generation, image classification, object detection, etc. Humans can make rich inferences from the sparse data, and achieve a deep understanding from a single picture using the acquired knowledge like the laws of physics. For e.g observing the water falling in the image of Niagara Falls, humans know it is flowing downwards and not upwards(since the image is still you cannot tell, in which direction the water is flowing) because of the act of gravity in place. There are human activities such as intent, causality, physics, functionality, social intent, and individual preferences, which the pixels cannot speak of.
All the above-defined factors are missing from the computer vision literature. The computer vision literature focuses on classification, localization, and reconstruction whereas the why, how, and if answers are missing. The why, how, and if are answered using casual reasoning, intuitive physics, learning functionality, affordance, intent prediction, and utility learning. This is what is called the small data for big tasks.
Continuum Spectrum of darkness
As we talk about, what exactly are the dark elements in the AI perceptional field. The darkness comes in a spectrum. Things that are very straightforward to pick and then the things which are very difficult to be picked from the visual cues. For e.g. the face of a person is sitting at the lightest end of the darkness spectrum, which means it is very straightforward. In the above image the position of the ketchup bottle being upside down, why, sits at the darkest end of the information darkness spectrum, since, it is not straightforward to say that, the bottle is kept upside to accumulate the ketchup to the brim because it seems that the bottle is almost empty otherwise why would you keep the bottle upside down.
What Computer vision systems are missing? — The Neuroscientific dimension
From the biological perspective, the majority of living creatures have a single vision system to perform thousands of tasks. This contrasts with the dominant contemporary stream of thought in computer vision research, where a single model is designed specifically for a single task. In the above-mentioned scene, if the human would have been given a task to make tea, he would have first identified the object needed(object recognition), the tea, the water, the milk, the gas stove, and maybe sugar, if the person is not hitting the gym or diabetic(probably). Then comes the grasping of the object which requires object manipulation and then planning how to make tea, heat up water first, then add tea, then heat it up for some time, and then add some sugar and in the end, add milk. This all is part of the task planning.
Prior research has shown that it takes a human just one minute to figure out all the above-mentioned details utilizing a single vision system facilitating all the subtasks.
- The experimentation on the left shows how the cortical dorsal and ventral regions respond to different types of objects, here the object of interest are tools and faces.
- The results show that recognizing a face inside an image utilizes a different mechanism from recognizing an object that can be manipulated as a tool.
- Taken together, these results indicate that our biological vision system possesses a mechanism for perceiving object functionality (i.e., how an object can be manipulated as a tool) that is independent of the mechanism governing face recognition (and recognition of other objects).
Despite good performance on the approaches of object classification, recognition, localization, tracking, etc., the current DNNs cannot account for the image-level behavior patterns of primates, calling attention to the need for more precise accounting for the neural mechanisms underlying primate object vision.
Task/need — driven perpetual understanding
Human vision organizes representations during the inference process for even simple categorization tasks based on what is the goal or the need. For e.g. a kitchen can be categorized as an indoor scene, a place to cook, a place to socialize, or specifically as one’s own kitchen. Thus, the information gathering and scene categorization processes are constrained by these categorization tasks suggesting, there is a bidirectional interplay between the visual input and the viewer’s need.
The above images show, how a viewer visualizes the same view(a) while asked to report it as a restaurant(b) and a cafeteria(c).
Also, the 3D model of the world is driven by the task which is hand. For e.g. the grasping of a mug could result in two different grasps- the cylindrical grasp of the mug body and the hook grasp of the mug handle. Such findings also suggest that vision(identifying graspable parts) is largely driven by tasks, different tasks result in diverse visual representations.
Let’s now move towards the core elements of the understanding with Humanlike Common Sense.
Towards Understanding with Humanlike Common Sense
The human cognitive abilities comprise the intersection of the visible traditional recognition and categorization of the objects. The human visual understanding involves causal interpretation of the objects with motion, motive anticipation of the objects in the perceptual field. Therefore the Humanlike commonsense can be achieved with the joint representation learnings, where the first component is the visible traditional recognition and categorization of the objects, scenes, actions, events, and so forth, and the second component is the “dark” higher-level concepts of fluent, causality, physics, functionality, intentions and the goals, the utility, and preference. Let’s understand each of the dark matter components in detail.
Fluent and Perceived Causality
Causality is the study of the cause and effect of the entities in observation from another person’s perspective. This behavioral phenomenon helps us understand, why a process is exhibiting nature, the way it is. With this causal understanding, the events are stitched in the temporal dimension to interpret the actions and the reason behind the course of their happenings.
How does the Human visual system perceive causality?
The question that should be popping up in your head, should be can we see causality? The answer to this question has more than a one-dimensional answer. Yes and No both. The physical causality phenomenon can be observed with vision being non-deceptive while the non-physical causality phenomenon can be deceptive to the human visual cortical system. Let’s try to understand this with examples.
Non-Deceptive Physical Causality
Above, is the example of a non-deceptive physical causality phenomenon, where the human visual cortex system deduces the cause and effect looking at the motion of the billiard balls. Consider red balls as A and green balls as B. Case a), demonstrates the launching of ball B by ball A in motion. The series of physical interaction images helps us visually perceive the collision casual effect between A and B. Next consider case b) where both the balls are set in motion after the collision between ball A and ball B, but this time A and B are moving together with no distance between them.
You would argue that case A and case B, both are having the same casual effect, why are they considered as different cases. The answer lies in the concept called a Retinotopic Adaptation. After prolonged viewing of the launching effect, subsequently viewed displays are judged as non-causal if the displays were located within the same retinotopic coordinates.
The experimental results showed that retinotopically specific adaptation did not transfer between launching and entraining, indicating that they are indeed fundamentally distinct categories of causal perception in vision. One point to argue here would be that the retinotopic adaptation can induce a perceptual deception which in some sense is a neural defect and should be considered under the category of deceptive physical causality. This is a pseudo-fair statement that will become clear with the below section.
Deceptive Physical Causality
This is rather a case that involves the law called as, knowing all the variables in the perceptual field. The work of a magician is basically the work of physical deception. Consider the magic trick in which, a magician tries to move a glass of water by blowing wind in its opposite direction, visual perception system, understands, that the magician’s blown air is moving the glass, while there is a magnet placed at the bottom of the surface over which the glass is moving, which is causing the glass to move. This observation can also be related to the phenomenon of correlation is not causation.
Non-Deceptive Non-Physical Causality
Consider the case, where two people are speaking. In the course of conversation, let’s say person A starts speaking, and then person B speaks, and they are the only two sitting in the room. You know for sure, it is person A’s speech that instigated person B’s utterance, without there being a physical stimulus involved. But now, consider the case, in which you mute the conversation and then observe the two persons. The input signal is what would be a pure visual signal which is not facilitated by the auditory signals. Now, would a third person be able to guess, whether person A’s utterance leads to person B’s utterance which in this case is just movement of the mouth and maybe hand gestures? You would probably say, yes. There will be visual cues for that, person A is looking at person B while moving his lips and person B may be nodding his head, worst case, might be looking down or somewhere else. These indicators will help understand the non-physical causality, which definitely is a difficult perceptual field.
Deceptive Non-Physical Causality
This case has a lot of intersection with the deceptive physical causality since it also is based on the concept of knowing all the variables. The latent variable in the environment is playing the trick of making the perceptual causality deceptive. The difference between the physical and non-physical deceptive case is that the deception is more profound to be understood. Consider the case where the above-mentioned two folks start to talk and person B laughs at one point of time in the conversation. A third person’s visual system would perceive the laughter of person B as the response to the conversational stimulus of person A, but that is not the case. Person B has a burst of laughter due to a related incident he recalled listening to person A’s utterance. This is a case that is very difficult to perceive because even person A was not able to figure out the reason for laughter, so would you expect a third-person view to be able to deduce the causality of the laughter.
Transferability of causal understanding
The physical behavior of objects in the environment in terms of abiding by the laws of physics tends to be consistent across different environments. This consistency should help us transfer the learned casual behavior in one environment to another environment, thus making the causal knowledge transferable. Assuming the dynamics of the world are constant, causal relationships will remain true regardless of observational changes to the environment.
I do have second thoughts about the causal nature of the entities to remain constant being subject to different environments. For e.g. consider the screwdriver. The previously learned causal effects make the AI system perceive a person with a screwdriver in hand to tighten the screws in the process of fixing objects. In the new environment, a person with a screwdriver in hand tries to kill another person, which might not be perceived as casual reasoning that the system learned historically, so the system will not be in alert mode while in an environment in which it finds a human holding a screwdriver.
Humans excel at understanding their physical environment and interacting with objects undergoing dynamic state changes, making approximate predictions from observed events. The knowledge underlying such activities is termed intuitive physics.
Human cognition relation with Intuitive Physics
The most widely used approach is the violation of expectation method, in which infants see two test events, an expected event, and an unexpected event, violating the expectation. For example, a stack of legos, showing, which stacks can potentially fall off and which ones will remain stable. In these complex and dynamic events, the ability to perceive, predict, and therefore appropriately interact with objects in the physical world relies on rapid physical inference about the environment. Hence, intuitive physics is a core component of human commonsense knowledge and enables a wide range of object and scene understanding.
The neuromechanics motor experiments have shown that systematic parietal and frontal regions are engaged when humans perform physical inferences even when simply viewing physically rich scenes. These findings suggest that these brain regions use a generalized mental engine for intuitive physical inference — that is, the brain’s “physics. This indicates a very intimate relationship between the cognitive and neural mechanisms for understanding intuitive physics and the mechanisms for preparing appropriate actions. This, in turn, is a critical component linking perception to action. engine.”
Physics Reasoning in the Computer Vision Systems
The current breed of computer vision systems at best creates the 3D map of the visual information thrown at it. The Statistical tools help us understand the patterns in the generated 3D scenes, but not the naturally occurring complexity and the ambiguity in the 3D map. For example, a 3D construction of the scene of the glass on the table doesn't tell the system that the glass is prone to fall off the table and it can break. There has to be meta reasoning added to the 3D construction of the scene to be able to deduce the above-stated observation.
The above image is an example of the safety and stability scoring system for the object in the scene. It is constructed using the voxel scene representation, where the degree of instability is defined by the ball in the pendulum. The farther away the ball is from the pendulum, the more unstable it is.
The above image shows the 13 physics concepts involved in understanding object dynamics. The material, density, volume, and mass are estimated using the 3D mesh construction, while the contact areas, momentum, impulse, pressure, force, velocity, acceleration is tracked using the 3D trajectory reconstruction.
Therefore, there are two major aspects in the physics reasoning that we need to answer a) the stability and safety understanding. b) Physical relationships in 3D scenes. . Altogether, the laws of physics and how they relate to and among objects in a scene are critical “dark” matter for an intelligent agent to perceive and understand; some of the most promising computer vision methods outlined above have understood and incorporated this insight.
Functionality and Affordance
The visual perception and scene understanding is based on the affordance theory. Affordance is the possibility of an action that can be performed using an entity. For example, the button in the elevator can be pressed, a chair affords being sat on. The functional understanding of the objects is highly subjective to the intention or goal in mind. A screwdriver in the scene can be used to tighten the loosened screw, or it can be used to hammer a nail if there is no nail in hand and that is what is the requirement you have in hand.
This is deeply related to the perception of causality, as to understand how an object can be used, an agent must understand what change of state will result if an object is interacted with in any way.
The Evolutionary dimension to the Functionality and affordance
The ability to use an object as a tool to alter another object and accomplish a task has traditionally been regarded as an indicator of intelligence and complex cognition, separating humans from other animals.
This ability to use a tool to perform an action is not specific to the optimization of the process always, rather there is a feeling of contentment and enjoyment too, that pushes organisms to use an object as a tool for a particular task. For e.g. New Caledonian crows can bend a piece of straight wire into a hook and use it to lift a bucket containing food from a vertical pipe. They behave optimistically after using tools. Effort cannot explain their optimism; instead, they appear to enjoy or be intrinsically motivated by tool use.
“The theory of affordances rescues us from the philosophical muddle of assuming fixed classes of objects, each defined by its common feature and then gives a name . . . You do not have to classify and label things in order to perceive what they afford . . . It is never necessary to distinguish all the features of an object and, in fact, it would be impossible to do so.”
Some of the researchers have formulated affordance and functionality recognition as the task-oriented object recognition problem, which is based on unraveling the physics, causality, and function of the object in the scene. So, therefore, any object can be used for certain tasks in some scenarios, and memorization of objects and their functionality in different environments will not help us here.
Container and Containment Relationship
There have been a series of studies on understanding how logic thinking is developed in infants and the famous experiment used was called the container and containment relationship problem. As early as two and a half months old infants can understand the container and containment problem i.e. they start to understand that one object can be encapsulated inside another object to retain it. This points to the field of commonsense reasoning, qualitative representation for reasoning focusing on ontology, topology, first-order logic, and knowledge base.
Another example in the affordance understanding research is of the “chair”. The reasoning about its geometry and function is through visual signals. Apart from a geometric and functional understanding of the object, there's an interactive understanding of the object which is required i.e the forces/pressures on the body parts need to be analyzed to understand whether the particular object is an instrument of a particular kind of affordance. For example, with chairs of different shapes and sizes, to be able to find out whether which chairs solve the best purpose, there has to be a component of human physiological component added to it.
Thus, the affordance and functionality understanding is not the question, that we are trying to answer w.r.t to the embodiment of the AI systems, rather it is a question which we are still exploring w.r. t the human cognition system understanding too.
Intentions and Goals
The ability of a system to be able to imagine a desired futuristic state of an object w.r.t to a defined task in a particular environment describes the intentionality of the system. Optimal intentions require rational actions in relation to the desired outcome to devise the most efficient possible action plan.
Perceiving and understanding the actions of an agent in an environment is driven by their belief and desires. The field of developmental psychology tries to understand the progression of the human understanding of the intention. The human form till 12 months understands the actions of other humans as muscular movements without motives. The infant between 12 -24 months develops a sense of environment and actions. After 24 months of environmental exposure, the human child is able to understand the concept of intention leading to an action and then an outcome.
We has humans do not encode the complete details of human motion in space, instead we perceive motions in terms of intent. It is the constructed understanding of actions in terms of the actors’ goals and intentions that humans encode in memory and later retrieve.
Rich Social Relationships and the Mental States
Adults can perceive and attribute mental states from nothing but the simple motions of geometric shapes. The subjects upon viewing the 2D motion of some geometric shapes made statements about different shapes characterizing different sexes and attached a personality to the same. This points to the fact that humans make rich social inferences about these inanimate objects just by observing their motion. It is unclear whether the demonstrated visual perception of social relationships and mental states was attributable more or less to the dynamic motion of the stimuli, or to the relative attributes (size, shape, etc.) of the protagonists.
Intuitive Agency Theory
This embodies, rationality principle. This theory states that humans view themselves and others as casual agents.
- They devote their limited time and resources only to those actions that change the world in accordance with their intentions and desires.
- They achieve their intentions rationally by maximizing their utility while minimizing their costs, given their beliefs about the world.
The above experiment of chasing Subtley shows that if a wolf chases a sheep in the region of heating seeking i.e 60 degrees of arc keeping sheep in the center. The sheep perceives the movement of the wolf with an intention of getting chased where the 0 degrees i.e. the head-on movement of the wolf is considered the highest level of danger for the sheep. This result is consistent with the “rationality principle,” where human perception assumes that an agent’s intentional action will be one that maximizes its efficiency in reaching its goal.
Intention for the AI systems
In order to better predict intent based on pixel inputs, it is necessary and indispensable to fully exploit comprehensive cues such as motion trajectory, gaze dynamics, body posture and movements, human-object relationships, and communicative gestures. Apart from the above-mentioned movement-based intention understanding, gaze communication understanding is also an important aspect of scene understanding. There are atomic-level gazes that are hard to interpret from the observer’s point of view to understand the intent of the individual, while there is event-level gaze communication, which is much easier to get hold of and point exactly towards action in performance.
There are a lot of dark matter cues facial expression, head pose, body posture, and orientation, arm motion, gestures, proxemics, and relationships with other agents and objects that contribute to the human intent analysis. The imitation of the same in the robotics domain is needed to attempt an understanding of the social affordance(gazes), to be able to build the system with intent understanding.
Utility and Preference
An agent makes rational decisions/choices based on their beliefs and desires to maximize its expected utility. This is known as the principle of maximum expected utility.
In the classical definition of the utility, the utility that a decision-maker gains from making a choice is measured with a utility function. A ranking of preference of an individual such that U(a) > U(b), where choice a is preferred over choice b.
The existence of a utility function that describes an agent’s preference behavior does not necessarily mean that the agent is explicitly maximizing that utility function in its own deliberations. By observing a rational agent’s preferences, however, an observer can construct a utility function that represents what the agent is actually trying to achieve, even if the agent does not know it.
The above excerpt means that you can define a function w.r.t the activity performed by an entity but you cannot tell whether that particular behavior is related to the output maximization. For example, going back to the example of Caledonian crows, who use strings to create hooks to extract food, we never know whether they are doing this out of fun or from a utilitarian perspective. This makes the situation worse for the AI systems to understand the true nature of the subject whose utility function it has to draft.
The article covered some of the areas(FPICU) for which the current breed of AI systems don’t have answers to be or at least there is no concrete system built which covers these areas. There are deeper things that we haven’t covered in the article, whose understanding even for the human is still a mystery which involves the understanding of the emotion-chemical system, the morality which is driven by belief as well as the emotional center of the neural factory. If AI systems were to develop a religion, what would that religion look like? These questions are all blank slats for the brightest consciousness working in this field right now.
Have Faith in the Evolutionary progress
Considering where the human evolution started and where it is now, without much to say, let’s hope the progressive nature of the evolution, whose most prominent component right now the progress in human cognition, leads to the emergence of this new creation, the human imitating artificial system in the coming time.
Breakthroughs are never systematic outcomes. They emerge from a tangential path, a surprise sitting on top of years of work which started in divergent directions.