While you are reading this article, you are definitely paying “attention” to the text right in front of you which is coming from a latent attentive behavior triggered by a need for a specific type of information. Your eyes are focusing on certain text in front of you and you are making cognitive decipherments of the same which is triggering some neurons to fire and register that in your memory. It all requires energy to perform the act and you don’t want to let any of this attentive energy go in the drain for a process that does not yield any constructive outcome. So, the question is, what exactly should you be attentive to in an environment, given a task, that you accomplish in the minimal energy expenditure. This is where attention economics comes into play. Before getting into the computer science domain of attention, let’s first see how neuroscientific attention works.
In a physiological demonstration, attention can be defined as the alertness or ability to engage with the environment around. This is very analogous to the word vigilance in psychology. Attention is synonymous with arousal, alertness, and vigilance in the neuroscience community.
To study the attention patterns, researchers have taken subjects under different physiological states like sleep-deprived, on sedatives, sleep-wake, and have asked to perform a certain task for e.g. placing the ball in a square box in a specific region and correlated the same with the EEG signals. It was interesting to find out that the performance did not deter that much significantly when the task was backed by a fulfilling reward. This means, more than the physiological state, the incentive to perform the task drove the attention levels in the subjects.
Visual processing over the sensory system has been the dominant inspiration to model attention in the computer science community. They are frequently used in studies meant to address more general, cognitive aspects of attention. Visual attention can be broken down into two categories, spatial and feature-based attention.
There is rapid eye movement called Saccades, which happen multiple times in a second which indicates the focus of the visual signal at the current time in the space. The light wavelength hits the inner peripheral region of the retina called as Fovea. It is the place where the light wavelength signal hits the fovea that the neuronic attention is triggered.
The researchers study these rapid eye movements for the different subjects over the presented images to understand the areas of focus in the images or the saliency in the image. The initial few Saccades are of the most interest, those are the most reflexive saccades unbiased with any spatial information in front of them. The high-level task has an effect for e.g. if the subjects are asked to guess the age of the person in the image.
In general, the changes associated with attention are believed to increase the signal-to-noise ratio of the neurons that represent the attended stimulus, however, they can also impact communication between brain areas. To this end, attention’s effect on neural synchrony is important. Within a visual area, attention has been shown to increase spiking coherence. When a group of neurons fires synchronously, their ability to influence shared downstream areas is enhanced. Furthermore, attention may also be working to directly coordinate communication across areas.
In this type of attention, the subjects are asked to attend to a particular kind of feature in the image. For e.g. the eye of the person, the circles in the background of an image, etc. The visual search tasks, such as looking for a pattern triggers feature-based attention. The Potential sources of top-down feature-based attention have been found in the prefrontal cortex where sustained activity encodes the attended feature.
Congruency Conflict with other Sensory Modalities
In general, the use of multiple congruent sensory signals aids the detection of objects when compared to relying only on a single modality. Interestingly, some studies suggest that humans may have a bias for the visual domain, even when the signal from another domain is equally valid. Specifically, the visual domain appears to dominate most in tasks that require identifying the spatial location of a cue. This can be seen most readily in ventriloquism, where the visual cue of the dummy’s mouth moving overrides auditory evidence about the true location of the vocal source.
Overall, there are different ways in which the subject might be influenced to attend to an object, one is the voluntary attention for a goal-driven need called endogenous attention. There is another one, which is the stimulus-driven attention(bottom-up) called exogenous attention.
Executive Control, Memory and Attention
The combining of the sensory inputs with past knowledge in order to coordinate multiple systems for efficient task selection and execution is the role of the executive control, and this control is linked to the prefrontal cortex. A consequence of the three-way relationship between executive control, working memory, and attention is that the contents of working memory can impact attention, even when not desirable for the task. For example, if a subject has to keep an object in working memory while simultaneously performing a visual search for a separate object, the presence of the stored object in the search array can negatively interfere with the search. This suggests that working memory can interfere with the executive control of attention.
Beyond the flexible control of attention within a sensory modality, attention can also be shifted between modalities. Behavioral experiments indicate that switching attention either between two different tasks within a sensory modality (for example, going from locating a visual object to identifying it) or between sensory modalities (switching from an auditory task to a visual one) incurs a computational cost. This cost is usually measured as the extent to which performance is worse on trials just after the task has been switched vs. those where the same task is being repeated. Interestingly, task switching within a modality seems to incur a larger cost than switching between modalities. Such findings are believed to stem from the fact that switching within a modality requires a reconfiguration of the same neural circuits, which is more difficult than merely engaging the circuitry of a different sensory system.
It is the job of the executive controller to be aware of the cost of these attention shifts and try to minimize them.
Memory is the oxygen of attention
- When subjects are asked to memorize a list of words while simultaneously engaging in a secondary task that divides their attention, their ability to consciously recall those words later is impaired (though their ability to recognize the words as familiar is not so affected).
- Implicit statistical learning can also be biased by attention.
- Even if memory retrieval does not pull from shared attentional resources, it is still clear that some memories are selected for more vivid retrieval at any given moment than others.
- Some forms of memory occur automatically and within the sensory processing stream itself. Priming is a well-known phenomenon in psychology wherein the presence of a stimulus at one point in time impacts how later stimuli are processed or interpreted. For example, the word “doctor” may be recognized more quickly following the word “hospital” than the word “school.”
- Adaptation can also be considered a form of implicit memory. Here, neural responses decrease after repeated exposure to the same stimulus. By reducing the response to repetition, changes in the stimulus become more salient.
Attention in AI
Well, the AI industry has been stormed by different papers on the Attention-based neural architectures performing at the edge of the possible pseudo-intelligence performance, from the past 3 years. There have been numerous papers out trying to solve the quadratic compute complexity issue with the attention layer. The Breed of transformers namely Performer, Linformer, Synthesizer, Linear Transformer tries to use the Low-Rank Kernel trick to bring the complexity down to “nlogn”. Then, there is a breed of Set Transformer, Longformer, Routing Transformer, BigBird, which is trying to focus on making the transformers memory efficient. These all methods try to make the “Brute-Force” attention mechanism methods efficient, still, the process at heart is still a brute force.
Let’s go one step back, what is attention in terms of the NN domain?
Consider this sentence, “Attention is not all you need”. Now, considering we all understand the concept of word embeddings, get the vector representation of the whole sentence. Let’s construct the vector embedding-based input matrix for the same. We take the dot product of each vector embedding with every other vector embedding to create a cosine similarity score matrix. So, there will be 36 cosine similarity computations. Now this cosine similarity score is normalized by the sum of cosine similarity scores coming from all the computations. The highest value of the normalized cosine similarity scores of each word is kept into a vector called the context-aware embedding of input x.
We create a query, key, and values weight matrices for all the words in the sentence. The query is each word in the sentence we want to attend to, then the value is the cosine similarity value which we got as the highest value after normalizing the cosine score across different input words and then the value is vector embedding for the same word. So this attention score is computed for all the words in the sentence w.r.t each other. For better understanding, please refer to link 3 in the acknowledgment section.
So basically, it is all normalized dot products multiplied with word embedding vector which we call attention to, and this is done for each entity/word. As, we can see this way of attention is so brute force deterministic in nature, and is not self learning or efficient to compute only the requried attention score.
Bayesian Attention Belief Network — A Probabilistic approach to Attention
A more sophisticated approach to the attention mechanism is the Bayesian Attention Belief Networks. It constructs a decoder network by modeling the attention weights with a hierarchy of gamma distributions, and an encoder network by stacking Weibull distributions. A model which can progressively learn to attend to different patterns in the attention. These models work really well in being able to create models which are robust to adversarial attacks. This school of thought towards building a belief updated attention map between the embedding in the input space helps us create evolving attention mechanisms. Let’s see what is biological attention and what can we learn from it in terms of attention economics.
AGI and Attention
In the computer-based general intelligence system, attentive energy is the computational energy, the resources required to perform a logical operation contributing to the desired output. The management of system resources is typically called “artificial attention”. Resource Management and control mechanism to assign limited system resources to the process of most relevant information.
The above triangle is self-explanatory in terms of explaining the three dimensions of artificial attention. We need to define an attention process, which finds a sweet spot between the abundant information, limited resource, and time constraint.
The field of narrow AI has a preprogrammed regime defined for the information to be processed for attention, how much of resource needed and the time complexity(O(n2)) of the algorithm too, is already done. This is not the case with the AGI systems, a predefined paradigm will not work, a dynamic paradigm is needed. Thus, when the task is partially defined or unspecified at the design time the following is unknown :
- Data Relevance: What kind of information is relevant to the system operation?
- Process Relevance: How frequently the system has to sample information?
- Operational Efficiency: How quickly the system needs to make decisions?
- The resource requirements of the system.
AGI system has to generate the answers to the above questions by interacting with the operating environment. The AGI systems are needed to be capable of autonomous self-reconfiguration at the architecture level.
Let’s try to understand this general attention process diagram. It starts with the real-world environment with data of different modalities, being sampled from the sensory devices. There is a data biasing component defined which makes sure the data is matched with the respective attention patterns which are derived from the end goal. The data items are then sent out to the processes relevance domain. The processes relevance has a trigger for the actuation devices, which means, to which data modality, we should have more focus, which provides weights to the sampling data. The contextual process evaluation is defined as a heuristic within a process. The data and process mapping add the information of the process weight to the samples apart from the actuation weights. So, that the sampling can be defined made efficient.
Now, let’s move to a specific paradigm of the attention mechanism, which is defined with the OpenCog system.
Economics Attention Network — ECAN
The attention in the OpenCog is embodied with the atoms. The lowest level information entity in the hypergraph is used to represent the information in the OpenCog. To understand, what are atoms and what is atom space, consider reading my blog on the same.
The attention involves keeping track of the importance of the atoms in the atom space. There is an element called as Mind-Agent. A Mind-Agent confers to the attention that atoms use, and are then rewarded in importance when they achieve system goals. A mind agent is a software module, consider it as a thread. New mind agents can be added at any time of the process running on an OpenCog server, whenever new atoms are queried. Each atom has an attention value added to it in the attention focus instance.
There are subtypes of MindAgents:
- ImportanceUpdatingAgent — provide and collect from the atom with short-term importance(STI) weights and Long term importance(LTI) weights.
- ImportanceDiffusionAgents — spreads STI along with the Hebbian Links. The Hebbian Links are the connections defined between the atoms, which indicate, the atoms that are important at the same time. If an atom comes in the attention focus of the process, the Hebbian link holds the relevance of the atom within the process.
- ImportanceSpreadingAgents — takes the excess amount of importance that an atom has and spreads it along the HebbianLinks. This excess importance is defined as the amount of STI above the importance spreading threshold and this is usually greater than the attention focus boundary to ensure that importance spreading does not remove atoms from the attentional focus.
- ForgettingAgents — inhibit the atoms with low LTI.
- HebbianLinkUpdatingAgents — updates the weights of HebbianLinks based on what is the attentional focus of the OpenCog instance at the time of execution of a process.
Understand the flow
- Atoms are given stimulus by a MindAgent if they have been useful in achieving the MindAgent’s goals.
- STI is spread between atoms along HebbianLinks, either by the ImportanceDiffusionAgent or the ImportanceSpreadingAgent.
- The HebbianLinkUpdatingAgent updates the HebbianLink truth values, based on whether linked atoms are in the Attentional Focus or not.
- The ForgettingAgent removes either atoms that are below a threshold LTI, or above it.
Economic Action Selection
OpenCogPrime has an object called the Active Schema Pool, which contains the Schema that is currently “active”. It checks which schema needs to be executed based on the schema status. Each Schema consists of a set of modules, and the role of Action Selection is to choose which modules to execute during a given cycle, and which executing modules to pause. The modules within schema are related in two ways
- module 1 transfers its STI currency to another module 2.
- Module 1 termination leads to the Module 2 initialization.
When a contradiction between executing module M1 and module M2 exists, the module that can pay the most wins.
Economic Goal Selection
Goals are not any specific objects, rather, any atom may be taken as a Goal. Goals with LTI currency are considered supergoals.
Mathematical Intuition Behind the Attention Economics
The Short term importance can be formulated the initial importance value + Wage(attention input) + Rent(Attention Expenditure). The rent can be formulated in two manners, one is static rent, where, if the initial importance value is above a certain threshold, Saf, the rent is rent value itself, else 0. Linear rent has a weight associated with it as (Si — Saf) and normalized by (recentMaxrent- Saf), else 0.
The conjuntion importance, cij, the value of the hebbian link is formulated using a norm value function, which is normalized using the recentMaxSTI or the recentminSTI. There is a conjunction decay factor defined, whose value is multiplied with the current computed conjunction value and the previous conjuction value and the sum makes up the updated current conjunction value. That is how the hebbian link value is computed.
I have been working to understand the current implementation of the same by forking the code repo. I feel there could be gap in the current understanding of this system, that I propose, but, that’s where the fun of learning comes in. Iteratively covering up gaps in the your understanding of the system.
There are certain things in the attention allocation system in the OpenCog which have been proposed as the work of future or the work in development. You can check out the documentation for the same.
The attention in general intelligence involves the neuro reasoning energy management, while making sure, you achieve the expected goal. This involves understanding the environment and what variable to be focused more on, which can change dynamically over temporal dimension. The expected goal, itself can have sub-goals to achieve, which recursively requires the execution of the just defined phenomenon.
This field is still very much in the dark room, where people have been shooting without much success and overall, it applies to the field of general intelligence itself.
Let me know your thoughts or ideas in the comment, to enlighten me in the right direction.
Till next time, stay safe ☺ !!