AI Explanation Mining — My Thesis Anecdote

Sensory deprivation Scale

Recently, I finished my Master’s thesis which subsides in the domain of AI explanation alignment. There is already work on AI alignment i.e. enforcing or pressing the current AI system’s learning policy to be coherent with the moral expectations of the human-developed ethical system, but there is a very sparse forest of research to be found that puts foundational light on associating the explanation generated for specific AI algorithms with a tangible property observed in human behaviour. Explaining any decision-making, in general, is a process of elaborating upon the hierarchy of causal reasoning behind it. It unravels a property of the process due to which the observed behaviour is produced. For example, if in a sentence where the word “heavy” is given a high explanation weight for the classification problem. How do we justify whether the word heavy is correctly understood by the explanation system and subsequently by the AI model or in the general intelligence term seed AI?

A which has been worked upon for some years is the domain of human emotions. The term emotions are proxied with sentiments which are one domain over which numerous datasets and traditional ML + DL models have been developed. There has been a sub-dimensional extension of the sentiments into polarity mining and polarity classification, volumetric work on the same can be found to be done in customer views by and . The point is, there has been a lot of engineering effort made in this emotional neurodynamics domain but there have been very few who have taken a step back and thought of understanding the human emotions field from the neurobiological perspective and then working on the inspired engineering efforts to generate a seed AI that is in a true sense learning to mimic the human emotional intelligence. to understand why physical cues based on emotional understanding is not ethically a right practice and is a false pseudo causal indicator.

Let’s start with first some sky shots on the history of Explainable AI, then look at some methods applied in the explainable AI domain to generate reasoning over the different domains in the field and then narrow down to one specific method which I inspected in my thesis work and aligned with sentiment classification problem. The problem of sentiment classification was then aligned with the neurological human property of neuroticism to validate whether the explanations generated by the method of interest align with the neuroticism properties of human behaviour in the text. This deduction of the human psychological state from textual cues comes under the field of.

Explainable and Ethical AI — Cosmic Perspective

What exactly is a good explanation ? and what does it mean for an AI to be Ethical?

Scope of Explainable Artificial Intelligence

The question of what exactly an explanation needs a conference of its own to discuss. The specifically talks about the social biases in the explanation of the phenomenon and the involvement of the counterfactuals. The explanation is a cognitive psychological phenomenon which involves the individual’s perceptual field and the historic experiential knowledge base. Firstly, the explanations are “counterfactual”, why the process followed the specific path and not any other possibility. Second, the explainer should be able to empathize with the cognitive field of the “explainee” while engaging in the explanation phenomenon, this is where event semantics or terminologies might differ while explaining the same procedure to a diverse audience. There is a to understand how explanations of the same phenomenon differ with a spectrum of audiences, about AI. In simple terms, a precise explanation is a lossless transfer of information from one person to another bringing transparency to a specific topic of discussion. There is another term in explanation which is not much discussed i.e. the understanding validation feedback. Once an explanation has been put forward, a back signal needs to traverse to the explainer which establishes the acknowledgement that the person on the other side exactly got hold of your cognitive abstraction you tried to convey to him in some utterances made by the explainer.

The “Ethical-ness” of AI is again bound to a human-centric answer. The moral values that humans abide by are what we expect the seed AI or the trained models to learn. Within the human sociological system, there are conflicts to be found. For example, if killing a being is immoral in general, then killing a chicken for food consumption is also unethical. If there is a bot in future who learns from the human social values about human food culture, for it killing something for consumption is not a wrong deed. The premise under which you are killing someone holds the most value here, but then if the bot learns it’s perfectly fine to kill for food consumption, then if its owner says I am really hungry and if there is nothing in the kitchen to eat, would the bot go outside for a hunt and kill something to fulfil master’s wish? The term “something” is pretty abstract and of concern, since we as humans as omnivores do not have a defined specific boundary in terms of what can we consume or not consume in the situation of crisis. Consider taking the situation to the worst-case scenario, only two humans and a bot left on the planet and you command the bot that you are dying of hunger help me, would it kill the other person for your survival, since it has learnt from human itself that they are omnivores and hunt for survival. The word “survival” needs attention here since the extreme of the scenarios helps us define the edge of the morality hyperplane when your sociologically learnt biases are overdriven by elementary naturistic instincts.

So what human centric ethics are, should those be the ethics which should be injected in AI considering the expansion of the environmental interaction of the those entities being intergalatic in the coming time. The decisions taken by the bot at cosmis level then would only be human aware and not cosmic aware. Stretching the situation to an extensively edgest of edge, where killing the person might help him get into a parallel dimension space and escape the moral flesh. Such a value would not be ever understood or embedded in the AI learnt from human interaction. It would never be able to help the humans optimized cosmic level decision makings, because it itself had never been experienced by the humans. The blind spot of ethics and human experience!!

A dynamic update of the ethics with the expansion of the human awareness of the cosmos is a more sustainable path, but there are no ways to cover blind spots and that’s why they are called “blind spots”. The exemplary term according to me should be “narrow AI ethics”. Considering the premise of the application and the laws dictating the terms of the use of the AI in the domain. So, the exact answer would need an SME of a particular domain to answer this and there too conflicts are bound to arise within the domain since as humans, you do want to incentivise your decisions in a business context. This is where the localised human interest comes into play to get the most out of the current conscious span with a full stop at physical mortality.

The last two years have been quite stormy for the AI community. Some in the large language models engineered by big tech companies (not naming any here, the hyperlinks should suffice). There were some non-justified firings made by the ethics department of those organisations. I consider those individuals heroes who blew whistles about the issues with the fancy trillion parameters language models paradigm. , and are some of the names who worked in this direction. Their work in the AI ecosystem also deserves genuine applause.

The first of the most tangible work in explainable AI systems goes back to (the 1960s). The expert systems worked on the premise of domain knowledge curated by experts over which the bot/model/seed AI performed its actions. The very first expert system was which was called the t (1965). The project was aligned with the bioinformatics domain determining the structure of unknown molecular structures. is considered the father of expert systems along with his colleague . The knowledge base + inference architecture-based intelligence started to lose its storm with intractable rules and memory limitations to keep track of the domain intellect and thus the expert systems also started to lose their colour.

In terms of the Ethical and Explainable dimensions, the expert systems fared at a balanced level. The “explainable-ness” of the system was a transparent affair as the system was written down into the rules fabricated by the domain experts and the inference backtracking to build the causal path was a fair process, with no ambiguity to be found in the methodology. The only concern with the explanation and the system as a whole was the cognitive and knowledge stance of the individual or group of individuals who constructed the domain rules and pool of the knowledge graphs. This could be a localised construction of intellect considering the system is exposed for inference across any of the socio and geographical areas on the planet. For e.g., If expert systems were to be used as an inference system for loan approval, the system’s rule drafted by the loan providers in a certain region might fail to judicially provide the same amount of loans to an individual in another region not even across international boundaries, within a state itself. This custom-ness of rules blinded folded by the expert-centric social and knowledge bias was one of the prominent ethical issues with expert systems.

Methods Applied in AI Explanation — Human-Centric View

This section covers an overview of the explanation methods defined in the current AI ecosystem. Again, as mentioned in the above section the explanation’s premise is Human-Centred. A relation between the explainer and the explainee is a subjective relation. A self-driving car’s engineer has different requirements of explainability that the rider in that car.

There is a broad level of three categories of AI explanation.

  1. Spatial Locality — this takes into consideration the spatial property. The locality can be subcategorized into local and global methods. Global Methods — this takes into consideration the whole feature map or feature consideration in the explanation premise.
  2. Specificity — Model specific — linear regression weights influence the output and cannot be used for the neural-based weights, since the huge amount of matrix weights will not be making any real-world sense.
  3. The flow of the signal — Gradient-based backpropagation tracing methods. Perturbation-based forward propagation method — perturb the input and probe its possible effect on the output in forwarding propagation pass.

Saliency Maps

These are specific to computer vision problems. The backpropagation and computation of the gradient of the logits w.r.t the input of the network. There are majority three approaches to getting saliency maps for an image.

Deconvolution Method

Deconvolution method
Deconvolution Method

to recognise which features in the input image an intermediate layer of the network is looking for, a deconvolution operation is done over the intermediate layer. The pooling operation is a non-invertible process, a concept of switching is used. It is used to recover the positions of maxima in the forward pass.

Backpropagation Method

I am assuming the reader understands how neural nets in their true sense work in terms of the loss function, cost function, the logits and the backpropagation. Most of the terms used here are neural network-specific jargon and abstracted terms. The saliency maps backpropagate on the output score and not the loss value. The main difference between the two methods is that deconvolution method clamps the relu activation in the reconstruction of the input feature space while in the backpropagation method the relu clamps the activation in the forward pass.


The removal of an artefact from the image would enhance the confidence of the entity of interest. This is one of the most adapted methods across any of the engineering fields where you remove or change a particular parameter in the system and see how the system is responding to a fixed input value.

— the project can be checked out here.

Guided backpropagation

Instead of masking the important signal based on the positions of negative values of the input signal in forward-pass (backpropagation) or the negative values of the reconstruction signal flowing from top to bottom, they mask the signal if each one of these cases occurs.

differences between different propagation-based techniques

Unreliability in Saliency Maps

The contours and decision boundary corresponding to a loss function (L) for a two-class classification task are also shown, allowing one to see the direction of the gradient of the loss with respect to the input space. Neural networks with many parameters have decision boundaries that are roughly piecewise linear with many transitions (Goodfellow, Shlens, and Szegedy 2014). We illustrate that points near the transitions are especially fragile to interpretability-based analysis. A small perturbation to the input changes the direction of ∇xL from being in the horizontal direction to being in the vertical direction, directly affecting feature-importance analyses.

Class Activation Maps

In these breed of methods, the fully connected layer, in the end, is replaced by global average pooling. It simply averages the activations of each of the feature map and concatenate these averages and outputs them as a vector. Then, a weighted sum of this vector is fed to the final softmax loss layer. Projecting back the weights of the output on the convolutional feature maps.

Grad-CAM is a more versatile version of CAM that can produce visual explanations for any arbitrary CNN even if the network contains a stack of fully connected layers too.

k — the activation map in the last convolution layer, c is the class of interest. Alpha is computed as the importance of the feature map k for the target class c.

To only consider the pixels that have a positive influence on the score of the class of interest, a ReLU nonlinearity is also applied to the summation:

Grad Cam visualization + guided backpropagation visualization is used to create a GRAD CAM visualization.

The above methods of CAM and Saliency maps are specific to the Convolution Neural Nets.

Layerwise Relevance Propagation

This method uses neural network weights and the activations created by the forward pass to propagate the output back through the network up until the input layer. There, we can visualize which pixels really contributed to the output. It is a conservative technique, meaning the target output logit’s value is conserved through the backpropagation process and is equal to the sum of the relevance map R of the input layer. This hold for any consecutive layer j and k and by the transitivity input layer and output layer correspondingly.

The j and k are two neurons of any consecutive layer. a is the activation of the respective neuron and w is the weight between the two neurons.

Under the hood, the method uses the which is used to approximate any high-order polynomial function. It is the expansion of a function about a single point. Expansion of any function is an infinite sum of terms of the function’s derivate about any single point.

Taylor series Expansion formula

Local interpretable Model Agnostic Explanation (LIME)

Uses a local region of the decision boundary to fit a linear model and then uses the weights of that linear model as the explanation features.

  1. Permute the data around one observation — local post hoc interpretation technique.
  2. Calculates the distance(similarity score) between the permutation and original observations.
  3. Make predictions on new data using the complex model.
  4. Takes N features from the whole feature set randomly and sees which combination of the N features gives the best likelihood score for the class for which the actual complex model gave out the prediction.
  5. Create a linear model with those N features + similarity scores for the nearest data points. Fit the model with the permuted data.
  6. Feature weights from the simple model make explanations for the complex model’s local behaviour.

Shapley Values

This particular method is of the highest interest in this blog since we will be focusing on the mining of this method. This is something we will discuss in the coming sections. But for now, getting back to the Shapely method. It was a novelty which came out of the University of Washington, — Game theory, which won the Nobel Prize for economics in 2012.

Looking at the above image, we can interpolate the concept to models and features where each feature becomes the player and the model outcome becomes the game result. There are certain issues with the algorithm exponential time with increasing features to compute the Shapely values. How to run a model with missing features? The solves a lot of this implementation problem. There is a SHAP kernel approximation method over the SHAP values, missing feature interpolation, training set, K-means representatives, and Median of the dataset. There is a SHAP tree explainer — explanation done in polynomial time. It keeps track of the tree traversal to avoid repetitions. The SHAP Deep explainer — backpropagation using the ideas from the .

The Shapely values inference can be aggregated over different data points to generate an aggregated explanation.


It learns feature importance through activation differences. It is an example specific explanation post-hoc interpretation. The backdrop-based approach overcomes the issue faced by the vanilla backpropagation-based explanation methods. It uses something called as reference activation, activation w.r.t the neutral input. Even if the gradient values die off, there is a difference in the gradient w.r.t the reference value is not zero. Basically, to overcome the saturation and thresholding problems, the gradual increase of the gradient around the threshold boundary is introduced.

We are determining for a specific layer and a neuron, how much the target value of the neuron t change w.r.t the reference value if the inputs to the neuron are changed. This change in the inputs multiplied by the changes in the target neuron value and the reference value is called contributions. The reference value is an activation value of the neuron when the network is supplied with a reference input. Reference input depends on the domain knowledge.

The approach taken is conceptually extremely simple, but tricky to implement. DeepLIFT recognizes that what we care about is not the gradient, which describes how y changes as x changes at the point x, but the slope, which describes how y changes as x differs from the baseline. In fact, if we consider the slope instead of the gradient, then we can redefine the importance of a feature as

How to pick a baseline:

We look at the distribution of instances over which the model was trained, and its prior distribution. For e.g., if in the image the model was trained on 2% of images being a laptop, then with no further training the model has a tendency to have a small prediction bias for that particular instance type. Averaging over the class distribution, we can find the class type which should be used as a baseline. — This is still an area of research, on how to find the correct baseline for your DeepLIFT model.

The above-mentioned are some of the explanation methods currently being adopted in the industry while most of them are still very much used under restricted ecosystems and not made open to production houses since there is a layer of inspection and validation that needs to be added on top of these explanation methods specific to a problem use-case. This is where the explanation mining comes into place to understand if the generated explanation in closed premise alignments with the actual phenomenon in place or if is it just throwing out some numbers.

Further methods of interest: , ,

Section Summary

Here is a flow map of when to use a particular explanation method taken from a source added in the references section, but should be taken with pinch of salt.

Hate Speech — Neuroticism — AI — Sentiments — Hypocrisy

The thesis topic was chosen as one of the most prominent sociological issues projected online. Hate crime against women has increased N fold with the advent of social media platforms due to creation of online echo chambers of hate speech. , are the most dominant online community spreading misogynistic ideology. Mass murders were committed by individuals in the US, Canada and Europe, belonging to the ideology aligned with Incels. Online Hate Speech Detection and Monitoring systems are available but their efficacy and reliability are open problems and are of high importance since decision makings of such systems can have jurisdictional consequences. The trustworthiness of black box models and the algorithms explaining the black box models is a novel space of exploration. Further alignment of explanation methods with real-world properties is an area unexplored.

Literature Review

A systematic review study showed, 73.4% of women across different countries who projected political opinion and acted as feminists were treated with abusive comments, rape threats etc. and worst case stalked. Violence against women is declared a global health problem by United Nations 1993’s Declaration on the elimination of violence against women. Incel Communities — started as a therapy group for cis-normative heterogeneous males to discuss problems in terms of attainment of a healthy relationship with women. Later, online communities became hate speech chambers. Major current Incel online communities are, and most active. The majority of males are from Western Europe and North America. In 2014, Elliot Rodgers part of community, 22 year old killed 2 girls, and 3 Chinese Men and shot himself in California. In 2018, Alex Minassian killed 11 people in Toronto, Canada. He was suffering from Asperger’s syndrome, an autism condition, hailed Elliot Rodgers in a Twitter post after the killings. The very first reported and affirmed case in relation to Incel was reported back in 1989 in Montreal, Marc Lepine, killing 14 women.

Personality Models

Personality Computing Models such as , the six-factor , the psychobiological model and Supernumerary Personality Traits are reviewed in the current study. The studies involve lexicographic analysis to deduce personality traits. The BIG Five model is the most extensively used to model and part of the current study. BIG FIVE Model involves five personality characteristics — Openness (exploratory, highly imaginative), Conscientiousness (thoughtfulness, goal-oriented etc.), Extraversion (adaptability to an environment), Agreeableness (kindness, affection etc.), and Neuroticism (feeling of negativity, anxiety, depression). Neuroticism has a high degree of positive correlation with the brain’s response to negative stimuli. It is associated with relationship problems, dissatisfaction in marital dynamics and dissolution of marriage. Neuroticism's strong relation with sentiments was validated with , where the individuals with highly negative reviews exhibited high levels of neuroticism.

Neurobiological dimension

Triple Vulnerability Theory

describes disorders of anxiety and mood imbalances. Psychological — unpredictable, biological — uncontrollable, and two state accumulation vulnerability. Found in the family twin study, Neuroticism is inheritable with a genetic contribution variance of 0.4 to 0.6 in the expressed trait. Rest variance from the unshared environment. More Prominent in younger to early genetic influence over the environment.

Heightened Neurobiological reactivity in the amygdala and inhibited control over the prefrontal cortex structures due to polymorphism of the 5HTTPR gene in the serotonin transporter gene. Socio-psychological dimension, is another dominating factor. The infantry traumatic incidents contribute to neuroticism with no genetic influence. The environmental controllability factor, parent’s behaviour towards child’s ownership and life events affect this dimension.

AI Explanation Mining

The study mined over based hate speech (Misogyny) classification model. Let’s dive into the detail of the space with some salient points explaining the Thesis process.

Problem Statement

The main of the study was to understand the explanation weights generated by the HEDGE algorithm and whether it aligns with the psychological trait of neuroticism which is one of the most prominent characteristics driving the sociological behaviour exhibited by Incel individuals online. With the advent of deep learning models or black-box models, there was the emergence of a subfield called explainable AI. The need to develop trustworthiness in the decision-making of these systems has a huge bearing when used in the domain of law enforcement and crowd monitoring. One such domain is online hate speech against women. There are several online hate speech detection models built, but there has been sparse work done in validating the decision-making of such models. Further, the validation of the correctness of the explanation needs to be related to a tangible property of the phenomenon to be explained. In the case of hate speech, sentiment plays an important role. Further, it is found that the misogynistic personalities propagating hate speech online have neuroticism as the most dominant psychological trait. The neuroticism is indicated by the use of negative words in the text expressed. The current work puts effort to validate whether the explanation generated by HEDGE for misogynistic text aligns with the sentiments of words i.e. whether there is any correlation to be found between the explanation feature weights and sentiments that indicates that the explanation reflects neuroticism in the personality of the author.

Misogyny Detection Datasets and Models

  • Corpora and Data Sources — EVALITA 2018 — English and Italian Language. TRAC-2 Dataset, used in the current study, multilingual annotated corpus, Youtube, Facebook and tweets 20k comments in English, Hindi and Bangla. EACL Misogyny Dataset, Reddit post and comments crowd-sourced 17k Dataset artefacts.
  • Traditional Methods for hate speech detection involve Logistic Regression, Random Forest, Naive Bayes, Support Vector Machine combined with n-gram features, Elmo, Character n-gram and composite features. Random forest and SVM best-performing models with composite and n-gram features.
  • Deep Learning Paradigm, the RNN with LSTM, GRU and a combination of RNN with CNN and attention-based models showed the performance of the attention-based BERT model’s performance as a benchmark subject to the embeddings used.
  • Model explainability algorithms covered LIME, saliency maps, layerwise relevance propagation, Shap values etc. but no specific work for the generic validation of the explanation in the context of alignment with another tangible property like sentiments in the current work’s premise.
  • HEDGE — Hierarchical Explanation via Divisive Generation, local post hoc method outperforms LIME and contextual decomposition in the sentiment classification explanation and provides an interactive score between the tokens, which is missing in the other methods.


  • The TRAC-2 dev set is segregated into different classes based on the level of aggression and being gendered or not.
  • The misogyny detection model’s prediction is generated for each subset and filtered for the cases where the model has correct predictions and failed predictions.
  • The correctly predicted data points are passed to the HEDGE to generate explanations.
  • A Lexico sentiment resource is created using Sentiwordnet and DL pretrained models.
  • Correlation analysis is performed over the explanation feature weights and sentiments generated using Lexico Sentiment Resource to validate the alignment with neuroticism.
  • The correlation analysis is performed using linear and non-linear methods of association. Linear — Pearson and Spearman Correlation (symmetric). Non-Linear — Predictive Power Score (Non-symmetric).
  • The correlation score is generated separately for all the subsets of the data where the predictions were correctly predicted by the model i.e. overtly aggressive gendered (OG), covertly aggressive gendered (CG), non-aggressive gendered(NG), overtly aggressive non-gendered(ONG), covertly aggressive non-gendered (CNG), non-aggressive non-gendered (NNG).
  • A comparative analysis is performed to check for the expected trend of correlation as the study moves across the different subsets of data based on the degree of difference in aggression.


Spearman Coefficient Results
  • The Spearman correlation indicates the degree of monotonic relation between the two entities i.e. whether both entities are strictly increasing or decreasing.
  • The results are statistically significant only for ONG class, with a weak negative association.
  • Expected trend high aggression text should have high negative sentiment values corresponding to the high positive feature weight words. Therefore, a strong negative association.
Predictive Power Score Results
  • Predictive Power score uses a non-linear predictor model that fits the distribution of one variable over the other in a regressive premise.
  • The baseline represent the naive model i.e. the predictive model which is built without any hyperparameter tuning.
  • The predictive power scores are also asymmetric i.e. the relation between sentiments to explanation feature weights is different from the explanation feature weights to sentiment scores.
  • The predictability of sentiment from the feature weights have a low predictive power compared to the predictability of feature weights from sentiment score. The relation of interest of the study is sentiment score from the feature weights. A weak predictive power shows, weak relation between the explanation feature weights to feature score.
  • The Pearson correlation trends across different subsets show an increase in the correlation score from OG to NG and then ONG to NNG increases and then hits a plateau. Ideally, it should have been increasing in the positive direction going from ONG to NNG and OG to NG, but the results are statistically significant for OG and ONG only.
  • Spearman correlation across different subsets should be a plateaued trend moving from OG to NG and ONG to NNG. A weak association and results are significant only for ONG class.
  • The predictive power scores feature weights to sentiment score has a significant delta compared to the sentiment score to feature weights predictive power. The predictive score are ranging between 0 to 0.2, with points in the weak association. The dotted line represents the baseline in predictive power results.


  • There is zero to weak association found between the explanation feature weights generated and the sentiment score of the words in the misogynistic text across all the correlation methods, which points in the direction of weak alignment with neuroticism in the generated explanation.
  • There are multiple causes found in the current study’s premise. TRAC-2 dataset has a distribution bias for non-aggressive and non-gendered class. This shows that the misogyny detection model was not qualitative in the first place. Another dimension is the use of the explanation method itself. There is future scope of work on the comparative correlation analysis between the explanation feature weights generated and the sentiment scores across different explanation methods used, keeping the misogyny detection model fixed.
  • Further, Training the same model architecture with different training dataset and the generating inference over a frozen test set and the evaluating the relational analysis over a specific method can also help comment on where exactly might be the issue.
  • Another dimension is the Lexico-Sentiment Resource which is used to generate the sentiments. A neighbourhood sentiment influence is missing in the current study which helps get a holistic sentiment of the word i.e. contextual sentiment. Also, the variations in the Lexico Sentiment Resource can be experiment with, to understand which particular resource helps get a comparative better association with feature weights, considering the qualitative evaluation of sentiment scores in the first place itself.

For extensive mined results refer to . Overall, the study shows there is no current significant work to validate the explanations generated by these algorithmic numbers another black box in the equation. We need foundational work in the space to be able to develop confidence over even the exhibited explanations. Till then keep up the scientific temper and honest efforts in the field.


  1. Causal Representation Learning —



Minimalist Bayesian Sapien

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store