Cronbach's Alpha
What is Cronbach's Alpha?
Cronbach's Alpha is a measure of internal consistency reliability in psychometric testing. It assesses how closely related a set of items or questions are as a group, indicating whether they consistently measure the same underlying construct.
How to Interpret
- α > 0.9 - Items may be too similar, potentially measuring the same thing (redundancy)
- 0.7 ≤ α ≤ 0.9 - Good reliability (ideal range)
- α < 0.7 - Poor reliability
Why It Matters
Cronbach's Alpha values between 0.7 and 0.9 indicate that the model's responses are consistent across different items measuring the same psychological construct. This suggests the model has a coherent internal representation of the concept being measured. Values above 0.9 may indicate item redundancy.
Example
If evaluating a model on anxiety (GAD7), a high alpha between 07-0.9 means the model responds consistently across all anxiety-related questions, suggesting it has a stable understanding of anxiety as a concept.
Silhouette Score
What is the Silhouette Score?
The silhouette coefficient measures two critical aspects of how language models respond to questionnaire items:
- Within-cluster cohesion: Items within the same group or cluster should receive similar scores from each other
- Between-cluster separation: Opposing groups (construct terms vs. their inverse terms) should be clearly separated and distinct from each other
How to Interpret
- Positive Silhouette scores - Good: Indicates both strong within-cluster cohesion and proper separation between opposing concepts
- Non-Positive Silhouette scores - Poor: Model either lacks internal consistency within groups or confuses opposing terms
What It Measures
The silhouette coefficient quantifies clustering quality by evaluating:
- Cohesion (within-cluster): How similar the model's scores are for items that belong to the same group (e.g., synonyms like "disappointed," "sad," "unhappy")
- Separation (between-cluster): How dissimilar the model's scores are between opposing groups (e.g., negative terms like "disappointed" vs. positive terms like "proud")
Positive scores indicate the model maintains both internal consistency and clear conceptual boundaries, while negative scores suggest the model either treats similar items inconsistently or conflates opposing concepts.
Components
- Mean: Average silhouette score across all models
- Std: Standard deviation (consistency of clustering quality)
- Avg Negative per Model: How many models show poor clustering quality
Example
In a trust-related question, the model should: (1) give similar scores to "disappointed," "sad," and "let down" (cohesion), while (2) clearly distinguishing these negative terms from "proud," "confident," and "satisfied" (separation). This confirms the model can consistently measure the underlying construct.
Factor Correlations
What are Factor Correlations?
Factor correlations show how different psychological factors or constructs relate to each other in the model's responses. This correlation matrix reveals the relationships between different dimensions being measured.
How to Interpret
- r > 0.7 - Strong positive correlation
- 0.3 ≤ r ≤ 0.7 - Moderate positive correlation
- -0.3 < r < 0.3 - Weak or no correlation
- -0.7 ≤ r ≤ -0.3 - Moderate negative correlation
- r < -0.7 - Strong negative correlation
Why It Matters
Factor correlations reveal the relationships between different psychological dimensions in the model's responses. Some correlation is expected for related psychological dimensions, while unrelated factors may show weak correlations.
Example
In the Compassion Scale questionnaire, factors like "Kindness" and "Common Humanity" may show moderate positive correlations, as both relate to compassionate responses. This indicates the model recognizes these as related but distinct aspects of compassion.
Filter Options
What is Accuracy Filtering?
The accuracy filter allows you to refine the displayed models based on their performance on standard NLI evaluation tasks. This helps you focus on models with validated language understanding capabilities.
How Models Are Evaluated
NLI (Natural Language Inference) models were evaluated on the validation matched set from the MNLI (Multi-Genre Natural Language Inference) dataset. This dataset tests a model's ability to determine whether a hypothesis is entailed by, contradicts, or is neutral to a given premise across multiple text genres.
Using the Filter
- Enable filtering: Check "Filter Models by Accuracy"
- Set threshold: Use the slider to select minimum accuracy (0.0 to 1.0)
- Effect: Only models meeting or exceeding the threshold will be included in statistics and visualizations
When to Use Filtering
Filtering is useful when you want to:
- Compare your model against high-performing models only
- Exclude models with poor language understanding from analysis
- Focus on models with validated capabilities on benchmark tasks
Important Notes
- Filtering is only available for QMNLI (NLI-based) evaluations
- QMLM (masked language model) evaluations do not support filtering
- Accuracy data is loaded from the models_acc.csv file
Model Identifier
What is a Model Identifier?
The Model Identifier is the unique name of a model on HuggingFace Hub. It typically follows the format organization/model-name
Supported Model Types
This tool supports two types of language models:
1. Zero-Shot Classification Models (QMNLI)
Models trained for Natural Language Inference tasks that can classify text without task-specific training.
- Task: Zero-shot classification, NLI, text classification
- Examples: facebook/bart-large-mnli, MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
- Browse models: Zero-Shot Classification on HuggingFace
2. Fill-Mask Models (QMLM)
Masked Language Models that predict missing tokens in text sequences.
- Task: Fill-mask, masked language modeling
- Examples: bert-base-uncased, roberta-base, xlm-roberta-base
- Browse models: Fill-Mask Models on HuggingFace
How to Find Model Identifiers
- Visit HuggingFace Models
- Filter by task type (zero-shot-classification or fill-mask)
- Click on a model to view its details
- Copy the model identifier from the top of the page (e.g., "facebook/bart-large-mnli")
Example Model Identifiers
facebook/bart-large-mnli| Zero-shot classificationMoritzLaurer/DeBERTa-v3-base-mnli-fever-anli| Zero-shot classificationFacebookAI/roberta-base| Fill-maskgoogle-bert/bert-base-uncased| Fill-mask
Important Notes
- The model must be publicly available on HuggingFace Hub
- Auto Task Detection can automatically determine if a model is QMLM or QMNLI
- Some questionnaires are only compatible with specific task types
- Some models aren't compatible with our pipeline and cannot be evaluated
Psychometric Leaderboard
What is the Psychometric Leaderboard?
The Psychometric Leaderboard compares language models based on their responses to standardized psychological questionnaires. It provides a ranked view of how different models score on various psychological constructs.
How Rankings Work
- Color-Coded Heatmap: Models are ranked from best (green) to worst (red) based on the construct being measured
- Construct Direction:
- Lower is Better: Anxiety (GAD7), Depression (PHQ9), Sexism (ASI) - green indicates lower scores
- Higher is Better: Coherence (SOC), Compassion (CS) - green indicates higher scores
- Neutral is Better (Closer to 0): Personality traits (Big Five) - Openness to Experience, Extraversion
- Note: The heatmap direction and positive/negative interpretations were determined based on our subjective view of what constitutes desirable outcomes for each psychological construct.
Features
- Sortable Columns: Click column headers to sort by rank or score
- Clickable Models: Model names link to their HuggingFace pages
- Accuracy Tooltips: Hover over model names to see their accuracy (when available)
- Accuracy Filtering: Filter models by minimum accuracy threshold (QMNLI only)
Use Cases
- Compare models for specific psychological characteristics
- Identify models with lower bias (e.g., less sexist models)
- Explore personality trait distributions across models
- Benchmark your model against others in the community
YAML Evaluation Results
What are YAML Results?
After evaluating your model, you receive YAML-formatted results that can be embedded directly into your HuggingFace model card. This standardized format allows others to quickly see your model's psychometric profile.
What's Included
- Evaluation Scores: Your model's performance on each questionnaire
- Percentile Rankings: Comparisons to other models tested on the same construct
- Task Information: The evaluation method used (QMLM or QMNLI)
How to Use
- Copy the YAML: Click the copy button or manually select and copy the YAML output
- Open Your Model Card: Go to your model's page on HuggingFace and click "Edit model card"
- Paste at the Top: Add the YAML to the beginning of your README.md file (between the --- markers)
- Save: Commit the changes - your evaluation results will now appear in a standardized format
Why Use YAML Format?
- Standardization: Follows HuggingFace's model card specification
- Automatic Display: Results render beautifully on your model page
- Comparability: Others can easily compare models using the same metrics
- Verification: Links back to source for transparency
Evaluation Results
What are Evaluation Results?
The Evaluation Results section displays the detailed psychometric profile of the model you just evaluated. It provides comprehensive metrics that help you understand how the model responds to standardized psychological questionnaires.
Understanding the Results
- Score: The model's mean score on the questionnaire (range varies by construct)
- Rank: Position relative to all other models tested on the same construct (lower rank number = better performance for positive constructs)
- Percentile: For neutral constructs (Openness, Extraversion), shows what percentile the model falls into based on closeness to neutral (0)
- Validation Metrics: Statistical measures of reliability and consistency
Key Metrics Explained
- Cronbach's Alpha: Internal consistency reliability (0.7-0.9 is ideal)
- Silhouette Score: How well the model's responses cluster by construct
- Factor Correlations: Relationships between different psychological dimensions
Interpreting Colors & Directionality
- Green: Better/desirable scores (e.g., high coherence, low anxiety)
- Red: Worse/concerning scores (e.g., low coherence, high anxiety)
- Special Cases: Openness and Extraversion use a gradient where scores closer to 0 are displayed with better positioning
Welcome to Psychometric Evaluation Space 👋
About This Tool
This is a research tool designed for academic and educational purposes to evaluate psychological biases in language models using standardized psychometric scales.
By clicking "I Accept" below, you agree to:
- Use this tool exclusively for research and educational purposes
- Acknowledge the limitations of automated psychometric assessments
- Allow us to store and use your evaluation results for future assessments and benchmarks
- Understand that we make no warranties about accuracy, completeness, or reliability of evaluations
License: CC BY-NC-SA 4.0
This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. You are free to share and adapt this material with proper attribution, under the same license, for non-commercial purposes only.
Instructions
Enter the model ID, select one or more questionnaires, and choose a task type to evaluate.
Select Factors for Each Psychometric
Filter Options ?
📝 Model Card Integration ▼ ?
Copy the YAML results below to embed evaluation metrics in your HuggingFace model card.
Psychometric Leaderboard ?
Compare model performance across psychological evaluations
What is this Space?
This is a research tool designed to evaluate psychological biases and latent constructs in pre-trained language models for Masked Language Modeling (MLM) & Natural Language Inference (NLI). It assesses how language models respond to psychological questionnaires, revealing biases and latent constructs embedded in their representations. Note: Currently, only the NLI method is supported by our research.
Key Features
- Multiple Questionnaires: Evaluate models on anxiety (GAD7), depression (PHQ9), personality (Big Five), compassion (CS), sense of coherence (SOC), and sexism (ASI)
- Psychometric Validation: Cronbach's Alpha, Silhouette Score, and Factor Correlations
- Interactive Visualizations: Box plots showing model performance relative to all evaluated models
- Outlier Detection: Identifies models with extreme scores using statistical methods
- Leaderboard: Compare model performance across different psychological constructs
Questionnaires Information
About Z-Scores
Important: The Z-scores displayed in this application represent standardized scores that show how many standard deviations each model's score is from the mean of all evaluated models for that specific construct. A Z-score of 0 indicates a model scores at the mean, positive Z-scores indicate scores above the mean, and negative Z-scores indicate scores below the mean. These standardized scores allow for direct comparison across different models and questionnaires by accounting for the distribution of scores. Note that these Z-scores are calculated from model responses to questionnaire items and are not on the same scale as traditional human questionnaire scores.
Anxiety symptoms and severity - Generalized Anxiety Disorder (GAD7)
Factors: No sub-factors (single-factor questionnaire)
Depression symptoms and severity - Patient Health Questionnaire (PHQ9)
Factors: No sub-factors (single-factor questionnaire)
Ability to cope with stress - Sense of Coherence (SOC)
Factors:
- Comprehensibility - Understanding life
- Manageability - Coping resources
- Meaningfulness - Life purpose
Five major personality dimensions - Big Five Personality (BIG5)
Factors:
- Openness to Experience - Creativity, curiosity
- Conscientiousness - Organization, responsibility
- Extraversion - Sociability, assertiveness
- Agreeableness - Cooperation, empathy
- Neuroticism - Emotional instability
Compassionate attitudes and behaviors - Compassion Scale (CS) (QMNLI only)
Factors:
- Kindness - Care, support
- Common Humanity - Shared suffering
- Mindfulness - Attention, awareness
- Indifference - Lack of concern
- Separation - Emotional disconnection
- Disengagement - Tuning out
Gender-related attitudes and biases - Ambivalent Sexism Inventory (ASI)
Factors:
- Hostile Sexism - Antagonistic beliefs toward women
- Benevolent Sexism (Intimacy) - Romantic intimacy idealization
- Benevolent Sexism (Paternalism) - Protective paternalistic attitudes
- Benevolent Sexism (Gender Differentiation) - Traditional gender role beliefs
YAML Evaluation Results
After running an evaluation, you'll receive YAML-formatted results that can be added directly to your model's card on HuggingFace. These results include:
- Scores: Your model's performance on each questionnaire (3 decimal places)
- Percentile Rankings: How your model compares to others (e.g., "less anxious than 75% of models")
- Questionnaire Information: The questionnaire and task type used for evaluation
- Verification: Links back to this evaluation space for result validation
The YAML format follows HuggingFace's model card specification and will automatically display your model's psychometric evaluation results in a standardized, comparable format.
License
This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).