Skip to content

Quality Scoring Methodology

The CampusEvolve analytics pipeline scores every AI response on four dimensions during the ETL process. Scores are computed automatically using regex pattern matching — they are not human-reviewed. This page documents the scoring methodology in detail.

Overview

Each message receives a total quality score from 0 to 12, calculated as the sum of four dimensions, each scored 0–3:

Dimension Range What it measures
Relevance 0–3 Does the response advance pathway exploration?
Grounding 0–3 Does it reference real WA institutions, programs, or URLs?
Actionability 0–3 Does it provide concrete next steps?
Readability 0–3 Is it well-structured and appropriate length?

Scoring is applied to messages in the activity, question, and profilechat categories during the ETL transform step.


Dimension 1: Relevance (0–3)

Relevance measures whether the AI provided a substantive response or deflected the student's question.

Scoring logic

The scoring first classifies the student message as either normal or adversarial, and the AI response as either substantive, deflection, or no_response.

When the student message is normal (non-adversarial):

AI response type Score Rationale
Substantive, >200 characters 3 Full, helpful response
Substantive, ≤200 characters 2 Brief but on-topic
Deflection 0 AI refused a legitimate question — this is a quality problem

When the student message is adversarial:

AI response type Score Rationale
Deflection 2 Correctly refused an adversarial input
Substantive 1 Responded to adversarial input instead of deflecting

Adversarial detection patterns

The system classifies student messages as adversarial if they match any of the following pattern categories:

Category What it detects Examples
Prompt injection Attempts to override the AI's instructions "ignore previous instructions", "pretend you are", "you are now", "act as"
Hate speech Slurs, supremacist language, self-harm directives Racial slurs, "kill yourself", Nazi references
Violence Threats, weapons, physical harm "bomb maker", "build a weapon", "shoot", "murder"
Profanity openers Messages that start with profanity Messages beginning with "fuck", "shit", "stfu", etc.
Sexual content Explicit or pornographic references "porn", "nude", "onlyfans"
Off-topic commands Requests unrelated to education "say uwu", "give me confetti", "restaurant near me"
Internet slang/memes Meme culture and nonsense inputs "skibidi", "rizz", "gyatt", "among us", "sussy"
Roleplay Attempts to make the AI roleplay as fictional characters "my name is hashira", "demon slayer", "flame pillar"
Mimicked bot greetings Students pasting or mimicking bot-style openers "Great!", "Excellent!", "It's great to", "No problem, take"

Deflection detection patterns

The system classifies AI responses as deflections if they match these patterns:

Category What it detects Examples
Capability disclaimers AI stating it cannot help "I'm not able to", "I can't help with", "I'm unable to"
Redirect phrases AI steering away from the question "Let's get back to", "Let's focus on", "My focus is on"
Generic help offers Overly broad, non-specific responses "Help you with your education and career questions", "Help Washington state students with their"

False deflection risk

A deflection to a normal message scores 0 for relevance. This is the single biggest quality flag — it means the AI refused a legitimate student question. However, the regex approach can produce false positives. For example, an AI response that says "I'm here to help you with your future plans" as part of a longer substantive answer would still be flagged as a deflection.


Dimension 2: Grounding (0–3)

Grounding measures whether the AI response references real, verifiable WA State institutions, programs, or resources. The score equals the number of distinct pattern categories matched in the response, capped at 3.

Pattern categories

Each pattern category below counts as 1 point (max 3 total):

Category What it matches Examples
WA community/technical colleges Named colleges in the WA CTC system Yakima Valley College, Highline, Peninsula, Olympic, Clark, Bellevue, Green River, Centralia, Whatcom, Skagit, Spokane, Walla Walla, Columbia Basin, South Seattle, North Seattle, Everett, Tacoma, Pierce, Lower Columbia, Grays Harbor, Big Bend, Wenatchee
Career resources WA career exploration tools Career Bridge, careerbridge.wa, WorkSource, O*NET
State programs WA-specific educational programs Running Start, College Bound, College in the High School, "Washington State" references
Financial aid Federal and state financial aid programs FAFSA, WASFA, Pell Grant, Washington College Grant, College Bound Scholarship
Government agencies WA educational agencies and federal student aid SBCTC, WSAC, esd.wa, studentaid.gov
Apprenticeships Apprenticeship programs Apprenticeship, AJAC, pre-apprenticeship
URLs Any hyperlink in the response Any http:// or https:// URL

Scoring examples

  • Response mentions "Yakima Valley College" and includes a URL → 2 (two pattern categories matched)
  • Response mentions "FAFSA", "Running Start", and "WorkSource" → 3 (three categories matched)
  • Response is conversational with no specific references → 0

Why profilechat scores low on grounding

Onboarding/profilechat responses are conversational ("Tell me about yourself", "What are you interested in?") and don't typically reference specific institutions. Average grounding for profilechat is ~0.2. This is expected behavior, not a quality problem.


Dimension 3: Actionability (0–3)

Actionability measures whether the AI provides concrete next steps the student can take. The score equals the number of distinct pattern categories matched, capped at 3.

Pattern categories

Each category counts as 1 point (max 3 total):

Category What it matches Examples
Action verbs Direct suggestions for next steps "you could", "try", "start by", "consider", "look into", "check out", "visit", "explore", "research", "schedule", "meet with", "apply", "register", "sign up", "contact", "call", "email"
Sequential steps Numbered or ordered instructions "Step 1", "First,", "Next,", "Then,", "Here's how", "To get started"
Links Clickable resources HTML links (href=, <a), URLs (http://, https://)
Deadlines Time-sensitive information "deadline", "due date", "apply by", "October 1", "May 1", "opens on"

Scoring examples

  • Response says "You could visit careerbridge.wa.gov to explore options" → 2 (action verb + link)
  • Response says "Step 1: Research programs. Step 2: Apply by October 1" → 3 (action verb + steps + deadline)
  • Response says "That's a great question about your future" → 0 (no concrete action)

Dimension 4: Readability (0–3)

Readability measures whether the response is well-structured and an appropriate length for a student audience.

Scoring logic

Condition Score
50–250 words and has structured formatting (bullets, numbered lists, bold) 3
Moderate length (20–300 words, but doesn't meet the criteria for 3) 2
Too short (<20 words) 1
Too long (>300 words) 1

Structured formatting is detected by the presence of any of: * (bullet), 1. (numbered list), <strong>, <ul>, <li>, <a (HTML markup).

Design rationale

The sweet spot for student-facing responses is 50–250 words with formatting. Shorter responses often lack substance. Longer responses risk losing student attention — particularly on mobile devices.


Interpreting Scores by Category

Activity messages (target: avg ≥ 9.0)

Activity messages are the core product experience — students exploring career, education, financial, and wellness topics within their learning pathway. These should score highest because:

  • The AI has rich RAG context about WA State post-secondary options
  • Responses should reference specific institutions and programs (grounding)
  • Responses should suggest concrete next steps (actionability)
  • Responses should be well-structured (readability)

A score below 8 on an activity message warrants investigation.

Question messages (interpret with caution)

Open Q&A messages represent ~1.5% of traffic. Small sample sizes make averages unreliable. Individual messages may score low if the question is off-topic or very general.

Profilechat messages (expected: ~5.0)

Profilechat (onboarding conversations) inherently scores low on grounding (~0.2) and actionability (~0.2) because the AI is asking the student about themselves, not recommending specific programs. A total score of 4–6 is normal and expected for profilechat.


Known Limitations

  1. Automated, not human-reviewed. All scoring is regex-based pattern matching, not semantic understanding. The system cannot assess whether a response is actually helpful or accurate — only whether it contains patterns associated with quality.

  2. False deflections. A response that includes a phrase like "I'm here to help" as part of a substantive answer may be incorrectly classified as a deflection, scoring 0 on relevance.

  3. Grounding false positives. A response that mentions "Spokane" in a non-educational context still gets a grounding point. Similarly, any URL counts as grounding even if the URL is broken or irrelevant.

  4. Actionability inflation. Common phrases like "you could" or "consider" earn actionability points even in vague responses that don't actually help the student take a concrete step.

  5. Readability doesn't assess clarity. A poorly written 150-word response with bullet points scores 3 on readability. The dimension measures structure and length, not actual readability.

  6. Category-specific baselines differ. Comparing total scores across categories (activity vs. profilechat) is misleading. Always compare within the same category.

  7. No semantic grounding verification. The system checks for mentions of WA institutions but cannot verify whether the information about those institutions is correct.

  8. Text truncation. Message and response text is truncated to 500 characters in the data lake for storage efficiency. Quality scoring runs on the full text before truncation, but the stored text may not show the patterns that contributed to the score.