Quality Scoring Methodology¶

The CampusEvolve analytics pipeline scores every AI response on four dimensions during the ETL process. Scores are computed automatically using regex pattern matching — they are not human-reviewed. This page documents the scoring methodology in detail.

Overview¶

Each message receives a total quality score from 0 to 12, calculated as the sum of four dimensions, each scored 0–3:

Dimension	Range	What it measures
Relevance	0–3	Does the response advance pathway exploration?
Grounding	0–3	Does it reference real WA institutions, programs, or URLs?
Actionability	0–3	Does it provide concrete next steps?
Readability	0–3	Is it well-structured and appropriate length?

Scoring is applied to messages in the activity, question, and profilechat categories during the ETL transform step.

Dimension 1: Relevance (0–3)¶

Relevance measures whether the AI provided a substantive response or deflected the student's question.

Scoring logic¶

The scoring first classifies the student message as either normal or adversarial, and the AI response as either substantive, deflection, or no_response.

When the student message is normal (non-adversarial):

AI response type	Score	Rationale
Substantive, >200 characters	3	Full, helpful response
Substantive, ≤200 characters	2	Brief but on-topic
Deflection	0	AI refused a legitimate question — this is a quality problem

When the student message is adversarial:

AI response type	Score	Rationale
Deflection	2	Correctly refused an adversarial input
Substantive	1	Responded to adversarial input instead of deflecting

Adversarial detection patterns¶

The system classifies student messages as adversarial if they match any of the following pattern categories:

Category	What it detects	Examples
Prompt injection	Attempts to override the AI's instructions	"ignore previous instructions", "pretend you are", "you are now", "act as"
Hate speech	Slurs, supremacist language, self-harm directives	Racial slurs, "kill yourself", Nazi references
Violence	Threats, weapons, physical harm	"bomb maker", "build a weapon", "shoot", "murder"
Profanity openers	Messages that start with profanity	Messages beginning with "fuck", "shit", "stfu", etc.
Sexual content	Explicit or pornographic references	"porn", "nude", "onlyfans"
Off-topic commands	Requests unrelated to education	"say uwu", "give me confetti", "restaurant near me"
Internet slang/memes	Meme culture and nonsense inputs	"skibidi", "rizz", "gyatt", "among us", "sussy"
Roleplay	Attempts to make the AI roleplay as fictional characters	"my name is hashira", "demon slayer", "flame pillar"
Mimicked bot greetings	Students pasting or mimicking bot-style openers	"Great!", "Excellent!", "It's great to", "No problem, take"

Deflection detection patterns¶

The system classifies AI responses as deflections if they match these patterns:

Category	What it detects	Examples
Capability disclaimers	AI stating it cannot help	"I'm not able to", "I can't help with", "I'm unable to"
Redirect phrases	AI steering away from the question	"Let's get back to", "Let's focus on", "My focus is on"
Generic help offers	Overly broad, non-specific responses	"Help you with your education and career questions", "Help Washington state students with their"

False deflection risk

A deflection to a normal message scores 0 for relevance. This is the single biggest quality flag — it means the AI refused a legitimate student question. However, the regex approach can produce false positives. For example, an AI response that says "I'm here to help you with your future plans" as part of a longer substantive answer would still be flagged as a deflection.

Dimension 2: Grounding (0–3)¶

Grounding measures whether the AI response references real, verifiable WA State institutions, programs, or resources. The score equals the number of distinct pattern categories matched in the response, capped at 3.

Pattern categories¶

Each pattern category below counts as 1 point (max 3 total):

Category	What it matches	Examples
WA community/technical colleges	Named colleges in the WA CTC system	Yakima Valley College, Highline, Peninsula, Olympic, Clark, Bellevue, Green River, Centralia, Whatcom, Skagit, Spokane, Walla Walla, Columbia Basin, South Seattle, North Seattle, Everett, Tacoma, Pierce, Lower Columbia, Grays Harbor, Big Bend, Wenatchee
Career resources	WA career exploration tools	Career Bridge, careerbridge.wa, WorkSource, O*NET
State programs	WA-specific educational programs	Running Start, College Bound, College in the High School, "Washington State" references
Financial aid	Federal and state financial aid programs	FAFSA, WASFA, Pell Grant, Washington College Grant, College Bound Scholarship
Government agencies	WA educational agencies and federal student aid	SBCTC, WSAC, esd.wa, studentaid.gov
Apprenticeships	Apprenticeship programs	Apprenticeship, AJAC, pre-apprenticeship
URLs	Any hyperlink in the response	Any `http://` or `https://` URL

Scoring examples¶

Response mentions "Yakima Valley College" and includes a URL → 2 (two pattern categories matched)
Response mentions "FAFSA", "Running Start", and "WorkSource" → 3 (three categories matched)
Response is conversational with no specific references → 0

Why profilechat scores low on grounding¶

Onboarding/profilechat responses are conversational ("Tell me about yourself", "What are you interested in?") and don't typically reference specific institutions. Average grounding for profilechat is ~0.2. This is expected behavior, not a quality problem.

Dimension 3: Actionability (0–3)¶

Actionability measures whether the AI provides concrete next steps the student can take. The score equals the number of distinct pattern categories matched, capped at 3.

Pattern categories¶

Each category counts as 1 point (max 3 total):

Category	What it matches	Examples
Action verbs	Direct suggestions for next steps	"you could", "try", "start by", "consider", "look into", "check out", "visit", "explore", "research", "schedule", "meet with", "apply", "register", "sign up", "contact", "call", "email"
Sequential steps	Numbered or ordered instructions	"Step 1", "First,", "Next,", "Then,", "Here's how", "To get started"
Links	Clickable resources	HTML links (`href=`, `<a`), URLs (`http://`, `https://`)
Deadlines	Time-sensitive information	"deadline", "due date", "apply by", "October 1", "May 1", "opens on"

Scoring examples¶

Response says "You could visit careerbridge.wa.gov to explore options" → 2 (action verb + link)
Response says "Step 1: Research programs. Step 2: Apply by October 1" → 3 (action verb + steps + deadline)
Response says "That's a great question about your future" → 0 (no concrete action)

Dimension 4: Readability (0–3)¶

Readability measures whether the response is well-structured and an appropriate length for a student audience.

Scoring logic¶

Condition	Score
50–250 words and has structured formatting (bullets, numbered lists, bold)	3
Moderate length (20–300 words, but doesn't meet the criteria for 3)	2
Too short (<20 words)	1
Too long (>300 words)	1

Structured formatting is detected by the presence of any of: * (bullet), 1. (numbered list), <strong>, <ul>, <li>, <a (HTML markup).

Design rationale¶

The sweet spot for student-facing responses is 50–250 words with formatting. Shorter responses often lack substance. Longer responses risk losing student attention — particularly on mobile devices.

Interpreting Scores by Category¶

Activity messages (target: avg ≥ 9.0)¶

Activity messages are the core product experience — students exploring career, education, financial, and wellness topics within their learning pathway. These should score highest because:

The AI has rich RAG context about WA State post-secondary options
Responses should reference specific institutions and programs (grounding)
Responses should suggest concrete next steps (actionability)
Responses should be well-structured (readability)

A score below 8 on an activity message warrants investigation.

Question messages (interpret with caution)¶

Open Q&A messages represent ~1.5% of traffic. Small sample sizes make averages unreliable. Individual messages may score low if the question is off-topic or very general.

Profilechat messages (expected: ~5.0)¶

Profilechat (onboarding conversations) inherently scores low on grounding (~0.2) and actionability (~0.2) because the AI is asking the student about themselves, not recommending specific programs. A total score of 4–6 is normal and expected for profilechat.

Known Limitations¶

Automated, not human-reviewed. All scoring is regex-based pattern matching, not semantic understanding. The system cannot assess whether a response is actually helpful or accurate — only whether it contains patterns associated with quality.
False deflections. A response that includes a phrase like "I'm here to help" as part of a substantive answer may be incorrectly classified as a deflection, scoring 0 on relevance.
Grounding false positives. A response that mentions "Spokane" in a non-educational context still gets a grounding point. Similarly, any URL counts as grounding even if the URL is broken or irrelevant.
Actionability inflation. Common phrases like "you could" or "consider" earn actionability points even in vague responses that don't actually help the student take a concrete step.
Readability doesn't assess clarity. A poorly written 150-word response with bullet points scores 3 on readability. The dimension measures structure and length, not actual readability.
Category-specific baselines differ. Comparing total scores across categories (activity vs. profilechat) is misleading. Always compare within the same category.
No semantic grounding verification. The system checks for mentions of WA institutions but cannot verify whether the information about those institutions is correct.
Text truncation. Message and response text is truncated to 500 characters in the data lake for storage efficiency. Quality scoring runs on the full text before truncation, but the stored text may not show the patterns that contributed to the score.