Skip to content

ETL Pipeline

The analytics ETL extracts data from MongoDB Atlas, transforms it into Parquet, and loads it into Azure Data Lake for querying by Synapse and Metabase.

Data Flow

MongoDB Atlas (pathways-prod)
    ↓ hourly ETL (VM cron job)
Azure Data Lake Gen2 — raw/ (JSONL)
    ↓ transform step (same script)
Azure Data Lake Gen2 — curated/ (Parquet)
    ↓ Synapse serverless SQL views
Metabase dashboards (analytics.campusevolve.ai)

Schedule

The ETL runs every hour on the hour via cron (0 * * * *). Maximum data latency is 1 hour between a student interaction and dashboard update.

Collections Extracted

Collection Contents
profiles Student profiles (excluding testers, archived, developers)
messages Student/AI message pairs with quality scores computed at transform time
actions User actions (login, signup, disclosure, onboarding)
paths AI-generated learning pathways
tasks Individual tasks within pathways
errorlogs Application error records

Watermark System

The ETL uses incremental extraction with a watermark per collection. Each collection's watermark is the createdAt timestamp of the last document successfully extracted.

  • Watermarks are stored in Azure Table Storage (watermarks table)
  • Each run queries MongoDB for documents with createdAt > watermark
  • After successful extraction, the watermark is updated to the latest createdAt
  • If no watermark exists for a collection, it defaults to 90 days ago

This ensures each run only processes new data, keeping extraction fast and MongoDB load minimal.

Tester Exclusion

Tester data is excluded at extraction time and never enters the data lake. The ETL identifies tester accounts by:

  • isTester: true flag on the profile
  • isArchived: true flag on the profile
  • first_name field exists and is non-empty (real students don't have this field; test profiles created by developers do)

For non-profile collections (messages, actions, paths, tasks), the ETL first collects all tester profile IDs, then filters by userId NOT IN tester_ids.

Quality Scoring

Quality scores are computed during the transform step for messages in the activity, question, and profilechat categories. See the Quality Scoring page for full methodology.


Current VM Setup

Cost: ~$8/mo (B1s VM + static IP)

MongoDB Atlas requires IP whitelisting. Azure serverless compute (Consumption Functions, Consumption Container Apps) uses shared outbound IP pools that can't be reliably whitelisted. A dedicated VM provides a static public IP.

Resource Value
VM etl-vm in ce-analytics-prod-rg
Static IP 20.25.140.178
SSH ssh azureuser@20.25.140.178
ETL script /home/azureuser/etl/etl.py
Config /home/azureuser/etl/.env
Logs /home/azureuser/etl/etl.log
Schedule Hourly via cron (0 * * * *)
Atlas whitelist 20.25.140.178/32 (comment: ETL VM Azure West US 3)

SSH Access

ssh azureuser@20.25.140.178

Running Manually

# SSH into the VM first
ssh azureuser@20.25.140.178

# Run a normal incremental ETL
cd ~/etl && export $(grep -v '^#' .env | xargs) && python3 etl.py

# Check logs
tail -50 ~/etl/etl.log

# Check cron schedule
crontab -l

Backfill

To re-extract the last 90 days of data (resets all watermarks):

cd ~/etl && export $(grep -v '^#' .env | xargs) && python3 etl.py --backfill

Use backfill when:

  • Quality scoring logic has changed and you need to re-score historical messages
  • Data was missed due to an outage
  • A new collection was added to the ETL

Checking Logs

# Last 50 lines
tail -50 ~/etl/etl.log

# Follow logs in real time
tail -f ~/etl/etl.log

# Search for errors
grep -i error ~/etl/etl.log | tail -20

A healthy run looks like:

2026-03-21 00:00:01 [INFO] ETL run: 20260321_000001
2026-03-21 00:00:02 [INFO] MongoDB: pathways-prod
2026-03-21 00:00:02 [INFO] Excluding 47 tester/archived profiles
2026-03-21 00:00:03 [INFO] messages (since 2026-03-20T23:00:00):
2026-03-21 00:00:04 [INFO]   raw: 12 docs → raw/messages/date=2026-03-21/messages_20260321_000001.jsonl
2026-03-21 00:00:04 [INFO]   curated: 12 rows → curated/messages/date=2026-03-21/messages_20260321_000001.parquet
...
2026-03-21 00:00:08 [INFO] Done: 15 documents across 6 collections

Troubleshooting

ETL produces 0 documents every run

Check the watermark. If the watermark is ahead of the actual data (e.g., due to a timezone issue), the query createdAt > watermark returns nothing. Run a backfill to reset watermarks.

MongoDB connection timeout

Check the Atlas IP whitelist. If the VM's IP changed (rare but possible if the static IP was deallocated), the connection will fail. Verify the IP:

# On the VM
curl -s ifconfig.me

Compare with the Atlas whitelist entry (20.25.140.178). If different, update the whitelist in the MongoDB Atlas console under Network Access.

Cron not running

# Verify cron is scheduled
crontab -l

# Check if cron service is running
systemctl status cron

Data shows in raw/ but not curated/

The transform step may be failing on specific documents. Check the logs for "Skipping doc" warnings, which indicate individual document transform failures.

Stale dashboard data

  1. Check if the ETL ran recently: tail -5 ~/etl/etl.log
  2. Check if the VM is up: ssh azureuser@20.25.140.178
  3. If the VM rebooted, cron restarts automatically, but any in-flight run at reboot time is lost

VM Limitations

  • VM needs OS patches (unattended-upgrades is enabled by default on Ubuntu)
  • No automatic restart if VM reboots — cron restarts with the OS but in-flight runs are lost
  • No built-in alerting if the ETL fails — check logs manually or add monitoring
  • Single point of failure — if the VM goes down, data stops flowing

Future Migration: NAT Gateway

When the pipeline becomes mission-critical or needs SLA guarantees, the plan is to migrate to a cloud-native architecture using Azure Functions with a NAT Gateway for static IP.

Target architecture

Azure Functions (Flex Consumption or EP1)
    └── VNet Integration
         └── Subnet with NAT Gateway
              └── Static Public IP
                   └── Whitelisted in MongoDB Atlas

Cost comparison

Setup Monthly Cost
Current VM (B1s + static IP) ~$8
NAT Gateway (EP1 + NAT + IP) ~$111

Benefits of NAT Gateway over VM

  • Fully managed — no OS patches, no SSH keys to manage
  • Auto-scaling — handles load spikes without intervention
  • Built-in monitoring — Azure Functions metrics, App Insights, alerting
  • Built-in retry — timer trigger automatically retries on failure
  • No single point of failure — Azure manages availability

The migration steps are documented in the vm-etl/NETWORKING.md file in the analytics-pipeline repo.