ETL Pipeline¶

The analytics ETL extracts data from MongoDB Atlas, transforms it into Parquet, and loads it into Azure Data Lake for querying by Synapse and Metabase.

Data Flow¶

MongoDB Atlas (pathways-prod)
    ↓ hourly ETL (VM cron job)
Azure Data Lake Gen2 — raw/ (JSONL)
    ↓ transform step (same script)
Azure Data Lake Gen2 — curated/ (Parquet)
    ↓ Synapse serverless SQL views
Metabase dashboards (analytics.campusevolve.ai)

Schedule¶

The ETL runs every hour on the hour via cron (0 * * * *). Maximum data latency is 1 hour between a student interaction and dashboard update.

Collections Extracted¶

Collection	Contents
`profiles`	Student profiles (excluding testers, archived, developers)
`messages`	Student/AI message pairs with quality scores computed at transform time
`actions`	User actions (login, signup, disclosure, onboarding)
`paths`	AI-generated learning pathways
`tasks`	Individual tasks within pathways
`errorlogs`	Application error records

Watermark System¶

The ETL uses incremental extraction with a watermark per collection. Each collection's watermark is the createdAt timestamp of the last document successfully extracted.

Watermarks are stored in Azure Table Storage (watermarks table)
Each run queries MongoDB for documents with createdAt > watermark
After successful extraction, the watermark is updated to the latest createdAt
If no watermark exists for a collection, it defaults to 90 days ago

This ensures each run only processes new data, keeping extraction fast and MongoDB load minimal.

Tester Exclusion¶

Tester data is excluded at extraction time and never enters the data lake. The ETL identifies tester accounts by:

isTester: true flag on the profile
isArchived: true flag on the profile
first_name field exists and is non-empty (real students don't have this field; test profiles created by developers do)

For non-profile collections (messages, actions, paths, tasks), the ETL first collects all tester profile IDs, then filters by userId NOT IN tester_ids.

Quality Scoring¶

Quality scores are computed during the transform step for messages in the activity, question, and profilechat categories. See the Quality Scoring page for full methodology.

Current VM Setup¶

Cost: ~$8/mo (B1s VM + static IP)

MongoDB Atlas requires IP whitelisting. Azure serverless compute (Consumption Functions, Consumption Container Apps) uses shared outbound IP pools that can't be reliably whitelisted. A dedicated VM provides a static public IP.

Resource	Value
VM	`etl-vm` in `ce-analytics-prod-rg`
Static IP	`20.25.140.178`
SSH	`ssh azureuser@20.25.140.178`
ETL script	`/home/azureuser/etl/etl.py`
Config	`/home/azureuser/etl/.env`
Logs	`/home/azureuser/etl/etl.log`
Schedule	Hourly via cron (`0 * * * *`)
Atlas whitelist	`20.25.140.178/32` (comment: ETL VM Azure West US 3)

SSH Access¶

ssh azureuser@20.25.140.178

Running Manually¶

# SSH into the VM first
ssh azureuser@20.25.140.178

# Run a normal incremental ETL
cd ~/etl && export $(grep -v '^#' .env | xargs) && python3 etl.py

# Check logs
tail -50 ~/etl/etl.log

# Check cron schedule
crontab -l

Backfill¶

To re-extract the last 90 days of data (resets all watermarks):

cd ~/etl && export $(grep -v '^#' .env | xargs) && python3 etl.py --backfill

Use backfill when:

Quality scoring logic has changed and you need to re-score historical messages
Data was missed due to an outage
A new collection was added to the ETL

Checking Logs¶

# Last 50 lines
tail -50 ~/etl/etl.log

# Follow logs in real time
tail -f ~/etl/etl.log

# Search for errors
grep -i error ~/etl/etl.log | tail -20

A healthy run looks like:

2026-03-21 00:00:01 [INFO] ETL run: 20260321_000001
2026-03-21 00:00:02 [INFO] MongoDB: pathways-prod
2026-03-21 00:00:02 [INFO] Excluding 47 tester/archived profiles
2026-03-21 00:00:03 [INFO] messages (since 2026-03-20T23:00:00):
2026-03-21 00:00:04 [INFO]   raw: 12 docs → raw/messages/date=2026-03-21/messages_20260321_000001.jsonl
2026-03-21 00:00:04 [INFO]   curated: 12 rows → curated/messages/date=2026-03-21/messages_20260321_000001.parquet
...
2026-03-21 00:00:08 [INFO] Done: 15 documents across 6 collections

Troubleshooting¶

ETL produces 0 documents every run¶

Check the watermark. If the watermark is ahead of the actual data (e.g., due to a timezone issue), the query createdAt > watermark returns nothing. Run a backfill to reset watermarks.

MongoDB connection timeout¶

Check the Atlas IP whitelist. If the VM's IP changed (rare but possible if the static IP was deallocated), the connection will fail. Verify the IP:

# On the VM
curl -s ifconfig.me

Compare with the Atlas whitelist entry (20.25.140.178). If different, update the whitelist in the MongoDB Atlas console under Network Access.

Cron not running¶

# Verify cron is scheduled
crontab -l

# Check if cron service is running
systemctl status cron

Data shows in raw/ but not curated/¶

The transform step may be failing on specific documents. Check the logs for "Skipping doc" warnings, which indicate individual document transform failures.

Stale dashboard data¶

Check if the ETL ran recently: tail -5 ~/etl/etl.log
Check if the VM is up: ssh azureuser@20.25.140.178
If the VM rebooted, cron restarts automatically, but any in-flight run at reboot time is lost

VM Limitations¶

VM needs OS patches (unattended-upgrades is enabled by default on Ubuntu)
No automatic restart if VM reboots — cron restarts with the OS but in-flight runs are lost
No built-in alerting if the ETL fails — check logs manually or add monitoring
Single point of failure — if the VM goes down, data stops flowing

Future Migration: NAT Gateway¶

When the pipeline becomes mission-critical or needs SLA guarantees, the plan is to migrate to a cloud-native architecture using Azure Functions with a NAT Gateway for static IP.

Target architecture¶

Azure Functions (Flex Consumption or EP1)
    └── VNet Integration
         └── Subnet with NAT Gateway
              └── Static Public IP
                   └── Whitelisted in MongoDB Atlas

Cost comparison¶

Setup	Monthly Cost
Current VM (B1s + static IP)	~$8
NAT Gateway (EP1 + NAT + IP)	~$111

Benefits of NAT Gateway over VM¶

Fully managed — no OS patches, no SSH keys to manage
Auto-scaling — handles load spikes without intervention
Built-in monitoring — Azure Functions metrics, App Insights, alerting
Built-in retry — timer trigger automatically retries on failure
No single point of failure — Azure manages availability

The migration steps are documented in the vm-etl/NETWORKING.md file in the analytics-pipeline repo.