ETL Pipeline¶
The analytics ETL extracts data from MongoDB Atlas, transforms it into Parquet, and loads it into Azure Data Lake for querying by Synapse and Metabase.
Data Flow¶
MongoDB Atlas (pathways-prod)
↓ hourly ETL (VM cron job)
Azure Data Lake Gen2 — raw/ (JSONL)
↓ transform step (same script)
Azure Data Lake Gen2 — curated/ (Parquet)
↓ Synapse serverless SQL views
Metabase dashboards (analytics.campusevolve.ai)
Schedule¶
The ETL runs every hour on the hour via cron (0 * * * *). Maximum data latency is 1 hour between a student interaction and dashboard update.
Collections Extracted¶
| Collection | Contents |
|---|---|
profiles |
Student profiles (excluding testers, archived, developers) |
messages |
Student/AI message pairs with quality scores computed at transform time |
actions |
User actions (login, signup, disclosure, onboarding) |
paths |
AI-generated learning pathways |
tasks |
Individual tasks within pathways |
errorlogs |
Application error records |
Watermark System¶
The ETL uses incremental extraction with a watermark per collection. Each collection's watermark is the createdAt timestamp of the last document successfully extracted.
- Watermarks are stored in Azure Table Storage (
watermarkstable) - Each run queries MongoDB for documents with
createdAt > watermark - After successful extraction, the watermark is updated to the latest
createdAt - If no watermark exists for a collection, it defaults to 90 days ago
This ensures each run only processes new data, keeping extraction fast and MongoDB load minimal.
Tester Exclusion¶
Tester data is excluded at extraction time and never enters the data lake. The ETL identifies tester accounts by:
isTester: trueflag on the profileisArchived: trueflag on the profilefirst_namefield exists and is non-empty (real students don't have this field; test profiles created by developers do)
For non-profile collections (messages, actions, paths, tasks), the ETL first collects all tester profile IDs, then filters by userId NOT IN tester_ids.
Quality Scoring¶
Quality scores are computed during the transform step for messages in the activity, question, and profilechat categories. See the Quality Scoring page for full methodology.
Current VM Setup¶
Cost: ~$8/mo (B1s VM + static IP)
MongoDB Atlas requires IP whitelisting. Azure serverless compute (Consumption Functions, Consumption Container Apps) uses shared outbound IP pools that can't be reliably whitelisted. A dedicated VM provides a static public IP.
| Resource | Value |
|---|---|
| VM | etl-vm in ce-analytics-prod-rg |
| Static IP | 20.25.140.178 |
| SSH | ssh azureuser@20.25.140.178 |
| ETL script | /home/azureuser/etl/etl.py |
| Config | /home/azureuser/etl/.env |
| Logs | /home/azureuser/etl/etl.log |
| Schedule | Hourly via cron (0 * * * *) |
| Atlas whitelist | 20.25.140.178/32 (comment: ETL VM Azure West US 3) |
SSH Access¶
ssh azureuser@20.25.140.178
Running Manually¶
# SSH into the VM first
ssh azureuser@20.25.140.178
# Run a normal incremental ETL
cd ~/etl && export $(grep -v '^#' .env | xargs) && python3 etl.py
# Check logs
tail -50 ~/etl/etl.log
# Check cron schedule
crontab -l
Backfill¶
To re-extract the last 90 days of data (resets all watermarks):
cd ~/etl && export $(grep -v '^#' .env | xargs) && python3 etl.py --backfill
Use backfill when:
- Quality scoring logic has changed and you need to re-score historical messages
- Data was missed due to an outage
- A new collection was added to the ETL
Checking Logs¶
# Last 50 lines
tail -50 ~/etl/etl.log
# Follow logs in real time
tail -f ~/etl/etl.log
# Search for errors
grep -i error ~/etl/etl.log | tail -20
A healthy run looks like:
2026-03-21 00:00:01 [INFO] ETL run: 20260321_000001
2026-03-21 00:00:02 [INFO] MongoDB: pathways-prod
2026-03-21 00:00:02 [INFO] Excluding 47 tester/archived profiles
2026-03-21 00:00:03 [INFO] messages (since 2026-03-20T23:00:00):
2026-03-21 00:00:04 [INFO] raw: 12 docs → raw/messages/date=2026-03-21/messages_20260321_000001.jsonl
2026-03-21 00:00:04 [INFO] curated: 12 rows → curated/messages/date=2026-03-21/messages_20260321_000001.parquet
...
2026-03-21 00:00:08 [INFO] Done: 15 documents across 6 collections
Troubleshooting¶
ETL produces 0 documents every run¶
Check the watermark. If the watermark is ahead of the actual data (e.g., due to a timezone issue), the query createdAt > watermark returns nothing. Run a backfill to reset watermarks.
MongoDB connection timeout¶
Check the Atlas IP whitelist. If the VM's IP changed (rare but possible if the static IP was deallocated), the connection will fail. Verify the IP:
# On the VM
curl -s ifconfig.me
Compare with the Atlas whitelist entry (20.25.140.178). If different, update the whitelist in the MongoDB Atlas console under Network Access.
Cron not running¶
# Verify cron is scheduled
crontab -l
# Check if cron service is running
systemctl status cron
Data shows in raw/ but not curated/¶
The transform step may be failing on specific documents. Check the logs for "Skipping doc" warnings, which indicate individual document transform failures.
Stale dashboard data¶
- Check if the ETL ran recently:
tail -5 ~/etl/etl.log - Check if the VM is up:
ssh azureuser@20.25.140.178 - If the VM rebooted, cron restarts automatically, but any in-flight run at reboot time is lost
VM Limitations¶
- VM needs OS patches (unattended-upgrades is enabled by default on Ubuntu)
- No automatic restart if VM reboots — cron restarts with the OS but in-flight runs are lost
- No built-in alerting if the ETL fails — check logs manually or add monitoring
- Single point of failure — if the VM goes down, data stops flowing
Future Migration: NAT Gateway¶
When the pipeline becomes mission-critical or needs SLA guarantees, the plan is to migrate to a cloud-native architecture using Azure Functions with a NAT Gateway for static IP.
Target architecture¶
Azure Functions (Flex Consumption or EP1)
└── VNet Integration
└── Subnet with NAT Gateway
└── Static Public IP
└── Whitelisted in MongoDB Atlas
Cost comparison¶
| Setup | Monthly Cost |
|---|---|
| Current VM (B1s + static IP) | ~$8 |
| NAT Gateway (EP1 + NAT + IP) | ~$111 |
Benefits of NAT Gateway over VM¶
- Fully managed — no OS patches, no SSH keys to manage
- Auto-scaling — handles load spikes without intervention
- Built-in monitoring — Azure Functions metrics, App Insights, alerting
- Built-in retry — timer trigger automatically retries on failure
- No single point of failure — Azure manages availability
The migration steps are documented in the vm-etl/NETWORKING.md file in the analytics-pipeline repo.