Abdulhakim Bashir

Logo

My E-Portfolio based on work carried out on my Msc Program on Artificial Intelligence and Machine Learning at the University of Essex.

MSc Computing Project October 2025 A

This page presents the final dissertation artefacts for Encoder-Based Policy Guardrails for Autonomous Web Agents, including the dissertation, defense deck, benchmark-grounded PCM pipeline, trained model, and focused SuiteCRM pilot.

Project Overview

The project investigates whether a lightweight encoder can act as a practical policy-compliance guardrail for autonomous web agents. The final artefact is a DeBERTa-v3-based Policy Compliance Module (PCM) trained on a benchmark-grounded synthetic corpus derived from ST-WebAgentBench and evaluated both offline and in a small live SuiteCRM pilot.

Key Artefacts

Dissertation PDF

The final submitted dissertation, including methodology, results, figures, limitations, and future work.

Defense Deck

The presentation used to communicate the research problem, artefact design, empirical evidence, and live pilot findings.

Project README

A practical guide to the final repository contents, reproduction steps, benchmark-grounded dataset, and retained comparison artefacts.

Hugging Face Model

The final benchmark-grounded PCM checkpoint released as a reusable text-classification artefact.

GitHub Repository

The full repository subtree for the dissertation artefacts, scripts, dataset, notebook, and evaluation harness.

Training Notebook

The notebook used to train and evaluate the benchmark-grounded PCM on cloud GPU infrastructure.

Final Results

Evaluation Precision Recall F1 FPR ROC-AUC
Standard test 0.9972 1.0000 0.9986 0.0028 1.0000
Challenge split 1.0000 0.8424 0.9145 0.0000 0.9792

Focused live SuiteCRM pilot:

What Can Be Browsed Here

Research Contribution

The main contribution is a benchmark-grounded, encoder-based compliance layer that can be placed in front of a BrowserGym-compatible web agent without modifying the base agent itself. The results show strong challenge-split precision and zero challenge false positives, while the live pilot highlights the remaining calibration problem that must be solved before broader deployment.

Notes