Every previous tutorial in this series used tabular, structured data, features extracted from audit logs, ROP gadgets, PE files, or network flows. URLs are different. A URL is a string, a sequence of characters where the meaning comes from structure (https://, domain, path, parameters), lexical patterns (random character strings vs. readable words), and context (does this domain look like a bank?).
This makes URL classification a natural language processing problem, which means transformers are a natural fit. This tutorial fine-tunes DistilBERT, a smaller, faster version of BERT (covered in Part 2 of the Transformers & LLMs series), to classify URLs as phishing or legitimate. We’ll confront the tokenization challenges head-on: transformer tokenizers were trained on English text, not URLs, so subword tokenization produces unintuitive results on domain names and path components. We’ll compare the fine-tuned model against a simple TF-IDF + logistic regression baseline to understand when the complexity of a transformer is justified.
Why URLs are hard to classify
URL classification looks simple but has subtle challenges:
Legitimate: https://www.amazon.com/dp/B08N5WRWNW
Phishing: https://www.amaz0n-secure.com/login/verify.php?id=8372
Legitimate: https://accounts.google.com/signin
Phishing: https://accounts-google.com.signin-verify.xyz/auth
Legitimate: https://github.com/user/repo/issues
Phishing: https://github-auth.com/login?redirect=github.comDistinguishing features:
| Feature | Legitimate | Phishing |
|---|---|---|
| Domain age | Old, established | Often new |
| Subdomain depth | Low (www.example.com) | Often deep (login.secure.example.xyz.com) |
| Path randomness | Readable paths | Random strings, encoded characters |
| Special characters | Few | Hyphens, numbers substituting letters |
| TLD | .com, .org, .edu | .xyz, .top, .buzz, .tk |
| HTTPS | Usually present | Increasingly present (not reliable) |
| Brand impersonation | N/A | Contains brand names in wrong positions |
Domain age requires WHOIS lookups, an external enrichment step. The models in this tutorial classify URLs from the string alone, so they rely on the other signals in this table.
A simple rule-based system catches obvious cases. The value of ML is in the gray area. URLs that look almost legitimate but have subtle deviations.
Setting up the environment
python -m venv venv && source venv/bin/activate
pip install torch transformers datasets scikit-learn pandas numpy matplotlibNote
Fine-tuning on the full ~584K-row training set takes roughly 30-60 minutes on a modern consumer GPU (RTX 3080/4080 class) and 8+ hours on CPU. For faster iteration while learning the pipeline, sample a subset:
df = df.groupby('label').sample(50000, random_state=42). If you have a CUDA-capable GPU, install the CUDA-enabled PyTorch version.
Dataset
We use the ealvaradob/phishing-dataset from Hugging Face. Its urls subset contains roughly 800,000 labeled URLs (52% legitimate, 48% phishing), which is large enough to fine-tune a transformer and still have a meaningful held-out test set. The datasets library is already in the pip install, so loading is a single call.
Warning
Supply-chain risk
trust_remote_code=Trueexecutes Python code from the dataset repository. Review the dataset script before running it, and pin a specific revision withrevision='<commit-hash>'to prevent silent updates.
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
ds = load_dataset('ealvaradob/phishing-dataset', 'urls', split='train',
trust_remote_code=True)
df = ds.to_pandas().rename(columns={'text': 'url'})
print(f'Total URLs: {len(df)}')
print(f' Legitimate: {sum(df.label == 0)}')
print(f' Phishing: {sum(df.label == 1)}')Total URLs: 811446
Legitimate: 428102
Phishing: 383344Note
Any raw-URL dataset with a
urlandlabelcolumn will work. What matters is that the dataset preserves the full URL string. Feature-engineered datasets (numeric columns only, no raw text) are useful for tabular baselines but cannot be fed through a tokenizer.
train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['label'], random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.1, stratify=train_df['label'], random_state=42)
print(f'Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}')Train: 584241, Val: 64916, Test: 162289Warning
Evaluation bias in random splits A random split can leak domain and campaign patterns between train and test. URLs from the same phishing kit often share structural patterns, so a model that memorizes those patterns scores well on a random test set but generalizes poorly. For production evaluation, split by registered domain or by collection date so the test set contains only domains the model has never seen. The random split here is sufficient for learning the pipeline, but treat the resulting metrics as optimistic upper bounds.
Baseline: TF-IDF + Logistic Regression
Before fine-tuning a transformer, establish a baseline with a simple approach.
Character-level TF-IDF
Standard word-level TF-IDF doesn’t work well on URLs because URLs aren’t natural language. Character n-gram TF-IDF captures the structural patterns:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
# Character n-gram TF-IDF (3 to 5 character sequences)
tfidf = TfidfVectorizer(
analyzer='char',
ngram_range=(3, 5),
max_features=50000,
)
X_train_tfidf = tfidf.fit_transform(train_df['url'])
X_val_tfidf = tfidf.transform(val_df['url'])
X_test_tfidf = tfidf.transform(test_df['url'])
# Train logistic regression
lr = LogisticRegression(max_iter=1000, C=1.0)
lr.fit(X_train_tfidf, train_df['label'])
# Evaluate
lr_preds = lr.predict(X_test_tfidf)
lr_probs = lr.predict_proba(X_test_tfidf)[:, 1]
lr_auc = roc_auc_score(test_df['label'], lr_probs)
print('--- TF-IDF + Logistic Regression ---')
print(classification_report(test_df['label'], lr_preds, target_names=['legitimate', 'phishing']))
print(f'ROC AUC: {lr_auc:.4f}')What the baseline learns
Examine the most discriminative n-grams:
feature_names = tfidf.get_feature_names_out()
coefs = lr.coef_[0]
# Top phishing indicators (positive coefficients)
phishing_indicators = sorted(zip(feature_names, coefs), key=lambda x: -x[1])[:20]
print('Top phishing n-gram indicators:')
for ngram, coef in phishing_indicators:
print(f' {coef:+.3f} "{ngram}"')
# Top legitimate indicators (negative coefficients)
legit_indicators = sorted(zip(feature_names, coefs), key=lambda x: x[1])[:20]
print('\nTop legitimate n-gram indicators:')
for ngram, coef in legit_indicators:
print(f' {coef:+.3f} "{ngram}"')Top phishing n-gram indicators:
+2.341 ".tk/"
+1.987 "-log"
+1.876 "0n-s"
+1.654 ".xyz"
+1.543 "veri"
+1.432 ".php"
...
Top legitimate n-gram indicators:
-2.123 ".com"
-1.876 "wiki"
-1.654 "gith"
-1.543 "goog"
...The baseline is surprisingly strong. Character n-grams capture domain reputation (.tk vs .com), impersonation patterns (character substitutions), and structural signals (.php paths, deep subdomain chains).
Fine-tuning DistilBERT
Tokenization challenges
DistilBERT’s WordPiece tokenizer was trained on English Wikipedia and BookCorpus. It doesn’t know URL structure:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# How the tokenizer sees URLs
url = 'https://www.amaz0n-secure.com/login/verify.php?id=8372'
tokens = tokenizer.tokenize(url)
print(f'URL: {url}')
print(f'Tokens ({len(tokens)}): {tokens}')URL: https://www.amaz0n-secure.com/login/verify.php?id=8372
Tokens (25): ['https', ':', '/', '/', 'www', '.', 'am', '##az', '##0', '##n',
'-', 'secure', '.', 'com', '/', 'login', '/', 'verify', '.', 'php',
'?', 'id', '=', '83', '##72']The tokenizer splits amaz0n into am, ##az, ##0, ##n; it doesn’t recognize this as a brand impersonation. The model must learn to combine these subword tokens to recognize the pattern.
Despite this limitation, BERT-family models can still learn URL classification because:
- The attention mechanism can relate distant tokens (connecting the domain parts with the path structure)
- Fine-tuning adapts the model’s representations to the URL domain
- The [CLS] token aggregates information from all positions
Dataset preparation
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizer
class URLDataset(Dataset):
"""Wraps a list of URLs and labels for batched tokenization."""
def __init__(self, urls, labels, tokenizer, max_length=128):
self.urls = urls
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.urls)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.urls[idx],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(self.labels[idx], dtype=torch.float),
}
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
train_dataset = URLDataset(train_df['url'].tolist(), train_df['label'].tolist(), tokenizer)
val_dataset = URLDataset(val_df['url'].tolist(), val_df['label'].tolist(), tokenizer)
test_dataset = URLDataset(test_df['url'].tolist(), test_df['label'].tolist(), tokenizer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)
test_loader = DataLoader(test_dataset, batch_size=64)Model
from transformers import DistilBertModel
import torch.nn as nn
class URLClassifier(nn.Module):
def __init__(self, dropout=0.3):
super().__init__()
self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
self.classifier = nn.Sequential(
nn.Linear(768, 256),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(256, 1),
)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
# Use [CLS] token representation (first token)
cls_output = outputs.last_hidden_state[:, 0, :]
return self.classifier(cls_output)Training loop
def train_model(model, train_loader, val_loader, epochs=5, lr=2e-5):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
criterion = nn.BCEWithLogitsLoss()
# Linear warmup scheduler
total_steps = len(train_loader) * epochs
warmup_steps = total_steps // 10
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps
return max(0.0, 1.0 - (step - warmup_steps) / (total_steps - warmup_steps))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
for epoch in range(epochs):
model.train()
total_loss = 0
correct = 0
total = 0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device).unsqueeze(1)
optimizer.zero_grad()
logits = model(input_ids, attention_mask)
loss = criterion(logits, labels)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item() * len(labels)
preds = (torch.sigmoid(logits) > 0.5).float()
correct += (preds == labels).sum().item()
total += len(labels)
train_acc = correct / total
# Validation
model.eval()
val_correct = 0
val_total = 0
val_loss = 0
with torch.no_grad():
for batch in val_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device).unsqueeze(1)
logits = model(input_ids, attention_mask)
loss = criterion(logits, labels)
val_loss += loss.item() * len(labels)
preds = (torch.sigmoid(logits) > 0.5).float()
val_correct += (preds == labels).sum().item()
val_total += len(labels)
val_acc = val_correct / val_total
print(f'Epoch {epoch+1}/{epochs} '
f'train_loss={total_loss/total:.4f} train_acc={train_acc:.4f} '
f'val_loss={val_loss/val_total:.4f} val_acc={val_acc:.4f}')
return model
model = URLClassifier()
model = train_model(model, train_loader, val_loader)Representative output from a single random-split run (your values will vary with hardware, library versions, and random seed):
Epoch 1/5 train_loss=0.2341 train_acc=0.9123 val_loss=0.1234 val_acc=0.9567
Epoch 2/5 train_loss=0.0876 train_acc=0.9678 val_loss=0.0765 val_acc=0.9712
Epoch 3/5 train_loss=0.0543 train_acc=0.9801 val_loss=0.0612 val_acc=0.9789
Epoch 4/5 train_loss=0.0387 train_acc=0.9867 val_loss=0.0587 val_acc=0.9801
Epoch 5/5 train_loss=0.0298 train_acc=0.9912 val_loss=0.0601 val_acc=0.9798Evaluation and comparison
from sklearn.metrics import classification_report, roc_auc_score
def evaluate_transformer(model, test_loader, test_labels):
device = next(model.parameters()).device
model.eval()
all_probs = []
with torch.no_grad():
for batch in test_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
logits = model(input_ids, attention_mask)
probs = torch.sigmoid(logits).cpu().numpy()
all_probs.extend(probs.flatten())
all_probs = np.array(all_probs)
preds = (all_probs > 0.5).astype(int)
print('--- DistilBERT Fine-tuned ---')
print(classification_report(test_labels, preds, target_names=['legitimate', 'phishing']))
auc = roc_auc_score(test_labels, all_probs)
print(f'ROC AUC: {auc:.4f}')
return all_probs, preds
bert_probs, bert_preds = evaluate_transformer(model, test_loader, test_df['label'].values)Head-to-head comparison
The values below are representative of a single random-split run with default hyperparameters. Expect variance across seeds, and note that random-split metrics overstate generalization to unseen campaigns (see the split warning above).
| Model | AUC | Precision (phishing) | Recall (phishing) | F1 (phishing) | Params | Inference speed |
|---|---|---|---|---|---|---|
| TF-IDF + LR | 0.976 | 0.96 | 0.94 | 0.95 | ~50K | ~100K URLs/sec |
| DistilBERT | 0.991 | 0.98 | 0.97 | 0.97 | ~67M | ~500 URLs/sec |
The transformer outperforms the baseline, particularly on subtle phishing attempts where the URL is carefully crafted to look legitimate. But it’s 200x slower at inference and requires 1000x more parameters.
Where the transformer wins
Examine cases where DistilBERT correctly classifies but TF-IDF fails:
# Find disagreements
bert_correct = bert_preds == test_df['label'].values
lr_correct = lr_preds == test_df['label'].values
bert_wins = bert_correct & ~lr_correct
print(f'\nDistilBERT correct, TF-IDF wrong: {bert_wins.sum()} samples')
print('Examples:')
for idx in np.where(bert_wins)[0][:10]:
url = test_df.iloc[idx]['url']
label = 'phishing' if test_df.iloc[idx]['label'] == 1 else 'legitimate'
print(f' [{label}] {url}')The transformer typically wins on:
- Sophisticated brand impersonation, URLs that contain the real brand domain as a subdomain or path component of a different domain
- Legitimate URLs with unusual structure, URLs that use parameters or encodings that look suspicious to n-gram matchers but are actually benign
- Typosquatting and leet substitutions (0 for O, 1 for l) that the transformer can learn to recognize from surrounding context. True Unicode homograph attacks (Cyrillic а vs. Latin a) are a separate problem: browsers display punycode, so the classifier would need IDN-aware normalization to catch them.
Practical recommendations
Use TF-IDF + LR when:
- You need high throughput (email gateway, proxy server)
- Interpretability matters (which n-grams triggered the classification?)
- Training data is limited (<5K samples)
- You don’t have GPU resources
Use the transformer when:
- Accuracy on edge cases is critical
- You can afford GPU inference (or batch processing)
- You have abundant training data (>50K samples)
- You plan to combine with page content analysis (for multilingual URLs, switch to a multilingual model like
bert-base-multilingual-casedor a byte/character-level model;distilbert-base-uncasedis English-centric)
In production, use both. TF-IDF provides fast first-pass filtering. URLs that score near the decision boundary get a second check with the transformer. This two-stage approach gets transformer-level accuracy at near-TF-IDF speed.
Feature engineering alternatives
For comparison, here’s what a pure feature-engineering approach looks like, extracting handcrafted features from URLs without any NLP:
from urllib.parse import urlparse
from ipaddress import ip_address
import math
def _is_ip(hostname):
"""Check whether a hostname is an IPv4 or IPv6 address."""
try:
ip_address(hostname)
return True
except ValueError:
return False
def _entropy(s):
"""Shannon entropy of a string."""
if not s:
return 0.0
probs = [s.count(c) / len(s) for c in set(s)]
return -sum(p * math.log2(p) for p in probs if p > 0)
def extract_url_features(url):
"""Extract structural features from a URL."""
try:
parsed = urlparse(url)
except Exception:
parsed = urlparse('http://invalid')
domain = parsed.netloc or ''
hostname = parsed.hostname or ''
path = parsed.path or ''
query = parsed.query or ''
features = {
'url_length': len(url),
'domain_length': len(domain),
'path_length': len(path),
'query_length': len(query),
'num_dots': url.count('.'),
'num_hyphens': url.count('-'),
'num_underscores': url.count('_'),
'num_slashes': url.count('/'),
'num_digits': sum(c.isdigit() for c in url),
'num_special': sum(not c.isalnum() and c not in './-_:?' for c in url),
'has_https': int(parsed.scheme == 'https'),
'has_ip': int(_is_ip(hostname)),
'subdomain_depth': max(len(hostname.split('.')) - 2, 0),
'path_depth': len([p for p in path.split('/') if p]),
'has_at_symbol': int('@' in url),
'has_double_slash_redirect': int('//' in path),
'digit_ratio': sum(c.isdigit() for c in url) / max(len(url), 1),
'uppercase_ratio': sum(c.isupper() for c in url) / max(len(url), 1),
'domain_entropy': _entropy(domain),
'path_entropy': _entropy(path),
'tld_is_suspicious': int(hostname.split('.')[-1] in
('tk', 'xyz', 'top', 'buzz', 'gq', 'ml', 'cf', 'ga')),
}
return featuresThese handcrafted features + XGBoost typically achieve AUC 0.96-0.98, between the TF-IDF baseline and the transformer. The advantage is speed and interpretability. The disadvantage is maintenance: you have to manually identify and code every pattern, and new attack techniques require new features.
Limitations
Adversarial robustness. Attackers who know the model exists can craft URLs that evade it, using URL shorteners, legitimate redirect services, or carefully chosen character combinations. No URL classifier is adversary-proof.
URL shorteners. Shortened URLs (bit.ly, t.co) remove all structural signals. The classifier sees only the shortener’s domain, which is legitimate. Resolving shortened URLs before classification mitigates this but adds latency.
Context matters. The same URL can be phishing in one context (emailed to a user claiming to be from their bank) and legitimate in another (posted on the real website). URL classification is one signal among many in a phishing detection pipeline.
Tokenizer limitations. DistilBERT’s WordPiece tokenizer wasn’t designed for URLs. Training a custom tokenizer on URL data (character-level BPE, for example) could improve results, but would require pre-training a language model from scratch on URL corpora: a significant effort.
Next steps
This tutorial moved from tabular features to raw text, showing that a fine-tuned transformer can squeeze out extra accuracy on subtle phishing URLs while a simple TF-IDF baseline handles the bulk of cases at orders-of-magnitude higher throughput. The practical takeaway is that the two approaches complement each other in a staged pipeline.
The ML for Security series will continue with new tutorials. In the meantime, if you want to go deeper on the transformer side, the Transformers & LLMs series covers the architecture and training pipeline in detail. For the security operations angle, try integrating the URL classifier into a log processing pipeline using the techniques from the anomaly detection tutorial.