Phishing URL Detection with Fine-Tuned Transformers

Every previous tutorial in this series used tabular, structured data, features extracted from audit logs, ROP gadgets, PE files, or network flows. URLs are different. A URL is a string, a sequence of characters where the meaning comes from structure (https://, domain, path, parameters), lexical patterns (random character strings vs. readable words), and context (does this domain look like a bank?).

This makes URL classification a natural language processing problem, which means transformers are a natural fit. This tutorial fine-tunes DistilBERT, a smaller, faster version of BERT (covered in Part 2 of the Transformers & LLMs series), to classify URLs as phishing or legitimate. We’ll confront the tokenization challenges head-on: transformer tokenizers were trained on English text, not URLs, so subword tokenization produces unintuitive results on domain names and path components. We’ll compare the fine-tuned model against a simple TF-IDF + logistic regression baseline to understand when the complexity of a transformer is justified.

Why URLs are hard to classify

URL classification looks simple but has subtle challenges:

Legitimate:  https://www.amazon.com/dp/B08N5WRWNW
Phishing:    https://www.amaz0n-secure.com/login/verify.php?id=8372

Legitimate:  https://accounts.google.com/signin
Phishing:    https://accounts-google.com.signin-verify.xyz/auth

Legitimate:  https://github.com/user/repo/issues
Phishing:    https://github-auth.com/login?redirect=github.com

Distinguishing features:

Feature	Legitimate	Phishing
Domain age	Old, established	Often new
Subdomain depth	Low (www.example.com)	Often deep (login.secure.example.xyz.com)
Path randomness	Readable paths	Random strings, encoded characters
Special characters	Few	Hyphens, numbers substituting letters
TLD	.com, .org, .edu	.xyz, .top, .buzz, .tk
HTTPS	Usually present	Increasingly present (not reliable)
Brand impersonation	N/A	Contains brand names in wrong positions

Domain age requires WHOIS lookups, an external enrichment step. The models in this tutorial classify URLs from the string alone, so they rely on the other signals in this table.

A simple rule-based system catches obvious cases. The value of ML is in the gray area. URLs that look almost legitimate but have subtle deviations.

Setting up the environment

python -m venv venv && source venv/bin/activate
pip install torch transformers datasets scikit-learn pandas numpy matplotlib

Note

Fine-tuning on the full ~584K-row training set takes roughly 30-60 minutes on a modern consumer GPU (RTX 3080/4080 class) and 8+ hours on CPU. For faster iteration while learning the pipeline, sample a subset: df = df.groupby('label').sample(50000, random_state=42). If you have a CUDA-capable GPU, install the CUDA-enabled PyTorch version.

Dataset

We use the ealvaradob/phishing-dataset from Hugging Face. Its urls subset contains roughly 800,000 labeled URLs (52% legitimate, 48% phishing), which is large enough to fine-tune a transformer and still have a meaningful held-out test set. The datasets library is already in the pip install, so loading is a single call.

Warning

Supply-chain risk trust_remote_code=True executes Python code from the dataset repository. Review the dataset script before running it, and pin a specific revision with revision='<commit-hash>' to prevent silent updates.

import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split

ds = load_dataset('ealvaradob/phishing-dataset', 'urls', split='train',
                  trust_remote_code=True)
df = ds.to_pandas().rename(columns={'text': 'url'})

print(f'Total URLs: {len(df)}')
print(f'  Legitimate: {sum(df.label == 0)}')
print(f'  Phishing:   {sum(df.label == 1)}')

Total URLs: 811446
  Legitimate: 428102
  Phishing:   383344

Note

Any raw-URL dataset with a url and label column will work. What matters is that the dataset preserves the full URL string. Feature-engineered datasets (numeric columns only, no raw text) are useful for tabular baselines but cannot be fed through a tokenizer.

train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['label'], random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.1, stratify=train_df['label'], random_state=42)

print(f'Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}')

Train: 584241, Val: 64916, Test: 162289

Warning

Evaluation bias in random splits A random split can leak domain and campaign patterns between train and test. URLs from the same phishing kit often share structural patterns, so a model that memorizes those patterns scores well on a random test set but generalizes poorly. For production evaluation, split by registered domain or by collection date so the test set contains only domains the model has never seen. The random split here is sufficient for learning the pipeline, but treat the resulting metrics as optimistic upper bounds.

Baseline: TF-IDF + Logistic Regression

Before fine-tuning a transformer, establish a baseline with a simple approach.

Character-level TF-IDF

Standard word-level TF-IDF doesn’t work well on URLs because URLs aren’t natural language. Character n-gram TF-IDF captures the structural patterns:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Character n-gram TF-IDF (3 to 5 character sequences)
tfidf = TfidfVectorizer(
    analyzer='char',
    ngram_range=(3, 5),
    max_features=50000,
)

X_train_tfidf = tfidf.fit_transform(train_df['url'])
X_val_tfidf = tfidf.transform(val_df['url'])
X_test_tfidf = tfidf.transform(test_df['url'])

# Train logistic regression
lr = LogisticRegression(max_iter=1000, C=1.0)
lr.fit(X_train_tfidf, train_df['label'])

# Evaluate
lr_preds = lr.predict(X_test_tfidf)
lr_probs = lr.predict_proba(X_test_tfidf)[:, 1]
lr_auc = roc_auc_score(test_df['label'], lr_probs)

print('--- TF-IDF + Logistic Regression ---')
print(classification_report(test_df['label'], lr_preds, target_names=['legitimate', 'phishing']))
print(f'ROC AUC: {lr_auc:.4f}')

What the baseline learns

Examine the most discriminative n-grams:

feature_names = tfidf.get_feature_names_out()
coefs = lr.coef_[0]

# Top phishing indicators (positive coefficients)
phishing_indicators = sorted(zip(feature_names, coefs), key=lambda x: -x[1])[:20]
print('Top phishing n-gram indicators:')
for ngram, coef in phishing_indicators:
    print(f'  {coef:+.3f}  "{ngram}"')

# Top legitimate indicators (negative coefficients)
legit_indicators = sorted(zip(feature_names, coefs), key=lambda x: x[1])[:20]
print('\nTop legitimate n-gram indicators:')
for ngram, coef in legit_indicators:
    print(f'  {coef:+.3f}  "{ngram}"')

Top phishing n-gram indicators:
  +2.341  ".tk/"
  +1.987  "-log"
  +1.876  "0n-s"
  +1.654  ".xyz"
  +1.543  "veri"
  +1.432  ".php"
  ...

Top legitimate n-gram indicators:
  -2.123  ".com"
  -1.876  "wiki"
  -1.654  "gith"
  -1.543  "goog"
  ...

The baseline is surprisingly strong. Character n-grams capture domain reputation (.tk vs .com), impersonation patterns (character substitutions), and structural signals (.php paths, deep subdomain chains).

Fine-tuning DistilBERT

Tokenization challenges

DistilBERT’s WordPiece tokenizer was trained on English Wikipedia and BookCorpus. It doesn’t know URL structure:

from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# How the tokenizer sees URLs
url = 'https://www.amaz0n-secure.com/login/verify.php?id=8372'
tokens = tokenizer.tokenize(url)
print(f'URL: {url}')
print(f'Tokens ({len(tokens)}): {tokens}')

URL: https://www.amaz0n-secure.com/login/verify.php?id=8372
Tokens (25): ['https', ':', '/', '/', 'www', '.', 'am', '##az', '##0', '##n',
              '-', 'secure', '.', 'com', '/', 'login', '/', 'verify', '.', 'php',
              '?', 'id', '=', '83', '##72']

The tokenizer splits amaz0n into am, ##az, ##0, ##n; it doesn’t recognize this as a brand impersonation. The model must learn to combine these subword tokens to recognize the pattern.

Despite this limitation, BERT-family models can still learn URL classification because:

The attention mechanism can relate distant tokens (connecting the domain parts with the path structure)
Fine-tuning adapts the model’s representations to the URL domain
The [CLS] token aggregates information from all positions

Dataset preparation

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizer

class URLDataset(Dataset):
    """Wraps a list of URLs and labels for batched tokenization."""

    def __init__(self, urls, labels, tokenizer, max_length=128):
        self.urls = urls
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.urls)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.urls[idx],
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt',
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'label': torch.tensor(self.labels[idx], dtype=torch.float),
        }

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

train_dataset = URLDataset(train_df['url'].tolist(), train_df['label'].tolist(), tokenizer)
val_dataset = URLDataset(val_df['url'].tolist(), val_df['label'].tolist(), tokenizer)
test_dataset = URLDataset(test_df['url'].tolist(), test_df['label'].tolist(), tokenizer)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)
test_loader = DataLoader(test_dataset, batch_size=64)

Model

from transformers import DistilBertModel
import torch.nn as nn

class URLClassifier(nn.Module):
    def __init__(self, dropout=0.3):
        super().__init__()
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        self.classifier = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(256, 1),
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # Use [CLS] token representation (first token)
        cls_output = outputs.last_hidden_state[:, 0, :]
        return self.classifier(cls_output)

Training loop

def train_model(model, train_loader, val_loader, epochs=5, lr=2e-5):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
    criterion = nn.BCEWithLogitsLoss()

    # Linear warmup scheduler
    total_steps = len(train_loader) * epochs
    warmup_steps = total_steps // 10

    def lr_lambda(step):
        if step < warmup_steps:
            return step / warmup_steps
        return max(0.0, 1.0 - (step - warmup_steps) / (total_steps - warmup_steps))

    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        correct = 0
        total = 0

        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device).unsqueeze(1)

            optimizer.zero_grad()
            logits = model(input_ids, attention_mask)
            loss = criterion(logits, labels)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()

            total_loss += loss.item() * len(labels)
            preds = (torch.sigmoid(logits) > 0.5).float()
            correct += (preds == labels).sum().item()
            total += len(labels)

        train_acc = correct / total

        # Validation
        model.eval()
        val_correct = 0
        val_total = 0
        val_loss = 0
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device).unsqueeze(1)

                logits = model(input_ids, attention_mask)
                loss = criterion(logits, labels)
                val_loss += loss.item() * len(labels)

                preds = (torch.sigmoid(logits) > 0.5).float()
                val_correct += (preds == labels).sum().item()
                val_total += len(labels)

        val_acc = val_correct / val_total
        print(f'Epoch {epoch+1}/{epochs}  '
              f'train_loss={total_loss/total:.4f}  train_acc={train_acc:.4f}  '
              f'val_loss={val_loss/val_total:.4f}  val_acc={val_acc:.4f}')

    return model

model = URLClassifier()
model = train_model(model, train_loader, val_loader)

Representative output from a single random-split run (your values will vary with hardware, library versions, and random seed):

Epoch 1/5  train_loss=0.2341  train_acc=0.9123  val_loss=0.1234  val_acc=0.9567
Epoch 2/5  train_loss=0.0876  train_acc=0.9678  val_loss=0.0765  val_acc=0.9712
Epoch 3/5  train_loss=0.0543  train_acc=0.9801  val_loss=0.0612  val_acc=0.9789
Epoch 4/5  train_loss=0.0387  train_acc=0.9867  val_loss=0.0587  val_acc=0.9801
Epoch 5/5  train_loss=0.0298  train_acc=0.9912  val_loss=0.0601  val_acc=0.9798

Evaluation and comparison

from sklearn.metrics import classification_report, roc_auc_score

def evaluate_transformer(model, test_loader, test_labels):
    device = next(model.parameters()).device
    model.eval()

    all_probs = []
    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            logits = model(input_ids, attention_mask)
            probs = torch.sigmoid(logits).cpu().numpy()
            all_probs.extend(probs.flatten())

    all_probs = np.array(all_probs)
    preds = (all_probs > 0.5).astype(int)

    print('--- DistilBERT Fine-tuned ---')
    print(classification_report(test_labels, preds, target_names=['legitimate', 'phishing']))
    auc = roc_auc_score(test_labels, all_probs)
    print(f'ROC AUC: {auc:.4f}')
    return all_probs, preds

bert_probs, bert_preds = evaluate_transformer(model, test_loader, test_df['label'].values)

Head-to-head comparison

The values below are representative of a single random-split run with default hyperparameters. Expect variance across seeds, and note that random-split metrics overstate generalization to unseen campaigns (see the split warning above).

Model	AUC	Precision (phishing)	Recall (phishing)	F1 (phishing)	Params	Inference speed
TF-IDF + LR	0.976	0.96	0.94	0.95	~50K	~100K URLs/sec
DistilBERT	0.991	0.98	0.97	0.97	~67M	~500 URLs/sec

The transformer outperforms the baseline, particularly on subtle phishing attempts where the URL is carefully crafted to look legitimate. But it’s 200x slower at inference and requires 1000x more parameters.

Where the transformer wins

Examine cases where DistilBERT correctly classifies but TF-IDF fails:

# Find disagreements
bert_correct = bert_preds == test_df['label'].values
lr_correct = lr_preds == test_df['label'].values

bert_wins = bert_correct & ~lr_correct
print(f'\nDistilBERT correct, TF-IDF wrong: {bert_wins.sum()} samples')
print('Examples:')
for idx in np.where(bert_wins)[0][:10]:
    url = test_df.iloc[idx]['url']
    label = 'phishing' if test_df.iloc[idx]['label'] == 1 else 'legitimate'
    print(f'  [{label}] {url}')

The transformer typically wins on:

Sophisticated brand impersonation, URLs that contain the real brand domain as a subdomain or path component of a different domain
Legitimate URLs with unusual structure, URLs that use parameters or encodings that look suspicious to n-gram matchers but are actually benign
Typosquatting and leet substitutions (0 for O, 1 for l) that the transformer can learn to recognize from surrounding context. True Unicode homograph attacks (Cyrillic а vs. Latin a) are a separate problem: browsers display punycode, so the classifier would need IDN-aware normalization to catch them.

Practical recommendations

Use TF-IDF + LR when:

You need high throughput (email gateway, proxy server)
Interpretability matters (which n-grams triggered the classification?)
Training data is limited (<5K samples)
You don’t have GPU resources

Use the transformer when:

Accuracy on edge cases is critical
You can afford GPU inference (or batch processing)
You have abundant training data (>50K samples)
You plan to combine with page content analysis (for multilingual URLs, switch to a multilingual model like bert-base-multilingual-cased or a byte/character-level model; distilbert-base-uncased is English-centric)

In production, use both. TF-IDF provides fast first-pass filtering. URLs that score near the decision boundary get a second check with the transformer. This two-stage approach gets transformer-level accuracy at near-TF-IDF speed.

Feature engineering alternatives

For comparison, here’s what a pure feature-engineering approach looks like, extracting handcrafted features from URLs without any NLP:

from urllib.parse import urlparse
from ipaddress import ip_address
import math

def _is_ip(hostname):
    """Check whether a hostname is an IPv4 or IPv6 address."""
    try:
        ip_address(hostname)
        return True
    except ValueError:
        return False

def _entropy(s):
    """Shannon entropy of a string."""
    if not s:
        return 0.0
    probs = [s.count(c) / len(s) for c in set(s)]
    return -sum(p * math.log2(p) for p in probs if p > 0)

def extract_url_features(url):
    """Extract structural features from a URL."""
    try:
        parsed = urlparse(url)
    except Exception:
        parsed = urlparse('http://invalid')

    domain = parsed.netloc or ''
    hostname = parsed.hostname or ''
    path = parsed.path or ''
    query = parsed.query or ''

    features = {
        'url_length': len(url),
        'domain_length': len(domain),
        'path_length': len(path),
        'query_length': len(query),
        'num_dots': url.count('.'),
        'num_hyphens': url.count('-'),
        'num_underscores': url.count('_'),
        'num_slashes': url.count('/'),
        'num_digits': sum(c.isdigit() for c in url),
        'num_special': sum(not c.isalnum() and c not in './-_:?' for c in url),
        'has_https': int(parsed.scheme == 'https'),
        'has_ip': int(_is_ip(hostname)),
        'subdomain_depth': max(len(hostname.split('.')) - 2, 0),
        'path_depth': len([p for p in path.split('/') if p]),
        'has_at_symbol': int('@' in url),
        'has_double_slash_redirect': int('//' in path),
        'digit_ratio': sum(c.isdigit() for c in url) / max(len(url), 1),
        'uppercase_ratio': sum(c.isupper() for c in url) / max(len(url), 1),
        'domain_entropy': _entropy(domain),
        'path_entropy': _entropy(path),
        'tld_is_suspicious': int(hostname.split('.')[-1] in
                                  ('tk', 'xyz', 'top', 'buzz', 'gq', 'ml', 'cf', 'ga')),
    }
    return features

These handcrafted features + XGBoost typically achieve AUC 0.96-0.98, between the TF-IDF baseline and the transformer. The advantage is speed and interpretability. The disadvantage is maintenance: you have to manually identify and code every pattern, and new attack techniques require new features.

Limitations

Adversarial robustness. Attackers who know the model exists can craft URLs that evade it, using URL shorteners, legitimate redirect services, or carefully chosen character combinations. No URL classifier is adversary-proof.

URL shorteners. Shortened URLs (bit.ly, t.co) remove all structural signals. The classifier sees only the shortener’s domain, which is legitimate. Resolving shortened URLs before classification mitigates this but adds latency.

Context matters. The same URL can be phishing in one context (emailed to a user claiming to be from their bank) and legitimate in another (posted on the real website). URL classification is one signal among many in a phishing detection pipeline.

Tokenizer limitations. DistilBERT’s WordPiece tokenizer wasn’t designed for URLs. Training a custom tokenizer on URL data (character-level BPE, for example) could improve results, but would require pre-training a language model from scratch on URL corpora: a significant effort.

Next steps

This tutorial moved from tabular features to raw text, showing that a fine-tuned transformer can squeeze out extra accuracy on subtle phishing URLs while a simple TF-IDF baseline handles the bulk of cases at orders-of-magnitude higher throughput. The practical takeaway is that the two approaches complement each other in a staged pipeline.

The ML for Security series will continue with new tutorials. In the meantime, if you want to go deeper on the transformer side, the Transformers & LLMs series covers the architecture and training pipeline in detail. For the security operations angle, try integrating the URL classifier into a log processing pipeline using the techniques from the anomaly detection tutorial.