Data Augmentation Using LLMs for Better Phishing Datasets

Data Augmentation

Can data augmentation elevate your data science status, leaving your peers in awe? Absolutely! Earlier this week, during my day-to-day as an ML Engineer, I was dealing with a particularly small phishing dataset. I needed to figure out how to increase the dataset size while generally keeping the integrity intact. So, allow me to demonstrate how utilizing Large Language Models (LLMs) for data augmentation can revolutionize your phishing datasets. And, in turn, significantly enhance your model performance.


Phishing attacks continually evolve as a cyber security threat. Now, more than ever, vendors must stay ahead and adapt their offerings with increased sophistication. Data augmentation using Large Language Models (LLMs) is one technique that can greatly enhance phishing datasets for machine learning applications. In this in-depth guide, I’ll dive into data augmentation, examine the power of LLMs in boosting your security datasets, and reinforce your cyber security confidence—ultimately earning your boss’s love and admiration.

What is Data Augmentation?

Understanding Data Augmentation

Data augmentation, a machine learning technique, expands and diversifies training datasets by generating new samples from the original data. Data augmentation using Large Language Models (LLM) is like the art of recycling but for data. We tweak our existing data slightly to create new, unique data. Just like recycling helps us make the most out of our resources, data augmentation helps us get the most out of our data. This, in turn, allows us to create “synthetic” phishing datasets while retaining our original unique value. This process enhances model performance, minimizes overfitting, and promotes generalization. In the context of phishing, having diverse and robust datasets is essential for effectively training models to identify phishing attacks and help prevent them.

The Need for Data Augmentation

Size Matters:

Most security datasets are incredibly unbalanced. As an example, the amount of malicious samples to benign is typically in the range of 1:100-1000. With augmentation, you can 10X your dataset size. For example, 7k emails to 70k

Battle Overfitting:

An overfit system becomes too specialized in the data it was trained on and fails to accurately identify new, slightly different phishing attempts.

Techniques for Data Augmentation

Several data augmentation techniques can be applied to enhance datasets, depending on the nature of the data. For instance, in text-based phishing datasets, common techniques generally encompass some of the following:

  • Synonym Replacement: Replacing words in the text with their synonyms to create new samples.
  • Random Insertion: Inserting random words into the text to increase variability.
  • Text Generation: Adding additional words to the text to create more concise samples.
  • Shuffling Words: Rearranging the order of words in a sentence or a paragraph to create new samples.
  • Back Translation: Translating the text to another language and then translating it back to the original language to introduce slight variations.

The Power of LLMs in Data Augmentation

But first, what are LLMs?

Large Language Models (LLMs), such as OpenAI’s GPT series, are sophisticated machine learning models trained on vast volumes of text data. Their ability to comprehend and generate human-like text makes them ideal instruments for data augmentation.

Advantages of Using LLMs for Data Augmentation

LLMs offer several advantages over traditional data augmentation techniques when it comes to enhancing phishing datasets:

  • Contextual Understanding: LLMs can understand the context and semantics of the input text, enabling them to generate more realistic and relevant samples.
  • Scalability: With their powerful text-generation capabilities, LLMs can quickly generate large quantities of new samples to expand datasets.
  • Versatility: LLMs can be fine-tuned for specific tasks, making them adaptable to a wide range of cyber security applications.
  • Reduced Manual Effort: By automating the data augmentation process, LLMs can save time and resources compared to manual techniques.

Implementing Data Augmentation with LLMs

To utilize LLMs in phishing dataset data augmentation, it’s crucial to fine-tune the model for phishing-related terms and patterns. Accomplish this by training the LLM on a smaller, curated dataset of phishing emails or web pages, enabling it to generate pertinent and persuasive phishing samples.

Applying LLMs to Phishing Datasets

Once fine-tuned, the LLM can augment phishing datasets by generating new samples. With these new samples creation generally falling into one of these two areas:

  • Conditional Text Generation: Providing the LLM with a prompt or a portion of a phishing text and asking it to complete the rest. Or in some cases, simply add additional words to the sample. This can result in diverse samples that maintain the structure and intent of the original data.
  • Text Paraphrasing: Using the LLM to rephrase existing phishing texts, generating new variations that retain the original meaning. This can increase the dataset’s diversity without losing important contextual information. I tend to think using translations falls into this category.

Technique #1: Random Replacement

In this first technique I’m going to show you how to replace a random word with a new similar word. I’m going to use an example phishing Subject line of “Account payment overdue” for all testing. I’ll be using a Huggingface pre-trained model of BERT and utilizing a “fill-mask” pipeline to achieve my goal.

# 05/03/2023
# Created by James Bower
# /
!pip install transformers
!pip install sacremoses
from transformers import pipeline
import random
unmasker = pipeline('fill-mask', model='bert-base-uncased')
#unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
input_text = "Account payment overdue"
orig_text_list = input_text.split()
len_input = len(orig_text_list)
#Random index where we want to replace the word 
rand_idx = random.randint(1,len_input-1)
orig_word = orig_text_list[rand_idx]
new_text_list = orig_text_list.copy()
new_text_list[rand_idx] = '[MASK]'
new_mask_sent = ' '.join(new_text_list)
print("Masked sentence->",new_mask_sent)
augmented_text_list = unmasker(new_mask_sent)
#To ensure new word and old word are not name
for res in augmented_text_list:
  if res['token_str'] != orig_word:
    augmented_text = res['sequence']
print("Augmented text->",augmented_text)
Masked sentence-> Account [MASK] overdue
Augmented text-> account is overdue

Nice! As you can see my new sample is pretty close to the original. Additionally just imagine that I feed our augmented text back into this script to create a variation of that, over and over again.

Technique #2: Back Translation

In the second technique I’m still going to use an example phishing Subject line of “Account payment overdue”. Now I’m going to translate it into German, and then back into English. I’m then going to do the same using French. I’m using these LLM’s directly from Huggingface and utilizing Pytorch in the code.

# 05/03/2023
# Created by James Bower
# /
!pip install transformers
!pip install sacremoses
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
#English to German using the Pipeline and T5
translator_en_to_de = pipeline("translation_en_to_de", model='t5-base')
#German to English using Bert2Bert model
#tokenizer = AutoTokenizer.from_pretrained("google/bert2bert_L-24_wmt_de_en", model_max_length=512, truncation=True, eos_token="</s>", bos_token="<s>")
tokenizer = AutoTokenizer.from_pretrained("google/bert2bert_L-24_wmt_de_en", pad_token="512", truncation=True, eos_token="</s>", bos_token="<s>")
model_de_to_en = AutoModelForSeq2SeqLM.from_pretrained("google/bert2bert_L-24_wmt_de_en")
#Our example phishing subject text 
input_text = "Account payment overdue"
en_to_de_output = translator_en_to_de(input_text)
translated_text = en_to_de_output[0]['translation_text']
print("Translated text->",translated_text)
input_ids = tokenizer(translated_text, return_tensors="pt", add_special_tokens=False).input_ids
output_ids = model_de_to_en.generate(input_ids)[0]
augmented_text = tokenizer.decode(output_ids, skip_special_tokens=True)
print("Augmented Text->",augmented_text)
Translated text-> Zahlungsverzug auf dem Konto
Augmented Text-> Late payment on the account

As you can see our original text of “Account payment overdue” became “Zahlungsverzug auf dem Konto” which then became “Late payment on the account”. How cool is that?

Strengthening Phishing Products with Enhanced Phishing Datasets

Improved Machine Learning Models

Employing LLM-augmented phishing datasets for training machine learning models can enhance their performance in detecting and combating phishing threats. By introducing models to diverse and realistic samples, they can recognize intricate phishing patterns, adjust to evolving threats, and generalize more effectively in real-world situations.

Enhanced machine learning models enable new phishing products to promptly detect and address threats, minimizing phishing attack impacts and protecting sensitive customer data.

General FAQs

Q: What is an example of data augmentation?

A: Within text-based phishing datasets, one data augmentation example is synonym replacement. By employing this method and swapping words with their synonyms, new and slightly varied samples are created. Consequently, this enriches the dataset’s diversity while maintaining the original meaning and context.

Q: What is data augmentation vs preprocessing?

A: On one hand, data augmentation involves generating new samples from existing data to enrich and diversify the dataset. On the other hand, preprocessing entails cleaning, transforming, and normalizing data to suit machine learning models. Although both methods aim to boost data quality, data augmentation emphasizes dataset expansion, while preprocessing concentrates on preparing data for efficient model training.

Q: Why is data augmentation used in deep learning?

A: Primarily, data augmenting in deep learning aims to expand and diversify the training dataset, resulting in improved model performance, minimized overfitting, and better generalization. By supplying the model with a varied and representative dataset, it can effectively learn complex patterns and adapt to previously unseen data. Nonetheless, this approach generally adds another layer of complexity that obviously needs to be considered before deciding.

Q: What is data augmentation for CNNS?

A: In the context of Convolutional Neural Networks (CNNs), data augmenting typically involves techniques for image data, such as rotation, scaling, flipping, and translation. By employing these methods, new image samples are created through applying transformations to the original images. Consequently, this increases the dataset’s diversity and assists the CNN in recognizing objects and patterns under various conditions and orientations.

Q: What is the reason for data augmentation?

A: Fundamentally, the main purpose of data augmenting is to enrich and diversify training datasets for machine learning models. Through generating new samples from the original data, data augmentation enables models to learn from a more representative dataset. Consequently, this enhances performance, reduces overfitting, and bolsters their ability to generalize to new, unseen data.

Q: What is the principle of data augmentation?

A: Essentially, the underlying principle of data augmenting is to create new samples from existing data through various transformations or alterations, aiming to enlarge and diversify the dataset. Consequently, this technique enables machine learning models to learn from a more diverse and representative dataset. As a result, it leads to enhanced performance, superior generalization, and decreased overfitting.


Employing LLMs enhances data augmenting in phishing datasets. Consequently, this can help strengthen a vendors cyber security models. By harnessing LLMs’ capabilities, organizations can develop diverse, high-quality datasets. As a result, these datasets empower machine learning models. Thus, models detect and counteract phishing threats efficiently. Proper implementation and assessment are vital. Importantly, LLM-based data augmentation plays a crucial part. It combats phishing attacks effectively. Ultimately, this fosters a more secure digital environment.

And thats that! I hope you have found this useful. Also make sure to keep coming back to my blog as I’m going to be putting out more text analytics and NLP posts here:

And as always, thank you for taking the time to read this. If you have any comments, questions, or critiques, please reach out to me on our FREE ML Security Discord Server – HERE

Related Posts