How to Convert Threat Intelligence PDF into Text

convert threat intelligence

Here’s your guide to effortlessly convert threat intelligence from cumbersome PDFs (blah to work with), into easily accessible text, streamlining your cybersecurity analysis and response like never before! I’m sure like me you’ve been struggling to convert threat intelligence PDFs into text? Well I hear you friend. As you know my Python isn’t the greatest so this has taken me awhile to figure out. But all those lonely nights have paid off. Continue reading for how I did it.

The PDF that I’m using is a threat assessment from the Senate Select Committee on Intelligence. Why? Because it sounds cool of course! And if I’m going to work on a project I at least want to feel like Ja

The PDF that I’m using is a threat assessment from the Senate Select Committee on Intelligence. Why? Because it sounds cool of course! And if I’m going to work on a project I at least want to feel like James Bond doing it! I put it up on Dropbox so you can grab it yourself. I put it up on Dropbox so you can grab it yourself.

https://bit.ly/41pNT2C

Let’s look at a paragraph from the second page as our example for comparison. It’s pretty straight forward as far as text goes. It gets a lot more difficult when dealing with tables and images. But we’ll discuss that in another blog post.

Here is the code I used and I pinned the version of pdfminer to help prevent any issues. I also did this in Python 2.5

easy_install pip
pip install 'pdfminer==20080407'
import sys
import urllib
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def download_pdf(url, file_name):
    urllib.urlretrieve(url, file_name)
def convert_pdf_to_txt(file_path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(file_path, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
    fp.close()
    device.close()
    text = retstr.getvalue()
    retstr.close()
    return text
pdf_url = "https://www.dropbox.com/s/e7ho2tentz3fns4/2009_threat_assessment_intelligence_subcommittee.pdf?dl=0"
input_pdf = "input.pdf"
output_txt = "output.txt"
download_pdf(pdf_url, input_pdf)
with open(output_txt, "w") as output_file:
    output_file.write(convert_pdf_to_txt(input_pdf))

Here’s what it should look like now. You can we have some new line and spacing issues. Also some slight formatting issues. But overall I would say it looks pretty darn good and is definitely something we can work with moving forward.

And there you have it! And here is the link to our new “text” file.

https://www.dropbox.com/s/gsk1gieyihg3od0/threat_assessment.txt?dl=0

I hope you have found this useful. This is something I’ve been wanting to figure out for some time now. Also make sure to keep coming back to my blog as I’m going to be putting out more threat intelligence posts here:

https://www.jamesbower.com/category/threat-intelligence/

And as always, thank you for taking the time to read this. If you have any comments, questions, or critiques, please reach out to me on our FREE ML Security Discord Server – HERE

Related Posts