Here’s your guide to effortlessly convert threat intelligence from cumbersome PDFs (blah to work with), into easily accessible text, streamlining your cybersecurity analysis and response like never before! I’m sure like me you’ve been struggling to convert threat intelligence PDFs into text? Well I hear you friend. As you know my Python isn’t the greatest so this has taken me awhile to figure out. But all those lonely nights have paid off. Continue reading for how I did it.
The PDF that I’m using is a threat assessment from the Senate Select Committee on Intelligence. Why? Because it sounds cool of course! And if I’m going to work on a project I at least want to feel like Ja
The PDF that I’m using is a threat assessment from the Senate Select Committee on Intelligence. Why? Because it sounds cool of course! And if I’m going to work on a project I at least want to feel like James Bond doing it! I put it up on Dropbox so you can grab it yourself. I put it up on Dropbox so you can grab it yourself.
Let’s look at a paragraph from the second page as our example for comparison. It’s pretty straight forward as far as text goes. It gets a lot more difficult when dealing with tables and images. But we’ll discuss that in another blog post.
Here is the code I used and I pinned the version of pdfminer to help prevent any issues. I also did this in Python 2.5
pip install 'pdfminer==20080407'
import sys import urllib from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from cStringIO import StringIO def download_pdf(url, file_name): urllib.urlretrieve(url, file_name) def convert_pdf_to_txt(file_path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = open(file_path, 'rb') parser = PDFParser(fp) doc = PDFDocument(parser) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.create_pages(doc): interpreter.process_page(page) fp.close() device.close() text = retstr.getvalue() retstr.close() return text pdf_url = "https://www.dropbox.com/s/e7ho2tentz3fns4/2009_threat_assessment_intelligence_subcommittee.pdf?dl=0" input_pdf = "input.pdf" output_txt = "output.txt" download_pdf(pdf_url, input_pdf) with open(output_txt, "w") as output_file: output_file.write(convert_pdf_to_txt(input_pdf))
Here’s what it should look like now. You can we have some new line and spacing issues. Also some slight formatting issues. But overall I would say it looks pretty darn good and is definitely something we can work with moving forward.
And there you have it! And here is the link to our new “text” file.
I hope you have found this useful. This is something I’ve been wanting to figure out for some time now. Also make sure to keep coming back to my blog as I’m going to be putting out more threat intelligence posts here:
And as always, thank you for taking the time to read this. If you have any comments, questions, or critiques, please reach out to me on our FREE ML Security Discord Server – HERE