Do you need to extract the right data from a list of PDF files but right now you’re stuck?
If yes, you’ve come to the right place.
Note: This article treats PDF documents that are machine-readable. If that’s not your case, I recommend you use Adobe Acrobat Pro that will do it automatically for you. Then, come back here.
In this article, you will learn:
- How to extract the content of a PDF file in R (two techniques)
- How to clean the raw document so that you can isolate the data you want
After explaining the tools I’m using, I will show you a couple examples so that you can easily replicate it on your problem.
This article was first published on Medium, before I had a website.
Why PDF files?
When I started to work as a freelance data scientist, I did several jobs consisting in only extracting data from PDF files.
My clients usually had two options: Either do it manually (or hire someone to do it), or try to find a way to automate it.
The first way being really tedious and costly when the number of files increases, they turned to the second solution for which I helped them.
For example, a client had thousands of invoices that all had the same structure and wanted to get important data from it:
- the number of sold items,
- the profits made at each transaction,
- the data from his customers
Having everything in PDF files isn’t handy at all. Instead, he wanted a clean spreadsheet where he could easily find who bought what and when and make calculations from it.
Another classical example is when you want to do data analysis from reports or official documents. You will usually find those saved under PDF files rather than freely accessible on webpages.
Similarly, I needed to extract thousands of speeches made at the U.N. General Assembly.
So, how do you even get started?
Two techniques to extract raw text from PDF files
The first technique requires you to install the
pdftools package from CRAN:
A quick glance at the documentation will show you the few functions of the package, the most important of which being
For this article, I will use an official record from the UN that you can find on this link
This function will directly import the raw text in a character vector with spaces to show the white space and
\n to show the line breaks.
Having a full page in one element of a vector is not the most practical. Using
strsplit will help you separate lines from each other:
If you want to know more about the functions of the
pdftools package, I recommend you read Introducing pdftools - A fast and portable PDF extractor, written by the author himself.
tm is the go-to package when it comes to doing text mining/analysis in R.
For our problem, it will help us import a PDF document in R while keeping its structure intact. Plus, it makes it ready for any text analysis you want to do later.
readPDF function from the
tm package doesn’t actually read a PDF file like
pdf_text from the previous example we did. Instead, it will help you create your own function, the benefit of it being that you can choose whatever PDF extracting engine you want.
By default, it will use
xpdf, available at http://www.xpdfreader.com/download.html
You have to:
- Download the archive from the website (under the Xpdf tools section).
- Unzip it.
- Make sure it is in the PATH of your computer.
Then, you can create your PDF extracting function:
The control argument enables you to set up parameters as you would write them in the command line. Think of the above function as writing
xpdf -layout in the shell.
Then, you’re ready to import the PDF document:
Notice the difference with the excerpt from the first method. New empty lines appeared, corresponding more closely to the document. This can help to identify where the header stops in this case.
Another difference is how pages are managed. With the second method, you get the whole text at once, with page breaks symbolized with the
\f symbol. With the first method, you simply had a list where 1 page = 1 element.
This is the first line of the second page, with an added
\f in front of it.
Extract the right information
Naturally, you don’t want to stop there. Once you have the PDF document in R, you want to extract the actual pieces of text that interest you, and get rid of the rest.
That’s what this part is about.
I will use a few common tools for string manipulation in R:
- Base string manipulation functions (such as
My goal is to extract all the speeches from the speakers of the document we’ve worked on so far (this one), but I don’t care about the speeches from the president.
Here are the steps I will follow:
- Clean the headers and footers on all pages.
- Get the two columns together.
- Find the rows of the speakers.
- Extract the correct rows.
I will use regular expressions (regex) regularly in the code. If you have absolutely no knowledge of it, I recommend you follow a tutorial about it, because it is essential as soon as you start managing text data.
If you have some basic knowledge, that should be enough. I’m not a big expert either.
1. Clean the headers and footers on all pages.
Notice how each page contains text at the top and at the bottom that will interfere with our extraction.
Now, our document is a bit cleaner. Next step is to do something about the two columns, which is super annoying.
2. Get the two columns together.
My idea (there might be better ones) is to use the
str_split function to split the rows every time two spaces appear (i.e. it’s not a normal sentence).
Then, because sometimes there are multiple spaces together at the beginning of the rows, I detect where there is text, where there is not, and I pick the elements with text.
It’s a bit arbitrary, you’ll see, but it works:
Now, let’s put it together, thanks to the marker
page that we added earlier:
Now that we have a nice clean vector of all text lines in the right order, we can start extracting the speeches.
3. Find the rows of the speakers
This is where you must look into the document to spot some patterns that would help us detect where the speeches start and end.
It’s actually fairly easy since all speakers are introduced with “Mr.” or “Mrs.”. And the president is always called “The President:” or “The Acting President:”
Let’s get these rows:
Now it’s easy. We know where the speeches start, and they always end with someone else speaking (whether another speaker or the president).
Finally, we could get all the speeches in a list. We can now analyze what each country representative talk about, how this evolves over more documents, over the years, depending on the topic discussed, etc.
Now, one could argue that for one document, it would be easier to extract it in a semi-manually way (by specifying the row numbers manually, for example). This is true.
But the idea here is to replicate this same process over hundreds, or even thousands, of such documents.
This is where the fun begins, as they will all have their specificities, the format might evolve, sometimes stuff is misspelled, etc. In fact, even with this example, the extraction is not perfect! You can try to improve it if you want.