The issues with PDF come up frequently. I have briefly touched upon how I deal with PDF in my Digitizing Learning Materials, which is mainly about digitization from scratch, i.e., using scanners, fixing scanned images and using OCR. This article is about incorporating already digitized materials (PDF) for Incremental Reading in SuperMemo.
TL;DR: For copyable PDFs, copy the whole PDF, add empty lines between paragraphs, use a tool to remove line breaks, then paste into SuperMemo. For non-copyable PDFs, use an OCR software.
First and foremost, avoid PDF whenever possible. PDF is difficult for editing and formatting. It’s like having your food mixed with sand: it takes a lot of time removing the unwanted parts. Actively look for other file formats. For example, for the paper I frequently mentioned, EPUB is available, which can be perfectly imported into SuperMemo.
Different Types of PDFs
There are basically three types of PDFs:
1. Images. There’s nothing you can do except using an OCR software and convert it into editable text.
Source: Applying the Science of Learning
2. OCR (Optical Character Recognition) text. The text is OCR-ed and will more than likely contains errors. What you copied may not be what you get. If your PDF is heavy in symbols and equations, it’s a bad idea to copy-and-paste. I treat this as the first type: I use an OCR software to remove the formatting.
3. Genuine copyable text. After copy and paste, it contains no errors. This is the only type we can work on. Example:
Source: Distributed Systems
How can you tell 2. from 3.?
If it is a scanned document yet allows copy and paste, you know it’s OCR-ed. You can also try highlighting the text. If it doesn’t get highlighted perfectly, it’s probably OCR-ed. The best way to tell is to copy and paste. After pasting it to Microsoft Word, if there are a lot of errors and weird symbols, you know the text is OCR-ed.
1.&2. If I Have Non-copyable or OCR PDFs
I just dump the whole PDF into FineReader for Optical Character Recognition (OCR). In FineReader, after processing, there are various output options. I choose Word (.docx) because of the navigation function. It accurately separates the chapters for me. After selecting a whole chapter in Word, I copy and paste with PureText.
3. Extracting Content From a Genuine PDF
This is only possible when your PDF is genuinely copyable (type 3). If the PDF is OCR-ed, then it’s worse than using an OCR software because this takes a lot of time.
I only do this for high-stake materials, basically text that contains a lot of symbols, numbers and equations. Examples would be mathematics, computer science, physics and chemistry. On the other hand, if the source material involves very few symbols, such as history or psychology, then I just dump the whole PDF into FineReader and OCR it.
With genuine and copyable PDFs, the major benefit of copy and paste over OCR software is of course, the accuracy. The copied text is 100% correct. The problem is the line breaks. If you paste it into SuperMemo it looks like this:
What I want however:
PS: If you don’t care about the line breaks, you can simply copy the whole PDF at once, rinse it through Notepad, and paste it straight into SuperMemo.
My Step-by-step Process
1. Highlight and copy a whole chapter in PDF
2. Paste it to Word
3. Add line breaks manually
4. (Optional) Remove headers and page numbers etc
5. Use this Line Break Removal Tool to remove line breaks but not paragraph breaks.
The most time-consuming is manually adding empty lines for the copied content in Word. If you copy the whole PDF and remove all the line breaks without empty lines, it will end up one big jumbled text:
So for every paragraph, you need to add an empty line:
What about figures, images and pictures?
At this point you can decide whether to include figures into the Word document. You can use the Window Snipping Tool (Fn+Shift+S) to insert them into their appropriate places. I recommend using the software Greenshot for faster and easier image clipping.
You can then convert the .docx into .html. Then open the html and copy and paste it to SuperMemo. That HTML contains everything you need. All images will be properly displayed because SuperMemo references the local storage of your images. You don’t have to open the original PDF when you encounter figures and pictures. Personally, I don’t. When I come across any images during Incremental Reading, I just open the original PDF and refer back to it.
This is the best way I know to extract information from copyable PDFs. It takes time to prepare your material this way but I think the time cost is bearable; the time investment upfront for authenticity and accuracy is worth it.