Using Python To Convert PDFs To Images

165 VIEWS

Python is like that annoying kid in school who thinks he can do anything, but in this case, he can! With Python you can build web applications, AI models, chatbots and just about anything you can imagine.

Python is also really good at acting as a wrapper for complex tasks. Often, these can be core level libraries that have bindings in Python. One common task that most developers perform is file manipulation—either writing or reading from a file, or manipulation of file types.

In this article, I’ll compare and contrast various Python libraries that can convert PDFs to images.

Installing Python

If you already have Python installed, you can skip this step. However, for those who haven’t, read on.

For this tutorial I‘ll be using ActiveState’s Python, which is built from vetted source code and regularly maintained for security clearance. You have two choices:

  1. Download and install the pre-built “PDF to JPG” runtime environment for Windows 10 or CentOS 7, or
  2. Build your own custom Python runtime with just the packages you’ll need for this project, by creating a free ActiveState Platform account, after which you will see the following image:
  3. Click the Get Started button and choose Python and the OS you’re working in. Choose the packages you’ll need for this tutorial, including pdf2image and PyPDF2.
  4. Once the runtime builds, you can download the State Tool and use it to install your runtime.
    1. And that’s it! You now have installed Python in a virtual environment.

      Ghostscript For Manipulating PDFs

      A very popular tool for manipulating PDF and PostScript formats is Ghostscript. It’s a C library that has bindings in Python in order to provide for easy access from various applications.

      Ghostscript has been around since 1988, and the last release happened a few months ago (April 2019 as of this writing). It’s safe to say that this library is not only proven, but actively managed. However, be aware that it’s licensed with the GNU Affero General Public License (AGPL), which may prevent it from being a good fit for enterprise applications.

      To get started, install the Python Ghostscript package:

      ```
      pip install ghostscript
      ``` 

      Let’s look at the code to convert a PDF file to an image. This is straightforward, and you will find most of the code in the PyPI documentation page.

      import ghostscript
      import locale
      
      def pdf2jpeg(pdf_input_path, jpeg_output_path):
          args = ["pef2jpeg", # actual value doesn't matter
                  "-dNOPAUSE",
                  "-sDEVICE=jpeg",
                  "-r144",
                  "-sOutputFile=" + jpeg_output_path,
                  pdf_input_path]
      
          encoding = locale.getpreferredencoding()
          args = [a.encode(encoding) for a in args]
      
          ghostscript.Ghostscript(*args)
      
      pdf2jpeg(
          "...Fixate/ActiveState/pdf/a.pdf",
          "...Fixate/ActiveState/pdf/a.jpeg",
      )

      To execute the file, run:

      ```
      python gh.py
      ```

      You will encounter this error:

      The last line says:

      ```
      RuntimeError: Can not find Ghostscript library (libgs)
      ```

      This means that the Ghostscript Python library we installed isn’t able to find the Ghostscript C library on the development machine. The Python package is just a wrapper around the C library that actually does all the work. So we need to do a second install in order to deploy the C library on our machine.

      If you’re on a Mac with brew installed, you can just run:

      ```
      brew install ghostscript
      ```

      To see installation steps for other platforms, please visit the Ghostscript installation page.

      Executing the script gh.py again will now perform the conversion of a PDF file named a.pdf into a graphic file named a.jpeg.

      Ghostscript was first introduced to manage PostScript files, a file format used by printers and fax machines (yes, fax!). But even in the publishing industry, PostScript files have almost entirely been replaced by PDFs. Originally, PDFs were just compiled PostScript files, but since PDF v1.4, Adobe no longer uses PostScript as the basis of the PDF format. Even so, Ghostscript still includes both PDF and PostScript manipulation capabilities.

      Advantages of Ghostscript:

      1. Has been around for more than 30 years, and is still consistently maintained.
      2. Has easy bindings for Python.
      3. Has an extensive feature list.

      Disadvantages of Ghostscript:

      1. Needs the C library to be installed first, as the Python package is just a wrapper for the core C library that does the actual conversion.
      2. AGPL-licensed, which may limit usage in commercial applications.
      3. Poppler And Pdf2image For PDF Conversion

        Poppler is an open-source software utility built using C++ for rendering PDF documents. It is commonly used across Linux, GNOME and KDE systems. Its development is supported by freedesktop.org.

        Poppler was initially launched in 2005 and is still actively supported. The Python package pdf2image is a Python wrapper for Poppler.

        Since ActiveState’s Python already contains the pdf2image Python wrapper, all we need to install is the Poppler C++ library:

        ```
        brew install poppler
        ```

        Now, it’s extremely straightforward to convert a PDF to an image:

        from pdf2image import convert_from_path
        
        pages = convert_from_path('...Fixate/ActiveState/pdf/a.pdf', 500)
        for page in pages:
            page.save('p2ijpg', 'JPEG')

        Both Poppler and Ghostscript have the advantage of being mature software utility tools. However, Ghostscript was created primarily to manage Postscript files, while Poppler—from its inception—was only meant to be a PDF manipulation tool. With Poppler, you can perform any action on PDF files, including creation, merging, and even converting. It pays to be built 15 years after your competition!

        Advantages of pdf2image:

        1. Has been around for almost 15 years, and is still consistently maintained.
        2. Has easy bindings for Python.
        3. pdf2image features an MIT license, which is generally acceptable for enterprise/commercial use.

        Disadvantages of pdf2image:

        1. It requires a C++ library to be installed, as the Python package is just a wrapper.

        Extracting Data From PDF Files With PyPDF2

        All the examples we’ve spoken about so far are Python wrappers for a much larger C or C++ codebase. With PyPDF2, the entire PDF manipulation logic is written only in Python. This means there is no need to install any other any other dependent libraries. However, this also means that while PyPDF2 is great at creating, adding and removing pages, it struggles to convert and extract textual data from a PDF file.

        Let’s look at how text can be extracted from a PDF:

        import PyPDF2
        
        pdfFileObj = open('...Fixate/ActiveState/pdf/a.pdf', 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        
        print(pdfReader.numPages)
        
        pageObj = pdfReader.getPage(0)
        print(pageObj.extractText())
        
        pdfFileObj.close()

        With PyPDF2, it is quite simple to manipulate PDFs programmatically. The Python syntax is extremely intuitive. This would be useful in scenarios where information needs to be extracted and then processed in a larger workflow.

        However, it’s important to note that text extraction is only possible when a PDF is programmatically created. If the PDF is just a scanned image of a document, PyPDF2 has nothing to extract other than the image file itself.

        PyPDF2 also doesn’t have any capabilities to convert a PDF file into an image, which is understandable since it does not use any core PDF libraries. So if you want to convert your PDF to an image file, the best you can do is extract text and write it to an image file.

        Advantages of PyPDF2:

        1. Written entirely in Python, so there’s no “helper” library to install.
        2. pdf2image features a BSD-3 license, which is generally acceptable for enterprise/commercial use.

        Disadvantages of PyPDF2:

        1. Very limited functionality for scanned PDF files.
        2. Much slower compared to Ghostscript and pdf2image, since the code is pure Python.

        Conclusions

        Python is loaded with packages that make large, complex tasks achievable with just a few lines, and PDF manipulation is no different. Although a full-featured, Python-only package has yet to be released, solutions that act as wrappers around C/C++ libraries work great for converting PDF files directly to images. In this case. it’s really a toss up between Ghostscript and pdf2image unless your company frowns on AGPL-licensed code. But if you’re looking to just extract specific data from PDF files, PyPDF2 is a great Python-only solution.


Swaathi Kakarla is the co-founder and CTO at Skcript She enjoys talking and writing about code efficiency, performance and startups. In her free time she finds solace in yoga, bicycling and contributing to open source.


Discussion

Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *

Menu