Opening Django Model Files in PyMuPDF

I have been working on a file upload feature for a Django project. We use django-storages to store files in AWS S3, but we store them on the filesystem for local development.

PyMuPDF is a Python library for reading and writing PDF files. It's very fast and I've been delighted to use it.

I ran into a small issue getting my Django file uploads to work with PyMuPDF. This post will outline my notes on how I worked through it.

Opening a file from the local filesystem

Let's say we have a model SlideDeck with a FileField that stores a PDF file:

class SlideDeck(models.Model):
    pdf = models.FileField(
        upload_to="slide_decks",
        max_length="500",
    )

And a model SlideImage that stores an image for each slide in the PDF:

class SlideImage(models.Model):
    slide_deck = models.ForeignKey(
        SlideDeck,
        on_delete=models.CASCADE,
    )
    img = models.ImageField(
        upload_to="slide_decks/images",
        max_length="500",
    )

And we have a helper function to create a list of SlideImage objects from a SlideDeck object:

def _create_empty_slides(slide_deck: SlideDeck) -> list[SlideImage]:
    pdf_document = pymupdf.open(slide_deck.file.path)
    slide_img_objs = SlideImage.objects.bulk_create(
        [
            SlideImage(slide_deck=slide_deck)
            for i in range(len(pdf_document))
        ]
    )
    pdf_document.close()
    return slide_img_objs

The above works with the Django default FileSystemStorage backend. But if you're using a remote storage backend like the AWS S3 backend from django-storages, the file will not be available on the filesystem.

We are using AWS S3 and this was the error I ran into when trying to open a file with open(model_obj.file.path):

NotImplementedError: This backend doesn't support absolute paths.

Should you open with file.name or file.path?

Assuming we have a model FieldFile object:

file = model_obj.file

file.name returns the relative path to the file.

file.path returns the absolute path on the local filesystem.

We could then rewrite the above _create_empty_slides() function like this:

def _create_empty_slides(slide_deck: SlideDeck) -> list[SlideImage]:
    pdf_document = pymupdf.open(slide_deck.file.name)
    slide_img_objs = SlideImage.objects.bulk_create(
        [
            SlideImage(slide_deck=slide_deck)
            for i in range(len(pdf_document))
        ]
    )
    pdf_document.close()
    return slide_img_objs

And that would still work on the filesystem, but it's not portable to a remote storage backend. More on that later.

Opening an InMemoryUploadedFile

There's a GitHub discussion about uploaded files in PyMuPDF. The author had to specify to seek back to the beginning of the file to open it:

import pymupdf


def get_pages_of_uploaded_pdf(request: HttpRequest) -> int:
    # Get the file
    uploaded_pdf = request.FILES.get("user_uploaded_pdf")

    # Set read location
    uploaded_pdf.seek(0)

    # Open the file
    pdf_document = pymupdf.open(
        stream=uploaded_pdf.read(), filetype="pdf"
    )
    num_pages = len(pdf_document)
    pdf_document.close()
    return num_pages

Opening a file from remote storage

The PyMuPDF docs have a section on opening remote files. They outline getting the remote file content as a bytes object, and then opening the file with PyMuPDF:

import pymupdf
import requests


r = requests.get("https://mupdf.com/docs/mupdf_explored.pdf")
data = r.content
doc = pymupdf.open(stream=data)

I started to implement this in our project, but didn't want to have to check for what backend we're using every time we want to open a file.

So instead, we can leverage the default_storage class from Django to open the file:

import pymupdf
from django.core.files.storage import default_storage


def get_pages_from_pdf(slide_deck: SlideDeck) -> int:
    # Get the file bytes data
    with default_storage.open(slide_deck.file.name, "rb") as f:
        content: bytes = f.read()

    # Open the file with pymupdf
    pdf_document = pymupdf.open(stream=content)
    num_pages = len(pdf_document)
    pdf_document.close()
    return num_pages

This reads the entire file into memory as a bytes object in the content variable. If you're low on memory, this might create problems.

Opening a file with pymupdf as a context manager

We can rewrite the above function to use a context manager, automatically closing the file when we're done:

import pymupdf
from django.core.files.storage import default_storage


def get_pages_from_pdf(slide_deck: SlideDeck) -> int:
    # Get the file bytes data
    with default_storage.open(slide_deck.file.name, "rb") as f:
        content: bytes = f.read()

    # Open the file with pymupdf
    with pymupdf.open(stream=content) as pdf_document:
        num_pages = len(pdf_document)
    return num_pages

Get Notified of New Posts

Sign up for the newsletter and I'll send you an email when there's a new post.