Opening Django Model Files in PyMuPDF
I have been working on a file upload feature for a Django project. We use django-storages
to store files in AWS S3, but we store them on the filesystem for local development.
PyMuPDF is a Python library for reading and writing PDF files. It's very fast and I've been delighted to use it.
I ran into a small issue getting my Django file uploads to work with PyMuPDF. This post will outline my notes on how I worked through it.
Opening a file from the local filesystem
Let's say we have a model SlideDeck
with a FileField
that stores a PDF file:
class SlideDeck(models.Model):
pdf = models.FileField(
upload_to="slide_decks",
max_length="500",
)
And a model SlideImage
that stores an image for each slide in the PDF:
class SlideImage(models.Model):
slide_deck = models.ForeignKey(
SlideDeck,
on_delete=models.CASCADE,
)
img = models.ImageField(
upload_to="slide_decks/images",
max_length="500",
)
And we have a helper function to create a list of SlideImage
objects from a SlideDeck
object:
def _create_empty_slides(slide_deck: SlideDeck) -> list[SlideImage]:
pdf_document = pymupdf.open(slide_deck.file.path)
slide_img_objs = SlideImage.objects.bulk_create(
[
SlideImage(slide_deck=slide_deck)
for i in range(len(pdf_document))
]
)
pdf_document.close()
return slide_img_objs
The above works with the Django default FileSystemStorage
backend. But if you're using a remote storage backend like the AWS S3 backend from django-storages
, the file will not be available on the filesystem.
We are using AWS S3 and this was the error I ran into when trying to open a file with open(model_obj.file.path)
:
NotImplementedError: This backend doesn't support absolute paths.
Should you open with file.name
or file.path
?
Assuming we have a model FieldFile
object:
file = model_obj.file
file.name
returns the relative path to the file.
file.path
returns the absolute path on the local filesystem.
We could then rewrite the above _create_empty_slides()
function like this:
def _create_empty_slides(slide_deck: SlideDeck) -> list[SlideImage]:
pdf_document = pymupdf.open(slide_deck.file.name)
slide_img_objs = SlideImage.objects.bulk_create(
[
SlideImage(slide_deck=slide_deck)
for i in range(len(pdf_document))
]
)
pdf_document.close()
return slide_img_objs
And that would still work on the filesystem, but it's not portable to a remote storage backend. More on that later.
Opening an InMemoryUploadedFile
There's a GitHub discussion about uploaded files in PyMuPDF. The author had to specify to seek back to the beginning of the file to open it:
import pymupdf
def get_pages_of_uploaded_pdf(request: HttpRequest) -> int:
# Get the file
uploaded_pdf = request.FILES.get("user_uploaded_pdf")
# Set read location
uploaded_pdf.seek(0)
# Open the file
pdf_document = pymupdf.open(
stream=uploaded_pdf.read(), filetype="pdf"
)
num_pages = len(pdf_document)
pdf_document.close()
return num_pages
Opening a file from remote storage
The PyMuPDF docs have a section on opening remote files. They outline getting the remote file content as a bytes
object, and then opening the file with PyMuPDF:
import pymupdf
import requests
r = requests.get("https://mupdf.com/docs/mupdf_explored.pdf")
data = r.content
doc = pymupdf.open(stream=data)
I started to implement this in our project, but didn't want to have to check for what backend we're using every time we want to open a file.
So instead, we can leverage the default_storage
class from Django to open the file:
import pymupdf
from django.core.files.storage import default_storage
def get_pages_from_pdf(slide_deck: SlideDeck) -> int:
# Get the file bytes data
with default_storage.open(slide_deck.file.name, "rb") as f:
content: bytes = f.read()
# Open the file with pymupdf
pdf_document = pymupdf.open(stream=content)
num_pages = len(pdf_document)
pdf_document.close()
return num_pages
This reads the entire file into memory as a bytes
object in the content
variable. If you're low on memory, this might create problems.
Opening a file with pymupdf
as a context manager
We can rewrite the above function to use a context manager, automatically closing the file when we're done:
import pymupdf
from django.core.files.storage import default_storage
def get_pages_from_pdf(slide_deck: SlideDeck) -> int:
# Get the file bytes data
with default_storage.open(slide_deck.file.name, "rb") as f:
content: bytes = f.read()
# Open the file with pymupdf
with pymupdf.open(stream=content) as pdf_document:
num_pages = len(pdf_document)
return num_pages