In this tutorial, we will write a Python code to extract images from PDF files and save them in the local disk using PyMuPDF and Pillow libraries.
With PyMuPDF,you are able to access PDF, XPS, OpenXPS, epub and many other extensions. It should run on all platforms including Windows, Mac OSX and Linux.
First We will Install:-
pip install PyMuPDF
pip install Pillow
I am going to test with this PDF file. Feel free to try with your own set of PDF files. Make sure you set the correct path of your file but to keep things simple I am going to place the PDF file and script in a folder which will be my WORKING directory.
What it will do?
Ans. It will look for all the images in every page of your PDF and save it in your directory.
NOTE:- Do not place this script on your desktop and run else. BOOM
# PDF_IMAGE_Extractor.py import io import fitz from PIL import Image # Enter name of the PDF file / path infile = "<YOUR PDF FILE>.pdf" file = infile pdf_file = fitz.open(file) # iterate over pdf pages for page_index in range(len(pdf_file)): # get the page itself page = pdf_file[page_index] image_list = page.getImageList() # printing number of images found in this page if image_list: print(f"[+] Found a total of {len(image_list)} images in page {page_index}") else: print("[!] No images found on page", page_index) for image_index, img in enumerate(page.getImageList(), start=1): # get the XREF of the image xref = img[0] # extract the image bytes base_image = pdf_file.extractImage(xref) image_bytes = base_image["image"] # get the image extension image_ext = base_image["ext"] # load it to PIL image = Image.open(io.BytesIO(image_bytes)) # save it to local disk image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
We’re using getImageList()
method to list all available image objects as a list of tuples in that
particular page. To get the image object index, we simply get the first element of the tuple returned.
After that, we use the extractImage()
method that returns the image in bytes along with additional information such as the image extension.
Finally, we convert the image bytes to a PIL image instance and save it to the local disk using the save()
method, which accepts a file pointer as an argument, then we're simply naming the images with their corresponding page and image indices.
That was it!
After running the script you will get the following output:
[+] Found a total of 1 images in page 0 [+] Found a total of 1 images in page 1 [!] No images found on page 2 [+] Found a total of 1 images in page 3 [+] Found a total of 1 images in page 4 [+] Found a total of 1 images in page 5 [+] Found a total of 1 images in page 6 [+] Found a total of 1 images in page 7 [+] Found a total of 1 images in page 8 [+] Found a total of 1 images in page 9 [!] No images found on page 10 [+] Found a total of 1 images in page 11
And the images are saved as well, in the current directory.
Conclusion
Alright, we have successfully extracted images from that PDF file without loosing image quality. For more information on how the library works, I suggest you take a look at the documentation.
Comments
Post a Comment