파이썬을 사용하여 PDF에서 이미지와 이미지 정보 추출하기

Python

파이썬을 사용하여 PDF에서 이미지와 이미지 정보 추출하기

권현욱(엑셀러) 2024. 9. 29. 18:19

들어가기 전에

PDF(Portable Document Format) 파일은 일관된 포맷과 유연성 인해 문서 공유 및 보존에 널리 사용됩니다. PDF에는 텍스트 외에 이미지가 들어 있을 수 있습니다. 파이썬을 사용하여 PDF 파일에서 이미지와 이미지 정보를 추출하는 방법을 소개합니다.

권현욱(엑셀러) | 아이엑셀러 닷컴 대표 · Microsoft Excel MVP · Excel 솔루션 프로바이더 · 작가

※ 이 글은 아래 기사 내용을 토대로 작성되었습니다만, 필자의 개인 의견이나 추가 자료들이 다수 포함되어 있습니다.

원문: Extract Images and Image Information from PDF with Python
URL: https://medium.com/@alice.yang_10652/extract-images-and-image-information-from-pdf-with-python-10719a3bda81

PDF에서 이미지 정보를 추출하는 파이썬 라이브러리

Python에서 PDF 파일에서 이미지와 이미지 정보를 추출하기 위해 Python용 Spire.PDF를 사용하겠습니다. 이 라이브러리는 Python 애플리케이션 내에서 PDF 파일을 생성, 읽기, 편집, 변환할 수 있도록 설계된 기능이 풍부하고 사용자 친화적인 라이브러리입니다. 다음 pip 명령을 사용하여 PyPI에서 Python용 Spire.PDF를 설치할 수 있습니다.

pip install Spire.Pdf

Python용 Spire.PDF가 이미 설치되어 있고 최신 버전으로 업그레이드하려면 다음 pip 명령을 사용하세요.

pip install - upgrade Spire.Pdf

설치에 대한 자세한 내용은 [여기]에서 확인할 수 있습니다.

파이썬으로 PDF에서 이미지 추출하기

Python용 Spire.PDF의 PdfImageHelper 클래스는 PDF의 이미지를 편리하게 처리할 수 있는 방법을 제공합니다. PDF의 이미지를 가져오려면 PdfImageHelper.GetImagesInfo 함수를 사용할 수 있습니다. 이 함수는 PDF 페이지의 이미지를 각각 나타내는 PdfImageInfo 객체 목록을 반환합니다. PdfImageInfo 객체가 있으면 PdfImageInfo.Image.Save() 함수를 사용하여 각 이미지를 파일에 저장할 수 있습니다.

아래 코드는 Python 및 Python용 Spire.PDF를 사용하여 PDF 파일에서 이미지를 추출하는 방법을 보여줍니다.

from spire.pdf.common import *
from spire.pdf import *

def extract_images_from_pdf(pdf_path, output_dir):
    """
    Extracts all images from a PDF file and saves them to the specified output directory.
    
    Args:
        pdf_path (str): The path to the PDF file.
        output_dir (str): The directory where the extracted images will be saved.
    """
    # Create a PdfDocument object and load the PDF file
    doc = PdfDocument()
    doc.LoadFromFile(pdf_path)

    # Create a PdfImageHelper object
    image_helper = PdfImageHelper()

    image_count = 1
    # Iterate over all pages in the PDF
    for page_index in range(doc.Pages.Count):
        # Get the image information for the current page
        image_infos = image_helper.GetImagesInfo(doc.Pages[page_index])

        # Extract and save the images
        for image_index in range(len(image_infos)):
            # Get the image
            image = image_infos[image_index].Image
            # Specify the output file name
            output_file = os.path.join(output_dir, f"Image-{image_count}.png")
            # Save the image
            image.Save(output_file)
            image_count += 1

    # Close the PdfDocument object
    doc.Close()

# Example usage
extract_images_from_pdf("Sample.pdf", "C:/Users/Administrator/Desktop/Images")

파이썬으로 PDF에서 이미지 정보 추출하기

PDF에서 위치(x, y 좌표), 너비, 높이와 같은 이미지 정보를 추출하려면 PdfImageInfo.Bounds.X, PdfImageInfo.Bounds.Y, PdfImageInfo.Bounds.Width 및 PdfImageInfo.Bounds.Height 속성을 사용할 수 있습니다.

아래 코드는 Python과 Python용 Spire.PDF를 사용하여 PDF 파일에서 위치(x 및 y 좌표), 너비, 높이와 같은 이미지 정보를 추출하는 방법을 보여줍니다.

from spire.pdf.common import *
from spire.pdf import *

def print_pdf_image_info(pdf_path):
    """
    Prints information about the images in a PDF file.
    
    Args:
        pdf_path (str): The path to the PDF file.
    """
    # # Create a PdfDocument object and load the PDF file
    doc = PdfDocument()
    doc.LoadFromFile(pdf_path)

    # Create a PdfImageHelper object
    image_helper = PdfImageHelper()

    # Iterate over all pages in the PDF
    for page_index in range(doc.Pages.Count):
        page = doc.Pages[page_index]

        # Get the image information for the current page
        image_infos = image_helper.GetImagesInfo(page)

        # Print the image information
        for image_index, image_info in enumerate(image_infos):
            print(f"Page {page_index + 1}, Image {image_index + 1}:")
            print(f"  Image position: ({image_info.Bounds.X}, {image_info.Bounds.Y})")
            print(f"  Image size: {image_info.Bounds.Width} x {image_info.Bounds.Height}")

    # Close the PdfDocument object
    doc.Close()

# Example usage
print_pdf_image_info("Sample.pdf")

마치며

Python을 사용하여 PDF 파일에서 이미지를 추출하는 방법을 살펴보았습니다. Python을 사용하면 PDF 파일에서 이미지의 위치(x 및 y 좌표), 너비, 높이 등 이미지와 관련된 세부 정보를 추출하는 것도 가능합니다.

멤버십 안내 "최고" 가성비로 "최신" Office 활용 정보를 보내드립니다.

강의 & 개발 최고 전문가의 실무 노하우를 아낌 없이 전수해 드립니다.

'Python' 카테고리의 다른 글

파이썬을 사용하여 Excel에서 데이터 유효성 검사 사용하는 방법 (7)	2024.10.03
파이썬에서 PDF를 Word로 변환하는 5가지 방법 (5)	2024.10.01
파이썬으로 Excel 워크시트 분할하기 (4)	2024.09.28
파이썬으로 PDF 파일을 암호화하고 해독하는 방법 (8)	2024.09.22
파이썬을 사용하여 Excel에서 다양한 유형의 차트 만들기 (11)	2024.09.14

현재글파이썬을 사용하여 PDF에서 이미지와 이미지 정보 추출하기

권현욱의 엑셀 & IT정보