maturaarbeit/PDF_CONVERTER_GUIDE.md
MM4go c7ff6a8a29 Add PDF to Markdown converter with mise task runner
- Implement pdf_to_markdown.py script with pypdf for text extraction
- Extract metadata (title, author, creation date) from PDFs
- Generate clean Markdown files with YAML front matter
- Add comprehensive error handling and logging
- Create mise.toml with 10+ convenient tasks for conversion
- Provide detailed documentation (4 guides + quick reference)
- Successfully convert all 18 PDF files in artikel/ folder to Markdown
- Include .gitignore for Python cache and local config
2026-02-23 14:58:58 +01:00

5.8 KiB

PDF to Markdown Converter - Setup & Usage Guide

Overview

This is a Python script that converts PDF files to clean Markdown format, extracting text content and document metadata.

Features:

  • Extracts text from all PDF pages
  • Preserves page structure with page headers
  • Extracts metadata (title, author, creation date)
  • Generates YAML front matter in Markdown files
  • Robust error handling (skips problematic PDFs)
  • Detailed logging and conversion summary
  • Multiple CLI options for flexibility

Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)

Setup Steps

  1. Clone or download this project (if you haven't already)

  2. Install dependencies:

    pip install -r requirements.txt
    

    This installs:

    • pypdf >= 3.0.0 - For PDF text extraction
    • python-dateutil >= 2.8.0 - For date parsing
  3. Verify installation:

    python3 pdf_to_markdown.py --help
    

Usage

Basic Usage

Convert all PDFs in default folder (./artikel):

python3 pdf_to_markdown.py

Convert PDFs from custom input folder:

python3 pdf_to_markdown.py /path/to/pdf/folder

Specify both input and output folders:

python3 pdf_to_markdown.py /path/to/input /path/to/output

Advanced Options

Verbose mode (detailed logging):

python3 pdf_to_markdown.py -v ./artikel
python3 pdf_to_markdown.py --verbose ./artikel

Quiet mode (suppress output except errors):

python3 pdf_to_markdown.py -q ./artikel
python3 pdf_to_markdown.py --quiet ./artikel

Dry run (preview without writing files):

python3 pdf_to_markdown.py --dry-run ./artikel

Examples

# Process all PDFs in artikel folder, save to artikel/converted
python3 pdf_to_markdown.py

# Process PDFs in custom location with verbose output
python3 pdf_to_markdown.py -v ~/Documents/PDFs

# Test what would be converted without writing files
python3 pdf_to_markdown.py --dry-run ./artikel

# Convert and save to specific output directory
python3 pdf_to_markdown.py ./input_pdfs ./output_markdown

Output Format

Each converted PDF becomes a Markdown file with the following structure:

---
title: Document Title
author: Author Name
created: 2024-02-23
converted: 2024-02-23 14:32:15
source: original_filename.pdf
---

# Document Title

## Page 1

[Extracted text from page 1...]

## Page 2

[Extracted text from page 2...]

Front Matter Sections:

  • title - Document title (from PDF metadata or filename)
  • author - Document author (if available in PDF metadata)
  • created - PDF creation date (if available in metadata)
  • converted - Timestamp of when the conversion occurred
  • source - Original PDF filename

Troubleshooting

Issue: ModuleNotFoundError: No module named 'pypdf'

Solution: Install dependencies:

pip install -r requirements.txt

Issue: PDF has no extractable text

This typically happens with:

  • Scanned PDFs (image-based, no embedded text layer)
  • Corrupted PDFs
  • Encrypted PDFs

The script will:

  • Log a warning for the file
  • Create a Markdown file with metadata but note that text extraction failed
  • Continue processing other PDFs

Issue: Permission denied when writing files

Solution: Ensure you have write permissions to the output directory:

chmod 755 /path/to/output/directory

Issue: Special characters or encoding problems

The script uses UTF-8 encoding by default, which handles most character sets. If you encounter issues:

  • Ensure your terminal supports UTF-8
  • Check if the PDF contains unusual character encodings

Output Statistics

After processing, the script displays a summary:

============================================================
CONVERSION SUMMARY
============================================================
Total PDFs:       25
Successful:       23
Failed:           2
Output directory: /path/to/converted
============================================================

If any PDFs failed to convert, details are logged for debugging.

File Structure

.
├── pdf_to_markdown.py      # Main conversion script
├── requirements.txt         # Python dependencies
└── README.md               # This file

How It Works

  1. Discovers PDFs - Finds all .pdf files in the input directory
  2. Extracts Metadata - Reads title, author, and creation date from PDF metadata
  3. Extracts Text - Processes each page and extracts text content
  4. Creates Markdown - Formats extracted content with metadata front matter
  5. Saves Files - Writes Markdown files to output directory with same names as PDFs
  6. Reports Results - Displays conversion summary and any errors

Limitations

  • No image extraction - Images in PDFs are not extracted or embedded
  • Text-only - Requires PDFs with extractable text (scanned PDFs won't work well)
  • Layout preservation - Complex multi-column layouts may not be perfectly preserved
  • Recursive search - Only searches the top-level directory (not subdirectories)

Advanced: Customizing the Script

To process subdirectories:

Replace this line in the script:

pdf_files = list(self.input_dir.glob('*.pdf'))

With:

pdf_files = list(self.input_dir.glob('**/*.pdf'))

To include image extraction:

The script currently skips images. To add image extraction:

  1. Replace pypdf with pymupdf (fitz) for better image support
  2. Modify the extract_text() method to save images
  3. Update create_markdown() to reference extracted images

Support & Feedback

For issues or feature requests, visit: https://github.com/anomalyco/opencode

License

This script is provided as-is for use in your project.


Version: 1.0
Last Updated: 2024-02-23