MM4go c7ff6a8a29 Add PDF to Markdown converter with mise task runner

- Implement pdf_to_markdown.py script with pypdf for text extraction
- Extract metadata (title, author, creation date) from PDFs
- Generate clean Markdown files with YAML front matter
- Add comprehensive error handling and logging
- Create mise.toml with 10+ convenient tasks for conversion
- Provide detailed documentation (4 guides + quick reference)
- Successfully convert all 18 PDF files in artikel/ folder to Markdown
- Include .gitignore for Python cache and local config

2026-02-23 14:58:58 +01:00

5.8 KiB

Raw Blame History

PDF to Markdown Converter - Setup & Usage Guide

Overview

This is a Python script that converts PDF files to clean Markdown format, extracting text content and document metadata.

Features:

✅ Extracts text from all PDF pages
✅ Preserves page structure with page headers
✅ Extracts metadata (title, author, creation date)
✅ Generates YAML front matter in Markdown files
✅ Robust error handling (skips problematic PDFs)
✅ Detailed logging and conversion summary
✅ Multiple CLI options for flexibility

Installation

Prerequisites

Python 3.8 or higher
pip (Python package manager)

Setup Steps

Clone or download this project (if you haven't already)
Install dependencies:
```
pip install -r requirements.txt
```
This installs:
- pypdf >= 3.0.0 - For PDF text extraction
- python-dateutil >= 2.8.0 - For date parsing
Verify installation:
```
python3 pdf_to_markdown.py --help
```

Usage

Basic Usage

Convert all PDFs in default folder (./artikel):

python3 pdf_to_markdown.py

Convert PDFs from custom input folder:

python3 pdf_to_markdown.py /path/to/pdf/folder

Specify both input and output folders:

python3 pdf_to_markdown.py /path/to/input /path/to/output

Advanced Options

Verbose mode (detailed logging):

python3 pdf_to_markdown.py -v ./artikel
python3 pdf_to_markdown.py --verbose ./artikel

Quiet mode (suppress output except errors):

python3 pdf_to_markdown.py -q ./artikel
python3 pdf_to_markdown.py --quiet ./artikel

Dry run (preview without writing files):

python3 pdf_to_markdown.py --dry-run ./artikel

Examples

# Process all PDFs in artikel folder, save to artikel/converted
python3 pdf_to_markdown.py

# Process PDFs in custom location with verbose output
python3 pdf_to_markdown.py -v ~/Documents/PDFs

# Test what would be converted without writing files
python3 pdf_to_markdown.py --dry-run ./artikel

# Convert and save to specific output directory
python3 pdf_to_markdown.py ./input_pdfs ./output_markdown

Output Format

Each converted PDF becomes a Markdown file with the following structure:

---
title: Document Title
author: Author Name
created: 2024-02-23
converted: 2024-02-23 14:32:15
source: original_filename.pdf
---

# Document Title

## Page 1

[Extracted text from page 1...]

## Page 2

[Extracted text from page 2...]

Front Matter Sections:

title - Document title (from PDF metadata or filename)
author - Document author (if available in PDF metadata)
created - PDF creation date (if available in metadata)
converted - Timestamp of when the conversion occurred
source - Original PDF filename

Troubleshooting

Issue: `ModuleNotFoundError: No module named 'pypdf'`

Solution: Install dependencies:

pip install -r requirements.txt

Issue: PDF has no extractable text

This typically happens with:

Scanned PDFs (image-based, no embedded text layer)
Corrupted PDFs
Encrypted PDFs

The script will:

Log a warning for the file
Create a Markdown file with metadata but note that text extraction failed
Continue processing other PDFs

Issue: Permission denied when writing files

Solution: Ensure you have write permissions to the output directory:

chmod 755 /path/to/output/directory

Issue: Special characters or encoding problems

The script uses UTF-8 encoding by default, which handles most character sets. If you encounter issues:

Ensure your terminal supports UTF-8
Check if the PDF contains unusual character encodings

Output Statistics

After processing, the script displays a summary:

============================================================
CONVERSION SUMMARY
============================================================
Total PDFs:       25
Successful:       23
Failed:           2
Output directory: /path/to/converted
============================================================

If any PDFs failed to convert, details are logged for debugging.

File Structure

.
├── pdf_to_markdown.py      # Main conversion script
├── requirements.txt         # Python dependencies
└── README.md               # This file

How It Works

Discovers PDFs - Finds all .pdf files in the input directory
Extracts Metadata - Reads title, author, and creation date from PDF metadata
Extracts Text - Processes each page and extracts text content
Creates Markdown - Formats extracted content with metadata front matter
Saves Files - Writes Markdown files to output directory with same names as PDFs
Reports Results - Displays conversion summary and any errors

Limitations

No image extraction - Images in PDFs are not extracted or embedded
Text-only - Requires PDFs with extractable text (scanned PDFs won't work well)
Layout preservation - Complex multi-column layouts may not be perfectly preserved
Recursive search - Only searches the top-level directory (not subdirectories)

Advanced: Customizing the Script

To process subdirectories:

Replace this line in the script:

pdf_files = list(self.input_dir.glob('*.pdf'))

With:

pdf_files = list(self.input_dir.glob('**/*.pdf'))

To include image extraction:

The script currently skips images. To add image extraction:

Replace pypdf with pymupdf (fitz) for better image support
Modify the extract_text() method to save images
Update create_markdown() to reference extracted images

Support & Feedback

For issues or feature requests, visit: https://github.com/anomalyco/opencode

License

This script is provided as-is for use in your project.

Version: 1.0
Last Updated: 2024-02-23

5.8 KiB Raw Blame History