- Implement pdf_to_markdown.py script with pypdf for text extraction - Extract metadata (title, author, creation date) from PDFs - Generate clean Markdown files with YAML front matter - Add comprehensive error handling and logging - Create mise.toml with 10+ convenient tasks for conversion - Provide detailed documentation (4 guides + quick reference) - Successfully convert all 18 PDF files in artikel/ folder to Markdown - Include .gitignore for Python cache and local config
5.8 KiB
PDF to Markdown Converter - Setup & Usage Guide
Overview
This is a Python script that converts PDF files to clean Markdown format, extracting text content and document metadata.
Features:
- ✅ Extracts text from all PDF pages
- ✅ Preserves page structure with page headers
- ✅ Extracts metadata (title, author, creation date)
- ✅ Generates YAML front matter in Markdown files
- ✅ Robust error handling (skips problematic PDFs)
- ✅ Detailed logging and conversion summary
- ✅ Multiple CLI options for flexibility
Installation
Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
Setup Steps
-
Clone or download this project (if you haven't already)
-
Install dependencies:
pip install -r requirements.txtThis installs:
pypdf>= 3.0.0 - For PDF text extractionpython-dateutil>= 2.8.0 - For date parsing
-
Verify installation:
python3 pdf_to_markdown.py --help
Usage
Basic Usage
Convert all PDFs in default folder (./artikel):
python3 pdf_to_markdown.py
Convert PDFs from custom input folder:
python3 pdf_to_markdown.py /path/to/pdf/folder
Specify both input and output folders:
python3 pdf_to_markdown.py /path/to/input /path/to/output
Advanced Options
Verbose mode (detailed logging):
python3 pdf_to_markdown.py -v ./artikel
python3 pdf_to_markdown.py --verbose ./artikel
Quiet mode (suppress output except errors):
python3 pdf_to_markdown.py -q ./artikel
python3 pdf_to_markdown.py --quiet ./artikel
Dry run (preview without writing files):
python3 pdf_to_markdown.py --dry-run ./artikel
Examples
# Process all PDFs in artikel folder, save to artikel/converted
python3 pdf_to_markdown.py
# Process PDFs in custom location with verbose output
python3 pdf_to_markdown.py -v ~/Documents/PDFs
# Test what would be converted without writing files
python3 pdf_to_markdown.py --dry-run ./artikel
# Convert and save to specific output directory
python3 pdf_to_markdown.py ./input_pdfs ./output_markdown
Output Format
Each converted PDF becomes a Markdown file with the following structure:
---
title: Document Title
author: Author Name
created: 2024-02-23
converted: 2024-02-23 14:32:15
source: original_filename.pdf
---
# Document Title
## Page 1
[Extracted text from page 1...]
## Page 2
[Extracted text from page 2...]
Front Matter Sections:
title- Document title (from PDF metadata or filename)author- Document author (if available in PDF metadata)created- PDF creation date (if available in metadata)converted- Timestamp of when the conversion occurredsource- Original PDF filename
Troubleshooting
Issue: ModuleNotFoundError: No module named 'pypdf'
Solution: Install dependencies:
pip install -r requirements.txt
Issue: PDF has no extractable text
This typically happens with:
- Scanned PDFs (image-based, no embedded text layer)
- Corrupted PDFs
- Encrypted PDFs
The script will:
- Log a warning for the file
- Create a Markdown file with metadata but note that text extraction failed
- Continue processing other PDFs
Issue: Permission denied when writing files
Solution: Ensure you have write permissions to the output directory:
chmod 755 /path/to/output/directory
Issue: Special characters or encoding problems
The script uses UTF-8 encoding by default, which handles most character sets. If you encounter issues:
- Ensure your terminal supports UTF-8
- Check if the PDF contains unusual character encodings
Output Statistics
After processing, the script displays a summary:
============================================================
CONVERSION SUMMARY
============================================================
Total PDFs: 25
Successful: 23
Failed: 2
Output directory: /path/to/converted
============================================================
If any PDFs failed to convert, details are logged for debugging.
File Structure
.
├── pdf_to_markdown.py # Main conversion script
├── requirements.txt # Python dependencies
└── README.md # This file
How It Works
- Discovers PDFs - Finds all
.pdffiles in the input directory - Extracts Metadata - Reads title, author, and creation date from PDF metadata
- Extracts Text - Processes each page and extracts text content
- Creates Markdown - Formats extracted content with metadata front matter
- Saves Files - Writes Markdown files to output directory with same names as PDFs
- Reports Results - Displays conversion summary and any errors
Limitations
- No image extraction - Images in PDFs are not extracted or embedded
- Text-only - Requires PDFs with extractable text (scanned PDFs won't work well)
- Layout preservation - Complex multi-column layouts may not be perfectly preserved
- Recursive search - Only searches the top-level directory (not subdirectories)
Advanced: Customizing the Script
To process subdirectories:
Replace this line in the script:
pdf_files = list(self.input_dir.glob('*.pdf'))
With:
pdf_files = list(self.input_dir.glob('**/*.pdf'))
To include image extraction:
The script currently skips images. To add image extraction:
- Replace
pypdfwithpymupdf (fitz)for better image support - Modify the
extract_text()method to save images - Update
create_markdown()to reference extracted images
Support & Feedback
For issues or feature requests, visit: https://github.com/anomalyco/opencode
License
This script is provided as-is for use in your project.
Version: 1.0
Last Updated: 2024-02-23