maturaarbeit/PDF_CONVERTER_GUIDE.md
MM4go c7ff6a8a29 Add PDF to Markdown converter with mise task runner
- Implement pdf_to_markdown.py script with pypdf for text extraction
- Extract metadata (title, author, creation date) from PDFs
- Generate clean Markdown files with YAML front matter
- Add comprehensive error handling and logging
- Create mise.toml with 10+ convenient tasks for conversion
- Provide detailed documentation (4 guides + quick reference)
- Successfully convert all 18 PDF files in artikel/ folder to Markdown
- Include .gitignore for Python cache and local config
2026-02-23 14:58:58 +01:00

234 lines
5.8 KiB
Markdown

# PDF to Markdown Converter - Setup & Usage Guide
## Overview
This is a Python script that converts PDF files to clean Markdown format, extracting text content and document metadata.
**Features:**
- ✅ Extracts text from all PDF pages
- ✅ Preserves page structure with page headers
- ✅ Extracts metadata (title, author, creation date)
- ✅ Generates YAML front matter in Markdown files
- ✅ Robust error handling (skips problematic PDFs)
- ✅ Detailed logging and conversion summary
- ✅ Multiple CLI options for flexibility
## Installation
### Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
### Setup Steps
1. **Clone or download this project** (if you haven't already)
2. **Install dependencies:**
```bash
pip install -r requirements.txt
```
This installs:
- `pypdf` >= 3.0.0 - For PDF text extraction
- `python-dateutil` >= 2.8.0 - For date parsing
3. **Verify installation:**
```bash
python3 pdf_to_markdown.py --help
```
## Usage
### Basic Usage
**Convert all PDFs in default folder (`./artikel`):**
```bash
python3 pdf_to_markdown.py
```
**Convert PDFs from custom input folder:**
```bash
python3 pdf_to_markdown.py /path/to/pdf/folder
```
**Specify both input and output folders:**
```bash
python3 pdf_to_markdown.py /path/to/input /path/to/output
```
### Advanced Options
**Verbose mode** (detailed logging):
```bash
python3 pdf_to_markdown.py -v ./artikel
python3 pdf_to_markdown.py --verbose ./artikel
```
**Quiet mode** (suppress output except errors):
```bash
python3 pdf_to_markdown.py -q ./artikel
python3 pdf_to_markdown.py --quiet ./artikel
```
**Dry run** (preview without writing files):
```bash
python3 pdf_to_markdown.py --dry-run ./artikel
```
### Examples
```bash
# Process all PDFs in artikel folder, save to artikel/converted
python3 pdf_to_markdown.py
# Process PDFs in custom location with verbose output
python3 pdf_to_markdown.py -v ~/Documents/PDFs
# Test what would be converted without writing files
python3 pdf_to_markdown.py --dry-run ./artikel
# Convert and save to specific output directory
python3 pdf_to_markdown.py ./input_pdfs ./output_markdown
```
## Output Format
Each converted PDF becomes a Markdown file with the following structure:
```markdown
---
title: Document Title
author: Author Name
created: 2024-02-23
converted: 2024-02-23 14:32:15
source: original_filename.pdf
---
# Document Title
## Page 1
[Extracted text from page 1...]
## Page 2
[Extracted text from page 2...]
```
**Front Matter Sections:**
- `title` - Document title (from PDF metadata or filename)
- `author` - Document author (if available in PDF metadata)
- `created` - PDF creation date (if available in metadata)
- `converted` - Timestamp of when the conversion occurred
- `source` - Original PDF filename
## Troubleshooting
### Issue: `ModuleNotFoundError: No module named 'pypdf'`
**Solution:** Install dependencies:
```bash
pip install -r requirements.txt
```
### Issue: PDF has no extractable text
This typically happens with:
- **Scanned PDFs** (image-based, no embedded text layer)
- **Corrupted PDFs**
- **Encrypted PDFs**
The script will:
- Log a warning for the file
- Create a Markdown file with metadata but note that text extraction failed
- Continue processing other PDFs
### Issue: Permission denied when writing files
**Solution:** Ensure you have write permissions to the output directory:
```bash
chmod 755 /path/to/output/directory
```
### Issue: Special characters or encoding problems
The script uses UTF-8 encoding by default, which handles most character sets. If you encounter issues:
- Ensure your terminal supports UTF-8
- Check if the PDF contains unusual character encodings
## Output Statistics
After processing, the script displays a summary:
```
============================================================
CONVERSION SUMMARY
============================================================
Total PDFs: 25
Successful: 23
Failed: 2
Output directory: /path/to/converted
============================================================
```
If any PDFs failed to convert, details are logged for debugging.
## File Structure
```
.
├── pdf_to_markdown.py # Main conversion script
├── requirements.txt # Python dependencies
└── README.md # This file
```
## How It Works
1. **Discovers PDFs** - Finds all `.pdf` files in the input directory
2. **Extracts Metadata** - Reads title, author, and creation date from PDF metadata
3. **Extracts Text** - Processes each page and extracts text content
4. **Creates Markdown** - Formats extracted content with metadata front matter
5. **Saves Files** - Writes Markdown files to output directory with same names as PDFs
6. **Reports Results** - Displays conversion summary and any errors
## Limitations
- **No image extraction** - Images in PDFs are not extracted or embedded
- **Text-only** - Requires PDFs with extractable text (scanned PDFs won't work well)
- **Layout preservation** - Complex multi-column layouts may not be perfectly preserved
- **Recursive search** - Only searches the top-level directory (not subdirectories)
## Advanced: Customizing the Script
### To process subdirectories:
Replace this line in the script:
```python
pdf_files = list(self.input_dir.glob('*.pdf'))
```
With:
```python
pdf_files = list(self.input_dir.glob('**/*.pdf'))
```
### To include image extraction:
The script currently skips images. To add image extraction:
1. Replace `pypdf` with `pymupdf (fitz)` for better image support
2. Modify the `extract_text()` method to save images
3. Update `create_markdown()` to reference extracted images
## Support & Feedback
For issues or feature requests, visit:
https://github.com/anomalyco/opencode
## License
This script is provided as-is for use in your project.
---
**Version:** 1.0
**Last Updated:** 2024-02-23