maturaarbeit/PDF_CONVERTER_GUIDE.md

# PDF to Markdown Converter - Setup & Usage Guide

## Overview

This is a Python script that converts PDF files to clean Markdown format, extracting text content and document metadata.

**Features:**
- ✅ Extracts text from all PDF pages
- ✅ Preserves page structure with page headers
- ✅ Extracts metadata (title, author, creation date)
- ✅ Generates YAML front matter in Markdown files
- ✅ Robust error handling (skips problematic PDFs)
- ✅ Detailed logging and conversion summary
- ✅ Multiple CLI options for flexibility

## Installation

### Prerequisites
- Python 3.8 or higher
- pip (Python package manager)

### Setup Steps

1. **Clone or download this project** (if you haven't already)

2. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

   This installs:
   - `pypdf` >= 3.0.0 - For PDF text extraction
   - `python-dateutil` >= 2.8.0 - For date parsing

3. **Verify installation:**
   ```bash
   python3 pdf_to_markdown.py --help
   ```

## Usage

### Basic Usage

**Convert all PDFs in default folder (`./artikel`):**
```bash
python3 pdf_to_markdown.py
```

**Convert PDFs from custom input folder:**
```bash
python3 pdf_to_markdown.py /path/to/pdf/folder
```

**Specify both input and output folders:**
```bash
python3 pdf_to_markdown.py /path/to/input /path/to/output
```

### Advanced Options

**Verbose mode** (detailed logging):
```bash
python3 pdf_to_markdown.py -v ./artikel
python3 pdf_to_markdown.py --verbose ./artikel
```

**Quiet mode** (suppress output except errors):
```bash
python3 pdf_to_markdown.py -q ./artikel
python3 pdf_to_markdown.py --quiet ./artikel
```

**Dry run** (preview without writing files):
```bash
python3 pdf_to_markdown.py --dry-run ./artikel
```

### Examples

```bash
# Process all PDFs in artikel folder, save to artikel/converted
python3 pdf_to_markdown.py

# Process PDFs in custom location with verbose output
python3 pdf_to_markdown.py -v ~/Documents/PDFs

# Test what would be converted without writing files
python3 pdf_to_markdown.py --dry-run ./artikel

# Convert and save to specific output directory
python3 pdf_to_markdown.py ./input_pdfs ./output_markdown
```

## Output Format

Each converted PDF becomes a Markdown file with the following structure:

```markdown
---
title: Document Title
author: Author Name
created: 2024-02-23
converted: 2024-02-23 14:32:15
source: original_filename.pdf
---

# Document Title

## Page 1

[Extracted text from page 1...]

## Page 2

[Extracted text from page 2...]
```

**Front Matter Sections:**
- `title` - Document title (from PDF metadata or filename)
- `author` - Document author (if available in PDF metadata)
- `created` - PDF creation date (if available in metadata)
- `converted` - Timestamp of when the conversion occurred
- `source` - Original PDF filename

## Troubleshooting

### Issue: `ModuleNotFoundError: No module named 'pypdf'`

**Solution:** Install dependencies:
```bash
pip install -r requirements.txt
```

### Issue: PDF has no extractable text

This typically happens with:
- **Scanned PDFs** (image-based, no embedded text layer)
- **Corrupted PDFs**
- **Encrypted PDFs**

The script will:
- Log a warning for the file
- Create a Markdown file with metadata but note that text extraction failed
- Continue processing other PDFs

### Issue: Permission denied when writing files

**Solution:** Ensure you have write permissions to the output directory:
```bash
chmod 755 /path/to/output/directory
```

### Issue: Special characters or encoding problems

The script uses UTF-8 encoding by default, which handles most character sets. If you encounter issues:
- Ensure your terminal supports UTF-8
- Check if the PDF contains unusual character encodings

## Output Statistics

After processing, the script displays a summary:
```
============================================================
CONVERSION SUMMARY
============================================================
Total PDFs:       25
Successful:       23
Failed:           2
Output directory: /path/to/converted
============================================================
```

If any PDFs failed to convert, details are logged for debugging.

## File Structure

```
.
├── pdf_to_markdown.py      # Main conversion script
├── requirements.txt         # Python dependencies
└── README.md               # This file
```

## How It Works

1. **Discovers PDFs** - Finds all `.pdf` files in the input directory
2. **Extracts Metadata** - Reads title, author, and creation date from PDF metadata
3. **Extracts Text** - Processes each page and extracts text content
4. **Creates Markdown** - Formats extracted content with metadata front matter
5. **Saves Files** - Writes Markdown files to output directory with same names as PDFs
6. **Reports Results** - Displays conversion summary and any errors

## Limitations

- **No image extraction** - Images in PDFs are not extracted or embedded
- **Text-only** - Requires PDFs with extractable text (scanned PDFs won't work well)
- **Layout preservation** - Complex multi-column layouts may not be perfectly preserved
- **Recursive search** - Only searches the top-level directory (not subdirectories)

## Advanced: Customizing the Script

### To process subdirectories:

Replace this line in the script:
```python
pdf_files = list(self.input_dir.glob('*.pdf'))
```

With:
```python
pdf_files = list(self.input_dir.glob('**/*.pdf'))
```

### To include image extraction:

The script currently skips images. To add image extraction:
1. Replace `pypdf` with `pymupdf (fitz)` for better image support
2. Modify the `extract_text()` method to save images
3. Update `create_markdown()` to reference extracted images

## Support & Feedback

For issues or feature requests, visit:
https://github.com/anomalyco/opencode

## License

This script is provided as-is for use in your project.

---

**Version:** 1.0
**Last Updated:** 2024-02-23