- Implement pdf_to_markdown.py script with pypdf for text extraction - Extract metadata (title, author, creation date) from PDFs - Generate clean Markdown files with YAML front matter - Add comprehensive error handling and logging - Create mise.toml with 10+ convenient tasks for conversion - Provide detailed documentation (4 guides + quick reference) - Successfully convert all 18 PDF files in artikel/ folder to Markdown - Include .gitignore for Python cache and local config
234 lines
5.8 KiB
Markdown
234 lines
5.8 KiB
Markdown
# PDF to Markdown Converter - Setup & Usage Guide
|
|
|
|
## Overview
|
|
|
|
This is a Python script that converts PDF files to clean Markdown format, extracting text content and document metadata.
|
|
|
|
**Features:**
|
|
- ✅ Extracts text from all PDF pages
|
|
- ✅ Preserves page structure with page headers
|
|
- ✅ Extracts metadata (title, author, creation date)
|
|
- ✅ Generates YAML front matter in Markdown files
|
|
- ✅ Robust error handling (skips problematic PDFs)
|
|
- ✅ Detailed logging and conversion summary
|
|
- ✅ Multiple CLI options for flexibility
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
- Python 3.8 or higher
|
|
- pip (Python package manager)
|
|
|
|
### Setup Steps
|
|
|
|
1. **Clone or download this project** (if you haven't already)
|
|
|
|
2. **Install dependencies:**
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
This installs:
|
|
- `pypdf` >= 3.0.0 - For PDF text extraction
|
|
- `python-dateutil` >= 2.8.0 - For date parsing
|
|
|
|
3. **Verify installation:**
|
|
```bash
|
|
python3 pdf_to_markdown.py --help
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
**Convert all PDFs in default folder (`./artikel`):**
|
|
```bash
|
|
python3 pdf_to_markdown.py
|
|
```
|
|
|
|
**Convert PDFs from custom input folder:**
|
|
```bash
|
|
python3 pdf_to_markdown.py /path/to/pdf/folder
|
|
```
|
|
|
|
**Specify both input and output folders:**
|
|
```bash
|
|
python3 pdf_to_markdown.py /path/to/input /path/to/output
|
|
```
|
|
|
|
### Advanced Options
|
|
|
|
**Verbose mode** (detailed logging):
|
|
```bash
|
|
python3 pdf_to_markdown.py -v ./artikel
|
|
python3 pdf_to_markdown.py --verbose ./artikel
|
|
```
|
|
|
|
**Quiet mode** (suppress output except errors):
|
|
```bash
|
|
python3 pdf_to_markdown.py -q ./artikel
|
|
python3 pdf_to_markdown.py --quiet ./artikel
|
|
```
|
|
|
|
**Dry run** (preview without writing files):
|
|
```bash
|
|
python3 pdf_to_markdown.py --dry-run ./artikel
|
|
```
|
|
|
|
### Examples
|
|
|
|
```bash
|
|
# Process all PDFs in artikel folder, save to artikel/converted
|
|
python3 pdf_to_markdown.py
|
|
|
|
# Process PDFs in custom location with verbose output
|
|
python3 pdf_to_markdown.py -v ~/Documents/PDFs
|
|
|
|
# Test what would be converted without writing files
|
|
python3 pdf_to_markdown.py --dry-run ./artikel
|
|
|
|
# Convert and save to specific output directory
|
|
python3 pdf_to_markdown.py ./input_pdfs ./output_markdown
|
|
```
|
|
|
|
## Output Format
|
|
|
|
Each converted PDF becomes a Markdown file with the following structure:
|
|
|
|
```markdown
|
|
---
|
|
title: Document Title
|
|
author: Author Name
|
|
created: 2024-02-23
|
|
converted: 2024-02-23 14:32:15
|
|
source: original_filename.pdf
|
|
---
|
|
|
|
# Document Title
|
|
|
|
## Page 1
|
|
|
|
[Extracted text from page 1...]
|
|
|
|
## Page 2
|
|
|
|
[Extracted text from page 2...]
|
|
```
|
|
|
|
**Front Matter Sections:**
|
|
- `title` - Document title (from PDF metadata or filename)
|
|
- `author` - Document author (if available in PDF metadata)
|
|
- `created` - PDF creation date (if available in metadata)
|
|
- `converted` - Timestamp of when the conversion occurred
|
|
- `source` - Original PDF filename
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: `ModuleNotFoundError: No module named 'pypdf'`
|
|
|
|
**Solution:** Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Issue: PDF has no extractable text
|
|
|
|
This typically happens with:
|
|
- **Scanned PDFs** (image-based, no embedded text layer)
|
|
- **Corrupted PDFs**
|
|
- **Encrypted PDFs**
|
|
|
|
The script will:
|
|
- Log a warning for the file
|
|
- Create a Markdown file with metadata but note that text extraction failed
|
|
- Continue processing other PDFs
|
|
|
|
### Issue: Permission denied when writing files
|
|
|
|
**Solution:** Ensure you have write permissions to the output directory:
|
|
```bash
|
|
chmod 755 /path/to/output/directory
|
|
```
|
|
|
|
### Issue: Special characters or encoding problems
|
|
|
|
The script uses UTF-8 encoding by default, which handles most character sets. If you encounter issues:
|
|
- Ensure your terminal supports UTF-8
|
|
- Check if the PDF contains unusual character encodings
|
|
|
|
## Output Statistics
|
|
|
|
After processing, the script displays a summary:
|
|
```
|
|
============================================================
|
|
CONVERSION SUMMARY
|
|
============================================================
|
|
Total PDFs: 25
|
|
Successful: 23
|
|
Failed: 2
|
|
Output directory: /path/to/converted
|
|
============================================================
|
|
```
|
|
|
|
If any PDFs failed to convert, details are logged for debugging.
|
|
|
|
## File Structure
|
|
|
|
```
|
|
.
|
|
├── pdf_to_markdown.py # Main conversion script
|
|
├── requirements.txt # Python dependencies
|
|
└── README.md # This file
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **Discovers PDFs** - Finds all `.pdf` files in the input directory
|
|
2. **Extracts Metadata** - Reads title, author, and creation date from PDF metadata
|
|
3. **Extracts Text** - Processes each page and extracts text content
|
|
4. **Creates Markdown** - Formats extracted content with metadata front matter
|
|
5. **Saves Files** - Writes Markdown files to output directory with same names as PDFs
|
|
6. **Reports Results** - Displays conversion summary and any errors
|
|
|
|
## Limitations
|
|
|
|
- **No image extraction** - Images in PDFs are not extracted or embedded
|
|
- **Text-only** - Requires PDFs with extractable text (scanned PDFs won't work well)
|
|
- **Layout preservation** - Complex multi-column layouts may not be perfectly preserved
|
|
- **Recursive search** - Only searches the top-level directory (not subdirectories)
|
|
|
|
## Advanced: Customizing the Script
|
|
|
|
### To process subdirectories:
|
|
|
|
Replace this line in the script:
|
|
```python
|
|
pdf_files = list(self.input_dir.glob('*.pdf'))
|
|
```
|
|
|
|
With:
|
|
```python
|
|
pdf_files = list(self.input_dir.glob('**/*.pdf'))
|
|
```
|
|
|
|
### To include image extraction:
|
|
|
|
The script currently skips images. To add image extraction:
|
|
1. Replace `pypdf` with `pymupdf (fitz)` for better image support
|
|
2. Modify the `extract_text()` method to save images
|
|
3. Update `create_markdown()` to reference extracted images
|
|
|
|
## Support & Feedback
|
|
|
|
For issues or feature requests, visit:
|
|
https://github.com/anomalyco/opencode
|
|
|
|
## License
|
|
|
|
This script is provided as-is for use in your project.
|
|
|
|
---
|
|
|
|
**Version:** 1.0
|
|
**Last Updated:** 2024-02-23
|