- Implement pdf_to_markdown.py script with pypdf for text extraction - Extract metadata (title, author, creation date) from PDFs - Generate clean Markdown files with YAML front matter - Add comprehensive error handling and logging - Create mise.toml with 10+ convenient tasks for conversion - Provide detailed documentation (4 guides + quick reference) - Successfully convert all 18 PDF files in artikel/ folder to Markdown - Include .gitignore for Python cache and local config
283 lines
6.3 KiB
Markdown
283 lines
6.3 KiB
Markdown
# Mise en Place - PDF to Markdown Converter
|
|
|
|
A modern task runner configuration for the PDF to Markdown conversion project using [mise](https://mise.jdx.dev/).
|
|
|
|
## Overview
|
|
|
|
Mise is a polyglot tool manager that handles tool installations and task execution. This project uses it to:
|
|
- Automatically install Python 3.11 and dependencies
|
|
- Provide convenient commands for PDF conversion tasks
|
|
- Manage development workflows
|
|
- Track conversion status
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
- **mise** CLI installed: https://mise.jdx.dev/getting-started.html
|
|
|
|
Quick install:
|
|
```bash
|
|
curl https://mise.jdx.dev/install.sh | sh
|
|
```
|
|
|
|
### Setup
|
|
```bash
|
|
# Clone or navigate to the project
|
|
cd maturaarbeit
|
|
|
|
# Trust the configuration files (one-time setup)
|
|
mise trust
|
|
|
|
# Verify installation
|
|
mise tasks
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Convert All PDFs
|
|
```bash
|
|
mise run convert
|
|
```
|
|
|
|
This will:
|
|
1. Install dependencies (if not already installed)
|
|
2. Run the PDF to Markdown converter
|
|
3. Process all PDFs in `artikel/` folder
|
|
4. Output Markdown files to `artikel/converted/`
|
|
5. Display a conversion summary
|
|
|
|
### Check Conversion Status
|
|
```bash
|
|
mise run status
|
|
```
|
|
|
|
Shows:
|
|
- Number of PDFs in `artikel/`
|
|
- Number of converted Markdown files
|
|
- ✓ All PDFs converted (if done)
|
|
|
|
### Preview Without Writing
|
|
```bash
|
|
mise run dry-run
|
|
```
|
|
|
|
Shows what PDFs would be converted without actually writing files.
|
|
|
|
## Available Tasks
|
|
|
|
| Task | Description |
|
|
|------|-------------|
|
|
| `install` | Install Python 3.11 and project dependencies |
|
|
| `convert` | Convert all PDFs to Markdown (main task) |
|
|
| `convert-verbose` | Convert with detailed logging output |
|
|
| `convert-quiet` | Convert silently (errors only) |
|
|
| `dry-run` | Preview conversion without writing files |
|
|
| `convert-custom` | Convert from custom input/output folders |
|
|
| `status` | Show conversion status and progress |
|
|
| `clean` | Remove converted Markdown files |
|
|
| `clean-all` | Remove all build artifacts and cache |
|
|
| `help` | List all available tasks |
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Conversion
|
|
```bash
|
|
# Convert all PDFs using defaults
|
|
mise run convert
|
|
|
|
# Convert with verbose logging
|
|
mise run convert-verbose
|
|
|
|
# Convert silently
|
|
mise run convert-quiet
|
|
```
|
|
|
|
### Custom Paths
|
|
```bash
|
|
# Convert from custom input directory
|
|
INPUT_DIR=/path/to/pdfs mise run convert-custom
|
|
|
|
# Specify both input and output directories
|
|
INPUT_DIR=/path/to/pdfs OUTPUT_DIR=/path/to/output mise run convert-custom
|
|
```
|
|
|
|
### Cleanup
|
|
```bash
|
|
# Remove only converted markdown files
|
|
mise run clean
|
|
|
|
# Remove all artifacts (markdown files, cache, __pycache__)
|
|
mise run clean-all
|
|
```
|
|
|
|
## Configuration Files
|
|
|
|
### `mise.toml`
|
|
Main configuration file with all tasks, environment variables, and tool versions.
|
|
|
|
**Key sections:**
|
|
- `[env]` - Environment variables (e.g., `PYTHONUNBUFFERED`)
|
|
- `[tasks.*]` - Task definitions with descriptions and commands
|
|
- `[tools.python]` - Python version specification (3.11)
|
|
- `[tools.pipenv]` - Package manager version
|
|
|
|
### `.mise.local.toml`
|
|
Local overrides for environment-specific configuration. Git-ignored file for personal settings.
|
|
|
|
**Example customizations:**
|
|
```toml
|
|
# Override input/output directories
|
|
INPUT_DIR = "./my_pdfs"
|
|
OUTPUT_DIR = "./my_output"
|
|
|
|
# Custom Python path
|
|
PYTHON_PATH = "/usr/local/bin/python3"
|
|
```
|
|
|
|
### `.gitignore`
|
|
Excludes mise cache and local configuration from version control.
|
|
|
|
## How It Works
|
|
|
|
### Automatic Tool Installation
|
|
When you run a task, mise automatically:
|
|
1. Detects required tools (Python 3.11)
|
|
2. Downloads and installs them if missing
|
|
3. Creates isolated environment
|
|
4. Executes the task in that environment
|
|
|
|
### Task Execution
|
|
1. **Setup phase** - Install dependencies via `pip install -r requirements.txt`
|
|
2. **Execution phase** - Run the Python script with appropriate arguments
|
|
3. **Cleanup phase** - Report results and summary
|
|
|
|
### Environment Variables
|
|
```bash
|
|
PYTHONUNBUFFERED=1 # Real-time output (no buffering)
|
|
INPUT_DIR # Custom input folder (default: ./artikel)
|
|
OUTPUT_DIR # Custom output folder (default: ./artikel/converted)
|
|
```
|
|
|
|
## Advantages Over Traditional Approach
|
|
|
|
### Before (Manual Setup)
|
|
```bash
|
|
# Install Python globally
|
|
# Install pip
|
|
# Install dependencies
|
|
# Hope everything works
|
|
python3 pdf_to_markdown.py
|
|
```
|
|
|
|
### After (Mise)
|
|
```bash
|
|
# One command - everything handled
|
|
mise run convert
|
|
```
|
|
|
|
**Benefits:**
|
|
- ✅ Reproducible - Same environment every time
|
|
- ✅ Isolated - Tools don't affect system Python
|
|
- ✅ Fast - Caches installed tools
|
|
- ✅ Easy - Single command to run tasks
|
|
- ✅ Portable - Works on any system with mise
|
|
- ✅ Documented - Task descriptions built-in
|
|
- ✅ Flexible - Environment variables for customization
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: "mise: command not found"
|
|
|
|
**Solution:** Install mise first
|
|
```bash
|
|
curl https://mise.jdx.dev/install.sh | sh
|
|
```
|
|
|
|
### Issue: "Config files are not trusted"
|
|
|
|
**Solution:** Trust the configuration
|
|
```bash
|
|
mise trust
|
|
```
|
|
|
|
### Issue: Python dependencies not installing
|
|
|
|
**Solution:** Manually install in the mise environment
|
|
```bash
|
|
mise run install
|
|
```
|
|
|
|
### Issue: "No PDF files found"
|
|
|
|
**Solution:** Check the input directory path
|
|
```bash
|
|
# Verify PDFs exist
|
|
ls -la artikel/*.pdf
|
|
|
|
# If in different location, use custom path
|
|
INPUT_DIR=/path/to/pdfs mise run convert-custom
|
|
```
|
|
|
|
### Issue: Slow first run
|
|
|
|
**Solution:** First run downloads and installs tools (one-time). Subsequent runs are fast.
|
|
|
|
## Advanced Usage
|
|
|
|
### Running Tasks from Shell Scripts
|
|
```bash
|
|
#!/bin/bash
|
|
# Run conversion in a script
|
|
mise run convert
|
|
|
|
# Capture exit code
|
|
if mise run convert; then
|
|
echo "Conversion successful"
|
|
mise run status
|
|
else
|
|
echo "Conversion failed"
|
|
exit 1
|
|
fi
|
|
```
|
|
|
|
### Integrating with CI/CD
|
|
```bash
|
|
# GitHub Actions example
|
|
- name: Convert PDFs
|
|
run: |
|
|
curl https://mise.jdx.dev/install.sh | sh
|
|
mise run convert
|
|
```
|
|
|
|
### Custom Task Definition
|
|
To add a new task, edit `mise.toml`:
|
|
|
|
```toml
|
|
[tasks.my-custom-task]
|
|
description = "My custom task description"
|
|
run = "echo 'Running custom task'"
|
|
depends = ["install"] # Depends on install task
|
|
```
|
|
|
|
Then run:
|
|
```bash
|
|
mise run my-custom-task
|
|
```
|
|
|
|
## Documentation
|
|
|
|
- **Project Guide** - See `PDF_CONVERTER_GUIDE.md`
|
|
- **Mise Docs** - https://mise.jdx.dev/
|
|
- **Python Script** - See `pdf_to_markdown.py`
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
- Mise documentation: https://mise.jdx.dev/
|
|
- Project issues: https://github.com/anomalyco/opencode
|
|
|
|
---
|
|
|
|
**Version:** 1.0
|
|
**Last Updated:** 2024-02-23
|