# PDF to Markdown Converter - Setup & Usage Guide ## Overview This is a Python script that converts PDF files to clean Markdown format, extracting text content and document metadata. **Features:** - ✅ Extracts text from all PDF pages - ✅ Preserves page structure with page headers - ✅ Extracts metadata (title, author, creation date) - ✅ Generates YAML front matter in Markdown files - ✅ Robust error handling (skips problematic PDFs) - ✅ Detailed logging and conversion summary - ✅ Multiple CLI options for flexibility ## Installation ### Prerequisites - Python 3.8 or higher - pip (Python package manager) ### Setup Steps 1. **Clone or download this project** (if you haven't already) 2. **Install dependencies:** ```bash pip install -r requirements.txt ``` This installs: - `pypdf` >= 3.0.0 - For PDF text extraction - `python-dateutil` >= 2.8.0 - For date parsing 3. **Verify installation:** ```bash python3 pdf_to_markdown.py --help ``` ## Usage ### Basic Usage **Convert all PDFs in default folder (`./artikel`):** ```bash python3 pdf_to_markdown.py ``` **Convert PDFs from custom input folder:** ```bash python3 pdf_to_markdown.py /path/to/pdf/folder ``` **Specify both input and output folders:** ```bash python3 pdf_to_markdown.py /path/to/input /path/to/output ``` ### Advanced Options **Verbose mode** (detailed logging): ```bash python3 pdf_to_markdown.py -v ./artikel python3 pdf_to_markdown.py --verbose ./artikel ``` **Quiet mode** (suppress output except errors): ```bash python3 pdf_to_markdown.py -q ./artikel python3 pdf_to_markdown.py --quiet ./artikel ``` **Dry run** (preview without writing files): ```bash python3 pdf_to_markdown.py --dry-run ./artikel ``` ### Examples ```bash # Process all PDFs in artikel folder, save to artikel/converted python3 pdf_to_markdown.py # Process PDFs in custom location with verbose output python3 pdf_to_markdown.py -v ~/Documents/PDFs # Test what would be converted without writing files python3 pdf_to_markdown.py --dry-run ./artikel # Convert and save to specific output directory python3 pdf_to_markdown.py ./input_pdfs ./output_markdown ``` ## Output Format Each converted PDF becomes a Markdown file with the following structure: ```markdown --- title: Document Title author: Author Name created: 2024-02-23 converted: 2024-02-23 14:32:15 source: original_filename.pdf --- # Document Title ## Page 1 [Extracted text from page 1...] ## Page 2 [Extracted text from page 2...] ``` **Front Matter Sections:** - `title` - Document title (from PDF metadata or filename) - `author` - Document author (if available in PDF metadata) - `created` - PDF creation date (if available in metadata) - `converted` - Timestamp of when the conversion occurred - `source` - Original PDF filename ## Troubleshooting ### Issue: `ModuleNotFoundError: No module named 'pypdf'` **Solution:** Install dependencies: ```bash pip install -r requirements.txt ``` ### Issue: PDF has no extractable text This typically happens with: - **Scanned PDFs** (image-based, no embedded text layer) - **Corrupted PDFs** - **Encrypted PDFs** The script will: - Log a warning for the file - Create a Markdown file with metadata but note that text extraction failed - Continue processing other PDFs ### Issue: Permission denied when writing files **Solution:** Ensure you have write permissions to the output directory: ```bash chmod 755 /path/to/output/directory ``` ### Issue: Special characters or encoding problems The script uses UTF-8 encoding by default, which handles most character sets. If you encounter issues: - Ensure your terminal supports UTF-8 - Check if the PDF contains unusual character encodings ## Output Statistics After processing, the script displays a summary: ``` ============================================================ CONVERSION SUMMARY ============================================================ Total PDFs: 25 Successful: 23 Failed: 2 Output directory: /path/to/converted ============================================================ ``` If any PDFs failed to convert, details are logged for debugging. ## File Structure ``` . ├── pdf_to_markdown.py # Main conversion script ├── requirements.txt # Python dependencies └── README.md # This file ``` ## How It Works 1. **Discovers PDFs** - Finds all `.pdf` files in the input directory 2. **Extracts Metadata** - Reads title, author, and creation date from PDF metadata 3. **Extracts Text** - Processes each page and extracts text content 4. **Creates Markdown** - Formats extracted content with metadata front matter 5. **Saves Files** - Writes Markdown files to output directory with same names as PDFs 6. **Reports Results** - Displays conversion summary and any errors ## Limitations - **No image extraction** - Images in PDFs are not extracted or embedded - **Text-only** - Requires PDFs with extractable text (scanned PDFs won't work well) - **Layout preservation** - Complex multi-column layouts may not be perfectly preserved - **Recursive search** - Only searches the top-level directory (not subdirectories) ## Advanced: Customizing the Script ### To process subdirectories: Replace this line in the script: ```python pdf_files = list(self.input_dir.glob('*.pdf')) ``` With: ```python pdf_files = list(self.input_dir.glob('**/*.pdf')) ``` ### To include image extraction: The script currently skips images. To add image extraction: 1. Replace `pypdf` with `pymupdf (fitz)` for better image support 2. Modify the `extract_text()` method to save images 3. Update `create_markdown()` to reference extracted images ## Support & Feedback For issues or feature requests, visit: https://github.com/anomalyco/opencode ## License This script is provided as-is for use in your project. --- **Version:** 1.0 **Last Updated:** 2024-02-23