Mykhailo Pavlov
Back to Projects

Multilingual Vocabulary Automation Engine

Released
Released Oct 2025
PythonFlaskBeautifulSoupLLM IntegrationAnkiConnectGitHub ActionsWeb Scraping

About the Project

The Objective

Manual flashcard creation is a bottleneck for language learners. I built a tool to automate the heavy lifting: scraping definitions, generating audio, and pushing cards directly to Anki via AnkiConnect. The system supports 100+ languages and handles complex linguistic data like etymology and phonetics.

System Design

The application is a hybrid of a local web server (Flask) and a desktop utility (PyInstaller).

  • Scraping Layer: Modular scrapers for Oxford, Cambridge, and Wiktionary. Each scraper normalizes data into a unified schema.
  • Enrichment Layer: Uses LLMs to generate example sentences and context, ensuring cards aren't just dry definitions.
  • Integration Layer: Communicates with Anki via JSON-RPC to create notes, add media, and tag cards automatically.

Key Metrics

  • Output: Generated 1,000+ cards during testing.
  • Coverage: Supports 100+ target languages via dictionary API aggregation.
  • Reliability: CI pipeline runs integration tests on every commit to catch scraper breakages early.

Engineering Challenges

1. Scraper Fragility

Problem: Dictionary sites change their HTML structure frequently, breaking parsers. Solution: Implemented a fallback chain. If Oxford fails, the system tries Cambridge, then Wiktionary. I also added unit tests for each scraper using stored HTML snapshots to detect regressions immediately.

2. Media Handling

Problem: Audio files and images need to be stored in specific Anki media folders. Solution: Built a dedicated media manager that hashes filenames to prevent collisions and cleans up unused assets during sync. Supports SVG thumbnails and fallback TTS (gTTS) if native audio is unavailable.

3. Distribution

Problem: Users shouldn't need to install Python dependencies to run the tool. Solution: Configured PyInstaller to bundle the Flask app and scripts into a single .exe. Added a graceful shutdown mechanism to ensure the server stops cleanly without orphaned processes.

DevOps & Quality

I treated this like a production service, not a script:

  • CI/CD: GitHub Actions runs linting and integration tests on every push.
  • Logging: Detailed logs track which dictionary source provided the data for each card, aiding debugging when data looks off.
  • State Management: Tracks card lifecycle (createdsentedited) in a local SQLite DB to prevent duplicate uploads.

Tech Stack

  • Backend: Python, Flask, SQLite
  • Scraping: BeautifulSoup, Selenium (for dynamic content)
  • Integration: AnkiConnect, LLM APIs
  • DevOps: GitHub Actions, PyInstaller

Reflections

This project taught me the value of defensive programming when dealing with external APIs. Scrapers will break. The key is designing a system that degrades gracefully rather than crashing. I also gained experience bridging the gap between web technologies (Flask) and desktop automation (AnkiConnect).