AI Pipeline · Weeks 1–3

From Raw Data toAI-Powered Skill Gap Analysis

A 3-week pipeline: ETL → LLM tagging → Full-stack chat app. Built with Python, FastAPI, Ollama, and Gemini.

scroll
00
Overview

The Pipeline

Three weeks, one cohesive AI system. Each week builds on the last.

WEEK 1

Job Listings ETL

Extract .mhtml archives → Bronze → Silver → Gold SQLite

Complete
WEEK 2

AI Skill Tagger

LLM batch tagging + resume skill gap detection

Complete
WEEK 3

KYouth Chat

Full-stack chat app with PDF resume analysis

Complete
01
Week 1 — ETL Pipeline
main.pybash
# Run full ETL pipeline end-to-end
python main.py all
 
# Or run individual stages
python main.py ingest # .mhtml → bronze HTML
python main.py process # HTML → silver JSON
python main.py load # JSON → gold SQLite
python main.py profile # data quality report
 
🥉 Bronze: Extracted 100 files
🥈 Silver: Processed 84 / Skipped 16
🥇 Gold: Inserted 84 records
 
--- DATA QUALITY REPORT ---
Total Records: 84
Missing Values → job_title: 0, company: 0
Avg Description Length: 2654 chars

Job Listings ETL

Parses raw .mhtml web archives through a 4-layer Medallion Architecture: Source → Bronze → Silver → Gold into SQLite. This ensures data integrity and high-fidelity extraction from non-standard formats.

0Jobs Processed
0%Schema Validated
02
Week 2 — LLM Analysis

AI-Powered Skill Gap Analysis

Sends job descriptions to Gemini/Ollama in batches of 3, extracts tech stacks with strict format validation, then cross-references your resume to surface missing skills.

find_skill_gaps.pybash
uv run tag_data.py
 
Analyzed Job 91347112: Java, Spring Boot, Python, REST APIs, CI/CD
Analyzed Job 91533584: PHP, Python, Node.js, MySQL, Docker, AWS
Analyzed Job 91554915: Python, Docker, GitHub Actions, Prometheus
Analyzed Job 91597624: Python, SQL, Google Cloud, AWS, PostgreSQL
Total tokens used: 2433, took 10486.325ms
 
uv run find_skill_gaps.py
 
gaps=['aws', 'docker', 'github actions', 'java', 'postgresql',
'prometheus', 'spring boot', 'sql', 'rest apis']
0Jobs Tagged
0LLM Models
0Skill Gaps Found
Models
Geminillama3.1phi3deepseek-r1
Skill Gaps Found
awsdockergithub actionsjavapostgresqlprometheusspring bootsqlrest apis
03
Week 3 — Full-Stack App

Resume Helper Chat App

FastAPI backend with Jinja2 frontend. Upload a resume PDF to trigger real-time skill gap analysis. Switch between local Ollama models and cloud Gemini mid-conversation.

docker-composebash
# Option A — Docker (recommended)
docker compose up --build -d
docker exec -it week_3-ollama-1 ollama pull llama3.1
 
# Option B — Local dev
cd week_3/backend
uv run uvicorn --app-dir src --host 0.0.0.0 --port 8001 app:app
 
cd week_3/frontend
uv run uvicorn --app-dir src --host 0.0.0.0 --port 8000 app:app
Architecture
Browser
└── Frontend :8000
└── Backend :8001
├── Ollama :11434
├── Gemini API
└── SQLite DB
PDF Upload
Attach any resume PDF — skills extracted automatically
Multi-Model
Switch llama3.1, gemma3, phi3, deepseek, Gemini mid-chat
Skill Gap Report
Matched vs missing skills returned inline in chat
04
Deployment

How to Run

Docker is the fastest path. Local dev for machines without enough resources for the Ollama container.

Prerequisites:Python 3.14DockerOllamauv8 GB RAMGemini API Key (optional)

Docker

Recommended
1Configure environment
cp .env.example .env
# edit .env — add GEMINI_API if using Gemini
2Build and start all containers
docker compose up --build -d
3Pull Ollama model (one-time)
docker exec -it week_3-ollama-1 ollama pull llama3.1
# optionally pull more:
# ollama pull gemma3 phi3 deepseek-r1:1.5b
4Open the app
http://localhost:8000

Local Dev

1Install prerequisites & pull model
ollama pull llama3.1
2Start backend
cd week_3/backend
uv sync
uv run uvicorn --app-dir src --host 0.0.0.0 --port 8001 app:app
3Start frontend
cd week_3/frontend
uv sync
uv run uvicorn --app-dir src --host 0.0.0.0 --port 8000 app:app
4Open the app
http://localhost:8000
Repository Structure
kyouth-project/
|-- week_1/ # ETL pipeline
| |-- main.py # entry point
| |-- src/
| | |-- ingestor.py # .mhtml → bronze
| | |-- processor.py # bronze → silver
| | |-- loader.py # silver → gold SQLite
| | `-- profiler.py # data quality report
| `-- data/ # source / bronze / silver / gold
|
|-- week_2/ # LLM skill tagger
| |-- tag_data.py # batch LLM tagging
| |-- find_skill_gaps.py
| `-- prompt_model.py # Gemini / Ollama adapter
|
`-- week_3/ # full-stack chat app
|-- backend/ # FastAPI :8001
|-- frontend/ # Jinja2 :8000
|-- landing/ # Next.js :3000
`-- docker-compose.yml
Environment Variables
VariableServiceDefaultDescription
CHAT_MODELbackendllama3.1Fallback model if none selected in UI
GEMINI_APIbackendGoogle Gemini API key (cloud models only)
OLLAMA_HOSTbackendhttp://ollama:11434Ollama server URL
BACKEND_URLfrontendhttp://backend:8000Backend service URL
DB_PATHbackenddata/jobs_d1.dbSQLite jobs database path
05
Technologies

Tech Stack

Every tool chosen for a reason. No framework soup.

Python
Python
Core language — all 3 weeks
FastAPI
FastAPI
Backend REST API — week 3
SQLite
SQLite
Jobs database — weeks 2 & 3
Docker
Docker
Containerization + Compose
Ollama
Ollama
Local LLM runtime
Gemini API
Gemini API
Cloud LLM — Google
Jinja2
Jinja2
Server-side HTML templates
uv
uv
Python package manager