Files
mgeeky-decode-spam-headers/Auto Run Docs/SpecKit-web-header-analyzer-Phase-02-Engine-Refactoring.md
2026-02-17 23:36:29 +01:00

74 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 02: Engine Refactoring
This phase decomposes the monolithic `decode-spam-headers.py` (6,931 lines, 106 test methods, 3 classes) into independently testable scanner modules that the API can invoke programmatically. This is a prerequisite for all user stories — without modular scanners, the backend cannot expose individual tests or stream progress. TDD Red-Green: write failing tests first, then implement parser, scanner base, registry, 10 vendor-grouped scanner modules, and the analyzer orchestrator.
## Spec Kit Context
- **Feature:** 1-web-header-analyzer
- **Specification:** .specify/specs/1-web-header-analyzer/spec.md
- **Plan:** .specify/specs/1-web-header-analyzer/plan.md
- **Tasks:** .specify/specs/1-web-header-analyzer/tasks.md
- **Data Model:** .specify/specs/1-web-header-analyzer/data-model.md
- **Constitution:** .specify/memory/constitution.md (TDD mandate: P6)
## Architecture Reference
The existing monolith structure (read-only reference):
- `decode-spam-headers.py` lines 209419: `Logger` class
- `decode-spam-headers.py` lines 421439: `Verstring` class
- `decode-spam-headers.py` lines 441+: `SMTPHeadersAnalysis` class
- `decode-spam-headers.py` lines 18962027: `getAllTests()` defining all 106 tests
- `decode-spam-headers.py` lines 24376504: All test method implementations
Target modular structure:
```
backend/app/engine/
├── __init__.py
├── models.py # AnalysisRequest, AnalysisResult, TestResult, HopChainNode, SecurityAppliance
├── logger.py # Adapted Logger class (Python logging module)
├── parser.py # HeaderParser.parse(raw_text) -> list[ParsedHeader]
├── scanner_base.py # BaseScanner protocol: id, name, run(headers) -> TestResult | None
├── scanner_registry.py # ScannerRegistry: get_all(), get_by_ids(), list_tests()
├── analyzer.py # HeaderAnalyzer orchestrator with progress callback
└── scanners/
├── received_headers.py # Tests 13
├── forefront_antispam.py # Tests 1216, 6364
├── spamassassin.py # Tests 1821, 74
├── ironport.py # Tests 2729, 3843, 8889
├── mimecast.py # Tests 30, 6162, 65
├── trendmicro.py # Tests 4759, 97
├── barracuda.py # Tests 6973
├── proofpoint.py # Tests 6667
├── microsoft_general.py # Tests 3134, 80, 8385, 99102
└── general.py # Remaining tests: 411, 17, 2226, 3637, 4446, 68, 7579, 82, 8687, 9096, 98, 103106
```
## Tasks
- [x] T007 Write failing tests (TDD Red) in `backend/tests/engine/test_parser.py` (header parsing with sample EML), `backend/tests/engine/test_scanner_registry.py` (discovery returns 106+ scanners, filtering by ID), and `backend/tests/engine/test_analyzer.py` (full pipeline with reference fixture). Create `backend/tests/fixtures/sample_headers.txt` with representative header set extracted from the existing test infrastructure
- [x] T008 Create `backend/app/engine/__init__.py` and `backend/app/engine/models.py` — Pydantic models for `AnalysisRequest`, `AnalysisResult`, `TestResult`, `HopChainNode`, `SecurityAppliance`. Refer to `.specify/specs/1-web-header-analyzer/data-model.md` for field definitions and severity enum values (spam→#ff5555, suspicious→#ffb86c, clean→#50fa7b, info→#bd93f9)
- [ ] T009 Create `backend/app/engine/logger.py` — extract Logger class from `decode-spam-headers.py` (lines 209419), adapt to use Python `logging` module instead of direct stdout
- [ ] T010 Create `backend/app/engine/parser.py` — extract header parsing from `SMTPHeadersAnalysis.collect()` and `getHeader()` (lines ~21372270). Expose `HeaderParser.parse(raw_text: str) -> list[ParsedHeader]` including MIME boundary and line-break handling. Verify `test_parser.py` passes (TDD Green)
- [ ] T011 Create `backend/app/engine/scanner_base.py` — abstract `BaseScanner` (Protocol or ABC) with interface: `id: int`, `name: str`, `run(headers: list[ParsedHeader]) -> TestResult | None`
- [ ] T012 Create `backend/app/engine/scanner_registry.py``ScannerRegistry` with auto-discovery: `get_all()`, `get_by_ids(ids)`, `list_tests()`. Verify `test_scanner_registry.py` passes (TDD Green)
- [ ] T013 [P] Create scanner modules by extracting test methods from `SMTPHeadersAnalysis` into `backend/app/engine/scanners/`. Each file implements `BaseScanner`:
- `backend/app/engine/scanners/received_headers.py` (tests 13)
- `backend/app/engine/scanners/forefront_antispam.py` (tests 1216, 6364)
- `backend/app/engine/scanners/spamassassin.py` (tests 1821, 74)
- `backend/app/engine/scanners/ironport.py` (tests 2729, 3843, 8889)
- `backend/app/engine/scanners/mimecast.py` (tests 30, 6162, 65)
- `backend/app/engine/scanners/trendmicro.py` (tests 4759, 97)
- `backend/app/engine/scanners/barracuda.py` (tests 6973)
- `backend/app/engine/scanners/proofpoint.py` (tests 6667)
- `backend/app/engine/scanners/microsoft_general.py` (tests 3134, 80, 8385, 99102)
- `backend/app/engine/scanners/general.py` (remaining tests: 411, 17, 2226, 3637, 4446, 68, 7579, 82, 8687, 9096, 98, 103106)
- [ ] T014 Create `backend/app/engine/analyzer.py``HeaderAnalyzer` orchestrator: accepts `AnalysisRequest`, uses `HeaderParser` + `ScannerRegistry`, runs scanners with per-test timeout, collects results (marking failed tests with error status per FR-25), supports progress callback `Callable[[int, int, str], None]`. Verify `test_analyzer.py` passes (TDD Green)
## Completion
- [ ] `pytest backend/tests/engine/` passes with all tests green
- [ ] All 106+ tests are registered in the scanner registry (`ScannerRegistry.get_all()` returns 106+ scanners)
- [ ] Analysis of `backend/tests/fixtures/sample_headers.txt` produces results matching original CLI output
- [ ] `ruff check backend/` passes with zero errors
- [ ] Run `/speckit.analyze` to verify consistency