Files
mgeeky-decode-spam-headers/Auto Run Docs/SpecKit-web-header-analyzer-Phase-02-Engine-Refactoring.md
2026-02-17 23:45:54 +01:00

6.1 KiB
Raw Blame History

Phase 02: Engine Refactoring

This phase decomposes the monolithic decode-spam-headers.py (6,931 lines, 106 test methods, 3 classes) into independently testable scanner modules that the API can invoke programmatically. This is a prerequisite for all user stories — without modular scanners, the backend cannot expose individual tests or stream progress. TDD Red-Green: write failing tests first, then implement parser, scanner base, registry, 10 vendor-grouped scanner modules, and the analyzer orchestrator.

Spec Kit Context

  • Feature: 1-web-header-analyzer
  • Specification: .specify/specs/1-web-header-analyzer/spec.md
  • Plan: .specify/specs/1-web-header-analyzer/plan.md
  • Tasks: .specify/specs/1-web-header-analyzer/tasks.md
  • Data Model: .specify/specs/1-web-header-analyzer/data-model.md
  • Constitution: .specify/memory/constitution.md (TDD mandate: P6)

Architecture Reference

The existing monolith structure (read-only reference):

  • decode-spam-headers.py lines 209419: Logger class
  • decode-spam-headers.py lines 421439: Verstring class
  • decode-spam-headers.py lines 441+: SMTPHeadersAnalysis class
  • decode-spam-headers.py lines 18962027: getAllTests() defining all 106 tests
  • decode-spam-headers.py lines 24376504: All test method implementations

Target modular structure:

backend/app/engine/
├── __init__.py
├── models.py              # AnalysisRequest, AnalysisResult, TestResult, HopChainNode, SecurityAppliance
├── logger.py              # Adapted Logger class (Python logging module)
├── parser.py              # HeaderParser.parse(raw_text) -> list[ParsedHeader]
├── scanner_base.py        # BaseScanner protocol: id, name, run(headers) -> TestResult | None
├── scanner_registry.py    # ScannerRegistry: get_all(), get_by_ids(), list_tests()
├── analyzer.py            # HeaderAnalyzer orchestrator with progress callback
└── scanners/
    ├── received_headers.py      # Tests 13
    ├── forefront_antispam.py    # Tests 1216, 6364
    ├── spamassassin.py          # Tests 1821, 74
    ├── ironport.py              # Tests 2729, 3843, 8889
    ├── mimecast.py              # Tests 30, 6162, 65
    ├── trendmicro.py            # Tests 4759, 97
    ├── barracuda.py             # Tests 6973
    ├── proofpoint.py            # Tests 6667
    ├── microsoft_general.py     # Tests 3134, 80, 8385, 99102
    └── general.py               # Remaining tests: 411, 17, 2226, 3637, 4446, 68, 7579, 82, 8687, 9096, 98, 103106

Tasks

  • T007 Write failing tests (TDD Red) in backend/tests/engine/test_parser.py (header parsing with sample EML), backend/tests/engine/test_scanner_registry.py (discovery returns 106+ scanners, filtering by ID), and backend/tests/engine/test_analyzer.py (full pipeline with reference fixture). Create backend/tests/fixtures/sample_headers.txt with representative header set extracted from the existing test infrastructure
  • T008 Create backend/app/engine/__init__.py and backend/app/engine/models.py — Pydantic models for AnalysisRequest, AnalysisResult, TestResult, HopChainNode, SecurityAppliance. Refer to .specify/specs/1-web-header-analyzer/data-model.md for field definitions and severity enum values (spam→#ff5555, suspicious→#ffb86c, clean→#50fa7b, info→#bd93f9)
  • T009 Create backend/app/engine/logger.py — extract Logger class from decode-spam-headers.py (lines 209419), adapt to use Python logging module instead of direct stdout
  • T010 Create backend/app/engine/parser.py — extract header parsing from SMTPHeadersAnalysis.collect() and getHeader() (lines ~21372270). Expose HeaderParser.parse(raw_text: str) -> list[ParsedHeader] including MIME boundary and line-break handling. Verify test_parser.py passes (TDD Green)
  • T011 Create backend/app/engine/scanner_base.py — abstract BaseScanner (Protocol or ABC) with interface: id: int, name: str, run(headers: list[ParsedHeader]) -> TestResult | None (implemented Protocol in backend/app/engine/scanner_base.py)
  • T012 Create backend/app/engine/scanner_registry.pyScannerRegistry with auto-discovery: get_all(), get_by_ids(ids), list_tests(). Verify test_scanner_registry.py passes (TDD Green)
  • T013 [P] Create scanner modules by extracting test methods from SMTPHeadersAnalysis into backend/app/engine/scanners/. Each file implements BaseScanner:
    • backend/app/engine/scanners/received_headers.py (tests 13)
    • backend/app/engine/scanners/forefront_antispam.py (tests 1216, 6364)
    • backend/app/engine/scanners/spamassassin.py (tests 1821, 74)
    • backend/app/engine/scanners/ironport.py (tests 2729, 3843, 8889)
    • backend/app/engine/scanners/mimecast.py (tests 30, 6162, 65)
    • backend/app/engine/scanners/trendmicro.py (tests 4759, 97)
    • backend/app/engine/scanners/barracuda.py (tests 6973)
    • backend/app/engine/scanners/proofpoint.py (tests 6667)
    • backend/app/engine/scanners/microsoft_general.py (tests 3134, 80, 8385, 99102)
    • backend/app/engine/scanners/general.py (remaining tests: 411, 17, 2226, 3637, 4446, 68, 7579, 82, 8687, 9096, 98, 103106)
  • T014 Create backend/app/engine/analyzer.pyHeaderAnalyzer orchestrator: accepts AnalysisRequest, uses HeaderParser + ScannerRegistry, runs scanners with per-test timeout, collects results (marking failed tests with error status per FR-25), supports progress callback Callable[[int, int, str], None]. Verify test_analyzer.py passes (TDD Green)

Completion

  • pytest backend/tests/engine/ passes with all tests green
  • All 106+ tests are registered in the scanner registry (ScannerRegistry.get_all() returns 106+ scanners)
  • Analysis of backend/tests/fixtures/sample_headers.txt produces results matching original CLI output
  • ruff check backend/ passes with zero errors
  • Run /speckit.analyze to verify consistency