Skip to content

feat(MachineLearning/PACLearning): add VersionSpace abstraction#592

Merged
SamuelSchlesinger merged 6 commits into
leanprover:mainfrom
Zetetic-Dhruv:version-space-v2
May 29, 2026
Merged

feat(MachineLearning/PACLearning): add VersionSpace abstraction#592
SamuelSchlesinger merged 6 commits into
leanprover:mainfrom
Zetetic-Dhruv:version-space-v2

Conversation

@Zetetic-Dhruv
Copy link
Copy Markdown
Contributor

Adds the classical version space abstraction (Mitchell 1982, Angluin 1980) as a companion to the PAC learning definitions from #492.

  • VersionSpace C S: the subset of C whose concepts agree with S on every sample point
  • versionSpace_subset, versionSpace_empty_sample: sanity lemmas
  • versionSpace_antitone: more data gives a smaller version space
  • IsConsistent A C: predicate on learners whose output always lies in the version space
  • mem_versionSpace_of_realizable, versionSpace_nonempty_of_realizable: realizable-case bridge

Foundation for downstream proofs (Sauer-Shelah, PAC lower bounds, infinite NFL).

Adds the classical version space abstraction (Mitchell 1982, Angluin 1980)
to the PAC learning module. Version space — the subset of a concept class
consistent with observed labeled data — is the structural substrate every
sample complexity theorem operates on (Baby PAC, Sauer-Shelah, PAC lower
bounds, NFL). This PR complements leanprover#492 by providing:

- VersionSpace C S: the consistent subset of C given sample S
- versionSpace_subset, versionSpace_empty_sample: sanity lemmas
- versionSpace_antitone: structural monotonicity (more data → smaller VS),
  dual to sample complexity monotonicity
- IsConsistent A C: predicate on learners (output always in version space)
- IsConsistent.output_mem_conceptClass, output_consistent: consistent learner
  properties
- mem_versionSpace_of_realizable: under realizable data, the target concept
  lies in the version space (the realizable-case bridge)
- versionSpace_nonempty_of_realizable: corollary

No measure theory, no new Mathlib dependencies, ~150 lines, 0 sorry.

Together with leanprover#492 these suffice to state the Baby PAC theorem, Sauer-Shelah
sample complexity, PAC lower bounds, and NFL as structural statements rather
than ad-hoc computations.
@Zetetic-Dhruv
Copy link
Copy Markdown
Contributor Author

@SamuelSchlesinger

Copy link
Copy Markdown
Collaborator

@SamuelSchlesinger SamuelSchlesinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to see the bridge that makes this useful: version-space membership is exactly zero empirical error. I think that is what will motivate this PR and make the downstream work easier. The stuff you do have is sound and I appreciate this PR overall, my requests are mostly just stylistic commentary and generalization requests.

Comment thread Cslib/MachineLearning/PACLearning/VersionSpace.lean
Comment thread Cslib/MachineLearning/PACLearning/VersionSpace.lean
Comment thread Cslib/MachineLearning/PACLearning/VersionSpace.lean Outdated
Comment thread references.bib
Comment thread Cslib/MachineLearning/PACLearning/VersionSpace.lean Outdated
- File docstring above section; c implicit; rename output_consistent → output_agrees
- Add Mitchell1977 citation; generalize antitone via versionSpace_reindex; add mono_C
- Add empirical{Miscount,Measure,Error} defs + combinatorial and measure-theoretic bridges
- Add Realizable predicate + Realizable.versionSpace_nonempty
- Add ae_mem_versionSpace_of_realizable
- Add IsConsistent.{empiricalMiscount,empiricalError}_eq_zero corollaries
@Zetetic-Dhruv
Copy link
Copy Markdown
Contributor Author

Hope the PR looks better now @SamuelSchlesinger

@SamuelSchlesinger
Copy link
Copy Markdown
Collaborator

It looks great, I think the last bit of the bridge is:

theorem empiricalError_eq_div [DecidableEq β] {m : ℕ} (hm : 0 < m)
      (h : α → β) (S : LabeledSample α β m) :
      empiricalError h S = (empiricalMiscount h S : ℝ≥0∞) / m

Would you implement that? Then it's good to go.

@Zetetic-Dhruv
Copy link
Copy Markdown
Contributor Author

It looks great, I think the last bit of the bridge is:

theorem empiricalError_eq_div [DecidableEq β] {m : ℕ} (hm : 0 < m)
      (h : α → β) (S : LabeledSample α β m) :
      empiricalError h S = (empiricalMiscount h S : ℝ≥0∞) / m

Would you implement that? Then it's good to go.

Hey, done as you suggested, sorry for missing this!

@Zetetic-Dhruv
Copy link
Copy Markdown
Contributor Author

✔ [126/126] Built mk_all:exe (988ms)
The file 'Cslib.lean' is out of date: run lake exe mk_all to update it
Error: Process completed with exit code 1.

Source of error in action

@Zetetic-Dhruv
Copy link
Copy Markdown
Contributor Author

Zetetic-Dhruv commented May 29, 2026

Regenerated cslib.lean. Can someone rerun the ci-check?

@SamuelSchlesinger
Copy link
Copy Markdown
Collaborator

@fmontesi whenever you'd like, we can merge this.

@SamuelSchlesinger SamuelSchlesinger added this pull request to the merge queue May 29, 2026
@fmontesi
Copy link
Copy Markdown
Collaborator

Excellent, thank you!

Merged via the queue into leanprover:main with commit 45072a3 May 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants