Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion databricks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@ variables:
default: "811de200-7873-4ab3-a435-da3cddf13c36"
description: Service Principal client ID for production environment


# ============================================================
# Testing Configuration
# Controls automated test behavior during job execution.
Expand Down
39 changes: 39 additions & 0 deletions docs/Adding and selecting packages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Adding Packages

## Warning
- AI tends to suggest packages unneccesarily, and list them as imports


## Checks before adding packages
- Is the package already in use within the project?
- Is a similiar package already in use, can we use only one of the two? (unit tests should allow confidence in changing prexisiting package)
- search (to see where used)
- requirements.txt and requirements-dev.txt
- Is the package still regularly updated, trusted, managed by a team not an individual and lots of downloads?
- E.g. releases https://pypi.org/project/pandas/#history
- E.g. versioning and downloads https://pepy.tech/projects/pandas?timeRange=threeMonths&category=version&includeCIDownloads=true&granularity=daily&viewType=line&versions=3.0.1%2C3.0.0%2C3.0.0rc2
- Is it necessary?
- niceties like packages providing nicer syntaxes can make AI effective less and increase knowledge requirement for engaging with ecosystem so pros and cons should be weighed potential including the team
- they increase the maintenance surface
- Is there a better package?
- More downloaded/maintained
- Related to existing package
- Recommended for databricks
- Is more future proof


## How to add packages

- Add to AI prompts global and individual instructions to know we are using the package so it can recommend its usage
- Add to toml (if needs some configuration)
- Add to requirements.txt and requirements-dev.txt so they are global not local, and usable in git pipeline
- requirements.txt should be the same as requirments-dev.txt excluding packages not used in prod such as test and lint
- requirements-dev.txt is for all environments excepts prod
- pin version numbers when adding packages to these files
- include a comment on why it was added


# Extras
- we are not currently creating package wheels with own libraries to apply to clusters 18/3/26
- [JAR may be relevant for wheel](https://learn.microsoft.com/en-gb/azure/databricks/release-notes/release-types)
- [sharing our packages across dbx ... may not be useful as may not help with git](https://learn.microsoft.com/en-gb/azure/databricks/compute/serverless/dependencies#create-common-utilities-to-share-across-your-workspace)
Empty file.
50 changes: 29 additions & 21 deletions docs/TODO.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,42 @@
# Not covered in POC
- **try moving vars to const files etc and out of bundle**
include:
- 'variables/*.yml' if execute in order might be able to put vars in folders based off headings, may affect running in personal area before deploy?

# Not done
- explore what black does in toml

# Changes to make for the POC review qqqq
- get input on adhoc analysis template and file naming
- add adhoc read me to someone feedback objectives

# Not done qqqq
- add black to linting
- This document has excellent coverage and should be used to plan next steps and best practice examples [Dataquality](https://www.databricks.com/discover/pages/data-quality-management)
- read https://blogs.perficient.com/2025/03/19/delta-live-tables-and-great-expectations/
- need public repo for branch rules, theyre not tweaked so cant just be exported but
- can set deployment rules
- and rules per branch .yml file
- github auto merge staging
- **version numbering**
- enable copilot auto pr
- recommend enable in branch rules
- and require one reviewer
- /addinstructions as a command in databricks ai can work so can put user space or work space instructions
- lakehouse monitoring!
- seperated tests requiring rerunning pipelines from those that dont (data quality, sql sp)
- different job per test type and different notebook


- seperating dabs
- environment branch rules stop pr directing against staging and prod make only correct branches mergeable (can always turn rule off in emergency)
- ~~version numbering~~ we can do release versioning manually makes more sense for us but it is not vital
- ~~enable copilot auto pr~~
- **IMP**seperated tests requiring rerunning pipelines from those that dont (data quality, sql sp)
- different job per test type and different notebook


- ~~seperating dabs~~ dont see the advantage, team may have reason for why its desired
- can do bronze, silver, gold etc
- means modular deployment
- cicd would need to detect changes in folders in order to know which dab to deploy
- wheels again may be needed for code between dabs but maybe not
- cluster dealing with smaller sizes
- how to test

- are requirements txt still needed is it due to python version and setup in git pipeline that i am not just using toml?
- requirements.txt and requirements-dev.txt need sorting, individual installs need removing




[review existing policys](https://adb-295718430158257.17.azuredatabricks.net/compute/policies?o=295718430158257)


- config in the workspace level outside of git
- this may be how we should manage the config

# something here may help seperating unit integration etc tests
[An example of some tests in NHS repo](https://github.com/nhsengland/NHSE_probabilistic_linkage/blob/main/tests/DAE_only_tests.py)
[An example of some tests in NHS repo2](https://github.com/nhsengland/NHSE_probabilistic_linkage/tree/main/tests)
- # MAGIC %run ../utils/dataset_ingestion_utils
- dbutils.notebook.run('../../notebooks_linking/clerical_review_evaluation', 0, {"params": params_serialized})
179 changes: 1 addition & 178 deletions docs/phil-feedback.md
Original file line number Diff line number Diff line change
@@ -1,178 +1 @@
# Feedback

Collect feedback here qqqq


## Concerns to check
- Environments and Packages in https://nhsdigital.github.io/rap-community-of-practice/training_resources/python/intro-to-python/ am i doing enough here is it set by my toml?

## Reviewing Other DBX Repos
- [Nice standard Git flow explanation a resource for getting a clear feel of the process and all things github](https://docs.github.com/en/get-started/using-github/github-flow)
- add to git docs once other changes made


## Make a RAP doc
[RAP](https://github.com/NHSDigital/rap-community-of-practice/tree/main/docs)
[what rap is why it matters high level gov stuff long strategy](https://analysisfunction.civilservice.gov.uk/policy-store/reproducible-analytical-pipelines-strategy/)

""What you need for RAP
There is no specific tool that is required to build a RAP, but both R and Python provide the power and flexibility to carry out end-to-end analytical processes, from data source to final presentation.

Once the minimum RAP has been implemented statisticians and analysts should attempt to further develop their pipeline using:

functions or code modularity
unit testing of functions
error handling for functions
documentation of functions
packaging
code style
input data validation
logging of data and the analysis
continuous integration
dependency management"

[Nice rap blog](https://analysisfunction.civilservice.gov.uk/blog/the-nature-of-reproducible-analytical-pipelines-rap/)
- should we do unit tests together initially, or pair code initially
“Just-in-time” learning is the best approach
"As we progressed through the project we were inevitably presented with new concepts or tasks unfamiliar to us as a working group. We got into the habit of tackling these with bespoke just-in-time training sessions for the team. When it came to test parts of the code we had developed, we held a unit testing session, facilitated by the BPI Team."
..."We had lots of sessions like this; when we needed to resolve merge conflicts in Git, complete peer reviewing and develop documentation"


doc git branch and comits add

"https://analysisfunction.civilservice.gov.uk/blog/the-nature-of-reproducible-analytical-pipelines-rap/"
- power of git
- was a learning curve
- "It’s a completely new way of working for us so it took some time for us to get to grips with it, but we got there. If we were to do this project again, I would push my team to use this to host their code from the first line they developed. I would also encourage them to get into the habit of committing to our repository much more frequently. When it came to quality assuring our pipeline, time elapsed between commits meant we had large chunks of code to check. This would have been a much more streamlined and manageable process had we used Git little and often from the start."


[cute rap overview friendly read]("https://nhsdigital.github.io/rap-community-of-practice/#what-is-rap")

""Recommendation 7: promote and resource ‘Reproducible Analytical Pipelines’ ... as the minimum standard for academic and NHS data analysis"
Data Saves Lives, 2022 government strategy report"

**Can we do this**
"Our RAP Service
The NHS England Data Science team offers support to NHSE teams looking to implement RAP.

We'll:

Work alongside your team for 6-12 weeks
Work with you to recreate one of your processes in the RAP way
Deliver training in Git, Python, R, Databricks, and anything else you'll need
Learn more:"

[Rap tools git pyspark etc great](https://nhsdigital.github.io/rap-community-of-practice/training_resources/git/introduction-to-git/)
- this is really good
- Learning hub python course 2 days
- https://analysisfunction.civilservice.gov.uk/training/introduction-to-python/
- learning hub 2 day pyspark
- https://analysisfunction.civilservice.gov.uk/training/introduction-to-pyspark/

[Rap 7min explanation 7 min questions vid nhs](https://www.youtube.com/watch?v=npEh7RmdTKM)
- python git etc
- process mapping 5.13
- advise write functions names at the beggining just dont populate them yet
- write the test names for them too but dont create them yet

- DBX git testing, setting us up well for rap


- add this to code quality and the peer review doc
https://nhsdigital.github.io/rap-community-of-practice/training_resources/coding_tips/refactoring-guide/



[Time management starting to use Rap](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/rap-readiness/)
- e.g. 15 hours of tool training recommended
- think slice is what we are after as our next step i think!
- target a level of RAP <- we should do this

[Preparing for RAP List of code, docs, tool resource to help get rap ready](https://nhsdigital.github.io/rap-community-of-practice/tags/#preparing-for-rap)
- really useful list here
- coding tips section


Add to unit testing readme and PR
https://nhsdigital.github.io/rap-community-of-practice/training_resources/python/unit-testing/
- provides good guidances
- good all these docs saying python and pytest

- add to PR does python have python doc strings
"""
Converts temperatures in Fahrenheit to Celsius.

Takes the input temperature (in Fahrenheit), calculates the value of
the same temperature in Celsius, and then returns this value.

Args:
temp: a float value representing a temperature in Fahrenheit.

Returns:
The input temperature converted into Celsius

Example:
fahrenheit_to_celsius(temp=77)
>>> 25.0
"""
- this will help us identify when to use other functions, and how to adapt them if they need to change to support more callers


IDE and python over jupyter
"Jupyter notebooks require quite a bit of arduous coding gymnastics to perform what in Python files would be simple imports"
but make notebooks easier with utility functions to turn into python libraries sort of https://jupyter-notebook.readthedocs.io/en/4.x/examples/Notebook/rstversions/Importing%20Notebooks.html


Create Next steps doc
Next Steps. Future task

PT recommends
https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/thin-slice-strategy/
focus on one ingestion → dashboard process then slice

implement, review as a team, try to maximise RAP good practice using the pipeline

Everyone a reviewer for PR, take detours for simple training dives on anything that feels not right or is a blocker.
There are in the POC project good practice docs, references and some training reference for pyspark and python, unit testing etc.


[RAP massive gov doc maybe useful if looking for somehting specific its long](https://www.gov.uk/guidance/the-aqua-book)

[How to refactor](https://nhsdigital.github.io/rap-community-of-practice/training_resources/coding_tips/refactoring-guide/)

[click newbie for git guide for rap](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/skills_for_rap/git_for_rap/)

[Quality coding, e.g doc manual coding in repo?](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/workflow/quality-assuring-analytical-outputs/)


[RAP level really shows we are on the right track](https://nhsdigital.github.io/rap-community-of-practice/introduction_to_RAP/levels_of_RAP/)
- we probably aim for silver as should achieve alot but ensure baseline
- we also are looking at hitting most the gold with cicd, and databricks and dabs


[python style guide if interested, can be useful not just to lint but to help name etc](https://peps.python.org/pep-0008/)

https://nhsdigital.github.io/rap-community-of-practice/training_resources/pyspark/pyspark-style-guide/
"We avoid using pandas or koalas because it adds another layer of learning. The PySpark method chaining syntax is easy to learn, easy to read, and will be familiar for anyone who has used SQL."


pyspark dynamic modular queries lazy.

unit test, expected, error handling, edge cases <- alreay got this comment but make sure i put it in

python > pandas

[pyspark RAP](https://nhsdigital.github.io/rap-community-of-practice/training_resources/pyspark/) put this in PR

[another pyguide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md)
[another approachable longer doc about RAP and what it involves](https://best-practice-and-impact.github.io/qa-of-code-guidance/principles.html)


[gov rap policies](https://github.com/NHSDigital/rap-community-of-practice/blob/main/docs/introduction_to_RAP/gov-policy-on-rap.md)


# Not really need but nice to cpmment somewhere
[open source 1](https://www.gov.uk/guidance/be-open-and-use-open-source)
[open source 2](https://gds.blog.gov.uk/2017/09/04/the-benefits-of-coding-in-the-open/)
[open source 3](https://github.com/nhsx/open-source-policy/blob/main/open-source-policy.md)

# Feedback
Loading