diff --git a/LICENCE b/LICENCE new file mode 100644 index 0000000..63b4b68 --- /dev/null +++ b/LICENCE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) [year] [fullname] + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. \ No newline at end of file diff --git a/docs/phil-feedback.md b/docs/phil-feedback.md index 4af0f41..4471961 100644 --- a/docs/phil-feedback.md +++ b/docs/phil-feedback.md @@ -1,3 +1,178 @@ # Feedback -Collect feedback here qqqq \ No newline at end of file +Collect feedback here qqqq + + +## Concerns to check +- Environments and Packages in https://nhsdigital.github.io/rap-community-of-practice/training_resources/python/intro-to-python/ am i doing enough here is it set by my toml? + +## Reviewing Other DBX Repos +- [Nice standard Git flow explanation a resource for getting a clear feel of the process and all things github](https://docs.github.com/en/get-started/using-github/github-flow) + - add to git docs once other changes made + + + ## Make a RAP doc +[RAP](https://github.com/NHSDigital/rap-community-of-practice/tree/main/docs) +[what rap is why it matters high level gov stuff long strategy](https://analysisfunction.civilservice.gov.uk/policy-store/reproducible-analytical-pipelines-strategy/) + +""What you need for RAP +There is no specific tool that is required to build a RAP, but both R and Python provide the power and flexibility to carry out end-to-end analytical processes, from data source to final presentation. + +Once the minimum RAP has been implemented statisticians and analysts should attempt to further develop their pipeline using: + +functions or code modularity +unit testing of functions +error handling for functions +documentation of functions +packaging +code style +input data validation +logging of data and the analysis +continuous integration +dependency management" + +[Nice rap blog](https://analysisfunction.civilservice.gov.uk/blog/the-nature-of-reproducible-analytical-pipelines-rap/) +- should we do unit tests together initially, or pair code initially +“Just-in-time” learning is the best approach +"As we progressed through the project we were inevitably presented with new concepts or tasks unfamiliar to us as a working group. We got into the habit of tackling these with bespoke just-in-time training sessions for the team. When it came to test parts of the code we had developed, we held a unit testing session, facilitated by the BPI Team." +..."We had lots of sessions like this; when we needed to resolve merge conflicts in Git, complete peer reviewing and develop documentation" + + +doc git branch and comits add + +"https://analysisfunction.civilservice.gov.uk/blog/the-nature-of-reproducible-analytical-pipelines-rap/" +- power of git +- was a learning curve +- "It’s a completely new way of working for us so it took some time for us to get to grips with it, but we got there. If we were to do this project again, I would push my team to use this to host their code from the first line they developed. I would also encourage them to get into the habit of committing to our repository much more frequently. When it came to quality assuring our pipeline, time elapsed between commits meant we had large chunks of code to check. This would have been a much more streamlined and manageable process had we used Git little and often from the start." + + +[cute rap overview friendly read]("https://nhsdigital.github.io/rap-community-of-practice/#what-is-rap") + +""Recommendation 7: promote and resource ‘Reproducible Analytical Pipelines’ ... as the minimum standard for academic and NHS data analysis" +Data Saves Lives, 2022 government strategy report" + +**Can we do this** +"Our RAP Service +The NHS England Data Science team offers support to NHSE teams looking to implement RAP. + +We'll: + +Work alongside your team for 6-12 weeks +Work with you to recreate one of your processes in the RAP way +Deliver training in Git, Python, R, Databricks, and anything else you'll need +Learn more:" + +[Rap tools git pyspark etc great](https://nhsdigital.github.io/rap-community-of-practice/training_resources/git/introduction-to-git/) +- this is really good +- Learning hub python course 2 days + - https://analysisfunction.civilservice.gov.uk/training/introduction-to-python/ +- learning hub 2 day pyspark + - https://analysisfunction.civilservice.gov.uk/training/introduction-to-pyspark/ + +[Rap 7min explanation 7 min questions vid nhs](https://www.youtube.com/watch?v=npEh7RmdTKM) +- python git etc +- process mapping 5.13 +- advise write functions names at the beggining just dont populate them yet + - write the test names for them too but dont create them yet + +- DBX git testing, setting us up well for rap + + +- add this to code quality and the peer review doc +https://nhsdigital.github.io/rap-community-of-practice/training_resources/coding_tips/refactoring-guide/ + + + +[Time management starting to use Rap](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/rap-readiness/) +- e.g. 15 hours of tool training recommended +- think slice is what we are after as our next step i think! +- target a level of RAP <- we should do this + +[Preparing for RAP List of code, docs, tool resource to help get rap ready](https://nhsdigital.github.io/rap-community-of-practice/tags/#preparing-for-rap) +- really useful list here +- coding tips section + + +Add to unit testing readme and PR +https://nhsdigital.github.io/rap-community-of-practice/training_resources/python/unit-testing/ +- provides good guidances +- good all these docs saying python and pytest + +- add to PR does python have python doc strings + """ + Converts temperatures in Fahrenheit to Celsius. + + Takes the input temperature (in Fahrenheit), calculates the value of + the same temperature in Celsius, and then returns this value. + + Args: + temp: a float value representing a temperature in Fahrenheit. + + Returns: + The input temperature converted into Celsius + + Example: + fahrenheit_to_celsius(temp=77) + >>> 25.0 + """ + - this will help us identify when to use other functions, and how to adapt them if they need to change to support more callers + + +IDE and python over jupyter + "Jupyter notebooks require quite a bit of arduous coding gymnastics to perform what in Python files would be simple imports" + but make notebooks easier with utility functions to turn into python libraries sort of https://jupyter-notebook.readthedocs.io/en/4.x/examples/Notebook/rstversions/Importing%20Notebooks.html + + +Create Next steps doc +Next Steps. Future task + +PT recommends +https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/thin-slice-strategy/ +focus on one ingestion → dashboard process then slice + +implement, review as a team, try to maximise RAP good practice using the pipeline + +Everyone a reviewer for PR, take detours for simple training dives on anything that feels not right or is a blocker. +There are in the POC project good practice docs, references and some training reference for pyspark and python, unit testing etc. + + +[RAP massive gov doc maybe useful if looking for somehting specific its long](https://www.gov.uk/guidance/the-aqua-book) + +[How to refactor](https://nhsdigital.github.io/rap-community-of-practice/training_resources/coding_tips/refactoring-guide/) + +[click newbie for git guide for rap](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/skills_for_rap/git_for_rap/) + +[Quality coding, e.g doc manual coding in repo?](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/workflow/quality-assuring-analytical-outputs/) + + +[RAP level really shows we are on the right track](https://nhsdigital.github.io/rap-community-of-practice/introduction_to_RAP/levels_of_RAP/) +- we probably aim for silver as should achieve alot but ensure baseline +- we also are looking at hitting most the gold with cicd, and databricks and dabs + + +[python style guide if interested, can be useful not just to lint but to help name etc](https://peps.python.org/pep-0008/) + +https://nhsdigital.github.io/rap-community-of-practice/training_resources/pyspark/pyspark-style-guide/ +"We avoid using pandas or koalas because it adds another layer of learning. The PySpark method chaining syntax is easy to learn, easy to read, and will be familiar for anyone who has used SQL." + + +pyspark dynamic modular queries lazy. + +unit test, expected, error handling, edge cases <- alreay got this comment but make sure i put it in + +python > pandas + +[pyspark RAP](https://nhsdigital.github.io/rap-community-of-practice/training_resources/pyspark/) put this in PR + +[another pyguide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md) +[another approachable longer doc about RAP and what it involves](https://best-practice-and-impact.github.io/qa-of-code-guidance/principles.html) + + +[gov rap policies](https://github.com/NHSDigital/rap-community-of-practice/blob/main/docs/introduction_to_RAP/gov-policy-on-rap.md) + + +# Not really need but nice to cpmment somewhere +[open source 1](https://www.gov.uk/guidance/be-open-and-use-open-source) +[open source 2](https://gds.blog.gov.uk/2017/09/04/the-benefits-of-coding-in-the-open/) +[open source 3](https://github.com/nhsx/open-source-policy/blob/main/open-source-policy.md) + diff --git a/resources/pipeline/silver/isreporter_dlt.yml b/resources/pipeline/silver/isreporter_dlt.yml index 5b57e65..b5f8721 100644 --- a/resources/pipeline/silver/isreporter_dlt.yml +++ b/resources/pipeline/silver/isreporter_dlt.yml @@ -9,7 +9,7 @@ resources: photon: true # good practice to specify its something to do with dlt having beta version? channel: current - continuous: true # maybe triggered for POC once works + continuous: false # maybe triggered for POC once works # By defining catalog here we set it for all jobs in the pipeline without needing to specify it with the variable when defining a table catalog: ${var.catalog} target: ${var.schema_prefix}${var.layer_silver}_${var.domain_reporting} diff --git a/scratch/dlt-ingestdeleteme-cluster.md.ipynb b/scratch/dlt-ingestdeleteme-cluster.md.ipynb new file mode 100644 index 0000000..a3da9d7 --- /dev/null +++ b/scratch/dlt-ingestdeleteme-cluster.md.ipynb @@ -0,0 +1,98 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "ce0b4748-c235-4a29-ae6a-487a4f491d59", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "%sh\n", + "rm -rf /dbfs/device_stream\n", + "mkdir -p /dbfs/device_stream\n", + "curl -L -o /dbfs/device_stream/device_data.csv https://github.com/MicrosoftLearning/mslearn-databricks/raw/main/data/device_data.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "ee5c0ce0-74fe-441a-ba87-b6ffc46f1d22", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + " from pyspark.sql.functions import *\n", + " from pyspark.sql.types import *\n", + "\n", + " # Define the schema for the incoming data\n", + " schema = StructType([\n", + " StructField(\"device_id\", StringType(), True),\n", + " StructField(\"timestamp\", TimestampType(), True),\n", + " StructField(\"temperature\", DoubleType(), True),\n", + " StructField(\"humidity\", DoubleType(), True)\n", + " ])\n", + "\n", + " # Read streaming data from folder\n", + " inputPath = '/device_stream/'\n", + " iotstream = spark.readStream.schema(schema).option(\"header\", \"true\").csv(inputPath)\n", + " print(\"Source stream created...\")\n", + "\n", + " # Write the data to a Delta table\n", + " query = (iotstream\n", + " .writeStream\n", + " .format(\"delta\")\n", + " .option(\"checkpointLocation\", \"/tmp/checkpoints/iot_data\")\n", + " .start(\"/tmp/delta/iot_data\"))" + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "computePreferences": null, + "dashboards": [], + "environmentMetadata": { + "base_environment": "", + "environment_version": "4" + }, + "inputWidgetPreferences": null, + "language": "python", + "notebookMetadata": { + "mostRecentlyExecutedCommandWithImplicitDF": { + "commandId": 6527703006113399, + "dataframes": [ + "_sqldf" + ] + }, + "pythonIndentUnit": 4 + }, + "notebookName": "dlt-ingestdeleteme-cluster.md", + "widgets": {} + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/tests/Run_Lint.ipynb b/tests/Run_Lint.ipynb index edc16af..43dbc59 100644 --- a/tests/Run_Lint.ipynb +++ b/tests/Run_Lint.ipynb @@ -511,7 +511,8 @@ "widgetDisplayType": "Text", "validationRegex": null }, - "parameterDataType": "String" + "parameterDataType": "String", + "dynamic": false }, "widgetInfo": { "widgetType": "text", diff --git a/tests/Run_Tests.ipynb b/tests/Run_Tests.ipynb index 0f8bb30..ee82f59 100644 --- a/tests/Run_Tests.ipynb +++ b/tests/Run_Tests.ipynb @@ -491,7 +491,8 @@ "widgetDisplayType": "Text", "validationRegex": null }, - "parameterDataType": "String" + "parameterDataType": "String", + "dynamic": false }, "widgetInfo": { "widgetType": "text", @@ -517,7 +518,8 @@ "widgetDisplayType": "Text", "validationRegex": null }, - "parameterDataType": "String" + "parameterDataType": "String", + "dynamic": false }, "widgetInfo": { "widgetType": "text", @@ -543,7 +545,8 @@ "widgetDisplayType": "Text", "validationRegex": null }, - "parameterDataType": "String" + "parameterDataType": "String", + "dynamic": false }, "widgetInfo": { "widgetType": "text",