Data to Insight is not a software supplier in the traditional sense -- we're here to help local authority (LA) colleagues get work done and share it with each other. The best thing about Data to Insight is the community of analysts who are doing and sharing great work through this network.
We recognise that different LAs want different levels of engagement with our work. As such, we offer four key collaboration approaches to suit different users’ information needs.
Type 1 CIN rules
When you feel ready to code a type 1 CIN rule, or when there are no beginner ones left, this guide explains what to do and how it differs from the beginner rules. This guide is a bit more bare bones than the beginners CIN rule guide as lots of key things are explained there: trouble shooting, coding via codespaces, testing the rule, so if you're stuck, you can always go back to the beginners CIN rule guide where you want more detail.
Once you've assigned yourself to a rule, made a new branch to code the rule in, and opened a codespace on that branch, you'll need to copy the template rule, like we did for beginner rules, and rename it according to the convention rule_####.py where #### is your rules number, so rule 1000 would be rule_1000.py. The template rule for intermediate rules is rule 8925.
You'll first need to replace/update the tables/column of interest from the template rule with your own. The structure of these can be found in the file _api.py in the rule_engine directory. Remember, start with the table or tables of interest, creating a variable with the name of the table of interest and assigning it the value of CINTable.<TableName> This means that the Child Characteristics table would be set up with the following line of code:
ChildCharacteristics = CINTable.ChildCharacteristics
You'll need to do the same for the columns as well, this time creating a variable with the column name as the variables name, and assigning it the value of <TableName>.<ColumnName>. So, if you wanted the LAchildID and Ethnicity column from Child Characteristics, your code would include these lines:
LAchildID = ChildCharacteristics.LAchildID
Ethnicity = ChildCharacteristics.Ethnicity
In the image below is the completed code for rule 8840, the rule I will be using in this example. It uses the ChildProtectionPlans table, and the columns CPPendDate. CPPstartDate and LAchildID. For the location linking, you'll need to ensure that one of the columns is LAchildID as that's needed for location linking.
Once you've updated the tables and columns, you will need to update the rule_definition decorator. Assign your rule's code to the code variable, pass your table to the module variable using the CINTable.<TableName> format used to set up the tables in the step above. Assign the error message for your rule as a string to the message variable by using single or double quotes ('<error message>'/"<error message>"). Finally, pass the names of the columns your rule needs to the affected_fields variable as a list.
Now it's time to get on to coding the rule. This, and the test_validate functions are where the process of writing a rule is most different to writing a beginner rule. Other than writing the logic, however, it's still very much a fill-in-the-blanks process.
First things first, you'll need to change the df variable. You need to put the name of the table your rule uses in the square brackets after data_container, replacing ChildProtectionPlans. The line of code will look like this:
df = data_container[<TableName>]
So, for Child Characteristics, the line will look like this:
df = data_container[ChildCharacteristics]
Leave the next line of code that sets df.index.name, this is used for location linking.
Next, you need to write the logic for your rule. This is slightly different to the beginner rules because instead of returning failing indices to tell the validator where the data failt, we return a dataframe called df_issues. To do this, you need to set up the logical condition for your rule and set the variable 'condition' equal to it. For instance, if I wanted to check to see if CPPendDate was before CPPstartDate, I'd write the line:
condition = df[CPPendDate] < df[CPPstartDate]
If I was checking to see if they're the same I'd write:
condition = df[CPPendDate] == df[CPPstartDate]
These lines of code check the values in the CPPend and start date columns of the dataframe we made earlier using the df = data_continer line. As of yet, as we haven't done anything with condition, this line doesn't do anything, however, it just defines the condition we use to perform our logic. We then use condition to slice the df dataframe, reset the index, and set the variable df_issues equal to this. Your rule may have very different logic, but the idea is that you need the df_issues dataframe to contain the rows where df fails the logical check of the rule you're writing. The implementation of the logic for rule 8840 can be seen below.
Next up is the main addition to the intermediate rules, this is the code that allows the back-end to properly link together the places the data fails validation checks. The first bit of code to change here is some of the values assigned to link_id. link_id is passed a tuple that contains zipped variables. Tuples are a way of storing multiple items as a single variable, zip takes in iterable variables and returns an iterator. Don't worry if that doesn't make sense. To make this work you'll just need to replace the columns currently inside the zip with the ones for your rule, whilst leaving LAchildID as it's needed for the location linking. So, if I was using the CINdetailsID and ReferralSource columns, the line would look like this:
zip(df_issues[LAchildID], df_issues[CINdetailsID], df_issues[Referral source])
You can leave the next two lines with df_issues['Error_ID'] and df_issues = df_issues.groupby(). You will need to update some of the code for rule_context. table needs to be updated to be equal to the table your rule relates to and columns need to be passed a list of the columns your rule relates to (remember, lists go inside square brackets). This updated part of the validate function can be seen in the image below. rule 8840 used the same columns as rule 9825 so these didn't need to be changed from the template.
Now onto the test for the validate function. Much like for the beginner rules you'll need to create a datframe that passes and fails your validation check in known areas, you'll need to tell the code where it fails, and you'll need to update some of the lines to reflect your rule. First off, the dataframe. You can do this by making a dictionary of keys with column names and values of lists for each row and using pd.DataFrame(). You can technically call this whatever you want but we are opting for snake_case versions of the table you're validating, so ChildProtectionPlans becomes child_protection_plans. You'll need to make a dataframe that has an LAchildID column, and have one ID for each row in your dataframe. You'll also need key:value sets for your other columns. Make sure you have rows for every passing and failing row of your dataframe. For instance, for rule 8840, which checks if CPP start and end dates are the same, I had three rows that were the same, I had one row where they were different, and a final row with no dates at all (using pd.NA to create NaNs). The dataframe I created for this rule is below, I have 5 rows so I've created 5 child IDs, simply by passing a list of strings child1 through child5 to the key LAchildID. I then passed CPPstartDate and CPPendDate my failing and passing dates.
You'll also need to change your dates to datetime format if your rule uses dates, and set the errors argument equal to "coerce". An example of this can be seen below:
You will need to update the result variable. It is passed validate, which doesn't need to be changed, but you will need to update the second argument. You'll need to change ChildProtectionPlans to be the name of the table you defined in the first part of writing the rule, and after the colon you need to replace child_protection_plans with whatever you called your test dataframe, unless your test datframe is child_protection_plans, then you don't need to change it, obviously.
You also need to update the lines:
assert issue_table == ChildProtectionPlans
assert issue_columns == [CPPstartDate, CPPendDate]
assert len(issue_rows) == 3
to reflect your rule. issue_table needs to be passed the table your rule used, and issue_columns needs to be passed a list of the columns your rule uses, and len(issue_rows) needs to be equal to the number of rows where your test dataframe should fail your rule.
Next, you need to create a dataframe that matches the dataframe you expect to be returned by the logic of your rule. That is, you need to create the dataframe that's just the rows that should fail from your test dataframe. This datframe should be called expected_df and should have columns for every column in the test dataframe, and a row for each one that fails. You also need to include an extra column called 'ROW_ID' which, for each failing row, should have the index location of that row from your test dataframe. The expected_df dataframe for rule 8840 is pictured below.
Finally, you need to update assert rule.definition.code == #### to include the number for your rules and the message in the line after to include the message for your rule. This can be seen in the image below.