Type 2 CIN rules

This guide will explain how to code the type 2 CIN rules. These are rules that use two tables. Just as with the type 1 CIN rules, this guide is a bit more bare bones than the original guide to not clog up the page with too much text. This guide does assume that you have written at least a good-first-issue/beginner rule before.

Coding the logic for type 2 rules will help to further develop your Python data analysis skills as it involved some more complex actions like merging/joining tables. The more bare bones style of this guide will also help you grow as a programmer, as you will have to learn how to navigate GitHub with less guidance. As always, google is your friend.

As always, select a rule from the issues page, assign yourself to it, create an appropriately named branch, and open the code in codespaces on that branch. This time, there are two template rules, for rules using two tables, use rule_8841.py, for rules with three tables, use rule_2885.py. I will write this guide using rule 8841 as a n explanation, but the steps using 2885 are pretty similar. At the time of writing this guide, these rules are in draft form as pull requests so that Tambe and I can review them and suggest changes to each other, so you will need to copy and paste the code from the view code option on the pull request, then using the copy raw contents button. It will be good GitHub experience to work out how to do this yourself.

Once you've made a new file in the cin2022_23 directory, as we have for every other type of rule, and copy/pasted the template rule in. It's time to code! First things first, You'll need to update the tables and columns. This time, remember you'll need to get two different tables from CINTables, and get the appropriate columns from each table. You'll need to get more than just the affected fields columns this time, however. We need LAchildID to be able to make our link_id, as in type 1 rules, but we'll also need to include columns to help us link between tables in a way that helps us avoid many-to-many joins. For instance, joining the ChildProtectionPlans table to the reviews table, if a child has had multiple CP plans and reviews, a join on LAchildID would join every row with matching LAchildID between tables, ending with more rows than we want. Instead, it's better to join on both LAchildID and CPPID, which is a unique combination for every child's CPP group, meaning we will only join the ChildProtectionPlans table to the appropriate rows of Reviews.

Next up is the rule_definition decorator. This is just like other rules, just make sure that you pass affected_fields a list of the columns which will show issues. It doesn't matter which table they're form as you defined them above. Rule 8841 will flag errors in CPPstartDate and CPPreviewDate, so I passed it a list of those two variables.

Now for the validate function, this is also where most of the changes tot he rule writing process are.

We have to prep the data first. We need to make dataframes for each of the tables our rule needs. I've made two, df_cpp and df_reviews, if you're doing a three table rule, you'll need to make three.

Next, for each of the tables, name the index "ROW_ID", as in the example above. Following this, reset the index in place of each table to make ROW_ID a column of each table.

Now for the logic, seen above. As with previous rules, comment it well. Ensure you give the full explanation of the rule (not just the message) in the comments, as well as explaining the steps. I'll explain the logic behind coding rule 8841, which should help you code yours. The most important thing to note is how the join/merge works as this is new for type 2 rules.

Rule 8841 states that within a CP plan group, there should be no CP plan review date that is on the same day or earlier than a CP plan start date. This is the same as saying CP plan review dates can only be greater than CP plan start dates. For our issues dataframes then, we want to return rows where the CP plan review date is the same as, or smaller, than the CP plan start date in a CP plan group. To work this out, we'll need a table that merges CP plans and reviews, with the appropriate rows or reviews with the appropriate rows of CP plans, that is, reviews with the right plan groups. Luckily, children have CP plan IDs, CPPID, and CINdetailsIDs, which are shared columns between ChildProtectionPlans (cpp_df) and Reviews (reviews_df).

Before we start doing this though, we want to drop any extraneous rows, that is rows in cpp_df without CP plan start dates, and rows in reviews_df without review dates. This may seem redundant as, given the structure of the ingest XML, there should not be rows in cpp_df or reviews_df without dates, but these lines allow a little robustness against incorrectly entered data in other of these tables. This is done by filtering each table by the rows where .notna() for the relevant columns is true, as above. This can be done by appropriately using the line:

data_frame = data_frame[data_frame[column_name].notna()]

If you wanted to only return rows without data, you can replace .notna() with .isna().

Following this, it's time to merge our tables. This is done using the Pandas .merge() method. The documentation for the pandas merge method can be found here. It can be used relatively similarly to a SQL join/merge. When selecting the merge type, using the how= argument, it's worth googling to make sure than your merge behaves as expected, as some behave slightly differently to the same type in SQL.

For rule 8841 I had to join my dataframes cpp_df and reviews_df in a way that gave me one-to-one matches, that is one row joins one relevant row. So, I initialised a variable called df_merged (this, or some variation on this is a common convention for naming merged dataframes inside functions), and set it equal to my merged dataframes. To merge the two dataframes, you use the .merge() method on the dataframe to be joined to, typically this is the left dataframe. You then pass .merge() a number of arguments. The first one is the name of the dataframe to join to, typically the right dataframe. You then pass left_on= and right_on= the column(s) to merge on, the dataframe outside the .merge() method is left, the dataframe inside is right. You can pass these individual columns, or, as in the case of this rule, lists.

Following this, you need to select your merge type. Typically CIN rules will use a left join, but it's worth googling how joins work in Pandas to understand better. Some common join types are shown in the image below.

The final argument passed to .merge() is the suffixes. These are suffixes added to column titles coming from each table where each table shares column titles. For instance, both tables have ROW_ID columns, and the suffixes are used later when making link IDs to differentiate them. It's probably best to use a short suffix that's easily recognisable, I went for _cpp and _reviews. If you don't give suffix names, Pandas defaults to _x and _y.

Finally, the tables are merged correctly and we can select just rows that fail. Do this just as you would have for other rule types. Initialise a variable called condition with the logic in for relevant columns, then filter merged_df by that, remembering to reset the index.

That's the hardest bits out of the way, now we are back to filling in the blanks, mostly. Firstly, you'll have to make a unique ERROR_ID. This helps the front end know which cells to highlight as having issues. As a rule of thumb, to do this, simply update CPPstartDate and CPPreviewDate with the columns used for your rule. Theoretically, exceptions to this rule of thumb may arise, but Tambe and I haven't come across any yet, and if we find a rule that breaks one in a PR, we'll catch it, so don't worry.

Next, we need to add the ROW_ID and ERROR_ID columns to our orignal dataframes to get the issues dataframes out. This is where the suffix argument from the first merge comes into play! You'll need to make a new dataframe for each original dataframe, which has the same name but _issues as a suffix, as in the image above. You'll then need to merge the original dataframes with the merged_df that's been sliced according to the rule logic. Your left_on argument should be ROW_ID and your right_on argument should be ROW_ID with the appropriate suffix from the merged_df. Make sure to not remove the .groupby(), .apply(list), and .reset_index() methods.

The last bit of the validate function to update is the rule context. You'll need to do this once for each table relevant to your rule. This is a type 2 rule, so push_type_2 remains. You'll then need to update each rule_context line for each table to ensure that table, columns, and row_df are passed the appropriate variables.

Now for the test_validate function. Just like the previous rules, you'll need to make dataframes that will have passing and failing rows, containing all possible variations of data your rule could be passed. The difference this time is you'll need to make a dataframe for each table your rule tests. Each dataframe will need to include LAchildIDs, the data youre checking, and whatever links tables. In the image below you can see an example row for one dataframe in the test_validate function for rule 8841.

Remember that if your rule uses dates, you'll need to convert relevant columns to datetime, you can do this by replacing the variables below the sample dataframes, seen below.

Just as in previous rules, you'll now need to update the result variable to replace the dataframes the test validate runs on with your sample dataframes. My fake/sample dataframes were called sample_cpp and sample_reviews so I used the dictionary to update those as the values for the table name keys.

Following this, you'll need to update the statement:

assert (

len(issues_list) == 2

)

replacing 2 with the number of dataframes your rule returns issues in, that is, the number of dataframes you made in the validate function with _issues suffixes. You'll also need to update the issues variable and set it equal to a dataframe in issues_list, selecting by index, as standard, the line reads:

issues = issues_list[1]

This means that it takes the second issues dataframe. In the case of rule 8841, the template rule, that's df_reviews_issues. You'll also need to update the values passed to three other assert statements: issue_table, issue_columns, and len(issue_rows). Issue table is the table you're finding issues in based on what you selected in issues = issues_list[X], so for 8841, that's the Reviews table. Issue_columns is where the issue is found, for 8841, that's CPPreviewDate. It's the same column from the affected fields. finally len(issue_rows) == the number of rows your sample dataframe returns with issues, the number of times your sample dataframes fail the validation rule.

Now, just like the other rules, you need to make a dataframe of what the expected issues dataframe will look like. The example above, and in the template, is the first row of the issues dataframe made my the sample dataframes and returned for the Reviews table. It has an ERROR_ID, as defined in the validate function of LAchildID, CPPstartDate, and CPPreviewDate, it also has a ROW_ID of the index from the sample dataframe where it was made.

Final step, just as with all the other rules, you'll need to update the last two assert statements with the rule code and the rule message.

Then, you're done. Test using:

python -m cin_validator test

in the terminal, and fix and tinker as needs be, then make your PR!