Coding a CIN rule

This guide goes through the workflow of coding a CIN rule using codespaces. There is also a video guide. You will need the Excel document of rules, grouped by difficulty and with information about table names that can be downloaded using the button below.

Rules Excel Document

Video Guide

Quick guide

NOTE: do not start writing intermediate rules as of yet, the backend cannot accept their outputs correctly currently.

This is the quick guide, if you get stuck, or want a more detailed step by step guide, see the next section of this webpage.

Select an issue from the issues page of the repo and assign yourself to is so that no one else does it.
Make a branch from main named according to the convention rule<rulenumber>, so rule 8840 is the branch rule8840. Then navigate to that branch and create a codespace on the branch.
Copy rule_8500.py in the directory /cin_validator/rules/cin2022_23, paste it in the same directory, and rename it rule<rulenumber>.py, so rule 8840 would be rule_8840.py.
Using the information for your rule in the Excel document, where directed by in-code comments update the ChildIdentifiers variable with the module/table name you need from the CINtable class, and the LAchildID variable with the one you need from the CINtable class. You'll need to add lines of code if your rule uses more than one of each table/module and column, but this won't be the case for beginner rules. The CINtable class can be found in __api.py in the rule engine directory. So, if I wanted the CINdetails table and the CPPendDate columns, I would delete the current table/module and column code and replace it with:
- CINdetails = CINtable.CINdetails
- CPPendDate = CINdetails.CPPendDate
Using information in the Excel document, and the relevant comments, update the @rule_definition decorator for your rule.
Using the information in the Excel document, and the in-code comments, update the validate function for your rule, failing-indices should return the index locations of rows where data fails validation checks. If your rule requires a check to see if a date falls within a census period, see the Census Period Rules section at the bottom of this page to see how these types of rule should be written.
Update the test_validate fucntion using the in-code comments to guide you. Start by making a dataframe that includes every pass and fail condition for your rule. Then ensure that the number being passed to assert len(issues) is the same as the number of issues your dataframe raises, and that the IssueLocator statements include the table, column, and index of every issue, you may need to add or remove these statements depending on the number of issues, then update the function to accommodate your rule's code and message.
Put the following code in the terminal and hit enter: python -m cin_validator test -r rules.cin2022_23 . It tests your code. If your code doesn't pass, use the output in the terminal to help you re-write or change it to make it pass.
When you're certain your rule behaves correctly and it passes the tests, commit it to your branch and make a pull request. Ensure that you have 'closes #<issuenumber>' in your comment to close the issue. So, for rule 8840, this is 'closes #107' as rule 8840 is rule 107.
You may get comments on your pull request asking you to change it, or it may be merged after review.

Detailed guide

Selecting a rule and getting it in to codespaces

NOTE: the backend code to implement intermediate and advanced rules is not currently finished. Please stick to beginner rules for now, also, do not follow the intermediate rule writing guide on this page yet, it is a work in progress based on the current state of the code and, as the backend code may change for intermediate rules, it may be wrong in the future.

First up, to start coding a rule for the CIN tool, you'll need to head to the Issues tab of the CIN validator repository. Have a look through the issues and choose something you think you can, or could learn, to code. Beginner rules are tagged 'Good first issue'. Note that this guide splits mid way through to beginner and intermediate rule writing (advanced to follow) as there are some slight differences those doing beginner issues should not be worried about.

Once you've selected an issue, click through to the page for the issue. I for instance have chosen issue 8840. On the right-hand side of the page is a section to assign someone to the issue. You access it by clicking the gear button. You'll then get the dropdown you can see in the image below asking you to search for people to assign. You should assign yourself to the issue you want to work on, so we don't have two people working on the same issue at once.

Now you need to make a new branch to start working on. We'll do this in GitHub rather than in Codespace this time, so you know two different ways to do it. Navigate to the main CIN code page. Then, on the left, there is a button that says main, click it and you'll get a context box and the ability to write a branch name. Enter the name of your issue into the branch, for me this was Rule8840. I'd already created my branch, so my image looks a little different to how your screen will look. You should have the option, in the dropdown, to 'Create branch: XXXX from main'. Click this to make your branch, this should also take you to your branch. If it doesn't use the same dropdown to go to the branch.

Once you've assigned yourself to an issue and have made a branch, you can create a Codespace and start working on it. This time, unlike the original Codespace tutorial, we'll start our Codespace in a branch from the get-go. From your branch's page click the green Code button, go to the Codespaces tab, and select the green Create Codespace on (branch name) button.

Once your Codespace has loaded, select your preferred colour scheme and ensure that Python is installed. Then, on the left of the screen, ensure you are in the Explorer pane (it looks a little like a new document button in Microsoft Suite tools) and navigate to the file rule_8500.py. It's in the filepath cin_validator, then rules, then cin2022_2023. The explorer pane and the file rule_8500.py are shown in the image below.

We need to select rule_8500.py because it is a template to be used as the basis for all other rules. Now, if you coded the 903 tool, you'll remember we coded every rule in the same file. In order to improve the functionality of the CIN validator, we've opted to put every rule in its own file. To do this, right click rule_8500.py and select copy, then rename the copy to rule_XXXX.py. Mine is called rule_8840.py.

In the next section, we cover some problems you might have when trying to use codespaces and the solutions we've found. If everything runs fine, feel free to skip the section, or come back to it if you find problems.

Codespaces troubleshooting and setup

You may encounter some issues when using codespaces. We have done our best to ensure that it should run easily without any tinkering, but solutions to problems we have seen analysts have are in this section. If you have a problem, let us know so we can help you with it, and so we can add the solution to this section.

If you try and run the code you've opened in Codespaces, you may get some errors. These are more likely to be encountered if you are coming to the project shortly after we started it, before we have been able to automate some of the Codespaces setup on our end, easing the work of analysts. Particularly, you may be told that there is an import error with Pandas, that other packages are missing, or that there is no module named cin_validator. If you get errors saying that packages can't be found, you will need to tell your virtual machine to install them using pip. For instance, to do this for pandas, type:

pip install pandas

into the terminal and hit the enter key. Pandas should then install. If you get any other errors for missing packages, do the same action for those packages: pip install (package name). The full list of packages at the time of writing this guide is:

attrs

click

dataclasses

DateTime

iniconfig

numpy

packaging

pandas

pluggy

pyparsing

pytest

python-dateutil

pytz

six

tomli

typing

zope.interface

You may sometimes get an error saying that module cin_validator cannot be found. If this is the case, you'll need to tell Codespaces what directory to look in when running code. Do this by entering into the terminal, replacing path/to/your/project with the path to your project:

export PYTHONPATH="${PYTHONPATH}:/path/to/your/project/"

in my case, that is:

export PYTHONPATH="${PYTHONPATH}:/workspaces/CIN-validator/cin_validator"

You can get the path by right clicking the cin_validator folder/module and selecting 'Copy Path', then replace path/to/your/project with the path you get.

The final error you may encounter is a linting error that stops packages being imported correctly. You will see errors in the problems tab of the terminal, as in in the image below. Note, I do not have this error, but some analysts I have spoken to have had it, which is why the specific error is not in my screenshot. Whilst not a good solution, as it means turning off linting, which is a useful function as it essentially spell checks our python, a workaround for our purposes if you have this issue is to disable linting.

To disable linting, hit control+shit+p to open the command palette, or right click in the code editor area and select command palette from the list that appears. When the command palette opens at the top of the screen, type linting into the search bar and select Python: Enable/Disable Linting, then select Disable on the second menu.

This should see to all of the problems we've seen analysts have so far, but if you encounter any more, please let us know so we can deal with them, and hopefully, others can avoid them.

Now let's look at some coding. The first part of the rule coding is a bit of a fill in the blanks exercise, that' once you have done it once, will be very easy.

Beginner rules

Much of the content of the beginner, intermediate, and advanced rules sections are the same, however, because the number of fields that can have issues and the way that the backend of the code accepts information about those issues is different, we've split the sections up to avoid confusion. You might also notice that the rule I'm coding in this section is different to the previous section, that's just because the rule in the previous section was an intermediate rule. It doesn't change the workflow as we haven't looked at the rule yet. For the purposes of the beginner guide, I'll be working on rule 1540. As per the beginning of this guide I've already assigned myself to the rule, made a branch from main called rule1540, opened up that branch in Codespaces, and copied rule_8500.py and called it rule_1540.py.

We have set up rule writing to be quite fill-in-the-blanks because the backend code needs very specific outputs to work, so, a good portion of rule writing will be working out what you need to replace certain lines of code with to make your rule. You will need to do quite a bit of thinking to write the logic of your rule, however! Rule writing also comes in three sections: defining the rule, writing the rule itself, and writing code to check that your rule works and that it gives the outputs that the back-end code expects. In the image below is the first bit of code you'll need to change.

You will notice there's lots of comments in the code, these are the lines of text proceeded by # signs. This are lines that Python doesn't read as code, they're comments that are there for you to help you write the rule. The way the rule writing has been set up for the CIN validator, at least the outline, is very much a template for you to put your work in to, this is so that we can be sure that the rest of the back-end code gets out exactly what it expects. I've highlighted some of the lines you'll need to change in the image above. Essentially, the comments tell you what to replace the code below it with to write your own rule. To proceed you'll need the rules Excel document, the download button for this is at the top of the page.

Find your rule in the Excel document, I used control+f to make it easier. You have have to change the filtering in the stage column to show all rules if you can't find your rule.

In the Excel document, the row for your rule contains useful information for writing your rule, such as the rule code, the module, and the message that needs to be displayed if the rule is not satisfied by the data.

First things first, we need to work out which tables and columns our rule will look in to assess the validity of the data. I know that rule 1540 checks to see if characters 5-12 of the UPN are numeric. Looking in the module column, I can see that this check is performed on the Child Identifiers table, and the validation check column shows me that is uses the UPN column (columns checked on are are between < >).

Looking back at the code for rule 8500, we can see how to tell the code where to look for the columns and table names we need, it is the code with the vertical line next to it in the picture above, and in the picture immediately below this text. It specifies that it looks in the ChildIdentifiers table, located in the CINTable class, and passes this to the variable ChildIdentifiers. We can also see that it then passes the column LAchildID from the ChildIdentifiers table to the variable LAchildID. Setting things up like this makes coding the rule a bit easier. We need to change this to match our rule.

We need to change our code to match our rule. I know I'll be looking in the Child Identifiers table, so that line does not need to change. I also know I'll need the UPN column from this table. Table names and column names are not written the same in our code as they are in the Excel, this is because of how Python needs variables to be named, Python does not like spaces in variable names. We have named tables and columns using PascalCase, where, generally, the first letter of any word is capitalised. To find your table and column names, head to the __api.py file in the rule_rule engine subdirectory of cin_validator. This is where the CINTable class is. If you don't know what a class is, don't worry. Just know that, like most things in Python, it's an object that we can fill out with class specific methods and attributes. For instance a Pandas DataFrame is a class. A class is, essentially, a thing that has stuff attributed to it, that we can do stuff to.

Once you've navigated to the CINTable class, locate the module you need. I located ChildIdentifiers (it has a big arrow pointing to it in the image below), then locate the columns you'll need to the right of the module name. Use these to fill in the table and columns you need, as above.

My updated table and column identifiers are shown in the image below. I haven't had to change ChildIdentifiers for rule 1540, but I have had to change the next variable to be called UPN and to make sure it is passed the UPN column from the ChildIdentifiers table.

From here, with reference to the Excel document, you can also start to fill out the fields within the @rule_definition decorator. Again, like with classes, we don't need to worry about what a decorator is to get going with coding the rule, but, for the purposes of this guide, think of a decorator as something that takes in a function, adds functionality to it, and returns it. In our @rule_definition decorator, we are just adding the important information about our rule to the decorator that will be used every time the rule is implemented by the code.

As you can see from the image above, I have updated the code, module, message, and affected_fields variables to suit rule 1540, the information for this is in the Excel document.

We can now move to the real coding. We need to update the validate function to work for our rule. The validate function, as you copied it from rule_8500.py will be unchanged and is shown below.

You won't need to change any of the function variables, so the data_container or rule_context variables. I'll explain what they do briefly, though. Mapping tells the function that all the data is in the CINTable class, and all of that data will be in the form of pandas dataframes. rule_context just tells the function that it will need to do what RuleContext tells it to when to comes to the rule context variable at the end of the function.

I think that the easiest way to explain how to write the code yourself is to show you what I did, what I changed, and how I got there.

In the code above, I have highlighted the areas where you may need to change the original code to fit your rule. df needs to be updated so that the table/module your rule needs is in the square brackets after data container. The comment above the logic for your rule needs to be updated to explain what the logic of your rule does, copy and paste the entry for the validation check column of the Excel document. Your rule logic must be written such that it takes in the dataframe df, which is the table and columns your rule needs, and failing_indices returns the index numbers of rows that fail the validation check. If your rule requires a check to see if a date falls within a census period, see the Census Period Rules section at the bottom of this page to see how these types of rule should be written. Finally, table and field need to be updated to match the table and field that your validation check uses.

It may help to write figure out the logic of your rule in another file with a dataframe you know will fail and pass in some places and then copy it back to your rule. In my instance, the rule 1540 validation check needs to fail where, when UPNs are not empty, there are not numbers between the 5th and 12th character of the UPN. To write this rule, first, I sliced the dataframe df using the .loc method and the .notna() method on the UPN column to take only rows where the UPN did not have Na/NaN values. This means that the validation rule will then only check rows with data in, dealing witht he first requirement of rule 1540. Next, I need to return the rows where UPN has non-numerical characters between characters 5 and 12. To do this I took another slice of df. This would have been easier to write using for loops, but would have been much slower to run, so I decided against that. Where possible, Pandas datframe methods should be used instead of for loops for efficiency. If you use a for loop in your rule where a Pandas method exists, you might be asked to change your logic. So, I know I need to check characters 5 to 12, and fail rows where there are non-numerical characters. I know I can check characters 5-12 using the .str method on df taking a slice of the string in each row using .str[4:12]. The first number sliced by is the number before the first number in the slice, but the last number is inclusive, so 4:12 takes characters 5 to 12. Following this, I can check if this string contains only digits by using another .str method and then using the .isdigit() method on that. What this does is return a Boolean that's True where it only contains digits and False when it's not only digits, the inverse of what the rule asks for. To rectify this, I then use the not operator (~) at the beginning of the slice logic to inverse the output, returning indices where characters 5:12contain non-digit characters.

Ensure that your code is well commented and the logic is outlined clearly, this helps not only reviewers but also other analysts who may be looking at your code to help them write rules understand what you're doing and how it works.

Once you have completely updated the validate function for your rule it's time to write a test into the test_validate function to make sure that your rule works and returns what the main code expects. The image below shows the test_validate code from rule_8500.py which you'll need to update. The goal is to write some a dataframe that your validate function returns errors in, and that it returns them in the rows we expect and in the way that we expect.

To update this, we need to create a dataframe with some correct and incorrect values to run the validate function on, we need to update the value passed to assert len(issues) == 2 to the number of issues we expect our dataframe to raise according to our rule. We also need to update the IssueLocator statement to the columns and index numbers we expect these issues to be in, and we need to update result.definition.code and result.defintion.message to match those for our rule, too.

In the image above you can see that I have highlighted the parts of the code that may need to be changed. First off, I've made a dataframe with a column called UPN and four rows which include the different outcomes of my rule. One that has no data, important as my rule has to skip rows with no data, then a row that passes, and then two different rows that will fail. When testing the rule, this is passed to the code instead of the real data. For this reason, you need to temporarily replace ChildIdentifiers (the table from the real data) with our temporary test dataframe. Name it something suitable, I have kept the name from rule 8500, and ensure that the result = run_rule statement is updated to have the real table name on the left of the colon, and the temporary table on the right, so ChildIdentifiers:child_identifiers. If I was using the CINdetails table, this could be, for instance, CINdetails:cin_details.

Once these two bits of code are updated, change the line assert len(issues) == 2 to include the number of issues your rule and temporary dataframe should raise with your test dataframe. For me, this was still 2.

Next, update the IssueLocator statements to have the correct table/module, column, and row for each issue. My dataframe raises issues in the UPN column of ChildIdentifiers in index locations 2 and 3 so I changed these statements accordingly. If you have more than 2 issues, you'll need to add a new statement for each issue, if you only have one, you'll need to delete a row.

Finally, update the last two lines with the error code and message.

Once this is done, skip to the running and testing the rule section of this guide to proceed.

Intermediate Rules

NOTE: do not start writing or coding intermediate rules yet, the backend is not properly configured to accept outputs of rules that affect more than one column so, any rule you write will not be able to be merged. If you've written a beginner rule before, skip to the NAMESECTION section of this guide to see the intermediate specific instructions.

You will notice there's lots of comments in the code, these are the lines of text proceeded by # signs. This are lines that Python doesn't read as code, they're there for you to help you write the rule. The way the rule writing has been set up for the CIN validator, at least the outline, is very much a template for you to put your work in to, this is so that we can be sure that the rest of the back-end code gets out exactly what it expects. I've highlighted some of the lines you'll need to change in the image above. Essentially, the comments tell you what to replace the code below it with to write your own rule. To proceed you'll need the rules Excel document, the download button for this is at the top of the page.

Find your rule in the Excel document, I used control+f to make it easier. You have have to change the filtering in the stage column to show all rules if you can't find your rule.

First things first, we need to work out which tables and columns our rule will look in to assess the validity of the data. I know that rule 8840 checks to see if the start and end date of a CP plan are on the same day, so I'll need to look in the table, and then rows, that have this information. Looking at the rule 8500, we can see that it specifies that it looks in the ChildIdentifiers table, located in the CINTable class, and passes this to the variable ChildIdentifiers. We can also see that it then passes the column LAchildID from the ChildIdentifiers table to the variable LAchildID. Setting things up like this makes coding the rule a bit easier.

If you then look at the image below, you can see that I have replaced these variables with the ones I need for my rule. First, I pass to the variable ChildProtectionPlans the ChildProtectionPlans table from the CINTable class, and then, using the same format as rule 8500, pass CPPstartDate and CPPendDate variables the correct columns from this table.

So, you know where to put the table and column names you need, but where can you find them if you don't know? Well, the Excel document tells you the module. Then, knowing this, head into the rule_engine folder, and to __api.py. This is where the CINTable class is. If you don't know what a class is, don't worry. Just know that, like most things in Python, it's an object that we can fill out with class specific methods and attributes. For instance a Pandas DataFrame is a class. A class is, essentially, a thing that has stuff attributed to it, that we can do stuff to.

Once you've navigated to the CINTable class, locate the module you need. I located ChildProtectionPlans (it has a big arrow pointing to it in the image below), then locate the columns you'll need to the right of the module name. Use these to fill in the table and columns you need, as above.

As above, I have changed the code variable to 8840, to match my rule, I've updated the module to give the table I want, ChildProtectionPlans from the CINTable class, I've updated the message to the message from the Excel document matching my rules, and I've updated the list of affected_fields to contain ChildProtectionPlans, CPPstartDate and CPPendDate. Affected fields are identified in the Validation Check column of the Excel document, they are the words between < > symbols. They're likely the tables and columns you set up variables for just now. Remember to spell them and format their names as they're found in the CINTable class, so, correct capitalisation and no spaces between words.

Now we can move on to the real coding, writing the function that validates the rule! To do this, navigate down to the validate function, it starts with the line def validate( and is shown below. Until you change it, as you copied rule_8500.py, it will contain that code. You're going to replace that code with your own.

I think that the easiest way to explain how to write the code yourself is to show you what I did, what I changed, and how I got there.

The image above shows the same code as for rule_8500.py, but changed to suit rule 8840. Underlined are places where the code has been changed.

The fist thing that's been underlined is ChildProtectionPlans. ChildProtectionPlans replaces ChildIdentifiers as it is the table I need to get data from to write my rule.

The next thing that has changed is failing_indices and the code that is passed to it. This is the logic that evaluates rule 8840. I'll explain this a in more detail in a bit, and talk about how you can work out your own.

The last thing that's changed is the rule_context statements at the end of the function. Notice there are now two, one for each place that contributes to the error. In these statements I've updated the field variable to reflect the columns of the dataframes that give rise to the error. Of course, for rule 8840, where CP plans can't start and end on the same day, the fields that cause this error are CPPstartDate and CPPendDate.

Now let's look at the rule logic itself. The completed rule logic needs to be written such that, when it finds a row with an error, it returns the index locations of those errors and passes them to failing_indices. This is so that the code knows which rows the error is in, and, with knowledge of the rows, and the columns, can highlight specific places causing the error.

So, how does my rule work, and how did I arrive at it?

You don't need to use my method, if you're confident writing a rule straight in a function and then go ahead. I personally find that the easiest way to write some code, particularly when it will be called in a function, is to write it separately, then bring it into the function when I know it works. To do this, I made a new file, I've called tinkering.py (if you do this, make sure to either delete it before committing and making a pull request, or stage your changes so it doesn't get added to the pull request, we don't want main filled with a bunch of tester code). I then work out how to write my rule in the new tinkering.py file.

First off, I imported pandas and numpy as I know the rule will need these packages.

I then made a dataframe called df. I called it df as this is the name given to dataframes in the validate function, so any rule I write here can be copy/pasted over easily. I made the dataframe using pandas DataFrame method on two dictionaries, one with key CPPstartDate and the other with key CPPendDate. This way, the columns of the dataframe produced match the column names of the dataframe that will be passed to the validate function, again allowing easy copy/pasting later.

Finally, in the values of these dictionaries, I made lists of dates. I configured the list of dates in such a way that, in each dictionary (and therefore column) the first, second, and fourth entries would be the same. I then printed df and ran the code to ensure that the dataframe I expected was produced.

When I had a dataframe I was happy with, it was time to code the logic of the rule. The rule validates that a CP plan can't start and end on the same date, meaning that CPPstartDate and CPPendDate cannot be equal. To validate this, I took a slice of df where the value of the column CPPstartDate was equal to CPPenddate. I then used the .index method to take the index locations of this slice, and passed this to the failing_indices variable.

Finally, I printed failing_indices to check that it returned, 0,1, and 3, the indices of the rows of the dataframe where I had configured the dates to be the same. When I ran the code in the terminal, it printed Int64Index([0, 1, 3], dtype='int64') to the terminal so I knew my rule worked! Savvy Pythonistas may notice that I didn't convert the date strings in my dataframe to Pandas datetime objects before making the comparison. There's two reasons for this: 1) the logic of the rule works whether or not you do this, 2) when validating real data, the dates passed to the function will already be in datetime format, so there is no need to get my rule to do it.

When the logic of the rule was written I could copy/paste just the logic from tinkering.py back into my rule file, over the logic for 8500, as in the image from before. If you wrote your rule in another file and made a dataframe to test it, don't delete that file yet, the dataframe you made will be really useful for the next step. Now the rule is written, we need to test that it works with the rest of the code. To do this, we need to write the test!

The test essentially checks that the rule you wrote brings up failures when you expect it to, and that it passes the right things to the rest of the code when it runs. Below is the code for rule 8500, it's also the code you'll see in your rule until you change it.

The goal here is to write something that checks that your rule raises failures when it should. Below is the test I wrote for rule 8840, with changes I made to rule 8500's code illustrated.

First off, I replaced the dataframe created to test rule 8500 with one to test rule 8840. This datframe has the same value in both CPPstartDate and CPPendDate in rows 1,2, and 4, corresponding to index locations, 0,1, and 3. I just copy/pasted the dataframe I made to write the rule originally and changed the name to child_protection_plans for the purposes of this test, but if you didn't write the rule using my workflow and instead wrote the rule as is, just create a dataframe that passes on a few rows and fails on some others, taking note of the ones it fails on.

The next change is to update the dataframe the code runes the test on to the test dataframe we just wrote. To do this, change CHildIdentifiers from rule 8500 to the table used for your rule, and the variable after the colon to the test datframe you just wrote. For me that meant changing the statement to ChildProtectionPlans: child_protection_plans. We just need to let the code know what dataframe to proxy in for the real one when testing the code works.

Next, update the statement assert len(issues) == 2 to the number of issues your dataframe should give rise to. For me this is 6. Why is it 6 is the dataframe fails in only 3 rows? Because each row throws an error for two columns, so that's 6 total errors! Having done this, update the issue locations to the index numbers where your issues should arise. Do this by completing the statement:

IssueLocator(CINTable.ChildProtectionPlans, CPPstartDate, 0),

replacing the number at the end with the column name and index location of rows that should fail. Remember: index locations start at 0 not 1, so row 4 is index 3! Make sure there is an issue locator line for every row AND column your test dataframe should throw an issue. For me that's 0, 1, and 3 in both CPPstartDate and CPPendDate.

When that's done, you simply need to update assert result.definition.code == with your rule code and assert result.definition.message == with the error message for your rule and you're done!

Running and testing the rule

The next step is testing your code and that it integrates properly with the main body of code. To do this, head to main.py and enter into the terminal the statement:

python -m cin_validator test -r rules.cin2022_23

Then hit enter to run it. This tells python to run a test on the code using the rules in the 2022_23 folder.

You can see my terminal above. The first line was where I entered python -m cin_validator test -r rules.cin2022_23, then once I hit enter, the tests ran, testing all the rules that had been written. Rules that pass have a green dot next to them, rules that fail have a red dot (this may be different if you have colour-blind settings specified). Finally, there's the line of green equals signs saying 2 passed. This means that my rule, and the previous rules passed. Note: your rule could potentially pass if you don't write the test properly, but if you've followed the steps in this guide, that's unlikely.

If your rule doesn't pass the test, you'll have an output like below:

There are really too many reasons your code could fail to document them all here. The problem could be with your code. For instance, it could be misplaced punctuation, misspelling something, or any other number of issues. In general, the terminal is good at notifying you of where problems are and what they might be to help you fix them. For instance, if you ask to use a package that you haven't imported, it will tell you, if you try and use a method that doesn't exist for an object, or spell the method wrong, it'll say. If you spell something wrong or misplace punctuation, you'll get a syntax error. The terminal should also highlight the areas where this is wrong so you can fix it. You can see, in my example above, that there is a problem with the assert len(issues) == 5 line, as this has a > sign next to it. In this case, the number of issues was 6 not the 5 i had specified.

The problem could also be with the rule you've written not doing what you expect it to, or you having written the test validate wrong. If you keep working and can't figure out why your tests won't pass, that's a great problem to bring to the drop-in and speak to us about.

Comitting and making a pull request

Once your code passes the tests, you're sure the rule does what you think it does, and it's read to join the rest of the code, you're ready to commit your code and make a pull request, if you don't know how to do that, check out our GitHub and Codespaces guide via this link. This guide teaches the workflow of taking code from GitHub into Codespaces, and then joining code you wrote back into the main body of code.

There are some specifics to making a pull request not covered in that guide that are important here. First off, when you make your pull request, in order that the pull request being committed into the main body of the code marks the issue your rule was under as completed, you'll need to include some information in your comment. You'll just need to include, somewhere in the comment, the line 'closes #issuenumber' where issue number is the number of the issue that was given to your rule, for instance, for rule 8850, the issue number was #107, as seen in the image below, so in my pull request comment, I must ensure that the line 'closes #107' is somewhere in there.

Also, if there is some work that needs to be done on your code, your pull request might not be accepted, instead, a comment will be added suggesting things you need to change and fix. Don't feel bad if this happens, it's a totally normal part of the process of writing code.

Rules checking census periods

If the rule you've written checks to see if a date is within a certain census period, there are some specific things you need to do. Partly this is so there is a unified way of referring to census periods across the code. Firstly, if you write a rule that uses either (or both) the start and end of a census period, these should be assigned the variables collection_start and collection_end as appropriate within the validate function.

In the test function, they can be created using pandas.to_datetime(date_string). The period of census is defined as the 1st of April of the previous year to the 31st of March of the collection year.

For example if you choose 2022 as the collection year for the data that you want to test, then the start and end of your period of census can be defined as shown.

collection_start = pd.to_datetime('01/04/2021', format='%d/%m/%Y')

collection_end = pd.to_datetime('31/03/2021', format='%d/%m/%Y')