Creating Excel spreadsheets with Python

Every month I produce usage spreadsheets for the HPC service I run. This involved much copying and pasting of data, which I already have a script to export in CSV format, into a bunch of spreadsheets for different people. In total, it takes about 7 hours (essentially a full working day of doing nothing else) to do - this post is taking some of that 7 hours to get the existing script to generate the required spreadsheets directly. This is definitely going to be a case of getting back more time than I invest.

One thing I have found is that it is far easier to prototype the spreadsheet by hand and then translate to code than to try and create a spreadsheet from scratch in code.

Install the Excel writer

Goole led me to a page on working with Python Pandas and XlsxWriter. Reading that page, I realised that using Pandas was not going to add anything (in order to achive what I wanted, I would need to “access the underlying workbook and worksheet objects” at which point I was just using XlsxWriter directly) so I may as well just use XlsxWriter directly.

I started by adding ‘XlsxWriter’ to my requirement.txt and pip installing it to my virtualenv (I noted that XlsxWriter has an easyconfig so I may just build the module but my EasyBuild is busy building other software for users at the moment.

Getting started

A little bit of background is perhaps necessary here - accounting is done at a level we have called “sub-project” (jobs are charged to a sub-project), but billed internally at a sector level with the heirarchy going “Sector -> Project -> Sub-Project”. As a result I am interested at a sector level, however project leads want to know what has been used down to each user within a sub-project.

I started by getting XlsxWriter to reproduce my “Overall” overview worksheet faithfully.

Step 1 - open an Excel spreadsheet and add the overall sheet

import xlsxwriter

writer = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet('Overall Totals')

Step 2 - add some information

The first information required is the date for the report (used to calculate how far through the year we are, to help see whether spending is on track or not).

On the manual version of spreadsheet, this information is cells D2-E5. Column D is the labels and E the values - 2 of which are calculated and 2 are hinted as editable (via background colour).

Adding the information is relatively straight-forward (logger is a python logger instance, start_date and end_date are python dateutil.date (imported as date) objects for the start and end of the report period and relativedelta is from python-deteutil):

worksheet.write('D2', 'End of month:')
# We assume report is for a full calendar month, warn if that is not correct
if start_date.day != 1 or start_date.month != end_date.month or (end_date+relativedelta(days=+1)).date != 1:
    logger.warning("Report is not for precisely one calendar month - calculations based on calendar months will be incorrect.")
worksheet.write('E2', start_date)

# Work out the company financial year start (1st April)
worksheet.write('D3', '1st month of year:')
# We already warned if start_date and end_date are not the same month - arbitarially piekct start_date
if start_date.month < 4:
    year = start_date.year - 1
else:
    year = start_date.year
worksheet.write('E3', date(year, 4, 1))

# Formulae for how far through the year we are
worksheet.write('D4', 'Months completed:')
worksheet.write('E4', '=DATEDIF(E3,E2,"M")+1')
worksheet.write('D5', '% of year completed:')
worksheet.write('E5', '=E4/12')

This is great, but the data is cut off, the dates and percentages are not displayed correctly and I want the labels bolded and cells which should be user editable highlighted.

Step 3 - column widths

Column widths can be modified with set_column on the worksheet object, using Excel units:

worksheet.set_column('D:D', 20)
worksheet.set_column('E:E', 18)
# ... etc.

Step 4 - formats

Formats are defined at a workbook level, not worksheet or cell level.

Bold

Starting with the labels:

format_bold = workbook.add_format({'bold': True})

And then modify all of the D column write calls to have the fomat as the 3rd argument, for example:

worksheet.write('D2', 'End of month:', format_bold)

Number formats

The numbers are next, and I needed to define seperate formats for the month display and percentage. The XslxWriter documentation has advice for finding the correct format string for Excel’s built in formats - go into format cell, select the format you want then switch it immediately to custom - the correct format string for the build-in format will be displayed.

format_month = workbook.add_format({'num_format': 'mmm-yy'})
format_percent = workbook.add_format({'num_format': '0.00%'})

and then like the bold, just add to the cells:

worksheet.write('E2', start_date, format_month)
# ...
worksheet.write('E3', date(year, 4, 1), format_month)
# ...
worksheet.write('E5', '=E4/12', format_percent)

Tip: since the number of months (cell E4) is a plain number, there is no need to specify a special format

Colour and combining formats

Finally, I want to add colour to the editable cells - E2 and E4 - however, these will not be the only editable cells in the workbook. I may want to have editable cells of any format, so rather than duplicate the existing formats I pulled out the existing formats into a dictionary:

formats = {}
formats['bold'] = {'bold': True}
formats['month'] = {'num_format': 'mmm-yy'}
formats['percent'] = {'num_format': '0.00%'}

Initially I created a snippet that generated ‘edit_’ editable variants with a solid background colour and then used that to generate a collections.namedtuple:

# Add 'editable' variants of all formats, with colour to indicate where values may be changed
# pattern 1 is solid fill
formats = {
    **{'edit_' + key: {**val, **{'pattern': 1, 'bg_color': '#FFF2CC'} for key, val in formats.items()},
    **formats
}
FormatsTuple = namedtuple('FormatsTuple', formats.keys())
formats = FormatsTuple(**{key: workbook.add_format(val) for val, key in formats.items()})

# Now can use (e.g.) formats.bold, formats.edit_month etc.

However this is less than satisfactory as it spawns formats needlessly and does not allow creation of arbitary combinations of attributes. Instead I created this shim class, which lets us define a selection of “base” formats and then arbitarially combine them - generating new formats if the combination has not been used before (it uses collections.ChainMap):

class WorkbookFormats(object):
    """
    Container class for XlsxWriter.Workbook formats, which allows arbitray combination of the formats.
    """

    def __init__(self, workbook, initial_formats={}}):
        """
        Initialiser.

        args:
            workbook: A XlsxWriter.Workbook instance to manage formats for
            initial_formats: An optional dictionary of initial formats (in the same form as would be passed to add_formats)
        """
        self.workbook = workbook
        self._formats = initial_formats
        self._generated_formats = {}
    
    def add_formats(self, formats):
        """
        Convenience method to add multiple formats.

        Calls add_format(name, format) for each pair in the formats argument.

        args:
            formats: dictionary of {name: format} formats to add
        """
        for name, format in formats.items():
            self.add_format(name, format)

    def add_format(self, name, format):
        """
        Add the format as the given name.

        args:
            name: name to use for the format
            format: format dictionary to pass to workbook.add_format
        """
        if name in _formats:
            logger.warning("Overwriting format %s with %s", name, format)
            # Clear out any old formats generated from this name
            modified_names = [x for x in self._generated_formats.keys() if x == name or x.startswith(name + '_') or x.endswith('_' + name) or ('_%s_' % name) in  x]
            for item in modified_names:
                del self._generated_formats[item]
        if '_' in name:
            logger.warning("Format name %s contains an underscorre and will not be usable in combination - try using camelCase instead", name)
        self._formats[name] = format
    
    def __getattr__(self, name):
        """
        Get the format specified, generating it if required.

        Names can be compounded (it is the purpose of this class!), for
        example `.bold_underlined` would give a format generated by
        combining `bold` and `underlined` (assuming they had been
        added) formats.

        Note that order is important, in the even of a property clash
        the first wins.  So if both `bold` and `underlined` specified
        `bg_color`, with `.bold_underlined` the one in bold would be
        used and with `.underlined_bold` the one from underlined.

        args:
            name: name (or compound name) to get
        """
        # Only need to generate if not already got this one
        if name not in self._generated_formats:
            if name in self._formats:
                # Direct base format, just first time it has been used
                self._generated_formats[name] = self.workbook.add_format(self._formats[name])
            else:
                # Assume compound format
                self._generated_formats[name] = self.workbook.add_format(dict(ChainMap(*[self._formats[format] for format in name.split('_')])))
        return self._generated_formats[name]

Using this class, I can define formats like this:

formats = {}
formats['bold'] = {'bold': True}
formats['month'] = {'num_format': 'mmm-yy'}
formats['percent'] = {'num_format': '0.00%'}
formats['editable'] = {'pattern': 1, 'bg_color': '#FFF2CC'}

formats = WorkbookFormats(workbook, formats)

Now I can use the defined formats in any combination I like, for example formats.editable_month or formats.editable_bold_month etc.

Step 4 - merging cells

The next bit of the first sheet is a large, horizontally arranged, table that summarises the totals per-sector, for the month and year, for each of compute usage, shared (infrastructure) costs and development/testing costs.

In column A it has ‘Compute’, ‘Shared costs’ and ‘Dev/Test’ lables and then in column B ‘Month’ and ‘Year’ labels for each (and other labels in column C for each of these but none of these are merged).

First thing I needed to do was define some extra formats, for aligning these cells (before passing to WorkbookFormats constructor, above):

formats['centre'] = {'align': 'center'}
formats['verticalCentre'] = {'valign': 'vcenter'}

Then I can merge the cells like this:

# Column A
worksheet.merge_range('A10:A16', 'Compute', formats.bold_centre.verticalCentre)
worksheet.merge_range('A17:A24', 'Shared costs', formats.bold_centre.verticalCentre)
worksheet.merge_range('A25:A32', 'Dev/Test', formats.bold_centre.verticalCentre)

# Column B
worksheet.merge_range('B10:B13', 'Month', formats.bold_centre.verticalCentre)
worksheet.merge_range('B14:B16', 'Year', formats.bold_centre.verticalCentre)
worksheet.merge_range('B17:B20', 'Month', formats.bold_centre.verticalCentre)
worksheet.merge_range('B21:B24', 'Year', formats.bold_centre.verticalCentre)
worksheet.merge_range('B25:B29', 'Month', formats.bold_centre.verticalCentre)
worksheet.merge_range('B30:B32', 'Year', formats.bold_centre.verticalCentre)

These basic steps can be used to complete the worksheet.

A note about ranges

XlsxWriter supporst Excel notation (which is what I have used so far) and 0-indexed ‘(row, column)’ (N.B. row and column are reversed compared to Excel notation - e.g. C5 is (6, 2)) notation for referring to cells. The latter is extremely convenient when populating the sheet with dynamic data.

The xlswriter.utility module provides xl_rowcol_to_cell which is very useful when translating row, column values for formulae.

For example:

# Populate values for each sector
column = starting_column = 3 # Start in column D
for sector in sector_list:
    worksheet.write(7, column, sector)  # Start at row 8 (0-indexed, remember)
    worksheet.write(8, column, 'FILLME', formats.editable)
    # ... etc.
    column += 1

# `column` will now point to the one after the last one filled (1 is added at the end of the loop)

# Calculate totals for the entire populated portion of the row, in the next row cell
worksheet.write(8, column, '=SUM(%s:%s)' % (xl_rowcol_to_cell(9, starting_column), xl_rowcol_to_cell(9, column-1)))
# ... etc.

Wraping things into a class

As I mentioned at the start, I need to create a number of spreadsheets for different people (oversight committees, financial controllers, project leads) and there is a significant amount of duplication between then (hence the huge amount of manual copy-and-pasting that takes place).

In order to automate this, I started wrapping the workbook generation in a class that provides a mechanism for beginning to write out seperate workbooks with the same sheets by seperating each worksheet into its own function and a set of utility methods to tear-up and down each workbook around the report data (which is the same for each workbook).

The constructor just stores the report data, does a little pre-processing and sets up variables for the current workbook (and a WorkbookFormats for it).

class ReportWriter(object):
    """
    Writer for the report data.

    Provides convenience wrappers for writing out the various workbooks
    information.
    """

    def __init__(self, start_date, end_date, report_data):
        """
        Constructor for the ReportWriter.

        args:
            state_date: python `date` object for the date the report
                        starts
            end_date: python `date` object for the date the report ends
            report_data: a dict of report data that contains two keys,
                         'accounts' and 'usage'.
                         'accounts' is a structure of information about
                         accounts, keyed on account name, providing:
                            'parent': name of parent account, or None in
                                      case of the root
                            'children': list of child account names
                            'budget': the budgetary limit set
                            'users': list of usernames with access to
                                    this account
                         'usage' is a structure of usage information
                         for the reporting period providing:
                            'account_totals': dict keyed on account
                                              name giving total usage
                                              of the account
                            'accounts': dict keyed on account name
                                        giving dict keyed on user
                                        giving per-user usage for that
                                        account
                            'users': dict keyed on user giving dict
                                     keyed on account name giving per-
                                     account usage for that user
        """
        self.start_date = start_date
        self.end_date = end_date
        self.report_data = report_data
        # Find the root account (the one with no parent)
        root_account = [key for key, value in self.report_data['accounts'].items() if value['parent'] is None]
        if len(root_account) != 1:
            logger.warning("Uable to reliably determine root account, will use %s but the following all appear to not have a parent: %s", root_account[0] if len(root_account) else None, root_account)
        self._root_account = root_account[0]
        logger.debug("Root account: %s", self._root_account)
        # Work out the worksheet names
        # Sectors
        self._worksheet_names = {key: 'S-%s' % key for key in self.report_data['accounts'][self._root_account]['children']}
        # Projects
        projects = []
        for sector in self.report_data['accounts'][self._root_account]['children']:
            projects.extend(self.reprot_data['accounts'][sector]['children'])
        self._worksheet_names.update({key: 'P-%s' % key for key in projects})
        # Sub-projects
        for project in projects:
            self._worksheet_names.update({key: 'SP-%s' % key for key in self.report_data['accounts'][project]['children']})
            subprojects_with_children = [x for x in self.report_data['accounts'][project]['children'] if self.report_data['accounts'][x]['children']]
            if subprojects_with_children:
                logger.warning("Sub-projects %s appear to have child projects - they should be leaves of the project heirarchy tree.", subprojects_with_children)
        self.workbook = None
        self._formats = None
        self._cell_references = {}  # dict to store references to key cells whose position is affected by dynamic data

Next, a utility method to check that there is a valid workbook setup when writing worksheets:

def _check_workbook(self):
    """
    Check that a workbook has been initialised.

    Logs and throws a RuntimeError if not.
    """
    err_msg = None
    if self.workbook is None:
        err_msg = "Attempted to write output without defining a workbook - did you forget to call create_workbook first?"
    elif self._formats is None:
        err_msg = "Formats not initialised for workbook - did you try to manually set the workbook instead of using create_workbook?"
    if err_msg:
        logger.error(err_msg)
        raise RuntimeError(err_msg)

A method to initialise a new workbook:

def _initialise_workbook(self):
    """
    Initialise a new workbook.

    Currently just adds the formats we want to use, as formats are at a workbook level.
    """
    # Sanity check
    err_msg = None
    if self.workbook is None:
        err_msg = "Attempted to intialise workbook without creating it first - try calling creatr_workbook to create a new workbook."
    if self._formats is not None:
        err_msg = "Foramats is not none - perhaps trying to initialise workbook twice? (This should not happen!)"
    
    if err_msg:
        logger.error(err_msg)
        raise RuntimeError(err_msg)

    # Define our workbook's formats
    formats = {}
    # Add all the formats we will use
    formats['bold'] = {'bold': True}
    formats['italic'] = {'italic': True}
    # 1 - single underline, 2 - double underline
    formats['underline'] = {'underline': 1}
    formats['month'] = {'num_format': 'mmm-yy'}
    formats['percent'] = {'num_format': '0.00%'}
    # This is a copy and paste of Excel's "accounting" format
    formats['gbp'] = {'num_format': '_-£* #,##0.00_-;-£* #,##0.00_-;_-£* "-"??_-;_-@_-'}
    # to more deciman places:
    formats['precisionGbp'] = {'num_format': '_-£* #,##0.000000_-;-£* #,##0.000000_-;_-£* "-"??_-;_-@_-'}
    formats['plainNumber'] = {'num_format': ''}
    # Pattern 1 is solid fill
    formats['editable'] = {'pattern': 1, 'bg_color': '#FFF2CC'}
    formats['centre'] = {'align': 'center'}
    formats['verticalCentre'] = {'valign': 'vcenter'}
    formats['redOnRed'] = {'font_color': '#9C0006', 'bg_color': '#FFC7CE'}

    self._formats = WorkbookFormats(self.workbook, formats)

And a method that is a context guard (using contextlib.contextmanger), which will open the workbook, yield it (for use inside the context) and close it afterwards:

@contextmanager
def create_workbook(self, filename):
    """
    Creates a workbook, yields self then closes the workbook (decorated as contextmanager)

    Intended to be used as a context guard, e.g.:
        with  instance.createworkbook('myfile.xslx'):
            instance.write_overall_totals() # for example

    args:
        filename: file to create

    yields:
        self
    """
    workbook = xlsxwriter.Workbook(filename)
    self.workbook = workbook
    self._cell_refernces = {}  # Reset this just to be safe
    self._initialise_workbook()
    try:
        yield self
    finally:
        self.workbook = None
        self._formats = None
        self._cell_refernces = {}
        workbook.close()    

The method to create a sheet (using teh code from earlier in this post) then beings to looks like this:

def write_overall_totals(self, header_only=False):
    """
    Write the Overall Totals worksheet to xslxfile

    Args:
        header_only: If True will only write the header (which
                     month this is and how far through the financial
                     year we are)
    """
    self._check_workbook()  # Make sure it is safe to write

    worksheet = self.workbook.add_worksheet('Overall Totals')

    worksheet.write('D2', 'End of month:', self._formats.bold)
    # Next cell assumes report is for a full calendar month - warn if that is not correct
    if self.start_date.day != 1 or self.start_date.month != end_date.month or (self.end_date+relativedelta(days=+1)).date != 1:
        logger.warning("Report is not for precisely one calendar month - calculations based on calendar months will be incorrect.")
    worksheet.write('E2', self.start_date, self._formats.editable_month)

    #... etc.
    # storing a reference to a cell we will need in another sheet
    self._cell_references['total_users'] = "'Overall Totals'!%s" % xl_row_col_to_cell(8, column)
    #...

Finally, adding convenience methods for each report to be generated:

def create_overall_report(self, filename):
    """
    Creates a workbook and writes out the overall report to it.

    args:
        filename: file to create
    """
    with self.create_workbook(filename):
        self.write_overall_totals()
        self.write_shared_costs()
        self.write_azure_costs()
        self.write_devtest_costs()
        for sector in self.report_data['accounts'][self._root_account]['children']:
            self.write_sector(sector)
            for project in self.report_data['accounts'][sector]['children']:
                self.write_project_and_subprojects(project)

def create_project_reports(self, dirname):
    """
    Creates workbooks for all projects, named project_name.xlsx

    args:
        dirname: a Path-like object to the directory to create the
                 files in (assumed to exist)
    """
    for sector in self.report_data['accounts'][self._root_account]['children']:
        for project in self.report_data['accounts'][sector]['children']:
            with self.create_workbook(dirname / ('%s.xlsx' % project)):
                self.write_overall_totals(header_only=True)
                self.write_project_and_subprojects(project)