Creating Excel spreadsheets with Python
Every month I produce usage spreadsheets for the HPC service I run. This involved much copying and pasting of data, which I already have a script to export in CSV format, into a bunch of spreadsheets for different people. In total, it takes about 7 hours (essentially a full working day of doing nothing else) to do - this post is taking some of that 7 hours to get the existing script to generate the required spreadsheets directly. This is definitely going to be a case of getting back more time than I invest.
One thing I have found is that it is far easier to prototype the spreadsheet by hand and then translate to code than to try and create a spreadsheet from scratch in code.
Install the Excel writer
Goole led me to a page on working with Python Pandas and XlsxWriter. Reading that page, I realised that using Pandas was not going to add anything (in order to achive what I wanted, I would need to “access the underlying workbook and worksheet objects” at which point I was just using XlsxWriter directly) so I may as well just use XlsxWriter directly.
I started by adding ‘XlsxWriter’ to my requirement.txt and pip install
ing it to my virtualenv (I noted that XlsxWriter has an easyconfig so I may just build the module but my EasyBuild is busy building other software for users at the moment.
Getting started
A little bit of background is perhaps necessary here - accounting is done at a level we have called “sub-project” (jobs are charged to a sub-project), but billed internally at a sector level with the heirarchy going “Sector -> Project -> Sub-Project”. As a result I am interested at a sector level, however project leads want to know what has been used down to each user within a sub-project.
I started by getting XlsxWriter to reproduce my “Overall” overview worksheet faithfully.
Step 1 - open an Excel spreadsheet and add the overall sheet
import xlsxwriter
writer = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet('Overall Totals')
Step 2 - add some information
The first information required is the date for the report (used to calculate how far through the year we are, to help see whether spending is on track or not).
On the manual version of spreadsheet, this information is cells D2-E5. Column D is the labels and E the values - 2 of which are calculated and 2 are hinted as editable (via background colour).
Adding the information is relatively straight-forward (logger
is a python logger instance, start_date
and end_date
are python dateutil.date (imported as date
) objects for the start and end of the report period and relativedelta
is from python-deteutil):
worksheet.write('D2', 'End of month:')
# We assume report is for a full calendar month, warn if that is not correct
if start_date.day != 1 or start_date.month != end_date.month or (end_date+relativedelta(days=+1)).date != 1:
logger.warning("Report is not for precisely one calendar month - calculations based on calendar months will be incorrect.")
worksheet.write('E2', start_date)
# Work out the company financial year start (1st April)
worksheet.write('D3', '1st month of year:')
# We already warned if start_date and end_date are not the same month - arbitarially piekct start_date
if start_date.month < 4:
year = start_date.year - 1
else:
year = start_date.year
worksheet.write('E3', date(year, 4, 1))
# Formulae for how far through the year we are
worksheet.write('D4', 'Months completed:')
worksheet.write('E4', '=DATEDIF(E3,E2,"M")+1')
worksheet.write('D5', '% of year completed:')
worksheet.write('E5', '=E4/12')
This is great, but the data is cut off, the dates and percentages are not displayed correctly and I want the labels bolded and cells which should be user editable highlighted.
Step 3 - column widths
Column widths can be modified with set_column
on the worksheet object, using Excel units:
worksheet.set_column('D:D', 20)
worksheet.set_column('E:E', 18)
# ... etc.
Step 4 - formats
Formats are defined at a workbook level, not worksheet or cell level.
Bold
Starting with the labels:
format_bold = workbook.add_format({'bold': True})
And then modify all of the D column write
calls to have the fomat as the 3rd argument, for example:
worksheet.write('D2', 'End of month:', format_bold)
Number formats
The numbers are next, and I needed to define seperate formats for the month display and percentage. The XslxWriter documentation has advice for finding the correct format string for Excel’s built in formats - go into format cell, select the format you want then switch it immediately to custom
- the correct format string for the build-in format will be displayed.
format_month = workbook.add_format({'num_format': 'mmm-yy'})
format_percent = workbook.add_format({'num_format': '0.00%'})
and then like the bold, just add to the cells:
worksheet.write('E2', start_date, format_month)
# ...
worksheet.write('E3', date(year, 4, 1), format_month)
# ...
worksheet.write('E5', '=E4/12', format_percent)
Tip: since the number of months (cell E4) is a plain number, there is no need to specify a special format
Colour and combining formats
Finally, I want to add colour to the editable cells - E2 and E4 - however, these will not be the only editable cells in the workbook. I may want to have editable cells of any format, so rather than duplicate the existing formats I pulled out the existing formats into a dictionary:
formats = {}
formats['bold'] = {'bold': True}
formats['month'] = {'num_format': 'mmm-yy'}
formats['percent'] = {'num_format': '0.00%'}
Initially I created a snippet that generated ‘edit_’ editable variants with a solid background colour and then used that to generate a collections.namedtuple:
# Add 'editable' variants of all formats, with colour to indicate where values may be changed
# pattern 1 is solid fill
formats = {
**{'edit_' + key: {**val, **{'pattern': 1, 'bg_color': '#FFF2CC'} for key, val in formats.items()},
**formats
}
FormatsTuple = namedtuple('FormatsTuple', formats.keys())
formats = FormatsTuple(**{key: workbook.add_format(val) for val, key in formats.items()})
# Now can use (e.g.) formats.bold, formats.edit_month etc.
However this is less than satisfactory as it spawns formats needlessly and does not allow creation of arbitary combinations of attributes. Instead I created this shim class, which lets us define a selection of “base” formats and then arbitarially combine them - generating new formats if the combination has not been used before (it uses collections.ChainMap):
class WorkbookFormats(object):
"""
Container class for XlsxWriter.Workbook formats, which allows arbitray combination of the formats.
"""
def __init__(self, workbook, initial_formats={}}):
"""
Initialiser.
args:
workbook: A XlsxWriter.Workbook instance to manage formats for
initial_formats: An optional dictionary of initial formats (in the same form as would be passed to add_formats)
"""
self.workbook = workbook
self._formats = initial_formats
self._generated_formats = {}
def add_formats(self, formats):
"""
Convenience method to add multiple formats.
Calls add_format(name, format) for each pair in the formats argument.
args:
formats: dictionary of {name: format} formats to add
"""
for name, format in formats.items():
self.add_format(name, format)
def add_format(self, name, format):
"""
Add the format as the given name.
args:
name: name to use for the format
format: format dictionary to pass to workbook.add_format
"""
if name in _formats:
logger.warning("Overwriting format %s with %s", name, format)
# Clear out any old formats generated from this name
modified_names = [x for x in self._generated_formats.keys() if x == name or x.startswith(name + '_') or x.endswith('_' + name) or ('_%s_' % name) in x]
for item in modified_names:
del self._generated_formats[item]
if '_' in name:
logger.warning("Format name %s contains an underscorre and will not be usable in combination - try using camelCase instead", name)
self._formats[name] = format
def __getattr__(self, name):
"""
Get the format specified, generating it if required.
Names can be compounded (it is the purpose of this class!), for
example `.bold_underlined` would give a format generated by
combining `bold` and `underlined` (assuming they had been
added) formats.
Note that order is important, in the even of a property clash
the first wins. So if both `bold` and `underlined` specified
`bg_color`, with `.bold_underlined` the one in bold would be
used and with `.underlined_bold` the one from underlined.
args:
name: name (or compound name) to get
"""
# Only need to generate if not already got this one
if name not in self._generated_formats:
if name in self._formats:
# Direct base format, just first time it has been used
self._generated_formats[name] = self.workbook.add_format(self._formats[name])
else:
# Assume compound format
self._generated_formats[name] = self.workbook.add_format(dict(ChainMap(*[self._formats[format] for format in name.split('_')])))
return self._generated_formats[name]
Using this class, I can define formats like this:
formats = {}
formats['bold'] = {'bold': True}
formats['month'] = {'num_format': 'mmm-yy'}
formats['percent'] = {'num_format': '0.00%'}
formats['editable'] = {'pattern': 1, 'bg_color': '#FFF2CC'}
formats = WorkbookFormats(workbook, formats)
Now I can use the defined formats in any combination I like, for example formats.editable_month
or formats.editable_bold_month
etc.
Step 4 - merging cells
The next bit of the first sheet is a large, horizontally arranged, table that summarises the totals per-sector, for the month and year, for each of compute usage, shared (infrastructure) costs and development/testing costs.
In column A it has ‘Compute’, ‘Shared costs’ and ‘Dev/Test’ lables and then in column B ‘Month’ and ‘Year’ labels for each (and other labels in column C for each of these but none of these are merged).
First thing I needed to do was define some extra formats, for aligning these cells (before passing to WorkbookFormats constructor, above):
formats['centre'] = {'align': 'center'}
formats['verticalCentre'] = {'valign': 'vcenter'}
Then I can merge the cells like this:
# Column A
worksheet.merge_range('A10:A16', 'Compute', formats.bold_centre.verticalCentre)
worksheet.merge_range('A17:A24', 'Shared costs', formats.bold_centre.verticalCentre)
worksheet.merge_range('A25:A32', 'Dev/Test', formats.bold_centre.verticalCentre)
# Column B
worksheet.merge_range('B10:B13', 'Month', formats.bold_centre.verticalCentre)
worksheet.merge_range('B14:B16', 'Year', formats.bold_centre.verticalCentre)
worksheet.merge_range('B17:B20', 'Month', formats.bold_centre.verticalCentre)
worksheet.merge_range('B21:B24', 'Year', formats.bold_centre.verticalCentre)
worksheet.merge_range('B25:B29', 'Month', formats.bold_centre.verticalCentre)
worksheet.merge_range('B30:B32', 'Year', formats.bold_centre.verticalCentre)
These basic steps can be used to complete the worksheet.
A note about ranges
XlsxWriter supporst Excel notation (which is what I have used so far) and 0-indexed ‘(row, column)’ (N.B. row and column are reversed compared to Excel notation - e.g. C5 is (6, 2)) notation for referring to cells. The latter is extremely convenient when populating the sheet with dynamic data.
The xlswriter.utility module provides xl_rowcol_to_cell
which is very useful when translating row, column values for formulae.
For example:
# Populate values for each sector
column = starting_column = 3 # Start in column D
for sector in sector_list:
worksheet.write(7, column, sector) # Start at row 8 (0-indexed, remember)
worksheet.write(8, column, 'FILLME', formats.editable)
# ... etc.
column += 1
# `column` will now point to the one after the last one filled (1 is added at the end of the loop)
# Calculate totals for the entire populated portion of the row, in the next row cell
worksheet.write(8, column, '=SUM(%s:%s)' % (xl_rowcol_to_cell(9, starting_column), xl_rowcol_to_cell(9, column-1)))
# ... etc.
More advanced topics
Conditional formatting
Cells can have conditional formatting applied using a variety of criteria. Simple formatting, based on the value in another cell (simply replace the cell reference with a value to test against a static value - strings need to be quoted (e.g. {'value': '0'}
for a digit and {'value': '"string"'}
for a string)):
worksheet.conditional_format('A3:B3', {'type': 'cell', 'criteria': '>', 'value': 'A4', 'format': formats.redOnRed})
N.B. be very careful with references in value
- in the above example A3
will be compared to A4
and B3
to B4
- i.e. value
is as though it were the first cell - if everything is to be compared to a single cell use $A$4
Set default formatting
As well as changing the height and width, set_row
and set_column
can be used to specify the default format:
worksheet.set_column('C:C', 20, formats.gbp)
Limiting cell choices to a list
This is a form of data validation, in Excel. The specific type of validation to present a drop-down is list
and you can provide a Python list (for a static list) or reference to cells (for a dynamic list).
worksheet.data_validation('A2:A200', {'validate': 'list', 'source': "='Shared Costs'!$B$61:$B$69"})
N.B. As with conditional formatting be very careful with the reference in source
- the same ‘fill’ effect will apply if not made an absolute reference with $
prefixes to the column and row references.
Wraping things into a class
As I mentioned at the start, I need to create a number of spreadsheets for different people (oversight committees, financial controllers, project leads) and there is a significant amount of duplication between then (hence the huge amount of manual copy-and-pasting that takes place).
In order to automate this, I started wrapping the workbook generation in a class that provides a mechanism for beginning to write out seperate workbooks with the same sheets by seperating each worksheet into its own function and a set of utility methods to tear-up and down each workbook around the report data (which is the same for each workbook).
The constructor just stores the report data, does a little pre-processing and sets up variables for the current workbook (and a WorkbookFormats for it).
class ReportWriter(object):
"""
Writer for the report data.
Provides convenience wrappers for writing out the various workbooks
information.
"""
def __init__(self, start_date, end_date, report_data):
"""
Constructor for the ReportWriter.
args:
state_date: python `date` object for the date the report
starts
end_date: python `date` object for the date the report ends
report_data: a dict of report data that contains two keys,
'accounts' and 'usage'.
'accounts' is a structure of information about
accounts, keyed on account name, providing:
'parent': name of parent account, or None in
case of the root
'children': list of child account names
'budget': the budgetary limit set
'users': list of usernames with access to
this account
'usage' is a structure of usage information
for the reporting period providing:
'account_totals': dict keyed on account
name giving total usage
of the account
'accounts': dict keyed on account name
giving dict keyed on user
giving per-user usage for that
account
'users': dict keyed on user giving dict
keyed on account name giving per-
account usage for that user
"""
self.start_date = start_date
self.end_date = end_date
self.report_data = report_data
# Find the root account (the one with no parent)
root_account = [key for key, value in self.report_data['accounts'].items() if value['parent'] is None]
if len(root_account) != 1:
logger.warning("Uable to reliably determine root account, will use %s but the following all appear to not have a parent: %s", root_account[0] if len(root_account) else None, root_account)
self._root_account = root_account[0]
logger.debug("Root account: %s", self._root_account)
# Work out the worksheet names
# Sectors
self._worksheet_names = {key: 'S-%s' % key for key in self.report_data['accounts'][self._root_account]['children']}
# Projects
projects = []
for sector in self.report_data['accounts'][self._root_account]['children']:
projects.extend(self.reprot_data['accounts'][sector]['children'])
self._worksheet_names.update({key: 'P-%s' % key for key in projects})
# Sub-projects
for project in projects:
self._worksheet_names.update({key: 'SP-%s' % key for key in self.report_data['accounts'][project]['children']})
subprojects_with_children = [x for x in self.report_data['accounts'][project]['children'] if self.report_data['accounts'][x]['children']]
if subprojects_with_children:
logger.warning("Sub-projects %s appear to have child projects - they should be leaves of the project heirarchy tree.", subprojects_with_children)
self.workbook = None
self._formats = None
self._cell_references = {} # dict to store references to key cells whose position is affected by dynamic data
Next, a utility method to check that there is a valid workbook setup when writing worksheets:
def _check_workbook(self):
"""
Check that a workbook has been initialised.
Logs and throws a RuntimeError if not.
"""
err_msg = None
if self.workbook is None:
err_msg = "Attempted to write output without defining a workbook - did you forget to call create_workbook first?"
elif self._formats is None:
err_msg = "Formats not initialised for workbook - did you try to manually set the workbook instead of using create_workbook?"
if err_msg:
logger.error(err_msg)
raise RuntimeError(err_msg)
A method to initialise a new workbook:
def _initialise_workbook(self):
"""
Initialise a new workbook.
Currently just adds the formats we want to use, as formats are at a workbook level.
"""
# Sanity check
err_msg = None
if self.workbook is None:
err_msg = "Attempted to intialise workbook without creating it first - try calling creatr_workbook to create a new workbook."
if self._formats is not None:
err_msg = "Foramats is not none - perhaps trying to initialise workbook twice? (This should not happen!)"
if err_msg:
logger.error(err_msg)
raise RuntimeError(err_msg)
# Define our workbook's formats
formats = {}
# Add all the formats we will use
formats['bold'] = {'bold': True}
formats['italic'] = {'italic': True}
# 1 - single underline, 2 - double underline
formats['underline'] = {'underline': 1}
formats['month'] = {'num_format': 'mmm-yy'}
formats['percent'] = {'num_format': '0.00%'}
# This is a copy and paste of Excel's "accounting" format
formats['gbp'] = {'num_format': '_-£* #,##0.00_-;-£* #,##0.00_-;_-£* "-"??_-;_-@_-'}
# to more deciman places:
formats['precisionGbp'] = {'num_format': '_-£* #,##0.000000_-;-£* #,##0.000000_-;_-£* "-"??_-;_-@_-'}
formats['plainNumber'] = {'num_format': ''}
# Pattern 1 is solid fill
formats['editable'] = {'pattern': 1, 'bg_color': '#FFF2CC'}
formats['centre'] = {'align': 'center'}
formats['verticalCentre'] = {'valign': 'vcenter'}
formats['redOnRed'] = {'font_color': '#9C0006', 'bg_color': '#FFC7CE'}
self._formats = WorkbookFormats(self.workbook, formats)
And a method that is a context guard (using contextlib.contextmanger), which will open the workbook, yield it (for use inside the context) and close it afterwards:
@contextmanager
def create_workbook(self, filename):
"""
Creates a workbook, yields self then closes the workbook (decorated as contextmanager)
Intended to be used as a context guard, e.g.:
with instance.createworkbook('myfile.xslx'):
instance.write_overall_totals() # for example
args:
filename: file to create
yields:
self
"""
workbook = xlsxwriter.Workbook(filename)
self.workbook = workbook
self._cell_refernces = {} # Reset this just to be safe
self._initialise_workbook()
try:
yield self
finally:
self.workbook = None
self._formats = None
self._cell_refernces = {}
workbook.close()
The method to create a sheet (using teh code from earlier in this post) then beings to looks like this:
def write_overall_totals(self, header_only=False):
"""
Write the Overall Totals worksheet to xslxfile
Args:
header_only: If True will only write the header (which
month this is and how far through the financial
year we are)
"""
self._check_workbook() # Make sure it is safe to write
worksheet = self.workbook.add_worksheet('Overall Totals')
worksheet.write('D2', 'End of month:', self._formats.bold)
# Next cell assumes report is for a full calendar month - warn if that is not correct
if self.start_date.day != 1 or self.start_date.month != end_date.month or (self.end_date+relativedelta(days=+1)).date != 1:
logger.warning("Report is not for precisely one calendar month - calculations based on calendar months will be incorrect.")
worksheet.write('E2', self.start_date, self._formats.editable_month)
#... etc.
# storing a reference to a cell we will need in another sheet
self._cell_references['total_users'] = "'Overall Totals'!%s" % xl_row_col_to_cell(8, column)
#...
Finally, adding convenience methods for each report to be generated:
def create_overall_report(self, filename):
"""
Creates a workbook and writes out the overall report to it.
args:
filename: file to create
"""
with self.create_workbook(filename):
self.write_overall_totals()
self.write_shared_costs()
self.write_azure_costs()
self.write_devtest_costs()
for sector in self.report_data['accounts'][self._root_account]['children']:
self.write_sector(sector)
for project in self.report_data['accounts'][sector]['children']:
self.write_project_and_subprojects(project)
def create_project_reports(self, dirname):
"""
Creates workbooks for all projects, named project_name.xlsx
args:
dirname: a Path-like object to the directory to create the
files in (assumed to exist)
"""
for sector in self.report_data['accounts'][self._root_account]['children']:
for project in self.report_data['accounts'][sector]['children']:
with self.create_workbook(dirname / ('%s.xlsx' % project)):
self.write_overall_totals(header_only=True)
self.write_project_and_subprojects(project)