As part of a web application I’ve created for a client, I’ve been working on efficient analysis of somewhat large datasets for some time now. This analysis often involves complexing grouping, clustering and value aggregation. SQL and PostgreSQL aggregate functions in particular come in quite handy when dealing with that kind of challenge.
While RDBMS and SQL certainly are useful already just for keeping and retrieving data, i.e. for running your usual CRUD operation, those tools really shine when it comes to efficiently gaining insight from raw data in a scalable manner. Sure, Microsoft Excel or Google Sheets also provide features for analyzing data and particularly for graphically displaying the results visually using diagrams and plots. In fact, that used to be their raison d’être. However, scaling an Excel spreadsheet to more than a few thousand rows – let alone millions – isn’t exactly recommendable or even possible anymore once the dataset grows large enough.
Dealing with large datasets in a scalable manner – that’s what relational database systems were invented for in the first place so it’s only natural to use them for large scale data analysis and data mining, too. Apart from quite common and well-known aggregation functions such as COUNT, SUM or AVG, PostgreSQL also provides more sophisticated aggregation functions for calculating statistical measures, for example:
In addition to that, Matt DeLand – co-founder and data scientist at Wagon – recently summarized techniques for summarizing data with SQL. In this article he covers aspects such as:
If you work on analyzing somewhat large datasets his article – and his company Wagon – or very much worth checking out.
If you’d like to know more about data analysis with SQL (or other subjects related to software development / data analysis / data mining), please shoot me an email.