The first part of any machine learning project is to gather data. This sounds easy. You may think that this puts you in the perfect position to work with data you have in relational databases. In some circumstances that may be correct. However, most of the ways that we store data in databases for business platforms are sub-optimal for using machine learning. They require more work to gain the insights we want out of our data.
When we store transaction data or streams of events it can be easy to gain insights about those transactions or events individually through machine learning, potentially giving us the ability to fill in missing fields with or perform other processing regarding the transactions or events that we already have. This approach fails to give us the insights we want in regards to our users and how they interact with our platforms. Luckily most database providers have provided functionality to turn our tables into summarized tables, which can more easily give us the insights we are after. Aggregations are the tools that will help us achieve this. An aggregation is a function that takes our starting table and groups values around some field within that table and gives us back information about those groupings.
If we want to learn more about our users, our first step is to group our data by the user field. This can be done manually by SELECTing for each user individually, but some database tools have GROUP BY which accomplishes this automatically and also works as a gateway to building summary tables with further aggregations as fields.
The most basic aggregation is COUNT, which gives the total number of rows associated with the specific user who generated the transactions or events that we are looking at. We can use this, as well as other specific information that has been gathered about this user, like their geographic location or demographic information, in order to train machine learning algorithms to predict the number of transactions a new user will make in our system. If we are using our event stream data to do a similar thing, we can predict how many times a new user will interact with our system in a time frame similar to that of our current users.
Additionally, we can use more sophisticated aggregations like SUM and AVG in order to get values like total money spent or average price per transaction to help us extract more useful data about our system’s users. Other common aggregation functions include maximum, median, minimum, mode, and range. Sometimes a standard deviation function stdev is also included. All of these aggregation functions can be useful in extracting information from our data sets that we want to know. You can even create similar functionality by hand. You can pull data from your tables and process them externally before putting them back in a new form. The ability to create summary tables is an invaluable tool for those who want to gain insights into their data. Especially ones that their current analytics can’t provide due to the structure of their data.
One impressive new entry into this space is the Elasticsearch transform functionality. Elasticsearch is a type of a search-enabled datastore that developers can interact with through a number of REST APIs. Elasticsearch transforms are similar to normal aggregations. They transform Elasticsearch indices (comparable to database tables) into summary indices that rearrange data in entity-centric ways.
One example used in Elastic’s transform documentation turns a set of interactions with a website, into a set of user sessions. The original index contains only a user id, a timestamp, and the URL that the user is accessing. Using transforms, the summary index has the same user id field. It also contains fields for session duration, URLs visited, and the total number of interactions with the system. This data is far more useful for someone who wants to understand the site’s users. It is also much easier to use in machine learning algorithms.
In conclusion, database aggregations generally, and Elasticsearch transforms specifically can make it easy to make use of your data. It allows you to turn data in your business platform databases into useful analytics and machine learning data sets. They don’t force you to use a sub-optimal data model for your platform’s needs. They also allow you to quickly access the machine learning optimal organization when needed.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!