In Apache Cassandra Lunch #65: Spark Cassandra Connector Pushdown, we discussed Spark predicate pushdown in the context of the Spark Cassandra connector. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In general, predicate pushdown is an optimization that SparkSQL applies to a query that interacts with an external datasource. It involves pushing down filtering to be done by the underlying data store being queried, rather than doing that processing in Spark. It applies to several types of external data sources including Cassandra, whether that connection is via the Spark Cassandra connector or Datastax Enterprise. We can also use the optimization when connecting to JDBC data sources like when using the PostgreSQL driver. It also applies when connecting to Hive data sources and when reading from Parquet data sources.
Predicate pushdown happens when a query going ultimately to an external data source includes a valid predicate. A predicate is a clause that is part of the query that selects a specific subset of the data in a table based on the values in each row. A query with no predicates returns all of the data within a particular table. Any time we formulate a query that includes where or filter steps working on data from an external database, predicate pushdown may occur.
This works both for spark sql queries made directly that include a where clause, but it also includes Spark dataframes loaded from external sources that have filters or where methods applied to them. Since Spark uses lazy evaluation, loading a dataframe from an external source for one query does not confer any benefit to future queries unless we cache the dataframe containing all the data, so that it can be queried and filtered later. Loading only the data we need into Spark confers a performance benefit since we don’t need to load the entire table in and the filtering doesn’t need to take place using Spark resources.
The rules for predicate pushdown are different in Cassandra for partition key columns and clustering columns. Spark can push partition key columns down as long as we include all of the partition key columns in the filter, and no more than one equivalence predicate exists per column. Although you can use an IN clause to specify multiple restrictions for a particular column. For clustering key columns, only the last predicate in the filter can be a non-equivalence predicate. If there is more than one predicate for a column, the predicates cannot be equivalence predicates. Queries that fail to follow these rules get their associated tables loaded fully into Spark and then are filtered down using Spark resources.
We can use the explain method to see how a Spark query evolves from the provided query to the optimized plan. In these explanations, we see a PushedFilters section when predicate pushdown occurs.
The Spark Cassandra connector also has the ability to push down other types of query. Older versions of the Spark Cassandra connector included specific functions like cassandraCount to perform pushed-down versions of count queries. The alternative would be to load the table into Spark, potentially filter that dataframe, and then use the count method to get the desired value. Now, this same sequence triggers predicate pushdown and the query only returns the individual counts that Spark puts together.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!