Apache Spark and Time-Series Analytics
High-level instructions for loading data from QuestDB to Spark and back.
#
What is Spark?Apache Spark is an analytics engine for large-scale data engineering and stream processing, well-known in the big data landscape. It is suitable for executing data engineering, data science, and machine learning on single-node machines or clusters.
#
QuestDB Spark integrationA typical Spark application processes data in the following steps:
- Loading data from different sources
- Transforming and analyzing the data
- Saving the result to a data storage
Our example demonstrates these steps using QuestDB as the data source and storage. It loads data from QuestDB into a Spark Dataframe; then the data is enriched with new features, and eventually, it is written back into QuestDB.
#
PrerequisitesPackage manager: This depends on your choice of OS. The below instructions are for macOS using Homebrew.
QuestDB: An instance must be running and accessible. You can do so from Docker, the binaries, or Homebrew.
#
Installing Apache SparkSpark can be installed and set up in many ways, depending on requirements. Typically, it is part of a Big Data stack, installed on multiple nodes with an external cluster manager, such as Yarn or Apache Mesos. In this tutorial, we will work with a single-node standalone Spark installation.
Spark has a multi-language environment. It is written in Scala, runs on the Java Virtual Machine, and also integrates with R and Python. Our example is written using Python. By running the below commands Spark will be installed with all required dependencies:
The exact versions used for this example:
#
Installing the JDBC driverSpark communicates with QuestDB via JDBC, connecting to its Postgres Wire Protocol endpoint. This requires the Postgres JDBC driver to be present.
- Create a working directory:
- Download the JDBC driver from here into the working directory. The exact version used for this example:
#
Setting up database tablesFirst, start QuestDB. If you are using Docker run the following command:
The port mappings allow us to connect to QuestDB's REST and PGWire endpoints. The former is required for opening the Web Console, and the latter is used by Spark to connect to the database.
Open the Web Console in your browser at http://localhost:9000
.
Run the following SQL commands using the console:
The INSERT
command generates 3 days' worth of test data, and stores it in the
trades
table.
#
Feature engineering examplesSave the below Python code into a file called sparktest.py
inside the working
directory:
This Spark application loads aggregated data from the trades
table into a
Dataframe, then adds two new features, a 10-minute moving average and the
standard deviation. Finally, it writes the enriched data back into QuestDB and
saves it to the trades_enriched
table.
#
Run the exampleSubmit the application to Spark for execution using spark-submit
:
The example requires the JDBC driver at runtime. This dependency is submitted to
Spark using the --jars
option.
After the execution is completed, we can check the content of the
trades_enriched
table:
The enriched data should be displayed in the Web Console.
#
See alsoFor a more detailed explanation of the QuestDB Spark integration, please also see our tutorial Integrate Apache Spark and QuestDB for Time-Series Analytics.