Start Simple
Serve a Static Website on Amazon S3
Setting up the infrastructure to serve a static website is often harder than it seems — but fortunately, this is a task where Pulumi really shines.
Read on →
June 8, 2020 by Rahul Singh
The stunningly fast micro-framework by Laravel.
<?php /** * Reimagine what you expect... */ $app->get('/', function() { return ['version' => '5.3'] }); /** * From your micro-framework... */ $app->post('framework/{id}', function($framework) { $this->dispatch(new Energy($framework)); }); $app->get('api/users/{id}', function($id) { return User::find($id); });
Lightning fast micro-services and APIs delivered with the elegance you expect.
Lumen is the perfect solution for building Laravel based micro-services and blazing fast APIs. In fact, it's one of the fastest micro-frameworks available. It has never been easier to write stunningly fast services to support your Laravel applications.
1000
Silex
1800
Slim 3
1900
Lumen
Requests per second
Don't sacrifice power for speed. Use the Laravel features you love like Eloquent, caching, queues, validation, routing, middleware, and the powerful Laravel service container. All with almost zero configuration.
<?php $app->get('user/{id}', function($id) { return User::findOrFail($id); });
Have a Lumen project you want to upgrade to the full Laravel framework? It couldn't be easier. Since Lumen is powered by Laravel's components, just drop your code into a Laravel installation. You won't have to change a thing.
Laravel Vapor is a serverless deployment platform for Laravel, powered by AWS. Launch your Laravel infrastructure on Vapor and fall in love with the scalable simplicity of serverless.
A demo application to illustrate how Inertia.js works, ported to Symfony from Laravel.
Tested on both PHP 7.4 and 8.0.
Make sure you have the symfony
binary (Symfony CLI) installed and in your PATH
.
Clone the repo locally:
git clone https://github.com/aleksblendwerk/pingcrm-symfony.git pingcrm-symfony
cd pingcrm-symfony
Install dependencies:
composer install
yarn install
Build assets:
yarn build
The current configuration uses MySQL. Adjust the DATABASE_URL
in .env
accordingly
(or optionally create a .env.local
file and put your overrides there).
Create the database, schema and load the initial data:
composer build-database
Run the dev server:
symfony serve
You're ready to go! Visit Ping CRM in your browser, and login with:
Keep in mind to adjust the DATABASE_URL
in .env.test
accordingly
(or optionally create a .env.test.local
file and put your overrides there).
Run the Ping CRM tests:
composer test
One of the goals for this port was to leave the original JS side of things unchanged. This promise has been kept, aside from one or two very minor changes. As a result, the PHP backend code occasionally has to jump through a few hoops to mimic the expected response data formats which are partly catered to Laravel's out-of-the-box features.
Also, I am currently not really satisfied with the whole validation workflow, this might eventually get an overhaul.
Consider this a proof of concept, I am sure there is room for improvements. If any fellow Symfony developers want to join in to tackle things in more concise or elegant ways, let's go for it!
Shout-outs to all Ping CRMs all over the world!
Inertia is a new approach to building classic server-driven web apps. We call it the modern monolith.
Inertia allows you to create fully client-side rendered, single-page apps, without much of the complexity that comes with modern SPAs. It does this by leveraging existing server-side frameworks.
Inertia has no client-side routing, nor does it require an API. Simply build controllers and page views like you've always done!
See the who is it for and how it works pages to learn more.
Inertia isn't a framework, nor is it a replacement to your existing server-side or client-side frameworks. Rather, it's designed to work with them. Think of Inertia as glue that connects the two. Inertia does this via adapters. We currently have three official client-side adapters (React, Vue, and Svelte) and two server-side adapters (Laravel and Rails).
If you're interested in following along with the development of Inertia.js, I share updates about it with my newsletter.
—Jonathan Reinink, creator of Inertia.js
Building modern web apps is hard.
Tools like Vue and React are extremely powerful, but the complexity they add to a full-stack developer's workflow is insane.
It doesn’t have to be this way...
Ok, I'm listening...
Say hello to Livewire.
Hi Livewire!
Livewire is a full-stack framework for Laravel that makes building dynamic interfaces simple, without leaving the comfort of Laravel.
Consider my interest piqued
It's not like anything you've seen before. The best way to understand it is to just look at the code. Strap on your snorkel, we're diving in.
...I'll get my floaties
Pulumi's Infrastructure as Code SDK is the easiest way to create and deploy cloud software that use containers, serverless functions, hosted services, and infrastructure, on any cloud.
Simply write code in your favorite language and Pulumi automatically provisions and manages your AWS, Azure, Google Cloud Platform, and/or Kubernetes resources, using an infrastructure-as-code approach. Skip the YAML, and use standard language features like loops, functions, classes, and package management that you already know and love.
For example, create three web servers:
let aws = require("@pulumi/aws");
let sg = new aws.ec2.SecurityGroup("web-sg", {
ingress: [{ protocol: "tcp", fromPort: 80, toPort: 80, cidrBlocks: ["0.0.0.0/0"]}],
});
for (let i = 0; i < 3; i++) {
new aws.ec2.Instance(`web-${i}`, {
ami: "ami-7172b611",
instanceType: "t2.micro",
securityGroups: [ sg.name ],
userData: `#!/bin/bash
echo "Hello, World!" > index.html
nohup python -m SimpleHTTPServer 80 &`,
});
}
Or a simple serverless timer that archives Hacker News every day at 8:30AM:
const aws = require("@pulumi/aws");
const snapshots = new aws.dynamodb.Table("snapshots", {
attributes: [{ name: "id", type: "S", }],
hashKey: "id", billingMode: "PAY_PER_REQUEST",
});
aws.cloudwatch.onSchedule("daily-yc-snapshot", "cron(30 8 * * ? *)", () => {
require("https").get("https://news.ycombinator.com", res => {
let content = "";
res.setEncoding("utf8");
res.on("data", chunk => content += chunk);
res.on("end", () => new aws.sdk.DynamoDB.DocumentClient().put({
TableName: snapshots.name.get(),
Item: { date: Date.now(), content },
}).promise());
}).end();
});
Many examples are available spanning containers, serverless, and infrastructure in pulumi/examples.
Pulumi is open source under the Apache 2.0 license, supports many languages and clouds, and is easy to extend. This
repo contains the pulumi
CLI, language SDKs, and core Pulumi engine, and individual libraries are in their own repos.
Getting Started: get up and running quickly.
Tutorials: walk through end-to-end workflows for creating containers, serverless functions, and other cloud services and infrastructure.
Examples: browse a number of useful examples across many languages, clouds, and scenarios including containers, serverless, and infrastructure.
Reference Docs: read conceptual documentation, in addition to details on how to configure Pulumi to deploy into your AWS, Azure, or Google Cloud accounts, and/or Kubernetes cluster.
Community Slack: join us over at our community Slack channel. Any and all discussion or questions are welcome.
Roadmap: check out what's on the roadmap for the Pulumi project over the coming months.
See the Get Started guide to quickly get started with Pulumi on your platform and cloud of choice.
Otherwise, the following steps demonstrate how to deploy your first Pulumi program, using AWS Serverless Lambdas, in minutes:
Install:
To install the latest Pulumi release, run the following (see full installation instructions for additional installation options):
$ curl -fsSL https://get.pulumi.com/ | sh
Create a Project:
After installing, you can get started with the pulumi new
command:
$ mkdir pulumi-demo && cd pulumi-demo
$ pulumi new hello-aws-javascript
The new
command offers templates for all languages and clouds. Run it without an argument and it'll prompt
you with available projects. This command created an AWS Serverless Lambda project written in JavaScript.
Deploy to the Cloud:
Run pulumi up
to get your code to the cloud:
$ pulumi up
This makes all cloud resources needed to run your code. Simply make edits to your project, and subsequent
pulumi up
s will compute the minimal diff to deploy your changes.
Use Your Program:
Now that your code is deployed, you can interact with it. In the above example, we can curl the endpoint:
$ curl $(pulumi stack output url)
Access the Logs:
If you're using containers or functions, Pulumi's unified logging command will show all of your logs:
$ pulumi logs -f
Destroy your Resources:
After you're done, you can remove all resources created by your program:
$ pulumi destroy -y
To learn more, head over to pulumi.com for much more information, including tutorials, examples, and details of the core Pulumi CLI and programming model concepts.
Architecture | Build Status |
---|---|
Linux/macOS x64 | |
Windows x64 |
Language | Status | Runtime | |
---|---|---|---|
JavaScript | Stable | Node.js 10+ | |
TypeScript | Stable | Node.js 10+ | |
Python | Stable | Python 3.6+ | |
Go | Stable | Go 1.13.x | |
.NET (C#/F#/VB.NET) | Stable | .NET Core 3.1 |
See Supported Clouds for the full list of supported cloud and infrastructure providers.
Please See CONTRIBUTING.md for information on building Pulumi from source or contributing improvements.
PulumiUP: Hear from technical leaders as they present the vision for the future of cloud engineering. Save Your Spot
Create
DeployManageCreate modern applications.
logo_NETcore
Deploy to any cloud.
Manage cloud environments.
Update #1
10 seconds ago
Changes:Resources:Duration: 8s
pulumi:providers:aws
default_0_18_27
aws:s3:Bucket
my-bucket
I needed a solution that cut across silos and gave our developers a tool they could use themselves to provision infrastructure to suit their own immediate needs. The way Pulumi solves the multi-cloud problem is exactly what I was looking for.Dinesh RamamurthyEngineering Manager, Mercedes-Benz Research and Development North America
Pulumi supercharged our infrastructure team by helping us create reusable building blocks that developers can leverage to provision new resources and enforce organizational policies for logging, permissions, resource tagging, and security.Igor ShapiroPrincipal Engineer, Lemonade
We are building a distributed-database-as-a-service product that runs on Kubernetes clusters across multiple public clouds including GCP, AWS and others. Pulumi's declarative model, the support for familiar programming languages, and the uniform workflow on any cloud make our SRE team much more efficient.Josh ImhoffSite Reliability Engineer, Cockroach Labs
Start Simple
Setting up the infrastructure to serve a static website is often harder than it seems — but fortunately, this is a task where Pulumi really shines.
Read on →
Migrate to Pulumi
A mountain of running infrastructure shouldn’t deter you from trying Pulumi. See how easy it is to bring resources built with tools like Terraform or CloudFormation into Pulumi.
Read on →
Use All the Clouds
Use Pulumi to deploy and manage a typical application across all major cloud providers using the TypeScript programming language.
Read on →
See what it's like to program the cloud with Pulumi.
Give it a try! Deploy your first Pulumi app in just five minutes.
JSON to JSON transformation library written in Java where the "specification" for the transform is itself a JSON document.
Jolt :
The Stock transforms are:
shift : copy data from the input tree and put it the output tree default : apply default values to the tree remove : remove data from the tree sort : sort the Map key values alphabetically ( for debugging and human readability ) cardinality : "fix" the cardinality of input data. Eg, the "urls" element is usually a List, but if there is only one, then it is a String
Each transform has its own DSL (Domain Specific Language) in order to facilitate its narrow job.
Currently, all the Stock transforms just effect the "structure" of the data. To do data manipulation, you will need to write Java code. If you write your Java "data manipulation" code to implement the Transform interface, then you can insert your code in the transform chain.
The out-of-the-box Jolt transforms should be able to do most of your structural transformation, with custom Java Transforms implementing your data manipulation.
Jolt Slide Deck : covers motivation, development, and transforms.
Javadoc explaining each transform DSL :
Running a Jolt transform means creating an instance of Chainr with a list of transforms.
The JSON spec for Chainr looks like : unit test.
The Java side looks like :
Chainr chainr = JsonUtils.classpathToList( "/path/to/chainr/spec.json" );
Object input = elasticSearchHit.getSource(); // ElasticSearch already returns hydrated JSon
Object output = chainr.transform( input );
return output;
The Shiftr transform generally does most of the "heavy lifting" in the transform chain. To see the Shiftr DSL in action, please look at our unit tests (shiftr tests) for nice bite sized transform examples, and read the extensive Shiftr javadoc.
Our unit tests follow the pattern :
{
"input": {
// sample input
},
"spec": {
// transform spec
},
"expected": {
// what the output of the transform looks like
}
}
We read in "input", apply the "spec", and Diffy it against the "expected".
To learn the Shiftr DSL, examine "input" and "output" json, get an understanding of how data is moving, and then look at the transform spec to see how it facilitates the transform.
For reference, this was the very first test we wrote.
There is a demo available at jolt-demo.appspot.com. You can paste in JSON input data and a Spec, and it will post the data to server and run the transform.
Note
Getting started code wise has its own doc.
If you can't get a transform working and you need help, create and Issue in Jolt (for now).
Make sure you include what your "input" is, and what you want your "output" to be.
Aside from writing your own custom code to do a transform, there are two general approaches to doing a JSON to JSON transforms in Java.
Aside from being a Rube Goldberg approach, XSLT is more complicated than Jolt because it is trying to do the whole transform with a single DSL.
With this approach you are working from the output format backwards to the input, which is complex for any non-trivial transform. Eg, the structure of your template will be dictated by the output JSON format, and you will end up coding a parallel tree walk of the input data and the output format in your template. Jolt works forward from the input data to the output format which is simpler, and it does the parallel tree walk for you.
Being in the Java JSON processing "space", here are some other interesting JSON manipulation tools to look at / consider :
The primary goal of Jolt was to improve "developer speed" by providing the ability to have a declarative rather than imperative transforms. That said, Jolt should have a better runtime than the alternatives listed above.
Work has been done to make the stock Jolt transforms fast:
Two things to be aware of :
Jolt Transforms and tools can be run from the command line. Command line interface doc here.
For the moment we have Cobertura configured in our poms.
mvn cobertura:cobertura
open jolt-core/target/site/cobertura/index.html
Currently, for the jolt-core artifact, code coverage is at 89% line, and 83% branch.
In order to get optimal performance from cassandra, its important to understand how it stores the data on disk. Its common problem among new users coming from RDBMS to not consider the queries while designing their column families(a.k.a tables). Cassandra’s cql interface return data in tabular format and it might give the illusion that we can query it just like any RDBMS, but that’s not the case.
All cassandra data is persisted in SSTables(Sorted String tables) inside data directory. Default location of data directory is $CASSANDRA_HOME/data/data
. You can change it using data_file_directories
config in conf/cassandra.yaml
. On a fresh setup, here’s what data directory looks like:
1 2 3 4 5 6 | data/ ├── system ├── system_auth ├── system_distributed ├── system_schema └── system_traces |
Each directory in data represent a keyspace. These are internal keyspaces used by cassandra.
Lets create a new keyspace
1 | CREATE KEYSPACE ks1 WITH replication={'class':'SimpleStrategy','replication_factor':1}; |
Since we have not yet created any table, so you will not see ks1 named directory yet in data dir, but you can check cassandra' system_schema.keyspace
table.
1 2 3 4 5 | cqlsh> select * from system_schema.keyspaces where keyspace_name = 'ks1'; keyspace_name | durable_writes | replication ---------------+----------------+------------------------------------------------------------------------------------- ks1 | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} |
Let create a table now:
1 2 3 4 5 6 7 8 | CREATE TABLE user_tracking ( user_id text, action_category text, action_id text, action_detail text, PRIMARY KEY(user_id, action_category, action_id) ); |
As soon as you create the table, here’s how the data directory will look like
1 2 3 | ks1/ └── tb1-ed4784f0b64711e7b18a2f179b6f38f9 └── backups |
It created a directory by name of <table>-<table_id>
. Cassandra creates a sub directory for each table. All the data for this table will be contained in this dir. You can move around this dir to different location in future by creating a symlink.
You can check for this table in system_schema.tables table
1 2 3 4 5 6 7 | cqlsh:ks1> select * from system_schema.tables where keyspace_name = 'ks1'; keyspace_name | table_name | bloom_filter_fp_chance | caching | comment | compaction | compression | crc_check_chance | dclocal_read_repair_chance | default_time_to_live | extensions | flags | gc_grace_seconds | id | max_index_interval | memtable_flush_period_in_ms | min_index_interval | read_repair_chance | speculative_retry ---------------+------------+------------------------+-----------------------------------------------+---------+---------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+------------------+----------------------------+----------------------+------------+--------------+------------------+--------------------------------------+--------------------+-----------------------------+--------------------+--------------------+------------------- ks1 | tb1 | 0.01 | {'keys': 'ALL', 'rows_per_partition': 'NONE'} | | {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} | {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} | 1 | 0.1 | 0 | {} | {'compound'} | 864000 | ed4784f0-b647-11e7-b18a-2f179b6f38f9 | 2048 | 0 | 128 | 0 | 99PERCENTILE (1 rows) |
You will notice the table defaults like gc_grace_seconds
and memtable_flush_period_in_ms
which got applied.
Lets insert some data to this table:
1 | insert into ks1.user_tracking(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a1', 'Logged in from home page'); |
Now lets check the table data directory. You will notice that there is no new file created. Thats because C* first writes data in memory and then flushed it to disk after a certain threshold is reached. For durability purpose it writes data to a append only commit log. Commit log is shared across keyspaces. By default its location is $CASSANDRA_HOME/data/commitlog and it can be changed in conf/cassandra.yaml
To force flushing memtables data to disk as SSTable, use following command
1 2 3 | bin/nodetool flush //flush all tables from of all keyspace bin/nodetool flush <keyspace> //flush all tables for a single keyspace bin/nodetool flush <keyspace> <table_name> //flush a single table from a keyspace |
After flushing, here’s what the content of table directory looks like. Each flush create a new SSTable on disk. For each SSTable, following set of files is created
1 2 3 4 5 6 7 8 9 10 11 12 | user_tracking-49eb78d0b65a11e7b18a2f179b6f38f9/ ├── backups ├── mc-5-big-CompressionInfo.db ├── mc-5-big-Data.db ├── mc-5-big-Digest.crc32 ├── mc-5-big-Filter.db ├── mc-5-big-Index.db ├── mc-5-big-Statistics.db ├── mc-5-big-Summary.db └── mc-5-big-TOC.txt 1 directory, 8 files |
All files share common naming convention which is --.db
The table below provides a brief description for each component.
1 2 3 4 5 6 7 8 9 | File Description mc-1-big-TOC.txt A file that lists the components for the given SSTable. mc-1-big-Digest.crc32 A file that consists of a checksum of the data file. mc-1-big-CompressionInfo.db A file that contains meta data for the compression algorithm, if enabled. mc-1-big-Statistics.db A file that holds statistical metadata about the SSTable. mc-1-big-Index.db A file that contains the primary index data. mc-1-big-Summary.db This file provides summary data of the primary index, e.g. index boundaries, and is supposed to be stored in memory. mc-1-big-Filter.db This file embraces a data structure used to validate if row data exists in memory i.e. to minimize the access of data on disk. mc-1-big-Data.db This file contains the base data itself. Note: All the other component files can be regenerated from the base data file. |
Lets insert few more entries into the table and see what data directory looks like:
1 2 3 4 5 | insert into ks1.user_tracking(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a1', 'Logged in from home page'); insert into ks1.user_tracking(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a2', 'Logged in from email link'); insert into ks1.user_tracking(user_id, action_category, action_id, action_detail) VALUES ('user1', 'dashboard', 'a3', 'Opened dashboard link'); insert into ks1.user_tracking(user_id, action_category, action_id, action_detail) VALUES ('user2', 'auth', 'a4', 'Logged in'); |
Again do bin/nodetool flush
to flush data to SSTable and check filesystem:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | sam@sam-ub:ks1$ tree user_tracking-49eb78d0b65a11e7b18a2f179b6f38f9/ user_tracking-49eb78d0b65a11e7b18a2f179b6f38f9/ ├── backups ├── mc-7-big-CompressionInfo.db ├── mc-7-big-Data.db ├── mc-7-big-Digest.crc32 ├── mc-7-big-Filter.db ├── mc-7-big-Index.db ├── mc-7-big-Statistics.db ├── mc-7-big-Summary.db ├── mc-7-big-TOC.txt ├── mc-8-big-CompressionInfo.db ├── mc-8-big-Data.db ├── mc-8-big-Digest.crc32 ├── mc-8-big-Filter.db ├── mc-8-big-Index.db ├── mc-8-big-Statistics.db ├── mc-8-big-Summary.db └── mc-8-big-TOC.txt 1 directory, 16 files |
You will notice that a new SSTable with generation number 7 is created. Cassandra periodically keep merging these SSTables through a process called compaction. You can force compaction by running following command.
1 2 3 | bin/nodetool compact //compact SSTables for tables from of all keyspace bin/nodetool compact <keyspace> //compact SSTables for tables of a single keyspace bin/nodetool compact <keyspace> <table_name> //compact SSTables of a single table |
You will notice that after compaction is complete there is just a single SSTable with a new generation. Compaction causes a temporary spike in disk space usage and disk I/O while old and new SSTables co-exist. As it completes, compaction frees up disk space occupied by old SSTables.
Now lets see how data is exactly stored in this SSTable.
C* comes bundled with a utility called sstabledump which can be used to see content of SSTable in json or row format. Here’s what you will see
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | [ { "partition" : { "key" : [ "user2" ], "position" : 0 }, "rows" : [ { "type" : "row", "position" : 45, "clustering" : [ "auth", "a4" ], "liveness_info" : { "tstamp" : "2017-10-24T20:33:32.772370Z" }, "cells" : [ { "name" : "action_detail", "value" : "Logged in" } ] } ] }, { "partition" : { "key" : [ "user1" ], "position" : 46 }, "rows" : [ { "type" : "row", "position" : 104, "clustering" : [ "auth", "a1" ], "liveness_info" : { "tstamp" : "2017-10-24T20:33:32.074848Z" }, "cells" : [ { "name" : "action_detail", "value" : "Logged in from home page" } ] }, { "type" : "row", "position" : 104, "clustering" : [ "auth", "a2" ], "liveness_info" : { "tstamp" : "2017-10-24T20:33:32.085959Z" }, "cells" : [ { "name" : "action_detail", "value" : "Logged in from email link" } ] }, { "type" : "row", "position" : 145, "clustering" : [ "dashboard", "a3" ], "liveness_info" : { "tstamp" : "2017-10-24T20:33:32.099739Z" }, "cells" : [ { "name" : "action_detail", "value" : "Opened dashboard link" } ] } ] } ] |
Here you will notice 2 “rows”. Don’t confuse this row with RDBMS row. This “row” actually means a paritition. Number of rows(partitions) in sstable is determined by your primary key. Here’s the key which we used:
Primary key consists of 2 parts - partitioning and clustering fields
First part of our key, i.e. “user” will be used for partitioning, and remaining part - aka action_category and action_id will be used for clustering.
To understand this better, lets create a new table with exact same definition but changed partitioning key
1 2 3 4 5 6 7 8 | CREATE TABLE user_tracking_new ( user_id text, action_category text, action_id text, action_detail text, PRIMARY KEY((user_id, action_category), action_id) ); |
In above definition, now paritioning key will be user_id + action_category and clustering key will be just action_id. Lets insert exact same 2 rows and notice how they gets stored in sstable
1 2 3 4 | insert into ks1.user_tracking_new(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a1', 'Logged in from home page'); insert into ks1.user_tracking_new(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a2', 'Logged in from email link'); insert into ks1.user_tracking_new(user_id, action_category, action_id, action_detail) VALUES ('user1', 'dashboard', 'a3', 'Opened dashboard link'); insert into ks1.user_tracking_new(user_id, action_category, action_id, action_detail) VALUES ('user2', 'auth', 'a4', 'Logged in'); |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | [ { "partition" : { "key" : [ "user1", "dashboard" ], "position" : 0 }, "rows" : [ { "type" : "row", "position" : 67, "clustering" : [ "a3" ], "liveness_info" : { "tstamp" : "2017-10-24T20:32:45.633901Z" }, "cells" : [ { "name" : "action_detail", "value" : "Opened dashboard link" } ] } ] }, { "partition" : { "key" : [ "user2", "auth" ], "position" : 68 }, "rows" : [ { "type" : "row", "position" : 118, "clustering" : [ "a4" ], "liveness_info" : { "tstamp" : "2017-10-24T20:32:45.648367Z" }, "cells" : [ { "name" : "action_detail", "value" : "Logged in" } ] } ] }, { "partition" : { "key" : [ "user1", "auth" ], "position" : 119 }, "rows" : [ { "type" : "row", "position" : 182, "clustering" : [ "a1" ], "liveness_info" : { "tstamp" : "2017-10-24T20:32:45.614746Z" }, "cells" : [ { "name" : "action_detail", "value" : "Logged in from home page" } ] }, { "type" : "row", "position" : 182, "clustering" : [ "a2" ], "liveness_info" : { "tstamp" : "2017-10-24T20:32:45.624710Z" }, "cells" : [ { "name" : "action_detail", "value" : "Logged in from email link" } ] } ] } ] |
You will notice that in this example, for each user, each category of data went into a different partition.
First rule for data modelling is that you should choose your paritioning key in a way for a single query, only 1 partition is read.
Lets try to run some queries and see the impact of partitioning:
1 2 3 4 5 6 7 8 9 | cqlsh:ks1> select * from user_tracking where user_id = 'user1'; user_id | action_category | action_id | action_detail ---------+-----------------+-----------+--------------------------- user1 | auth | a1 | Logged in from home page user1 | auth | a2 | Logged in from email link user1 | dashboard | a3 | Opened dashboard link (3 rows) |
Lets run same query on user_tracking_new
1 2 | cqlsh:ks1> select * from user_tracking_new where user_id = 'user2'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Partition key parts: action_category must be restricted as other parts are" |
Reason you got the error is that at minimum you need to specify all columns which are part of partition key. C* needs to know a way to get to a parition before it can query data inside it. In first case it worked because only user_id was part of partition key.
That means our user_tracking_new can only be used to query user data when category is also know. You might wonder then why do you even prefer that over first CF. Reason is that in first CF, for huge volume of user active, parition can grow very large and hence have performance issues. Our goal is to keep partition size to a reasonable size to not effect query performance.
Lets try another query:
1 2 3 | cqlsh:ks1> select * from user_tracking_new where user_id = 'user1' and action_category = 'auth' and action_detail = 'Logged in from home page'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING" |
You will see that it failed to fetch the result. Reason being that above filtering needs to span 2 partitions and hence C* is warning against it. If we want to force it, then we can add ALLOW FILTERING
at the end of query
1 2 3 4 5 6 7 | cqlsh:ks1> select * from user_tracking_new where user_id = 'user1' and action_category = 'auth' and action_detail = 'Logged in from home page' ALLOW FILTERING; user_id | action_category | action_id | action_detail ---------+-----------------+-----------+-------------------------- user1 | auth | a1 | Logged in from home page (1 rows) |
This is another extremely powerful feature available in Cassandra, and it allows you to naturally store records in a given order based on the value of a particular column. So every time you write to the Stocks table, Cassandra will figure out where that record is supposed to go in the physical data partitions and store the record in the order you told it to. Storing data in sorted order gives drastic query performance improvements for range queries, which is very significant in timeseries data.
1 2 3 4 5 6 7 8 | CREATE TABLE user_tracking_ordered ( user_id text, action_category text, action_id text, action_detail text, PRIMARY KEY((user_id, action_category), action_id) ) WITH CLUSTERING ORDER BY (action_id DESC); |
Now insert same data in this CF:
1 2 3 4 5 | insert into ks1.user_tracking_ordered(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a1', 'Logged in'); insert into ks1.user_tracking_ordered(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a2', 'Logged in'); insert into ks1.user_tracking_ordered(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a3', 'Logged in'); insert into ks1.user_tracking_ordered(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a4', 'Logged in'); insert into ks1.user_tracking_ordered(user_id, action_category, action_id, action_detail) VALUES ('user1', 'auth', 'a5', 'Logged in'); |
If you try to fetch 1 row, you will get it in sorted order
1 2 3 4 5 6 7 | cqlsh:ks1> select * from user_tracking_ordered limit 1; user_id | action_category | action_id | action_detail ---------+-----------------+-----------+--------------- user1 | auth | a1 | Logged in (1 rows) |
If you change CLUSTERING order to DESC like WITH CLUSTERING ORDER BY (action_id DESC)
and do same thing, you can see it now return it sorted in DESC order:
1 2 3 4 5 6 7 | cqlsh:ks1> select * from user_tracking_ordered limit 1; user_id | action_category | action_id | action_detail ---------+-----------------+-----------+--------------- user1 | auth | a5 | Logged in (1 rows) |
You can see same ordered getting reflected in sstable too
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | $ sstabledump mc-*-big-Data.db [ { "partition" : { "key" : [ "user1", "auth" ], "position" : 0 }, "rows" : [ { "type" : "row", "position" : 50, "clustering" : [ "a5" ], "liveness_info" : { "tstamp" : "2017-10-25T08:21:43.628921Z" }, "cells" : [ { "name" : "action_detail", "value" : "Logged in" } ] }, { ... "clustering" : [ "a4" ], ... }, { ... "clustering" : [ "a3" ], ... }, { ... "clustering" : [ "a2" ], ... }, { ... "clustering" : [ "a1" ], ... } ] } ] |
You will notice that for each column stored in cassandra, it has to maintain some other metadata too. You need to understand these overheads too for better capacity planning. I will cover those in some future post.
I hope you got some idea about why cqlsh looks like SQL but doesn’t behave like one. Play around with different combinations of primary key and see what data looks like in SSTable to get more understanding of data storage which will help you to model it based on your queries.
Download hundreds of benchmark network data sets from a variety of network types. Also share and contribute by uploading recent network data sets. Naturally all conceivable data may be represented as a graph for analysis. This includes social network data, brain networks, temporal network data, web graph datasets, road networks, retweet networks, labeled graphs, and numerous other real-world graph datasets.
Network data can be visualized and explored in real-time on the web via our web-based interactive network visual analytics platform.
Build pipelines of computations written in Spark, SQL, DBT, or any other framework.
Locally develop pipelines in-process, then flexibly deploy on Kubernetes or your custom infrastructure.
Unify your view of pipelines and the tables, ML models, and other assets they produce.
Dagster is a data orchestrator for machine learning, analytics, and ETL
Dagster lets you define pipelines in terms of the data flow between reusable, logical components, then test locally and run anywhere. With a unified view of pipelines and the assets they produce, Dagster can schedule and orchestrate Pandas, Spark, SQL, or anything else that Python can invoke.
Dagster is designed for data platform engineers, data engineers, and full-stack data scientists. Building a data platform with Dagster makes your stakeholders more independent and your systems more robust. Developing data pipelines with Dagster makes testing easier and deploying faster.
With Dagster’s pluggable execution, the same pipeline can run in-process against your local file system, or on a distributed work queue against your production data lake. You can set up Dagster’s web interface in a minute on your laptop, or deploy it on-premise or in any cloud.
Dagster models data dependencies between steps in your orchestration graph and handles passing data between them. Optional typing on inputs and outputs helps catch bugs early.
Dagster’s Asset Manager tracks the data sets and ML models produced by your pipelines, so you can understand how your they were generated and trace issues when they don’t look how you expect.
Dagster helps platform teams build systems for data practitioners. Pipelines are built from shared, reusable, configurable data processing and infrastructure components. Dagster’s web interface lets anyone inspect these objects and discover how to use them.
Dagster’s repository model lets you isolate codebases, so that problems in one pipeline don’t bring down the rest. Each pipeline can have its own package dependencies and Python version. Pipelines run in isolated processes so user code issues can't bring the system down.
Dagit, Dagster’s web interface, includes expansive facilities for understanding the pipelines it orchestrates. When inspecting a pipeline run, you can query over logs, discover the most time consuming tasks via a Gantt chart, re-execute subsets of steps, and more.
pip install dagster dagit
This installs two modules:
hello_dagster.py
from dagster import execute_pipeline, pipeline, solid
@solid
def get_name(_):
return 'dagster'
@solid
def hello(context, name: str):
context.log.info('Hello, {name}!'.format(name=name))
@pipeline
def hello_pipeline():
hello(get_name())
Save the code above in a file named hello_dagster.py
. You can execute the pipeline using any one
of the following methods:
(1) Dagster Python API
if __name__ == "__main__":
execute_pipeline(hello_pipeline) # Hello, dagster!
(2) Dagster CLI
$ dagster pipeline execute -f hello_dagster.py
(3) Dagit web UI
$ dagit -f hello_dagster.py
Next, jump right into our tutorial, or read our complete documentation. If you're actively using Dagster or have questions on getting started, we'd love to hear from you:
For details on contributing or running the project for development, check out our contributing
guide.
Dagster works with the tools and systems that you're already using with your data, including:
Integration | Dagster Library | |
![]() |
Apache Airflow | dagster-airflow Allows Dagster pipelines to be scheduled and executed, either containerized or uncontainerized, as Apache Airflow DAGs. |
![]() |
Apache Spark | dagster-spark · dagster-pyspark
Libraries for interacting with Apache Spark and PySpark. |
![]() |
Dask | dagster-dask
Provides a Dagster integration with Dask / Dask.Distributed. |
![]() |
Datadog | dagster-datadog
Provides a Dagster resource for publishing metrics to Datadog. |
![]() ![]() |
Jupyter / Papermill | dagstermill Built on the papermill library, dagstermill is meant for integrating productionized Jupyter notebooks into dagster pipelines. |
![]() |
PagerDuty | dagster-pagerduty
A library for creating PagerDuty alerts from Dagster workflows. |
![]() |
Snowflake | dagster-snowflake
A library for interacting with the Snowflake Data Warehouse. |
Cloud Providers | ||
![]() |
AWS | dagster-aws
A library for interacting with Amazon Web Services. Provides integrations with Cloudwatch, S3, EMR, and Redshift. |
![]() |
Azure | dagster-azure
A library for interacting with Microsoft Azure. |
![]() |
GCP | dagster-gcp
A library for interacting with Google Cloud Platform. Provides integrations with GCS, BigQuery, and Cloud Dataproc. |
This list is growing as we are actively building more integrations, and we welcome contributions!
Big Data is complex, I have written quite a bit about the vast ecosystem and the wide range of options available. One aspect that is often ignored but critical, is managing the execution of the different steps of a big data pipeline. Quite often the decision of the framework or the design of the execution process is deffered to a later stage causing many issues and delays on the project.
You should design your pipeline orchestration early on to avoid issues during the deployment stage. Orchestration should be treated like any other deliverable; it should be planned, implemented, tested and reviewed by all stakeholders.
Orchestration frameworks are often ignored and many companies end up implementing custom solutions for their pipelines. This is not only costly but also inefficient, since custom orchestration solutions tend to face the same problems that out-of-the-box frameworks already have solved; creating a long cycle of trial and error.
In this article, I will present some of the most common open source orchestration frameworks.
Data pipeline orchestration is a cross cutting process which manages the dependencies between your pipeline tasks, schedules jobs and much more. If you use stream processing, you need to orchestrate the dependencies of each streaming app, for batch, you need to schedule and orchestrate the jobs.
Remember, tasks and applications may fail, so you need a way to schedule, reschedule, replay, monitor, retry and debug your whole data pipeline in an unified way.
Some of the functionality provided by orchestration frameworks are:
Let’s review some of the options…
Apache Oozie it’s a scheduler for Hadoop, jobs are created as DAGs and can be triggered by a cron based schedule or data availability. Oozie is a scalable, reliable and extensible system that runs as a Java web application. It has integrations with ingestion tools such as Sqoop and processing frameworks such Spark.
Oozie workflows definitions are written in hPDL (XML). Workflows contain control flow nodes and action nodes. Control flow nodes define the beginning and the end of a workflow ( start, end and fail nodes) and provide a mechanism to control the workflow execution path ( decision, fork and join nodes)[1].
Action nodes are the mechanism by which a workflow triggers the execution of a task. Oozie provides support for different types of actions (map-reduce, Pig, SSH, HTTP, eMail…) and can be extended to support additional type of actions[1].
Also, workflows can be parameterized and several identical workflow jobs can concurrently.
It was the first scheduler for Hadoop and quite popular but has become a bit outdated, still is a great choice if you rely entirely in the Hadoop platform.
Airflow is a platform that allows to schedule, run and monitor workflows. It has become the most famous orchestrator for big data pipelines thanks to the ease of use and the innovate workflow as code approach where DAGs are defined in Python code that can be tested as any other software deliverable.
It uses DAGs to create complex workflows. Each node in the graph is a task, and edges define dependencies among the tasks. Tasks belong to two categories:
Airflow scheduler executes your tasks on an array of workers while following the specified dependencies described by you. It has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers and can scale to infinity[2].
It generates the DAG for you, maximizing parallelism. The DAGs are written in Python, so you can run them locally, unit test them and integrate them with your development workflow. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative[2].
The rich UI makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed[2]. It is fast, easy to use and very useful. It has several views and many ways to troubleshoot issues. It keeps the history of your runs for later reference.
It is very straightforward to install. You just need Python. It has two processes, the UI and the Scheduler that run independently.
Principles[2]:
Although Airflow flows are written as code, Airflow is not a data streaming solution[2]. Also, workflows are expected to be mostly static or slowly changing, for very small dynamic jobs there are other options that we will discuss later.
It is simple and stateless, although XCOM functionality is used to pass small metadata between tasks which is often required, for example when you need some kind of correlation ID. It also supports variables and parameterized jobs. Finally, it has support SLAs and alerting. It can be integrated with on-call tools for monitoring.
Luigi is an alternative to Airflow with similar functionality but Airflow has more functionality and scales up better than Luigi.
Dagster is a newer orchestrator for machine learning, analytics, and ETL[3]. The main difference is that you can track the inputs and outputs of the data, similar to Apache NiFi, creating a data flow solution. This mean that it tracks the execution state and can materialize values as part of the execution steps. You can test locally and run anywhere with a unified view of data pipelines and assets. It support any cloud environment.
Dagster models data dependencies between steps in your orchestration graph and handles passing data between them. Optional typing on inputs and outputs helps catch bugs early[3]. Pipelines are built from shared, reusable, configurable data processing and infrastructure components. Dagster’s web UI lets anyone inspect these objects and discover how to use them[3].
It can also run several jobs in parallel, it is easy to add parameters, easy to test, provides simple versioning, great logging, troubleshooting capabilities and much more. It is more feature rich than Airflow but it is still a bit immature and due to the fact that it needs to keep track the data, it may be difficult to scale, which is a problem shared with NiFi due to the stateful nature. Also it is heavily based on the Python ecosystem.
Prefect is similar to Dagster, provides local testing, versioning, parameter management and much more. It is also Python based.
What makes Prefect different from the rest is that aims to overcome the limitations of Airflow execution engine such as improved scheduler, parametrized workflows, dynamic workflows, versioning and improved testing. Versioning is a must have for many DevOps oriented organizations which is still not supported by Airflow and Prefect does support it.
It has a core open source workflow management system and also a cloud offering which requires no setup at all. Prefect Cloud is powered by GraphQL, Dask, and Kubernetes, so it’s ready for anything[4]. The UI is only available in the cloud offering.
Apache NiFi is not an orchestration framework but a wider dataflow solution. NiFi can also schedule jobs, monitor, route data, alert and much more. It is focused on data flow but you can also process batches.
It does not require any type of programming and provides a drag and drop UI. It is very easy to use and you can use it for easy to medium jobs without any issues but it tends to have scalability problems for bigger jobs.
It runs outside of Hadoop but can trigger Spark jobs and connect to HDFS/S3.
Let’s put see some examples…
We have seem some of the most common orchestration frameworks. As you can see, most of them use DAGs as code so you can test locally, debug pipelines and test them properly before rolling new workflows to production. Consider all the features discussed in this article and choose the best tool for the job.
In short, if your requirement is just orchestrate independent tasks that do not require to share data and/or you have slow jobs and/or you do not use Python, use Airflow or Ozzie. For data flow applications that require data lineage and tracking use NiFi for non developers; or Dagster or Prefect for Python developers.
When possible, try to keep jobs simple and manage the data dependencies outside the orchestrator, this is very common in Spark where you save the data to deep storage and not pass it around. In this case, Airflow is a great option since it doesn’t need to track the data flow and you can still pass small meta data like the location of the data using XCOM. For smaller, faster moving , python based jobs or more dynamic data sets, you may want to track the data dependencies in the orchestrator and use tools such Dagster.
[1] https://oozie.apache.org/docs/5.2.0/index.html
Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1.1, you can also use them for Apache Avro, JSON and CSV formats!
My name is Bartosz Konieczny and I am a data engineer working with software for 2009. I'm also an Apache Spark enthusiast, AWS and GCP certified cloud user, blogger, and speaker. I like to share and you can discover it on
my waitingforcode.com blog or conferences like Spark+AI Summit 2019
or Data+AI Summit 2020.
Check me on social media:
Bartosz Konieczny
Twitter
Github
Stack Overflow
Facebook YouTube
Some time ago I found an interesting article describing 2 faces of synchronizing the data pipelines - orchestration and choreography. The article ended with an interesting proposal to use both of them as a hybrid solution. In this post, I will try to implement that idea.
The post is composed of 3 parts. The first one reminds the basics from the article quoted in the "Read also" section. The second part focuses more on the hybrid approach explanation whereas the last one shows a sample implementation.
Before we go to the main topic of this post, let's recall some basics. The first is the definition of orchestration. In the data pipelines, an orchestrator is a component responsible for managing the processes. It's the only one who knows which pipeline should be executed at a given moment and it's the single component able to trigger that execution.
On the other side, the choreography relies on the separate microservices architecture where every service knows what to do at a given moment of the day. The services don't communicate directly. Instead, they communicate indirectly with an event-based architecture. Every service knows then how to react at each of subscribed events.
Both approaches have their pros and cons. The orchestrator provides a unified view of the system but it's less flexible than the choreography. But the choreography uses loose coupling and sometimes the shared-nothing pipelines can be more difficult to manage than the highly coupled one. Especially when the context becomes more and more complex with every new added service and event.
The orchestration approach can be presented with DAG abstraction used by Apache Airflow to define data processing workflows. For instance, you can have the a data pipeline composed of the steps integrating the data coming from our different partners and one final DAG to make some final computation on them:
To illustrate the choreography pattern we could use the AWS event-driven architecture to integrate the data of our partners and trigger the final aggregation job:
Maybe the schemas don't show it clearly but both approaches are slightly different. In the orchestration-based architecture, the orchestrator checks at regular interval whether it can start the partner's processing. With choreography, every partner has its dedicated data pipeline and the logic to start it is managed internally by the data processor Lambda function.
The choreography has the advantage of being based on specific events. Therefore, when some input data is not present, there won't be any processing action on top of it. On the other side, having an overall view of such a system may be complicated. Especially if you would like to know what part was executed and when. The hybrid solution discussed in "Big Data Pipeline - Orchestration or Choreography" post overcomes that shortcoming.
The hybrid approach still uses choreography to execute the processing logic but with an enrichment of the central state manager. The manager is responsible for persisting the events in an event store and also to communicate with the orchestrator in order to provid a centralized way to visualize what happens with the data pipelines.
The theory seems quite simple but mixing both words in a concrete manner is more complex. It's still should be possible though, especially with our natively event-driven example of an AWS data processing and the Apache Airflow orchestrator. Let me show it to you in the following schema:
In the schema the Lambda function behaves only as an interceptor for produced event. It doesn't contain the processing logic. Instead, it catches the event and sends it to some streaming broker. On the Airflow's side, I added a simple consumer of the stream which, depending on the read event, may trigger a DAG. The triggered DAG hasn't the schedule and therefore can only be started with the external trigger. The consumer passes all the configuration specific to the given execution as the -c CONF parameter.
The orchestration and choreography are quite opposite concepts. The former one uses a single controller to manage jobs execution whereas the latter ones gives much more freedom to that execution. However, both can be mixed in order to mitigate their respective drawbacks. The orchestration provides a visibility and a better control whereas the orchestration more reactive behavior. In the post, I showed how I would implement that mix with the help of AWS event-driven services and Apache Airflow for, respectively, choreography and orchestration parts. During next weeks I will try to implement such hybrid solution and share m feedback of it.
The Apache Software Foundation’s latest top-level project, Airflow, workflow automation and scheduling stem for Big Data processing pipelines, already is in use at more than 200 organizations, including Adobe, Airbnb, Paypal, Square, Twitter and United Airlines.
“Apache Airflow has quickly become the de facto standard for workflow orchestration,” said Bolke de Bruin, vice president of Apache Airflow. “Airflow has gained adoption among developers and data scientists alike thanks to its focus on configuration as code.”
When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative, according to the project’s GitHub page. Airflow provides smart scheduling, database and dependency management, error handling and logging. It touts command-line utilities for performing complex surgeries on DAGs and the user interface for providing visibility into pipelines running in production, making it easy to monitor progress and troubleshoot issues.
Maxime Beauchemin created Airflow in 2014 at Airbnb. It entered the ASF incubator in March 2016. It’s designed to be dynamic, extensible, lean and explicit, and scalable for processing pipelines of hundreds of petabytes.
With Airflow, users can create workflows as directed acyclic graphs (DAGs) to automate scripts to perform tasks. Though based in Python, it can execute programs in other languages as well. The Airflow scheduler executes tasks on an array of workers while following the specified dependencies.
DAG operators define individual tasks to be performed, though custom operators can be created.
The three main types of operators are:
In an introduction to the technology, Matt Davis, a senior software engineer at Clover Health, explains that it enables multisystem workflows to be executed in parallel across any number of workers. A single pipeline might contain bash, Python, and SQL operations. With dependencies specified between tasks, Airflow knows which ones it can run in parallel and which ones must run after others.
Its ability to work in languages other than Python makes it easy to integrate with other systems including AWS S3, Docker, Apache Hadoop HDFS, Apache Hive, Kubernetes, MySQL, Postgres, Apache Zeppelin, and more.
“Airflow has been a part of all our Data pipelines created in past two years acting as the ringmaster and taming our Machine Learning and ETL Pipelines. It has helped us create a single view for our client’s entire data ecosystem. Airflow’s Data-aware scheduling and error-handling helped automate entire report-generation processes reliably without any human intervention,” said Kaxil Naik, data engineer at Data Reply, who pointed out that its configuration-as-a-code paradigm makes it easy for non-technical people to use without a steep learning curve.
However, Airflow is not a data-streaming solution such as Spark Streaming or Storm, the documentation notes. It is more comparable to Oozie, Azkaban, Pinball, or Luigi.
Workflows are expected to be mostly static or slowly changing. They should look similar from one run to the next — slightly more dynamic than a database structure.
It comes out of the box with an SQLite database that helps users get up and running quickly, providing a tour of the UI and command line utilities.
“At Qubole, not only are we a provider, but also a big consumer of Airflow as well,” said innerspring manager Sumit Maheshwari. Qubole offers Airflow as a managed service.
The company’s “Insight and Recommendations” platform is built around Airflow. It processes billions of events each month from hundreds of enterprises and generates insights on Big Data systems such as Apache Hadoop, Apache Spark, and Presto.
“We are very impressed by the simplicity of Airflow and ease at which it can be integrated with other solutions like clouds, monitoring systems or various data sources.”
Cincinnati-based Astronomer built its platform on top of Airflow. In addition, Google launched Cloud Composer, a managed Airflow service, in beta last May. And Amazon has integrated its managed machine-learning-workflow service Sagemaker with Airflow.
Feature image via Pixabay.
Before you can install the chart, you need to configure the storage class settings for your cloud provider, such as AWS, GCP, or Azure. The handling of storage varies from cloud provider to cloud provider.
Create a new file called storage_values.yaml
for the storage class settings.
To use an existing storage class (including the default one) set this value:
default_storage: existingStorageClassName: default or <name of storage class>
For each volume of each component (Zookeeper, Bookkeeper), you can override the default_storage
setting by specifying a different existingStorageClassName
.
This allows you to match the optimum storage type to the volume.
If you have specific storage class requirement, for example fixed IOPS disks in AWS, you can have the chart configure the storage classes for you. Here are examples from the cloud providers:
# For AWS # default_storage: # provisioner: kubernetes.io/aws-ebs # type: gp2 # fsType: ext4 # extraParams: # iopsPerGB: "10" # For GCP # default_storage: # provisioner: kubernetes.io/gce-pd # type: pd-ssd # fsType: ext4 # extraParams: # replication-type: none # For Azure # default_storage: # provisioner: kubernetes.io/azure-disk # fsType: ext4 # type: managed-premium # extraParams: # storageaccounttype: Premium_LRS # kind: Managed # cachingmode: ReadOnly
See this values file for more details on the settings.
Once you have your storage settings in the values file, install the chart. First, create the namespace; in this example, we use pulsar
.
kubectl create namespace pulsar
Then run this helm command:
helm install pulsar datastax-pulsar/pulsar --namespace pulsar --values storage_values.yaml --create-namespace
To avoid having to specify the pulsar namespace on each subsequent command, set the namespace context. Example:
|
kubectl config set-context $(kubectl config current-context) --namespace=pulsar
Ticketing for WordPress made simple
Easily embed livestreams, integrate with Zoom, and optimize your calendar for virtual events.
Manage ticket sales and email marketing for your online events, right from WordPress.
Built for your events
Our products make event management a snap, from promotion and ticket sales to registration and communication.
Fully flexible and totally customizable plugins that work seamlessly with your site.
Our event tools help you make the most of your events: Build your brand, bolster attendance, and connect with your audience.
The Events Calendar suite of tools is perfect for schools and universities, community groups, and civic organizations.
Event Professionals
Developers & Agencies
Small Businesses
Event Curation
Create the ultimate events marketplace where users can submit their own events and sell tickets.
All of our premium plugins and add-ons are backed by a knowledgeable support team.
Event Aggregator
Schedule automatic event imports from Meetup, Eventbrite, Google Calendar, iCalendar, and other sites.
Mobile Ticket Scanner
On event day access all of your guest lists, communicate last-minute details with Promoter, and use our mobile app to scan tickets at the door.
Highlighting our upcoming events is vital to our success. Our entire team can easily manage The Events Calendar, and it has become the most visited page on our site!
S. Jay Farrand, KCRW
Event Organiser offers the most comprehensive, and yet easy to use, event and calendar plug-in for WordPress.
Additional support & features, such as booking management are available with Pro
Complex event recurrence schedules, multi-day events and the ability to add or remove individual dates are all available in the free version.
Collect booking payments with Pro offline or via PayPal. Other available gateways include Stripe, Authorize.net and iDeal.
Tailor the booking form to your needs by adding your own fields via the form customiser.
Event Organiser is built with developers in mind. Templates can be easily replaced, and WordPress hooks allow you to modify the plug-in’s behaviour. The codebase is extensively documented with a function reference, hook reference and user documentation.
With extensions ranging from gateways to discount codes, and iCal sync to front-end event submissions – Event Organiser can meet the needs of almost any user. By purchasing Pro Business or Developer license, you’ll get extensions included for free.
With over million downloads and a 94% rating you can have confidence in Event Organiser to power events and bookings on your site. It packs a host of features and yet maintains an impressively simple and user-friendly interface, making event management easy. But don’t take our word for it, check what people are saying.
If you’ve ever tried to install a calendar plugin you know that it’s not exactly the same as a fully functional events plugin or any event management tool. Calendars display dates of events, while WordPress event plugins offer functions like ticketing, RSVPs, guest management, automated email notifications, booking forms and more.
That’s why it’s so important to think about what you plan on doing with your WordPress calendar.
Do you need to sell tickets for events? Would you like to display detailed information like images, maps, speakers, and payment methods? What about setting up irregular recurring events like a meeting you hold every three months?
In order to achieve some of the more advanced calendar features, a WordPress events plugin is required. What’s great is that you have many options to choose from and the best ones are affordable, powerful, and easy to understand.
Want to know which one you should pick? Check out our curated list of the best event plugins!
The Events Manager plugin offers an excellent free version, but you do have the option to upgrade to Events Manager Pro. The average user won’t need the Pro version, but it does have some great features for the low price of $75.
For instance, the upgrade version gives you premium support, a custom payment gateway, API, spam protection, coupons, discounts, customizable booking forms, and PayPal support.
So, registrations are possible with the Events Manager plugin, but you’ll have to pay the extra fee to start collecting payments with something like PayPal or Authorize.net.
The backend interface is simple enough for the average WordPress user and when displayed on the frontend, your events calendar can be used for selling tickets, showing a simple calendar, or displaying event details. I like that the plugin integrates with your iCal feed and Google Calendar. You can also utilize some of the widgets for showing locations, full calendars, or individual events.
As for showing your events on the calendar and being as detailed as possible, the Events Manager plugin gives you most features you need without paying any money. For instance, Google Maps can be embedded in the events pages. There is also a tool for creating custom event attributes. This means that pretty much any type of description field is possible, like if you wanted to make an area for the dress code to your event, for example.
4.3 out of 5 stars (WordPress.org)
100,000+
5.3 or higher
WP Event Manager is one of the simpler, lightweight WordPress event management plugins. I see it working for those who want to keep their sites fast and not take up too much space or clutter the backend with too many features.
This plugin might be considered the new kid in the event management space, but it’s a popular plugin with great reviews and even great customer support.
As with many of the event plugins on this list, WP Event Manager offers a free, core plugin, along with the option to buy add-ons to ramp up your operation. Although the interface is sleek and simple, the free plugin’s feature list is quite impressive.
For instance, you receive everything from multilingual translations to frontend forms, and to widgets and shortcodes for searchable event listings.
I’ve also noticed that the WP Event Manager developers have put quite a bit of effort into speed and user-experience–with beautiful caching features, responsive elements, AJAX-powered event listings, and more.
As for the premium add-ons, there’s a long list of them, but here are some highlights:
4.7 out of 5 stars (WordPress.org)
8,000+
5.4 or higher
Event Organiser delivers a suitable event management solution for the WordPress environment because of its combination with default custom post types. Essentially, you install this plugin, then choose the right custom post type to maintain the WordPress post format, but gain control of some additional events modules.
Therefore, it’s an intuitive user interface, with the basic features required and some great support for one-time and recurring events. You’ll find several premium add-ons to buy along with this event management plugin. One of them is called Event Organiser Pro, and it offers a booking form customizer, a full management area, customizable emails, and various payment gateways.
You’ll also see some other add-ons that expand the functionality of your free or premium Event Organizer plugin. Some of them include:
The pricing for each plugin varies, but it seems like the more advanced and feature-packed they get, the higher the pricing. Some of them go for around $15, while others are listed at $50. I enjoy the frontend of this plugin since it provides a basic interface with colorization and interactivity.
You also have multiple formats you can choose from, such as lists or calendar configurations. Showing the calendars and events on your website is done with the help of shortcodes and widgets. So, the average WordPress user shouldn’t have any problems with getting up and running.
4.7 out of 5 stars (WordPress.org)
40,000+
Not provided.
If you’re looking for a WordPress events plugin that can help you manage your events, the All-in-One Event Calendar plugin might do the trick. It has a decent number of features right out of the box, with items like recurring events, filtering, and embedded Google Maps, all for free.
If you need more, you can then choose to opt-in for their hosted software solution which starts at $14.99/month which will provide you with:
The free version still has its upsides, with the ability to import events from Facebook, social sharing, venue auto-saving, and recurring events. The plugin stands out in the sharing/importing realm since it offers tools for easily sharing and importing data from Google Calendar, Apple iCal, and MS Outlook.
4.3 out of 5 stars (WordPress.org)
100,000+
5.4 or higher
Event Expresso has been a crowd favorite for some time now and the developers have come out with the most recent Event Espresso 4 Decaf version. The “Decaf” version is completely free and filled with some basic features like event ticketing and registration. What’s more, is that you can process PayPal payments without having to upgrade to one of the paid plans or by getting an add-on.
The automated confirmation emails are interesting as well since you can send out event reminders and link that up to your event registrant list. Finally, another awesome part of the free plugin is the Android and Apple app support for scanning tickets and tracking who comes to your events.
As for the premium plan, which is required to buy add-ons, they start at $79.95 per year and go all the way up to $299.95 per year. You’ll receive over 60 features and dozens of add-ons depending on the plan you go with.
4.3 out of 5 stars (WordPress.org)
2,000+
5.4 or higher
The Events Calendar plugin is made by the developers at Modern Tribe and it’s packed with features for making a highly professional calendar on your website, alongside a management area.
The whole point of the Events Calendar plugin is to get up and running within minutes. It has a rapid event creation tool for those organizations that want events listed on a website but don’t have all the time in the world. You can also save venues and organizers for later and present different calendar views for a sleek UI.
The core plugin works smoothly for simple calendars. It has a beautiful premium version for $89 per year. It’s not the cheapest option on this event plugins list, but you gain access to several great features like recurring events, shortcodes, and custom event attributes.
You can collect RSVPs for free with the core plugin and get payments with their free Event Tickets plugin. If you need more advanced ecommerce capabilities, you can also get the Event Tickets Plus as a premium add-on.
On the frontend, you can choose from a wide variety of layouts, from lists to regular calendars. The calendars are clean and modern, with support for maps and other essential event information. One of the main reasons I like the Events Calendar plugin is because it integrates with Eventbrite.
4.4 out of 5 stars (WordPress.org)
800,000+
5.6 or higher
With the My Calendar plugin, your events get displayed on multiple websites through WordPress multisite or on however many pages you’d like on an individual website. This is a standard calendar plugin without much event management behind it.
However, you do have the option to upgrade with some of the premium extensions and free plugins. For example, the My Tickets plugin is free and it integrates with the My Calendar plugin. The combination turns your calendar into a ticket sales operation for people to purchase tickets, RSVP, print the tickets, or pick them up at a physical location.
The My Calendar Pro plugin goes for $49 per year and truly turns the core plugin into an event management portal. Let visitors submit their own events, accept payments through PayPal and Authorize.net, and import events from multiple sources.
The regular My Calendar plugin has a full calendar grid and list view, along with mini calendars and widgets for smaller displays. The location manager is there for when you have frequently used venues, and the email notification system sends you a message when a date has been reserved or scheduled.
All in all, the My Calendar plugin is fairly robust for a free option, yet I wouldn’t call it complete until you upgrade to the $49 per year.
4.4 out of 5 stars (WordPress.org)
49,000+
Join 20,000+ others who get our weekly newsletter with insider WordPress tips!
5.3 or higher
EventOn is a premium-only WordPress event management plugin. This plugin is quite the gem if you’re willing to pull the trigger and not spend time playing around with a free plugin. Besides, the price at the time of this article is $19.
At its most basic, the EventOn plugin is one of the most visually appealing event calendars on the market. The colorful, modern list and calendar layouts beat out pretty much all of the plugins on this list. Specifics such as times, locations, and event cancellations are all displayed right on the main calendar page.
There’s also a beautiful tile layout that looks somewhat like a portfolio, except with all of your events listed.
EventOn also serves as a decent event management program, using event organization tools, location management, an excellent search bar for your users, and several social sharing buttons. It’s not that robust on the event management side of things, but it’s definitely a good-looking plugin for getting your events on your website.
4.4 out of 5 stars (CodeCanyon)
49,000+
Not disclosed
Another premium plugin sold on CodeCanyon is called Calendarize it! for WordPress. Once again, this has more of a focus on making a great calendar, but for $30, and loads of other add-ons, Calendarize it! stands strong as one of the best WordPress events plugins in the game.
To start, many of the add-ons are completely free, so you don’t have to worry about spending a few extra bucks after you already download the original plugin. Some of the add-ons include an event countdown module, importer tool, and an accordion of upcoming events.
However, there are a few add-ons you have to pay for, even though they don’t cost much. The only problem I have is that the payment options add-on is one of the premium add-ons. So, you’re not going to have much functionality when it comes to accepting payments unless you shell out that extra cash.
That said, all of the premium and free add-ons are pretty spectacular, with social auto-publishing, RSVP tools, ratings and reviews, Eventbrite tickets, and even an Advent calendar.
4.31 out of 5 stars (CodeCanyon)
11,000+
Not disclosed
The Modern Events Calendar plugin says quite a bit in its name since it’s a high-quality, professional, and modern take on your standard events management layout. You can choose from a wide variety of designs, making it an excellent solution for branding and fitting in your website.
It’s also nice that the Modern Events Calendar provides an event repeating system since those recurring events are always easier to handle when you don’t have to think about them every time they come up.
What’s more, is that this developer promises that you can transfer over all of your events if you’re currently using a different WordPress event management plugin. For instance, if you had a full year of events in EventOn and decided it’s not for you, this plugin transfers all of those events over for you.
With multiple skins for making the calendar your own, along with some awesome features like the Google Maps, featured image, and custom skin colors, this events plugin is a good choice for many.
4.3 out of 5 stars (WordPress.org)
50,000+
5.6 or higher
Amelia features a minimalistic and easy-to-use user interface. This plugin allows you to manage both appointments and events and accepts online payments.
Amelia is used by more than 4000 users and it’s worth giving it a try if you’re looking for an all-in-one booking solution with no required add-ons and no additional costs.
4.4 out of 5 stars (WordPress.org)
10,000+
5.6 or higher
Event Calendar WD has both free and premium versions. It’s a solid choice for those interested in sharing information about events and collecting RSVPs through a WordPress site. The plugin allows you to create a calendar that you manage through WordPress, with options for selling tickets and sending invitations.
This is a highly flexible events plugin as it provides complete control over the appearance of your calendar and how your customers are able to interact with it. Not only that, but the Event Calendar WD support team is available on a regular basis and rather knowledgeable about the product.
4.6 out of 5 stars (WordPress.org)
20,000+
5.2 or higher
The Stachethemes Event Calendar plugin boasts a long list of impressive features for launching an events calendar on your website. To begin, the plugin features a sleek, modern calendar with a wide range of colors to choose from. Each month and day is organized in a list format by default, but you can adjust how your users view the calendar.
The plugin comes with a drag and drop builder to remove the need to mess with any code. You can also incorporate several event details such as pictures, locations, and times. The box grid view is one of our favorites because it looks somewhat like a modern portfolio.
It’s also good to know that the plugin is fully responsive for you and your customers to utilize smartphones and tablets while scheduling. Along with a reasonable price on CodeCanyon and a list of payment options, you can’t go wrong with this event management plugin.
4.51 out of 5 (CodeCanyon)
4,000+
Not disclosed
Tickera is yet another free WordPress event management plugin with support for payment collections and calendars. There’s also a premium version if you’d like to upgrade for more features, which requires a $70 one-time fee on top of the yearly plan starting at $49.
The primary purpose of Tickera is to sell tickets and distribute them amongst the buyers. You can use the plugin as a regular event calendar, but the majority of features revolve around selling. For instance, the plugin provides support for barcode readers and QR codes that you can bring up on mobile phones.
There’s even a Chrome app that works by speeding up the check-in process. A large collection of payment gateways are available for you to choose from, and you can even link the system to your WooCommerce store.
The ticket builder is a powerful tool, with templates that you can customize to fit your brand. Everything about the plugin is white-label, and you can even take a percentage of a ticket sale if you’re running more of an event marketplace. Along with multiple ticket types, discount codes, and ticket-selling addons, the Tickera plugin looks like a winner for online sales.
4.7 out of 5 (WordPress.org)
8,000+
5.6 or higher
Venture Event Manager has premium and free plugins, offering a user-friendly solution for scheduling your events and adding recurring events to your calendar. Agencies and developers are more likely to utilize an event management plugin like this one because of its flexibility and code customization options.
Overall, the Venture Event Manager is great because it’s responsive, it has drag and drop builders, and all of your calendars include multiple views for customers. The event lists come in handy, while you can also feature event venues, categories, and filters.
We especially like the support for event widgets, since it allows for placing your event calendars on many areas of your website. It’s also worth mentioning that shortcodes are included. Finally, the plugin offers options for both non-ticketed and ticketed events. You don’t get any features for collecting payments through the plugin, but it does integrate with most ticketing platforms.
5 out of 5 (WordPress.org)
50+
5.2 or higher
The market for WordPress events plugin solutions is quite vast. A quick search on the WordPress plugin library, Google, or CodeCanyon shows that many developers try their hands at event management. Hopefully, this list helps you narrow down your search, but if you still have some questions, here are some suggestions based on what your needs might be:
From ticketing options to different calendar formats, each of the event plugins has its own purpose for some businesses. What’s your preferred WordPress events plugin? Tell us in the comments below!
If you enjoyed this article, then you’ll love Kinsta’s WordPress hosting platform. Turbocharge your website and get 24/7 support from our veteran WordPress team. Our Google Cloud powered infrastructure focuses on auto-scaling, performance, and security. Let us show you the Kinsta difference! Check out our plans
This project is a Scala application which uses Alpakka Cassandra 2.0, Akka Streams and Twitter4S (Scala Twitter Client) to pull new Tweets from Twitter for a given hashtag (or set of hashtags) using Twitter API v1.1 and write them into a local Cassandra database.
NOTE: The project will only save tweets which are not a retweet of another tweet and currently only saves the truncated version of tweets (<=140 chars).
docker run -p 9042:9042 --rm --name my-cassandra -d cassandra
docker ps -a
The above output shows that the container has been running for 3 minutes, and also shows that port 9042 locally is bound to port 9042 in the container. (default port for Cassandra)
docker exec -it my-cassandra cqlsh
CREATE KEYSPACE testkeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
CREATE table testkeyspace.testtable(id bigint PRIMARY KEY, excerpt text);
INSERT INTO testkeyspace.testtable(id, excerpt)
VALUES (37, 'appletest');
exit
application.conf.example
file found in /src/main/resources/application.conf.example
. Copy this file into this same directory and rename it application.conf
mv /src/main/resources/application.conf.example /src/main/resources/application.conf
application.conf
:twitter { consumer { key = "consumer-key-here" secret = "consumer-secret-here" } access { key = "access-key-here" secret = "access-token-here" } }
/src/main/scala/com/alptwitter/AlpakkaTwitter.scala
and change the following line to indicate what hashtags you wish to look at new tweets for val trackedWords = Seq("#myHashtag")
:vim /workspace/example-cassandra-alpakka-twitter/src/main/scala/com/alptwitter/AlpakkaTwitter.scala
If you want to track more than one hashtag, add more by adding more strings and separating with commas.
sbt run
As new tweets are posted which contain any of the hashtags listed in the trackedWords variable, a message will print in the console which says whether the tweet was a retweet or a unique tweet.
docker exec -it my-cassandra cqlsh
SELECT * FROM testkeyspace.testtable;
Comments Off on Stream
© 2019 Anant Corporation. All rights reserved.