High performance distributed data processing engine
Word count using skale:
var sc = require('skale-engine').context(); sc.textFile('/path/...') .flatMap(line => line.split(' ')) .map(word => [word, 1]) .reduceByKey((a, b) => a + b, 0) .count().then(console.log);
- In-memory computing
- Controlled memory usage, spill to disk when necessary
- Fast multiple distributed streams
- realtime lazy compiling and running of execution graphs
- workers can connect through TCP or websockets
- simple standalone or fully distributed mode
- very fast, see benchmark
Docs & community
- Gitter for support and discussion
- skale mailing list for discussion about use and development
- Contributing guide
The best and quickest way to get started with skale-engine is to use skale to create, run and deploy skale applications.
$ sudo npm install -g skale # Install skale command once and for all $ skale create my_app # Create a new app, install skale-engine $ cd my_app $ skale run # Starts a local cluster if necessary and run
In the following, we bypass skale toolbelt, and use directly and only skale-engine. It's for you if you are rather more interested by the skale-engine architecture, details and internals.
To run the internal examples, clone the skale-engine repository and install the dependencies:
$ git clone git://github.com/skale-me/skale-engine.git --depth 1 $ cd skale-engine $ npm install
Then start a skale-engine server and workers on local host:
$ npm start
Then run whichever example you want
$ ./examples/wordcount.js /etc/hosts
Standalone local mode
The standalone mode is the default operating mode. All the processes,
master and workers are running on the local host, using the
NodeJS module. This mode is the simplest to operate: no
dependency, and no server nor cluster setup and management required.
It is used as any standard NodeJS package: simply
and that's it.
This mode is perfect for development, fast prototyping and tests on a single machine (i.e. a laptop). For unlimited scalibity, see distributed mode below.
The distributed mode allows to run the exact same code as in standalone over a network of multiple machines, thus achieving horizontal scalability.
The distributed mode involves two executables, which must be running prior to launch application programs:
skale-serverprocess, which is the access point where the
master(user application) and
workers(running slaves) connect to, either by direct TCP connections, or by websockets.
skale-workerprocess, which is a worker controller, running on each machine of the computing cluster, and connecting to the
skale-server. The worker controller will spawn worker processes on demand (typically one per CPU), each time a new job is submitted.
To run in distributed mode, the environment variable
be set to the
skale-server hostname or IP address. If unset, the
application will run in standalone mode. Multiple applications, each
with its own set of workers and master processes can run simultaneously
using the same server and worker controllers.
Although not mandatory, running an external HTTP server on worker
hosts, exposing skale temporary files, allows efficient peer-to-peer
shuffle data transfer between workers. If not available, this traffic
will go through the centralized
skale-server. Any external HTTP
server such as nginx, apache or busybox httpd, or even NodeJS
(although not the most efficient for static file serving) will do.
For further details, see command line help for
To run the test suite, first install the dependencies, then run
$ npm install $ npm test