Join the chat at Build Status Build Status

High performance distributed data processing engine

Skale-engine is a fast and general purpose distributed data processing system. It provides a high-level API in Javascript and an optimized parallel execution engine on top of NodeJS.

The following figure is a comparison against Spark for a logistic regression program:


Word count using skale:

var sc = require('skale-engine').context();

  .flatMap(line => line.split(' '))
  .map(word => [word, 1])
  .reduceByKey((a, b) => a + b, 0)


  • In-memory computing
  • Controlled memory usage, spill to disk when necessary
  • Fast multiple distributed streams
  • realtime lazy compiling and running of execution graphs
  • workers can connect through TCP or websockets
  • simple standalone or fully distributed mode
  • very fast, see benchmark

Docs & community


The best and quickest way to get started with skale-engine is to use skale to create, run and deploy skale applications.

$ sudo npm install -g skale  # Install skale command once and for all
$ skale create my_app        # Create a new app, install skale-engine
$ cd my_app
$ skale run                  # Starts a local cluster if necessary and run


In the following, we bypass skale toolbelt, and use directly and only skale-engine. It's for you if you are rather more interested by the skale-engine architecture, details and internals.

To run the internal examples, clone the skale-engine repository and install the dependencies:

$ git clone git:// --depth 1
$ cd skale-engine
$ npm install

Then start a skale-engine server and workers on local host:

$ npm start

Then run whichever example you want

$ ./examples/wordcount.js /etc/hosts

Standalone local mode

The standalone mode is the default operating mode. All the processes, master and workers are running on the local host, using the cluster core NodeJS module. This mode is the simplest to operate: no dependency, and no server nor cluster setup and management required. It is used as any standard NodeJS package: simply require('skale-engine'), and that's it.

This mode is perfect for development, fast prototyping and tests on a single machine (i.e. a laptop). For unlimited scalibity, see distributed mode below.

Distributed mode

The distributed mode allows to run the exact same code as in standalone over a network of multiple machines, thus achieving horizontal scalability.

The distributed mode involves two executables, which must be running prior to launch application programs:

  • a skale-server process, which is the access point where the master (user application) and workers (running slaves) connect to, either by direct TCP connections, or by websockets.
  • A skale-worker process, which is a worker controller, running on each machine of the computing cluster, and connecting to the skale-server. The worker controller will spawn worker processes on demand (typically one per CPU), each time a new job is submitted.

To run in distributed mode, the environment variable SKALE_HOST must be set to the skale-server hostname or IP address. If unset, the application will run in standalone mode. Multiple applications, each with its own set of workers and master processes can run simultaneously using the same server and worker controllers.

Although not mandatory, running an external HTTP server on worker hosts, exposing skale temporary files, allows efficient peer-to-peer shuffle data transfer between workers. If not available, this traffic will go through the centralized skale-server. Any external HTTP server such as nginx, apache or busybox httpd, or even NodeJS (although not the most efficient for static file serving) will do.

For further details, see command line help for skale-worker and skale-server.


To run the test suite, first install the dependencies, then run npm test:

$ npm install
$ npm test


The original authors of skale-engine are Cedric Artigue and Marc Vertes.

List of all contributors




加入在聊天 建立状态 建立状态 < / a>


Skale-engine是一种快速和通用的分布式数据处理 系统。它在Javascript中提供了一个高级API,并进行了优化 并行执行引擎在NodeJS之上。

下图是针对比较。> Spark 逻辑回归程序:


var sc = require('skale-engine').context();

sc.textFile('/path/…') .flatMap(line => line.split(' ')) .map(word => [word, 1]) .reduceByKey((a, b) => a + b, 0) .count().then(console.log);


  • 内存计算
  • 受控内存使用情况,必要时会溢出到磁盘
  • 快速的多个分布式流
  • 实时懒惰编译和运行执行图
  • 工作人员可以通过TCP或Websockets 进行连接
  • 简单独立完全分发模式
  • 速度非常快,请参见基准测试



开始使用skale引擎的最好和最快捷的方法是使用 skale 创建,运行 并部署Skale应用程序。

$ sudo npm install -g skale  # Install skale command once and for all
$ skale create my_app        # Create a new app, install skale-engine
$ cd my_app
$ skale run                  # Starts a local cluster if necessary and run


在下面,我们绕过 skale 工具带,直接使用的只有Skale引擎。如果你是这样的 对Skale引擎架构,细节和内部部件更感兴趣。

要运行内部示例,请克隆skale引擎存储库 安装依赖关系:

$ git clone git:// –depth 1
$ cd skale-engine
$ npm install


$ npm start


$ ./examples/wordcount.js /etc/hosts


独立模式是默认操作模式。所有的过程, 主人和工人正在本地主机上运行,​​使用 群集核心 NodeJS模块。这种模式是最简单的操作:否 依赖关系,无需服务器,也不需要群集设置和管理。 它被用作任何标准的NodeJS包:简单的 require(’skale-engine’) 就是这样。

这种模式非常适合开发,快速原型制作和测试 在单个机器(即膝上型计算机)上。对于无限的可见性,请参阅 以下分布式模式。


分布式模式允许与独立运行完全相同的代码 一个多机器的网络,从而实现了水平的可扩展性 分布式模式涉及两个可执行文件 在启动应用程序之前运行:

  • 一个 skale-server 进程,它是接入点 其中 master (用户应用程序)和 worker (正在运行 从站)通过直接TCP连接或通过连接 websockets。
  • 正在运行的一个作业控制器的 skale-worker 进程 在计算集群的每台机器上,并连接到 skale-server 。工人控制员将产生工人 按需处理(通常每个CPU一个),每次a 提交新工作。

要在分布式模式下运行,环境变量 SKALE_HOST 必须 设置为 skale-server 主机名或IP地址。如果没有,那么 应用程序将以独立模式运行。多个应用程序,每个 拥有自己的一套工人和主流程可以同时运行 使用相同的服务器和工作人员控制器。

虽然不是必需的,但是在worker上运行外部HTTP服务器 主机,公开Skale临时文件,允许高效的点对点 在工作人员之间洗牌数据传输。如果不可用,这个流量 将通过集中的 skale-server 。任何外部HTTP 服务器如nginx,apache或busybox httpd,甚至NodeJS (尽管不是最有效的静态文件服务)会做。

有关更多详细信息,请参阅 skale-worker skale-server 的命令行帮助。


要运行测试套件,首先安装依赖项,然后运行 npm test

$ npm install
$ npm test

原始的Skale引擎作者是 Cedric Artigue Marc Vertes

所有列表 贡献者