Vega-Lite文档: 03_transform

coding, doc, javascript, vegalite

Offical documentation

数据转换

数据变换有两种途径:

transform: view-level transforms, 在整个view中指定转换逻辑
encoding: field transforms inside, 在数据列中指定转换逻辑

如果两种转换方式都提供, 会先执行transform,然后再执行内联的转换, 且转换顺序为: bin -> timeUnit -> aggregate -> sort -> stack
transform中的转换顺序是按照出现的顺序进行的, 支持如下转换:

Aggregate汇总

汇总操作由aggregate和groupby完成
aggregate支持三个属性: op指定汇总操作(函数), field指定处理哪列, as指定输出列名
groupby指定分组处理的规则, 传入的是列名向量

aggregate支持的操作:

Operation	Description
count	The total count of data objects in the group.Note: `count` operates directly on the input objects and return the same value regardless of the provided field.
valid	The count of field values that are not null, undefined or NaN.
values	A list of data objects in the group.
missing	The count of null or undefined field values.
distinct	The count of distinct field values.
sum	The sum of field values.
product	The product of field values.
mean	The mean (average) field value.
average	The mean (average) field value. Identical to mean.
variance	The sample variance of field values.
variancep	The population variance of field values.
stdev	The sample standard deviation of field values.
stdevp	The population standard deviation of field values.
stderr	The standard error of field values.
median	The median field value.
q1	The lower quartile boundary of field values.
q3	The upper quartile boundary of field values.
ci0	The lower boundary of the bootstrapped 95% confidence interval of the mean field value.
ci1	The upper boundary of the bootstrapped 95% confidence interval of the mean field value.
min	The minimum field value.
max	The maximum field value.
argmin	An input data object containing the minimum field value.Note: When used inside encoding, argmin must be specified as an object. (See below for an example.)
argmax	An input data object containing the maximum field value.Note: When used inside encoding, argmax must be specified as an object. (See below for an example.)

argmin/argmax

用于查找与另一字段中的极值相对应的当前字段的值:

{
    "data": {"url": "/assets/data/movies.json"},
    "mark": "bar",
    "encoding": {
        "x": {
        //查找US Gross最大值对应的Production Budget
        "aggregate": {"argmax": "US Gross"}, 
        "field": "Production Budget",
        "type": "quantitative"
        },
        "y": {"field": "Major Genre", "type": "nominal"}
    }
}

以上是用的encoding方式, 也可以用transform的形式:

{
    "data": {"url": "/assets/data/movies.json"},
    "transform": [{
        "aggregate": [{
            "op": "argmax",
            "field": "US Gross",
            "as": "argmax_US_Gross"
        }],
        "groupby": ["Major Genre"]
    }],
    "mark": "bar",
    "encoding": {
        "x": {
            "field": "argmax_US_Gross['Production Budget']",
            "type": "quantitative"
        },
        "y": {"field": "Major Genre", "type": "nominal"}
    }
}

Info 好像用encoding的方式更直观, 而且代码量更少啊, 有什么只能用transform的场景么?

argmax的一个应用场景: 通过获取X轴最后一个值, 给折线图加标签:

{
    "data": {"url": "/assets/data/stocks.csv"},
    "transform": [{"filter": "datum.symbol !== 'IBM"}],
    "encoding": {
        "color": {
            "field":"symbol",
            "type":"nominal",
            "legend": null
        }
    },
    "layer": [{
    "mark": "line",
    "encoding": {
        "x": {"field": "date", "type": "temporal", "title": "date"},
        "y": {"field": "price", "type": "quantitative", "title": "price"}
    }
    },{
    "encoding": {
        "x": {"aggregate": "max", "field": "date", "type": "temporal"},
        "y": {"aggregate": {"argmax": "date"}, "field": "price", "type": "quantitative"}
    },
    "layer": [{
        "mark": {"type": "circle"}
    }, {
        "mark": {"type": "text", "align": "left", "dx": 4},
        "encoding": {"text": {"field":"symbol", "type": "nominal"}}
    }]
    }],
    "config": {"view": {"stroke": null}}
}

Bin分箱

可以用来做histogram
encoding和transform

encoding中使用bin

// A Single View or a Layer Specification
{
    ...,
    "mark/layer": ...,
    "encoding": {
        "x": {
        "bin": ..., // bin
        "field": ...,
        "type": "quantitative",
        ...
        },
        "y": ...,
        ...
    },
    ...
}

在encoding中, 直接用bin属性操作, bin属性支持传入的类型有:

true: 使用默认的分箱参数, 默认是false
binned: 表明数据已经分箱过了, 可以把bin-start和bin-end映射到x/y和x2/y2

{
    "data": {"url": "/assets/data/movies.json"},
    "mark": "bar",
    "encoding": {
        "x": {
            "bin": true,
            "field": "IMDB Rating"
        },
        "y": {"aggregate": "count"}
    }
}

设置分箱的维度的type为ordinal, 会把分箱的范围当作刻度标签:

{
    "data": {"url": "/assets/data/movies.json"},
    "mark": "bar",
    "encoding": {
        "x": {
            "bin": true,
            "field": "IMDB Rating",
            "type": "ordinal"
        },
        "y": {"aggregate": "count"}
    }
}

可以用bin来分配热图颜色, 会自动创建图注:

{
    "data": {"url": "/assets/data/cars.json"},
    "mark": "point",
    "encoding": {
        "x": {"field": "Horsepower", "type": "quantitative"},
        "y": {"field": "Miles_per_Gallon", "type": "quantitative"},
        "color": {"bin": true, "field": "Acceleration"}
    }
}

直接导入已经分箱好的数据:

{
    "data": {
        "values": [
        {"bin_start": 8, "bin_end": 10, "count": 7},
        {"bin_start": 10, "bin_end": 12, "count": 29},
        {"bin_start": 12, "bin_end": 14, "count": 71},
        {"bin_start": 14, "bin_end": 16, "count": 127},
        {"bin_start": 16, "bin_end": 18, "count": 94},
        {"bin_start": 18, "bin_end": 20, "count": 54},
        {"bin_start": 20, "bin_end": 22, "count": 17},
        {"bin_start": 22, "bin_end": 24, "count": 5}
        ]
    },
    "mark": "bar",
    "encoding": {
        // 注意这里的用法, binned, x2
        "x": {
        "field": "bin_start",
        "bin": {"binned": true, "step": 2}
        },
        "x2": {"field": "bin_end"},
        "y": {
        "field": "count",
        "type": "quantitative"
        }
    }
}

transform中使用bin

// Any View Specification
{
    ...
    "transform": [
        {"bin": ..., "field": ..., "as" ...} // Bin Transform
        ...
    ],
    ...
}

transform中使用bin, 有bin, field, as三个可选属性参数.

例子: 用bin生成新的列

{
  "data": {"url": "/assets/data/movies.json"},
  "transform": [
    {
      "bin": true,
      "field": "IMDB Rating",
      "as": "binned rating"
    }
  ],
  "mark": "bar",
  "encoding": {
    "x": {
      "field": "binned rating",
      "title": "IMDB Rating (binned)",
      "bin": {
        "binned": true,
        "step": 1
      }
    },
    "x2": {"field": "binned rating_end"},
    "y": {"aggregate": "count"}
  }
}

Bin的可选参数

Property	Type	Description
anchor	Number	A value in the binned domain at which to anchor the bins, shifting the bin boundaries if necessary to ensure that a boundary aligns with the anchor value. Default value: the minimum bin extent value
base	Number	The number base to use for automatic bin determination (default is base 10). Default value: 10
divide	Number[]	Scale factors indicating allowable subdivisions. The default value is [5, 2], which indicates that for base 10 numbers (the default base), the method may consider dividing bin sizes by 5 and/or 2. For example, for an initial step size of 10, the method can check if bin sizes of 2 (= 10/5), 5 (= 10/2), or 1 (= 10/(52)) might also satisfy the given constraints. Default value:* [5, 2]
extent	Array	A two-element ([min, max]) array indicating the range of desired bin values.
maxbins	Number	Maximum number of bins. Default value: 6 for row, column and shape channels; 10 for other channels
minstep	Number	A minimum allowable step size (particularly useful for integer values).
nice	Boolean	If true, attempts to make the bin boundaries use human-friendly boundaries, such as multiples of ten. Default value: true
step	Number	An exact step size to use between bins. Note: If provided, options such as maxbins will be ignored.
steps	Number[]	An array of allowable step sizes to choose from.

示例: 更改最大分箱数目maxbins:

{
  "data": {"url": "/assets/data/movies.json"},
  "mark": "bar",
  "encoding": {
    "x": {
      "bin": {"maxbins": 30},
      "field": "IMDB Rating"
    },
    "y": {"aggregate": "count"}
  }
}

分箱排序

如果需要对分箱结果进行排序, 可以设置"type":"ordinal":

{
  "data": {"url": "/assets/data/movies.json"},
  "mark": "bar",
  "encoding": {
    "x": {
      "bin": true,
      "field": "IMDB Rating",
      "type": "ordinal",
      "sort": {
        "op": "count",
        "order": "descending"
      }
    },
    "y": {"aggregate": "count"}
  }
}

Calculate

// Any View Specification
{
  ...
  "transform": [
    {"calculate": ..., "as" ...} // Calculate Transform
     ...
  ],
  ...
}

两个属性: calculate, as

calculate支持传入数据的列组成的表达式, 需要用datum代表传入数据, 比如2*datum.col1+datum.col2. 支持的表达式很多, 这里不展开了, 请参考Vega表达式.

{
  "data": {
    "values": [
      {"a": "A", "b": 28},
      {"a": "B", "b": 55},
      {"a": "C", "b": 43},
      {"a": "G", "b": 19},
      {"a": "H", "b": 87},
      {"a": "I", "b": 52},
      {"a": "D", "b": 91},
      {"a": "E", "b": 81},
      {"a": "F", "b": 53}
    ]
  },
  "transform": [
    {"calculate": "2*datum.b", "as": "b2"},
    {"filter": "datum.b2 > 60"}
  ],
  "mark": "bar",
  "encoding": {
    "y": {"field": "b2", "type": "quantitative"},
    "x": {"field": "a", "type": "ordinal"}
  }
}

Density

指定维度计算核密度估计, 生成密度分布曲线.

/ Any View Specification
{
  ...
  "transform": [
    {"density": ...} // Density Transform
     ...
  ],
  ...
}

density参数

Property	Type	Description
density	String	Required. The data field for which to perform density estimation.
groupby	String[]	The data fields to group by. If not specified, a single group containing all data objects will be used.
cumulative	Boolean	A boolean flag indicating whether to produce density estimates (false) or cumulative density estimates (true). Default value: false
counts	Boolean	A boolean flag indicating if the output values should be probability estimates (false) or smoothed counts (true). Default value: false
bandwidth	Number	The bandwidth (standard deviation) of the Gaussian kernel. If unspecified or set to zero, the bandwidth value is automatically estimated from the input data using Scott’s rule.
extent	Number[]	A [min, max] domain from which to sample the distribution. If unspecified, the extent will be determined by the observed minimum and maximum values of the density value field.
minsteps	Number	The minimum number of samples to take along the extent domain for plotting the density. Default value: 25
maxsteps	Number	The maximum number of samples to take along the extent domain for plotting the density. Default value: 200
steps	Number	The exact number of samples to take along the extent domain for plotting the density. If specified, overrides both minsteps and maxsteps to set an exact number of uniform samples. Potentially useful in conjunction with a fixed extent to ensure consistent sample points for stacked densities.
as	String[]	The output fields for the sample value and corresponding density estimate. Default value: ["value", "density"]

示例1: 简单的密度图

{
  "data": {
    "url": "/assets/data/movies.json"
  },
  "width": 400,
  "height": 400,
  "transform":[{
    "density": "IMDB Rating",
    "bandwidth": 0.3
  }],
  "mark": "area",
  "encoding": {
    "x": {
      "field": "value",
      "title": "IMDB Rating",
      "type": "quantitative"
    },
    "y": {
      "field": "density",
      "type": "quantitative"
    }
  }
}

示例2: 分组堆叠密度图: group分组, extent限定范围, 在encoding中设置按照分组列上色

{
  "title": "Distribution of Body Mass of Penguins",
  "width": 400,
  "height": 300,
  "data": {
    "url": "/assets/data/penguins.json"
  },
  "mark": "area",
  "transform": [
    {
      "density": "Body Mass (g)",
      "groupby": ["Species"],
      "extent": [2500, 6500]
    }
  ],
  "encoding": {
    "x": {"field": "value", "type": "quantitative", "title": "Body Mass (g)"},
    "y": {"field": "density", "type": "quantitative", "stack": "zero"},
    "color": {"field": "Species", "type": "nominal"}
  }
}

示例3: 分面在encoding中设置按照分组列画Y轴

{
  "title": "Distribution of Body Mass of Penguins",
  "width": 400,
  "height": 80,
  "data": {
    "url": "/assets/data/penguins.json"
  },
  "mark": "area",
  "transform": [
    {
      "density": "Body Mass (g)",
      "groupby": ["Species"],
      "extent": [2500, 6500]
    }
  ],
  "encoding": {
    "x": {"field": "value", "type": "quantitative", "title": "Body Mass (g)"},
    "y": {"field": "density", "type": "quantitative", "stack": "zero"},
    "row": {"field": "Species"}
  }
}

Filter

根据指定规则过滤数据:

// Any View Specification
{
  ...
  "transform": [
    {"filter": ...} // Filter Transform
     ...
  ],
  ...
}

filter接收Predicate格式的传入, 可以是:

expression: 例如{filter: "datum.b2 > 60"}
field predicates: equal, lt, lte, gt, gte, range, oneOf, valid, 具体参考field predicates
selection predicate: 选择条件的名字, 或者逻辑组合, 参考selection predicate
上述条件的逻辑组合, and, or, not, 参考logical composition

Flatten

把向量转换成列表形式: 转换逻辑是一一对应, 如果不等长, 短的那个用null补齐

//对于如下表:
[
  {"key": "alpha", "foo": [1, 2], "bar": ["A", "B"]},
  {"key": "beta", "foo": [3, 4, 5], "bar": ["C", "D"]}
]

//应用flatten:
{"flatten": ["foo", "bar"]}

//变成如下表:
[
  {"key": "alpha", "foo": 1, "bar": "A"},
  {"key": "alpha", "foo": 2, "bar": "B"},
  {"key": "beta", "foo": 3, "bar": "C"},
  {"key": "beta", "foo": 4, "bar": "D"},
  {"key": "beta", "foo": 5, "bar": null}
]

一个进阶用法的例子 (mark, 有点复杂我还没看):

{
  "data": {
    "values": [
      { "id": "001",
        "ra": 243.35,
        "dec": "+54.6",
        "lc": [{"time": 1, "mag": 18.5}, {"time": 2, "mag": 19}]
      },
      { "id": "002",
        "ra": 210.35,
        "dec": "+14.6",
        "lc": [{"time": 1, "mag": 19.5}, {"time": 2, "mag": 20}]
      },
      { "id": "003",
        "ra": 143.35,
        "dec": "+33.6",
        "lc": [{"time": 1, "mag": 19}, {"time": 2, "mag": 18}]
      }
    ]
  },
  "transform": [
    {"flatten": ["lc"]}
  ],
  "vconcat": [
    {
      "width": 300,
      "height": 200,
      "title": "Sky position",
      "transform": [{"aggregate": [], "groupby": ["ra", "dec", "id"]}],
      "mark": "circle",
      "params": [{
        "name": "pts",
        "select": {"type": "point", "fields": ["id"]}
      }],
      "encoding": {
        "x": {"field": "ra", "type": "quantitative", "scale": {"zero": false}},
        "y": {"field": "dec", "type": "quantitative"},
        "color": {
          "condition": {"param": "pts", "value": "steelblue"},
          "value": "grey"
        },
        "size": {"value": 100}
      }
    },
    {
      "width": 300,
      "height": 200,
      "title": "Light curve",
      "transform": [{"filter": {"param": "pts"}}],
      "mark": "line",
      "encoding": {
        "x": {
          "field": "lc.time",
          "type": "quantitative",
          "scale": {"zero": false}
        },
        "y": {"field": "lc.mag", "type": "quantitative"},
        "color": {"value": "steelblue"},
        "detail": {"field": "id", "type": "nominal"}
      }
    }
  ]
}

Fold

fold: 把指定列(可以是多列)转换成"key-value"对, 类似于宽表变长表

//原表
[
  {"country": "USA", "gold": 10, "silver": 20},
  {"country": "Canada", "gold": 7, "silver": 26}
]

// 折叠这两列
{"fold": ["gold", "silver"]}


//新表: 这两列都变成key-value了
[
  {"key": "gold", "value": 10, "country": "USA", "gold": 10, "silver": 20},
  {"key": "silver", "value": 20, "country": "USA", "gold": 10, "silver": 20},
  {"key": "gold", "value": 7, "country": "Canada", "gold": 7, "silver": 26},
  {"key": "silver", "value": 26, "country": "Canada", "gold": 7, "silver": 26}
]

Impute

这个是对数据进行补齐处理的, 我的应用场景, 不太需要用Vega-Lite进行数据处理, 所以先不学习这块。

Todo 补充impute的用法: 官方文档

Join Aggregate

把agggregate操作生成的新列与原列进行join。

操作跟aggregate类似:

joinaggregate: 支持op, field, as属性
groupby

一个例子: 偏离均值的程度选取评分高于平均评分2.5分的电影:

{
  "data": {"url": "/assets/data/movies.json"},
  "transform": [
    {"filter": "datum['IMDB Rating'] != null"},
    {
      "joinaggregate": [{
        "op": "mean",
        "field": "IMDB Rating",
        "as": "AverageRating"
      }]
    },
    {"filter": "(datum['IMDB Rating'] - datum.AverageRating) > 2.5"}
  ],
  "layer": [
    {
      "mark": "bar",
      "encoding": {
        "x": {
          "field": "IMDB Rating", "type": "quantitative",
          "title": "IMDB Rating"
        },
        "y": {"field": "Title", "type": "ordinal"}
      }
    },
    {
      "mark": {"type": "rule", "color": "red"},
      "encoding": {
        "x": {
          "aggregate": "average",
          "field": "AverageRating",
          "type": "quantitative"
        }
      }
    }
  ]
}

或者, 不过滤, 而是把高于/低于均值的电影高亮出来:

{
  "data": {
    "url": "/assets/data/movies.json"
  },
  "transform": [
    {"filter": "datum['IMDB Rating'] != null"},
    {"filter": {"timeUnit": "year", 
      "field": "Release Date", "range": [null, 2019]}},
    {
      "joinaggregate": [{
        "op": "mean",
        "field": "IMDB Rating",
        "as": "AverageRating"
      }]
    },
    {
      "calculate": "datum['IMDB Rating'] - datum.AverageRating",
      "as": "RatingDelta"
    }
  ],
  "mark": "point",
  "encoding": {
    "x": {
      "field": "Release Date",
      "type": "temporal"
    },
    "y": {
      "field": "RatingDelta",
      "type": "quantitative",
      "title": "Rating Delta"
    },
    "color": {
      "field": "RatingDelta",
      "type": "quantitative",
      "scale": {"domainMid": 0},
      "title": "Rating Delta"
    }
  }
}

Loess

局部加权回归Loess进行平滑操作:生成趋势线

P	T	D
loess	String	需要loess的数据列
on	String	自变量列
groupby	String[]	分组列们
bandwidth	Number	[0,1]范围的频宽取值, 控制平滑程度
as	String[]	输出列名

{
  "data": {
    "url": "/assets/data/movies.json"
  },
  "layer": [
    {
      "mark": {
        "type": "point",
        "filled": true
      },
      "encoding": {
        "x": {
          "field": "Rotten Tomatoes Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB Rating",
          "type": "quantitative"
        }
      }
    },
    {
      "mark": {
        "type": "line",
        "color": "firebrick"
      },
      "transform": [
        {
          "loess": "IMDB Rating",
          "on": "Rotten Tomatoes Rating"
        }
      ],
      "encoding": {
        "x": {
          "field": "Rotten Tomatoes Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB Rating",
          "type": "quantitative"
        }
      }
    }
  ]
}

Lookup

查找与主数据源中指定字段相匹配的副数据中对应的对象。

Note 就是我们日常分析中的join操作啊

P	T	D
lookup	String	主数据中的Key
from	LookupDate/LookupSelection	副数据源
as	String[]	略
default	Any	匹配失败时分配的默认值, 默认是`null`

副数据属性:

P	T	D
data	Data	数据源
key	String	副数据的key
fields	String[]	指定要匹配的字段, 默认匹配全部对象

一个例子:

输入数据表格如下:

lookup_groups.csv:

group	person
1	Alan
1	George
1	Fred
2	Steve
2	Nick
2	Will
3	Cole
3	Rick
3	Tom

lookup_people.csv:

name	age	height
Alan	25	180
George	32	174
Fred	39	182
Steve	42	161
Nick	23	180
Will	21	168
Cole	51	160
Rick	63	181
Tom	54	179

按照人名合并两个表格:

{
  "data": {"url": "/assets/data/lookup_groups.csv"},
  "transform": [{
    "lookup": "person",
    "from": {
      "data": {"url": "/assets/data/lookup_people.csv"},
      "key": "name",
      "fields": ["age", "height"]
    }
  }],
  "mark": "bar",
  "encoding": {
    "x": {"field": "group"},
    "y": {"field": "age", "aggregate": "mean"}
  }
}

进阶用法:

lookup还支持把select交互动作的名字param当作数据源. 以下例子用lookup做炫酷的交互:

{
  "data": {
    "url": "/assets/data/stocks.csv",
    "format": {"parse": {"date": "date"}}
  },
  "width": 650,
  "height": 300,
  "layer": [
    {
      //在这里定义交互规则
      "params": [{
        "name": "index",
        "value": [{"x": {"year": 2005, "month": 1, "date": 1}}],
        "select": {
          "type": "point",
          "encodings": ["x"],
          "on": "mouseover",
          "nearest": true
        }
      }],
      "mark": "point",
      "encoding": {
        "x": {"field": "date", "type": "temporal", "axis": null},
        "opacity": {"value": 0}
      }
    },
    {
      "transform": [
        {
          "lookup": "symbol",
          //这里的from设置成交互规则的param
          "from": {"param": "index", "key": "symbol"}
        },
        {
          //形成的新表就是把index添加到原始stock表中, 所以可以把index对应的值跟原始表的值一起做数值计算
          "calculate": "datum.index && datum.index.price > 0 ? (datum.price - datum.index.price)/datum.index.price : 0",
          "as": "indexed_price"
        }
      ],
      "mark": "line",
      "encoding": {
        "x": {"field": "date", "type": "temporal", "axis": null},
        "y": {
          "field": "indexed_price", "type": "quantitative",
          "axis": {"format": "%"}
        },
        "color": {"field": "symbol", "type": "nominal"}
      }
    },
    {
      "transform": [{"filter": {"param": "index"}}],
      "encoding": {
        "x": {"field": "date", "type": "temporal", "axis": null},
        "color": {"value": "firebrick"}
      },
      "layer": [
        {"mark": {"type": "rule", "strokeWidth": 0.5}},
        {
          "mark": {"type": "text", "align": "center", "fontWeight": 100},
          "encoding": {
            "text": {"field": "date", "timeUnit": "yearmonth"},
            "y": {"value": 310}
          }
        }
      ]
    }
  ]
}

Pivot

长表转宽表, 是fold的逆操作

P	T	D
pivot	String	数据源
value	String	需要转换的列, 其值最终会变成新表的列名
groupby	String[]	分组列
limit	Number	最大可以生成的列数, 默认是`0`, 就是不限制
op	String	对分组的`value`进行什么操作, 默认是`sum`

示例:

[
  {"country": "Norway", "type": "gold", "count": 14},
  {"country": "Norway", "type": "silver", "count": 14},
  {"country": "Norway", "type": "bronze", "count": 11},
  {"country": "Germany", "type": "gold", "count": 14},
  {"country": "Germany", "type": "silver", "count": 10},
  {"country": "Germany", "type": "bronze", "count": 7},
  {"country": "Canada", "type": "gold", "count": 11},
  {"country": "Canada", "type": "silver", "count": 8},
  {"country": "Canada", "type": "bronze", "count": 10}
]

\\进行如下pivot操作:
{
  "pivot": "type",
  "groupby": ["country"],
  "value": "count"
}

\\得到结果:
[
  {"country": "Norway", "gold": 14, "silver": 14, "bronze": 11},
  {"country": "Germany", "gold": 14, "silver": 10, "bronze": 7},
  {"country": "Canada", "gold": 11, "silver": 8, "bronze": 10}
]

Quantile

计算分位数。

P	T	D
quantile	String	要处理的列名
groupby
probs	Number[]	分位数比值(0,1)列表, 如果不提供, 则使用step值
step	Number	分位数步长(默认0.01), 只有probs为空时才有用
as	String[]	输出列名, 默认是`["prob", "value"]`

{"quantile": "measure", "probs": [0.25, 0.5, 0.75]}
\\输出
[
  {prob: 0.25, value: 1.34},
  {prob: 0.5, value: 5.82},
  {prob: 0.75, value: 9.31}
];

示例: 用来生成QQ图

{
  "data": {
    "url": "/assets/data/normal-2d.json"
  },
  "transform": [
    {
      "quantile": "u",
      "step": 0.01,
      "as": [
        "p",
        "v"
      ]
    },
    {
      "calculate": "quantileUniform(datum.p)",
      "as": "unif"
    },
    {
      "calculate": "quantileNormal(datum.p)",
      "as": "norm"
    }
  ],
  "hconcat": [
    {
      "mark": "point",
      "encoding": {
        "x": {
          "field": "unif",
          "type": "quantitative"
        },
        "y": {
          "field": "v",
          "type": "quantitative"
        }
      }
    },
    {
      "mark": "point",
      "encoding": {
        "x": {
          "field": "norm",
          "type": "quantitative"
        },
        "y": {
          "field": "v",
          "type": "quantitative"
        }
      }
    }
  ]
}

Regression

支持的回归模型:

linear: linear(线性), \( y = a + bx \)
log: logarithmics(对数), \( y = a + b*log(x) \)
exp: exponential(指数), \( y = a * e^(bx) \)
pow: power(幂), \( y = a * x^b \)
quad: quadratic(二项), \( y = a + b * x + c * x^2 \)
poly: polynomial(多项), \( y = a + b * x + ... + k * x^(order) \)

P	T	D
regression	String	因变量
on	String	自变量
groupby	String[]
method	String	上述回归模型, 默认是`linear`
order	Number	`poly`模型下, 多项式的项数, 默认是3
extent	Number[]	趋势线的上下界
params	Boolean	是否返回回归模型的参数,而不是返回画趋势线的点, 如果是`true`, 会返回`coef`向量和`rSquared`值
as	String[]	默认就是x和y的列名

一个例子:

{
  "data": {
    "url": "/assets/data/movies.json"
  },
  "layer": [
    {
      "mark": {
        "type": "point",
        "filled": true
      },
      "encoding": {
        "x": {
          "field": "Rotten Tomatoes Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB Rating",
          "type": "quantitative"
        }
      }
    },
    {
      "mark": {
        "type": "line",
        "color": "firebrick"
      },
      "transform": [
        {
          "regression": "IMDB Rating",
          "on": "Rotten Tomatoes Rating"
        }
      ],
      "encoding": {
        "x": {
          "field": "Rotten Tomatoes Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB Rating",
          "type": "quantitative"
        }
      }
    },
    {
      "transform": [
        {
          "regression": "IMDB Rating",
          "on": "Rotten Tomatoes Rating",
          "params": true
        },
        {"calculate": "'R²: '+format(datum.rSquared, '.2f')", "as": "R2"}
      ],
      "mark": {
        "type": "text",
        "color": "firebrick",
        "x": "width",
        "align": "right",
        "y": -5
      },
      "encoding": {
        "text": {"type": "nominal", "field": "R2"}
      }
    }
  ]
}

Sample

随机抽样, 就一个参数: sample, 指定抽样大小: {"sample": 500}

// Any View Specification
{
  ...
  "transform": [
    {"sample": 500} // Sample Transform
     ...
  ],
  ...
}

Stack

stack堆叠柱状图, 可以用在encoding中, 也可用在transform中.

只适用于连续变量的x, y, theta, radius
zero或true: 没有基准偏移的堆叠(类似于ggplot中的position="identity"), 基本的堆叠柱状图
normalize: 标准化的(类似于ggplot中的position="fill")堆叠图, 也用于画饼图
center: 向中心偏移的堆叠柱状图, 用于生成流线图
null或false: 各个分组互相重叠

示例很多, 不放了, 参考这里

一个进阶用法例子: 不用stack, 通过计算更改数值, 实现双向堆叠:

{
  "data": { "url": "/assets/data/population.json"},
  "transform": [
    {"filter": "datum.year == 2000"},
    {"calculate": "datum.sex == 2 ? 'Female' : 'Male'", "as": "gender"},
    {"calculate": "datum.sex == 2 ? -datum.people : datum.people", "as": "signed_people"}
  ],
  "width": 500,
  "height": 300,
  "mark": "bar",
  "encoding": {
    "y": {
      "field": "age",
      "axis": null, "sort": "descending"
    },
    "x": {
      "aggregate": "sum", "field": "signed_people",
      "title": "population",
      "axis": {"format": "s"}
    },
    "color": {
      "field": "gender",
      "scale": {"range": ["#675193", "#ca8861"]},
      "legend": {"orient": "top", "title": null}
    }
  },
  "config": {
    "view": {"stroke": null},
    "axis": {"grid": false}
  }
}

另一个进阶用法: 对折线堆叠时, 要显式声明偏移量:

{
  "data": {"url": "/assets/data/population.json"},
  "transform": [
    {"filter": "datum.year == 2000"},
    {"calculate": "datum.sex == 2 ? 'Female' : 'Male'", "as": "gender"}
  ],
  "layer": [
    {
      "mark": {"opacity": 0.7, "type": "area"},
      "encoding": {
        "y": {"aggregate": "sum", "field": "people", "type": "quantitative"},
        "x": {"field": "age", "type": "nominal"},
        "color": {
          "field": "gender",
          "scale": {"range": ["#675193", "#ca8861"]},
          "type": "nominal"
        },
        "opacity": {"value": 0.7}
      }
    },
    {
      "mark": {"type": "line"},
      "encoding": {
        "y": {
          "aggregate": "sum",
          "field": "people",
          "type": "quantitative",
          "stack": "zero"
        },
        "x": {"field": "age", "type": "nominal"},
        "color": {
          "field": "gender",
          "scale": {"range": ["#675193", "#ca8861"]},
          "type": "nominal"
        },
        "opacity": {"value": 0.7}
      }
    }
  ]
}

transform中使用stack, 支持如下属性: stack, groupby, offset, sort, as 其中, sort可以用来对堆叠结果进行排序, 类似ggplot2中根据levels排序

另外还有两个进阶用法, 平时用不太上, 代码就不贴了, 需要的自己看:

自定义偏移:

马赛克图:

Time Unit

我平时不怎么处理时间序列相关的数据, 这个就先跳过了原文档: here

Window

对已排序的数组对象执行计算(如: ranking, lead/lag, aggregates), 结果返回输入数据流

Transform Parameters

P	T	D
sort	`Compare`	定义数据比较顺序
groupby	`Field[]`	分组统计
ops	`String[]`	具体操作, 如`rank`,`lead`,`sum`等, 具体见表window operation reference
fields	`Field[]`	要计算的字段, 该字段数组要与ops、as和params数组对齐
params	`Array`	windows函数的参数值, 与ops对齐
as	`String[]`	ops操作的输出字段名称, 如果不指定, 则根据操作自动生成
frame	`Number[]`	二元数组配置滑窗参数, `[-5,5]`表示窗口包含当前对象和前后各5个对象, 默认是`[null,0]`, 表示当前对象和所有之前对象(null->无限对象)
ignorePeers	`Boolean`	滑窗是否忽略`Peer values` (Peer values是sort中排序相同的值), 默认是false

Window operation reference

window 中的有效操作, 包含所有 aggregate操作以及以下操作:

Operation	Parameter	Description
row_number	None	分配1开始的行号
rank	None	分配1开始的排名, 相同排名并列, 随后排名包含先前数量, 如: 1,1,3,3,5
dense_rank	None	从1开始排名, 相同并列, 随后不包含先前数量, 如:1,1,2,2,3
percent_rank	None	分配百分比排名, 计算方法: \((rank-1)/(group_size - 1)\)
cume_dist	None	分配0-1之间的累积分布值
ntile	Number	分位数, 参数为百分制整数(eg: 百分位数100, 五分位数5)
lag	Number	当前对象之前指定偏移量的值, 如果不存在则输出`null`, 偏移量默认为1
lead	Number	当前对象后指定偏移量的值
first_value	None	当前滑窗的第一个值
last_value	None	当前滑窗的最后一个值
nth_value	Number	当前滑窗的第n个值
prev_value	None	返回排序数组中(含当前字段)最近的前一个非缺失值
next_value	None	返回排序数组中(含当前字段)最近的后一个非缺失值

Example

Wordcloud

Example

变换参数

P	T	D
font	`String\|Expr`	字体
fontStyle	`String\|Expr`	字体样式
fontWeight	`String\|Expr`	字体粗细
fontSize	`Number\|Expr`	字体大小
fontSizeRange	`Number[]`	大小范围, 如果指定了范围且没指定fontSize,则根据平方根比例在范围内自动缩放
padding	`Number\|Expr`	单词的padding
rotate	`Number\|Expr`	角度(度为单位)
text	`Field`	文本内容
size	`Number[]`	布局大小, `[width, height]`
spiral	`String`	布局方法, `archimedean`(默认)或`rectangular`
as	`String`	输出字段, 默认为`["x", "y", "font", "fontSize", "fontStyle", "fontWeight", "angle"]`

Note wordcloud 要求文本标记具有align:center和baseline: alphabetic属性, 否则文本定位会不准确。