Vega-Lite文档: 03_transform
数据转换
数据变换有两种途径:
transform
: view-level transforms, 在整个view中指定转换逻辑encoding
: field transforms inside, 在数据列中指定转换逻辑
如果两种转换方式都提供, 会先执行
transform
,然后再执行内联的转换, 且转换顺序为:bin -> timeUnit -> aggregate -> sort -> stack
transform
中的转换顺序是按照出现的顺序进行的, 支持如下转换:
Aggregate汇总
汇总操作由
aggregate
和groupby
完成aggregate
支持三个属性:op
指定汇总操作(函数),field
指定处理哪列,as
指定输出列名groupby
指定分组处理的规则, 传入的是列名向量
aggregate支持的操作:
Operation | Description |
---|---|
count | The total count of data objects in the group.Note: count operates directly on the input objects and return the same value regardless of the provided field. |
valid | The count of field values that are not null, undefined or NaN. |
values | A list of data objects in the group. |
missing | The count of null or undefined field values. |
distinct | The count of distinct field values. |
sum | The sum of field values. |
product | The product of field values. |
mean | The mean (average) field value. |
average | The mean (average) field value. Identical to mean. |
variance | The sample variance of field values. |
variancep | The population variance of field values. |
stdev | The sample standard deviation of field values. |
stdevp | The population standard deviation of field values. |
stderr | The standard error of field values. |
median | The median field value. |
q1 | The lower quartile boundary of field values. |
q3 | The upper quartile boundary of field values. |
ci0 | The lower boundary of the bootstrapped 95% confidence interval of the mean field value. |
ci1 | The upper boundary of the bootstrapped 95% confidence interval of the mean field value. |
min | The minimum field value. |
max | The maximum field value. |
argmin | An input data object containing the minimum field value.Note: When used inside encoding, argmin must be specified as an object. (See below for an example.) |
argmax | An input data object containing the maximum field value.Note: When used inside encoding, argmax must be specified as an object. (See below for an example.) |
argmin/argmax
用于查找与另一字段中的极值相对应的当前字段的值:
|
以上是用的encoding
方式, 也可以用transform
的形式:
|
|
Bin分箱
可以用来做histogram
encoding
和transform
encoding中使用bin
// A Single View or a Layer Specification
{
...,
"mark/layer": ...,
"encoding": {
"x": {
"bin": ..., // bin
"field": ...,
"type": "quantitative",
...
},
"y": ...,
...
},
...
}
在encoding
中, 直接用bin
属性操作, bin
属性支持传入的类型有:
true
: 使用默认的分箱参数, 默认是false
binned
: 表明数据已经分箱过了, 可以把bin-start和bin-end映射到x/y和x2/y2
|
设置分箱的维度的type
为ordinal
, 会把分箱的范围当作刻度标签:
|
可以用bin
来分配热图颜色, 会自动创建图注:
|
直接导入已经分箱好的数据:
|
transform中使用bin
// Any View Specification
{
...
"transform": [
{"bin": ..., "field": ..., "as" ...} // Bin Transform
...
],
...
}
transform
中使用bin, 有bin, field, as
三个可选属性参数.
例子: 用bin生成新的列
|
Bin的可选参数
Property | Type | Description |
---|---|---|
anchor | Number | A value in the binned domain at which to anchor the bins, shifting the bin boundaries if necessary to ensure that a boundary aligns with the anchor value. Default value: the minimum bin extent value |
base | Number | The number base to use for automatic bin determination (default is base 10). Default value: 10 |
divide | Number[] | Scale factors indicating allowable subdivisions. The default value is [5, 2], which indicates that for base 10 numbers (the default base), the method may consider dividing bin sizes by 5 and/or 2. For example, for an initial step size of 10, the method can check if bin sizes of 2 (= 10/5), 5 (= 10/2), or 1 (= 10/(5*2)) might also satisfy the given constraints. Default value: [5, 2] |
extent | Array | A two-element ([min, max]) array indicating the range of desired bin values. |
maxbins | Number | Maximum number of bins. Default value: 6 for row, column and shape channels; 10 for other channels |
minstep | Number | A minimum allowable step size (particularly useful for integer values). |
nice | Boolean | If true, attempts to make the bin boundaries use human-friendly boundaries, such as multiples of ten. Default value: true |
step | Number | An exact step size to use between bins. Note: If provided, options such as maxbins will be ignored. |
steps | Number[] | An array of allowable step sizes to choose from. |
示例: 更改最大分箱数目maxbins
:
|
分箱排序
如果需要对分箱结果进行排序, 可以设置"type":"ordinal"
:
|
Calculate
// Any View Specification
{
...
"transform": [
{"calculate": ..., "as" ...} // Calculate Transform
...
],
...
}
两个属性: calculate, as
calculate
支持传入数据的列组成的表达式, 需要用datum
代表传入数据, 比如2*datum.col1+datum.col2
. 支持的表达式很多, 这里不展开了, 请参考Vega表达式.
|
Density
指定维度计算核密度估计, 生成密度分布曲线.
/ Any View Specification
{
...
"transform": [
{"density": ...} // Density Transform
...
],
...
}
density参数
Property | Type | Description |
---|---|---|
density | String | Required. The data field for which to perform density estimation. |
groupby | String[] | The data fields to group by. If not specified, a single group containing all data objects will be used. |
cumulative | Boolean | A boolean flag indicating whether to produce density estimates (false) or cumulative density estimates (true). Default value: false |
counts | Boolean | A boolean flag indicating if the output values should be probability estimates (false) or smoothed counts (true). Default value: false |
bandwidth | Number | The bandwidth (standard deviation) of the Gaussian kernel. If unspecified or set to zero, the bandwidth value is automatically estimated from the input data using Scott’s rule. |
extent | Number[] | A [min, max] domain from which to sample the distribution. If unspecified, the extent will be determined by the observed minimum and maximum values of the density value field. |
minsteps | Number | The minimum number of samples to take along the extent domain for plotting the density. Default value: 25 |
maxsteps | Number | The maximum number of samples to take along the extent domain for plotting the density. Default value: 200 |
steps | Number | The exact number of samples to take along the extent domain for plotting the density. If specified, overrides both minsteps and maxsteps to set an exact number of uniform samples. Potentially useful in conjunction with a fixed extent to ensure consistent sample points for stacked densities. |
as | String[] | The output fields for the sample value and corresponding density estimate. Default value: ["value", "density"] |
示例1: 简单的密度图
|
示例2: 分组堆叠密度图: group
分组, extent
限定范围, 在encoding
中设置按照分组列上色
|
示例3: 分面 在encoding
中设置按照分组列画Y轴
|
Filter
根据指定规则过滤数据:
// Any View Specification
{
...
"transform": [
{"filter": ...} // Filter Transform
...
],
...
}
filter
接收Predicate格式的传入, 可以是:
expression: 例如
{filter: "datum.b2 > 60"}
field predicates:
equal, lt, lte, gt, gte, range, oneOf, valid
, 具体参考field predicatesselection predicate: 选择条件的名字, 或者逻辑组合, 参考selection predicate
上述条件的逻辑组合,
and, or, not
, 参考logical composition
Flatten
把向量转换成列表形式: 转换逻辑是一一对应, 如果不等长, 短的那个用null
补齐
//对于如下表:
[
{"key": "alpha", "foo": [1, 2], "bar": ["A", "B"]},
{"key": "beta", "foo": [3, 4, 5], "bar": ["C", "D"]}
]
//应用flatten:
{"flatten": ["foo", "bar"]}
//变成如下表:
[
{"key": "alpha", "foo": 1, "bar": "A"},
{"key": "alpha", "foo": 2, "bar": "B"},
{"key": "beta", "foo": 3, "bar": "C"},
{"key": "beta", "foo": 4, "bar": "D"},
{"key": "beta", "foo": 5, "bar": null}
]
一个进阶用法的例子 (mark, 有点复杂 我还没看):
|
Fold
fold
: 把指定列(可以是多列)转换成"key-value"对, 类似于宽表变长表
//原表
[
{"country": "USA", "gold": 10, "silver": 20},
{"country": "Canada", "gold": 7, "silver": 26}
]
// 折叠这两列
{"fold": ["gold", "silver"]}
//新表: 这两列都变成key-value了
[
{"key": "gold", "value": 10, "country": "USA", "gold": 10, "silver": 20},
{"key": "silver", "value": 20, "country": "USA", "gold": 10, "silver": 20},
{"key": "gold", "value": 7, "country": "Canada", "gold": 7, "silver": 26},
{"key": "silver", "value": 26, "country": "Canada", "gold": 7, "silver": 26}
]
Impute
这个是对数据进行补齐处理的, 我的应用场景, 不太需要用Vega-Lite
进行数据处理, 所以先不学习这块。
Join Aggregate
把agggregate
操作生成的新列与原列进行join
。
操作跟aggregate
类似:
joinaggregate
: 支持op, field, as
属性groupby
一个例子: 偏离均值的程度 选取评分高于平均评分2.5分的电影:
|
或者, 不过滤, 而是把高于/低于均值的电影高亮出来:
|
Loess
局部加权回归Loess进行平滑操作:生成趋势线
P | T | D |
---|---|---|
loess | String | 需要loess的数据列 |
on | String | 自变量列 |
groupby | String[] | 分组列们 |
bandwidth | Number | [0,1]范围的频宽取值, 控制平滑程度 |
as | String[] | 输出列名 |
|
Lookup
查找与主数据源中指定字段相匹配的副数据中对应的对象。
P | T | D |
---|---|---|
lookup | String | 主数据中的Key |
from | LookupDate/LookupSelection | 副数据源 |
as | String[] | 略 |
default | Any | 匹配失败时分配的默认值, 默认是null |
副数据属性:
P | T | D |
---|---|---|
data | Data | 数据源 |
key | String | 副数据的key |
fields | String[] | 指定要匹配的字段, 默认匹配全部对象 |
一个例子:
输入数据表格如下:
lookup_groups.csv
:
group | person |
---|---|
1 | Alan |
1 | George |
1 | Fred |
2 | Steve |
2 | Nick |
2 | Will |
3 | Cole |
3 | Rick |
3 | Tom |
lookup_people.csv
:
name | age | height |
---|---|---|
Alan | 25 | 180 |
George | 32 | 174 |
Fred | 39 | 182 |
Steve | 42 | 161 |
Nick | 23 | 180 |
Will | 21 | 168 |
Cole | 51 | 160 |
Rick | 63 | 181 |
Tom | 54 | 179 |
按照人名合并两个表格:
|
进阶用法:
lookup还支持把select
交互动作的名字param
当作数据源. 以下例子用lookup做炫酷的交互:
{
"data": {
"url": "/assets/data/stocks.csv",
"format": {"parse": {"date": "date"}}
},
"width": 650,
"height": 300,
"layer": [
{
//在这里定义交互规则
"params": [{
"name": "index",
"value": [{"x": {"year": 2005, "month": 1, "date": 1}}],
"select": {
"type": "point",
"encodings": ["x"],
"on": "mouseover",
"nearest": true
}
}],
"mark": "point",
"encoding": {
"x": {"field": "date", "type": "temporal", "axis": null},
"opacity": {"value": 0}
}
},
{
"transform": [
{
"lookup": "symbol",
//这里的from设置成交互规则的param
"from": {"param": "index", "key": "symbol"}
},
{
//形成的新表就是把index添加到原始stock表中, 所以可以把index对应的值跟原始表的值一起做数值计算
"calculate": "datum.index && datum.index.price > 0 ? (datum.price - datum.index.price)/datum.index.price : 0",
"as": "indexed_price"
}
],
"mark": "line",
"encoding": {
"x": {"field": "date", "type": "temporal", "axis": null},
"y": {
"field": "indexed_price", "type": "quantitative",
"axis": {"format": "%"}
},
"color": {"field": "symbol", "type": "nominal"}
}
},
{
"transform": [{"filter": {"param": "index"}}],
"encoding": {
"x": {"field": "date", "type": "temporal", "axis": null},
"color": {"value": "firebrick"}
},
"layer": [
{"mark": {"type": "rule", "strokeWidth": 0.5}},
{
"mark": {"type": "text", "align": "center", "fontWeight": 100},
"encoding": {
"text": {"field": "date", "timeUnit": "yearmonth"},
"y": {"value": 310}
}
}
]
}
]
}
Pivot
长表转宽表, 是fold的逆操作
P | T | D |
---|---|---|
pivot | String | 数据源 |
value | String | 需要转换的列, 其值最终会变成新表的列名 |
groupby | String[] | 分组列 |
limit | Number | 最大可以生成的列数, 默认是0 , 就是不限制 |
op | String | 对分组的value 进行什么操作, 默认是sum |
示例:
[
{"country": "Norway", "type": "gold", "count": 14},
{"country": "Norway", "type": "silver", "count": 14},
{"country": "Norway", "type": "bronze", "count": 11},
{"country": "Germany", "type": "gold", "count": 14},
{"country": "Germany", "type": "silver", "count": 10},
{"country": "Germany", "type": "bronze", "count": 7},
{"country": "Canada", "type": "gold", "count": 11},
{"country": "Canada", "type": "silver", "count": 8},
{"country": "Canada", "type": "bronze", "count": 10}
]
\\进行如下pivot操作:
{
"pivot": "type",
"groupby": ["country"],
"value": "count"
}
\\得到结果:
[
{"country": "Norway", "gold": 14, "silver": 14, "bronze": 11},
{"country": "Germany", "gold": 14, "silver": 10, "bronze": 7},
{"country": "Canada", "gold": 11, "silver": 8, "bronze": 10}
]
Quantile
计算分位数。
P | T | D |
---|---|---|
quantile | String | 要处理的列名 |
groupby | ||
probs | Number[] | 分位数比值(0,1)列表, 如果不提供, 则使用step值 |
step | Number | 分位数步长(默认0.01), 只有probs为空时才有用 |
as | String[] | 输出列名, 默认是["prob", "value"] |
{"quantile": "measure", "probs": [0.25, 0.5, 0.75]}
\\输出
[
{prob: 0.25, value: 1.34},
{prob: 0.5, value: 5.82},
{prob: 0.75, value: 9.31}
];
示例: 用来生成QQ图
{
"data": {
"url": "/assets/data/normal-2d.json"
},
"transform": [
{
"quantile": "u",
"step": 0.01,
"as": [
"p",
"v"
]
},
{
"calculate": "quantileUniform(datum.p)",
"as": "unif"
},
{
"calculate": "quantileNormal(datum.p)",
"as": "norm"
}
],
"hconcat": [
{
"mark": "point",
"encoding": {
"x": {
"field": "unif",
"type": "quantitative"
},
"y": {
"field": "v",
"type": "quantitative"
}
}
},
{
"mark": "point",
"encoding": {
"x": {
"field": "norm",
"type": "quantitative"
},
"y": {
"field": "v",
"type": "quantitative"
}
}
}
]
}
Regression
支持的回归模型:
linear
: linear(线性), \( y = a + bx \)log
: logarithmics(对数), \( y = a + b*log(x) \)exp
: exponential(指数), \( y = a * e^(bx) \)pow
: power(幂), \( y = a * x^b \)quad
: quadratic(二项), \( y = a + b * x + c * x^2 \)poly
: polynomial(多项), \( y = a + b * x + ... + k * x^(order) \)
P | T | D |
---|---|---|
regression | String | 因变量 |
on | String | 自变量 |
groupby | String[] | |
method | String | 上述回归模型, 默认是linear |
order | Number | poly 模型下, 多项式的项数, 默认是3 |
extent | Number[] | 趋势线的上下界 |
params | Boolean | 是否返回回归模型的参数,而不是返回画趋势线的点, 如果是true , 会返回coef 向量和rSquared 值 |
as | String[] | 默认就是x和y的列名 |
一个例子:
{
"data": {
"url": "/assets/data/movies.json"
},
"layer": [
{
"mark": {
"type": "point",
"filled": true
},
"encoding": {
"x": {
"field": "Rotten Tomatoes Rating",
"type": "quantitative"
},
"y": {
"field": "IMDB Rating",
"type": "quantitative"
}
}
},
{
"mark": {
"type": "line",
"color": "firebrick"
},
"transform": [
{
"regression": "IMDB Rating",
"on": "Rotten Tomatoes Rating"
}
],
"encoding": {
"x": {
"field": "Rotten Tomatoes Rating",
"type": "quantitative"
},
"y": {
"field": "IMDB Rating",
"type": "quantitative"
}
}
},
{
"transform": [
{
"regression": "IMDB Rating",
"on": "Rotten Tomatoes Rating",
"params": true
},
{"calculate": "'R²: '+format(datum.rSquared, '.2f')", "as": "R2"}
],
"mark": {
"type": "text",
"color": "firebrick",
"x": "width",
"align": "right",
"y": -5
},
"encoding": {
"text": {"type": "nominal", "field": "R2"}
}
}
]
}
Sample
随机抽样, 就一个参数: sample
, 指定抽样大小: {"sample": 500}
// Any View Specification
{
...
"transform": [
{"sample": 500} // Sample Transform
...
],
...
}
Stack
stack
堆叠柱状图, 可以用在encoding
中, 也可用在transform
中.
只适用于连续变量的
x, y, theta, radius
zero
或true
: 没有基准偏移的堆叠(类似于ggplot中的position="identity"
), 基本的堆叠柱状图normalize
: 标准化的(类似于ggplot中的position="fill"
)堆叠图, 也用于画饼图center
: 向中心偏移的堆叠柱状图, 用于生成流线图null
或false
: 各个分组互相重叠
示例很多, 不放了, 参考这里
一个进阶用法例子: 不用stack, 通过计算更改数值, 实现双向堆叠:
{
"data": { "url": "/assets/data/population.json"},
"transform": [
{"filter": "datum.year == 2000"},
{"calculate": "datum.sex == 2 ? 'Female' : 'Male'", "as": "gender"},
{"calculate": "datum.sex == 2 ? -datum.people : datum.people", "as": "signed_people"}
],
"width": 500,
"height": 300,
"mark": "bar",
"encoding": {
"y": {
"field": "age",
"axis": null, "sort": "descending"
},
"x": {
"aggregate": "sum", "field": "signed_people",
"title": "population",
"axis": {"format": "s"}
},
"color": {
"field": "gender",
"scale": {"range": ["#675193", "#ca8861"]},
"legend": {"orient": "top", "title": null}
}
},
"config": {
"view": {"stroke": null},
"axis": {"grid": false}
}
}
另一个进阶用法: 对折线堆叠时, 要显式声明偏移量:
{
"data": {"url": "/assets/data/population.json"},
"transform": [
{"filter": "datum.year == 2000"},
{"calculate": "datum.sex == 2 ? 'Female' : 'Male'", "as": "gender"}
],
"layer": [
{
"mark": {"opacity": 0.7, "type": "area"},
"encoding": {
"y": {"aggregate": "sum", "field": "people", "type": "quantitative"},
"x": {"field": "age", "type": "nominal"},
"color": {
"field": "gender",
"scale": {"range": ["#675193", "#ca8861"]},
"type": "nominal"
},
"opacity": {"value": 0.7}
}
},
{
"mark": {"type": "line"},
"encoding": {
"y": {
"aggregate": "sum",
"field": "people",
"type": "quantitative",
"stack": "zero"
},
"x": {"field": "age", "type": "nominal"},
"color": {
"field": "gender",
"scale": {"range": ["#675193", "#ca8861"]},
"type": "nominal"
},
"opacity": {"value": 0.7}
}
}
]
}
transform
中使用stack
, 支持如下属性: stack, groupby, offset, sort, as
其中, sort可以用来对堆叠结果进行排序, 类似ggplot2中根据levels排序
另外还有两个进阶用法, 平时用不太上, 代码就不贴了, 需要的自己看:
自定义偏移:
马赛克图:
Time Unit
我平时不怎么处理时间序列相关的数据, 这个就先跳过了 原文档: here
Window
对已排序的数组对象执行计算(如: ranking, lead/lag, aggregates), 结果返回输入数据流
Transform Parameters
P | T | D |
---|---|---|
sort | Compare | 定义数据比较顺序 |
groupby | Field[] | 分组统计 |
ops | String[] | 具体操作, 如rank ,lead ,sum 等, 具体见表window operation reference |
fields | Field[] | 要计算的字段, 该字段数组要与ops、as和params数组对齐 |
params | Array | windows函数的参数值, 与ops对齐 |
as | String[] | ops操作的输出字段名称, 如果不指定, 则根据操作自动生成 |
frame | Number[] | 二元数组配置滑窗参数, [-5,5] 表示窗口包含当前对象和前后各5个对象, 默认是[null,0] , 表示当前对象和所有之前对象(null->无限对象) |
ignorePeers | Boolean | 滑窗是否忽略Peer values (Peer values是sort中排序相同的值), 默认是false |
Window operation reference
window 中的有效操作, 包含所有 aggregate操作以及以下操作:
Operation | Parameter | Description |
---|---|---|
row_number | None | 分配1开始的行号 |
rank | None | 分配1开始的排名, 相同排名并列, 随后排名包含先前数量, 如: 1,1,3,3,5 |
dense_rank | None | 从1开始排名, 相同并列, 随后不包含先前数量, 如:1,1,2,2,3 |
percent_rank | None | 分配百分比排名, 计算方法: \((rank-1)/(group_size - 1)\) |
cume_dist | None | 分配0-1之间的累积分布值 |
ntile | Number | 分位数, 参数为百分制整数(eg: 百分位数100, 五分位数5) |
lag | Number | 当前对象之前指定偏移量的值, 如果不存在则输出null , 偏移量默认为1 |
lead | Number | 当前对象后指定偏移量的值 |
first_value | None | 当前滑窗的第一个值 |
last_value | None | 当前滑窗的最后一个值 |
nth_value | Number | 当前滑窗的第n个值 |
prev_value | None | 返回排序数组中(含当前字段)最近的前一个非缺失值 |
next_value | None | 返回排序数组中(含当前字段)最近的后一个非缺失值 |
Example
Wordcloud
Example
变换参数
P | T | D |
---|---|---|
font | String|Expr | 字体 |
fontStyle | String|Expr | 字体样式 |
fontWeight | String|Expr | 字体粗细 |
fontSize | Number|Expr | 字体大小 |
fontSizeRange | Number[] | 大小范围, 如果指定了范围且没指定fontSize,则根据平方根比例在范围内自动缩放 |
padding | Number|Expr | 单词的padding |
rotate | Number|Expr | 角度(度为单位) |
text | Field | 文本内容 |
size | Number[] | 布局大小, [width, height] |
spiral | String | 布局方法, archimedean (默认)或rectangular |
as | String | 输出字段, 默认为["x", "y", "font", "fontSize", "fontStyle", "fontWeight", "angle"] |