Data input formats¶
Contents
1 Pre-requirements
1.1 Import dependencies
1.2 Notebook configuration
2 Overview
3 Points
3.1 2D NumPy array of shape (n, d)
4 Distances
4.1 2D NumPy array of shape (n, n)
5 Neighbourhoods
6 Densitygraph
Pre-requirements¶
Import dependencies¶
[1]:
import sys
import matplotlib as mpl
import cnnclustering.cnn as cnn # CNN clustering
[2]:
# Version information
print(sys.version)
3.8.3 (default, May 15 2020, 15:24:35)
[GCC 8.3.0]
Notebook configuration¶
[3]:
# Matplotlib configuration
mpl.rc_file(
"matplotlibrc",
use_default_template=False
)
[3]:
# Axis property defaults for the plots
ax_props = {
"xlabel": None,
"ylabel": None,
"xlim": (-2.5, 2.5),
"ylim": (-2.5, 2.5),
"xticks": (),
"yticks": (),
"aspect": "equal"
}
# Line plot property defaults
line_props = {
"linewidth": 0,
"marker": '.',
}
Overview¶
A data set of \(n\) points can primarily be represented through point coordinates in a \(d\)-dimensional space, or in terms of a pairwise distance matrix (of arbitrary metric). Secondarily, the data set can be described by neighbourhoods (in a graph structure) with respect to a specific radius cutoff. Furthermore, it is possible to trim the neighbourhoods into a density graph containing density connected points rather then neighbours for each point. The memory demand of the input forms
and the speed at which they can be clustered varies. Currently the cnnclustering.cnn
module can deal with the following data structures (\(n\): number of points, \(d\): number of dimensions).
Points
2D NumPy array of shape (n, d), holding point coordinates
Distances
2D NumPy array of shape (n, n), holding pairwise distances
Neighbourhoods
1D Numpy array of shape (n,) of 1D Numpy arrays of shape (<= n,), holding point indices
Python list of length (n) of Python sets of length (<= n), holding point indices
Sparse graph with 1D NumPy array of shape (<= n²), holding point indices, and 1D NumPy array of shape (n,), holding neighbourhood start indices
Density graph
1D Numpy array of shape (n,) of 1D Numpy arrays of shape (<= n,), holding point indices
Python list of length (n) of Python sets of length (<= n), holding point indices
Sparse graph with 1D NumPy array of shape (<= n²), holding point indices, and 1D NumPy array of shape (n,), holding connectivity start indices
The different input structures are wrapped by corresponding classes to be handled as attributes of a CNN
cluster object. Different kinds of input formats corresponding to the same data set are bundled in an Data
object.
Points¶
2D NumPy array of shape (n, d)¶
The cnn
module provides the class Points
to handle data set point coordinates. Instances of type Points
behave essentially like NumPy arrays.
[19]:
points = cnn.Points()
print("Representation of points: ", repr(points))
print("Points are Numpy arrays: ", isinstance(points, np.ndarray))
Representation of points: Points([], dtype=float64)
Points are Numpy arrays: True
If you have your data points already in the format of a 2D NumPy array, the conversion into Points
is straightforward and does not require any copying. Note that the dtype of Points
is for now fixed to np.float_
.
[42]:
original_points = np.array([[0, 0, 0],
[1, 1, 1]], dtype=np.float_)
points = cnn.Points(original_points)
points[0, 0] = 1
points
[42]:
Points([[1., 0., 0.],
[1., 1., 1.]])
[43]:
original_points
[43]:
array([[1., 0., 0.],
[1., 1., 1.]])
1D sequences are interpreted as a single point on initialisation.
[45]:
points = cnn.Points(np.array([0, 0, 0]))
points
[45]:
Points([[0., 0., 0.]])
Other sequences like lists do work as input, too but consider that this requires a copy.
[47]:
original_points = [[0, 0, 0],
[1, 1, 1]]
points = cnn.Points(original_points)
points
[47]:
Points([[0., 0., 0.],
[1., 1., 1.]])
Points
can be used to represent data sets distributed over multiple parts. Parts could constitute independent measurements that should be clustered together but remain separated for later analyses. Internally Points
stores the underlying point coordinates always as a (vertically stacked) 2D array. Points.edges
is used to track the number of points belonging to each part. The alternative constructor Points.from_parts
can be used to deduce edges
from parts of points passed as a
sequence of 2D sequences.
[64]:
points = cnn.Points.from_parts([[[0, 0, 0],
[1, 1, 1]],
[[2, 2, 2],
[3, 3, 3]]])
points
[64]:
Points([[0., 0., 0.],
[1., 1., 1.],
[2., 2., 2.],
[3., 3., 3.]])
[65]:
points.edges # 2 parts, 2 points each
[65]:
array([2, 2])
Trying to set edges
manually to a sequence not consistent with the total number of points, will raise an error. Setting the edges
of an empty Points
object is, however, allowed and can be used to store part information even when no points are loaded.
[66]:
points.edges = [2, 3]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-66-4bd144cf309c> in <module>
----> 1 points.edges = [2, 3]
~/CNN/cnnclustering/cnn.py in edges(self, x)
810
811 if (n != 0) and (sum_edges != n):
--> 812 raise ValueError(
813 f"Part edges ({sum_edges} points) do not match data points "
814 f"({n} points)"
ValueError: Part edges (5 points) do not match data points (4 points)
Points.by_parts
can be used to retrieve the parts again one by one.
[70]:
for part in points.by_parts():
print(f"{part} \n")
[[0. 0. 0.]
[1. 1. 1.]]
[[2. 2. 2.]
[3. 3. 3.]]
To provide one possible way to calculate neighbourhoods from points, Points
has a thin method wrapper for scipy.spatial.cKDTree
. This will set Points.tree
which is used by CNN.calc_neighbours_from_cKDTree
. The user is encouraged to use any other external method instead.
[75]:
points.cKDTree()
points.tree
[75]:
<scipy.spatial.ckdtree.cKDTree at 0x7f0f6d3f3900>
Distances¶
2D NumPy array of shape (n, n)¶
The cnn
module provides the class Distances
to handle data set pairwise distances as a dense matrix. Instances of type Distances
behave (like Points
) much like NumPy arrays.
[79]:
distances = cnn.Distances([[0, 1], [1, 0]])
distances
[79]:
Distances([[0., 1.],
[1., 0.]])
Distances
do not support an edges
attribute, i.e. can not represent part information. Use the edges
of an associated Points
instance instead.
Pairwise Distances
can be calculated for \(n\) points within a data set from a Points
instance for example with CNN.calc_dist
, resulting in a matrix of shape (\(n\), \(n\)). They can be also calculated between \(n\) points in one and \(m\) points in another data set, resulting in a relative distance matrix (map matrix) of shape (\(n\), \(m\)). In the later case Distances.reference
should be used to keep track of the CNN
object carrying the second
data set. Such a map matrix can be used to predict cluster labels for a data set based on the fitted cluster labels of another set.