Corefunctions

Corefunctionality for data preparation of sequential data for pytorch, fastai models

4. Split in Training, Validation

Splitting kann anhand von vorher bekannten Indizes, dem Dateipfad oder anderen allgemeinen Funktion durchgeführt werden.

Splitting innerhalb einer Sequenzen sollte in der Praxis nur dann geschehen wenn eine einzige Sequenz vorhanden ist. Diese kann dann vorher manuell geteilt werden.

4.1 Splitting mit vorgegebenem Index

from nbdev.config import get_config
project_root = get_config().config_file.parent
f_path = project_root / 'test_data/WienerHammerstein'
hdf_files = get_files(f_path,extensions='.hdf5',recurse=True).sorted()
splitter = IndexSplitter([1,2])
test_eq(splitter(hdf_files),[[0],[1,2]])
list_dict = CreateDict()(hdf_files)
list_dict
[{'path': '/home/pheenix/Development/tsfast/test_data/WienerHammerstein/test/WienerHammerstein_test.hdf5'},
 {'path': '/home/pheenix/Development/tsfast/test_data/WienerHammerstein/train/WienerHammerstein_train.hdf5'},
 {'path': '/home/pheenix/Development/tsfast/test_data/WienerHammerstein/valid/WienerHammerstein_valid.hdf5'}]
test_eq(splitter(list_dict),splitter(hdf_files))

4.2 Splitting mit allgemeiner Funktion

Items, bei denen die definierte Funktion True zurück gibt, werden den Validierungsdatensatz zugeordnet, der Rest dem Training. In diesem Fall wird nach dem Übergeordneten Ordnernamen gesucht.

splitter = FuncSplitter(lambda o: Path(o).parent.name == 'valid')
splitter(hdf_files)
test_eq(splitter(hdf_files),[[0,1],[2]])
((#2) [np.int64(0),np.int64(1)], (#1) [2])

4.3 Splitting anhand des Parent-Folders

Splitter, der Explizit Training und Validierungsordner den Datensätzen zuordnet


source

ParentSplitter

 ParentSplitter (train_name='train', valid_name='valid')

Split items from the parent folder names (train_name and valid_name).

splitter = ParentSplitter()
test_eq(splitter(hdf_files),[[1],[2]])
test_eq(splitter(list_dict),splitter(hdf_files))

4.4 Percentage Splitter


source

PercentageSplitter

 PercentageSplitter (pct=0.8)

Split items in order in relative quantity.

splitter = PercentageSplitter(0.7)
#test_eq(splitter(hdf_files),[[0,1],[2]])

4.5 Apply To Dictionary

In Case of the Datablock API your items are a list of dictionaries. If you want to apply a Splitter to the path stored within you need a wrapper function.


source

ApplyToDict

 ApplyToDict (fn, key='path')
splitter = FuncSplitter(lambda o: Path(o).parent.name == 'valid')
test_fail(lambda: splitter(list_dict))
dict_splitter = ApplyToDict(splitter)
test_eq(dict_splitter(list_dict),splitter(hdf_files))
dict_splitter(list_dict)
((#2) [np.int64(0),np.int64(1)], (#1) [2])

4.6 Valid Column

Using the ‘valid’ column of the Dataframe that has been created by a transformation.

from tsfast.data.core import CreateDict, ValidClmContains,DfHDFCreateWindows
tfm_src = CreateDict([ValidClmContains(['valid']),DfHDFCreateWindows(win_sz=100+1,stp_sz=10,clm='u')])
src_dicts = tfm_src(hdf_files)
valid_clm_splitter(src_dicts)
((#16780) [np.int64(0),np.int64(1),np.int64(2),np.int64(3),np.int64(4),np.int64(5),np.int64(6),np.int64(7),np.int64(8),np.int64(9)...],
 (#1990) [16780,16781,16782,16783,16784,16785,16786,16787,16788,16789...])