Danger You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software Click here to go to the new docs pages.
Danger
You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software
Click here to go to the new docs pages.
The Synthetic Data Vault is built as a collection of libraries which provide different functionalities, from simple data transformation to complex state-of-the-art Generative Deep Learning models.
Documentation
Source Code
SDV is the main library, which provides a user friendly interface to access all the different functionalities of the project.
The SDV library allows you to:
Model single tables using Copulas and Deep Learning based models.
Model complex multi-table relational datasets using Copulas and unique recursive modeling techniques.
Handle multiple data types and missing data with minimum user input.
Use and define custom inter-column constraints and data validation rules.
Define entire datasets with a custom and flexible Metadata JSON schema.
Tabular Models
Constraints
Metadata JSON
RDT is a Python library used to transform data for data science libraries and preserve the transformations in order to revert them as needed.
It is organized around the objects called Transformers, which implement a very simple and familiar API with 4 methods:
fit: Learn the properties of the data.
fit
transform: Transform the data.
transform
fit_transform: Fit the transformer to the data and then transform it.
fit_transform
reverse_transform: Revert a previous transformation to go back to the original format.
reverse_transform
Getting Started
Developer Guides
Copulas is a Python library for modeling multivariate distributions and sampling from them using copula functions. Given a table containing numerical data, we can use Copulas to learn the distribution and later on generate new synthetic rows following the same statistical properties.
Some of the features provided by this library include:
A variety of distributions for modeling univariate data.
Multiple Archimedean copulas for modeling bivariate data.
Gaussian and Vine copulas for modeling multivariate data.
Automatic selection of univariate distributions and bivariate copulas.
CTGAN is a GAN-based data synthesizer that can generate synthetic tabular data with high fidelity.
It was presented in NeurIPS 2020 in the paper Modeling Tabular data using Conditional GAN.