Connect with us

AI

Generating Synthetic Sequential Data Using GANs

Sequential data — data that has time dependency — is very common in business, ranging from credit card transactions to medical healthcare records to stock market prices. But privacy regulations limit and dramatically slow-down access to useful data, essential to research and development. This creates a demand for highly representative, yet fully private, synthetic sequential […]

The post Generating Synthetic Sequential Data Using GANs appeared first on TOPBOTS.

Published

on

Sequential data — data that has time dependency — is very common in business, ranging from credit card transactions to medical healthcare records to stock market prices. But privacy regulations limit and dramatically slow-down access to useful data, essential to research and development. This creates a demand for highly representative, yet fully private, synthetic sequential data, which is challenging, to say the least.

Generating synthetic time-series and sequential data is more challenging than tabular data where normally all the information regarding one individual is stored in a single row. In sequential data, information can be spread through many rows, like credit card transactions, and preservation of correlations between rows — the events — and columns — the variables is key. Furthermore, the length of the sequences is variable; some cases may comprise just a few transactions while others may have thousands.

Generative models for sequential data and time-series have been studied extensively, however, many of these efforts have resulted in relatively poor synthetic data quality and low flexibility. In many cases the models are designed to be specific to each problem, thus requiring detailed domain knowledge.

In this post, we describe and apply an extended version of a recent powerful method to generate synthetic sequential data — DoppelGANger. It is a framework based on Generative Adversarial Networks (GANs) with some innovations that make possible the generation of synthetic versions of complex sequential datasets.

We build on this work by introducing two innovations:

  1. A learning strategy to speed up the convergence of the GAN and avoid mode collapse.
  2. A well-designed noise in the discriminator to make the process differentially private without degrading the quality of the data, using a modified version of the moments accountant strategy to improve the stability of the model.

Do you like this in-depth educational content on applied machine learning? Subscribe to our Enterprise AI mailing list to be alerted when we release new material.

Common approaches to sequential data generation

Most of the models for time-series data generation use one of the following approaches:

Dynamic stationary processes that work by representing each point in the time series as a sum of deterministic processes with some noise added. This is a widely used approach for modeling time-series with techniques like bootstrapping. However, some prior knowledge of long-term dependencies, like cyclical patterns, has to be incorporated to constrain the deterministic process. This makes it very difficult to model datasets with complex, unknown correlations.

Markov Models are a popular approach for modeling categorical time series by representing system dynamics as a conditional probability distribution. Variants, such as Hidden Markov models, have also been used for modeling the distributions of time series. The problem with this approach is its inability to capture long-term complex dependencies.

Autoregressive (AR) models are dynamic stationary processes where each point in the sequence is represented as a function of the previous n points. Nonlinear AR models (like ARIMA) are very powerful. AR models like Markov models have a fidelity problem — they produce simplistic models incapable of capturing complex temporal correlations.

Recurrent Neural Networks (RNNs) have been recently used for time-series modeling in deep learning. Like autoregressive and Markov models, RNNs use a sliding window of previous timesteps to determine the next points in time. RNNs also store an internal state variable that captures the entire history of the time series. RNNs, like long short-term memory networks (LTSMs), have had great success in learning discriminative models of time series data, which predict a label conditioned on a sample. However, RNNs are unable to learn certain simple time-series distributions.

GAN-based methods or generative adversarial network models have emerged as a popular technique for generating or augmenting datasets, especially with images and videos. However, GANs give poor fidelity in networking data, which has both complex temporal correlations and mixed discrete-continuous data types. Although GAN-based time-series generation exists — for instance for medical time series — such techniques fail on more complex data exhibiting poor autocorrelation scores on long sequences while prone to mode collapse. This is due to the fact that the data distribution is heavy-tailed and variable in length. This seems to affect considerably GANs.

Introducing DoppelGANger for generating high-quality, synthetic time-series data

In this section, I will explore the recent model to generate synthetic sequential data DoppelGANger. I will use this model based on GANs with a generator composed of recurrent unities to generate synthetic versions of transactional data using two datasets: bank transactions and road traffic. We used a modification of the DoppelGANger model to address the limitations of generative models for sequential data.

Traditional Generative Adversarial Networks, or GANs, struggle to model sequential data due to the following issues:

  • They don’t capture complex correlations between temporal features and their associated (immutable) attributes: For instance, depending on the owner characteristics (age, income, etc), credit card patterns in transactions are very distinct.
  • Long-term correlations within time series, such as diurnal patterns: These correlations are qualitatively very different from those found in images, which have a fixed dimension and do not need to be generated pixel by pixel.

DoppelGANger incorporates some innovative ideas, like:

  • using two networks (a MultiLayer Perceptron MLP and a recurrent network) to capture temporal dependencies
  • decoupled attribution generation to better capture correlations between time series and their attributes — e.g., age, location, and gender of users
  • batched generation — generation of small stacked batches for long sequences
  • decoupled normalization — the addition of normalization factors to the generator to constraints range of features

DoppelGANger decouples the generation of attributes from time series while feeding attributes to the time series generator at each timestep. This contrasts with conventional approaches, where attributes and features are generated jointly.

DoppelGANger’s conditional generation architecture also offers the flexibility to change the attribute distribution and condition the features on the attributes. This also helps to hide the attribute distribution thus increasing privacy.

The DoppelGANger model also has the advantage of generating data features conditioned on data attributes.

synthetic data with GAN

Figure 1: Schematic representation of the original DoppelGANger model two generator blocks and two discriminators. Credit: https://arxiv.org/abs/1909.13403.

Another neat feature of this model is how it handles extreme events, a very challenging problem. It’s not uncommon for sequential data to have a wide range of feature values across samples — some products may have thousands of transactions while others just a few. For GANs this is problematic as it is a sure recipe for mode collapse — samples will contain only the most common items and ignore the rare events. For images — the focus of almost all efforts on GANs — this isn’t an issue since the distributions are smooth. This is why the authors of DoppelGANger proposed an innovative way to handle these cases: auto-normalization. It consists in normalizing the data features prior to training and adding the minimum and maximum range of features as two additional attributes to each sample.

In the generated data, these two attributes usually scale features back to a realistic range. This is done in three steps:

  1. Generate attributes using the MultiLayer Perceptron (MLP) generator.
  2. With the generated attributes as inputs, generate the two “fake” (max/min) attributes using another MLP.
  3. With the generated real and fake attributes as inputs, generate the features.

Training the DoppelGANger model on bank transactions data

First, we evaluated DoppelGANger on a dataset of bank transactions. The data used for training is synthetic, so we know the real distributions, and can be accessed here. Our aim was to show that this model was able to learn the time dependencies in the data.

How to prepare the data?

synthetic data with GAN

Figure 2: Schematic representation of the data processed as a set of attributes and features of varied lengths.

We assume sequential data is composed of a set of sequences with maximum length Lmax — in our case we consider Lmax = 100. Each sequence contains a set of attributes A (fixed quantities) and features F (transactions). In our case, the only attribute is the initial bank Balance and the features are: Amount of the transaction (positive or negative) and two additional categories describing the transaction: Flag and Description.

To run the model we need three NumPy arrays:

  1. data_feature: training features, in NumPy float32 array format. The size is [(number of training samples) x (maximum length) x (total dimension of features)]. Categorical features are stored by one-hot encoding.
  2. data_attribute: Training attributes, in NumPy float32 array format. The size is [(number of training samples) x (total dimension of attributes)].
  3. data_gen_flag: An array of flags indicating the activation of features. The size is [(number of training samples) x (maximum length)].

Additionally, we need a list of objects of class Output that contains the data type for each variable, normalization, and cardinality. In this case, it is:

data_feature_outputs = [
output.Output(type_=OutputType.CONTINUOUS,dim=1,normalization=Normalization.ZERO_ONE,is_gen_flag=False), # time intervals between transactions (Dif) output.Output(type_=OutputType.DISCRETE,dim=20,is_gen_flag=False), # binarized Amount output.Output(type_=OutputType.DISCRETE,dim=5,is_gen_flag=False),
# Flag output.Output(type_=OutputType.DISCRETE,dim=44,is_gen_flag=False)
# Description
]

The first element of the list is the time interval between events Dif, followed by the 1-hot encoded transaction value (Amount), followed by the Flag, and the fourth is the transaction Description. All gen_flags are set to False since it’s an internal flag to be later modified by the model itself.

The attribute is encoded as a continuous variable with normalization between -1 and 1 to account for negative balances:

data_attribute_outputs = [output.Output(type_=OutputType.CONTINUOUS,dim=1,normalization=Normalization.MINUSONE_ONE,is_gen_flag=False)]

The only attribute used in this simulation is the initial balance. The balance at each step is simply updated by adding the corresponding transaction amount.

We used Hazy processors to pre-process each sequence and reshape it in the right format.

n_bins = 20
processor_dict = { "by_type": { "float": { "processor": "FloatToOneHot", #FloatToBin" "kwargs": {"n_bins": n_bins} }, "int": { "processor": "IntToFloat", "kwargs": {"n_bins": n_bins} }, "category": { "processor": "CatToOneHot", }, "datetime": { "processor": "DtToFloat", } } }
from hazy_trainer.processing import HazyProcessor
processor = HazyProcessor(processor_dict)

Now we are going to read the data and process it using the function format_data. The auxiliary variables categories_n and categories_cum store respectively the cardinality and the cumulative sum of the cardinality of the variables.

data=pd.read_csv('data.csv',nrows=100000) # read the data categorical = ['Amount','Flag','Description'] continuous =['Balance','Dif']
cols = categorical + continuous
processor = HazyProcessor(processor_dict) #call Hazy processor
processor.setup_df(data[cols]) # setup the processor
categories_n = [] # Number of categories in each categorical variable
for cat in categorical: categories_n.append(len(processor.column_to_columns[cat]['process'])) categories_cum = list(np.cumsum(categories_n)) # Cumulative sum of number of categorical variables
categories_cum = [x for x in categories_cum] # We take one out because they will be indexes
categories_cum = [0] + categories_cum def format_data(data, cols, nsequences=1000, Lmax=100, cardinality=70): ''' cols is a list of columns to be processed nsequences number of sequences to use for training Lmax is the maximum sequence length Cardinality shape of sequences''' idd=list(accenture.Account_id.unique()) # unique account ids data.Date = pd.to_datetime(data.Date) # format date # dummy to normalize the processors data['Dif']=np.random.randint(0,30*24*3600,size=accenture.shape[0]) data_all = np.zeros((nsequences,Lmax,Cardinality)) data_attribut=np.zeros((nsequences)) data_gen_flag=np.zeros((nsequences,Lmax)) real_df = pd.DataFrame() for i,ids in enumerate(idd[:nsequences]): user = data[data.Account_id==ids] user=user.sort_values(by='Date') user['Dif']=user.Date.diff(1).iloc[1:] user['Dif']=user['Dif'].dt.seconds user = user[cols] real_df=pd.concat([real_df,user]) processed_df = processor.process_df(user) Data_attribut[i] = processed_df['Balance'].values[0] processed_array = np.asarray(processed_df.iloc[:,1:) data_gen_flag[i,:len(user)]=1 data_all[i,:len(user),:]=processed_array return data_all, data_attribut, data_gen_flag

Data

The data consist of roughly 10 million bank transactions from which we will use just a sample of 100,000 containing 5,000 unique accounts with an average of 20 transactions per account. We consider the following fields:

  • Date of the transaction
  • Amount of transaction
  • Balance
  • Transaction Flag (5 levels)
  • Description (44 levels)

Below is the head of the data used:

synthetic data with GAN

Table 1: sample of bank transactions data

As mentioned before, the temporal information will be modeled as the time difference between two consecutive transactions (in seconds).

synthetic data with GAN

Figure 3: Histogram of transactions for different Description separated as income and outflows.

synthetic data with GAN

Figure 4: Heatmaps of transactions for different time distributions.

synthetic data with GAN

Figure 5: Distribution of transactions Amount.

synthetic data with GAN

Figure 6: Distribution of initial Balance. Note that some accounts have an initial negative balance due to overdraft.

synthetic data with GAN

Figure 7: Number of transactions over a month — income and outflow. Note that income has very distinct peaks. The synthetic data has to capture these peaks

Running the code

We ran the code for only 100 epochs using the following parameters:

import sys
import os
sys.path.append("..")
import matplotlib.pyplot as plt from gan import output
sys.modules["output"] = output import numpy as np
import pickle
import pandas as pd from gan.doppelganger import DoppelGANger
from gan.load_data import load_data
from gan.network import DoppelGANgerGenerator, Discriminator, AttrDiscriminator
from gan.output import Output, OutputType, Normalization
import tensorflow as tf
from gan.network import DoppelGANgerGenerator, Discriminator, \ RNNInitialStateType, AttrDiscriminator
from gan.util import add_gen_flag, normalize_per_sample, \ renormalize_per_sample sample_len = 10
epoch = 100
batch_size = 20
d_rounds = 2
g_rounds = 1
d_gp_coe = 10.0
attr_d_gp_coe = 10.0
g_attr_d_coe = 1.0

Note that the generator is composed of a list of layers with the softmax activation function for categorical inputs and linear activation for continuous variables. Both generator and discriminator are optimized using the Adam algorithm with a specified learning rate and momentum.

Now we prepare the data to feed the network. The real_attribute_mask is a list of True/False with the same length as the number of attributes. False if the attribute is (max-min)/2 or (max+min)/2; otherwise True. First we instantiate the generator and the discriminator:

# create the necessary input arrays
data_all, data_attribut, data_gen_flag = format_data(data,cols)
# normalise data
(data_feature, data_attribute, data_attribute_outputs, real_attribute_mask) = normalize_per_sample( data_all, data_attribut, data_feature_outputs, data_attribute_outputs)
# add generation flag to features
data_feature, data_feature_outputs = add_gen_flag( data_feature, data_gen_flag, data_feature_outputs, sample_len) generator = DoppelGANgerGenerator( feed_back=False, noise=True, feature_outputs=data_feature_outputs, attribute_outputs=data_attribute_outputs, real_attribute_mask=real_attribute_mask, sample_len=sample_len, feature_num_units=100, feature_num_layers=2) discriminator = Discriminator()
attr_discriminator = AttrDiscriminator()

We used a neural network composed of two layers of 100 neurons for the generator and the discriminator. All data were normalized or 1-hot encoded. Then we train the model with the following parameters:

checkpoint_dir = "./results/checkpoint"
sample_path = "./results/time"
epoch = 100
batch_size = 50
g_lr = 0.0001
d_lr = 0.0001
vis_freq = 50
vis_num_sample = 5
d_rounds = 3
g_rounds = 1
d_gp_coe = 10.0
attr_d_gp_coe = 10.0
g_attr_d_coe = 1.0
extra_checkpoint_freq = 30
num_packing = 1

Some notes on training

If the data is large, you should use a larger number of epochs — the authors suggest 400 but, in our experiments, we found that we could be as high as 1000 without networks degenerating into mode collapse. Also, consider that the number of epochs is related to batch size — smaller batches need more epochs and a lower learning rate.

For those new to neural networks, Batch, Stochastic, and Minibatch gradient descent are the three main flavors of machine learning algorithms. Batch size controls the accuracy of the estimate of the error gradient when training neural networks. The user should be aware of the trade-offs between batch size, speed, and stability during learning. Larger batches require larger learning rates and the network will learn faster, but it can also be less stable, which is particularly problematic for GANs due to the mode collapse problem.

As a rule of thumb learning rates of generators and discriminators should be small (in the range 10^{-3}

to 10^{-5}) and similar to each other. In our case, we use 10^{-4}, not the default 10^{-3}.

Another important parameter is the number of rounds on the generator and discriminator. Wasserstein GAN (WGAN) requires two components to work properly: gradient clipping and higher rounds of discriminator (d_round) than the generator. Normally the number of rounds of the discriminator is between 3 to 5 for each round of the generator. Here we use d_round=3 and g_round=1.

In order to speed up the training, we used a cyclical learning rate for the generator and a fixed one for the discriminator.

The directory sample_path stores a set of samples collected at different checkpoints, which is useful for verification purposes. Visualizations of the loss functions can be done using TensorBoard on the checkpoint directory that you provide. You can control the frequency of checkpoints with the parameter extra_checkpoint_freq.

Be aware that this may take up a lot of disk space. The simulation took less than ten minutes on a MacBook Pro.

run_config = tf.ConfigProto()
tf.reset_default_graph() # if you are using spyder with tf.Session(config=run_config) as sess: gan = DoppelGANger( sess=sess, checkpoint_dir=checkpoint_dir, sample_dir=sample_dir, time_path=sample_path, epoch=epoch, batch_size=batch_size, data_feature=data_feature, data_attribute=data_attribute, real_attribute_mask=real_attribute_mask, data_gen_flag=data_gen_flag, sample_len=sample_len, data_feature_outputs=data_feature_outputs, data_attribute_outputs=data_attribute_outputs, vis_freq=vis_freq, vis_num_sample=vis_num_sample, generator=generator, discriminator=discriminator, attr_discriminator=attr_discriminator, d_gp_coe=d_gp_coe, attr_d_gp_coe=attr_d_gp_coe, g_attr_d_coe=g_attr_d_coe, d_rounds=d_rounds, g_rounds=g_rounds, g_lr=g_lr, d_lr=d_lr, num_packing=num_packing, extra_checkpoint_freq=extra_checkpoint_freq) gan.build() gan.train()

Synthetic data generation

After the model is trained, you can use the generator to create synthetic data from noise. There are two ways to do it:

  1. Unconditional generation from pure noise
  2. Conditional generation on attributes

In the first case, we generate attributes and features. In the second, we explicitly specify which attributes we want to condition the feature generation with so that only features are generated.

Below is the code to generate samples from:

run_config = tf.ConfigProto()
total_generate_num_sample = 1000
with tf.Session(config=run_config) as sess: gan = DoppelGANger( sess=sess, checkpoint_dir=checkpoint_dir, sample_dir=sample_dir, time_path=time_path, epoch=epoch, batch_size=batch_size, data_feature=data_feature, data_attribute=data_attribute, real_attribute_mask=real_attribute_mask, data_gen_flag=data_gen_flag, sample_len=sample_len, data_feature_outputs=data_feature_outputs, data_attribute_outputs=data_attribute_outputs, vis_freq=vis_freq, vis_num_sample=vis_num_sample, generator=generator, discriminator=discriminator, attr_discriminator=attr_discriminator, d_gp_coe=d_gp_coe, attr_d_gp_coe=attr_d_gp_coe, g_attr_d_coe=g_attr_d_coe, d_rounds=d_rounds, g_rounds=g_rounds, num_packing=num_packing, extra_checkpoint_freq=extra_checkpoint_freq) # build the network gan.build() length = int(data_feature.shape[1] / sample_len) real_attribute_input_noise = gan.gen_attribute_input_noise( total_generate_num_sample) addi_attribute_input_noise = gan.gen_attribute_input_noise( total_generate_num_sample) feature_input_noise = gan.gen_feature_input_noise( total_generate_num_sample, length) input_data = gan.gen_feature_input_data_free( total_generate_num_sample) # load the weights / change the path accordingly gan.load(checkpoint_dir+'/epoch_id-100') # generate features, attributes and lengths features, attributes, gen_flags, lengths = gan.sample_from( real_attribute_input_noise, addi_attribute_input_noise, feature_input_noise, input_data, given_attribute=None, return_gen_flag_feature=False)
#denormalise accordingly features, attributes = renormalize_per_sample( features, attributes, data_feature_outputs, data_attribute_outputs, gen_flags, num_real_attribute=1)

We need a few extra steps to process the generated samples into a sequence format and return vectors in a 1-hot encoding format.

nfloat = len(continuous)
synth=np.zeros(features.shape[-1])
for i in range(features.shape[0]): v = np.concatenate([np.zeros_like(attributes[i]), np.zeros_like(features[i])],axis=-1) v[attributes[i].shape] = attributes[i] V[attributes[i].shape[0]:attributes[i].shape[0]+1] = feature[i,:,0] for j, c in enumerate(categories_cum[:-1]): ac = features[:,nfloat+categories_cum[j]-1: nfloat+categories_cum[j+1]-1] a_hot = np.zeros((ac.shape[0], categories_n[j])) a_hot[np.arange(ac.shape[0]),ac.argmax(axis=1)] = 1 v[:,nfloat+categories_cum[j]:nfloat+categories_cum[j+1]]=a_hot v=np.concatenate([np.array([i]*len(ac))[np.newaxis].T,v],axis=1) synth = np.vstack([synth,v]) df = pd.DataFrame(synth[1:,1:],columns=processed_df.columns)
formated_df = processor.format_df(df)
formated_df['account_id']=synth[:,0] # add account_id

Below we present some comparisons between synthetic (generated) and real data. We can observe that, overall, the generated data distribution matches relatively well the real ones — Fig 8 and Fig 9.

synthetic data with GAN

synthetic data with GAN

synthetic data with GAN

Figure 8: Histograms of sequence length (top) time intervals between Transactions (middle) and Flags (bottom) for generated vs real data.

The only exception is the distribution of the variable Amount, as shown in Figure 9. This is due to the fact that this variable has a non-smooth distribution. To solve this issue we discretized it into 20 levels resulting in a much better match.

synthetic data with GAN

synthetic data with GAN

Figure 9: Amount real vs generated using a continuous encoding (top) and binarised one-hot encoding (bottom).

We then used the Hazy metrics to calculate the Similarity Score. This score is a mean of three scores: Histogram and histogram2D similarity (how much the real and synthetic data histograms overlap) and Mutual Information between columns. This score establishes how well the synthetic data preserves the correlations between columns.

We got a similarity score of 0.57 when treating Amount as a continuous variable and 0.63 when we binarised it into 20 bins. The Similarity Score was obtained as follows:

from hazy_trainer.evaluation.similarity import Similarity
sim = Similarity(metrics=['hist','hist2d','mi'])
score = sim.score(real_df[cols], formated_df[cols])
print(score['similarity']['score']) 

However, we’ve noticed that this number does not really tell the whole story since it does not explicitly measure the temporal coherence of the synthetic data sequences — it treats each row independently.

synthetic data with GAN

Figure 10: Transactions Amount generated by the model over time (money in and money out).

For that purpose, we used an additional key metric: autocorrelation which measures how an event in time is related to events occurring at time t — ∆ where ∆ is a time lag. To measure the relation we compare in the following way:

AC=i=1T(Areali -Asynthetici)2/ i=1T(Areali )2

Below are the autocorrelation plots for the total amount spent (aggregated by day) on real and synthetic data. We can see that the two have very similar patterns.

This will only work for numerical data. For categorical, we can use mutual information. For our data, we got AC = 0.71

synthetic data with GAN

Figure 11: Auto-correlation for real and synthetic data for the bank transaction dataset.

The traffic dataset

In order to prove the capabilities of a sequential data generator, we tested it on another more challenging dataset: the Metro Interstate Traffic Volume Data Set. It’s a dataset with hourly traffic data from 2012 to 2018. As we can see in the next figures, the data is relatively coherent over time with some daily and weekly patterns and large hourly variability. The synthetic data originated from the generator has to reproduce all these trends.

synthetic data with GAN

Figure 12: Histogram of traffic volume (vehicles per hour).

synthetic data with GAN

synthetic data with GAN

synthetic data with GAN

The daily patterns can be quite complex as seen in the next figure containing traffic over the first month (October 2012):

synthetic data with GAN

Figure 14: Hourly traffic patterns for the month of October 2012. Each dip represents a day. Weekends are visible in lower-traffic patterns.

In order to generate good quality synthetic data, the network has to predict the right daily, weekly, monthly, and even yearly patterns, so long-term correlations are important.

synthetic data with GAN

synthetic data with GAN

synthetic data with GAN

synthetic data with GAN

Figure 15: Some more distributions of the data.

In terms of autocorrelation, we can see a smooth daily correlation — which makes sense since most traffic have a symmetric behavior. High intensity in the morning is correlated with high intensity in the evening.

synthetic data with GAN

Figure 16: Auto-correlation for real traffic data versus generated traffic data. For longer legs, the autocorrelation of synthetic data starts to depart from the one obtained from real data

Running the model

In this case, the sequence lengths are fixed. To prepare the data, we generated 50,000 sequences using a sliding window of monthly and weekly data. This dataset is much larger than the previous and we expected the model to behave smoothly without mode collapse.

In this case, we also had a larger number of attributes. Some, like Day of the week and Month, were constructed from the data:

  • Temperature
  • Rain_1h
  • Snow_1h
  • Clouds_all
  • Weather_description
  • Weather_main
  • Holiday
  • Day of the week
  • Month

As features, we have only the hourly traffic volume. Since we want to capture this variable with the highest granularity, all numeric values were discretized into 20 bins, except the traffic volume that was discretized into 50 bins. The model ran for 200 epochs with a batch size of 20 and the same learning rate as before.

Results

Figure 17 contains a real and generated sample. We can see that the cyclic patterns are well kept and data looks realistic.

synthetic data with GAN

Figure 17: Real (top) and generated (bottom) sequences over a 500-hour period. The model was run unconditionally. We can see that the synthetic data captures very well the daily and weekly patterns.

To test the quality of the generated data, we present some metrics — see table 2:

  • Similarity — measured by the overlap of histograms and mutual information
  • Auto-correlation — the ratio between real and synthetic over 30 time lags
  • Utility — measured by the relative ratio of forecasting error when trained with real and synthetic data

We used as a baseline an LSTM (long short-term memory) model with bootstrapping. This LSTM model is composed of two layers with 100 neurons each and uses a sliding window of 30 hours. The attributes were added through a dense layer and concatenated to the last hidden layer of the network.

As we can see from Table 2, DoppelGANger, trained with weekly data, performs relatively well, outperforming by a good margin the bootstrapping technique.

synthetic data with GAN

Table 2: Results for the traffic dataset.

We added a third metric, the Sequential Mutual Information (SMI). It is evaluating the Mutual Information on a matrix containing T columns where each column corresponds to the event occurring previous t, t-1, t-2, … t-T time steps and averaging on a subset of attributes.

We should note that the model can be conditioned on the attributes, so we can generate samples for a specific weather condition or day of the week or month.

Experiments on Differential Privacy

In the original work, the authors introduced differential privacy in the model through the well-known technique of adding noise to the discriminator and clipping its gradients — the DPGAN.

However, they found out that, as soon as the privacy budget, ε, becomes relatively small — meaning that the synthetic data gets safer, it also starts losing quality — measured by temporal coherence with respect to the real data. This could represent a major problem if the end-usage of the data is to extract detailed temporal information, like causality between events.

Based on recent work around PPGAN (Privacy-preserving Generative Adversarial Network), we introduced some modifications to the noise injected to the gradients of the discriminator. The moment’s accountant frames the privacy loss problem as if it was a random variable, using its moment-generating functions to control the variable’s density distributions. This property makes the PPGAN model training more stable. The difference with DPGAN is particularly significant when generating very long sequences.

The noise is given by the following expression:

ɸ=f+N(0,σ2𝜍f2)

Where 𝞷 is the sensitivity to a query f from two neighbor points x and x’:

△f=maxf(x)-f(x’)2

This expression means that most informative points — the highest sensitivity — will get more noise added to the gradient, thus not compromising the quality from other points. By using this carefully designed noise, we were able to preserve 88 percent of the autocorrelation up to ε = 1 on the traffic data.

Conclusions

Synthetic sequential data generation is a challenging problem that has not yet been fully solved. Through the testing presented above, we proved that GANs present as an effective way to address this problem.

Learn more about Hazy synthetic data generation and request a demo at Hazy.com.

This article was originally published on Medium and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more updates on applied ML.

We’ll let you know when we release more technical education.

AI

How does it know?! Some beginner chatbot tech for newbies.

Published

on

Wouter S. Sligter

Most people will know by now what a chatbot or conversational AI is. But how does one design and build an intelligent chatbot? Let’s investigate some essential concepts in bot design: intents, context, flows and pages.

I like using Google’s Dialogflow platform for my intelligent assistants. Dialogflow has a very accurate NLP engine at a cost structure that is extremely competitive. In Dialogflow there are roughly two ways to build the bot tech. One is through intents and context, the other is by means of flows and pages. Both of these design approaches have their own version of Dialogflow: “ES” and “CX”.

Dialogflow ES is the older version of the Dialogflow platform which works with intents, context and entities. Slot filling and fulfillment also help manage the conversation flow. Here are Google’s docs on these concepts: https://cloud.google.com/dialogflow/es/docs/concepts

Context is what distinguishes ES from CX. It’s a way to understand where the conversation is headed. Here’s a diagram that may help understand how context works. Each phrase that you type triggers an intent in Dialogflow. Each response by the bot happens after your message has triggered the most likely intent. It’s Dialogflow’s NLP engine that decides which intent best matches your message.

Wouter Sligter, 2020

What’s funny is that even though you typed ‘yes’ in exactly the same way twice, the bot gave you different answers. There are two intents that have been programmed to respond to ‘yes’, but only one of them is selected. This is how we control the flow of a conversation by using context in Dialogflow ES.

Unfortunately the way we program context into a bot on Dialogflow ES is not supported by any visual tools like the diagram above. Instead we need to type this context in each intent without seeing the connection to other intents. This makes the creation of complex bots quite tedious and that’s why we map out the design of our bots in other tools before we start building in ES.

The newer Dialogflow CX allows for a more advanced way of managing the conversation. By adding flows and pages as additional control tools we can now visualize and control conversations easily within the CX platform.

source: https://cloud.google.com/dialogflow/cx/docs/basics

This entire diagram is a ‘flow’ and the blue blocks are ‘pages’. This visualization shows how we create bots in Dialogflow CX. It’s immediately clear how the different pages are related and how the user will move between parts of the conversation. Visuals like this are completely absent in Dialogflow ES.

It then makes sense to use different flows for different conversation paths. A possible distinction in flows might be “ordering” (as seen here), “FAQs” and “promotions”. Structuring bots through flows and pages is a great way to handle complex bots and the visual UI in CX makes it even better.

At the time of writing (October 2020) Dialogflow CX only supports English NLP and its pricing model is surprisingly steep compared to ES. But bots are becoming critical tech for an increasing number of companies and the cost reductions and quality of conversations are enormous. Building and managing bots is in many cases an ongoing task rather than a single, rounded-off project. For these reasons it makes total sense to invest in a tool that can handle increasing complexity in an easy-to-use UI such as Dialogflow CX.

This article aims to give insight into the tech behind bot creation and Dialogflow is used merely as an example. To understand how I can help you build or manage your conversational assistant on the platform of your choice, please contact me on LinkedIn.

Source: https://chatbotslife.com/how-does-it-know-some-beginner-chatbot-tech-for-newbies-fa75ff59651f?source=rss—-a49517e4c30b—4

Continue Reading

AI

Who is chatbot Eliza?

Between 1964 and 1966 Eliza was born, one of the very first conversational agents. Discover the whole story.

Published

on


Frédéric Pierron

Between 1964 and 1966 Eliza was born, one of the very first conversational agents. Its creator, Joseph Weizenbaum was a researcher at the famous Artificial Intelligence Laboratory of the MIT (Massachusetts Institute of Technology). His goal was to enable a conversation between a computer and a human user. More precisely, the program simulates a conversation with a Rogérian psychoanalyst, whose method consists in reformulating the patient’s words to let him explore his thoughts himself.

Joseph Weizenbaum (Professor emeritus of computer science at MIT). Location: Balcony of his apartment in Berlin, Germany. By Ulrich Hansen, Germany (Journalist) / Wikipedia.

The program was rather rudimentary at the time. It consists in recognizing key words or expressions and displaying in return questions constructed from these key words. When the program does not have an answer available, it displays a “I understand” that is quite effective, albeit laconic.

Weizenbaum explains that his primary intention was to show the superficiality of communication between a human and a machine. He was very surprised when he realized that many users were getting caught up in the game, completely forgetting that the program was without real intelligence and devoid of any feelings and emotions. He even said that his secretary would discreetly consult Eliza to solve his personal problems, forcing the researcher to unplug the program.

Conversing with a computer thinking it is a human being is one of the criteria of Turing’s famous test. Artificial intelligence is said to exist when a human cannot discern whether or not the interlocutor is human. Eliza, in this sense, passes the test brilliantly according to its users.
Eliza thus opened the way (or the voice!) to what has been called chatbots, an abbreviation of chatterbot, itself an abbreviation of chatter robot, literally “talking robot”.

Source: https://chatbotslife.com/who-is-chatbot-eliza-bfeef79df804?source=rss—-a49517e4c30b—4

Continue Reading

AI

FermiNet: Quantum Physics and Chemistry from First Principles

Weve developed a new neural network architecture, the Fermionic Neural Network or FermiNet, which is well-suited to modeling the quantum state of large collections of electrons, the fundamental building blocks of chemical bonds.

Published

on

Unfortunately, 0.5% error still isn’t enough to be useful to the working chemist. The energy in molecular bonds is just a tiny fraction of the total energy of a system, and correctly predicting whether a molecule is stable can often depend on just 0.001% of the total energy of a system, or about 0.2% of the remaining “correlation” energy. For instance, while the total energy of the electrons in a butadiene molecule is almost 100,000 kilocalories per mole, the difference in energy between different possible shapes of the molecule is just 1 kilocalorie per mole. That means that if you want to correctly predict butadiene’s natural shape, then the same level of precision is needed as measuring the width of a football field down to the millimeter.

With the advent of digital computing after World War II, scientists developed a whole menagerie of computational methods that went beyond this mean field description of electrons. While these methods come in a bewildering alphabet soup of abbreviations, they all generally fall somewhere on an axis that trades off accuracy with efficiency. At one extreme, there are methods that are essentially exact, but scale worse than exponentially with the number of electrons, making them impractical for all but the smallest molecules. At the other extreme are methods that scale linearly, but are not very accurate. These computational methods have had an enormous impact on the practice of chemistry – the 1998 Nobel Prize in chemistry was awarded to the originators of many of these algorithms.

Fermionic Neural Networks

Despite the breadth of existing computational quantum mechanical tools, we felt a new method was needed to address the problem of efficient representation. There’s a reason that the largest quantum chemical calculations only run into the tens of thousands of electrons for even the most approximate methods, while classical chemical calculation techniques like molecular dynamics can handle millions of atoms. The state of a classical system can be described easily – we just have to track the position and momentum of each particle. Representing the state of a quantum system is far more challenging. A probability has to be assigned to every possible configuration of electron positions. This is encoded in the wavefunction, which assigns a positive or negative number to every configuration of electrons, and the wavefunction squared gives the probability of finding the system in that configuration. The space of all possible configurations is enormous – if you tried to represent it as a grid with 100 points along each dimension, then the number of possible electron configurations for the silicon atom would be larger than the number of atoms in the universe!

This is exactly where we thought deep neural networks could help. In the last several years, there have been huge advances in representing complex, high-dimensional probability distributions with neural networks. We now know how to train these networks efficiently and scalably. We surmised that, given these networks have already proven their mettle at fitting high-dimensional functions in artificial intelligence problems, maybe they could be used to represent quantum wavefunctions as well. We were not the first people to think of this – researchers such as Giuseppe Carleo and Matthias Troyer and others have shown how modern deep learning could be used for solving idealised quantum problems. We wanted to use deep neural networks to tackle more realistic problems in chemistry and condensed matter physics, and that meant including electrons in our calculations.

There is just one wrinkle when dealing with electrons. Electrons must obey the Pauli exclusion principle, which means that they can’t be in the same space at the same time. This is because electrons are a type of particle known as fermions, which include the building blocks of most matter – protons, neutrons, quarks, neutrinos, etc. Their wavefunction must be antisymmetric – if you swap the position of two electrons, the wavefunction gets multiplied by -1. That means that if two electrons are on top of each other, the wavefunction (and the probability of that configuration) will be zero.

This meant we had to develop a new type of neural network that was antisymmetric with respect to its inputs, which we have dubbed the Fermionic Neural Network, or FermiNet. In most quantum chemistry methods, antisymmetry is introduced using a function called the determinant. The determinant of a matrix has the property that if you swap two rows, the output gets multiplied by -1, just like a wavefunction for fermions. So you can take a bunch of single-electron functions, evaluate them for every electron in your system, and pack all of the results into one matrix. The determinant of that matrix is then a properly antisymmetric wavefunction. The major limitation of this approach is that the resulting function – known as a Slater determinant – is not very general. Wavefunctions of real systems are usually far more complicated. The typical way to improve on this is to take a large linear combination of Slater determinants – sometimes millions or more – and add some simple corrections based on pairs of electrons. Even then, this may not be enough to accurately compute energies.

Source: https://deepmind.com/blog/article/FermiNet

Continue Reading
AI13 hours ago

How does it know?! Some beginner chatbot tech for newbies.

AI13 hours ago

Who is chatbot Eliza?

AI1 day ago

FermiNet: Quantum Physics and Chemistry from First Principles

AI1 day ago

How to take S3 backups with DejaDup on Ubuntu 20.10

AI3 days ago

How banks and finance enterprises can strengthen their support with AI-powered customer service…

AI3 days ago

GBoard Introducing Voice — Smooth Texting and Typing

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI3 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

Trending