Connect with us


ECCV 2020: Some Highlights

The 2020 European Conference on Computer Vision took place online, from 23 to 28 August, and consisted of 1360 papers, divided into 104 orals, 160 spotlights and the rest of 1096 papers as posters. In addition to 45 workshops and 16 tutorials. As it is the case in recent years with ML and CV conferences, […]

The post ECCV 2020: Some Highlights appeared first on TOPBOTS.



ECCV 2020

The 2020 European Conference on Computer Vision took place online, from 23 to 28 August, and consisted of 1360 papers, divided into 104 orals, 160 spotlights and the rest of 1096 papers as posters. In addition to 45 workshops and 16 tutorials. As it is the case in recent years with ML and CV conferences, the huge number of papers can be overwhelming at times. Similar to my CVPR2020 post, to get a grasp of the general trends of the conference this year, I will present in this blog post a sort of a snapshot of the conference by summarizing some papers (& listing some) that grabbed my attention.

First, some useful links:

Disclaimer: This post is not a representation of the papers and subjects presented in ECCV 2020; it is just a personnel overview of what I found interesting. Any feedback is welcomed!

General Statistics

The statistics presented in this section are taken from the official Opening & Awards presentation. Let’s start by some general statistics:

ECCV 2020

The trends of earlier years continued with more than 200% increase in submitted papers compared to the 2018 conference, and with a similar number of papers to CVPR 2020. As expected, this increase is joined by a corresponding increase in the number of reviewers and area chairs to accommodate this expansion.

As expected, the majority of the accepted papers focus on topics related to deep learning, recognition, detection, and understanding. Similar to CVPR 2020, we see an increasing interest in growing areas such as label-efficient methods (e.g., unsupervised learning) and low-level vision.

In terms of institutions; similar to ICML this year, Google takes the lead with 180 authors, followed by The Chinese University of Hong Kong with 140 authors and Peking University with 110 authors.

In the next sections, we’ll present some paper summaries by subject.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

Recognition, Detection, Segmentation and Pose Estimation

End-to-End Object Detection with Transformers (paper)

The task of object detection consists of localizing and classifying objects visible given an input image. The popular framework for object detection consist of pre-defining a set of boxes (ie., a set of geometric priors like anchors or region proposals), which are first classified, followed by a regression step to the adjust the dimensions of the predefined box, and then a post-processing step to remove duplicate predictions. However, this approach requires selecting a subset of candidate boxes to classify, and is not typically end-to-end differentiable. In this paper, the authors propose DETR (DEtection TRansformer), an end-to-end fully differentiable approach with no geometric priors. Bellow is a comparison of DETR and Faster R-CNN pipelines (image taken from the authors presentation), highlighting the holistic nature of the approach.

DETR is based on the encoder-decoder transformer architecture. The model consists of three components: the CNN feature extractor, the encoder, and the decoder. A given image is first passed through the feature extractor to get image features. Then, positional encodings generated using sinusoids at different frequencies are added to the features to retain the 2D structure of the image. The resulting features are then passed through the transformer encoder to aggregate information across features and separate the object instances. For decoding, object queries are passed to the decoder with the encoded feature producing the output feature vectors. These queries are a fixed set of learned embeddings called object queries, which are randomly initialized embeddings that are learned during training then fixed during evaluation, and their number defines an upper bound on the number of objects the model can detect. Finally, the output feature vectors are fed through a (shared) fully connected layer to predict the class and bounding box for each query. To compute the loss and train the model, the outputs are matched with the ground truths with a one-to-one matching using the Hungarian algorithm.

MutualNet: Adaptive ConvNet via Mutual Learning from Network Width and Resolution (paper)

Traditional neural network can only be used if a specific amount of compute is available, and if the resource constraints are not met, the model becomes unusable. However, this can greatly limit the usage of the models in real applications. For example, if the model is used for in-phone inference, the computational constrains are always changing depending on the load and the phone’s battery charge. A simple solution is to keep several models of different sizes on the device, and use the one with the corresponding constrains each time, but this requires a large amount of memory and cannot be scaled to different constraints. Recent methods like S-Net and US-Net sample sub-networks during training so the model can be used at different width during deployment. But the performance drops dramatically with very low constraints.

This paper proposes to leverage both the network scale and the input scale to find a good trade-off between the accuracy and the computational efficiency. As illustrated above, for a given training iteration, four sub-networks are sampled, a full one and three sub-networks with varying widths. The full network is trained on the original size of the image with the ground-truth labels using the standard cross-entropy loss, while the rest of the sub-networks are trained with randomly down-scaled version of the input image using KL divergence loss between their outputs and the output of the full network (ie., a distillation loss). This way, each sub-network will be able to learn multi-scale representations from both the input scale and the network scale. During deployment, and given a specific resource constraint, the optimal combination of network scale and input scale can be chosen for inference.

Gradient Centralization: A New Optimization Technique for Deep Neural Networks (paper)

Using second order statistics such as mean and variance during optimization to perform some form standardization of the activations or network’s weight, such as Batch norm or weight norm, have become an important component of neural network training. So, instead of operating on the weights or the activations with additional normalization modules, Gradient Centralization (GC) operates directly on gradients by centralizing the gradient vectors to have zero mean, which can smooth and accelerate the training process of neural networks and even improve the model generalization performance.

The GC operator, given the computed gradients, first computes the mean of the gradient vectors as illustrated above, then removes the mean of them. Formally, for a weight vector  {w}_i whose gradient is {nabla}_{w_i}mathcal{L}(i=1,2,…,N), the GC operator {Phi}_{GC} is defined as:

Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval (paper)

In image retrieval, the objective is to retrieve images of the same class as the query image from a large collection of images. This tasks differs from classification where the classes encountered during testing were already seen during training, in image retrieval, we might get an image with a novel class and we need to fetch similar images, ie., an open set problem. The general pipeline of image retrieval consists of extracting embeddings for the query image, and also embeddings for all of image collection using a CNN feature extractor, compute the cosine similarity score between each pair, then rank the images in the collection based on such a similarity. The feature extractor is then trained to have a good ranking. The ranking performance is measured using Average Precision (AP), computing the sum of the rank of each positive over its rank on the whole image collection. However, computing the ranking of a given image consists of a thresholding operation using a Heaviside step function, making it non-differentiable, so we cannot train the model end-to-end to directly optimize the ranking.

To solve this, the authors proposed to replace the Heaviside step function with a smooth temperature controlled sigmoid, making the ranking differentiable and useable as a loss function for end-to-end training. Compared to the triplet loss, the smooth-Ap loss optimizes a ranking loss, while the triplet loss is a surrogate loss that indirectly optimizes for a good ranking.

Hybrid Models for Open Set Recognition (paper)

Existing image classification methods are often based on a closed-set assumption, ie., the training set covers all possible classes that may appear in the testing phase. But this assumption is clearly unrealistic, given that even with large scale datasets such as ImageNet with 1K classes, it is impossible to cover all possible real-world classes. This where open-set classification comes, and tries to solves this by assuming that the test set contains both known and unknown classes.

In this paper, the authors use Flow-based model to tackle the problem of open-set classification. Flow-based are able to fit a probability distribution to training samples in an unsupervised manner via maximum likelihood estimation. The flow model can then be used to predict the probability density of each example. When the probability density of an input sample is large, it is likely to be part of the training distribution with a known class, and outliers will have a small density value. While previous methods stacked a classifier on top of the flow model, the authors propose to learn a joint embedding for both the flow model and the classifier since the embedding space learned from only flow-based model may not have sufficient discriminative features for effective classification. As illustrated above, during training, images are mapped into a latent feature space by the encoder, then the encoded features are fed into both the classifier trained with a cross-entropy loss, and the flow model for density estimation. The whole architecture is trained in an end-to-end manner. For testing, the (log p(x)) of each image is computed and then compared with the lowest (log p(x)) taken over the training set. If it is greater than the threshold, it is sent to the classifier to identify its specific known class, otherwise it is rejected as an unknown sample.

Conditional Convolutions for Instance Segmentation (paper)

Instance segmentation remains as one of the challenging tasks in computer vision, requiring a per-pixel mask and a class label for each visible object in a given image. The dominant approach is Mask R-CNN which consists of two steps, first, the object detector Faster R-CNN predicts a bounding box for each instance. Then, for each detected instance, the regions of interest are cropped from the output feature maps using ROI Align, resized to the same resolution, and are then fed into a mask head which is a small fully convolutional network used to predict the segmentation mask. However, the authors point out the following limitation with such an architecture; (1) the ROI Align might fetch irrelevant features belonging to the background or to other instances, (2) the resizing operation restricts the resolution of the instance segmentation, and (3) the mask head requires a stack of 3×3 convolutions to induce a large enough receptive field to predict the mask, which considerably increases the computational requirements of the mask head.

In this paper, the authors propose to adapt FCNs used for semantic segmentation for instance segmentation. For effective instance segmentation, FCNs require two type of information, appearance information to categorize objects and location information to distinguish multiple objects belonging to the same category. The proposed network, called CondInst (conditional convolutions for instance segmentation), is a network based on CondConv and HyperNetworks, where for each instance, a sub-network will generate the mask FCN head’s weights conditioned on the center area of each instance, which are then used to predict the mask of the given instance. Specifically, as shown above, the network consists of multiple heads applied at multiple scales of the feature map. Each head predicts the class of a given instance at pre-defined positions, and the network’s weights to be used by the mask FCN head. Then the mask prediction is done using the parameters produced by each head.

Multitask Learning Strengthens Adversarial Robustness (paper)

One of the main limitations of deep neural networks is their vulnerability to adversarial attacks, where very small and invisible perturbations are injected into the input, resulting in the wrong outputs, even if the appearance of the input remains the same. In recent years, the adversarial robustness of deep nets was rigorously investigated at different stages of the pipeline, from the input data (eg., using unlabeled data and adversarial training) to the model itself using regularization (eg., Parseval Networks), but the outputs of the model are still not utilized to improve the robustness of the model. In this paper, the authors investigate the effect of having multiple outputs for multi-task learning on the robustness of the learned model, such a setting is useful since a growing number of machine learning applications call for models capable of solving multiple tasks at once.

Using p-norm ball bounded attack, where the adversarial perturbation is found within a p-norm ball for a given radius of a given input example. Then, the vulnerability computed as the total loss change. The authors showed an improved robustness when training on a pair of tasks (eg., two tasks are chosen from: segmentation, depth, normals, reshading, input reconstruction, 2d and 3d keypoints…). The improved robustness is observed on single tasks attacks (ie., the perturbation is computed using one output) and multi tasks attacks (ie., the maximal perturbation of all the perturbations computed using all of outputs). The authors also theoretically showed that such a multi task robustness is only obtained if the tasks are correlated.

Dynamic Group Convolution for Accelerating Convolutional Neural Networks (paper)

Group convolutions were first introduced in AlexNet to accelerate training, and subsequently adapted for efficient CNNs such as MobileNet and Shufflenet. They consist of equally splitting the input and output channels in a convolution layer into mutually exclusive sections or groups while performing a normal convolution operation within each individual groups. So for (G) groups, the computation is reduced by (G) times. However, the authors argue that they introduce two key limitations: (1) they weaken the representation capability of the normal convolution by introducing sparse neuron connections, and (2) they have fixed channel division regardless of the properties of each input.

In order to adaptively select the most related input channels for each group while keeping the full structure of the original networks, the authors propose dynamic group convolution (DGC). DCG consists of two heads, in each head, there is a saliency score generator that assigns an importance score to each channel. Using these scores, the channels with low importance scores are pruned. Then the normal convolution is conducted based on the selected subset of input channels generating the output channels in each head. Finally, the output channels from different heads are concatenated and shuffled.

Disentangled Non-local Neural Networks (paper)

The non-local block models long-range dependency between pixels using the attention mechanism, and has been widely used for numerous visual recognition tasks, such as object detection, semantic segmentation, and video action recognition.

In this paper, the authors try to better understand the non-local block, find its limitations, and propose an improved version. First, they start by reformulating the similarity between a pixel (i) (referred to as a key pixel) to pixel (j) (referred to as a query pixel) as the sum of two term: a pairwise term, which is a whitened dot product term representing the pure pairwise relation between query and key pixels, and a unary term, representing where a given key pixel has the same impact on all query pixels. Then, to understand the impact of each term, they train using either one, and find that pair-wise term is responsible for the category information, while the unary is responsible for boundary information. However, by analyzing the gradient of the non-local block, when the two are combined in the normal attention operator, their gradients are multiplied, so if the gradients of one of the two term is zero, the non-zero gradients of the other wont have any contribution. To solve this, the authors proposed a disentangled version of the non-local, where each term is optimized separately.

Hard negative examples are hard, but useful (paper)

Deep metric learning optimizes an embedding function that maps semantically similar images to relatively nearby locations and maps semantically dissimilar images to distant locations. A popular way to learn the mapping is to define a loss function based on triplets of images: an anchor image, a positive image from the same class, and a negative image from a different class. The model is then penalized when the anchor is mapped closer to the negative image than it is to the positive image. However, during optimization, most triplet candidates already have the anchor much closer to the positive than the negative making them redundant. On the other hand, optimizing with the hardest negative examples leads to bad local minima in the early phase of the training. This is because in this case, the anchor-negative similarity is larger than the anchor-positive similarity as measured by the cosine similarity, ie., dot product between normalized feature vectors.

The authors show that such problems with the usage of hard-negatives come from the standard implementation of the triplet loss. Specifically, (1) if the normalization is not considered during the gradient computation, a large part of the gradient is lost, and (2) if two images of different classes are close by in the embedding space, the gradient of the loss might pull them closer instead of pushing them away. To solve this, instead of pulling the anchor-positive pair together to be tightly clustered as done in the standard triplet loss, the authors propose to avoid updating the anchor-positive pairs resulting in less tight clusters for a class of instances. This way the network focuses only on directly pushing apart the hard negative examples away from the anchor.

Volumetric Transformer Networks (paper)

One of the keys behind the success CNNs is their ability to learn discriminative feature representations of semantic object parts, which are very useful for computer vision tasks. However, CNNs still lacks the ability to handle various spatial variations, such as scale, view point and intra-class variations. Recent methods, such as spatial transformer networks (STNs), try to suppress such variations by first wrapping the feature maps of spatially different images to a standard canonical configuration, then train classifiers on such standard features. But such methods apply the same wrapping to all the feature channels, which does not take into consideration the fact that the individual feature channels can represent different semantic parts, which may require different spatial transformations with respect to the canonical configuration.

To solve this, the paper introduces Volumetric transformer network (VTN) shown above, a learnable module that predicts per channel and per spatial location wrapping transforms, which are used reconfigure the intermediate CNN features into a spatially agnostic and standard representations. VTN is an encoder-decoder network with modules dedicated to letting the information flow across the feature channels to account for the dependencies between the semantic parts.

Faster AutoAugment: Learning Augmentation Strategies Using Backpropagation (paper)

Data augmentations (DA) have become a important and indispensable component of deep learning methods, and recent works (eg., AutoAugmentFast AutoAugment and RandAugment) showed that augmentation strategies found by search algorithms outperform standard augmentations. With a pre-defined set of possible transformations, such as geometric transformations like rotation or color enhancing transformations like solarization, the objective is to find the optimal data augmentation parameters, ie., the magnitude of the augmentation, the probability of applying it, and the number of transformations to combine as illustrated in the left figure below. The optimal strategy is learned with a double optimization loop, so that the validation error of a given CNN trained with a given strategy is minimized. However, such an optimization method suffers from a large search space of possible policies, requiring sophisticated search strategies, and a single iteration of policy optimization requires the full training of the CNN. To solve this, the authors propose to find the optimal strategy using density matching of original and augmented images with gradient based optimization.

By viewing DA as a way to fill missing points of original data, the objective then is to minimize the distance between the distributions of augmented data and the original data using adversarial learning, and in order to learn the optimal augmentation strategy, the policy needs to be differentiable with respect to the parameters of the transformations. For the probability of applying a given augmentation, the authors use a stochastic binary variable sampled from a Bernoulli distribution, and optimized using the Gumbel trick, while the magnitude is approximate with a straight-through estimator and the combination are learned as a combination of one-hot vectors.

Other Papers

Semi-Supervised, Unsupervised, Transfer, Representation & Few-Shot Learning

Big Transfer (BiT): General Visual Representation Learning (paper)

In this paper, the authors revisit the simple paradigm of transfer learning: pre-train on a large amount of labeled source data (e.g., JFT-300M and ImageNet-21k datasets), then fine-tune the per-trained weights on the target tasks, reducing both the amount of data needed for target tasks and the fine-tuning time. The proposed framework is BiT (Big Transfer), and consists of a number of components that are necessary to build an effective network capable of leveraging large scale datasets and learning general and transferable representations.

On the (upstream) pre-training side, BiT consists of the following:

  • For very large datasets, the fact that Batch Norm (BN) uses statistics from the training data during testing results in train/test discrepancy, where the training loss is correctly optimized while the validation loss is very unstable. In addition to the sensitivity of BN to the batch size. To solve this, BiT uses both Group Norm and Weight Norm instead of Batch Norm.
  • A small model such as ResNet 50 does not benefit from large scale training data, so the size of the model needs to also be correspondingly scaled up.

For (down-stream) target tasks, BiT proposes the following:

  • The usage of standard SGD, withoyt any layer freezing, dropout, L2-regularization or any adaptation gradients. In addition to initializing the last prediction layer to all 0’s.
  • Instead of resizing all of inputs to a fixed size, eg., 224. During training, the images are resized and cropped to a square with a randomly chosen size, and randomly h-flipped. At test time, the image is resized to a fixed size,
  • While mixup is not useful for large scale pre-training given the abundance of data, BiT finds that mixup regularization can be very beneficial for mid-sized dataset used for downstream tasks.

Learning Visual Representations with Caption Annotations (paper)

Training deep models on large scale annotated dataset results in not only a good performance on the task at hand, but also enables the model to learn useful representation for downstream tasks. But can we obtain such useful features without such an expensive fine grained annotations?. This paper investigates weakly-supervised pre-training using noisy labels, which are image captions in this case.

With the objective of using a limited set of image-caption pairs to learn visual representations, how can the training objective be formulated to push for an effective interaction between the images and their captions? Based on masked image modeling used in BERT which randomly masks 15% of the input tokens, and the model is then trained to reconstruct the masked input tokens using the encoder part of the transformer model. The paper proposed image-conditioned masked language modeling (ICMLM), where the images are leveraged to reconstruct the masked tokens of their corresponding captions. To solve this objective, the authors proposes two multi modal architectures, (1) ICMLM tfm, where the image is encoded using a CNN, the masked caption using the BERT model, the caption and image features are then concatenated and passed through a transformer encoder resulting in multi-modal embedding used to predict the masked token. And (2), ICMLM att+fc, similarity, the caption and image features are first produced, then passed through a pair-wise attention block to aggregate the information between the caption and the image. The resulting features are then pooled and passed through a fully connected layer for masked token prediction.

Memory-augmented Dense Predictive Coding for Video Representation Learning (paper)

The recent progress in self-supervised representation learning for images showed impressive results on down-stream tasks. However, although multi-model representation learning for videos saw similar gains, self-supervision using video streams only, without any other modalities such as text or audio, is still not as developed. Even if the temporal information of videos provide a free supervisory signal to train a model to predict the future states from the past in self-supervised manner. The task remains hard to solve since the exact future is not deterministic, and at a given time step, there many likely and plausible hypotheses for future states (eg., when the action is “playing golf”, a future frame could have the hands and golf club in many possible positions).

This paper uses contrastive learning with a memory module to solve the issues with future prediction. To reduce the uncertainty, the model predicts the future at the feature level, and is trained using a contrastive loss to avoid overstrict constrains. And to deal with multiple hypothesis, a memory module is used to infer multiple future states simultaneously. Given a set of successive frame, a 2d-3d CNN encoder (ie., (f)) produces context features and a GRU (ie., (g)) aggregates all the past information, which are then used to select slots from the shared memory module. A predicted future state is then produced as a convex combination of the selected memory slots. The predicted future state is then compared with true features vectors of the future states using a contrastive loss. For downstream tasks, the feature produced by (g) are pooled and then fed to the classifier.

SCAN: Learning to Classify Images without Labels (paper)

To group the unlabeled input images into semantically meaningful clusters, we need to find the solutions using the visual similarities alone. Prior work either, (1) learn rich features with a self-supervised method, then applies k-means on the features to find the cluster, but this can lead to degeneracy quite easily. (2) end-to-end clustering approaches that either leverage CNNs features for deep clustering, or are based on mutual information maximization. However, the produced clusters depend heavily on the initialization and is likely to latch into low-level features.

To solve the issues found in prior work, the paper proposes SCAN (semantic clustering by adopting nearest neighbors) consiting of a two step procedure. In a first step, the feature representations are learned through a pretext task, then, to generate the initial cluster, SCAN mines the nearest neighbors of each image based on feature similarity instead of applying K-means. In a second step, the semantically meaningful nearest neighbors as are used as a prior to train the model to classify each image and its mined neighbors together. This is optimized using a loss function that maximizes their dot product after softmax, pushing the network to produce both consistent and discriminative (one-hot) predictions.

GATCluster: Self-Supervised Gaussian-Attention Network for Image Clustering (paper)

Clustering consists of separating data into clusters according to sample similarity. Traditional methods use hand-crafted features and domain specific distance function to measure the similarity, but such hand crafted feature are very limited in expressiveness. Subsequent work leveraged deep representations with clustering algorithms, but the performance of deep clustering still suffers when the input data is complex. For an effective clustering, in terms of the features, they must contain both high-level discriminative features, and capture object semantics. In terms of the clustering step, trivial solutions such as assigning all samples to a single or few clusters must be avoided, and the clustering needs to be efficient to be applied to large-sized images.

The paper proposes GATCluster, which directly outputs semantic cluster labels without further post-processing, where the learned features are one-hot encoded vectors to guarantee the avoidance of trivial solutions. GATCluster is trained in an unsupervised manner with four self-learning tasks under the constraints of transformation invariance, separability maximization, entropy analysis, and attention mapping.

Associative Alignment for Few-shot Image Classification (paper)

In few-shot image classification, the objective is to produce a model that can learn to recognize novel image classes when very few training examples are available. One of the popular approaches is Meta-learning that extracts common knowledge from a large amount of labeled data containing the base classes, and used to train a model. The model is then trained to classify images from novel concepts with only a few examples. The meta objective is to find a good set of initial weights that converge rapidly when trained on the new concepts. Interestingly, recent works demonstrated that standard transfer learning without meta learning, where a feature extractor is first pre-trained on the base classes, then fine-tunes a classifier on top of the pre-trained extractor on the new few examples performs on par with more sophisticated meta-learning strategies. However, the freezing of the extractor during fine-tuning that is necessary to avoid overfilling hinders the performances.

The paper proposes a two step approach to solve this. First, the feature extractor is used to produce features for the novel examples. The feature of each example is then mapped to one of the base classes using a similarity metric in the embeddings space. The second step consists of associative alignment, where the feature extractor is fine-tuned so that the embeddings of the novel images are pushed closer to the embeddings of their corresponding bases images. This is done by either centroid alignment where the distance between the center of each base class and the novel classes is reduced, or adversarial alignment where a discriminator pushes the feature extractor to align the base and novel examples in the embedding space.

Other Papers

3D Computer Vision & Robotics

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (paper)

3D view synthesis from 2D images is a challenging problem, especially if the input 2D images are sparsely sampled. The goal is to train a model that takes a set of 2D images of a 3D scene (with the optional camera pose and its intrinsics), then, using the trained model, we can render novel views of the 3D scene that were not found in the input 2D images. A successful approach is voxed-based representations that represent the 3D scene on a discredited grid. Anf the 3D voxel of RGB-alpha grid values is predicted using a 3D CNN. However, such methods are memory inefficient since they scale cubically with the space resolution, can be hard to optimize and are not able to parametrize scene surfaces smoothly. A recent trend in the computer vision community is to represent a given 3D scene as a continuous function using a fully-connected neural network. So the neural network itself is a compressed representation of the 3D scene, trained using the set of 2D images and then used to render novel views. Still, the existing methods were not able to match existing voxed-based methods.

NeRF (neural radiance fields) represents a scene as a continuous 5D function using a fully-connected network of 9 layers and 256 channels, whose input is a single continuous 5D coordinate, ie., 3D spatial locations ((x), (y), (z)) and the viewing directions ((theta), (phi)), and whose output is RGB color and opacity (output density). To synthesize a given view, the rendering procedure consists of querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize the representation is a set of images with known camera poses. This way, NeRF is able to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance with a simple reconstruction loss between the rendered images and the ground truths, and demonstrates results that outperform prior work on neural rendering and view synthesis.

Towards Streaming Perception (paper)

Practical applications such as self-driving vehicles require fast reaction times similar to that of of humans, which is typically 200 milliseconds. In such settings, low-latency algorithms are required to ensure safe operation. However, even if the latency of computer vision algorithms is often studied, it have been primarily explored only in an offline setting. While Vision-for-online perception imposes quite different latency demands. Because by the time an algorithm finishes processing a particular image frame, say after 200ms, the surrounding world has changed as shown in the figure bellow. This forces the perception to be ultimately predictive of the future, which is a fundamental property of human vision (e.g., as required whenever a baseball player strikes a fast ball).

To develope better benchmarks that reflects real-world scenarios and make comparing existing method easier. The paper introduces the objective of streaming perception, ie., real-time online perception, and proposes a new meta-benchmark that systematically converts any image understanding task into a streaming image understand-ing task. This benchmark is built on a key insight: streaming perception requires understanding the state of the world at all time instants. So when a new frame arrives, streaming algorithms must report the state of the world even if they have not done processing the previous frame, forcing them to consider the amount of streaming data that should be ignored while the computation is occurring. Specifically, when comparing the model’s output and the ground truths, the alignment is done using time instead of the input index, so the model needs to give the correct prediction for time step (t) before the processing the corresponding input, ie., if the model takes (Delta t) to process the inputs, it can only use data before (t - Delta t) to predict the output corresponding to the input at time (t).

Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images (paper)

Humans are capable of forming a mental model at a young age that maps the perception of an object with a perceived sense of touch, which is based on previous experiences when interacting with different items. Having autonomous agents equipped with such a mental model can be a very valuable tool when interacting with novel objects, especially when a simple object class is not informative enough to accurately estimate tactile physical properties.

In order to simulate such a mental model in a more direct manner, the paper proposes to estimate the physical properties directly, allowing attributes of objects to be utilized directly. First, the authors propose a dataset of 400+ surface image sequences and tactile property measurements. Since when estimating surface properties, people often unconsciously move their heads, acquiring multiple views of a surface, the captured images sequences comprise multiple viewing angles for each material surface. Then, they propose a cross-modal framework for learning the complex mapping for visual cues to the tactile properties. The training objective of the model is to generate precise tactile properties estimates given vision information. Both visual and tactile information are embedded into a shared latent space through separate encoder networks. A generator function then estimates tactile property values from the embedded visual vector. The discriminator network learns to predict whether a tactile-visual pair is a real or synthetic example. During inference, the encoder-generator pair is used to infer the tactile properties if the input images.

Convolutional Occupancy Networks (paper)

3D reconstruction is an important problem in computer vision with numerous applications. For an ideal representation of 3D geometry we need to be able to, a) encode complex geometries and arbitrary topologies, b) scale to large scenes, c) encapsulate local and global information, and d) be tractable in terms of memory and computation. However, existing representations for 3D reconstruction do not satisfy all of these requirements. While recent implicit neural representation have demonstrated impressive performances in 3D reconstruction, they suffer from some limitation due to using a simple fully-connected network architecture which does not allow for integrating local information in the observations or incorporating inductive biases such as translational equivariance.

Convolutional Occupancy Networks uses convolutional encoders with implicit occupancy decoders to incorporates inductive biases and enabling structured reasoning in 3D space. Resulting in a more fine-grained implicit 3D reconstruction of single objects, with the ability to scale to large indoor scenes, and generalizes well from synthetic to real data.

Other Papers

Image and Video Synthesis

Transforming and Projecting Images into Class-conditional Generative Networks (paper)

GaNs are capable of generating diverse images from different classes. For instance, BigGaN, a class conditional GaN, given a noise vector (z) and a class embeddings (c), the model is capable of generating a new image from that class. The image can then be manipulated by editing the latent variable of the noise vectors and class embedding. But is the inverse possible?, ie., given an input image, can we find the latent variable (z) and the class embedding (c) that best matches to the image? This problem remains challenging since many input images cannot be generated by a GaN. Additionally, the objective function have many local minimas, the search algorithms can get stuck in such regions easily.

To address these problems, the paper proposes pix2latent with two new ideas: estimating input transformations at scale, and using a non-local search algorithm to find better solutions. As illustrated above, given an input image, pix2latent first finds the best transformation so that the transformed input is likely to be generated by a GaN, then the image is projected into the latent space using the proposed BasicCMA optimization method. The obtained latent variables are then edited, projected back into the image space obtaining an edited image, which can then be transformed with the inverse of the initial transformation

Contrastive Learning for Unpaired Image-to-Image Translation (paper)

Given two training sets of image pairs of different properties and modes, eg., images of horses and zebras, the objective of unpaired image-to-image translation is to learn a translation function between the two modes, eg., transform horses to zebras and vice-versa, while retaining the sensible information such as pose or size without having access to a set of one-to-one matches between the two modes. Existing methods such as CycleGaN forces the model to have back translated images that are consistent with the original ones. But such methods assume a bijection, which is often too restrictive since a given translated image might have many plausible source images. An ideal loss should be invariant to different styles, but differentiate between sensitive information.

Contrastive Unpaired Translation (CUT) aims to learn such an embedding space. In addition to the standard GaN loss where the generator is trained to generate realistic translated images while the discriminator tries to differentiate between the translate images and real ones. An additional loss that pushes for similar embeddings between two corresponding patches from the input and translated image in used. Optimized with a contrastive objective which pulls the embeddings of the two corresponding patches, while pushing away the embedding of a give patch and its negatives which are randomly sampled patches (ie., only internal patches from the same input image are used, external ones from other images decrease the performances).

Rewriting a Deep Generative Model (paper)

GAN are capable of modeling a rich set of semantic and physical rules about the data distribution, but up to now, it has been obscure how such rules are encoded in the network, or how a rule could be changed. This paper introduces a new problem setting: manipulation of specific rules encoded by a deep generative model. So given a generative model, the objective is to adjust its weights, so that the new and modified model follows new rules, and generates images that follow the new set of rules as shown below.

By viewing each layer as an associative memory that stores latent rules as a set of key-value relationships over hidden features. The model can be edited by defining a constrained optimization that adds or edits one specific rule within the associative memory while preserving the existing semantic relationships in the model as much as possible. The papers does this directly by measuring and manipulating the model’s internal structure, without requiring any new training data.

Learning Stereo from Single Images (paper)

Given a pair of corresponding images, the goal of stereo matching is to estimate the per-pixel horizontal displacement (i.e. disparity) between the corresponding location of every pixel from the first view to the second, or vice-versa. While fully supervised methods give good results, the precise ground truth disparity between a pair of stereo images is often hard to acquire. A possible alternative is to train on synthetic data, then fine-tune on a limited amount of real labeled data. But without a fine-tuning step with enough labels, such model are not capable of generating well to real images.

The paper proposed a novel and fully automatic pipeline for generating stereo training data from unstructured collections of single images given a depth-from-color model, requiring no synthetic data or stereo pairs of images to train. Using a depth estimation network. First, a given left input image is converted into a synthesized right image by a forward wrapping operation using the depth disparity. Then, with stereo pairs of images, the stereo network can then be trained in a supervised manner, resulting in a well generalizable model.

What makes fake images detectable? Understanding properties that generalize (paper)

Although the the quality of GaN generated images is reaching impressive levels, deep networks trained to detect fake images can still pick up on the subtle artifacts in these generated images, and such trained networks can also find the same artifacts across many models trained on different dataset and with different methods. This paper aims to visualize and understand which artifacts are shared between models and are easily detectable and transferable across different scenarios.

Since the global facial structure can vary among different generators and datasets, local patches of the generated images are more stereotyped and may share redundant artifacts. To this end, a fully-convolutional patch-based classifier is used to focus on local patches rather than global structure. The path level classifier can then be used to visualize and categorize the patches that are most indicative of real or fake images across various test datasets. Additionally, the generated image can be manipulated to exaggerate characteristic attributes of fake images.

Other Papers

Vision and Language

Connecting Vision and Language with Localized Narratives (paper)

One of the popular ways for connecting vision and language is image captioning, where each image is paired with human authored textual captions, but this link is only at the full image scale where the sentences describes the whole image. To improve this linking, grounded images captioning adds links between specific parts of the image caption and object boxes in the image. However, the links are still very sparse and the majority of objects and words are not grounded and the annotation process if expensive.

The paper proposes a new and efficient form of multi-modal image annotations for connecting vision and language called Localized Narratives. Localized Narratives are generated by asking the annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. For instance, as shown in the figure above, the annotator says “woman” while using their mouse to indicate her spatial extent, thus providing visual grounding for this noun. Later they move the mouse from the woman to the balloon following its string, saying “holding”. This provides direct visual grounding of this relation. They also describe attributes like “clear blue sky” and “light blue jeans”. Since voice is synchronized to the mouse pointer, the image location of every single word in the description can be determined. This provides dense visual grounding in the form of a mouse trace segment for each word. This rich from of annotation with multiple modalities (ie., image, text, speech and grounding) can be used to for different tasks such as text-to-image generation, visual question answering and voice-driven environnement navigation. Or for a more fine-grained control of tasks, such conditioning captions on specific parts of the image, which can be used by a person with imperfect vision to get descriptions of specific parts by hovering their finger on the image.

UNITER: UNiversal Image-TExt Representation Learning (paper)

Most Vision-and-Language (V&L) tasks such Visual Question Answering (VQA) rely on joint multi modal embeddings to bridge the semantic gap between visual and textual clues in images and text. But such representations are usually tailored for specific tasks, and require specific architectures. In order to learn general joint embeddings that can be used on all of the V&L downstream tasks. The paper introduces UNITER, a large-scale pre-trained model for joint multimodal embedding illustrated bellow. Based on the transformer model, UNITER is pre-trained on 4 tasks: Masked Language Modeling (MLM) conditioned on image, where the randomly masked words are recovered using both image and text features. Masked Region Modeling (MRM) conditioned on text, where the model reconstructs some regions of a given image. Image-Text Matching (ITM), where the model predicts if an image and a text instances are paired or not. And Word-Region Alignment (WRA), where the model learn the optimal alignment between words and images found using optimal transport. To use UNITER on downstream tasks, first they are reformulated as a classification problem, then the added classifier on top of the [CLS] features can be trained using a cross-entropy loss.

Learning to Learn Words from Visual Scenes (paper)

The standard approach in vision and language consists of learning a common embedding space, however this approach is inefficient, often requiring millions of examples to learn, generalizes poorly to the natural compositional structure of language, and the learned embeddings are unable to adapt to novel words at inference time. So instead of learning the word embeddings, this paper proposes to learn the process for acquiring word embeddings.

The model is based on the transformer model, and at each iteration, the model receives an episode of image and language pairs, and then meta-learns a policy to acquire word representations from the episode. This produces a representation that is able to acquire novel words at inference time as well as more robustly generalize to novel compositions. Specifically, every tasks is formulated as a language acquisition task or an episode, consisting of training examples and a testing examples, where the testing examples evaluates the language acquired from the training examples. In this figure above for instance, the model needs to acquire the word “chair” from the training samples, a word which it has never seen before. The meta-training is done in the forward pass, where the model needs to point to the correct word, “chair”, in the training example, and a matching loss is used to train the model. After training on many episodes and tasks, the model is able to adapt very quickly to novel task during inference.

Other Papers

The Rest

Unfortunately, the number of papers makes the summarization task difficult and time consuming. So for the rest of the papers, I will simply list some papers I came across in case the the reader is interested in the subjects.

Deep Learning: Applications, Methodology, and Theory

Low level vision, Motion and Tracking

Face, Gesture, and Body Pose

Action Recognition, Understanding

This article was originally published in Yassine Ouali’s blog and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we have more summary articles like this one.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *


How to Get the Best Start at Sports Betting

If you are looking into getting into sports betting, then you might be hesitant about how to start, and the whole idea of it can be quite daunting. There are many techniques to get the best possible start at sports betting and, in this article, we will have a look at some of the best […]

The post How to Get the Best Start at Sports Betting appeared first on 1redDrop.



If you are looking into getting into sports betting, then you might be hesitant about how to start, and the whole idea of it can be quite daunting. There are many techniques to get the best possible start at sports betting and, in this article, we will have a look at some of the best tips for that.

Mental preparation

This sounds a bit pretentious, but it is very important to understand some things about betting before starting so you can not only avoid nasty surprises but also avoid losing too much money. Firstly, you need to know that, in the beginning, you will not be good at betting. It is through experience and learning from your mistakes that you will get better. It is imperative that you do not convince yourself that you are good at betting, especially if you win some early bets, because I can guarantee it will have been luck – and false confidence is not your friend. 

It is likely that you will lose some money at first, but this is to be expected. Almost any hobby that you are interested in will cost you some money so, instead, look at it as an investment. However, do not invest ridiculous amounts; rather, wait until you are confident in your betting ability to start placing larger stakes. 

Set up different accounts

This is the best way to start with sports betting, as the welcome offers will offset a lot of the risk. These offers are designed to be profitable to entice you into betting with the bookie, but it is completely legal to just profit from the welcome offer and not bet with the bookie again. 

If you do this with the most bookies, as you can, you are minimising the risk involved with your betting and maximising possible returns, so it really is a no-brainer.

As well as this clear advantage, different betting companies offer different promotions. Ladbrokes offer a boost every day, for example, where you can choose your bet and boost it a little bit, and the Parimatch betting website chooses a bet for big events and doubles the odds. 

If you are making sure you stay aware of the best offers across these platforms, then you will be able to use the most lucrative ones and, as such, you will be giving yourself the best chance of making money. The house always wins, as they say, but if you use this tip, you are skewing the odds back in your favour. 

Remember, the house wins because of gamblers that do not put in the effort and do not bet smart. Avoid those mistakes and you will massively increase your chances of making money.


On Twitter, especially, but also other social media platforms, there are tipsters who offer their bets for free. It is not so much the bets themselves that you are interested in, but rather why they are betting on this. It is important that you find tipsters who know what they are doing, though, because there are a lot of tipsters who are essentially scamming their customers. It is quite easy to find legitimate tipsters because they are not afraid to show their mistakes. 

Once you have found good tipsters, then you need to understand the reasoning behind their bets. When you have done that, you can start placing these bets yourself, and they will likely be of better value since some tipsters influence the betting markets considerably. You can also follow their bets as they are likely to be sensible bets, although this does not necessarily translate to success.


Continue Reading


Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods

Estimates state that 70%–85% of the world’s data is text (unstructured data) [1]. New deep learning language models (transformers) have caused explosive growth in industry applications [5,6,11]. This blog is not an article introducing you to Natural Language Processing. Instead, it assumes you are familiar with noise reduction and normalization of text. It covers text preprocessing up […]

The post Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods appeared first on TOPBOTS.



text pre-processing

Estimates state that 70%–85% of the world’s data is text (unstructured data) [1]. New deep learning language models (transformers) have caused explosive growth in industry applications [5,6,11].

This blog is not an article introducing you to Natural Language Processing. Instead, it assumes you are familiar with noise reduction and normalization of text. It covers text preprocessing up to producing tokens and lemmas from the text.

We stop at feeding the sequence of tokens into a Natural Language model.

The feeding of that sequence of tokens into a Natural Language model to accomplish a specific model task is not covered here.

In production-grade Natural Language Processing (NLP), what is covered in this blog is that fast text pre-processing (noise cleaning and normalization) is critical.

  1. I discuss packages we use for production-level NLP;
  2. I detail the production-level NLP preprocessing text tasks with python code and packages;
  3. Finally. I report benchmarks for NLP text pre-processing tasks;

Dividing NLP Processing into Two Steps

We segment NLP into two major steps (for the convenience of this article):

  1. Text pre-processing into tokens. We clean (noise removal) and then normalize the text. The goal is to transform the text into a corpus that any NLP model can use. A goal is rarely achieved until the introduction of the transformer [2].
  2. A corpus is an input (text preprocessed into a sequence of tokens) into NLP models for training or prediction.

The rest of this article is devoted to noise removal text and normalization of text into tokens/lemmas (Step 1: text pre-processing). Noise removal deletes or transforms things in the text that degrade the NLP task model. It is usually an NLP task-dependent. For example, e-mail may or may not be removed if it is a text classification task or a text redaction task. We’ll cover replacement and removal of the noise.

Normalization of the corpus is transforming the text into a common form. The most frequent example is normalization by transforming all characters to lowercase. In follow-on blogs, we will cover different deep learning language models and Transformers (Steps 2-n) fed by the corpus token/lemma stream.

NLP Text Pre-Processing Package Factoids

There are many NLP packages available. We use spaCy [2], textacy [4], Hugging Face transformers [5], and regex [7] in most of our NLP production applications. The following are some of the “factoids” we used in our decision process.

Note: The following “factoids” may be biased. That is why we refer to them as “factoids.”

NLTK [3]

  • NLTK is a string processing library. All the tools take strings as input and return strings or lists of strings as output [3].
  • NLTK is a good choice if you want to explore different NLP with a corpus whose length is less than a million words.
  • NLTK is a bad choice if you want to go into production with your NLP application [3].


The use of regex is pervasive throughout our text-preprocessing code. Regex is a fast string processor. Regex, in various forms, has been around for over 50 years. Regex support is part of the standard library of Java and Python, and is built into the syntax of others, including Perl and ECMAScript (JavaScript);

spaCy [2]

  • spaCy is a moderate choice if you want to research different NLP models with a corpus whose length is greater than a million words.
  • If you use a selection from spaCy [3], Hugging Face [5], [13], and GPT-3 [6], then you are performing SOTA (state-of-the-art) research of different NLP models (my opinion at the time of writing this blog).
  • spaCy is a good choice if you want to go into production with your NLP application.
  • spaCy is an NLP library implemented both in Python and Cython. Because of the Cython, parts of spaCy are faster than if implemented in Python [3];
  • spacy is the fastest package, we know of, for NLP operations;
  • spacy is available for operating systems MS Windows, macOS, and Ubuntu [3];
  • spaCy runs natively on Nvidia GPUs [3];
  • explosion/spaCy has 16,900 stars on Github (7/22/2020);
  • spaCy has 138 public repository implementations on GitHub;
  • spaCy comes with pre-trained statistical models and word vectors;
  • spaCy transforms text into document objects, vocabulary objects, word- token objects, and other useful objects resulting from parsing the text ;
  • Doc class has several useful attributes and methods. Significantly, you can create new operations on these objects as well as extend a class with new attributes (adding to the spaCy pipeline);
  • spaCy features tokenization for 50+ languages;

Do you find this in-depth technical education about NLP applications to be useful? Subscribe below to be updated when we release new relevant content.

Creating long_s Practice Text String

We create long_, a long string that has extra whitespace, emoji, email addresses, $ symbols, HTML tags, punctuation, and other text that may or may not be noise for the downstream NLP task and/or model.

MULPIPIER = int(3.8e3)
text_l = 300 %time long_s = ':( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 '
long_s += ' 888 eihtg DoD Fee #hash ## Document Title</title> '
long_s += ':( cat- \n nip'
long_s += ' immed- \n natedly <html><h2>2nd levelheading</h2></html> . , '
long_s += '# f@z.yx can\'t Be a ckunk. $4 $123,456 won\'t seven '
long_s +=' $Shine $$beighty?$ '
long_s *= MULPIPIER
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.11 µs
size: 1.159e+06 :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee #hash ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beigh

A string, long_s of 1.159 million characters is created in 8.11 µs.

Python String Corpus Pre-processing Step and Benchmarks

All benchmarks are run within a Docker container on MacOS Version 14.0 (14.0).

Model Name: Mac Pro
Processor Name: 12-Core Intel Xeon E5
Processor Speed: 2.7 GHz
Total Number of Cores: 24
L2 Cache (per Core): 256 KB
L3 Cache: 30 MB
Hyper-Threading Technology: Enabled Memory: 64 GB

Note: Corpus/text pre-processing is dependent on the end-point NLP analysis task. Sentiment Analysis requires different corpus/text pre-processing steps than document redaction. The corpus/text pre-processing steps given here are for a range of NLP analysis tasks. Usually. a subset of the given corpus/text pre-processing steps is needed for each NLP task. Also, some of required corpus/text pre-processing steps may not be given here.

1. NLP text preprocessing: Replace Twitter Hash Tags

from textacy.preprocessing.replace import replace_hashtags
%time text = replace_hashtags(long_s,replace_with= 'HASH')
print('size: {:g} {}'.format(len(text),text[:text_l])))

output =>

CPU times: user 223 ms, sys: 66 µs, total: 223 ms
Wall time: 223 ms
size: 1.159e+06 :
( 😻 😈 _HASH_ +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee _HASH_ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beigh

Notice that #google and #hash are swapped with_HASH_,and ##and _# are untouched. A million characters were processed in 200 ms. Fast enough for a big corpus of a billion characters (example: web server log).

2. NLP text preprocessing: Remove Twitter Hash Tags

from textacy.preprocessing.replace import replace_hashtags
%time text = replace_hashtags(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 219 ms, sys: 0 ns, total: 219 ms
Wall time: 220 ms
size: 1.1134e+06 :( 😻 😈 +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

Notice that #google and #hash are removed and ##,and _# are untouched. A million characters were processed in 200 ms.

3. NLP text preprocessing: Replace Phone Numbers

from textacy.preprocessing.replace import replace_phone_numbers
%time text = replace_phone_numbers(long_s,replace_with= 'PHONE')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 384 ms, sys: 1.59 ms, total: 386 ms
Wall time: 383 ms
size: 1.0792e+06
:( 😻 😈 PHONE 08-PHONE 608-444-00003 ext. 508 888 eihtg

Notice phone number 08-444-0004 and 608-444-00003 ext. 508 were not transformed.

4. NLP text preprocessing: Replace Phone Numbers – better

RE_PHONE_NUMBER: Pattern = re.compile( # core components of a phone number r"(?:^|(?<=[^\w)]))(\+?1[ .-]?)?(\(?\d{2,3}\)?[ .-]?)?(\d{2,3}[ .-]?\d{2,5})" # extensions, etc. r"(\s?(?:ext\.?|[#x-])\s?\d{2,6})?(?:$|(?=\W))", flags=re.UNICODE | re.IGNORECASE) text = RE_PHONE_NUMBER.sub('_PHoNE_', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 353 ms, sys: 0 ns, total: 353 ms
Wall time: 350 ms
size: 1.0108e+06 :( 😻 😈 _PHoNE_ _PHoNE_ _PHoNE_ 888 eihtg DoD Fee ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

Notice phone number 08-444-0004 and 608-444-00003 ext. 508 were transformed. A million characters were processed in 350 ms.

5. NLP text preprocessing: Remove Phone Numbers

Using the improved RE_PHONE_NUMBER pattern, we put '' in for ‘PHoNE' to remove phone numbers from the corpus.

text = RE_PHONE_NUMBER.sub('', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 353 ms, sys: 459 µs, total: 353 ms
Wall time: 351 ms
size: 931000 :( 😻 😈 888 eihtg DoD Fee ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

A million characters were processed in 375 ms.

6. NLP text preprocessing: Removing HTML metadata

I admit removing HTML metadata is my favorite. Not because I like the task, but because I screen-scrape frequently. There is a lot of useful data that resides on an IBM mainframe, VAX-780 (huh?), or whatever terminal-emulation that results in an HTML-based report.

These techniques of web scraping of reports generate text that has HTML tags. HTML tags are considered noise typically as they are parts of the text with little or no value in the follow-on NLP task.

Remember, we created a test string (long_s) a little over million characters long with some HTML tags. We remove the HTML tags using BeautifulSoup.

from bs4 import BeautifulSoup
%time long_s = BeautifulSoup(long_s,'html.parser').get_text()
print('size: {:g} {}'.format(len(long_s),long_s[:text_l])))

output =>

CPU times: user 954 ms, sys: 17.7 ms, total: 971 ms
Wall time: 971 ms
size: 817000 :( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat- nip immed- natedly 2nd levelheading 

The result is that BeautifulSoup is able to remove over 7,000 HTML tags in a million character corpus in one second. Scaling linearly, a billion character corpus, about 200 million word, or approxiately 2000 books, would require about 200 seconds.

The rate for HTML tag removal byBeautifulSoup is about 0. 1 second per book. An acceptable rate for our production requirements.

I only benchmark BeautifulSoup. If you know of a competitive alternative method, please let me know.

Note: The compute times you get may be multiples of time longer or shorter if you are using the cloud or Spark.

7. NLP text preprocessing: Replace currency symbol

The currency symbols “[$¢£¤¥ƒ֏؋৲৳૱௹฿៛ℳ元円圆圓﷼\u20A0-\u20C0] “ are replaced with _CUR_using the textacy package:

%time textr = textacy.preprocessing.replace.replace_currency_symbols(long_s)
print('size: {:g} {}'.format(len(textr),textr[:text_l]))

output =>

CPU times: user 31.2 ms, sys: 1.67 ms, total: 32.9 ms
Wall time: 33.7 ms
size: 908200 :( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat- nip immed- natedly 2nd levelheading . , # f@z.yx can't Be a ckunk. _CUR_4 _CUR_123,456 won't seven _CUR_Shine _CUR__CUR_beighty?_CUR_

Note: The option textacy replace_<something> enables you to specify the replacement text. _CUR_ is the default substitution text for replace_currency_symbols.

You may have the currency symbol $ in your text. In this case you can use a regex:

%time text = re.sub('\$', '_DOL_', long_s)
print('size: {:g} {}'.format(len(text),text[:250]))

output =>

CPU times: user 8.06 ms, sys: 0 ns, total: 8.06 ms
Wall time: 8.25 ms
size: 1.3262e+06 :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # f@z.yx can't Be a ckunk. _DOL_4 _DOL_123,456 won't seven _DOL_Shine _DOL__DOL_beighty?_DOL_ :

Note: All symbol $ in your text will be removed. Don’t use if you have LaTex or any text where multiple symbol $ are used.

8. NLP text preprocessing: Replace URL String

from textacy.preprocessing.replace import replace_urls
%time text = replace_urls(long_s,replace_with= '_URL_')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 649 ms, sys: 112 µs, total: 649 ms
Wall time: 646 ms
size: 763800
:( 😻 😈 888 eihtg DoD Fee _URL_ ## Document Title :(

9. NLP text preprocessing: Remove URL String

from textacy.preprocessing.replace import replace_urls
%time text = replace_urls(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 633 ms, sys: 1.35 ms, total: 635 ms
Wall time: 630 ms
size: 744800
:( 😻 😈 888 eihtg DoD Fee ## Document Title :(

The rate for URL replace or removal is about 4,000 URLs per 1 million characters per second. Fast enough for 10 books in a corpus.

10. NLP text preprocessing: Replace E-mail string

%time text = textacy.preprocessing.replace.replace_emails(long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 406 ms, sys: 125 µs, total: 406 ms
Wall time: 402 ms
size: 725800
:( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat-
nip immed-
natedly 2nd levelheading . , # _EMAIL_ _EMAIL_ can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

The rate for email reference replace is about 8,000 emails per 1.7 million characters per second. Fast enough for 17 books in a corpus.

11. NLP text pre-processing: Remove E-mail string

from textacy.preprocessing.replace import replace_emails

%time text = textacy.preprocessing.replace.replace_emails(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 413 ms, sys: 1.68 ms, total: 415 ms
Wall time: 412 ms
size: 672600 :( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat-
nip immed-
natedly 2nd levelheading . , # can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

The rate for email reference removal is about 8,000 emails per 1.1 million characters per second. Fast enough for 11 books in a corpus.

12. NLP text preprocessing: normalize_hyphenated_words

from textacy.preprocessing.normalize import normalize_hyphenated_words
%time long_s = normalize_hyphenated_words(long_s)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l])))

output =>

CPU times: user 186 ms, sys: 4.58 ms, total: 191 ms
Wall time: 190 ms
size: 642200 :
( 😻 😈 888 eihtg DoD Fee ## Document Title :( catnip immednatedly

Approximately 8,000 hyphenated-words, cat — nip and immed- iately (mispelled) were corrected in a corpus of 640,000 characters in 190 ms or abouut 3 million per second.

13. NLP text preprocessing: Convert all characters to lower case

### - **all characters to lower case;**
%time long_s = long_s.lower()
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 4.82 ms, sys: 953 µs, total: 5.77 ms
Wall time: 5.97 ms
size: 642200
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

I only benchmark the .lower Python function. The rate for lower case transformation by.lower() of a Python string of a million characters is about 6 ms. A rate that far exceeds our production rate requirements.

14. NLP text preprocessing: Whitespace Removal

%time text = re.sub(' +', ' ', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 44.9 ms, sys: 2.89 ms, total: 47.8 ms
Wall time: 47.8 ms
size: 570000
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

The rate is about 0.1 seconds for 1 million characters.

15. NLP text preprocessing: Whitespace Removal (slower)

from textacy.preprocessing.normalize import normalize_whitespace

%time text= normalize_whitespace(long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 199 ms, sys: 3.06 ms, total: 203 ms
Wall time: 201 ms
size: 569999
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

normalize_whitespce is 5x slower but more general. For safety in production, we use normalize_whitespce.To date, we do not think we had any problems with faster regex.

16. NLP text preprocessing: Remove Punctuation

from textacy.preprocessing.remove import remove_punctuation

%time text = remove_punctuation(long_s, marks=',.#$?')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 34.5 ms, sys: 4.82 ms, total: 39.3 ms
Wall time: 39.3 ms
size: 558599
:( 😻 😈 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading can't be a ckunk 4 123 456 won't seven shine beighty


Creating the spaCy pipeline and Doc

In order to text pre-process with spaCy, we transform the text into a corpus Doc object. We can then use the sequence of word tokens objects of which a Doc object consists. Each token consists of attributes (discussed above) that we use later in this article to pre-process the corpus.

Our text pre-processing end goal (usually) is to produce tokens that feed into our NLP models.

  • spaCy reverses the stream of pre-processing text and then transforming text into tokens. spaCy creates a Doc of tokens. You then pre-process the tokens by their attributes.

The result is that parsing text into a Doc object is where the majority of computation lies. As we will see, pre-processing the sequence of tokens by their attributes is fast.

Adding emoji cleaning in the spaCy pipeline

import en_core_web_lg
nlp = en_core_web_lg.load() do = nlp.disable_pipes(["tagger", "parser"])
%time emoji = Emoji(nlp)
nlp.max_length = len(long_s) + 10
%time nlp.add_pipe(emoji, first=True)
%time long_s_doc = nlp(long_s)
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:text_l]))

output =>

CPU times: user 303 ms, sys: 22.6 ms, total: 326 ms
Wall time: 326 ms
CPU times: user 23 µs, sys: 0 ns, total: 23 µs
Wall time: 26.7 µs
CPU times: user 7.22 s, sys: 1.89 s, total: 9.11 s
Wall time: 9.12 s
size: 129199
:( 😻 😈 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading can't be a ckunk 4 123 456 won't seven shine beighty

Creating the token sequence required at 14,000 tokens per second. We will quite a speedup when we use NVIDIA gpu.

nlp.pipe_names output => ['emoji', 'ner']

Note: The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in nlp.pipe_names. The reason is that there can only be one tokenizer, and while all other pipeline components take a Doc and return it, the tokenizer takes a string of text and turns it into a Doc. You can still customize the tokenizer. You can either create your own Tokenizer class from scratch, or even replace it with an entirely custom function.

spaCy Token Attributes for Doc Token Preprocessing

As we saw earlier, spaCy provides convenience methods for many other pre-processing tasks. It turns — for example, to remove stop words you can reference the .is_stop attribute.

dir(token[0]) output=> 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'morph', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'remove_extension', 'right_edge', 'rights', 'sent', 'sent_start', 'sentiment', 'set_extension', 'shape', 'shape_', 'similarity', 'string', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'tensor', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_']

Attributes added by emoji and new.

dir(long_s_doc[0]._) output => ['emoji_desc', 'get', 'has', 'is_emoji', 'set', 'trf_alignment', 'trf_all_attentions', 'trf_all_hidden_states', 'trf_d_all_attentions', 'trf_d_all_hidden_states', 'trf_d_last_hidden_state', 'trf_d_pooler_output', 'trf_end', 'trf_last_hidden_state', 'trf_pooler_output', 'trf_separator', 'trf_start', 'trf_word_pieces', 'trf_word_pieces_'

I show spaCy performing preprocessing that results in a Python string corpus. The corpus is used to create a new sequence of spaCy tokens (Doc).

There is a faster way to accomplish spaCy preprocessing with spaCy pipeline extensions [2], which I show in an upcoming blog.

17. EMOJI Sentiment Score

EMOJI Sentiment Score is not a text preprocessor in the classic sense.

However, we find that emoji almost always is the dominating text in a document.

For example, two similar phrases from legal notes e-mail with opposite sentiment.

The client was challenging. :( The client was difficult. :)

We calcuate only emoji when present in a note or e-mail.

%time scl = [EMOJI_TO_SENTIMENT_VALUE[token.text] for token in long_s_doc if (token.text in EMOJI_TO_SENTIMENT_VALUE)]
len(scl), sum(scl), sum(scl)/len(scl)

output =>

CPU times: user 179 ms, sys: 0 ns, total: 179 ms
Wall time: 178 ms
(15200, 1090.7019922523152, 0.07175671001659968)

The sentiment was 0.07 (neutral) for 0.5 million character “note” with 15,200 emojis and emojicons in 178 ms. A fast sentiment analysis calculation!

18. NLP text preprocessing: Removing emoji

You can remove emoji using spaCy pipeline add-on

%time long_s_doc_no_emojicon = [token for token in long_s_doc if token._.is_emoji == False]
print('size: {:g} {}'.format(len(long_s_doc_no_emojicon),long_s_doc_no_emojicon[:int(text_l/5)]))

output =>

CPU times: user 837 ms, sys: 4.98 ms, total: 842 ms
Wall time: 841 ms
size: 121599
[:(, 888, eihtg, dod, fee, , document, title, :(, catnip, immednatedly, 2nd, levelheading, , ca, n't, be, a, ckunk, , 4, , 123, 456, wo, n't, seven, , shine, , beighty, , :(, 888, eihtg, dod, fee, , document, title, :(, catnip, immednatedly, 2nd, levelheading, , ca, n't, be, a, ckunk, , 4, , 123, 456, wo, n't, seven, , shine, , beighty, , :(, 888, eihtg, dod, fee, ]

The emoji spacy pipeline addition detected the emojicons, 😻 😈, but missed :) and :(.

19. NLP text pre-processing: Removing emoji (better)

We developed EMOJI_TO_PHRASEto detect the emojicons, 😻 😈, and emoji, such as :) and :(. and removed them [8,9].

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False \
else '' for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 242 ms, sys: 3.76 ms, total: 245 ms
Wall time: 245 ms
CPU times: user 3.37 ms, sys: 73 µs, total: 3.45 ms
Wall time: 3.46 ms
size: 569997
888 eihtg dod fee document title catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty 888 eihtg dod fee document title catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty 888 eihtg dod fee document title catnip imm

20. NLP text pre-processing: Replace emojis with a phrase

We can translate emojicon into a natural language phrase.

%time text = [token.text if token._.is_emoji == False else token._.emoji_desc for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 1.07 s, sys: 7.54 ms, total: 1.07 s
Wall time: 1.07 s
CPU times: user 3.78 ms, sys: 0 ns, total: 3.78 ms
Wall time: 3.79 ms
size: 794197
:( smiling cat face with heart-eyes smiling face with horns 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty

The emoji spaCy pipeline addition detected the emojicons, 😻 😈, but missed :) and :(.

21. NLP text pre-processing: Replace emojis with a phrase (better)

We can translate emojicons into a natural language phrase.

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False \
else EMOJI_TO_PHRASE[token.text] for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 251 ms, sys: 5.57 ms, total: 256 ms
Wall time: 255 ms
CPU times: user 3.54 ms, sys: 91 µs, total: 3.63 ms
Wall time: 3.64 ms
size: 904397
FROWNING FACE SMILING CAT FACE WITH HEART-SHAPED EYES SMILING FACE WITH HORNS 888 eihtg dod fee document title FROWNING FACE catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty FROWNING FAC

Again. EMOJI_TO_PHRASE detected the emojicons, 😻 😈, and emoji, such as :) and :(. and substituted a phrase.

22. NLP text preprocessing: Correct Spelling

We will use symspell for spelling correction [14].

SymSpell, based on the Symmetric Delete spelling correction algorithm, just took 0.000033 seconds (edit distance 2) and 0.000180 seconds (edit distance 3) on an old MacBook Pro [14].

%time sym_spell_setup() 
%time tk = [check_spelling(token.text) for token in long_s_doc[0:99999]]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 5.22 s, sys: 132 ms, total: 5.35 s
Wall time: 5.36 s
CPU times: user 25 s, sys: 12.9 ms, total: 25 s
Wall time: 25.1 s
CPU times: user 3.37 ms, sys: 42 µs, total: 3.41 ms
Wall time: 3.42 ms
size: 528259 FROWNING FACE SMILING CAT FACE WITH HEART a SHAPED EYES SMILING FACE WITH HORNS 888 eight do fee document title FROWNING FACE catnip immediately and levelheading a not be a chunk a of 123 456 to not seven of shine of eighty

Spell correction was accomplished for immednatedly, ckunk and beight. Correcting mis-spelled words is our largest computation. It required 30 seconds for 0.8 million characters.

23. NLP text preprocessing: Replacing Currency Symbol (spaCy)

%time token = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(token)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))aa

Note: spacy removes all punctuation including :) emoji and emoticon. You can protect the emoticon with:

%time long_s_doc = [token for token in long_s_doc if token.is_punct == False or token._.is_emoji == True]
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:50]))

However, replace_currency_symbols and regex ignore context and replace any currency symbol. You may have multiple use of $ in your text and thus can not ignore context. In this case you can use spaCy.

%time tk = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 366 ms, sys: 13.9 ms, total: 380 ms
Wall time: 381 ms
CPU times: user 9.7 ms, sys: 0 ns, total: 9.7 ms
Wall time: 9.57 ms
size: 1.692e+06 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd levelheading</h2></html > f@z.y a$@ ca n't bc$$ ef$4 5 66 _CUR_ wo nt seven eihtg _CUR_ nine _CUR_ _CUR_ zer$ 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd leve

24. NLP text preprocessing: Removing e-mail address (spacy)

%time tokens = [token for token in long_s_doc if not token.like_email]
print('size: {:g} {}'.format(len(tokens),tokens[:int(text_l/3)]))

output =>

CPU times: user 52.7 ms, sys: 3.09 ms, total: 55.8 ms
Wall time: 54.8 ms
size: 99999

About 0.06 second for 1 million characters.

25. NLP text preprocessing: Remove whitespace and punctuation (spaCy)

%time tokens = [token.text for token in long_s_doc if (token.pos_ not in ['SPACE','PUNCT'])]
%time text = ' '.join(tokens)
print('size: {:g} {}'.format(len(text),text[:text_l]))

26. NLP text preprocessing: Removing stop-words

NLP models (ex: logistic regression and transformers) and NLP tasks (Sentiment Analysis) continue to be added. Some benefit from stopword removal, and some will not. [2]

Note: We now only use different deep learning language models (transformers) and do not remove stopwords.

%time tokens = [token.text for token in long_s_doc if token.is_stop == False]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

27. NLP text pre-processing: Lemmatization

Lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words.

Lemmatization looks at the surrounding text to determine a given word’s part of speech. It does not categorize phrases.

%time tokens = [token.lemma_ for token in long_s_doc]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 366 ms, sys: 13.9 ms, total: 380 ms
Wall time: 381 ms
CPU times: user 9.7 ms, sys: 0 ns, total: 9.7 ms
Wall time: 9.57 ms
size: 1.692e+06 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd levelheading</h2></html > f@z.y a$@ ca n't bc$$ ef$4 5 66 _CUR_ wo nt seven eihtg _CUR_ nine _CUR_ _CUR_ zer$ 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd leve

Note: Spacy does not have stemming. You can add if it is you want. Stemming does not work as well as Lemmazatation because Stemming does not consider context [2] (Why some researcher considers spacy “opinionated”).

Note: If you do not know what is Stemming, you can still be on the Survivor show. (my opinion)


Whatever the NLP task, you need to clean (pre-process) the data (text) into a corpus (document or set of documents) before it is input into any NLP model.

I adopt a text pre-processing framework that has three major categories of NLP text pre-processing:

  1. Noise Removal
  • Transform Unicode characters into text characters.
  • convert a document image into segmented image parts and text snippets [10];
  • extract data from a database and transform into words;
  • remove markup and metadata in HTML, XML, JSON, .md, etc.;
  • remove extra whitespaces;
  • remove emoji or convert emoji into phases;
  • Remove or convert currency symbol, URLs, email addresses, phone numbers, hashtags, other identifying tokens;
  • The correct mis-spelling of words (tokens) [7];
  • Remove remaining unwanted punctuation;

2. Tokenization

  • They are splitting strings of text into smaller pieces, or “tokens.” Paragraphs segment into sentences, and sentences tokenize into words.

3. Normalization

  • Change all characters to lower case;
  • Remove English stop words, or whatever language the text is in;
  • Perform Lemmatization or Stemming.

Note: The tasks listed in Noise Removal and Normalization can move back and forth. The categorical assignment is for explanatory convenience.

Note: We do not remove stop-words anymore. We found that our current NLP models have higher F1 scores when we leave in stop-words.

Note: Stop-word removal is expensive computationally. We found the best way to achieve faster stop-word removal was not to do it.

Note: We saw no significant change in Deep Learning NLP models’ speed with or without stop-word removal.

Note: The Noise Removal and Normalization lists are not exhaustive. These are some of the tasks I have encountered.

Note: The latest NLP Deep Learning models are more accurate than older models. However, Deep Learning models can be impractically slow to train and are still too slow for prediction. We show in a follow-on article how we speed-up such models for production.

Note: Stemming algorithms drop off the end of the beginning of the word, a list of common prefixes and suffixes to create a base root word.

Note: Lemmatization uses linguistic knowledge bases to get the correct roots of words. Lemmatization performs morphological analysis of each word, which requires the overhead of creating a linguistic knowledge base for each language.

Note: Stemming is faster than lemmatization.

Note: Intuitively and in practice, lemmatization yields better results than stemming in an NLP Deep Learning model. Stemming generally reduces precision accuracy and increases recall accuracy because it injects semi-random noise when wrong.

Read more in How and Why to Implement Stemming and Lemmatization from NLTK.

Text preprocessing Action benchmarks

Our unique implementations, spaCy, and textacy are our current choice for short text preprocessing production fast to use. If you don’t mind the big gap in performance, I would recommend using it for production purposes, over NLTK’s implementation of Stanford’s NER.

In the next blogs, We see how performance changes using multi-processing, multithreading, Nvidia GPUs, and pySpark. Also, I will write about how and why our implementations, such as EMOJI_TO_PHRASEand EMOJI_TO_SENTIMENT_VALUE and or how to add emoji, emoticon, or any Unicode symbol.


[1] How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read.

[2] Industrial-Strength Natural Language Processing;Turbo-charge your spaCy NLP pipeline.

[3] NLTK 3.5 Documentation.

[4] Textacy: Text (Pre)-processing.

[5] Hugging Face.

[6] Language Models are Few-Shot Learners.

[7] re — Regular expression operations.

[8] Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm.

[9] How I Built Emojitracker.

[10] Classifying e-commerce products based on images and text.

[11] DART: Open-Domain Structured Data Record to Text Generation.

[12] Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT.

[13] .

[14] 1000x faster Spelling Correction.

This article was originally published on Medium and re-published to TOPBOTS with permission from the author. Read more technical guides by Bruce Cottman, Ph.D. on Medium.

Enjoy this article? Sign up for more AI and NLP updates.

We’ll let you know when we release more in-depth technical education.

Continue Reading


Microsoft BOT Framework: Building Blocks

I wrote an article last week introducing the ‘’Microsoft BOT Framework”. The highlight of the article was to educate the readers on how to…



Photo by Tincho Franco on Unsplash

I wrote an article some weeks ago introducing the ‘’Microsoft BOT Framework”. The highlight of the article was to educate the readers on how to develop a basic chatbot. Although my workmates acknowledged the efforts but were interested in knowing more. In this article, I am going to dig in a little deeper with the various concepts involved with the Microsoft BOT framework.

I would be touching on the below-mentioned concepts in this article.

  • Channel
  • State
  • Prompt
  • Dialog
  • Waterfall
  • Connector
  • Activity
  • Turn


Channel is an application that is being used to interact with the BOT. Some of the current integrations are available with Teams, Slack, Workplace, Skype, Facebook, Telegram, Line, Webchat, etc.

Some channels are also available as an adapter. Check here for more details.


State in the context of the ChatBots means persisting metadata of the conversation between the BOT and the user at a certain moment. State management makes the conversation more meaningful (i.e the responses could be saved to be accessed at a later point of time.


During a conversation between the user and the BOT, a prompt is an event when BOT asks a user any question. This question could be in the form of text, button, dropdown, etc.


Dialogs allow forming flow in the conversation. A Dialog comprises of 2 steps.

  1. A prompt from the BOT requesting for info
  2. User Response to the BOT

If the user response is valid, BOT will send a new prompt for further information, else it will re-send the same prompt.

1. 8 Proven Ways to Use Chatbots for Marketing (with Real Examples)

2. How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project

3. 5 Top Tips For Human-Centred Chatbot Design

4. Chatbot Conference Online


The waterfall is formed with a combination of Dialogs. It’s a sequence of dialogs which determines the complete flow of the conversation.

Let’s look at all of these concepts in a diagrammatic representation.


REST API used by BOT to communicate across multiple channels. The API allows the exchange of messages between BOT and the user on a specific channel.


As the name suggests, an activity is any communication between the user and the BOT. The connector API uses the activity object to send useful information back and forth. The most common activity type is the message. For a complete list of all Activity types, see here.


In any conversation between two parties, each party takes turns to respond to an activity (message). In the context of Microsoft BOT Framework, communication happens between user and BOT, hence a turn could be considered as the processing done by the BOT to respond to the user request.

Now that we have understood the basic concepts needed to build this sample, let’s have a look at our use case.

We would be building a ChatBot application which would enable users to Book a taxi. The conversational flow would be like:

Each box in the above diagram represents a Dialog.


Step 1: Create a VS2017 project

I would set the name of the project as “SuperTaxiBot”.

Step 2: Install Nuget Package

Install Nuget Package Microsoft.Bot.Builder.Dialogs using VS2017.

Step 3: Create a DialogBot.cs

The class consists of bot logic which processes incoming activities from one or more channels and generates outgoing activities in response.

ActivityHandler defines various handlers for different types of activities. The activities used in this sample are:

  • OnTurnAsync: Handles any incoming activity.
  • OnMessageActivityAsync: Invoked when a message activity is received from the user. If overridden, this could potentially contain conversational logic. By default, this method does nothing.
  • OnMembersAddedAsync: Invoked when members other than this bot (like a user) are added to the conversation


Continue Reading
AI13 hours ago

How to Get the Best Start at Sports Betting

AI13 hours ago

Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods

AI15 hours ago

Microsoft BOT Framework: Building Blocks

AI15 hours ago

Are Banking Chatbots Vulnerable to Attacks?

AI15 hours ago

TikTok Alexa Skill — Dance to the Tunes Hands-free!

AI2 days ago

How does it know?! Some beginner chatbot tech for newbies.

AI2 days ago

Who is chatbot Eliza?

AI2 days ago

FermiNet: Quantum Physics and Chemistry from First Principles

AI2 days ago

How to take S3 backups with DejaDup on Ubuntu 20.10

AI4 days ago

How banks and finance enterprises can strengthen their support with AI-powered customer service…

AI4 days ago

GBoard Introducing Voice — Smooth Texting and Typing

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition