Connect with us

AI

Best Research Papers From ICML 2020

This year’s virtual ICML conference hosted 10800+ attendees from 75 countries. Apparently, the virtual format makes big research conferences such as ICML more accessible to the AI community all over the world. With almost 5000 research papers submitted to ICML 2020 and an acceptance rate of 21.8%, a total of 1088 papers were presented at […]

The post Best Research Papers From ICML 2020 appeared first on TOPBOTS.

Published

on

This year’s virtual ICML conference hosted 10800+ attendees from 75 countries. Apparently, the virtual format makes big research conferences such as ICML more accessible to the AI community all over the world.

With almost 5000 research papers submitted to ICML 2020 and an acceptance rate of 21.8%, a total of 1088 papers were presented at the conference. As usual, the Outstanding Papers awards were given to exemplary papers at this year’s ICML. To help you stay aware of the most prominent AI research breakthroughs, we’ve summarized the key ideas of these papers.

If you’d like to skip around, here are the papers we featured:

  1. On Learning Sets of Symmetric Elements
  2. Tuning-free Plug-and-Play Proximal Algorithm for Inverse Imaging Problems
  3. Generative Pretraining from Pixels
  4. Efficiently Sampling Functions from Gaussian Process Posteriors

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

ICML 2020 Best Paper Awards

1. On Learning Sets of Symmetric Elements, by Haggai Maron, Or Litany, Gal Chechik, Ethan Fetaya

Original Abstract

Learning from unordered sets is a fundamental learning setup, recently attracting increasing attention. Research in this area has focused on the case where elements of the set are represented by feature vectors, and far less emphasis has been given to the common case where set elements themselves adhere to their own symmetries. That case is relevant to numerous applications, from deblurring image bursts to multi-view 3D shape recognition and reconstruction.

In this paper, we present a principled approach to learning sets of general symmetric elements. We first characterize the space of linear layers that are equivariant both to element reordering and to the inherent symmetries of elements, like translation in the case of images. We further show that networks that are composed of these layers, called Deep Sets for Symmetric Elements layers (DSS), are universal approximators of both invariant and equivariant functions. DSS layers are also straightforward to implement. Finally, we show that they improve over existing set-learning architectures in a series of experiments with images, graphs, and point-clouds.

Our Summary

The research paper focuses on learning sets in the case when the elements of the set exhibit certain symmetries. That case is relevant when learning with sets of images, sets of point-clouds, or sets of graphs. The research team from NVIDIA Research, Stanford University, and Bar Ilan University introduces a principled approach to learning such sets, where they first characterize the space of linear layers that are equivariant both to element reordering and to the inherent symmetries of elements and then show that networks that are composed of these layers are universal approximators of both invariant and equivariant functions. The experiments demonstrate that the proposed approach achieves significant improvements over the previous approaches.

ICML 2020 Best Papers

What’s the core idea of this paper?

  • The research paper introduces a new principled approach to learning from unordered sets by utilizing the symmetries that the set elements exhibit:
    • The authors describe the symmetry group of the sets and characterize the space of linear layers that are equivariant to this group. These layers are called Deep Sets for Symmetric elements layers (DSS).
    • Then, the researchers prove that if invariant networks for the elements of interest are universal, the corresponding invariant  DSS networks on sets of such elements are also universal. The same result holds for equivariant networks and equivariant DSS networks.

What’s the key achievement?

  • The experimental results show that DSS layers outperform previous approaches in a series of tasks, including classification, frame selection in images and shapes, highest quality image selection, color-channel matching, and burst image deblurring.

What does the AI community think?

  • The paper received the Outstanding Paper Award at ICML 2020.

2. Tuning-free Plug-and-Play Proximal Algorithm for Inverse Imaging Problems, by Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Carola-Bibiane Schnlieb, Hua Huang

Original Abstract

Plug-and-play (PnP) is a non-convex framework that combines ADMM or other proximal algorithms with advanced denoiser priors. Recently, PnP has achieved great empirical success, especially with the integration of deep learning-based denoisers. However, a key problem of PnP based approaches is that they require manual parameter tweaking. It is necessary to obtain high-quality results across the high discrepancy in terms of imaging conditions and varying scene content. In this work, we present a tuning-free PnP proximal algorithm, which can automatically determine the internal parameters including the penalty parameter, the denoising strength and the terminal time. A key part of our approach is to develop a policy network for automatic search of parameters, which can be effectively learned via mixed model-free and model-based deep reinforcement learning. We demonstrate, through numerical and visual experiments, that the learned policy can customize different parameters for different states, and often more efficient and effective than existing handcrafted criteria. Moreover, we discuss the practical considerations of the plugged denoisers, which together with our learned policy yield state-of-the-art results. This is prevalent on both linear and nonlinear exemplary inverse imaging problems, and in particular, we show promising results on Compressed Sensing MRI and phase retrieval.

Our Summary

A key issue with plug-and-play (PnP) approaches is the need to manually tweak parameters. The PnP algorithm introduced in this paper is tuning-free and can automatically determine internal parameters, including the penalty parameter, the denoising strength, and the terminal time. The parameters are optimized with a reinforcement learning (RL) algorithm, where a high reward is given if the policy leads to faster convergence and better restoration accuracy. The extensive numerical and visual experiments demonstrate the effectiveness of the suggested approach on compressed sensing MRI and phase retrieval problems.

ICML 2020 Best Papers

What’s the core idea of this paper?

  • PnP algorithms offer promising image recovery results. However, their performance is very sensitive to the internal parameter selection (i.e., the penalty parameter, the denoising strength, and the terminal time). The common approach is manual parameter tweaking for each specific problem setting, which is very cumbersome and time-consuming.
  • To address this problem, the researchers introduce an RL-based method with a policy network that can customize well-suited parameters for different images:
    • an automated parameter selection problem is formulated as a Markov decision process;
    • a policy agent gets higher rewards for faster convergence and better restoration accuracy;
    • the discrete terminal time and the continuous denoising strength and penalty parameters are optimized jointly.

What’s the key achievement?

  • An extensive range of numerical and visual experiments demonstrate that the introduced tuning-free PnP algorithm:
    • outperforms state-of-the-art techniques by a large margin on the linear inverse imaging problem, namely compressed sensing MRI (especially under the difficult settings);
    • demonstrates state-of-the-art performance on the non-linear inverse  imaging problem, namely phase retrieval, where it produces cleaner and clearer results than competing techniques;
    • often reaches a level of performance comparable to the “oracle” parameters tuned via the inaccessible ground truth.

What does the AI community think?

  • The paper received the Outstanding Paper Award at ICML 2020.

What are possible business applications?

  • The introduced tuning-free PnP proximal algorithm can be applied to different inverse imaging problems, including magnetic resonance imaging (MRI), computed tomography (CT), microscopy, and inverse scattering.

Where can you get implementation code?

  • The implementation of this research paper will be released on GitHub.

3. Generative Pretraining from Pixels, by Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever

Original Abstract

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.

Our Summary

Generative pre-training methods have had a substantial impact on natural language processing over the last few years. The OpenAI research team re-evaluates these techniques on images and demonstrates that generative pre-training is competitive with other self-supervised approaches. The introduced approach consists of a pre-training stage, where both autoregressive and BERT objectives are explored, and a fine-tuning stage. The authors apply Transformer architecture to predict pixels instead of language tokens. The experiments demonstrate that generative image modeling learns state-of-the-art representations for low-resolution datasets and achieves comparable results to other self-supervised methods on ImageNet.

ICML 2020 Best Papers

What’s the core idea of this paper?

  • The authors claim that generative pre-training methods for images can be competitive with other self-supervised approaches when using a flexible architecture such as Transformer, an efficient likelihood-based objective, and significant computational resources (2048 TPU cores).
  • They introduce Image GPT, or iGPT, which is based on GPT-2 but where the sequence Transformer architecture predicts pixels instead of language tokens:
    • First, raw images are resized to low resolution and reshaped into a 1D sequence.
    • Second, autoregressive next pixel prediction or masked pixel prediction (BERT) is chosen as the pre-training objective.
    • Finally, the representations are learned by these objectives with linear probes or fine-tuning.

What’s the key achievement?

  • The experiments demonstrate that iGPT:
    • outperforms a supervised WideResNet on CIFAR-10, CIFAR-100, and STL-10 datasets;
    • achieves 72% accuracy on ImageNet, which is competitive with the recent contrastive learning approaches that require fewer parameters but work with higher resolution and utilize knowledge of the 2D input structure;
    • after fine-tuning, achieves 99% accuracy on CIFAR-10, similar to GPipe, the best model which pre-trains using ImageNet labels.

What does the AI community think?

  • The paper received an Honorable Mention at ICML 2020.

What are future research areas?

  • Exploring more efficient self-attention approaches.
  • Revisiting the representation learning capabilities of other families of generative models (e.g., flows, VAEs).

Where can you get implementation code?

  • TensorFlow implementation of iGPT by the OpenAI team is available here.
  • PyTorch implementation of the model is available here.

4. Efficiently Sampling Functions from Gaussian Process Posteriors, by James T. Wilson, Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, Marc Peter Deisenroth

Original Abstract

Gaussian processes are the gold standard for many real-world modeling problems, especially in cases where a model’s success hinges upon its ability to faithfully represent predictive uncertainty. These problems typically exist as parts of larger frameworks, wherein quantities of interest are ultimately defined by integrating over posterior distributions. These quantities are frequently intractable, motivating the use of Monte Carlo methods. Despite substantial progress in scaling up Gaussian processes to large training sets, methods for accurately generating draws from their posterior distributions still scale cubically in the number of test locations. We identify a decomposition of Gaussian processes that naturally lends itself to scalable sampling by separating out the prior from the data. Building off of this factorization, we propose an easy-to-use and general-purpose approach for fast posterior sampling, which seamlessly pairs with sparse approximations to afford scalability both during training and at test time. In a series of experiments designed to test competing sampling schemes’ statistical properties and practical ramifications, we demonstrate how decoupled sample paths accurately represent Gaussian process posteriors at a fraction of the usual cost.

Our Summary

In this paper, the authors explore techniques for efficiently sampling from Gaussian process (GP) posteriors. After investigating the behaviors of naive approaches to sampling and fast approximation strategies using Fourier features, they find that many of these strategies are complementary. They therefore introduce an approach that incorporates the best of different sampling approaches. First, they suggest decomposing the posterior as the sum of a prior and an update. Then they combine this idea with techniques from literature on approximate GPs and obtain an easy-to-use general-purpose approach for fast posterior sampling. The experiments demonstrate that decoupled sample paths accurately represent GP posteriors at a much lower cost.

What’s the core idea of this paper?

  • The introduced approach to sampling functions from GP posteriors centers on the observation that it is possible to implicitly condition Gaussian random variables by combining them with an explicit corrective term.
  • The authors translate this intuition to Gaussian processes and suggest decomposing the posterior as the sum of a prior and an update.
  • Building on this factorization, the researchers suggest an efficient approach for fast posterior sampling that seamlessly pairs with sparse approximations to achieve scalability both during training and at test time.

What’s the key achievement?

  • Introducing an easy-to-use and general-purpose approach to sampling from GP posteriors.
  • Demonstrating, with a series of experiments, that decoupled sample paths:
    • avoid many shortcomings of the alternative sampling strategies;
    • accurately represent GP posteriors at a much lower cost; for example, simulation of a well-known model of a biological neuron required only 20 seconds using decoupled sampling, while the iterative approach required 10 hours.

What does the AI community think?

  • The paper received an Honorable Mention at ICML 2020.

Where can you get implementation code?

  • The authors released the implementation of this paper on GitHub.

If you like these research summaries, you might be also interested in the following articles:

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

AI

How to Get the Best Start at Sports Betting

If you are looking into getting into sports betting, then you might be hesitant about how to start, and the whole idea of it can be quite daunting. There are many techniques to get the best possible start at sports betting and, in this article, we will have a look at some of the best […]

The post How to Get the Best Start at Sports Betting appeared first on 1redDrop.

Published

on

If you are looking into getting into sports betting, then you might be hesitant about how to start, and the whole idea of it can be quite daunting. There are many techniques to get the best possible start at sports betting and, in this article, we will have a look at some of the best tips for that.

Mental preparation

This sounds a bit pretentious, but it is very important to understand some things about betting before starting so you can not only avoid nasty surprises but also avoid losing too much money. Firstly, you need to know that, in the beginning, you will not be good at betting. It is through experience and learning from your mistakes that you will get better. It is imperative that you do not convince yourself that you are good at betting, especially if you win some early bets, because I can guarantee it will have been luck – and false confidence is not your friend. 

It is likely that you will lose some money at first, but this is to be expected. Almost any hobby that you are interested in will cost you some money so, instead, look at it as an investment. However, do not invest ridiculous amounts; rather, wait until you are confident in your betting ability to start placing larger stakes. 

Set up different accounts

This is the best way to start with sports betting, as the welcome offers will offset a lot of the risk. These offers are designed to be profitable to entice you into betting with the bookie, but it is completely legal to just profit from the welcome offer and not bet with the bookie again. 

If you do this with the most bookies, as you can, you are minimising the risk involved with your betting and maximising possible returns, so it really is a no-brainer.

As well as this clear advantage, different betting companies offer different promotions. Ladbrokes offer a boost every day, for example, where you can choose your bet and boost it a little bit, and the Parimatch betting website chooses a bet for big events and doubles the odds. 

If you are making sure you stay aware of the best offers across these platforms, then you will be able to use the most lucrative ones and, as such, you will be giving yourself the best chance of making money. The house always wins, as they say, but if you use this tip, you are skewing the odds back in your favour. 

Remember, the house wins because of gamblers that do not put in the effort and do not bet smart. Avoid those mistakes and you will massively increase your chances of making money.

Tipsters

On Twitter, especially, but also other social media platforms, there are tipsters who offer their bets for free. It is not so much the bets themselves that you are interested in, but rather why they are betting on this. It is important that you find tipsters who know what they are doing, though, because there are a lot of tipsters who are essentially scamming their customers. It is quite easy to find legitimate tipsters because they are not afraid to show their mistakes. 

Once you have found good tipsters, then you need to understand the reasoning behind their bets. When you have done that, you can start placing these bets yourself, and they will likely be of better value since some tipsters influence the betting markets considerably. You can also follow their bets as they are likely to be sensible bets, although this does not necessarily translate to success.

Source: https://1reddrop.com/2020/10/20/how-to-get-the-best-start-at-sports-betting/?utm_source=rss&utm_medium=rss&utm_campaign=how-to-get-the-best-start-at-sports-betting

Continue Reading

AI

Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods

Estimates state that 70%–85% of the world’s data is text (unstructured data) [1]. New deep learning language models (transformers) have caused explosive growth in industry applications [5,6,11]. This blog is not an article introducing you to Natural Language Processing. Instead, it assumes you are familiar with noise reduction and normalization of text. It covers text preprocessing up […]

The post Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods appeared first on TOPBOTS.

Published

on

text pre-processing

Estimates state that 70%–85% of the world’s data is text (unstructured data) [1]. New deep learning language models (transformers) have caused explosive growth in industry applications [5,6,11].

This blog is not an article introducing you to Natural Language Processing. Instead, it assumes you are familiar with noise reduction and normalization of text. It covers text preprocessing up to producing tokens and lemmas from the text.

We stop at feeding the sequence of tokens into a Natural Language model.

The feeding of that sequence of tokens into a Natural Language model to accomplish a specific model task is not covered here.

In production-grade Natural Language Processing (NLP), what is covered in this blog is that fast text pre-processing (noise cleaning and normalization) is critical.

  1. I discuss packages we use for production-level NLP;
  2. I detail the production-level NLP preprocessing text tasks with python code and packages;
  3. Finally. I report benchmarks for NLP text pre-processing tasks;

Dividing NLP Processing into Two Steps

We segment NLP into two major steps (for the convenience of this article):

  1. Text pre-processing into tokens. We clean (noise removal) and then normalize the text. The goal is to transform the text into a corpus that any NLP model can use. A goal is rarely achieved until the introduction of the transformer [2].
  2. A corpus is an input (text preprocessed into a sequence of tokens) into NLP models for training or prediction.

The rest of this article is devoted to noise removal text and normalization of text into tokens/lemmas (Step 1: text pre-processing). Noise removal deletes or transforms things in the text that degrade the NLP task model. It is usually an NLP task-dependent. For example, e-mail may or may not be removed if it is a text classification task or a text redaction task. We’ll cover replacement and removal of the noise.

Normalization of the corpus is transforming the text into a common form. The most frequent example is normalization by transforming all characters to lowercase. In follow-on blogs, we will cover different deep learning language models and Transformers (Steps 2-n) fed by the corpus token/lemma stream.

NLP Text Pre-Processing Package Factoids

There are many NLP packages available. We use spaCy [2], textacy [4], Hugging Face transformers [5], and regex [7] in most of our NLP production applications. The following are some of the “factoids” we used in our decision process.

Note: The following “factoids” may be biased. That is why we refer to them as “factoids.”

NLTK [3]

  • NLTK is a string processing library. All the tools take strings as input and return strings or lists of strings as output [3].
  • NLTK is a good choice if you want to explore different NLP with a corpus whose length is less than a million words.
  • NLTK is a bad choice if you want to go into production with your NLP application [3].

Regex

The use of regex is pervasive throughout our text-preprocessing code. Regex is a fast string processor. Regex, in various forms, has been around for over 50 years. Regex support is part of the standard library of Java and Python, and is built into the syntax of others, including Perl and ECMAScript (JavaScript);

spaCy [2]

  • spaCy is a moderate choice if you want to research different NLP models with a corpus whose length is greater than a million words.
  • If you use a selection from spaCy [3], Hugging Face [5], fast.ai [13], and GPT-3 [6], then you are performing SOTA (state-of-the-art) research of different NLP models (my opinion at the time of writing this blog).
  • spaCy is a good choice if you want to go into production with your NLP application.
  • spaCy is an NLP library implemented both in Python and Cython. Because of the Cython, parts of spaCy are faster than if implemented in Python [3];
  • spacy is the fastest package, we know of, for NLP operations;
  • spacy is available for operating systems MS Windows, macOS, and Ubuntu [3];
  • spaCy runs natively on Nvidia GPUs [3];
  • explosion/spaCy has 16,900 stars on Github (7/22/2020);
  • spaCy has 138 public repository implementations on GitHub;
  • spaCy comes with pre-trained statistical models and word vectors;
  • spaCy transforms text into document objects, vocabulary objects, word- token objects, and other useful objects resulting from parsing the text ;
  • Doc class has several useful attributes and methods. Significantly, you can create new operations on these objects as well as extend a class with new attributes (adding to the spaCy pipeline);
  • spaCy features tokenization for 50+ languages;

Do you find this in-depth technical education about NLP applications to be useful? Subscribe below to be updated when we release new relevant content.

Creating long_s Practice Text String

We create long_, a long string that has extra whitespace, emoji, email addresses, $ symbols, HTML tags, punctuation, and other text that may or may not be noise for the downstream NLP task and/or model.

MULPIPIER = int(3.8e3)
text_l = 300 %time long_s = ':( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 '
long_s += ' 888 eihtg DoD Fee https://medium.com/ #hash ## Document Title</title> '
long_s += ':( cat- \n nip'
long_s += ' immed- \n natedly <html><h2>2nd levelheading</h2></html> . , '
long_s += '# bhc@gmail.com f@z.yx can\'t Be a ckunk. $4 $123,456 won\'t seven '
long_s +=' $Shine $$beighty?$ '
long_s *= MULPIPIER
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.11 µs
size: 1.159e+06 :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beigh

A string, long_s of 1.159 million characters is created in 8.11 µs.

Python String Corpus Pre-processing Step and Benchmarks

All benchmarks are run within a Docker container on MacOS Version 14.0 (14.0).

Model Name: Mac Pro
Processor Name: 12-Core Intel Xeon E5
Processor Speed: 2.7 GHz
Total Number of Cores: 24
L2 Cache (per Core): 256 KB
L3 Cache: 30 MB
Hyper-Threading Technology: Enabled Memory: 64 GB

Note: Corpus/text pre-processing is dependent on the end-point NLP analysis task. Sentiment Analysis requires different corpus/text pre-processing steps than document redaction. The corpus/text pre-processing steps given here are for a range of NLP analysis tasks. Usually. a subset of the given corpus/text pre-processing steps is needed for each NLP task. Also, some of required corpus/text pre-processing steps may not be given here.

1. NLP text preprocessing: Replace Twitter Hash Tags

from textacy.preprocessing.replace import replace_hashtags
%time text = replace_hashtags(long_s,replace_with= 'HASH')
print('size: {:g} {}'.format(len(text),text[:text_l])))

output =>

CPU times: user 223 ms, sys: 66 µs, total: 223 ms
Wall time: 223 ms
size: 1.159e+06 :
( 😻 😈 _HASH_ +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ _HASH_ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beigh

Notice that #google and #hash are swapped with_HASH_,and ##and _# are untouched. A million characters were processed in 200 ms. Fast enough for a big corpus of a billion characters (example: web server log).

2. NLP text preprocessing: Remove Twitter Hash Tags

from textacy.preprocessing.replace import replace_hashtags
%time text = replace_hashtags(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 219 ms, sys: 0 ns, total: 219 ms
Wall time: 220 ms
size: 1.1134e+06 :( 😻 😈 +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

Notice that #google and #hash are removed and ##,and _# are untouched. A million characters were processed in 200 ms.

3. NLP text preprocessing: Replace Phone Numbers

from textacy.preprocessing.replace import replace_phone_numbers
%time text = replace_phone_numbers(long_s,replace_with= 'PHONE')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 384 ms, sys: 1.59 ms, total: 386 ms
Wall time: 383 ms
size: 1.0792e+06
:( 😻 😈 PHONE 08-PHONE 608-444-00003 ext. 508 888 eihtg

Notice phone number 08-444-0004 and 608-444-00003 ext. 508 were not transformed.

4. NLP text preprocessing: Replace Phone Numbers – better

RE_PHONE_NUMBER: Pattern = re.compile( # core components of a phone number r"(?:^|(?<=[^\w)]))(\+?1[ .-]?)?(\(?\d{2,3}\)?[ .-]?)?(\d{2,3}[ .-]?\d{2,5})" # extensions, etc. r"(\s?(?:ext\.?|[#x-])\s?\d{2,6})?(?:$|(?=\W))", flags=re.UNICODE | re.IGNORECASE) text = RE_PHONE_NUMBER.sub('_PHoNE_', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 353 ms, sys: 0 ns, total: 353 ms
Wall time: 350 ms
size: 1.0108e+06 :( 😻 😈 _PHoNE_ _PHoNE_ _PHoNE_ 888 eihtg DoD Fee https://medium.com/ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

Notice phone number 08-444-0004 and 608-444-00003 ext. 508 were transformed. A million characters were processed in 350 ms.

5. NLP text preprocessing: Remove Phone Numbers

Using the improved RE_PHONE_NUMBER pattern, we put '' in for ‘PHoNE' to remove phone numbers from the corpus.

text = RE_PHONE_NUMBER.sub('', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 353 ms, sys: 459 µs, total: 353 ms
Wall time: 351 ms
size: 931000 :( 😻 😈 888 eihtg DoD Fee https://medium.com/ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

A million characters were processed in 375 ms.

6. NLP text preprocessing: Removing HTML metadata

I admit removing HTML metadata is my favorite. Not because I like the task, but because I screen-scrape frequently. There is a lot of useful data that resides on an IBM mainframe, VAX-780 (huh?), or whatever terminal-emulation that results in an HTML-based report.

These techniques of web scraping of reports generate text that has HTML tags. HTML tags are considered noise typically as they are parts of the text with little or no value in the follow-on NLP task.

Remember, we created a test string (long_s) a little over million characters long with some HTML tags. We remove the HTML tags using BeautifulSoup.

from bs4 import BeautifulSoup
%time long_s = BeautifulSoup(long_s,'html.parser').get_text()
print('size: {:g} {}'.format(len(long_s),long_s[:text_l])))

output =>

CPU times: user 954 ms, sys: 17.7 ms, total: 971 ms
Wall time: 971 ms
size: 817000 :( 😻 😈 888 eihtg DoD Fee https://medium.com/ ## Document Title :( cat- nip immed- natedly 2nd levelheading 

The result is that BeautifulSoup is able to remove over 7,000 HTML tags in a million character corpus in one second. Scaling linearly, a billion character corpus, about 200 million word, or approxiately 2000 books, would require about 200 seconds.

The rate for HTML tag removal byBeautifulSoup is about 0. 1 second per book. An acceptable rate for our production requirements.

I only benchmark BeautifulSoup. If you know of a competitive alternative method, please let me know.

Note: The compute times you get may be multiples of time longer or shorter if you are using the cloud or Spark.

7. NLP text preprocessing: Replace currency symbol

The currency symbols “[$¢£¤¥ƒ֏؋৲৳૱௹฿៛ℳ元円圆圓﷼\u20A0-\u20C0] “ are replaced with _CUR_using the textacy package:

%time textr = textacy.preprocessing.replace.replace_currency_symbols(long_s)
print('size: {:g} {}'.format(len(textr),textr[:text_l]))

output =>

CPU times: user 31.2 ms, sys: 1.67 ms, total: 32.9 ms
Wall time: 33.7 ms
size: 908200 :( 😻 😈 888 eihtg DoD Fee https://medium.com/ ## Document Title :( cat- nip immed- natedly 2nd levelheading . , # bhc@gmail.com f@z.yx can't Be a ckunk. _CUR_4 _CUR_123,456 won't seven _CUR_Shine _CUR__CUR_beighty?_CUR_

Note: The option textacy replace_<something> enables you to specify the replacement text. _CUR_ is the default substitution text for replace_currency_symbols.

You may have the currency symbol $ in your text. In this case you can use a regex:

%time text = re.sub('\$', '_DOL_', long_s)
print('size: {:g} {}'.format(len(text),text[:250]))

output =>

CPU times: user 8.06 ms, sys: 0 ns, total: 8.06 ms
Wall time: 8.25 ms
size: 1.3262e+06 :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. _DOL_4 _DOL_123,456 won't seven _DOL_Shine _DOL__DOL_beighty?_DOL_ :

Note: All symbol $ in your text will be removed. Don’t use if you have LaTex or any text where multiple symbol $ are used.

8. NLP text preprocessing: Replace URL String

from textacy.preprocessing.replace import replace_urls
%time text = replace_urls(long_s,replace_with= '_URL_')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 649 ms, sys: 112 µs, total: 649 ms
Wall time: 646 ms
size: 763800
:( 😻 😈 888 eihtg DoD Fee _URL_ ## Document Title :(

9. NLP text preprocessing: Remove URL String

from textacy.preprocessing.replace import replace_urls
%time text = replace_urls(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 633 ms, sys: 1.35 ms, total: 635 ms
Wall time: 630 ms
size: 744800
:( 😻 😈 888 eihtg DoD Fee ## Document Title :(

The rate for URL replace or removal is about 4,000 URLs per 1 million characters per second. Fast enough for 10 books in a corpus.

10. NLP text preprocessing: Replace E-mail string

%time text = textacy.preprocessing.replace.replace_emails(long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 406 ms, sys: 125 µs, total: 406 ms
Wall time: 402 ms
size: 725800
:( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat-
nip immed-
natedly 2nd levelheading . , # _EMAIL_ _EMAIL_ can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

The rate for email reference replace is about 8,000 emails per 1.7 million characters per second. Fast enough for 17 books in a corpus.

11. NLP text pre-processing: Remove E-mail string

from textacy.preprocessing.replace import replace_emails

%time text = textacy.preprocessing.replace.replace_emails(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 413 ms, sys: 1.68 ms, total: 415 ms
Wall time: 412 ms
size: 672600 :( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat-
nip immed-
natedly 2nd levelheading . , # can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

The rate for email reference removal is about 8,000 emails per 1.1 million characters per second. Fast enough for 11 books in a corpus.

12. NLP text preprocessing: normalize_hyphenated_words

from textacy.preprocessing.normalize import normalize_hyphenated_words
%time long_s = normalize_hyphenated_words(long_s)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l])))

output =>

CPU times: user 186 ms, sys: 4.58 ms, total: 191 ms
Wall time: 190 ms
size: 642200 :
( 😻 😈 888 eihtg DoD Fee ## Document Title :( catnip immednatedly

Approximately 8,000 hyphenated-words, cat — nip and immed- iately (mispelled) were corrected in a corpus of 640,000 characters in 190 ms or abouut 3 million per second.

13. NLP text preprocessing: Convert all characters to lower case

### - **all characters to lower case;**
%time long_s = long_s.lower()
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 4.82 ms, sys: 953 µs, total: 5.77 ms
Wall time: 5.97 ms
size: 642200
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

I only benchmark the .lower Python function. The rate for lower case transformation by.lower() of a Python string of a million characters is about 6 ms. A rate that far exceeds our production rate requirements.

14. NLP text preprocessing: Whitespace Removal

%time text = re.sub(' +', ' ', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 44.9 ms, sys: 2.89 ms, total: 47.8 ms
Wall time: 47.8 ms
size: 570000
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

The rate is about 0.1 seconds for 1 million characters.

15. NLP text preprocessing: Whitespace Removal (slower)

from textacy.preprocessing.normalize import normalize_whitespace

%time text= normalize_whitespace(long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 199 ms, sys: 3.06 ms, total: 203 ms
Wall time: 201 ms
size: 569999
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

normalize_whitespce is 5x slower but more general. For safety in production, we use normalize_whitespce.To date, we do not think we had any problems with faster regex.

16. NLP text preprocessing: Remove Punctuation

from textacy.preprocessing.remove import remove_punctuation

%time text = remove_punctuation(long_s, marks=',.#$?')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 34.5 ms, sys: 4.82 ms, total: 39.3 ms
Wall time: 39.3 ms
size: 558599
:( 😻 😈 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading can't be a ckunk 4 123 456 won't seven shine beighty

spaCy

Creating the spaCy pipeline and Doc

In order to text pre-process with spaCy, we transform the text into a corpus Doc object. We can then use the sequence of word tokens objects of which a Doc object consists. Each token consists of attributes (discussed above) that we use later in this article to pre-process the corpus.

Our text pre-processing end goal (usually) is to produce tokens that feed into our NLP models.

  • spaCy reverses the stream of pre-processing text and then transforming text into tokens. spaCy creates a Doc of tokens. You then pre-process the tokens by their attributes.

The result is that parsing text into a Doc object is where the majority of computation lies. As we will see, pre-processing the sequence of tokens by their attributes is fast.

Adding emoji cleaning in the spaCy pipeline

import en_core_web_lg
nlp = en_core_web_lg.load() do = nlp.disable_pipes(["tagger", "parser"])
%time emoji = Emoji(nlp)
nlp.max_length = len(long_s) + 10
%time nlp.add_pipe(emoji, first=True)
%time long_s_doc = nlp(long_s)
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:text_l]))

output =>

CPU times: user 303 ms, sys: 22.6 ms, total: 326 ms
Wall time: 326 ms
CPU times: user 23 µs, sys: 0 ns, total: 23 µs
Wall time: 26.7 µs
CPU times: user 7.22 s, sys: 1.89 s, total: 9.11 s
Wall time: 9.12 s
size: 129199
:( 😻 😈 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading can't be a ckunk 4 123 456 won't seven shine beighty

Creating the token sequence required at 14,000 tokens per second. We will quite a speedup when we use NVIDIA gpu.

nlp.pipe_names output => ['emoji', 'ner']

Note: The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in nlp.pipe_names. The reason is that there can only be one tokenizer, and while all other pipeline components take a Doc and return it, the tokenizer takes a string of text and turns it into a Doc. You can still customize the tokenizer. You can either create your own Tokenizer class from scratch, or even replace it with an entirely custom function.

spaCy Token Attributes for Doc Token Preprocessing

As we saw earlier, spaCy provides convenience methods for many other pre-processing tasks. It turns — for example, to remove stop words you can reference the .is_stop attribute.

dir(token[0]) output=> 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'morph', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'remove_extension', 'right_edge', 'rights', 'sent', 'sent_start', 'sentiment', 'set_extension', 'shape', 'shape_', 'similarity', 'string', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'tensor', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_']

Attributes added by emoji and new.

dir(long_s_doc[0]._) output => ['emoji_desc', 'get', 'has', 'is_emoji', 'set', 'trf_alignment', 'trf_all_attentions', 'trf_all_hidden_states', 'trf_d_all_attentions', 'trf_d_all_hidden_states', 'trf_d_last_hidden_state', 'trf_d_pooler_output', 'trf_end', 'trf_last_hidden_state', 'trf_pooler_output', 'trf_separator', 'trf_start', 'trf_word_pieces', 'trf_word_pieces_'

I show spaCy performing preprocessing that results in a Python string corpus. The corpus is used to create a new sequence of spaCy tokens (Doc).

There is a faster way to accomplish spaCy preprocessing with spaCy pipeline extensions [2], which I show in an upcoming blog.

17. EMOJI Sentiment Score

EMOJI Sentiment Score is not a text preprocessor in the classic sense.

However, we find that emoji almost always is the dominating text in a document.

For example, two similar phrases from legal notes e-mail with opposite sentiment.

The client was challenging. :( The client was difficult. :)

We calcuate only emoji when present in a note or e-mail.

%time scl = [EMOJI_TO_SENTIMENT_VALUE[token.text] for token in long_s_doc if (token.text in EMOJI_TO_SENTIMENT_VALUE)]
len(scl), sum(scl), sum(scl)/len(scl)

output =>

CPU times: user 179 ms, sys: 0 ns, total: 179 ms
Wall time: 178 ms
(15200, 1090.7019922523152, 0.07175671001659968)

The sentiment was 0.07 (neutral) for 0.5 million character “note” with 15,200 emojis and emojicons in 178 ms. A fast sentiment analysis calculation!

18. NLP text preprocessing: Removing emoji

You can remove emoji using spaCy pipeline add-on

%time long_s_doc_no_emojicon = [token for token in long_s_doc if token._.is_emoji == False]
print('size: {:g} {}'.format(len(long_s_doc_no_emojicon),long_s_doc_no_emojicon[:int(text_l/5)]))

output =>

CPU times: user 837 ms, sys: 4.98 ms, total: 842 ms
Wall time: 841 ms
size: 121599
[:(, 888, eihtg, dod, fee, , document, title, :(, catnip, immednatedly, 2nd, levelheading, , ca, n't, be, a, ckunk, , 4, , 123, 456, wo, n't, seven, , shine, , beighty, , :(, 888, eihtg, dod, fee, , document, title, :(, catnip, immednatedly, 2nd, levelheading, , ca, n't, be, a, ckunk, , 4, , 123, 456, wo, n't, seven, , shine, , beighty, , :(, 888, eihtg, dod, fee, ]

The emoji spacy pipeline addition detected the emojicons, 😻 😈, but missed :) and :(.

19. NLP text pre-processing: Removing emoji (better)

We developed EMOJI_TO_PHRASEto detect the emojicons, 😻 😈, and emoji, such as :) and :(. and removed them [8,9].

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False \
else '' for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 242 ms, sys: 3.76 ms, total: 245 ms
Wall time: 245 ms
CPU times: user 3.37 ms, sys: 73 µs, total: 3.45 ms
Wall time: 3.46 ms
size: 569997
888 eihtg dod fee document title catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty 888 eihtg dod fee document title catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty 888 eihtg dod fee document title catnip imm

20. NLP text pre-processing: Replace emojis with a phrase

We can translate emojicon into a natural language phrase.

%time text = [token.text if token._.is_emoji == False else token._.emoji_desc for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 1.07 s, sys: 7.54 ms, total: 1.07 s
Wall time: 1.07 s
CPU times: user 3.78 ms, sys: 0 ns, total: 3.78 ms
Wall time: 3.79 ms
size: 794197
:( smiling cat face with heart-eyes smiling face with horns 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty

The emoji spaCy pipeline addition detected the emojicons, 😻 😈, but missed :) and :(.

21. NLP text pre-processing: Replace emojis with a phrase (better)

We can translate emojicons into a natural language phrase.

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False \
else EMOJI_TO_PHRASE[token.text] for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 251 ms, sys: 5.57 ms, total: 256 ms
Wall time: 255 ms
CPU times: user 3.54 ms, sys: 91 µs, total: 3.63 ms
Wall time: 3.64 ms
size: 904397
FROWNING FACE SMILING CAT FACE WITH HEART-SHAPED EYES SMILING FACE WITH HORNS 888 eihtg dod fee document title FROWNING FACE catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty FROWNING FAC

Again. EMOJI_TO_PHRASE detected the emojicons, 😻 😈, and emoji, such as :) and :(. and substituted a phrase.

22. NLP text preprocessing: Correct Spelling

We will use symspell for spelling correction [14].

SymSpell, based on the Symmetric Delete spelling correction algorithm, just took 0.000033 seconds (edit distance 2) and 0.000180 seconds (edit distance 3) on an old MacBook Pro [14].

%time sym_spell_setup() 
%time tk = [check_spelling(token.text) for token in long_s_doc[0:99999]]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 5.22 s, sys: 132 ms, total: 5.35 s
Wall time: 5.36 s
CPU times: user 25 s, sys: 12.9 ms, total: 25 s
Wall time: 25.1 s
CPU times: user 3.37 ms, sys: 42 µs, total: 3.41 ms
Wall time: 3.42 ms
size: 528259 FROWNING FACE SMILING CAT FACE WITH HEART a SHAPED EYES SMILING FACE WITH HORNS 888 eight do fee document title FROWNING FACE catnip immediately and levelheading a not be a chunk a of 123 456 to not seven of shine of eighty

Spell correction was accomplished for immednatedly, ckunk and beight. Correcting mis-spelled words is our largest computation. It required 30 seconds for 0.8 million characters.

23. NLP text preprocessing: Replacing Currency Symbol (spaCy)

%time token = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(token)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))aa

Note: spacy removes all punctuation including :) emoji and emoticon. You can protect the emoticon with:

%time long_s_doc = [token for token in long_s_doc if token.is_punct == False or token._.is_emoji == True]
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:50]))

However, replace_currency_symbols and regex ignore context and replace any currency symbol. You may have multiple use of $ in your text and thus can not ignore context. In this case you can use spaCy.

%time tk = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 366 ms, sys: 13.9 ms, total: 380 ms
Wall time: 381 ms
CPU times: user 9.7 ms, sys: 0 ns, total: 9.7 ms
Wall time: 9.57 ms
size: 1.692e+06 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd levelheading</h2></html > bhc@gmail.com f@z.y a$@ ca n't bc$$ ef$4 5 66 _CUR_ wo nt seven eihtg _CUR_ nine _CUR_ _CUR_ zer$ 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd leve

24. NLP text preprocessing: Removing e-mail address (spacy)

%time tokens = [token for token in long_s_doc if not token.like_email]
print('size: {:g} {}'.format(len(tokens),tokens[:int(text_l/3)]))

output =>

CPU times: user 52.7 ms, sys: 3.09 ms, total: 55.8 ms
Wall time: 54.8 ms
size: 99999

About 0.06 second for 1 million characters.

25. NLP text preprocessing: Remove whitespace and punctuation (spaCy)

%time tokens = [token.text for token in long_s_doc if (token.pos_ not in ['SPACE','PUNCT'])]
%time text = ' '.join(tokens)
print('size: {:g} {}'.format(len(text),text[:text_l]))

26. NLP text preprocessing: Removing stop-words

NLP models (ex: logistic regression and transformers) and NLP tasks (Sentiment Analysis) continue to be added. Some benefit from stopword removal, and some will not. [2]

Note: We now only use different deep learning language models (transformers) and do not remove stopwords.

%time tokens = [token.text for token in long_s_doc if token.is_stop == False]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

27. NLP text pre-processing: Lemmatization

Lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words.

Lemmatization looks at the surrounding text to determine a given word’s part of speech. It does not categorize phrases.

%time tokens = [token.lemma_ for token in long_s_doc]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 366 ms, sys: 13.9 ms, total: 380 ms
Wall time: 381 ms
CPU times: user 9.7 ms, sys: 0 ns, total: 9.7 ms
Wall time: 9.57 ms
size: 1.692e+06 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd levelheading</h2></html > bhc@gmail.com f@z.y a$@ ca n't bc$$ ef$4 5 66 _CUR_ wo nt seven eihtg _CUR_ nine _CUR_ _CUR_ zer$ 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd leve

Note: Spacy does not have stemming. You can add if it is you want. Stemming does not work as well as Lemmazatation because Stemming does not consider context [2] (Why some researcher considers spacy “opinionated”).

Note: If you do not know what is Stemming, you can still be on the Survivor show. (my opinion)

Conclusion

Whatever the NLP task, you need to clean (pre-process) the data (text) into a corpus (document or set of documents) before it is input into any NLP model.

I adopt a text pre-processing framework that has three major categories of NLP text pre-processing:

  1. Noise Removal
  • Transform Unicode characters into text characters.
  • convert a document image into segmented image parts and text snippets [10];
  • extract data from a database and transform into words;
  • remove markup and metadata in HTML, XML, JSON, .md, etc.;
  • remove extra whitespaces;
  • remove emoji or convert emoji into phases;
  • Remove or convert currency symbol, URLs, email addresses, phone numbers, hashtags, other identifying tokens;
  • The correct mis-spelling of words (tokens) [7];
  • Remove remaining unwanted punctuation;

2. Tokenization

  • They are splitting strings of text into smaller pieces, or “tokens.” Paragraphs segment into sentences, and sentences tokenize into words.

3. Normalization

  • Change all characters to lower case;
  • Remove English stop words, or whatever language the text is in;
  • Perform Lemmatization or Stemming.

Note: The tasks listed in Noise Removal and Normalization can move back and forth. The categorical assignment is for explanatory convenience.

Note: We do not remove stop-words anymore. We found that our current NLP models have higher F1 scores when we leave in stop-words.

Note: Stop-word removal is expensive computationally. We found the best way to achieve faster stop-word removal was not to do it.

Note: We saw no significant change in Deep Learning NLP models’ speed with or without stop-word removal.

Note: The Noise Removal and Normalization lists are not exhaustive. These are some of the tasks I have encountered.

Note: The latest NLP Deep Learning models are more accurate than older models. However, Deep Learning models can be impractically slow to train and are still too slow for prediction. We show in a follow-on article how we speed-up such models for production.

Note: Stemming algorithms drop off the end of the beginning of the word, a list of common prefixes and suffixes to create a base root word.

Note: Lemmatization uses linguistic knowledge bases to get the correct roots of words. Lemmatization performs morphological analysis of each word, which requires the overhead of creating a linguistic knowledge base for each language.

Note: Stemming is faster than lemmatization.

Note: Intuitively and in practice, lemmatization yields better results than stemming in an NLP Deep Learning model. Stemming generally reduces precision accuracy and increases recall accuracy because it injects semi-random noise when wrong.

Read more in How and Why to Implement Stemming and Lemmatization from NLTK.

Text preprocessing Action benchmarks

Our unique implementations, spaCy, and textacy are our current choice for short text preprocessing production fast to use. If you don’t mind the big gap in performance, I would recommend using it for production purposes, over NLTK’s implementation of Stanford’s NER.

In the next blogs, We see how performance changes using multi-processing, multithreading, Nvidia GPUs, and pySpark. Also, I will write about how and why our implementations, such as EMOJI_TO_PHRASEand EMOJI_TO_SENTIMENT_VALUE and or how to add emoji, emoticon, or any Unicode symbol.

References

[1] How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read.

[2] Industrial-Strength Natural Language Processing;Turbo-charge your spaCy NLP pipeline.

[3] NLTK 3.5 Documentation.

[4] Textacy: Text (Pre)-processing.

[5] Hugging Face.

[6] Language Models are Few-Shot Learners.

[7] re — Regular expression operations.

[8] Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm.

[9] How I Built Emojitracker.

[10] Classifying e-commerce products based on images and text.

[11] DART: Open-Domain Structured Data Record to Text Generation.

[12] Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT.

[13] fast.ai .

[14] 1000x faster Spelling Correction.

This article was originally published on Medium and re-published to TOPBOTS with permission from the author. Read more technical guides by Bruce Cottman, Ph.D. on Medium.

Enjoy this article? Sign up for more AI and NLP updates.

We’ll let you know when we release more in-depth technical education.

Continue Reading

AI

Microsoft BOT Framework: Building Blocks

I wrote an article last week introducing the ‘’Microsoft BOT Framework”. The highlight of the article was to educate the readers on how to…

Published

on

Photo by Tincho Franco on Unsplash

I wrote an article some weeks ago introducing the ‘’Microsoft BOT Framework”. The highlight of the article was to educate the readers on how to develop a basic chatbot. Although my workmates acknowledged the efforts but were interested in knowing more. In this article, I am going to dig in a little deeper with the various concepts involved with the Microsoft BOT framework.

I would be touching on the below-mentioned concepts in this article.

  • Channel
  • State
  • Prompt
  • Dialog
  • Waterfall
  • Connector
  • Activity
  • Turn

Channel

Channel is an application that is being used to interact with the BOT. Some of the current integrations are available with Teams, Slack, Workplace, Skype, Facebook, Telegram, Line, Webchat, etc.

Some channels are also available as an adapter. Check here for more details.

State

State in the context of the ChatBots means persisting metadata of the conversation between the BOT and the user at a certain moment. State management makes the conversation more meaningful (i.e the responses could be saved to be accessed at a later point of time.

Prompt

During a conversation between the user and the BOT, a prompt is an event when BOT asks a user any question. This question could be in the form of text, button, dropdown, etc.

Dialog

Dialogs allow forming flow in the conversation. A Dialog comprises of 2 steps.

  1. A prompt from the BOT requesting for info
  2. User Response to the BOT

If the user response is valid, BOT will send a new prompt for further information, else it will re-send the same prompt.

1. 8 Proven Ways to Use Chatbots for Marketing (with Real Examples)

2. How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project

3. 5 Top Tips For Human-Centred Chatbot Design

4. Chatbot Conference Online

Waterfall

The waterfall is formed with a combination of Dialogs. It’s a sequence of dialogs which determines the complete flow of the conversation.

Let’s look at all of these concepts in a diagrammatic representation.

Connector

REST API used by BOT to communicate across multiple channels. The API allows the exchange of messages between BOT and the user on a specific channel.

Activity

As the name suggests, an activity is any communication between the user and the BOT. The connector API uses the activity object to send useful information back and forth. The most common activity type is the message. For a complete list of all Activity types, see here.

Turn

In any conversation between two parties, each party takes turns to respond to an activity (message). In the context of Microsoft BOT Framework, communication happens between user and BOT, hence a turn could be considered as the processing done by the BOT to respond to the user request.

Now that we have understood the basic concepts needed to build this sample, let’s have a look at our use case.

We would be building a ChatBot application which would enable users to Book a taxi. The conversational flow would be like:

Each box in the above diagram represents a Dialog.

Github: https://github.com/tarunbhatt9784/MFTSamples/tree/master/SuperTaxiBot

Step 1: Create a VS2017 project

I would set the name of the project as “SuperTaxiBot”.

Step 2: Install Nuget Package

Install Nuget Package Microsoft.Bot.Builder.Dialogs using VS2017.

Step 3: Create a DialogBot.cs

The class consists of bot logic which processes incoming activities from one or more channels and generates outgoing activities in response.

ActivityHandler defines various handlers for different types of activities. The activities used in this sample are:

  • OnTurnAsync: Handles any incoming activity.
  • OnMessageActivityAsync: Invoked when a message activity is received from the user. If overridden, this could potentially contain conversational logic. By default, this method does nothing.
  • OnMembersAddedAsync: Invoked when members other than this bot (like a user) are added to the conversation

Source: https://chatbotslife.com/microsoft-bot-framework-building-blocks-377be3d55dab?source=rss—-a49517e4c30b—4

Continue Reading
AI13 hours ago

How to Get the Best Start at Sports Betting

AI13 hours ago

Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods

AI15 hours ago

Microsoft BOT Framework: Building Blocks

AI15 hours ago

Are Banking Chatbots Vulnerable to Attacks?

AI15 hours ago

TikTok Alexa Skill — Dance to the Tunes Hands-free!

AI2 days ago

How does it know?! Some beginner chatbot tech for newbies.

AI2 days ago

Who is chatbot Eliza?

AI2 days ago

FermiNet: Quantum Physics and Chemistry from First Principles

AI2 days ago

How to take S3 backups with DejaDup on Ubuntu 20.10

AI4 days ago

How banks and finance enterprises can strengthen their support with AI-powered customer service…

AI4 days ago

GBoard Introducing Voice — Smooth Texting and Typing

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

Trending