Connect with us

AI

A brief tour to the NLP Sesame Street

Published

on

As part of my master’s degree I created this sort of article to start formalizing my scientific writing of topics I’m really interested about. I wanted to create a brief summaryof the NLP findings and progress in the last few years. It happens that in this period the performance has been improve significantly and it’s really interesting how the NLP community have been digging into new ways to process language. This summary would help to get some basic information about the state of the art of NLP and I invite to read the articles to get deep knowledge of each term and strategy.

Abstract

The effective processing of natural language is something that has been evaluated and tried to reach along the story through technology and scientific findings. However, due to the inherent complexity of language itself: semantic structure, slangs, regionalisms, outdated words, etc., it has been really difficult to get good and accurate value from its analysis regardless the media (voice, audio, text, etc.). In the other hand, the new coming or recent value of Big Data and the highly incremental processing power of computers nowadays enables a new way of solving Natural Language Processing (NLP) issues by handling the data collected from multiple sources and user inputs with high quality models and data-driven solutions. The next steps after NLP tasks get accurate and stable are pointing to streaming approaches, analyzing inputs in real-time and taking decisions with the aid of artificial intelligence and machine learning, and also, to really accurate jobs with documents: this has been showing continuous improvements changing the state-of-the-art of NLP in a brief time considering the time when it was passing from ideas to a formal model.

In this chapter, a description of the main language models and machine learning techniques whose are related with achieving new state-of-the-art results would be shown in order to give a lead for those who are entering to the Sesame Street of NLP, which we could refer as a state-of-the-art movement since 2018 up to date in 2020.

Background

On the basis of Big Data as a common factor, we can take its term for granted. Having said that, let’s review the main acronyms related with the state-of-the-art of NLP and some of its up-to-date techniques.

Natural Language Processing is the task or function of giving meaning and understanding to computers about the human language given a context taken as input and handling that information to perform a task or display an instruction as an output which, in most of the cases, is a human language-oriented message in order to have a good human-computer interaction (HCI).

Despite of the term of NLP has been incrementing its number of papers and its field of research year by year. As Joseph Mariani and G. Francopoulo said on their paper (2019), The field of study has been on hands since 1965 and the authorship has been increasing dramatically year by year since 1985.

It wasn’t until late 1940s when term was mentioned by first time as the research about machine translation had already started but not completely focused as it is by now. Before that time the three wasn’t even in existence. Not a coincidence as the enhancement on computer power and the release of the internet had had a lot of importance on this area. Nevertheless, it’s until the introduction of big data, artificial intelligence, neural networks and more accurate machine learning models when NLP is starting to be efficient enough to deliver value and be profitable and, of course, being a trend topic between the scientific community and researchers. “The ratio of the total number of papers (65,003) to the overall number of different authors (48,894) represents the global productivity of the community: each author published on average 1.33 papers over 50 years”, (Mariani et al., 2019) regarding the papers published at any of the NLP4NLP conferences or journals.

Linguistics

Back in the years, it was known, natural language processing was normally conducted by computer scientists. Nowadays, the area has been mutating integrating collaborative work as matter of the interest of other professionals as linguistics, psychologist and philosophers, etc. One of the most ironical aspect of NLP is that it adds up to the knowledge of human language (Khurana et al., 2017).

It’s curious how this technology breakthrough which is trying to communicate humans with machines is adding value to the knowledge of language itself. Also, scientists are taking help from professionals of other areas, place where we find Noam Chomsky an American linguist, philosopher and cognitive scientist and its theories about structure of language. Chomsky’s theories have dominated linguistics since the 1960’s and been highly influential throughout the field of cognitive science, which Chomsky helped to create. (Valin, 1980).

Linguistics is the science of language which includes Phonology that refers to sound and Morphology to word formation. Similarly, when we talk about natural language processes we have to think about those areas as our input comes from a human as speech or as text with the purpose of give an instruction or say something. After it, we have the syntax sentence structure, semantics syntax and pragmatics matters which refers to understanding and where we locate the NLP techniques and Language Models (LMs) (Khurana et al., 2017).

In the first models, syntactic rules, as linguistics matter, were procedure in nature, like algorithms, e.g. ‘move a question word like “who” to be beginning of the sentence in a question’ and the approach of that days to create sentences was a derivation consisting of sequences of procedural operations. However, the finding of Chomsky didn’t include how sentences are constructed or interpreted in real time as linguistic competence. His distinction was the performance scope, the application of this competence in actual situations. (Valin, 1980).

One of the things we can expect of a nearly future is machines answering and generating questions to refine the request instead of just doing an instruction right away. It’s clear NLP still has limitations and communication problems when using only statistical approaches. There’s the place where Natural Language Understanding (NLU) and Natural Language Generation (NLG) come around together to give a solution, as in language, besides the expectation of the next word or sentence. In some situations, we need more meaning and semantic representations than a bag of words to understand the context, for those cases its necessary to implement NLU models. In the opposite direction once a context was understood it’s when the NLG process starts producing from phrases until meaningful paragraphs.

When we refer to the study of language processing in humans, psycholinguists usually test human responses to words in context, in order to have a better understanding of the information that is use to generate prediction, as our mind is always trying to predict the future. Allyson Ettinger (2020) mentions two relevant types to measure predictive human responses: Cloze probability and N400 amplitude. For our specific purpose, let’s only talk about the Cloze probability as its just in one of the NLP techniques mentioned forward. The cloze probability is defined as the probability of the first impression of a human expectation. Also, we have a cloze task or procedure where humans try to fill blanks with expected words in incomplete sentences. Hence, Cloze probability of a word w in context c refers to the proportion of people who select w to complete c.

Related Work

The usage of NLP in applications nowadays have been increasing and, in most of the cases, working without noticing as it’s a backend process not perceive by users. Applications as translation, summarization, dialog system, spam filtering, information retrieval, image caption generation, speech recognition.

When we talk about NLP, we could be doing a reference to its speech area or audio processing which is, in a step of its process, parsed into text, the main area we are focusing on and where the NLP techniques are applied.

Datasets used for NLP Pre-training

With help of big data clustering we can find a lot of datasets and evaluation models with good value for pre-training tasks purposes. There’s plenty of datasets for different purposes or languages. Regarding NLP, the datasets normally used for pre-training tasks are the ones provided by IMDB (Internet Movie Database), which is normally used for sentiment classification as they have the review of movies of a lot of people it’s really helpful with sentiment analysis tasks, the Wikipedia dataset which contains articles of all languages, the Stanford Question Answering Dataset (SQuAD) which is a reading comprehension dataset.

1. 8 Proven Ways to Use Chatbots for Marketing (with Real Examples)

2. How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project

3. 5 Top Tips For Human-Centred Chatbot Design

4. Chatbot Conference Online

NLP tasks for evaluation

In order to evaluate the improvements of new language models strategies or techniques some NLP tasks have been standardized as metrics. Thus, being able to compare results of specific tasks with shared data as text corpora with a scope of reproducible research.

GLUE

Most NLU models above the word level are designed for a specific task and struggle with out of domain data. If it’s a need to develop models with understanding beyond detection of correspondences between inputs and outputs, then it’s critical to develop a more unified model that can learn to execute a range of different linguistics tasks in different domains. There’s where General Language Understanding Evaluation (GLUE) benchmark, a collection of NLU tasks including question answering, sentiment analysis and textual entailment, and an associated online platform for model evaluation, comparison and analysis, favors models that can learn to represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. This collection consists of nine “difficult and diverse” task datasets designed to test a model’s language understanding and it’s designed to highlight common challenges, such as the use of world knowledge and logical operators, that we expect models must handle to robustly solve the tasks. (Wang et al., 2019).

The values evaluated within the GLUE metrics are distributed as tasks of three kinds: single-sentence tasks, which consists of English acceptability judgments from books and journal articles on linguistic theory known as The Corpus of Linguistic Acceptability (CoLA), similarity and paraphrase tasks, where we can use the Microsoft Research Paraphrase Corpus (MRPC) which is a corpus of sentences pairs automatically extracted from online news sources adding human annotations and the inference tasks, in this case with have the Multi-Genre Natural Language Inference Corpus (MNLI) as a crowdsourced collection of sentence pairs with textual entailment annotations. As a multi-task benchmark and analysis platform for NLU, GLUE is usually applied to evaluate the performance of models as it covers a diverse range of NLP datasets (Sun, Wang, Li, Feng, Tian, et al., 2019).

SQuAD

The Stanford Question Answering Dataset (SQuAD) provides a paragraph of context and a question. It consists of 100,000+ questions posed by crowdworkers on a set of Wikipedia Articles, where the answer to each question is a segment of a text from the corresponding reading passage. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. (Rajpurkar et al., 2016).

BLEU

The Bilingual Evaluation Understudy (BLEU) is a method for automatic evaluation of machine translation that is quick, inexpensive and language-independent in comparison with human evaluations which are extensive but expensive as it can take months to finish and involves labor that cannot be reused. This method uses the BLEU score as a metric which has a range from 0 to 1. Few translations will attain a score of 1 unless they are identical to a reference translation. For this reason, even a human translator will not necessarily score 1.

This method works by counting the n-grams in the candidate translation to n-grams in the reference text matches. The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the references translation and count the number of matches. These matches are position-independent. The more the matches, the better the candidate is. (Papineni et al., 2002).

Matt Post (2019) from Amazon Research on its article “A Call for Clarity in Reporting BLEU Scores” mentions the BLEU score, is in fact a parameterized metric whose values can vary wildly with changes to these parameters. Also, he says, these parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. So, he suggests machine translation researchers to settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing. Also he mentioned some issues in this method focusing in its replicability structure as the way it has been used BLEU does not signify a single concrete method but a constellation of parametrized methods, scores computed against differently-processed references are not comparable and papers vary in the hidden parameters and schemes they use, yet often do not report them. Even so, Kyshore Papineni, the BLEU score proposer knew about the remain issues. Machine Translation systems can over-generate “reasonable” words, resulting in improbable, but high-precision, translations. Intuitively the problem is clear: a reference word should be considered exhausted after a matching candidate word is identified. We formalize this intuition as the modified unigram precision. (Papineni et al., 2002).

Artificial Intelligence — Machine Learning

When we talk about Artificial Intelligence we still far enough of a concept as consciousness or a process of thinking. Even so, what we can use by now and to take value out of it is really helpful and provides solution to daily tasks. Artificial Intelligence is related with machine learning and deep learning models which could be supervised, generally used to build pre-trained models offline with classification and regression techniques, and unsupervised normally used to analyze in real-time data with help of tags which can be necessary to implement supervised models based on its clustering decisions to group data together. Also, machine learning models can be predominantly categorized as generative or discriminative. Generative methods can generate synthetic data because of which they create rich models of probability distributions. Discriminative methods are more functional and have right estimating posterior probabilities and are based on observations. (Khurana et al., 2017)

Neural Networks

As part of the evolution of NLP techniques we have passed from straight forward traditional language models through statistical models as the Hidden Markov Model, which is used to calculate the missing values regarding the ones available at the moment, to neural probabilistic language models which are the ones in use.

Artificial Neural networks (ANN) have been a big shot in NLP area as it allows to perform massively parallel computation tasks for data processing and knowledge representation. Although ANN are drastic abstraction, the idea is not to replicate the operation of their biological system but to make use of what is known about the functionality of the biological counterparts for solving complex problems. The attractiveness of ANNs comes from the information processing characteristics based on the biological system such as, nonlinearity which allows better fit to the data, high parallelism for fast processing, learning and adaptivity to allow the system to update its internal structure in response to changing environment such is the functionality of long short-term memory (LSTM) neural networks. Also, generalization, one more characteristic, enables models to unlearned data while moving forward through the network. (Basheer & Hajmeer, 2000).

Neural sequence models, usually left-to-right, have been successfully appropriated by many conditional generation tasks as: machine translation, speech recognition and synthesis and image captioning. Much of the prior work in this area follows the seq2seq encoder-decoder paradigm, where an encoder builds a representation of an observed sequence x, and a decoder gives the conditional output distribution p(y | x) according to a predetermined factorization. (Chan et al., 2019)

Pre-training and Fine-tuning

Language representation pre-training has been shown effective for improving many natural language processing tasks such as named entity recognition, sentiment analysis, and question answering. In order to get reliable word representation, neural language models are designed to learn word cooccurrence and then obtain word embedding with unsupervised learning. [Quote Ernie](Sun, Wang, Li, Feng, Chen, et al., 2019). Pre-training of NLP models with a language modeling objective has recently gained popularity as a precursor to task-specific fine-tuning. Fine-tuning for application tasks by virtue of fine-tuning with task-specific supervised data, the pre-trained model can be adapted to different language understanding tasks, such as question answering, natural language inference, and semantic similarity. (Sun, Wang, Li, Feng, Tian, et al., 2019)

Talking about NLP we find that techniques using ANNs have reached new state-of- the-art for a wide range of tasks such as text classification, machine translation, semantic role labeling, coreference resolution and semantic parsing. (Gardner et al., 2019).

In particular, recurrent neural networks (RNN), LSTM and gated recurrent neural networks, have been firmly established as state-of-the-art approaches in sequence modeling and transduction problems such as language modeling and machine translation but improvements in performance are achieved month to month. Now LSTM are considered as slow to train in comparison with Attention mechanism as Transformers which are more efficient, easy to train and ready to fine tune for specific tasks.

Attention is all you need

Attention mechanism have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies disregarding to their distance in the input or output sequences. That’s what Ashish Vaswani (2017) member of the Google Brain Research Team mentioned on its article “Attention Is All You Need”. The main scope of Attention Functions is to map a query and a set of key-value pair to an output, where the query, keys, values, and output are all vectors. In this case, the output result is a weighted sum of the values where the attention function attend to information from different representation at different positions.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V. The input consists of queries and keys of dimension dk, and values dv. Also, we compute the matrix of outputs as:

It was on 2017 when Google Research’s team published a paper reporting new state-of-the-art results by applying a neural network architecture based solely on attention mechanism, dispensing with recurrence and convolutions entirely. The Transformer, their proposal, was the first sequence transduction model relying entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed and self-attention. Since then, it has been used in the field with better models. (Vaswani et al., 2017)

Google, in the WMT 2014 English-to-German (EN-DE) translation task, with the big transformer model Transformer(big) established a new state-of-the-art BLEU score of 28.4 and 41.8 in the WMT 2014 English-to-French (EN-FR) as show in the table 1, where a comparison with state-of-the-art models is shown to visualize the difference in BLEU score.

Table 1.- Comparison of BLEU Score with its contemporary state-of-the-art models.

The transformer can capture the contextual information for each token in the sequence via self-attention, and generate a sequence of contextual embeddings. (Sun, Wang, Li, Feng, Tian, et al., 2019)

It’s curious, and a funny trend between NLP researchers, the inside joke of naming language models after Muppets from Sesame Street show. ¿How is this? Up to date, there is a new breed of language models named as ELMo, BERT, Grover, Big BIRD, Rosita, RoBERTa, Ernie and KERMIT. Anyone would say “It’s ok, a geek or freak guy choose the names, nothing formal” … but it becomes interesting when we find Google, Facebook and Allen NLP, to name a few, are involved in this naming convention which is so well established and makes LMs easier to remember. Let’s review some of this language models in order to get information about the state-of-the-art improvements in NLP.

ELMo

All this joke or naming convention started with ELMo (Embeddings from Language Models) representation. This was published at 2018 in a paper led by Matthew Peters (2018) where a new type of deep contextualized word representations that models both complex characteristics of word use, as syntax and semantics, and how these uses vary across linguistic contexts. They showed that these representations could be easily added to existing models and significantly improve the state-of-the-art in every considered case across a range of challenging language understanding problems as question answering, textual entailment and sentiment analysis as shown in table 2. ELMo word representations are deep, in the sense that they are a function of the entire input sentence and all of the internal layers of the bidirectional language model (biLM), which is pretrained on large text corpus. (Peters et al., 2018)

BERT

After that, ELMo could have been a one-time fact or an isolated case if not for BERT, a language model developed by Google’s AI team some months later in 2018.

Pre-trained word embeddings are an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch. To pretrain word embedding vectors, left-to-right language modeling objectives have been used, as well as objectives to discriminate correct from incorrect words in left and right context. (Devlin et al., 2018)

Much of the recent success in NLP and state-of-the-art incomings is due to the transfer learning paradigm where Transformer-based models first try to learn task independent linguistic knowledge from large text corpora, let’s say, the basis of communication, and then get fine-tuned on small datasets for specific tasks. This enhancement allows researchers and companies to narrow down the scope in order to get specific information or to develop NLP solutions as a service or product.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). (Devlin et al., 2018)

Despite the percentage of improvement on NLP tasks, as researches have been really into updates, evaluations on BERT model were shown it was not the ultimate model but a pioneer as its issues are detected and enhancements proposed really fast. In the paper “When BERT Plays the Lottery, All Tickets Are Winning” a study finds that “bad” subnetworks of elements in a fine-tuned BERT model can be fine-tuned separately to achieve only a slightly worse performance than the “good” ones as its weights are potentially useful. Another example is the case described in the paper of Allyson Ettinger “What BERT is not”, she says we have yet to understand exactly what linguistic capacities pre-training processes could confer upon language models.

As a case study, we apply these diagnostics to the popular BERT model, finding that it can generally distinguish good from bad completions involving shared category or role reversal, albeit with less sensitivity than humans, and it robustly retrieves noun hypernyms, but it struggles with challenging inferences and role-based event prediction — and in particular, it shows clear insensitivity to the contextual impacts of negation. (Ettinger, 2020)

Ettinger found that though BERT was very good at the positive (“is”), it definitely struggled with the negative (“is not”). It was able to associate nouns with close immediate descriptors, but not handling the positive or negative phrasing. Here is an example from the paper: A robin is a ____ (BERT Large predictions: bird, robin, person, hunter, pigeon). A robin is not a ____ (BERT Large predictions: bird, robin, person, hunter, pigeon). It seems that BERT is having difficulties on an easy task for a human: on one hand is awesome in certain tasks but surprisingly fails on easy tasks with negative phrasing.

Our experiments show that while high lexical overlap between languages improves transfer, M-BERT is also able to transfer between languages written in different scripts — thus having zero lexical overlap — indicating that it captures multilingual representations. We further show that transfer works best for typologically similar languages, suggesting that while M-BERT’s multilingual representation is able to map learned structures onto new vocabularies, it does not seem to learn systematic transformations of those structures to accommodate a target language with different word order. (Pires et al., 2020)

Rosita

Introduced by Phoene Mulcaire as lead, Rosita, a multilingual extension of ELMo, is a method to produce multilingual contextual word representations (CWR) by training a single “polyglot” language model on text in multiple languages. It was named after the bilingual character from Sesame Street. While crosslingual transfer in neural network models is a promising direction, the best blend of polyglot and language-specific elements may depend on the task and architecture”. (Mulcaire et al., 2019). When the term crosslingual transfer is used in the paper it makes reference to any method which uses one or more source languages to help process another target language.

ERNIE

Inspired by the masking strategy of BERT: the last work is a [mask]. The Enhanced Representation through Knowledge Integration model (ERNIE) proposed by a Baidu’s research team was designed to learn language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking. Entity-level strategy masks entities which are usually composed of multiple words. Phrase-level strategy masks the whole phrase which is composed of several words standing together as a conceptual unit. (Sun, Wang, Li, Feng, Chen, et al., 2019). The main difference with this language representation model is that the majority of previous studies model the representations by predicting the missing word only through the context but they are ignoring the prior knowledge in the sentence. Taking phrases or an entity of several words as a unit to mask and predict is a strategy which achieves new state-of-the-arts results on NLP tasks including natural language inference, semantic similarity, named entity recognition, sentiment analysis and question answering.

Word-aware Pre-training Tasks Knowledge Masking Task ERNIE 1.0 proposed an effective strategy to enhance representation through knowledge integration. It introduced phrase masking and named entity masking and predicts the whole masked phrases and named entities to help the model learn the dependency information in both local contexts and global contexts. We use this task to train an initial version of the model. (Sun, Wang, Li, Feng, Tian, et al., 2019)

Giving an example, In the sentence “Tender Surrender is a song composed by Steve Vai”. It is easy for the model to predict missing or masked words of the entity recognized as Tender Surrender by word collocation or dependency but it cannot predict regarding the relationship between Tender Surrender and Steve Vai. As mentioned in the article, it is intuitive that if the model learns more about prior knowledge, the model can obtain more reliable language representation.

ERNIE 2.0

Some months after ERNIE’s paper release Baidu publish “ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding” with the proposal of a pre-training framework ERNIE 2.0 which incrementally and constantly builds pre-training tasks and then learn pre-trained models on these tasks, sort of a neural network behavior or like jobs processes continues feedback, via continual multi-task remembering the previously-learned tasks in order to get the lexical, syntactic and semantic information from training corpora. Outperforming BERT and, another language model, XLNET on 16 tasks including English tasks on GLUE benchmarks and some similar tasks in Chinese.

Generally, the pre-training of models often trains the model based on the co-occurrence of words and sentences. While in fact, there are other lexical, syntactic and semantic information worth examining in training corpora other than cooccurrence. For example, named entities like person names, location names, and organization names, may contain conceptual information. (Sun, Wang, Li, Feng, Tian, et al., 2019)

Information as the order of sentences or the proximity between them enables the models to learn structure-aware representation. Also, semantic similarity and discourse relations allows it to learn semantic-aware representation. The pre-training task construction or architecture of ERNIE 2.0 was defined by three different kinds to capture different aspect of information in the training corpora: the world-aware tasks which enable the model to capture the lexical information, the structure-aware tasks which enable the model capture the syntactic information of the corpus and the semantic-aware tasks which aims to learn semantic information.

Grover

This language model for controllable text generation named after another Sesame Street character “Grover” has the intention of verify news in order to detect fake ones and respond with an action before it manifests at scale. As Natural Language Generation has been progressing model-to-model it enables the possibility to generate neural fake news from some time ago.

The high quality of neural fake news written by Grover, as judged by humans, makes automatic neural fake news detection an important research area. Using models (below) for the role of the Verifier can mitigate the harm of neural fake news by classifying articles as Human or Machine written. These decisions can assist content moderators and end users in identifying likely (neural) disinformation. (Zellers et al., 2019).

In addition to the strategy to verify fake news this research provided of a large corpus of news articles from Common Crawl named RealNews, as Grover needed a large corpus of news with metadata which was not available or in current existence it was necessary to build one which was limited to the 5000 news domains indexed by Google News. Real News size is 120 gigabytes with no compression.

KERMIT

The Kontextuell Encoder Representations Made by Insertion Transformations, or KERMIT for short was presented in the paper “KERMIT: Generative Insertion-Based Modeling for Sequences” published by Google Research. KERMIT consists in a simple architecture, a single Transformer decode stack, easy to implement as it does not have a separate encoder and decoder, nor requires causality masks.

Like its close friend BERT (Devlin et al., 2019), KERMIT can also be used for self-supervised representation learning and applied to various language understanding tasks. We follow the same training procedure and hyperparameter setup as BERTLARGE. However, instead of masking 15% of the tokens and replacing them with blank tokenslike in BERT (Devlin et al., 2019), KERMIT simply drops them out completely from the sequence. (Chan et al., 2019).

KERMIT text generation can be in arbitrary order and can generate sequences in logarithmic time. Googles Research Team found this insertion-based framework for sequence is capable of matching or exceeding state-of-the art performance on tasks as machine translation, representation learning and zero shot cloze question answering; this is in comparison with language models as BERT, and another powerful language model successor of the Generative Pre-Training (GPT), GPT-2.

Big BIRD

As part of the internal joke as naming convention we arrived to our last member presented in another paper published by the Brain Team of Google Research, titled as “Big Bidirectional insertion Representations for Documents”, Big BIRD for short. This work proposes an extension of the Insertion Transformer as an insertion-based model for document-level translation tasks, scaling up from sentences to long form documents. Thus, showing an improvement on BLEU score on the WMT’19 English-to-German (EN-DE) document-level translation task in comparison with the Insertion Transformer baseline.

Document-Level Machine Translation is becoming an increasingly important task. Recent research suggests we are nearing human-level parity for sentence-level translation in certain domain, however, we lag significantly behind in document-level translation. (Li & Chan, 2019).

There are two primary methods to include context in a document-level machine translation model compared to a sentence-level translation model: source contextualization, which allows the target sentence to be contextualized to the source document, and target contextualization, which allows the target sentence to be contextualized to other target sentences. Also, there are two key contributions: extending the context window size to cover a document, and informing the model of sentence positional information, which are aligned between source and target sentences.

Conclusion

On this tour guided at the Sesame Street, we have found that NLP tasks and LMs are getting better in a brief time. If we consider the attention-based pre-trained models and the contribution of transformer as the basis of a new paradigm, it is just starting and its improvement and deficiency detection among the different NLP tasks goes really fast thanks to the Researchers community which seems to be collaborating more than competing one against other as if they were playing with this inside joke of naming their models after the Muppets. BERT was the arrowhead and it has been mutating to different versions some more popular than others as RoBERTa from Facebook AI, StructBert, ALBERT, BERTQA, BETO, AraBERT, just to name a few, and all of these contributions have been adding value to improve the performance achieving new state-of-the-art results.

References

Basheer, I. A., & Hajmeer, M. (2000). Artificial neural networks: Fundamentals, computing, design, and application. Journal of Microbiological Methods, 43(1), 3–31. https://doi.org/10.1016/S0167-7012(00)00201-3

Chan, W., Kitaev, N., Guu, K., Stern, M., & Uszkoreit, J. (2019). KERMIT: Generative Insertion-Based Modeling for Sequences. Section 3, 1–11. http://arxiv.org/abs/1906.01604

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Mlm. http://arxiv.org/abs/1810.04805

Ettinger, A. (2020). What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models. Transactions of the Association for Computational Linguistics, 8, 34–48. https://doi.org/10.1162/tacl_a_00298

Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N. F., Peters, M., Schmitz, M., & Zettlemoyer, L. (2019). AllenNLP: A Deep Semantic Natural Language Processing Platform. 1–6. https://doi.org/10.18653/v1/w18-2501

Khurana, D., Koli, A., Khatter, K., & Singh, S. (2017). Natural Language Processing: State of The Art, Current Trends and Challenges. Figure 1. http://arxiv.org/abs/1708.05148

Li, L., & Chan, W. (2019). Big Bidirectional Insertion Representations for Documents. 194–198. https://doi.org/10.18653/v1/d19-5620

Mariani, J., Francopoulo, G., & Paroubek, P. (2019). The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing. Frontiers in Research Metrics and Analytics, 3(February). https://doi.org/10.3389/frma.2018.00036

Mulcaire, P., Kasai, J., & Smith, N. A. (2019). Polyglot Contextual Representations Improve Crosslingual Transfer. 3912–3918. https://doi.org/10.18653/v1/n19-1392

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. NAACL HLT 2018–2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies — Proceedings of the Conference, 1, 2227–2237. https://doi.org/10.18653/v1/n18-1202

Pires, T., Schlinger, E., & Garrette, D. (2020). How multilingual is multilingual BERT? ACL 2019–57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 4996–5001. https://doi.org/10.18653/v1/p19-1493

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuad: 100,000+ questions for machine comprehension of text. EMNLP 2016 — Conference on Empirical Methods in Natural Language Processing, Proceedings, ii, 2383–2392.

Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., & Wu, H. (2019). ERNIE: Enhanced Representation through Knowledge Integration. http://arxiv.org/abs/1904.09223

Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., & Wang, H. (2019). ERNIE 2.0: A Continual Pre-training Framework for Language Understanding. http://arxiv.org/abs/1907.12412

Valin, R. D. Van. (1980). From NLP to NLU. 1–7.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 2017Decem(Nips), 5999–6009.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). GLUE: A Multi-Task Benchmark and Analysis platform for NLU. Iclr, 1–20. https://openreview.net/pdf?id=rJ4km2R5t7

Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y. (2019). Defending Against Neural Fake News. 1–21. http://arxiv.org/abs/1905.12616

PS: In order to improve my writting I’m open to feedback 😀

Source: https://chatbotslife.com/a-brief-tour-to-the-nlp-sesame-street-7bba02d75ae3?source=rss—-a49517e4c30b—4

AI

How to Get the Best Start at Sports Betting

If you are looking into getting into sports betting, then you might be hesitant about how to start, and the whole idea of it can be quite daunting. There are many techniques to get the best possible start at sports betting and, in this article, we will have a look at some of the best […]

The post How to Get the Best Start at Sports Betting appeared first on 1redDrop.

Published

on

If you are looking into getting into sports betting, then you might be hesitant about how to start, and the whole idea of it can be quite daunting. There are many techniques to get the best possible start at sports betting and, in this article, we will have a look at some of the best tips for that.

Mental preparation

This sounds a bit pretentious, but it is very important to understand some things about betting before starting so you can not only avoid nasty surprises but also avoid losing too much money. Firstly, you need to know that, in the beginning, you will not be good at betting. It is through experience and learning from your mistakes that you will get better. It is imperative that you do not convince yourself that you are good at betting, especially if you win some early bets, because I can guarantee it will have been luck – and false confidence is not your friend. 

It is likely that you will lose some money at first, but this is to be expected. Almost any hobby that you are interested in will cost you some money so, instead, look at it as an investment. However, do not invest ridiculous amounts; rather, wait until you are confident in your betting ability to start placing larger stakes. 

Set up different accounts

This is the best way to start with sports betting, as the welcome offers will offset a lot of the risk. These offers are designed to be profitable to entice you into betting with the bookie, but it is completely legal to just profit from the welcome offer and not bet with the bookie again. 

If you do this with the most bookies, as you can, you are minimising the risk involved with your betting and maximising possible returns, so it really is a no-brainer.

As well as this clear advantage, different betting companies offer different promotions. Ladbrokes offer a boost every day, for example, where you can choose your bet and boost it a little bit, and the Parimatch betting website chooses a bet for big events and doubles the odds. 

If you are making sure you stay aware of the best offers across these platforms, then you will be able to use the most lucrative ones and, as such, you will be giving yourself the best chance of making money. The house always wins, as they say, but if you use this tip, you are skewing the odds back in your favour. 

Remember, the house wins because of gamblers that do not put in the effort and do not bet smart. Avoid those mistakes and you will massively increase your chances of making money.

Tipsters

On Twitter, especially, but also other social media platforms, there are tipsters who offer their bets for free. It is not so much the bets themselves that you are interested in, but rather why they are betting on this. It is important that you find tipsters who know what they are doing, though, because there are a lot of tipsters who are essentially scamming their customers. It is quite easy to find legitimate tipsters because they are not afraid to show their mistakes. 

Once you have found good tipsters, then you need to understand the reasoning behind their bets. When you have done that, you can start placing these bets yourself, and they will likely be of better value since some tipsters influence the betting markets considerably. You can also follow their bets as they are likely to be sensible bets, although this does not necessarily translate to success.

Source: https://1reddrop.com/2020/10/20/how-to-get-the-best-start-at-sports-betting/?utm_source=rss&utm_medium=rss&utm_campaign=how-to-get-the-best-start-at-sports-betting

Continue Reading

AI

Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods

Estimates state that 70%–85% of the world’s data is text (unstructured data) [1]. New deep learning language models (transformers) have caused explosive growth in industry applications [5,6,11]. This blog is not an article introducing you to Natural Language Processing. Instead, it assumes you are familiar with noise reduction and normalization of text. It covers text preprocessing up […]

The post Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods appeared first on TOPBOTS.

Published

on

text pre-processing

Estimates state that 70%–85% of the world’s data is text (unstructured data) [1]. New deep learning language models (transformers) have caused explosive growth in industry applications [5,6,11].

This blog is not an article introducing you to Natural Language Processing. Instead, it assumes you are familiar with noise reduction and normalization of text. It covers text preprocessing up to producing tokens and lemmas from the text.

We stop at feeding the sequence of tokens into a Natural Language model.

The feeding of that sequence of tokens into a Natural Language model to accomplish a specific model task is not covered here.

In production-grade Natural Language Processing (NLP), what is covered in this blog is that fast text pre-processing (noise cleaning and normalization) is critical.

  1. I discuss packages we use for production-level NLP;
  2. I detail the production-level NLP preprocessing text tasks with python code and packages;
  3. Finally. I report benchmarks for NLP text pre-processing tasks;

Dividing NLP Processing into Two Steps

We segment NLP into two major steps (for the convenience of this article):

  1. Text pre-processing into tokens. We clean (noise removal) and then normalize the text. The goal is to transform the text into a corpus that any NLP model can use. A goal is rarely achieved until the introduction of the transformer [2].
  2. A corpus is an input (text preprocessed into a sequence of tokens) into NLP models for training or prediction.

The rest of this article is devoted to noise removal text and normalization of text into tokens/lemmas (Step 1: text pre-processing). Noise removal deletes or transforms things in the text that degrade the NLP task model. It is usually an NLP task-dependent. For example, e-mail may or may not be removed if it is a text classification task or a text redaction task. We’ll cover replacement and removal of the noise.

Normalization of the corpus is transforming the text into a common form. The most frequent example is normalization by transforming all characters to lowercase. In follow-on blogs, we will cover different deep learning language models and Transformers (Steps 2-n) fed by the corpus token/lemma stream.

NLP Text Pre-Processing Package Factoids

There are many NLP packages available. We use spaCy [2], textacy [4], Hugging Face transformers [5], and regex [7] in most of our NLP production applications. The following are some of the “factoids” we used in our decision process.

Note: The following “factoids” may be biased. That is why we refer to them as “factoids.”

NLTK [3]

  • NLTK is a string processing library. All the tools take strings as input and return strings or lists of strings as output [3].
  • NLTK is a good choice if you want to explore different NLP with a corpus whose length is less than a million words.
  • NLTK is a bad choice if you want to go into production with your NLP application [3].

Regex

The use of regex is pervasive throughout our text-preprocessing code. Regex is a fast string processor. Regex, in various forms, has been around for over 50 years. Regex support is part of the standard library of Java and Python, and is built into the syntax of others, including Perl and ECMAScript (JavaScript);

spaCy [2]

  • spaCy is a moderate choice if you want to research different NLP models with a corpus whose length is greater than a million words.
  • If you use a selection from spaCy [3], Hugging Face [5], fast.ai [13], and GPT-3 [6], then you are performing SOTA (state-of-the-art) research of different NLP models (my opinion at the time of writing this blog).
  • spaCy is a good choice if you want to go into production with your NLP application.
  • spaCy is an NLP library implemented both in Python and Cython. Because of the Cython, parts of spaCy are faster than if implemented in Python [3];
  • spacy is the fastest package, we know of, for NLP operations;
  • spacy is available for operating systems MS Windows, macOS, and Ubuntu [3];
  • spaCy runs natively on Nvidia GPUs [3];
  • explosion/spaCy has 16,900 stars on Github (7/22/2020);
  • spaCy has 138 public repository implementations on GitHub;
  • spaCy comes with pre-trained statistical models and word vectors;
  • spaCy transforms text into document objects, vocabulary objects, word- token objects, and other useful objects resulting from parsing the text ;
  • Doc class has several useful attributes and methods. Significantly, you can create new operations on these objects as well as extend a class with new attributes (adding to the spaCy pipeline);
  • spaCy features tokenization for 50+ languages;

Do you find this in-depth technical education about NLP applications to be useful? Subscribe below to be updated when we release new relevant content.

Creating long_s Practice Text String

We create long_, a long string that has extra whitespace, emoji, email addresses, $ symbols, HTML tags, punctuation, and other text that may or may not be noise for the downstream NLP task and/or model.

MULPIPIER = int(3.8e3)
text_l = 300 %time long_s = ':( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 '
long_s += ' 888 eihtg DoD Fee https://medium.com/ #hash ## Document Title</title> '
long_s += ':( cat- \n nip'
long_s += ' immed- \n natedly <html><h2>2nd levelheading</h2></html> . , '
long_s += '# bhc@gmail.com f@z.yx can\'t Be a ckunk. $4 $123,456 won\'t seven '
long_s +=' $Shine $$beighty?$ '
long_s *= MULPIPIER
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.11 µs
size: 1.159e+06 :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beigh

A string, long_s of 1.159 million characters is created in 8.11 µs.

Python String Corpus Pre-processing Step and Benchmarks

All benchmarks are run within a Docker container on MacOS Version 14.0 (14.0).

Model Name: Mac Pro
Processor Name: 12-Core Intel Xeon E5
Processor Speed: 2.7 GHz
Total Number of Cores: 24
L2 Cache (per Core): 256 KB
L3 Cache: 30 MB
Hyper-Threading Technology: Enabled Memory: 64 GB

Note: Corpus/text pre-processing is dependent on the end-point NLP analysis task. Sentiment Analysis requires different corpus/text pre-processing steps than document redaction. The corpus/text pre-processing steps given here are for a range of NLP analysis tasks. Usually. a subset of the given corpus/text pre-processing steps is needed for each NLP task. Also, some of required corpus/text pre-processing steps may not be given here.

1. NLP text preprocessing: Replace Twitter Hash Tags

from textacy.preprocessing.replace import replace_hashtags
%time text = replace_hashtags(long_s,replace_with= 'HASH')
print('size: {:g} {}'.format(len(text),text[:text_l])))

output =>

CPU times: user 223 ms, sys: 66 µs, total: 223 ms
Wall time: 223 ms
size: 1.159e+06 :
( 😻 😈 _HASH_ +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ _HASH_ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beigh

Notice that #google and #hash are swapped with_HASH_,and ##and _# are untouched. A million characters were processed in 200 ms. Fast enough for a big corpus of a billion characters (example: web server log).

2. NLP text preprocessing: Remove Twitter Hash Tags

from textacy.preprocessing.replace import replace_hashtags
%time text = replace_hashtags(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 219 ms, sys: 0 ns, total: 219 ms
Wall time: 220 ms
size: 1.1134e+06 :( 😻 😈 +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

Notice that #google and #hash are removed and ##,and _# are untouched. A million characters were processed in 200 ms.

3. NLP text preprocessing: Replace Phone Numbers

from textacy.preprocessing.replace import replace_phone_numbers
%time text = replace_phone_numbers(long_s,replace_with= 'PHONE')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 384 ms, sys: 1.59 ms, total: 386 ms
Wall time: 383 ms
size: 1.0792e+06
:( 😻 😈 PHONE 08-PHONE 608-444-00003 ext. 508 888 eihtg

Notice phone number 08-444-0004 and 608-444-00003 ext. 508 were not transformed.

4. NLP text preprocessing: Replace Phone Numbers – better

RE_PHONE_NUMBER: Pattern = re.compile( # core components of a phone number r"(?:^|(?<=[^\w)]))(\+?1[ .-]?)?(\(?\d{2,3}\)?[ .-]?)?(\d{2,3}[ .-]?\d{2,5})" # extensions, etc. r"(\s?(?:ext\.?|[#x-])\s?\d{2,6})?(?:$|(?=\W))", flags=re.UNICODE | re.IGNORECASE) text = RE_PHONE_NUMBER.sub('_PHoNE_', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 353 ms, sys: 0 ns, total: 353 ms
Wall time: 350 ms
size: 1.0108e+06 :( 😻 😈 _PHoNE_ _PHoNE_ _PHoNE_ 888 eihtg DoD Fee https://medium.com/ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

Notice phone number 08-444-0004 and 608-444-00003 ext. 508 were transformed. A million characters were processed in 350 ms.

5. NLP text preprocessing: Remove Phone Numbers

Using the improved RE_PHONE_NUMBER pattern, we put '' in for ‘PHoNE' to remove phone numbers from the corpus.

text = RE_PHONE_NUMBER.sub('', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 353 ms, sys: 459 µs, total: 353 ms
Wall time: 351 ms
size: 931000 :( 😻 😈 888 eihtg DoD Fee https://medium.com/ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

A million characters were processed in 375 ms.

6. NLP text preprocessing: Removing HTML metadata

I admit removing HTML metadata is my favorite. Not because I like the task, but because I screen-scrape frequently. There is a lot of useful data that resides on an IBM mainframe, VAX-780 (huh?), or whatever terminal-emulation that results in an HTML-based report.

These techniques of web scraping of reports generate text that has HTML tags. HTML tags are considered noise typically as they are parts of the text with little or no value in the follow-on NLP task.

Remember, we created a test string (long_s) a little over million characters long with some HTML tags. We remove the HTML tags using BeautifulSoup.

from bs4 import BeautifulSoup
%time long_s = BeautifulSoup(long_s,'html.parser').get_text()
print('size: {:g} {}'.format(len(long_s),long_s[:text_l])))

output =>

CPU times: user 954 ms, sys: 17.7 ms, total: 971 ms
Wall time: 971 ms
size: 817000 :( 😻 😈 888 eihtg DoD Fee https://medium.com/ ## Document Title :( cat- nip immed- natedly 2nd levelheading 

The result is that BeautifulSoup is able to remove over 7,000 HTML tags in a million character corpus in one second. Scaling linearly, a billion character corpus, about 200 million word, or approxiately 2000 books, would require about 200 seconds.

The rate for HTML tag removal byBeautifulSoup is about 0. 1 second per book. An acceptable rate for our production requirements.

I only benchmark BeautifulSoup. If you know of a competitive alternative method, please let me know.

Note: The compute times you get may be multiples of time longer or shorter if you are using the cloud or Spark.

7. NLP text preprocessing: Replace currency symbol

The currency symbols “[$¢£¤¥ƒ֏؋৲৳૱௹฿៛ℳ元円圆圓﷼\u20A0-\u20C0] “ are replaced with _CUR_using the textacy package:

%time textr = textacy.preprocessing.replace.replace_currency_symbols(long_s)
print('size: {:g} {}'.format(len(textr),textr[:text_l]))

output =>

CPU times: user 31.2 ms, sys: 1.67 ms, total: 32.9 ms
Wall time: 33.7 ms
size: 908200 :( 😻 😈 888 eihtg DoD Fee https://medium.com/ ## Document Title :( cat- nip immed- natedly 2nd levelheading . , # bhc@gmail.com f@z.yx can't Be a ckunk. _CUR_4 _CUR_123,456 won't seven _CUR_Shine _CUR__CUR_beighty?_CUR_

Note: The option textacy replace_<something> enables you to specify the replacement text. _CUR_ is the default substitution text for replace_currency_symbols.

You may have the currency symbol $ in your text. In this case you can use a regex:

%time text = re.sub('\$', '_DOL_', long_s)
print('size: {:g} {}'.format(len(text),text[:250]))

output =>

CPU times: user 8.06 ms, sys: 0 ns, total: 8.06 ms
Wall time: 8.25 ms
size: 1.3262e+06 :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. _DOL_4 _DOL_123,456 won't seven _DOL_Shine _DOL__DOL_beighty?_DOL_ :

Note: All symbol $ in your text will be removed. Don’t use if you have LaTex or any text where multiple symbol $ are used.

8. NLP text preprocessing: Replace URL String

from textacy.preprocessing.replace import replace_urls
%time text = replace_urls(long_s,replace_with= '_URL_')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 649 ms, sys: 112 µs, total: 649 ms
Wall time: 646 ms
size: 763800
:( 😻 😈 888 eihtg DoD Fee _URL_ ## Document Title :(

9. NLP text preprocessing: Remove URL String

from textacy.preprocessing.replace import replace_urls
%time text = replace_urls(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 633 ms, sys: 1.35 ms, total: 635 ms
Wall time: 630 ms
size: 744800
:( 😻 😈 888 eihtg DoD Fee ## Document Title :(

The rate for URL replace or removal is about 4,000 URLs per 1 million characters per second. Fast enough for 10 books in a corpus.

10. NLP text preprocessing: Replace E-mail string

%time text = textacy.preprocessing.replace.replace_emails(long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 406 ms, sys: 125 µs, total: 406 ms
Wall time: 402 ms
size: 725800
:( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat-
nip immed-
natedly 2nd levelheading . , # _EMAIL_ _EMAIL_ can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

The rate for email reference replace is about 8,000 emails per 1.7 million characters per second. Fast enough for 17 books in a corpus.

11. NLP text pre-processing: Remove E-mail string

from textacy.preprocessing.replace import replace_emails

%time text = textacy.preprocessing.replace.replace_emails(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 413 ms, sys: 1.68 ms, total: 415 ms
Wall time: 412 ms
size: 672600 :( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat-
nip immed-
natedly 2nd levelheading . , # can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

The rate for email reference removal is about 8,000 emails per 1.1 million characters per second. Fast enough for 11 books in a corpus.

12. NLP text preprocessing: normalize_hyphenated_words

from textacy.preprocessing.normalize import normalize_hyphenated_words
%time long_s = normalize_hyphenated_words(long_s)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l])))

output =>

CPU times: user 186 ms, sys: 4.58 ms, total: 191 ms
Wall time: 190 ms
size: 642200 :
( 😻 😈 888 eihtg DoD Fee ## Document Title :( catnip immednatedly

Approximately 8,000 hyphenated-words, cat — nip and immed- iately (mispelled) were corrected in a corpus of 640,000 characters in 190 ms or abouut 3 million per second.

13. NLP text preprocessing: Convert all characters to lower case

### - **all characters to lower case;**
%time long_s = long_s.lower()
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 4.82 ms, sys: 953 µs, total: 5.77 ms
Wall time: 5.97 ms
size: 642200
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

I only benchmark the .lower Python function. The rate for lower case transformation by.lower() of a Python string of a million characters is about 6 ms. A rate that far exceeds our production rate requirements.

14. NLP text preprocessing: Whitespace Removal

%time text = re.sub(' +', ' ', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 44.9 ms, sys: 2.89 ms, total: 47.8 ms
Wall time: 47.8 ms
size: 570000
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

The rate is about 0.1 seconds for 1 million characters.

15. NLP text preprocessing: Whitespace Removal (slower)

from textacy.preprocessing.normalize import normalize_whitespace

%time text= normalize_whitespace(long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 199 ms, sys: 3.06 ms, total: 203 ms
Wall time: 201 ms
size: 569999
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

normalize_whitespce is 5x slower but more general. For safety in production, we use normalize_whitespce.To date, we do not think we had any problems with faster regex.

16. NLP text preprocessing: Remove Punctuation

from textacy.preprocessing.remove import remove_punctuation

%time text = remove_punctuation(long_s, marks=',.#$?')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 34.5 ms, sys: 4.82 ms, total: 39.3 ms
Wall time: 39.3 ms
size: 558599
:( 😻 😈 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading can't be a ckunk 4 123 456 won't seven shine beighty

spaCy

Creating the spaCy pipeline and Doc

In order to text pre-process with spaCy, we transform the text into a corpus Doc object. We can then use the sequence of word tokens objects of which a Doc object consists. Each token consists of attributes (discussed above) that we use later in this article to pre-process the corpus.

Our text pre-processing end goal (usually) is to produce tokens that feed into our NLP models.

  • spaCy reverses the stream of pre-processing text and then transforming text into tokens. spaCy creates a Doc of tokens. You then pre-process the tokens by their attributes.

The result is that parsing text into a Doc object is where the majority of computation lies. As we will see, pre-processing the sequence of tokens by their attributes is fast.

Adding emoji cleaning in the spaCy pipeline

import en_core_web_lg
nlp = en_core_web_lg.load() do = nlp.disable_pipes(["tagger", "parser"])
%time emoji = Emoji(nlp)
nlp.max_length = len(long_s) + 10
%time nlp.add_pipe(emoji, first=True)
%time long_s_doc = nlp(long_s)
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:text_l]))

output =>

CPU times: user 303 ms, sys: 22.6 ms, total: 326 ms
Wall time: 326 ms
CPU times: user 23 µs, sys: 0 ns, total: 23 µs
Wall time: 26.7 µs
CPU times: user 7.22 s, sys: 1.89 s, total: 9.11 s
Wall time: 9.12 s
size: 129199
:( 😻 😈 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading can't be a ckunk 4 123 456 won't seven shine beighty

Creating the token sequence required at 14,000 tokens per second. We will quite a speedup when we use NVIDIA gpu.

nlp.pipe_names output => ['emoji', 'ner']

Note: The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in nlp.pipe_names. The reason is that there can only be one tokenizer, and while all other pipeline components take a Doc and return it, the tokenizer takes a string of text and turns it into a Doc. You can still customize the tokenizer. You can either create your own Tokenizer class from scratch, or even replace it with an entirely custom function.

spaCy Token Attributes for Doc Token Preprocessing

As we saw earlier, spaCy provides convenience methods for many other pre-processing tasks. It turns — for example, to remove stop words you can reference the .is_stop attribute.

dir(token[0]) output=> 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'morph', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'remove_extension', 'right_edge', 'rights', 'sent', 'sent_start', 'sentiment', 'set_extension', 'shape', 'shape_', 'similarity', 'string', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'tensor', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_']

Attributes added by emoji and new.

dir(long_s_doc[0]._) output => ['emoji_desc', 'get', 'has', 'is_emoji', 'set', 'trf_alignment', 'trf_all_attentions', 'trf_all_hidden_states', 'trf_d_all_attentions', 'trf_d_all_hidden_states', 'trf_d_last_hidden_state', 'trf_d_pooler_output', 'trf_end', 'trf_last_hidden_state', 'trf_pooler_output', 'trf_separator', 'trf_start', 'trf_word_pieces', 'trf_word_pieces_'

I show spaCy performing preprocessing that results in a Python string corpus. The corpus is used to create a new sequence of spaCy tokens (Doc).

There is a faster way to accomplish spaCy preprocessing with spaCy pipeline extensions [2], which I show in an upcoming blog.

17. EMOJI Sentiment Score

EMOJI Sentiment Score is not a text preprocessor in the classic sense.

However, we find that emoji almost always is the dominating text in a document.

For example, two similar phrases from legal notes e-mail with opposite sentiment.

The client was challenging. :( The client was difficult. :)

We calcuate only emoji when present in a note or e-mail.

%time scl = [EMOJI_TO_SENTIMENT_VALUE[token.text] for token in long_s_doc if (token.text in EMOJI_TO_SENTIMENT_VALUE)]
len(scl), sum(scl), sum(scl)/len(scl)

output =>

CPU times: user 179 ms, sys: 0 ns, total: 179 ms
Wall time: 178 ms
(15200, 1090.7019922523152, 0.07175671001659968)

The sentiment was 0.07 (neutral) for 0.5 million character “note” with 15,200 emojis and emojicons in 178 ms. A fast sentiment analysis calculation!

18. NLP text preprocessing: Removing emoji

You can remove emoji using spaCy pipeline add-on

%time long_s_doc_no_emojicon = [token for token in long_s_doc if token._.is_emoji == False]
print('size: {:g} {}'.format(len(long_s_doc_no_emojicon),long_s_doc_no_emojicon[:int(text_l/5)]))

output =>

CPU times: user 837 ms, sys: 4.98 ms, total: 842 ms
Wall time: 841 ms
size: 121599
[:(, 888, eihtg, dod, fee, , document, title, :(, catnip, immednatedly, 2nd, levelheading, , ca, n't, be, a, ckunk, , 4, , 123, 456, wo, n't, seven, , shine, , beighty, , :(, 888, eihtg, dod, fee, , document, title, :(, catnip, immednatedly, 2nd, levelheading, , ca, n't, be, a, ckunk, , 4, , 123, 456, wo, n't, seven, , shine, , beighty, , :(, 888, eihtg, dod, fee, ]

The emoji spacy pipeline addition detected the emojicons, 😻 😈, but missed :) and :(.

19. NLP text pre-processing: Removing emoji (better)

We developed EMOJI_TO_PHRASEto detect the emojicons, 😻 😈, and emoji, such as :) and :(. and removed them [8,9].

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False \
else '' for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 242 ms, sys: 3.76 ms, total: 245 ms
Wall time: 245 ms
CPU times: user 3.37 ms, sys: 73 µs, total: 3.45 ms
Wall time: 3.46 ms
size: 569997
888 eihtg dod fee document title catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty 888 eihtg dod fee document title catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty 888 eihtg dod fee document title catnip imm

20. NLP text pre-processing: Replace emojis with a phrase

We can translate emojicon into a natural language phrase.

%time text = [token.text if token._.is_emoji == False else token._.emoji_desc for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 1.07 s, sys: 7.54 ms, total: 1.07 s
Wall time: 1.07 s
CPU times: user 3.78 ms, sys: 0 ns, total: 3.78 ms
Wall time: 3.79 ms
size: 794197
:( smiling cat face with heart-eyes smiling face with horns 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty

The emoji spaCy pipeline addition detected the emojicons, 😻 😈, but missed :) and :(.

21. NLP text pre-processing: Replace emojis with a phrase (better)

We can translate emojicons into a natural language phrase.

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False \
else EMOJI_TO_PHRASE[token.text] for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 251 ms, sys: 5.57 ms, total: 256 ms
Wall time: 255 ms
CPU times: user 3.54 ms, sys: 91 µs, total: 3.63 ms
Wall time: 3.64 ms
size: 904397
FROWNING FACE SMILING CAT FACE WITH HEART-SHAPED EYES SMILING FACE WITH HORNS 888 eihtg dod fee document title FROWNING FACE catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty FROWNING FAC

Again. EMOJI_TO_PHRASE detected the emojicons, 😻 😈, and emoji, such as :) and :(. and substituted a phrase.

22. NLP text preprocessing: Correct Spelling

We will use symspell for spelling correction [14].

SymSpell, based on the Symmetric Delete spelling correction algorithm, just took 0.000033 seconds (edit distance 2) and 0.000180 seconds (edit distance 3) on an old MacBook Pro [14].

%time sym_spell_setup() 
%time tk = [check_spelling(token.text) for token in long_s_doc[0:99999]]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 5.22 s, sys: 132 ms, total: 5.35 s
Wall time: 5.36 s
CPU times: user 25 s, sys: 12.9 ms, total: 25 s
Wall time: 25.1 s
CPU times: user 3.37 ms, sys: 42 µs, total: 3.41 ms
Wall time: 3.42 ms
size: 528259 FROWNING FACE SMILING CAT FACE WITH HEART a SHAPED EYES SMILING FACE WITH HORNS 888 eight do fee document title FROWNING FACE catnip immediately and levelheading a not be a chunk a of 123 456 to not seven of shine of eighty

Spell correction was accomplished for immednatedly, ckunk and beight. Correcting mis-spelled words is our largest computation. It required 30 seconds for 0.8 million characters.

23. NLP text preprocessing: Replacing Currency Symbol (spaCy)

%time token = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(token)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))aa

Note: spacy removes all punctuation including :) emoji and emoticon. You can protect the emoticon with:

%time long_s_doc = [token for token in long_s_doc if token.is_punct == False or token._.is_emoji == True]
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:50]))

However, replace_currency_symbols and regex ignore context and replace any currency symbol. You may have multiple use of $ in your text and thus can not ignore context. In this case you can use spaCy.

%time tk = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 366 ms, sys: 13.9 ms, total: 380 ms
Wall time: 381 ms
CPU times: user 9.7 ms, sys: 0 ns, total: 9.7 ms
Wall time: 9.57 ms
size: 1.692e+06 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd levelheading</h2></html > bhc@gmail.com f@z.y a$@ ca n't bc$$ ef$4 5 66 _CUR_ wo nt seven eihtg _CUR_ nine _CUR_ _CUR_ zer$ 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd leve

24. NLP text preprocessing: Removing e-mail address (spacy)

%time tokens = [token for token in long_s_doc if not token.like_email]
print('size: {:g} {}'.format(len(tokens),tokens[:int(text_l/3)]))

output =>

CPU times: user 52.7 ms, sys: 3.09 ms, total: 55.8 ms
Wall time: 54.8 ms
size: 99999

About 0.06 second for 1 million characters.

25. NLP text preprocessing: Remove whitespace and punctuation (spaCy)

%time tokens = [token.text for token in long_s_doc if (token.pos_ not in ['SPACE','PUNCT'])]
%time text = ' '.join(tokens)
print('size: {:g} {}'.format(len(text),text[:text_l]))

26. NLP text preprocessing: Removing stop-words

NLP models (ex: logistic regression and transformers) and NLP tasks (Sentiment Analysis) continue to be added. Some benefit from stopword removal, and some will not. [2]

Note: We now only use different deep learning language models (transformers) and do not remove stopwords.

%time tokens = [token.text for token in long_s_doc if token.is_stop == False]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

27. NLP text pre-processing: Lemmatization

Lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words.

Lemmatization looks at the surrounding text to determine a given word’s part of speech. It does not categorize phrases.

%time tokens = [token.lemma_ for token in long_s_doc]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 366 ms, sys: 13.9 ms, total: 380 ms
Wall time: 381 ms
CPU times: user 9.7 ms, sys: 0 ns, total: 9.7 ms
Wall time: 9.57 ms
size: 1.692e+06 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd levelheading</h2></html > bhc@gmail.com f@z.y a$@ ca n't bc$$ ef$4 5 66 _CUR_ wo nt seven eihtg _CUR_ nine _CUR_ _CUR_ zer$ 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd leve

Note: Spacy does not have stemming. You can add if it is you want. Stemming does not work as well as Lemmazatation because Stemming does not consider context [2] (Why some researcher considers spacy “opinionated”).

Note: If you do not know what is Stemming, you can still be on the Survivor show. (my opinion)

Conclusion

Whatever the NLP task, you need to clean (pre-process) the data (text) into a corpus (document or set of documents) before it is input into any NLP model.

I adopt a text pre-processing framework that has three major categories of NLP text pre-processing:

  1. Noise Removal
  • Transform Unicode characters into text characters.
  • convert a document image into segmented image parts and text snippets [10];
  • extract data from a database and transform into words;
  • remove markup and metadata in HTML, XML, JSON, .md, etc.;
  • remove extra whitespaces;
  • remove emoji or convert emoji into phases;
  • Remove or convert currency symbol, URLs, email addresses, phone numbers, hashtags, other identifying tokens;
  • The correct mis-spelling of words (tokens) [7];
  • Remove remaining unwanted punctuation;

2. Tokenization

  • They are splitting strings of text into smaller pieces, or “tokens.” Paragraphs segment into sentences, and sentences tokenize into words.

3. Normalization

  • Change all characters to lower case;
  • Remove English stop words, or whatever language the text is in;
  • Perform Lemmatization or Stemming.

Note: The tasks listed in Noise Removal and Normalization can move back and forth. The categorical assignment is for explanatory convenience.

Note: We do not remove stop-words anymore. We found that our current NLP models have higher F1 scores when we leave in stop-words.

Note: Stop-word removal is expensive computationally. We found the best way to achieve faster stop-word removal was not to do it.

Note: We saw no significant change in Deep Learning NLP models’ speed with or without stop-word removal.

Note: The Noise Removal and Normalization lists are not exhaustive. These are some of the tasks I have encountered.

Note: The latest NLP Deep Learning models are more accurate than older models. However, Deep Learning models can be impractically slow to train and are still too slow for prediction. We show in a follow-on article how we speed-up such models for production.

Note: Stemming algorithms drop off the end of the beginning of the word, a list of common prefixes and suffixes to create a base root word.

Note: Lemmatization uses linguistic knowledge bases to get the correct roots of words. Lemmatization performs morphological analysis of each word, which requires the overhead of creating a linguistic knowledge base for each language.

Note: Stemming is faster than lemmatization.

Note: Intuitively and in practice, lemmatization yields better results than stemming in an NLP Deep Learning model. Stemming generally reduces precision accuracy and increases recall accuracy because it injects semi-random noise when wrong.

Read more in How and Why to Implement Stemming and Lemmatization from NLTK.

Text preprocessing Action benchmarks

Our unique implementations, spaCy, and textacy are our current choice for short text preprocessing production fast to use. If you don’t mind the big gap in performance, I would recommend using it for production purposes, over NLTK’s implementation of Stanford’s NER.

In the next blogs, We see how performance changes using multi-processing, multithreading, Nvidia GPUs, and pySpark. Also, I will write about how and why our implementations, such as EMOJI_TO_PHRASEand EMOJI_TO_SENTIMENT_VALUE and or how to add emoji, emoticon, or any Unicode symbol.

References

[1] How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read.

[2] Industrial-Strength Natural Language Processing;Turbo-charge your spaCy NLP pipeline.

[3] NLTK 3.5 Documentation.

[4] Textacy: Text (Pre)-processing.

[5] Hugging Face.

[6] Language Models are Few-Shot Learners.

[7] re — Regular expression operations.

[8] Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm.

[9] How I Built Emojitracker.

[10] Classifying e-commerce products based on images and text.

[11] DART: Open-Domain Structured Data Record to Text Generation.

[12] Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT.

[13] fast.ai .

[14] 1000x faster Spelling Correction.

This article was originally published on Medium and re-published to TOPBOTS with permission from the author. Read more technical guides by Bruce Cottman, Ph.D. on Medium.

Enjoy this article? Sign up for more AI and NLP updates.

We’ll let you know when we release more in-depth technical education.

Continue Reading

AI

Microsoft BOT Framework: Building Blocks

I wrote an article last week introducing the ‘’Microsoft BOT Framework”. The highlight of the article was to educate the readers on how to…

Published

on

Photo by Tincho Franco on Unsplash

I wrote an article some weeks ago introducing the ‘’Microsoft BOT Framework”. The highlight of the article was to educate the readers on how to develop a basic chatbot. Although my workmates acknowledged the efforts but were interested in knowing more. In this article, I am going to dig in a little deeper with the various concepts involved with the Microsoft BOT framework.

I would be touching on the below-mentioned concepts in this article.

  • Channel
  • State
  • Prompt
  • Dialog
  • Waterfall
  • Connector
  • Activity
  • Turn

Channel

Channel is an application that is being used to interact with the BOT. Some of the current integrations are available with Teams, Slack, Workplace, Skype, Facebook, Telegram, Line, Webchat, etc.

Some channels are also available as an adapter. Check here for more details.

State

State in the context of the ChatBots means persisting metadata of the conversation between the BOT and the user at a certain moment. State management makes the conversation more meaningful (i.e the responses could be saved to be accessed at a later point of time.

Prompt

During a conversation between the user and the BOT, a prompt is an event when BOT asks a user any question. This question could be in the form of text, button, dropdown, etc.

Dialog

Dialogs allow forming flow in the conversation. A Dialog comprises of 2 steps.

  1. A prompt from the BOT requesting for info
  2. User Response to the BOT

If the user response is valid, BOT will send a new prompt for further information, else it will re-send the same prompt.

1. 8 Proven Ways to Use Chatbots for Marketing (with Real Examples)

2. How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project

3. 5 Top Tips For Human-Centred Chatbot Design

4. Chatbot Conference Online

Waterfall

The waterfall is formed with a combination of Dialogs. It’s a sequence of dialogs which determines the complete flow of the conversation.

Let’s look at all of these concepts in a diagrammatic representation.

Connector

REST API used by BOT to communicate across multiple channels. The API allows the exchange of messages between BOT and the user on a specific channel.

Activity

As the name suggests, an activity is any communication between the user and the BOT. The connector API uses the activity object to send useful information back and forth. The most common activity type is the message. For a complete list of all Activity types, see here.

Turn

In any conversation between two parties, each party takes turns to respond to an activity (message). In the context of Microsoft BOT Framework, communication happens between user and BOT, hence a turn could be considered as the processing done by the BOT to respond to the user request.

Now that we have understood the basic concepts needed to build this sample, let’s have a look at our use case.

We would be building a ChatBot application which would enable users to Book a taxi. The conversational flow would be like:

Each box in the above diagram represents a Dialog.

Github: https://github.com/tarunbhatt9784/MFTSamples/tree/master/SuperTaxiBot

Step 1: Create a VS2017 project

I would set the name of the project as “SuperTaxiBot”.

Step 2: Install Nuget Package

Install Nuget Package Microsoft.Bot.Builder.Dialogs using VS2017.

Step 3: Create a DialogBot.cs

The class consists of bot logic which processes incoming activities from one or more channels and generates outgoing activities in response.

ActivityHandler defines various handlers for different types of activities. The activities used in this sample are:

  • OnTurnAsync: Handles any incoming activity.
  • OnMessageActivityAsync: Invoked when a message activity is received from the user. If overridden, this could potentially contain conversational logic. By default, this method does nothing.
  • OnMembersAddedAsync: Invoked when members other than this bot (like a user) are added to the conversation

Source: https://chatbotslife.com/microsoft-bot-framework-building-blocks-377be3d55dab?source=rss—-a49517e4c30b—4

Continue Reading
AI16 hours ago

How to Get the Best Start at Sports Betting

AI16 hours ago

Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods

AI18 hours ago

Microsoft BOT Framework: Building Blocks

AI18 hours ago

Are Banking Chatbots Vulnerable to Attacks?

AI18 hours ago

TikTok Alexa Skill — Dance to the Tunes Hands-free!

AI2 days ago

How does it know?! Some beginner chatbot tech for newbies.

AI2 days ago

Who is chatbot Eliza?

AI2 days ago

FermiNet: Quantum Physics and Chemistry from First Principles

AI3 days ago

How to take S3 backups with DejaDup on Ubuntu 20.10

AI4 days ago

How banks and finance enterprises can strengthen their support with AI-powered customer service…

AI4 days ago

GBoard Introducing Voice — Smooth Texting and Typing

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI4 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

Trending