Data Jenius – the human side of machine learning

A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings

🦈 Deep Dive

March 13, 2022

Josh Pause

Introduction An estimated 80% of all organizational data is unstructured text. Despite this, even among data savvy organizations, most of it goes completely ignored. Why? Numeric data is easy. Just toss it in a standard scaler, throw it at a model, and see what happens. You may not get good results, but you’ll get something,
Continue Reading…
The effect of various text generation methods on the outputs of GPT-2

📋 Research

February 12, 2022

Josh Pause

When generating text using the GPT-2 Large model, we found that both the method of generation, and text prompt used, have a statistically significant effect on on the output produced. In four out of six trials we found that the Nucleus Sampling method proposed by Holtzman, et all[mfn referencenumber=1]Holtzman, Buys, Du, Forbes, Choi. (2020). The…
Continue Reading…
A Crash Course in Generating and Measuring Neural Text with GPT-2

🎓 Crash Course

February 11, 2022

Josh Pause

A full crash course in neural text generation. We review various methods of generating text with GPT-2 (the “little brother” of GPT-3), including Beam Search, Top-K and Top-P sampling. We also review some key metrics of generated text, including perplexity and repetition, and try to get a more intuitive sense of these measures.
Continue Reading…