Author: Josh Pause

  • A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings

    A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings

    Introduction An estimated 80% of all organizational data is unstructured text. Despite this, even among data savvy organizations, most of it goes completely ignored. Why? Numeric data is easy. Just toss it in a standard scaler, throw it at a model, and see what happens. You may not get good results, but you’ll get something, […]

    Continue Reading…

  • The effect of various text generation methods on the outputs of GPT-2

    When generating text using the GPT-2 Large model, we found that both the method of generation, and text prompt used, have a statistically significant effect on on the output produced. In four out of six trials we found that the Nucleus Sampling method proposed by Holtzman, et all[mfn referencenumber=1]Holtzman, Buys, Du, Forbes, Choi. (2020). The Curious Case of Natural Text Degeneration. ICLR 2020. Retrieved February 1, 2020, from https://arxiv.org/pdf/1904.09751.pdf[/mfn], (aka Top-P) produced output that was significantly more humanlike than other methods. We also found that some troublesome prompts, such as the first sentence of the Bible, consistently produce outputs that seem relatively unaffected by the choice of generation method.

    Continue Reading…

  • A Crash Course in Generating and Measuring Neural Text with GPT-2

    A Crash Course in Generating and Measuring Neural Text with GPT-2

    A full crash course in neural text generation. We review various methods of generating text with GPT-2 (the “little brother” of GPT-3), including Beam Search, Top-K and Top-P sampling. We also review some key metrics of generated text, including perplexity and repetition, and try to get a more intuitive sense of these measures.

    Continue Reading…