• A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings

    A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings
    🦈 Deep Dive
    March 13, 2022

    Josh Pause

    Introduction An estimated 80% of all organizational data is unstructured text. Despite this, even among data savvy organizations, most of it goes completely ignored. Why? Numeric data is easy. Just toss it in a standard scaler, throw it at a model, and see what happens. You may not get good results, but you’ll get something,…

    Continue Reading…

  • The effect of various text generation methods on the outputs of GPT-2

    📋 Research
    February 12, 2022

    Josh Pause

    When generating text using the GPT-2 Large model, we found that both the method of generation, and text prompt used, have a statistically significant effect on on the output produced. In four out of six trials we found that the Nucleus Sampling method proposed by Holtzman, et all[mfn referencenumber=1]Holtzman, Buys, Du, Forbes, Choi. (2020). The…

    Continue Reading…

  • A Crash Course in Generating and Measuring Neural Text with GPT-2

    A Crash Course in Generating and Measuring Neural Text with GPT-2
    🎓 Crash Course
    February 11, 2022

    Josh Pause

    A full crash course in neural text generation. We review various methods of generating text with GPT-2 (the “little brother” of GPT-3), including Beam Search, Top-K and Top-P sampling. We also review some key metrics of generated text, including perplexity and repetition, and try to get a more intuitive sense of these measures.

    Continue Reading…