Q&A: How to start learning natural language processing
In this Q&A, 'Natural Language Processing in Action' co-author Hobson Lane discusses how to start learning NLP, including benefits and challenges of building your own pipelines.
Natural language processing silently underpins many aspects of our digital lives, from email spam filters and detecting plagiarism to checking grammar and correcting spelling. But learning how the technology works and understanding its role in AI is often challenging.
NLP is a branch of AI that enables computer systems to interpret, understand and generate written and spoken language in a manner similar to humans. Using linguistic rules and machine learning algorithms, NLP models can analyze and produce text and voice data as well as streamline interactions between humans and machines.
NLP is used for tasks such as text classification and extraction, natural language generation, and machine translation. With NLP, organizations can process and analyze large quantities of text-heavy data and build AI systems that enable them to better interact with customers.
But despite their ability to improve human-computer communication, NLP models can be difficult to build. In the second edition of Natural Language Processing in Action from Manning Publications Co., authors Hobson Lane and Maria Dyshel provide readers with detailed steps to build models that understand and generate text almost as well as humans can.
In this Q&A with TechTarget Editorial, Lane discusses the skills users need to get started creating NLP models, where to find high-quality data online and how NLP can play a positive role in the future of AI.
Editor's note: The following interview has been edited for clarity and length.
What are some skills readers should have before they begin using the book?
Hobson Lane: The main thing is curiosity. I have middle schoolers that are reading the book and helping me with it, even drawing diagrams.
You [should also] probably have already played around with Python as a programming language. A slight familiarity with Python and ability to set up an environment on your computer so that you can program in Python -- that's really all you need.
What are some common challenges that you have experienced or you've seen users experience when working with NLP?
Lane: Unfortunately, the Windows OS is not friendly to Python developers, so it's quite difficult to get set up [if using Windows]. But if you're able to get over those hurdles, then the next is access to high-quality, labeled data.
Fortunately, there are some high-quality data sets out there, like Project Gutenberg. All the books are out there in terms of getting raw text content, but they are 40 years old. So, they can't really get a lot of dialogue around a lot of technology.
Then there's Stack Overflow, a great source for questions and answers where NLP can be applied. Unfortunately, it's been polluted by bad actors and by AI itself. They've tried to ban large language models from contributing answers, but they're leaking in, so it has sort of stagnated as a source of authoritative information about technology.
We do provide a lot of hidden sources of information, such as Mastodon ActivityPub [a decentralized social networking protocol]. But you should be responsible with how you use it, preferably only using the content that people have opted in to sharing with you [through] a particular protocol to retrieve that data. Lemmy.ml is another platform that also operates on ActivityPub and is a social network similar to Reddit.
So, getting access to data and getting your computer to run on that data are the two big challenges people face.
In the book, you wrote that NLP might help save the world. Could you speak to that a little?
Lane: It's just a feeling in my heart. Obviously, it's not something quantitative. There's this concept that the intelligence and complexity of life itself is the result of cooperation. Cooperation is the key element of the natural evolution of biological systems and life, it seems -- and communication is that same thing for higher-level organisms like humans and mammals, where we communicate through natural language.
The way we cooperate is going to shape how we evolve, mediated by technology doing that for language processing and participating with us in a cooperative network. If we can build machines that cooperate with us, then complexity will continue to grow. So, if we build it right, it could save us, and if we build it wrong, it could destroy us.
You can save the world if you are the one who's helpful and contributing to that company, organization, nonprofit or university lab that comes up with a way to build cooperative machines that are able to outcompete the sociopathic machines that are intentionally being built by corporations to exploit you and get money from you. [But] if we are training and building machines with that focus in mind, then we're lost. It's important for people to have access to materials that teach them how to build machines that are as smart as those machines, but even smarter because they cooperate with both their human handlers and with each other.
Are you optimistic about the use of this technology?
Lane: Yes, I am very optimistic. These middle schoolers and high schoolers that are taking this up, they're going to lead us to this better world, I'm sure. The digital natives know this technology. They know its power and its addictive qualities. And they're able to, I think, do a better job than my generation has done creating a world where technology is a tool for productive, helpful and cooperative human thought, as opposed to a tool for exploitation and manipulation of others.
Is there anything else that you wanted to mention regarding the book and NLP?
Lane: The most important thing is for people to realize there's more than just generative models. I'm finding that a lot of people are seeing the magic of LLMs or generative models and thinking that that's where it's at. But it's not, because [generative AI] is completely uncontrolled and unexplainable.
Generative models were around long before ChatGPT and do not require a conversational interface. Just because conversation is what we do naturally, and it makes it feel fun and engaging and drives this viral nature on social media, does not mean that that's the way you should interact with your tools.
A conversation is not how you program a machine to do what you want it to do. You need a non-natural language to program a computer. You need to specify exactly what you want to do and have a programming language that you can use. And that's what we're doing. The book gives you those libraries of tools and those examples so you can actually build a system that's doing what you want it to do -- not sporadically, but all the time.
This book is not about prompt engineering. If you're coming to this to learn prompt engineering, you're coming to the wrong place. Prompt engineering is a misnomer. There's no engineering involved. It is just playing with trial and error, and somehow tricking [the model] into producing what you want every now and then. It's not a good thought companion, and that's not a good interface for interacting with natural language processing.
The tools in the book will show you how to do it better and more efficiently to get to the place you want to go faster with your business or your life. It will also help you see how this approach [works]. Hopefully, you'll see how much better it is than spending your life trying to chase whatever the next product is with a conversational interface.