8 data science projects to build your resume
A strong data science resume includes a variety of projects. Find out which data science project types employers are looking for and how to present them on your resume.
Writing a specific resume to apply for a data science position is no easy task. However, it is necessary, as applicants need to submit resumes for any open data science position. A well-written resume is the most critical component of getting an interview for a job as a data scientist.
A good data science resume should be brief -- typically, just one page long, unless the applicant has many years of experience. The sections of the data science resume should include:
- Resume objective
- Experience
- Education
- Certifications
- Skills
- Projects
- Publications
These sections help applicants demonstrate their backgrounds and knowledge in relevant areas.
Organizations looking to hire data scientists expect candidates to have either some previous work experience or, alternatively, data science-related projects. Job seekers transitioning to careers in data science right from college, switching careers or seeking different types of data science jobs can use projects to show prospective employers they have the necessary skills to do the work. A data science project portfolio should include three to five projects that showcase the applicant's relevant skills.
Here are eight data science projects to build your resume.
Sentiment analysis
Today, data-driven companies use sentiment analysis to identify customers' attitudes about their products or services. Sentiment analysis is the automated process of determining if opinions toward a product or service are positive, negative or neutral. Normally, this is expressed in pieces of text.
The objective of sentiment analysis is to help a company figure out the answers to questions such as:
- Why don't customers like the product or service?
- Why isn't the product or service hitting its target sales goals?
- How can the product or service be changed so more customers like it?
- What factors affect customer sentiment toward the product or service, e.g., quality, quantity, price or something else?
Customer opinions can range from positive to negative, and the range of responses can be classed as positive, negative or multiple -- i.e., excited, angry, happy, sad or another emotion.
This sentiment analysis data science project could be implemented in the R language, using the "janeaustenR" package or data set. For this project, the job candidate will use general-purpose lexicons, including:
- Loughran, which is used for financial text.
- Bing, which labels words as positive or negative.
- AFINN, a list of words rated for valence characterizing and categorizing specific emotions.
- An integer between minus five and plus five.
The applicant can then build a word cloud to display the results.
Real-time face detection
Face detection, a method to distinguish a person's face from other parts of the body and the background, is a simpler undertaking and can be considered a beginner-level project.
The objective of face detection is to determine if there are any faces in an image or video. If there is more than one face in the image or video, each face is enclosed by a bounding box. A job applicant should be able to build a simple face detector using Python. Building a program that detects faces is a great way to get started with computer vision.
The module library used for this project is called the Open Source Computer Vision Library (OpenCV), an open source computer vision and machine learning library with a focus on real-time applications.
Face detection is one of the steps needed for facial recognition, the procedural recognition of a person's face along with the user's authorized name. The best method for facial recognition is to use deep neural networks.
After a face is detected, deep learning can solve face recognition tasks, using such transfer learning models as VGG16 architecture, ResNet50 architecture and FaceNet architecture. These make it easier to build deep learning models, enabling users to build high-quality face recognition systems. Users can also build their own deep learning models to build face recognition systems. Face recognition models can be used in security systems and surveillance, for example.
Spam detection
Spam detection is a classic data science problem, as organizations need to monitor their communication channels for spam emails and messages to ward off data security threats. Google, Yahoo and other major email providers implement spam detection algorithms to handle the threats posed by spam emails.
Training a model to detect spam messages and spam emails is another project for data science applicants to use to build their resumes.
Project: Spam classification
Tools: Scikit-learn, Spacy, NLTK, Python
Data set: SMS Spam Collection Dataset from Kaggle
Data storytelling and visualization
Using data to provide insights, tell stories and convince people of something is an important part of a data science job. What good is doing a top-notch analysis if the CEO doesn't understand it or take action based on it?
This data science project should enable laypeople, such as hiring managers with little coding or statistical backgrounds, to draw the appropriate conclusions. Data visualization and communication skills are important for this project to show and explain the applicant's code.
One example is doing a data visualization project using ggplot2 (a data visualization package for the statistical programming language R) and its libraries to analyze certain parameters, such as the number of trips a Boston Uber driver makes in one day, one month, three months, six months or 12 months. The applicant will use Uber pickups in the Boston data set, for instance, and create visualizations for the different time frames of the year. This reveals how time affects customer trips.
Project: Uber data analysis project in R
Language: R
Data set: Uber pickups in Boston
Recommender system
A recommender system, a platform that uses a filtering process, offers users various content based on their preferences. A recommender system inputs information about the user, evaluates those parameters using a machine learning model and returns recommendations -- for example, with movie recommendations.
A movie recommendation can be based on input received from people who have seen a particular film. Their responses can classify a movie as funny, boring, interesting, exciting or even a waste of time.
There are two types of recommender systems:
- Content-based system. This offers recommendations based on the data a user provides. The system generates a user profile based on that data, which it then uses to make suggestions to the user. As the user inputs more data or takes certain actions based on the recommendations, the recommendation engine becomes increasingly more accurate. The recorded activity allows an algorithm to offer suggestions on movies if they're similar to those the user liked in the past.
- Collaborative system. This offers recommendations based on information about other users with similar viewing histories or preferences. Recording users' preferences enables a collaborative system to cluster similar users and provide recommendations based on the activities of users in the same group.
Netflix, for example, recommends movies or shows that are similar to a user's browsing history or movies that other users with similar browsing histories have watched in the past.
Project: Movie recommendation system project in R
Language: R
Data set: MovieLens dataset
Optical character recognition
This data science project is great for beginners. Optical character recognition (OCR) uses an electronic or mechanical device to convert two-dimensional text data into a form of machine-encoded text. Computer vision can be used to read the text files or image. After reading the image, use the Python-pytesseract module (an OCR tool for Python) to read the text data in the PDF or image. Then convert the text data into a string of data that can be displayed in Python.
Once data science job applicants thoroughly understand how OCR works and the necessary tools, they can compute more complex problems, such as using sequence-to-sequence attention models to convert the data the OCR reads from one language into another.
Time series prediction
Time series prediction is the study of how metrics behave over time. The time series technique is commonly used in data science with a wide range of applications, including weather forecasting, predicting sales, analyzing annual trends and analyzing website traffic.
The increase in traffic to a website can be a major problem for a company, as it can cause the site to load slowly or crash entirely. Predicting the website traffic can enable the company to make better decisions to control the congestion.
Project: Web traffic time series forecasting
Tools: Google Cloud Platform
Algorithms: Recurrent neural networks, long- and short-term memory, autoregressive integrated moving average-based techniques
Data set: The data set consists of 145,000 time series, representing the number of daily page views of different Wikipedia articles.
Data sources
One of the key decisions data science job applicants have to make is what data to analyze with any project.
Here are some websites where applicants can find data to work with.
- Kaggle. The world's largest data science community that offers tools and resources to help users achieve their data science goals. Includes different types of data sets of varying sizes that users can download for free.
- Data Portals. A comprehensive list of 590 (to date) open data portals from around the globe, each of which offers its own library of data sets. The data portal is curated by a group of open data experts, including representatives from local, regional and national governments and international organizations, such as the World Bank, and many nongovernmental organizations.
- Data.gov. The home of the U.S. government's open data, which includes data, tools and resources for conducting research, developing web and mobile applications, and designing data visualizations.
- Open Data on AWS. The Registry of Open Data on AWS makes it easy to find data sets publicly available through Amazon services.
- Academic Torrents. A distributed system for sharing massive data sets. The site facilitates the storage of all the data used in research, including data sets and publications.
How to add data science projects to a resume
The best projects to showcase are ones that can be presented succinctly. A well-constructed description of the project can be presented in a few sentences to a paragraph.
When adding data science projects to a resume, applicants should include:
- The name of the project.
- A description of the role -- was this a personal effort or a team effort?
- A brief explanation of the purpose of the project.
- A couple sentences about how the project was built.
- The tools that were used.
- What the project accomplished.
- A sentence about how the same principle could apply in business.
- A link to the project -- a website that offers data science job applications the opportunity to showcase all their personal projects in depth.
- A link to the code.
Although many recruiters and hiring managers will follow links and look at candidates' project presentations on their websites or portfolio sites, some will only look at a candidate's GitHub.
As such, applicants should know the basics of GitHub and be familiar with Git -- a version control system they can use to manage and keep track of their source code histories.
Data scientists are in high demand. Consequently, there's enormous potential for growth in this field for skilled professionals. To break into the field of data science, job applicants must impress prospective employers by showcasing their skills and expertise. They can demonstrate they have the necessary skills by adding data science projects to their resumes.