fuzzy search
What is a fuzzy search?
A fuzzy search is a technique that uses search algorithms to find strings that match patterns approximately. It's particularly useful for helping users find webpages without having to know exactly what they're looking for or how a word is spelled. Fuzzy searches are also used for Structured Query Language lookups to help database users find records without having to be sure of the exact spelling of the value they're looking for.
A fuzzy search is performed using a fuzzy matching algorithm, which returns a list of results based on likely relevance even though search argument words and spellings may not be an exact match. For web lookups, exact and highly relevant matches appear near the top of the list. Subjective relevance ratings may be given, usually as percentages.
How do fuzzy searches work?
A fuzzy matching algorithm can operate like a spell checker and spelling-error corrector. For example, if a user types "Misissippi" into Yahoo or Google -- both of which use fuzzy matching -- a list of hits is returned along with the question, "Did you mean Mississippi?" Alternative spellings and words that sound the same but are spelled differently are given.
A fuzzy matching algorithm can compensate for common input typing errors, as well as errors introduced by optical character recognition scanning of printed documents. The algorithm can return hits with content that contains a specified base word along with prefixes and suffixes. For example, if planet is entered as a search word, hits occur for sites containing words such as protoplanet or planetary. The program can also find synonyms and related terms, working like an online thesaurus or encyclopedic cross-reference tool.
Fuzzy matching algorithms usually return irrelevant hits, as well as relevant ones. Superfluous results are likely to occur for terms with multiple meanings, only one of which is the meaning the user intends. If the user has only a vague or general idea of the topic or doesn't know exactly what to look for, the ratio of relevant hits to irrelevant hits tends to be low. However, the ratio is even lower when an exact matching program is used in this situation.
Fuzzy searching is much more powerful than exact searching when used for research and investigation and is especially useful when researching unfamiliar, foreign-language or sophisticated terms for which the proper spellings aren't widely known. Fuzzy searching can also be used to locate individuals based on incomplete or partially accurate identifying information.
Many search engines enable users to specifically request a fuzzy search in the search query by using a tilde (~) at the end of the word or term they want to search with fuzziness.
What is the Levenshtein distance?
How close the search term is to the exact match is measured in terms of edit distance -- a metric that represents the cost of converting one string to another. The most widely used method for computing the edit distance is to take the minimum number of single-character changes needed to convert the search term to the exact match. This computation results in the Levenshtein distance, which is named after the Russian scientist who invented the method, Vladimir Levenshtein.
The following single-character changes are used to compute edit distance. Some algorithms use weights to place a higher value on one or more of the primitive operations:
- Insertion: fuzzy serch → fuzzy search
- Deletion: fuzzey search → fuzzy search
- Substitution: fuzzy seerch → fuzzy search
- Transposition: fuzzy saerch → fuzzy search
What are the pros and cons of fuzzy matching?
While fuzzy matching can be helpful to researchers who don't know the exact spelling of what they're looking for, it also can be a source of frustration, as it may return irrelevant results that the user must then filter.
Fuzzy matching offers the following benefits:
- Online visitors can find products or product categories without knowing the correct spellings.
- Tourists can locate cities without knowing exact spellings or by using an approximate spelling in cases where the city name is listed in a different language.
- Scientists can find articles without having to know the exact titles.
- Users can find the names of films or books without knowing the exact titles.
- Users can search for people without knowing the exact spelling of their names.
Fuzzy matching also has the following drawbacks:
- It sometimes returns too many results, requiring the user to sift through all the suggestions to find the one that's most appropriate. Sometimes, the most appropriate result is toward the end of the list, several pages down; and sometimes, none of the results are appropriate.
- It only compensates for typos or misspellings and must be complemented with a semantic algorithm to also consider synonyms and semantics.
What is the difference between a fuzzy search and a wildcard search?
A fuzzy search is either applied by default or specifically requested by the user -- frequently by typing a tilde at the end of the search term. The fuzzy search returns matches within a certain edit distance by assuming that any letters in the search term can be inserted, deleted, substituted or transposed.
A wildcard search can only be specifically requested by the user by using wildcards, which are characters that indicate different kinds of matches. The search engine only considers certain letters can be altered, assuming all the other letters are to be matched exactly. The following are two of the most commonly used wildcards along with their meanings:
- A question mark (?) is used for single-character replacement. A search on f?zzy might yield fuzzy and fizzy.
- An asterisk (*) is used for multiple-character matching. A search on fuzz* might yield fuzzie and fuzzy.
Learn how focused specialty search engines differ from all-purpose search engines, such as Google.