SQL Server: How to do a fuzzy search
I was asked yesterday to do a fuzzy search in SQL Server. The request was to find some numbers in a field that fell within a certain tolerance.
I worked out how to do this pretty simply using a BETWEEN clause with a fuzziness factor to go with it. In a simplified example, I'm looking for all employees within 3 years of 30. I parameterized both the age and the fuzz factor to make changes to this simple:
-- Here I just create a temporary table for the example.
create table #employee (employee varchar(32), age int)
insert into #employee values ('John', 32)
insert into #employee values ('Mary', 24)
insert into #employee values ('Sam', 40)
insert into #employee values ('Sarah', 28)
insert into #employee values ('Charles', 29)
insert into #employee values ('Henry', 31)
-- Here I declare variables to hold the age to look for
-- and the "fuzziness" of the search
declare @fuzz int
set @fuzz = 3
declare @seek_age int
set @seek_age = 30
-- This is the actual search
select employee, age
from #employee e
where e.age between @seek_age - @fuzz and @seek_age + @fuzz
Employee | Age |
---|---|
John | 32 |
Sarah | 28 |
Charles | 29 |
Henry | 31 |
Increasing the @fuzz value increases the possibility of finding records.
In the actual case, I used the BETWEEN clause as the joining clause in a left join to fuzzily join records. This looks pretty scary but works. The following example does this to create a list of employees matched to other employees within 3 years of the same age:
-- Find other employees within three years of an employee's age
declare @fuzz int
set @fuzz = 3
select test.employee test_employee, test.age test_age,
fuzzy.employee fuzzy_employee, fuzzy.age fuzzy_age
from #employee test
left join #employee fuzzy
on test.age between fuzzy.age - @fuzz and fuzzy.age + @fuzz
-- here we don't want an employee to match him or herself
and test.employee <> fuzzy.employee
test_employee | test_age | fuzzy_employee | fuzzy_age |
---|---|---|---|
John | 32 | Charles | 29 |
John | 32 | Henry | 31 |
Mary | 24 | NULL | NULL |
Sam | 40 | NULL | NULL |
Sarah | 28 | Charles | 29 |
Sarah | 28 | Henry | 31 |
Charles | 29 | John | 32 |
Charles | 29 | Sarah | 28 |
Charles | 29 | Henry | 31 |
Henry | 31 | John | 32 |
Henry | 31 | Sarah | 28 |
Henry | 31 | Charles | 29 |
A NULL result mean that that employee has no co-workers with 3 years of their age. Again, the @fuzz factor can be increased to find more matches.
No comments:
Post a Comment
I moderate comments blog posts over 14 days old. This keeps a lot of spam away. I generally am all right about moderating. Thanks for understanding.