This tutorial explains how to perform fuzzy matching (string matching) with SAS.
Sample Datasets
1. Remove punctuations, special characters, numeric values, dates.
- PRXCHANGE performs a pattern-matching replacement. It is an alternative to TRANWRD which can only replace a single keyword at a time. However, PRXCHANGE can replace multiple keywords at one run.
Full SAS Code : Fuzzy Matching
Method 2 : Improve Performance of Fuzzy Matching
Since COMPGED on 2 tables returns cartesian product of the two tables, it is a very memory extensive process. Suppose you have two tables - first table contains 100 cases and other table contains 1 million cases, it would return 100 million cases.
Tip : Improve Performance of Fuzzy Matching on large table
Suppose you have multiple primary keys to match with the other tables. For example, you have Company Name, Address and PIN.
Sample Datasets
Fuzzy Matching with SAS |
Data temp ;
Input company $30.;
cards;
Vanucover
Reliance
Tata
Tata Motors
;
run;
data temp2;
Input company $30.;
cards;
Tata
tata
Tataz
TataM Jan2015
Tata Motor
Reliance World
Reliance Ltd
Reliance #Petro
Reliance Global
Vanucoverltd 12 Company
;
run;
Steps
Company1 = compress(company, '',"ak");
Company2 = compbl(PrxChange('s/\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\b//i' , -1 , strip(company1)));
- COMPRESS with modifier ak removes punctuations, special characters and numeric values.
- PRXCHANGE performs a pattern-matching replacement. It is an alternative to TRANWRD which can only replace a single keyword at a time. However, PRXCHANGE can replace multiple keywords at one run.
Detailed Explanation - PRXCHANGE
- The 's tells SAS that we want to substitute or replace some keywords.
- The /\b(?: tells SAS to use these keywords when they are not grouped with other keywords. For example, It differentiate between 'Smith Jan' and 'JanSmith'. It would replace 'Jan' from 'Smith Jan' but not from 'Jan Smith'.
- The //i tells SAS to substitute keywords with blank. If you want to substitute with a keyword 'blank', change //i to /blank/i.
2. Setting Stopwords such as Corporation, Corp,Ltd, Limited, Inc etc,
Company3 = compbl(PrxChange('s/\b(?:Corporation|Corp|Ltd|Limited|Inc|Incorporated|Company|Co|LTD|LLC|PLLC)\b//i' , -1 ,Strip(Company2)));
3. Apply fuzzy matching using COMPGED function.
The 'LN' modifier tells SAS to consider the following points before comparing the values
Important Point
The value that is returned by COMPGED(string1, string2) is not always equal to the value that is returned by COMPGED(string2, string1) .
COMPGED(String1, String2, 400, 'LN')The COMPGED function returns the generalized edit distance between two strings. In the code, 400 is the cut off. If the actual generalized edit distance is greater than the value of cutoff, the value that is returned is equal to the value of cutoff.
The 'LN' modifier tells SAS to consider the following points before comparing the values
- remove leading blanks in string-1 and string-2
- remove quotations and ignore case sensitivity.
Important Point
The value that is returned by COMPGED(string1, string2) is not always equal to the value that is returned by COMPGED(string2, string1) .
data _null_;
x = compged('Tata', 'TataM');
y = compged('TataM', 'Tata');
put x y;
run;
data temp3 (drop = company1 company2);set temp2;company1 = compress(company, '',"ak"); /** Keeping only characters**/;/**Stopwords**/;Company2 = compbl(PrxChange('s/\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\b//i' , -1 , strip(company1)));Company3 = compbl(PrxChange('s/\b(?:Corporation|Corp|Ltd|Limited|Inc|Incorporated|Company|Co|LTD|LLC|PLLC)\b//i' , -1 ,Strip(Company2)));run;
proc sql noprint;
create table matrix as
select a.Company , a.company3 as cleaned_text, b.Company as Company2, COMPGED(a.Company3, b.Company, 400, 'LN') as GEDSCORE
from temp3 a left join temp b
on compged(a.company3, b.company, 'LN') < 400
group by a.company
having GEDSCORE= min(GEDSCORE)
order by a.company;
quit;
Method 2 : Improve Performance of Fuzzy Matching
Since COMPGED on 2 tables returns cartesian product of the two tables, it is a very memory extensive process. Suppose you have two tables - first table contains 100 cases and other table contains 1 million cases, it would return 100 million cases.
Tip : Improve Performance of Fuzzy Matching on large table
The quick tip to improve performance is to select exact matches outside COMPGED function and excludes these exact match cases when you apply COMPGED function.
proc sql noprint;Fuzzy Matching with Multiple Criteria
**Selecting exact matches;
create table matrix as
select a.company, a.company3 as cleaned_text, b.company as company2
from temp3 a inner join temp b
on UPCASE(a.company3) = UPCASE(b.company);
**Run compged on non-exact matches;
create table matrix2 as
select a.Company , a.company3 as cleaned_text, b.Company as Company2,
COMPGED(a.Company3, b.Company, 400, 'LN') as GEDSCORE
from temp3 a left join temp b
on compged(a.company3, b.company, 'LN') < 400
where a.company NOT IN (Select company from matrix)
group by a.company
having GEDSCORE= min(GEDSCORE)
order by a.company;
**Joining exact and non-exact matches;
create table matrix3 (drop = GEDSCORE2) as
select *,
case when GEDSCORE2 = . then 0 ELSE GEDSCORE2 END AS GEDSCORE
from
(select * from matrix
union
select * from matrix2 (rename = (GEDSCORE = GEDSCORE2)));
quit;
Suppose you have multiple primary keys to match with the other tables. For example, you have Company Name, Address and PIN.
Sample Datasets
Data xx ;SAS Code : Fuzzy Matching with Multiple Criteria
Input company $1-12 PIN 13-30;
cards;
Vanucover 110051
Reliance 112345
Tata 140000
Tata Motors 125432
;
run;
data xx2;
Input company $15. PIN 16-30;
cards;
Tata 140000
tata 140000
Tataz 125433
TataM Jan2015 125432
Tata Motor 125432
Reliance World 112345
Reliance Ltd 113345
Reliance #Petro 113007
Reliance Global 113600
Vanucoverltd 12 110051
;
run;
data example2 (drop = company1 company2);
set xx2;
company1 = compress(company, '',"ak"); /** Keeping only characters**/;
/**Stopwords**/;
Company2 = compbl(PrxChange('s/\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\b//i' , -1 , strip(company1)));
Company3 = compbl(PrxChange('s/\b(?:Corporation|Corp|Ltd|Limited|Inc|Incorporated|Company|Co|LTD|LLC|PLLC)\b//i' , -1 ,Strip(Company2)));
run;
proc sql noprint;
create table matrix2 as
select a.Company , a.company3 as cleaned_name, b.Company as Company2, a.PIN, b.PIN as PIN2,
COMPGED(a.Company3, b.Company, 'LN') as GEDSCORE_Co,
COMPGED(PUT(a.PIN,10.), PUT(b.PIN,10.)) as GEDSCORE_PIN,
calculated GEDSCORE_Co + calculated GEDSCORE_PIN as GEDSCORE
from example2 a left join xx b
on COMPGED(a.Company3, b.Company, 'LN') <= 400 OR COMPGED(PUT(a.PIN,10.), PUT(b.PIN,10.)) <=200
group by a.company
having GEDSCORE = MIN(GEDSCORE)
order by a.company;
quit;
Important Tips
Stopwords for Addresses
Addresses = PrxChange('s/\b(?:(NORTH|SOUTH|EAST|WEST|STREET|ST|AVENUE|AVE|LANE|LN|PARKWAY|PKWY|WAY|ROAD|RD|DRIVE|DR|
PLACE|PL|CIRCLE|CIR|COURT|CT|PIKE|TERRACE|TER|TRAIL|TRL|TL|BOULEVARD|BLVD)\b//i',-1 ,Strip(_Addresses));
Jaro Winkler Distance
It is best suited for short names such as first and last name. Check out this link -
Really do we use this much complicated codes in real time?
ReplyDeleteReal time implementation can be more complicated than this code as it depends on business requirement
Deletefuzzign matching macro attachment is asking for password , please let me know what to enter in it
ReplyDeleteHi Deepanshu,
ReplyDeleteCan you please fix all the images? I think the link is broken
Thank you,
Long
All the images are loaded fine at my end. are you opening the link from your office? Your organisation may have blocked these images.
Delete