This post covers how to deal some of the advanced string operations with SAS. In SAS, there are various functions available for handling character strings but sometimes they are not enough to manipulate character strings.
Example 1 : Generate frequently used keywords
Suppose you have a list of customer complaints with their open-ended comments You are asked to analyze it. The most common (or basic) text mining technique is to generate common used words in the list of complaints. It is easily possible via SAS text miner but a little bit complicated to be done via base SAS. The following SAS macro accomplish this task.
Areas of Improvement
Example 2 : Reverse a Character String
Suppose you have a list of words. You are asked to reverse it.
Create a Sample Dataset
You may want to get your hands dirty by writing code for it without using REVERSE function. You can do it by extracting each letter from a string using DO LOOP and then reverse it with PROC SORT, RETAIN and FIRST., LAST. variables. See the code below -
Example 1 : Generate frequently used keywords
Suppose you have a list of customer complaints with their open-ended comments You are asked to analyze it. The most common (or basic) text mining technique is to generate common used words in the list of complaints. It is easily possible via SAS text miner but a little bit complicated to be done via base SAS. The following SAS macro accomplish this task.
%macro frequency(inputdata=,var=,outdata=);
data test2;
set &inputdata.;
varr = compress(lowcase(&var.),' ','ak');
do i= 1 to countw(varr);
var1= scan(varr,i);
output;
end;
run;
proc sql noprint;
create table &outdata. as
select var1, count(*) as N from test2
where length(var1) > 2
group by 1
order by N desc;
quit;
%mend;
%frequency(inputdata=temp,var=var,outdata=freqlist);
Macro Parameters
- inputdata : Specify the name of the dataset in which open-ended comments exist
- var : Specify the name of the variable which contains comments
- outdata : Specify the name you want to assign to the output dataset
SAS : Frequency of Words |
Areas of Improvement
In the macro, this line of code "where length(var1) > 2" removes all keywords having length less than or equal to 2. It is to remove common non-meaningful words like "a", "an", "be", "is", "am" "of" "on" "in" etc. It does not cover exhaustive list of non-meaningful keywords such as "the" ,"and", "that" etc. Also, this WHERE condition can remove important keywords that are abbreviations of some department / business unit etc. Example, CA refers to Corporate Agency. So, instead of using this line of code, prepare an exclusion list which can be used to exclude non-meaningful keywords.
Example 2 : Reverse a Character String
Suppose you have a list of words. You are asked to reverse it.
Create a Sample Dataset
data temp;REVERSE Function
input list $50.;
cards;
listendata
saspythonr
datascience
analytics
;
run;
data temp2;
set temp;
x = left(reverse(list));
run;
In SAS, there is a function available for reversing a string. The function is called REVERSE. The LEFT function is used before REVERSE function to remove leading spaces.
SAS : Reverse String |
data test;
set temp;
do i= 1 to length(list);
list1= substr(list,i,1);
output;
end;
run;
proc sort data = test;
by list descending i ;
run;
data test2;
set test(keep = list list1);
retain list2;
by list;
if first.list then list2=trim(list1);
else list2 = cats("",list2,list1);
if last.list;
keep list list2;
run;
Example 3 : Extracting Alternate Letters from a String
Suppose you are asked to pull alternate letters from a character string. The logic for it is similar to the REVERSE code. A few changes are : (1) To increment by 2 in loop instead of 1. (2) No sorting letters on descending order.
Suppose you are asked to pull alternate letters from a character string. The logic for it is similar to the REVERSE code. A few changes are : (1) To increment by 2 in loop instead of 1. (2) No sorting letters on descending order.
SAS : Alternate Letters |
data test2;
set temp;
do i= 1 to length(list) by 2;
list1= substr(list,i,1);
output;
end;
run;
proc sort data = test2;
by list;
run;
data test3;
set test2(keep= list list1);
retain list2;
by list;
if first.list then list2=trim(list1);
else list2 = cats("",list2,list1);
if last.list;
keep list list2;
run;
Share Share Tweet