This is just intended to be a quick tip on cleaning up string fields in SPSS. Frequently if I am parsing a field or matching string records (such as names or addresses) I don’t want extra ascii characters besides names and/or numbers in the field. For example, if I have a name I might want to eliminate hyphens or quotes, or if I have a field that is meant to be a house number I typically do not want any alpha character in the end (geocoding databases will rarely be able to tell the difference between Apt’s 1A
and 1B
).
We can use a simple loop and the PIB variable format in SPSS to clean out unwanted ascii codes in string characters. So for instance if I wanted to replace all the numbers with nothing in a string field I could use this code below (where OrigField
is the original field with the numbers contained, and CleanField
is the subsequent cleaned variable).
string CleanField (A5).
compute CleanField = OrigField.
loop #i = 48 to 57.
compute CleanField = REPLACE(CleanField,STRING(#i,PIB),"").
end loop.
The DEC
column in the linked ascii table corresponds to the ascii character code in SPSS’s PIB format. The numbers 0 through 9 end up being 48 to 57 in decimal values, so I create a string corresponding to those characters via the string(#i,PIB)
commmand and replace them with nothing in the REPLACE
command. I loop through values of 48 to 57 to get rid of all numeric values.
This extends to potentially all characters, for instance if I want to return only capital alpha characters, I could use a loop with an if statement like below;
string CleanField (A5).
compute CleanField = OrigField.
loop #i = 1 to 255.
if #i < 65 or #i > 90 CleanField = REPLACE(CleanField,STRING(#i,PIB),"").
end loop.
There are (a lot) more than 255 ascii characters, but that should suffice to clean up most string fields in English.
Jon Peck
/ July 9, 2013Be careful, since there are lots of Unicode characters that you might want to keep that are outside the 0-255 range and also should not be inspected byte by byte. For example, in Unicode mode, the British pound symbol has hex value C2A9, taking two bytes, and the Euro symbol is E282A0 – three bytes. C2 in the usual English Windows code page 1252 is capital A with hat, but in Unicode mode it is the first half of the pound character.
A useful way to see what the codes really are for character data is to set the variable format to AHEX. The size field needs to be double the A width, so A10 would become AHEX20. This has to be done in syntax (FORMAT s(AHEX20)), since AHEX is not supported in the Data Editor type list.
Note also that the replace function shown above can actually change the character locations in Unicode mode, although since strings are fixed width, this would result in blank padding or truncation.
apwheele
/ July 9, 2013Yep agree Jon – if you want search/clean such characters not byte by byte you should prob. have a strong preference for using Python and regex’s unless you only have a small list. (On my to-do list to write a blog post illustrating that)
Bruce Weaver
/ January 9, 2022Hi Andy. Here is confirmation that a modification of your code worked to remove non-printable characters that were causing MATCH FILES to fail.
https://www.researchgate.net/post/How_can_I_merge_SPSS_data_with_an_unequal_number_of_cases_using_point-and-click_options_is_giving_me_duplicate_cases
Cheers,
Bruce
apwheele
/ January 9, 2022Thanks for the note Bruce, seeing that thread I think a good addition to alter type would be something like Unicode to ASCII for strings.