Using sequential case processing for data management in SPSS

SPSS when making calculations essentially loops through every variable sequentially. So although calculations in syntax are always vectorized (the exception being explicit loops in MATRIX commands), that is compute y = x - 5. works over the entire x vector without specifying it, it really is just doing a loop through all of the records in the data set and calculating the value of y in one row at a time.

We can use this to our advantage though in a variety of data management tasks in conjunction with using lagged values in the data matrix. Let’s consider making a counter variable within a set of ID’s. Consider the example dataset below;

data list free /id value.
begin data
1 10
1 11
1 14
1 13
2 12
2 90
2 16
2 14
3 12
3 8
3 17
3 22
end data.
dataset name seq.

To make a counter for the entire dataset, it would be as simple as using the system variable $casenum, but what about a counter variable within each unique id value? Well we can use SPSS’s sequential case processing and LAG to do that for us. For example (note that this assumes the variables are already sorted so the id’s are in sequential order in the dataset);

DO IF id <> LAG(id) or MISSING(LAG(id)) = 1.
    COMPUTE counter_id = 1.
ELSE IF id = LAG(id).
    COMPUTE counter_id = lag(counter_id) + 1.
END IF.

The first if statement evalutes if the previous id value is the same, and if it is different (or missing, which is for the first row in the dataset) starts the counter at 1. If the lagged id value is the same, it increases the counter by 1. It should be clear how this can be used to identify duplicate values as well. Although the MATCH FILES command can frequently be more economical, it is pretty easy using sort. For instance, lets say in the previous example I wanted to only have one id per row in the dataset (e.g. eliminate duplicate id’s), but I wanted to only keep the highest value within id. This can be done just by sorting the dataset in a particular way (so the id with the highest value is always at the top of the list of sequential id’s).

SORT CASES BY id (A) value (D).
COMPUTE dup = 0.
IF id = lag(id) dup = 1.
SELECT IF dup = 0.

The equivalent expression using match files would be (note the reversal of dup in the two expressions, in match files I want to select the 1 value).

SORT CASES BY id (A) value (D).
MATCH FILES file = *
/first = dup
/by id.
SELECT IF dup = 1.

The match files approach scales better to more variables. If I had two variables I would need to write IF id1 = lag(id1) and id2 = lag(id2) dup = 1. with the lag approach, but only need to write /by id1 id2. for the match files approach. Again this particular example can be trivially done with another command (AGGREGATE in this instance), but the main difference is the two approaches above keep all of the variables in the current data set, and this needs to be explicitly written on the AGGREGATE command.

DATASET ACTIVATE seq.
DATASET DECLARE agg_seq.
AGGREGATE
  /OUTFILE='agg_seq'
  /BREAK=id
  /value=MAX(value).
dataset close seq.
dataset activate agg_seq.

One may think that the sequential case processing is not that helpful, as I’ve shown some alternative ways to do the same thing. But consider a case where you want to propogate down values, this can’t be done directly via match files or aggregate. For instance, I’ve done some text munging of tables exported from PDF files that look approximately like this when reading into an SPSS data file (where I use periods to symbolize missing data);

data list free /table_ID (F1.0) row_name col1 col2 (3A5).
begin data
1 . Col1 Col2 
. Row1 4 6
. Row2 8 20
2 . Col1 Col2
. Row1 5 10
. Row2 15 20
end data.
dataset name tables.

Any more useful representation of the data would need to associate particular rows with which table it came from. Here is sequential case processing to the rescue;

if MISSING(table_ID) = 1 table_ID = lag(table_ID).

Very simple fix, but perhaps not intuitive without munging around in SPSS for awhile. For another simple application of this see this NABBLE discussion where I give an example of propogating down and concatenating multiple string values). Another (more elaborate) example of this can be seen when merging and sorting a database of ranges to a number within the range.

This is what I would consider an advanced data management tool, and one that I use on a regular basis.

Leave a comment

9 Comments

  1. David Marso

     /  March 13, 2013

    I like the following approach.
    SORT CASES BY gp_1 TO gp_#.
    COMPUTE @=1.
    SPLIT FILE BY gp1 TO gp_#.
    CREATE CaseIndex=CSUM(@).
    DELETE VARIABLES @.

    Reply
    • Yes I can see the appeal to that code snippet David. Unfortunately I frequently do this with very large datasets and many groups, in which case split file is less than friendly. It in fact has caused SPSS to freeze the last few times I have attempted your code on the usual dataset I do this to (size ~200,000 cases with around 140,000 groups in V15).

      I should probably just always use the match files approach, but it is good to know how to utilize CSUM (and split file) though in similar data management tasks.

      Reply
      • David Marso

         /  March 14, 2013

        It might be that the output is getting slaughtered by the huge PIVOT table from the 140K strata. I just ran the following on 2.M rows with 200K strata
        (SEE the *undocumented* SET ERRORS OFF RESULTS OFF combination). Undocumented in the sense they both need to be present for the output suppression.
        —-
        OTOH: I believe the MATCH FILES followed by the Logical expression is probably better performance wise with large files. Note that CREATE forces a data pass while MATCH/COMPUTE do not. If one has stats to perform immediately then NIX the EXECUTE and just roll . OTOH: Best NOT attempt any SELECT statements involving the send index prior to a data pass unless you really have your wits intact (LAG is awesome but …)

        INPUT PROGRAM.
        LOOP ID=1 TO 2000000.
        COMPUTE STRATA=TRUNC(RV.UNIFORM(1,200000)).
        END CASE.
        END LOOP.
        END FILE.
        END INPUT PROGRAM.

        ** Require this for both methods **.
        SORT CASES BY STRATA.

        ** For INDEX1 **.
        SPLIT FILE BY STRATA.
        COMPUTE @=1.
        PRESERVE.
        SET RESULTS OFF ERRORS OFF.
        CREATE INDEX1=CSUM(@).
        SPLIT FILE OFF.
        DELETE VARIABLES @.
        RESTORE.

        ** For Index2 **.

        MATCH FILES / FILE * / BY STRATA / FIRST = INDEX2.
        IF NOT(INDEX2 EQ 1) INDEX2=LAG(INDEX2)+1.
        EXECUTE..

  2. Good points as usual David. I forgot about the table part of split files (it would be nice to have an option right on split file to turn tables off completely).

    Also re the variables passing and lagged values, totally agree as well. The above works in the simple case, but I’ve had some instances in which some weird things happened with accumulated transformations. I will update the blog post when I get a chance to note this explicitly.

    Reply
  3. David Marso

     /  March 14, 2013

    Actually, ideally the SUPRESSION should be an Option in CREATE. Then we wouldn’t need that shameless SET hack.
    CREATE var=Fn(???) /NOCRAP.
    or something of that sort.

    Reply
  1. The Junk Charts Challenge: Remaking a great line chart in SPSS | Andrew Wheeler
  2. Using circular dot plots instead of circular histograms | Andrew Wheeler
  3. Quick SPSS tip: Suppressing output | Andrew Wheeler
  4. Blogging in review 2015 | Andrew Wheeler

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: