All posts tagged data-manipulation

Making an Edge List in SPSS

I have planned a series of posts on some data manipulation for network data. Here I am going to show how to go from either a list of network relations in long or wide format to a list of (non-redundant) edges in SPSS.

So to start off lets define what I mean by long, wide and edge format. Long format would consist of a table where there is an ID column defining a shared instance, and a sperate column defining the nodes that have relations within that event. Imagine an event is a criminal incident, and a sharing relation can be offender(s) or victim(s).

So a long table might look like this;


Incident# Name Status
--------------
1 Mary O
1 Joe O
1 Steve V
2 Bob O
2 Ted V
2 Jeff V

Here, incident 1 has three nodes, Mary, Joe and Steve. The Status field represents if the node is either an offender, O, or a victim, V. They are all related through the incident number field. Wide format of this data would have only one record for each unique incident number, and the people "nodes" would span across multiple columns. It might then look something like below, where a . represents missing data.


Incident# Offender1 Offender2 Victim1 Victim2
---------------------------------------------
1 Mary Joe Steve . 
2 Bob . Ted Jeff .

I’ve encountered both of these types of data formats in police RMS databases, and the following solution I propose to make an edge list with go back and forth between the two to produce the final list. So what do I mean by an edge list? Below is an example;


Incident# FromID ToID FromStatus ToStatus
-----------------------------------------
1 Mary Joe O O
1 Mary Steve O V
1 Joe Steve O V
2 Bob Ted O V
2 Bob Jeff O V
2 Jeff Ted V V

Here we define all possible relationship among the two incidents ignoring the FromID and the ToID fields order (e.g. Mary Joe is equivalent to Joe Mary). Why do we want an edge list like this? In further posts I will show to do some data manipulation to find neighbors of different degrees using data formatted like this, but suffice to say many graph drawing algorithms need data in this format (or return data in this format).

So below I will show an example in SPSS going from the long format to the edge list format. In doing show I will transform the long list to the wide format, so it is trivial to adapt the code to go from the wide format to the edge list (instead of generating the wide table from the long table, you would generate the long table from the wide table).

So to start lets use some data in long format.


data list free / MyID (F1.0) Name (A1) Status (A1).
begin data
1 A O
1 B O
1 C V
1 D V
2 E O
2 F O
2 G V
end data.
dataset name long.

Now I will make a copy of this dataset, and reshape to the wide format. Then I merge the wide dataset into the long dataset.


*make copy and reshape to one row.
dataset copy wide.
dataset activate wide.
casestovars
/id MyID
/seperator = "".

*merge back into main dataset.
dataset activate long.
match files file = *
/table = 'wide'
/by MyID.
dataset close wide.

From here we will reshape the dataset to wide again, and this will create a full expansion of possible pairs. This produces much redundancy though. So first, before I reshape wide to long, I get rid of values in the new set of Name? variables that match the original Name variable (can’t have an edge with oneself). You could technically do this after VARSTOCASES, but I prefer to make as little extra data as possible. With big datasets this can expand to be very big – a case with n people would expand to be n^2, by eliminating self-referencing edges it will only expand n(n-1). Also I eliminate cases simply based on the sort order between Name and the wide XName? variables. By eliminating cases based on the ordering it reduces it to n(n-1)/2 total cases after the VARSTOCASES command (which by default drops missing data).


*Reshape to long again!
do repeat XName = Name1 to Name4 /XStatus = Status1 to Status4.
DO IF Name = XName OR  Name > XName. 
    compute Xname = " ".
    compute XStatus = " ".
END IF.
end repeat.
VARSTOCASES
/make XName from Name1 to Name4
/make XStatus from Status1 to Status4.

So you end up with a list of non-redundant edges with supplemental information on the nodes (note you can change the DO IF command to just Name > XName, here I leave it as is to further distinguish between them). To follow are some more posts about manipulating this data further to produce neighbor lists. I’d be interested to see in anyone has better ideas about how to make the edge list. It is easier to make pairwise comparisons in MATRIX programs, but I don’t go that route here because my intended uses are datasets too big to fit into memory. My code will certainly be slow though (CASESTOVARS and VARSTOCASES are slow operations on large datasets). Maybe an efficient XSAVE? (Not sure – let me know in the comments!) The Wikipedia page on SQL joins has an example of using a self join to produce the same type of edge table as well.

Below is the full code from the post without the text in between (for easier copying and pasting).


data list free / MyID (F1.0) Name (A1) Status (A1).
begin data
1 A O
1 B O
1 C V
1 D V
2 E O
2 F O
2 G V
end data.
dataset name long.

*make copy and reshape to one row.
dataset copy wide.
dataset activate wide.
casestovars
/id MyID
/seperator = "".

*merge back into main dataset.
dataset activate long.
match files file = *
/table = 'wide'
/by MyID.
dataset close wide.

*Reshape to long again! - and then get rid of duplicates.
do repeat XName = Name1 to Name4 /XStatus = Status1 to Status4.
DO IF Name = XName OR  Name > XName. 
    compute Xname = " ".
    compute XStatus = " ".
END IF.
end repeat.
VARSTOCASES
/make XName from Name1 to Name4
/make XStatus from Status1 to Status4.

4 Comments

by Andy Wheeler on June 30, 2013 • Permalink

Posted in SPSS

Tagged data-manipulation, network, SPSS

Posted by Andy Wheeler on June 30, 2013

https://andrewpwheeler.com/2013/06/30/making-an-edge-list-in-spss/

An example of using a MACRO to make a custom data transformation function in SPSS

MACROS in SPSS are ways to make custom functions. They can either accomplish very simple tasks, as I illustrate here, or can wrap up large blocks of code. If you pay attention to many of my SPSS blog posts, or the NABBLE SPSS forum you will see a variety of examples of their use. They aren’t typical fodder though for introductory books in SPSS, so here I will provide a very brief example and refer those interested to other materials.

I was reading Gelman’s and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models and for their chapter on Logistic regression they define a function in R, invlogit, to prevent the needless repetition of writing 1/1 + (exp(-x)) (where x is some arbitrary value or column of data) when transforming predictions on the logit scale to the probability scale. We can do the same in SPSS with a custom macro.


DEFINE !INVLOGIT (!POSITIONAL  !ENCLOSE("(",")") ) 
1/(1 + EXP(-!1))
!ENDDEFINE.

To walk one through the function, an SPSS macro defintion starts with a DEFINE statement, and ends with !ENDDEFINE. In between these are the name of the custom function, !INVLOGIT, and the parameters the function will take within parentheses. This function only takes one parameter, defined as the first argument passed after the function name that is enclosed within parentheses, !POSITIONAL !ENCLOSE("(",")").

After those statements comes the functions the macro will perform. Here it is just a simple data transformation, 1/(1 + EXP(-!1)), and !1 is where the argument is passed to the function. The POSITIONAL key increments if you use mutliple !POSITIONAL arguments in a macro call, and starts at !1. The enclose statement says the value that will be passed to !1 will be contained within a left and right parenthesis.

When the MACRO is called, by typing !INVLOGIT(x) for example, it will then expand to the SPSS syntax 1/(1 + EXP(-x)), where the !1 is replaced by x. I could pass anything here though within the parenthesis, like a constant value or a more complicated value such as (x+5)/3*1.2. To make sense you only need to provide a numeric value. The macro is just a tool to that when expanded writes SPSS code with the arbitrary arguments inserted.

Below is a simple use example case. One frequent mistake of beginners is to not expand the macro call in the text log using SET MPRINT ON. to debug incorrect code, so the code includes that as an example (and uses PRESERVE. and RESTORE. to keep the your intial settings).


DEFINE !INVLOGIT (!POSITIONAL  !ENCLOSE("(",")") ) 
1/(1 + EXP(-!1))
!ENDDEFINE.

data list free / x.
begin data
0.5
1
2
end data.

PRESERVE.
SET MPRINT ON.
compute logit = !INVLOGIT(x).
RESTORE.
*you can pass more complicated arguments since.
*they are enclosed within parentheses.
compute logit2 = !INVLOGIT(x/3).
compute logit3 = !INVLOGIT((x+5)/3*1.2).
compute logit4 = !INVLOGIT(1).
EXECUTE.

Like all SPSS transformation statements, the INVLOGIT transformation is not sensitive to case (e.g. you could write !InvLogit(1) or !invlogit(1) and they both would be expanded). It is typical practice to write custom macro functions with a leading exclamation mark not because it is necessary, but to clearly differentiate them from native SPSS functions. Macros can potentially be expanded even when in * marked comments (but will not be expanded in /* */ style comments), so I typically write macro names excluding the exclamation in comments and state something along the lines of *replace the * with a ! to run the macro.. Here I intentially write the macro to look just like an SPSS data transformation that only takes one parameter and is enclosed within parentheses. Also I do not call the execute statement in the macro, so just like all data transformations this is not immediately performed.

This is unlikely to be the best example case for macros in SPSS, but I merely hope to provide more examples to the unfamiliar. Sarah Boslaugh’s An Intermediate Guide to SPSS Programming has one of the simplest introductions to macros in SPSS you can find. Also this online tutorial has some good use examples of using loops and string functions to perform a variety of tasks with macros. Of course viewing Raynald’s site of SPSS syntax examples provides a variety of use cases in addition to the programming and data management guide that comes with SPSS.

2 Comments

by Andy Wheeler on June 25, 2013 • Permalink

Posted in SPSS

Tagged data-manipulation, MACRO, SPSS

Posted by Andy Wheeler on June 25, 2013

https://andrewpwheeler.com/2013/06/25/an-example-of-using-a-macro-to-make-a-custom-data-transformation-function-in-spss/

Why I feel SPSS (or any statistical package) is better than Excel for this particular job

I debated on pulling an Andrew Gelman and adding a ps to my prior Junk Charts Challenge post, but it ended up being too verbose, so I just made an entirely new follow-up. To start, the discussion has currently evolved from this series of posts;

The original post on remaking a great line chart by Kaiser Fung, with the suggestion that the task (data manipulation and graphing) is easier in Excel.
My response on how to make the chart in SPSS.
Kaiser’s response to my post, in which I doubt I swayed his opinion on using Excel for this task! It appears to me based on the discussion so far the only real quarrel is whether the data manipulation is sufficiently complicated enough compared to the ease of pointing and clicking in Excel to justify using Excel. In SPSS to recreate Kaiser’s chart is does take some advanced knowledge of sorting and using lags to identify the pit and recoveries (the same logic could be extended to the data manipulations Kaiser says I skim over, as long as you can numerically or externally define what is a start of a recession).

All things considered for the internet, discussion has been pretty cordial so far. Although it is certainly sprinkled in my post, I didn’t mean for my post on SPSS to say that the task of grabbing data from online, manipulating it, and creating the graph was in any objective way easier in SPSS than in Excel. I realize pointing-and-clicking in Excel is easier for most, and only a few really adept at SPSS (like myself) would consider it easier in SPSS. I write quite a few tutorials on how to do things in SPSS, and that was one of the motivations for the tutorial. I want people using SPSS (or really any graphing software) to make nice graphs – and so if I think I can add value this way to the blogosphere I will! I hope my most value added is through SPSS tutorials, but I try to discuss general graphing concepts in the posts as well, so even for those not using SPSS it hopefully has some other useful content.

My original post wasn’t meant to discuss why I feel SPSS is a better job for this particular task, although it is certainly a reasonable question to ask (I tried to avoid it to prevent flame wars to be frank – but now I’ve stepped in it at this point it appears). As one of the comments on Kaiser’s follow up notes (and I agree), some tools are better for some jobs and we shouldn’t prefer one tool because of some sort of dogmatic allegiance. To make it clear though, and it was part of my motivation to write my initial response to the challenge post, I highly disagree that this particular task, which entails grabbing data from the internet, manipulating it, and creating a graph, and updating said graph on a monthly basis is better done in Excel. For a direct example of my non-allegiance to doing everything in SPSS for this job, I wouldn’t do the grabbing the data from the internet part in SPSS (indeed – it isn’t even directly possible unless you use Python code). Assuming it could be fully automated, I would write a custom SPSS job that manipulates the data after a wget command grabs the data, and have it all wrapped up in one bat file that runs on a monthly timer.

To go off on a slight tangent, why do I think I’m qualified to make such a distinction? Well, I use both SPSS and Excel on a regular basis. I wouldn’t consider myself a wiz at Excel nor VBA for Excel, but I have made custom Excel MACROS in the past to perform various jobs (make and format charts/tables etc.), and I have one task (a custom daily report of the crime incidents reported the previous day) I do on a daily basis at my job in Excel. So, FWIW, I feel reasonably qualified to make decisions on what tasks I should perform in which tools. So I’m giving my opinion, the same way Kaiser gave his initial opinion. I doubt my experience is as illustruous as Kaiser’s, but you can go to my CV page to see my current and prior work roles as an analyst. If I thought Excel, or Access, or R, or Python, or whatever was a better tool I would certainly personally use and suggest that. If you don’t have alittle trust in my opinion on such matters, well, you shouldn’t read what I write!

So, again to be clear, I feel this is a job better for SPSS (both the data manipulation and creating the graphics), although I admit it is initially harder to write the code to accomplish the task than pointing, clicking and going through chart wizards in Excel. So here I will try to articulate those reasons.

Any task I do on a regular basis, I want to be as automated as possible. Having to point-click, copy-paste on a regular basis invites both human error and is a waste of time. I don’t doubt you could fully (or very near) automate the task in Excel (as the comment on my blog post mentions). But this will ultimately involve scripting in VBA, which diminishes in any way that the Excel solution is easier than the SPSS solution.
The breadth of both data management capabilities, statistical analysis, and graphics are much larger in SPSS than in Excel. Consider the VBA code necessary to replicate my initial VARSTOCASES command in Excel, that is reshaping wide data to stacked long form. Consider the necessary VBA code to execute summary statistics over different groups without knowing what the different groups are beforehand. These are just a sampling of data management tools that are routine in statistics packages. In terms of charting, the most obvious function lacking in Excel is that it currently does not have facilities to make small-multiple charts (you can see some exceptional hacks from Jon Peltier, but those are certainly more limited in functionality that SPSS). Not mentioned (but most obvious) is the statistical capabilities of a statistical software!

So certainly, this particular job, could be done in Excel, as it does not require any functionality unique to a stats package. But why hamstring myself with these limitations from the onset? Frequently after I build custom, routine analysis like this I continually go back and provide more charts, so even if I have a good conceptualization of what I want to do at the onset there is no guarantee I won’t want to add this functionality in later. In terms of charting not having flexible small multiple charts is really a big deal, they can be used all the time.

Admittedly, this job is small enough in scope, if say the prior analyst was doing a regular updated chart via copy-paste like Kaiser is suggesting, I would consider just keeping that same format (it certainly is a lost opportunity cost to re-write the code in SPSS, and the fact that it is only on a monthly basis means to get time recovered if the task were fully automated would take quite some time). I just have personally enough experience in SPSS I know I could script a solution in SPSS quicker from the on-set than in Excel (I certainly can’t extrapolate that to anyone else though).

Part of both my preference and experience in SPSS comes from the jobs I personally have to do. For an example, I routinely pull a database of 500,000 incidents, do some data cleaning, and then merge this to a table of 300,000 charges and offenses and then merge to a second table of geocoded incident locations. Then using this data I routinely subset it, create aggregate summaries, tables, estimate various statistics and models, make some rudimentary maps, or even export the necessary data to import into a GIS software.

For arguments sake (with the exception of some of the more complicated data cleaning) this could be mostly done in SQL – but certainly no reasonable person should consider doing these multiple table merges and data cleaning in Excel (the nice interactive facilities with working with the spreadsheet in Excel are greatly dimished with any tables that take more a few scrolls to see). Statistical packages are really much more than tools to fit models, they are tools for working and manipulating data. I would highly recommend if you have to conduct routine tasks in which you manipulate data (something I assume most analysts have to do) you should consider learning statistical sofware, the same way I would recommend you should get to know SQL.

To be more balanced, here are things (knowing SPSS really well and Excel not as thoroughly) I think Excel excels at compared to SPSS;

Ease of making nicely formatted tables
Ease of directly interacting and editing components of charts and tables (this includes adding in supplementary vector graphics and labels).
Sparklines
Interactive Dashboards/Pivot Tables

Routine data management is not one of them, and only really sparklines and interactive dashboards are functionality in which I would prefer to make an end product in Excel over SPSS (and that doesn’t mean the whole workflow needs to be one software). I clean up ad-hoc tables for distribution in Excel all the time, because (as I said above) editing them in Excel is easier than editing them in SPSS. Again, my opinion, FWIW.

5 Comments

by Andy Wheeler on March 30, 2013 • Permalink

Posted in Personal Productivity, SPSS

Tagged data visualization, data-manipulation, small-multiples, SPSS

Posted by Andy Wheeler on March 30, 2013

https://andrewpwheeler.com/2013/03/30/why-i-feel-spss-or-any-statistical-package-is-better-than-excel-for-this-particular-job/

The Junk Charts Challenge: Remaking a great line chart in SPSS

I read and very much enjoy Kaiser Fung’s blog Junk Charts. One of the exchanges in the comments to the post, Remaking a great chart, Kaiser asserted it was easier to make the original chart in Excel than in any current programming language. I won’t deny it is easier to use a GUI dialog than learn some code, but here I will present how you would go about making the chart in SPSS’s grammar of graphics. The logic extends part-and-parcel to ggplot2.

The short answer is the data is originally in wide format, and most statistical packages it is only possible (or at least much easier) to make the chart when the data is in long format. This ends up being not a FAQ, but a frequent answer to different questions, so I hope going over such a task will have wider utility for alot of charting tasks.

So here is the original chart (originally from the Calculated Risk blog)

And here is Kaiser Fung’s updated version;

Within the article Kaiser states;

One thing you’ll learn quickly from doing this exercise is that this is a task ill-suited for a computer (so-called artificial intelligence)! The human brain together with Excel can do this much faster. I’m not saying you can’t create a custom-made application just for the purpose of creating this chart. That can be done and it would run quickly once it’s done. But I find it surprising how much work it would be to use standard tools like R to do this.

Of course because anyone saavy with a statistical package would call bs (because it is), Kaiser gets some comments by more experienced R users saying so. Then Kaiser retorts in the comments with a question how to go about making the charts in R;

Hadley and Dean: I’m sure you’re better with R than most of us so I’d love to hear more. I have two separate issues with this task:

assuming I know exactly the chart to build, and have all the right data elements, it is still much easier to use Excel than any coding language. This is true even if I have to update the chart month after month like CR blog has to. I see this as a challenge to those creating graphing software. (PS. Here, I’m thinking about the original CR version – I don’t think that one can easily make small multiples in Excel.)

I don’t see a straightforward way to proceed in R (or other statistical languages) from grabbing the employment level data from the BLS website, and having the data formatted precisely for the chart I made. Perhaps one of you can give us some pseudo-code to walk through how you might do it. I think it’s easier to think about it than to actually do it.

So here I will show how one would go about making the charts in a statistical package, here SPSS. I actually don’t use the exact data to make the same chart, but there is very similar data at the Fed Bank of Minneapolis website. Here I utilize the table on cumulative decline of Non-Farm employment (seasonally adjusted) months after the NBER defined peak. I re-format the data so it can actually be read into a statistical package, and here is the xls data sheet. Also at that link the zip file contains all the SPSS code needed to reproduce the charts in this blogpost.

So first up, the data from the Fed Bank of Minneapolis website looks like approximately like this (in csv format);

MAP,Y1948,Y1953,Y1957,Y1960,Y1969,Y1973,Y1980,Y1981,Y1990,Y2001,Y2007
0,0,0,0,0,0,0,0,0,0,0,0
1,-0.4,-0.1,-0.4,-0.6,-0.1,0.2,0.1,0.0,-0.2,-0.2,0.0
2,-1.1,-0.3,-0.7,-0.8,0.1,0.3,0.2,-0.1,-0.3,-0.2,-0.1
3,-1.5,-0.6,-1.1,-0.9,0.3,0.4,0.1,-0.2,-0.4,-0.3,-0.1
4,-2.1,-1.2,-1.4,-1.0,0.2,0.5,-0.4,-0.5,-0.5,-0.4,-0.3

This isn’t my forte, so I’m unsure when Kaiser says grab the employment level data from the BLS website what exact data or table he is talking about. Regardless, if the table you grab the data from is in this wide format, it will be easier to make the charts we want if the data is in long format. So in the end if you want the data in long format, instead of every line being a different column, all the lines are in one column, like so;

MAP, YEAR, cdecline
0, 1948, 0
1, 1948, -.04
.
72, 1948, 8.2
0, 2007, 0
1, 2007, 0
.

So in SPSS, the steps would be like this to reshape the data (after reading in the data from my prepped xls file);

GET DATA /TYPE=XLS
 /FILE='data\historical_recessions_recoveries_data_03_08_2013.xls'
   /SHEET=name 'NonFarmEmploy'
   /CELLRANGE=full
   /READNAMES=on
   /ASSUMEDSTRWIDTH=32767.
DATASET NAME NonFarmEmploy.

*Reshape wide to long.
VARSTOCASES
/MAKE cdecline from Y1948 to Y2007
/INDEX year (cdecline).
compute year = REPLACE(year,"Y","").

This produces the data so instead of having seperate years in different variables, you have the cumulative decline in one column in the dataset, and another categorical variable identifying the year. Ok, so now we are ready to make a chart that replicates the original from the calculated risk blog. So here is the necessary code in SPSS to make a well formatted chart. Note the compute statement first makes a variable to flag if the year is 2007, which I then map to the aesthetics of red and larger size, so it comes to the foreground of the chart;

compute flag_2007 = (year = "2007").
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=MAP cdecline flag_2007 year
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: MAP=col(source(s), name("MAP"))
 DATA: cdecline=col(source(s), name("cdecline"))
 DATA: flag_2007=col(source(s), name("flag_2007"), unit.category())
 DATA: year=col(source(s), name("year"), unit.category())
 SCALE: cat(aesthetic(aesthetic.color), map(("0", color.grey), ("1", color.red)))
 SCALE: cat(aesthetic(aesthetic.size), map(("0",size."1px"), ("1",size."3.5px")))
 SCALE: linear(dim(1), min(0), max(72))
 SCALE: linear(dim(2), min(-8), max(18))
 GUIDE: axis(dim(1), label("Months After Peak"), delta(6))
 GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2))
 GUIDE: form.line(position(*,0), size(size."1px"), shape(shape.dash), color(color.black))
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 GUIDE: legend(aesthetic(aesthetic.size), null())
 ELEMENT: line(position(MAP*cdecline), color(flag_2007), size(flag_2007), split(year))
END GPL.

Which produces this chart (ok I cheated alittle, I post-hoc added the labels in by hand in the SPSS editor, as I did not like the automatic label placement and it is easier to add in by hand than fix the automated labels). Also note this will appear slightly different than the default SPSS charts because I use my own personal chart template.

That is one hell of a chart command call though! You can actually produce most of the lines for this call through SPSS’s GUI dialog, and it just takes some more knowledge of the graphic language of SPSS to adjust the aesthetics of the chart. It would take a book to go through exactly how GPL works and the structure of the grammar, but here is an attempt at a more brief run down.

So typically, you would make seperate lines by specifiying that every year gets its own color. This is nearly impossible to distinguish between all of the lines though (as Kaiser originally states). A simple solution is to only highlight the line we are interested in, 2007, and make the rest of the lines the same color. To do this and still have the lines rendered seperately in SPSS’s GPL code, one need to specify the split modifier within the ELEMENT statement (the equivalent in ggplot2 is the group statement within aes). The things I manually edited differently than the original code generated through the GUI are;

Guide line at the zero value, and then making the guideline 1 point wide, black, and with a dashed pattern (GUIDE: form.line)
Color and size the 2007 line differently than the rest of the lines (SCALE: cat(aesthetic(aesthetic.color), map(("0", color.grey), ("1", color.red))))
Set the upper and lower boundary of the x and y axis (SCALE: linear(dim(2), min(-8), max(18)))
set the labels for the x and y axis, and set how often tick marks are generated (GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2)))
set the chart so the legend for the mapped aesthetics are not generated, because I manually label them anyway (GUIDE: legend(aesthetic(aesthetic.size), null()))

Technically, both in SPSS (and ggplot2) you could produce the chart in the original wide format, but this ends up being more code in the chart call (and grows with the numbers of groups) than simply reshaping the data so the data to makes the lines is in one column.

This chart, IMO, makes the point we want to make easily and succintly. The recession in 2007 has had a much harsher drop off in employment and has lasted much longer than employment figures in any recession since 1948. All of the further small multiples are superflous unless you really want to drill down into the differences between prior years, which are small in magnitude compared to the current recession. Using small lines and semi-transparency is the best way to plot many lines (and I wish people running regressions on panel data sets did it more often!)

So although that one graph call is complicated, it takes relatively few lines of code to read in the data and make it. In ggplot2 I’m pretty sure would be fewer lines (Hadley’s version of the grammar is much less verbose than SPSS). So, in code golf terms of complexity, we are doing alright. The power in programming though is it is trivial to reuse the code. So to make a paneled version similar to Kaiser’s remake we simply need to make the panel groupings, then copy-paste and slightly update the prior code to make a new chart;

compute #yearn = NUMBER(year,F4.0).
if RANGE(#yearn,1940,1959) = 1 decade = 1.
if RANGE(#yearn,1960,1979) = 1 decade = 2.
if RANGE(#yearn,1980,1999) = 1 decade = 3.
if RANGE(#yearn,2000,2019) = 1 decade = 4.
value labels decade
1 '1940s-50s'
2 '1960s-70s'
3 '1980s-90s'
4 '2000s'.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=MAP cdecline year decade flag_2007
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: MAP=col(source(s), name("MAP"))
 DATA: cdecline=col(source(s), name("cdecline"))
 DATA: year=col(source(s), name("year"), unit.category())
 DATA: flag_2007=col(source(s), name("flag_2007"), unit.category())
 DATA: decade=col(source(s), name("decade"), unit.category())
 SCALE: cat(aesthetic(aesthetic.color), map(("0", color.black), ("1", color.red)))
 SCALE: cat(aesthetic(aesthetic.size), map(("0",size."1px"), ("1",size."3.5px")))
 SCALE: linear(dim(1), min(0), max(72))
 SCALE: linear(dim(2), min(-8), max(18))
 GUIDE: axis(dim(1), label("Months After Peak"), delta(6))
 GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2))
 GUIDE: axis(dim(4), opposite())
 GUIDE: form.line(position(*,0), size(size."0.5px"), shape(shape.dash), color(color.lightgrey))
 GUIDE: legend(aesthetic(aesthetic.color), null())
 GUIDE: legend(aesthetic(aesthetic.size), null())
 ELEMENT: line(position(MAP*cdecline*1*decade), color(flag_2007), size(flag_2007), split(year))
END GPL.

It should be easy to see comparing the new paneled chart syntax to the original, it only took two slight changes; 1) I needed to add in the new decade variable and define it in the DATA mapping, 2) I needed to add it to the ELEMENT call to produce paneling by row. Again I cheated alittle, I post hoc edited the grid lines out of the image, and changed the size of the Y axis labels. If I really wanted to automate these things in SPSS, I would need to rely on a custom template. In R in ggplot2, this is not necessary, as everything is exposed in the programming language. This is quite short work. Harder is to add in labels, I don’t bother here, but I would assume to do it nicely (if really needed) I would need to do it manually. I don’t bother here because it isn’t clear to me why I should care about which prior years are which.

On aesthetics, I would note Kaiser’s original panelled chart lacks distinction between the panels, which makes it easy to confuse Y axis values. I much prefer the default behavior of SPSS here. Also the default here does not look as nice in the original in terms of the X to Y axis ratio. This is because the panels make the charts Y axis shrink (but keep the X axis the same). My first chart I suspect looks nicer because it is closer to the Cleveland ideal of average 45 degree banking in the line slopes.

What about the data manipulation Kaiser suggests is difficult to conduct in a statistical programming language? Well, that is more difficult, but certainly not impossible (and certainly not faster in Excel to anyone who knows how to do it!) Here is how I would go about it in SPSS to identify the start, the trough, and the recovery.

*Small multiple chart in piecewise form, figure out start, min and then recovery.
compute flag = 0.
*Start.
if MAP = 0 flag = 1.
*Min.
sort cases by year cdecline.
do if year <> lag(year) or $casenum = 1.
    compute flag = 2.
    compute decline_MAP = MAP.
else if year = lag(year). 
    compute decline_MAP = lag(decline_MAP).
end if.
*Recovery.
*I need to know if it is after min to estimate this, some have a recovery before the
min otherwise.
sort cases by year MAP.
if lag(cdecline) < 0 and cdecline >= 0 and MAP > decline_MAP flag = 3.
if year = "2007" and MAP = 62 flag = 3.
exe.
*Now only select these cases.
dataset copy reduced.
dataset activate reduced.
select if flag > 0.

So another 16 lines (that aren’t comments) – what is this world of complex statistical programming coming too! If you want a run-down of how I am using lagged values to identify the places, see my recent post on sequential case processing.

Again, we can just copy and paste the chart syntax to produce the same chart with the reduced data. This time it is the exact same code as prior, so no updating needed.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=MAP cdecline year decade flag_2007
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: MAP=col(source(s), name("MAP"))
 DATA: cdecline=col(source(s), name("cdecline"))
 DATA: year=col(source(s), name("year"), unit.category())
 DATA: flag_2007=col(source(s), name("flag_2007"), unit.category())
 DATA: decade=col(source(s), name("decade"), unit.category())
 SCALE: cat(aesthetic(aesthetic.color), map(("0", color.black), ("1", color.red)))
 SCALE: cat(aesthetic(aesthetic.size), map(("0",size."1px"), ("1",size."3.5px")))
 SCALE: linear(dim(1), min(0), max(72))
 SCALE: linear(dim(2), min(-8), max(1))
 GUIDE: axis(dim(1), label("Months After Peak"), delta(6))
 GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2))
 GUIDE: axis(dim(4), opposite())
 GUIDE: form.line(position(*,0), size(size."0.5px"), shape(shape.dash), color(color.lightgrey))
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 GUIDE: legend(aesthetic(aesthetic.size), null())
 ELEMENT: line(position(MAP*cdecline*1*decade), color(flag_2007), size(flag_2007), split(year))
END GPL.

Again I lied a bit earlier, you really only needed 14 lines of code to produce the above chart. I actually spent a few saving to a new dataset. I wanted to see if the reduced summary in this dataset was an accurate representation. You can see it is except for years 73 and 80, in which they had slight positive recoveries before the bottoming out, so one bend in the curve doesn’t really cut it in those instances. Again, the chart only takes some slight editing in the GPL to produce. Here I produce a chart where each year has it’s own panel, and the panels are wrapped (instead of placed in new rows). This is useful when you have many panels.

compute reduced = 1.
dataset activate NonFarmEmploy.
compute reduced = 0.
add files file = *
/file = 'reduced'.
dataset close reduced.
value labels reduced
0 'Full Series'
1 'Kaisers Reduced Series'.

*for some reason, not letting me format labels for small multiples.
value labels year
'1948' "48"
'1953' "53"
'1957' "57"
'1960' "60"
'1969' "69"
'1973' "73"
'1980' "80"
'1981' "81"
'1990' "90"
'2001' "01"
'2007' "07".

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=MAP cdecline year flag_2007 reduced
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: MAP=col(source(s), name("MAP"))
 DATA: cdecline=col(source(s), name("cdecline"))
 DATA: year=col(source(s), name("year"), unit.category())
 DATA: flag_2007=col(source(s), name("flag_2007"), unit.category())
 DATA: reduced=col(source(s), name("reduced"), unit.category())
 COORD: rect(dim(1,2), wrap())
 SCALE: cat(aesthetic(aesthetic.color), map(("0", color.black), ("1", color.red)))
 SCALE: linear(dim(1), min(0), max(72))
 SCALE: linear(dim(2), min(-8), max(18))
 GUIDE: axis(dim(1), label("Months After Peak"), delta(6))
 GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2))
 GUIDE: axis(dim(3), opposite())
 GUIDE: form.line(position(*,0), size(size."0.5px"), shape(shape.dash), color(color.lightgrey))
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 GUIDE: legend(aesthetic(aesthetic.size), null())
 ELEMENT: line(position(MAP*cdecline*year), color(reduced))
END GPL.

SPSS was misbehaving and labelling my years with a comma. To prevent that I made value labels with just the trailing two years. Again I post-hoc edited the size of the Y and X axis labels and manually removed the gridlines. Quite short work. Harder is to add in labels, I don’t bother here, but I would assume to do it nicely (if really needed) I would need to do it manually. I don’t bother here because it isn’t clear to me why I should care

As oppossed to going into a diatribe about the utility of learning a statistical programming language, I will just say that, if you are an analyst that works with data on a regular basis, you are doing yourself a disservice by only sticking to excel. Not only is the tool in large parts limited to the types of graphics and analysis one can conduct, it is very difficult to make tasks routine and reproducible.

Part of my dissapointment is that I highly suspect Kaiser has such programming experience, he just hasn’t taken the time to learn a statistical program thoroughly enough. I wouldn’t care, except that Kaiser is in a position of promoting best practices, and I would consider this to be one of them. I don’t deny that learning such programming languages is not easy, but as an analyst that works with data every day, I can tell you it is certainly worth the effort to learn a statistical programming language well.

9 Comments

by Andy Wheeler on March 18, 2013 • Permalink

Posted in Data Visualization

Tagged data visualization, data-manipulation, grammar of graphics, small-multiples, SPSS

Posted by Andy Wheeler on March 18, 2013

https://andrewpwheeler.com/2013/03/18/the-junk-charts-challenge-remaking-a-great-line-chart-in-spss/

Using sequential case processing for data management in SPSS

SPSS when making calculations essentially loops through every variable sequentially. So although calculations in syntax are always vectorized (the exception being explicit loops in MATRIX commands), that is compute y = x - 5. works over the entire x vector without specifying it, it really is just doing a loop through all of the records in the data set and calculating the value of y in one row at a time.

We can use this to our advantage though in a variety of data management tasks in conjunction with using lagged values in the data matrix. Let’s consider making a counter variable within a set of ID’s. Consider the example dataset below;

data list free /id value.
begin data
1 10
1 11
1 14
1 13
2 12
2 90
2 16
2 14
3 12
3 8
3 17
3 22
end data.
dataset name seq.

To make a counter for the entire dataset, it would be as simple as using the system variable $casenum, but what about a counter variable within each unique id value? Well we can use SPSS’s sequential case processing and LAG to do that for us. For example (note that this assumes the variables are already sorted so the id’s are in sequential order in the dataset);

DO IF id <> LAG(id) or MISSING(LAG(id)) = 1.
    COMPUTE counter_id = 1.
ELSE IF id = LAG(id).
    COMPUTE counter_id = lag(counter_id) + 1.
END IF.

The first if statement evalutes if the previous id value is the same, and if it is different (or missing, which is for the first row in the dataset) starts the counter at 1. If the lagged id value is the same, it increases the counter by 1. It should be clear how this can be used to identify duplicate values as well. Although the MATCH FILES command can frequently be more economical, it is pretty easy using sort. For instance, lets say in the previous example I wanted to only have one id per row in the dataset (e.g. eliminate duplicate id’s), but I wanted to only keep the highest value within id. This can be done just by sorting the dataset in a particular way (so the id with the highest value is always at the top of the list of sequential id’s).

SORT CASES BY id (A) value (D).
COMPUTE dup = 0.
IF id = lag(id) dup = 1.
SELECT IF dup = 0.

The equivalent expression using match files would be (note the reversal of dup in the two expressions, in match files I want to select the 1 value).

SORT CASES BY id (A) value (D).
MATCH FILES file = *
/first = dup
/by id.
SELECT IF dup = 1.

The match files approach scales better to more variables. If I had two variables I would need to write IF id1 = lag(id1) and id2 = lag(id2) dup = 1. with the lag approach, but only need to write /by id1 id2. for the match files approach. Again this particular example can be trivially done with another command (AGGREGATE in this instance), but the main difference is the two approaches above keep all of the variables in the current data set, and this needs to be explicitly written on the AGGREGATE command.

DATASET ACTIVATE seq.
DATASET DECLARE agg_seq.
AGGREGATE
  /OUTFILE='agg_seq'
  /BREAK=id
  /value=MAX(value).
dataset close seq.
dataset activate agg_seq.

One may think that the sequential case processing is not that helpful, as I’ve shown some alternative ways to do the same thing. But consider a case where you want to propogate down values, this can’t be done directly via match files or aggregate. For instance, I’ve done some text munging of tables exported from PDF files that look approximately like this when reading into an SPSS data file (where I use periods to symbolize missing data);

data list free /table_ID (F1.0) row_name col1 col2 (3A5).
begin data
1 . Col1 Col2 
. Row1 4 6
. Row2 8 20
2 . Col1 Col2
. Row1 5 10
. Row2 15 20
end data.
dataset name tables.

Any more useful representation of the data would need to associate particular rows with which table it came from. Here is sequential case processing to the rescue;

if MISSING(table_ID) = 1 table_ID = lag(table_ID).

Very simple fix, but perhaps not intuitive without munging around in SPSS for awhile. For another simple application of this see this NABBLE discussion where I give an example of propogating down and concatenating multiple string values). Another (more elaborate) example of this can be seen when merging and sorting a database of ranges to a number within the range.

This is what I would consider an advanced data management tool, and one that I use on a regular basis.

9 Comments

by Andy Wheeler on February 18, 2013 • Permalink

Posted in SPSS

Tagged data-manipulation, SPSS

Posted by Andy Wheeler on February 18, 2013

https://andrewpwheeler.com/2013/02/18/using-sequential-case-processing-for-data-management-in-spss/

Comparing continuous distributions of unequal size groups in SPSS

The other day I had the task of comparing two distributions of a continous variable between two groups. One complication that arose when trying to make graphical comparisons was that the groups had unequal sample sizes. I’m making this blog post mainly because many of the options I will show can’t be done in SPSS directly through the graphical user interface (GUI), but understanding alittle bit about how the graphic options work in the GPL will help you make the charts you want to make without having to rely solely on what is available through the GUI.

The basic means I typically start out at are histograms, box-plots and a few summary statistics. The beginning code is just how I generated some fake data to demonstrate these graphics.

SET TNumbers=Labels ONumbers=Labels OVars=Labels TVars=Labels.
dataset close ALL.
output close ALL.
*making fake cases data.
set seed = 10.
input program.
loop #i = 1 to 5000.
if #i <= 1500 group = 1.
if #i > 1500 group = 2.
end case.
end loop.
end file.
end input program.
dataset name sim.
execute.

*making approximate log normal data.
if group = 1 time_event = (RV.LNORMAL(0.5,0.6))*10.
if group = 2 time_event = (RV.LNORMAL(0.6,0.5))*10.

variable labels time_event 'Time to Event'.
value labels group 
1 'Group 1'
2 'Group 2'.
formats group time_event (F3.0).

variable level group (nominal).

*Good First Stabs are Histograms and Box plots and summary statistics.
GRAPH
  /HISTOGRAM=time_event
  /PANEL ROWVAR=group ROWOP=CROSS.

EXAMINE VARIABLES=time_event BY group
  /PLOT=BOXPLOT
  /STATISTICS=NONE
  /NOTOTAL.

So this essentially produces a summary statistics table, a paneled histogram, and a box-plot (shown below).

First blush this is an alright way to visually assess various characteristics of each distribution, and the unequal sizes of each group is not problematic when comparing the summary statistics nor the box-plots. The histogram produced by SPSS though is the frequency of events per bin, and this makes it difficult to compare Group 2 to Group 1, as Group 2 has so many more observations. One way to normalize the distributions is to make a histogram showing the percent of the distribution that falls within that bin as oppossed to the frequency. You can actually do this through the GUI through the Chart Builder, but it is buried within some various other options, below is a screen shot showing how to change the histogram from frequency to percents. Also to note, you need to change what the base percentage is built off of, by clicking the Set Parameters button (circled in red) and then toggling the denominator choice in the new pop up window to total for each panel (if you click on the screen shot images they will open up larger images).

Sometimes you can’t always get to what you want through the chart builder GUI though. For an example, I originally wanted to make a population pyramid type chart, and it does not allow you to specify the base percent like that through the GUI. So I originally made a pyramid chart like this;

And here is what the pasted output appears like.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event group MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: group=col(source(s), name("group"), unit.category())
  COORD: transpose(mirror(rect(dim(1,2))))
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(1), opposite(), label("Time to Event"))
  GUIDE: axis(dim(2), label("Frequency"))
  GUIDE: axis(dim(3), label("group"), opposite(), gap(0px))
  GUIDE: legend(aesthetic(aesthetic.color), null())
  SCALE: cat(dim(3), include("1", "2"))
  ELEMENT: interval(position(summary.count(bin.rect(time_event*1*group))), color.interior(group))
END GPL.

To get the percent bins instead of the count bins takes one very simple change to summary specification on the ELEMENT statement. One would simply insert summary.percent.count instead of summary.count. Which will approximately produce the chart below.

You can actually post-hoc edit the traditional histogram to make a population pyramid (by mirroring the panels), but by examining the GPL produced for the above chart gives you a glimpse of the potential possibilities you can do to produce a variety of charts in SPSS.

Another frequent way to assess continuous distributions like those displayed so far is by estimating kernel density smoothers through the distribution (sometime referred by the acronym kde (e is for estimate). Sometimes this is perferable because our perception of the distribution can be too highly impacted by the histogram bins. Kernel density smoothers aren’t available through the GUI at all though (as far as I’m aware), and so you would have only known the potential exisited if you looked at the examples in the GPL reference guide that comes with the software. Below is an example (including code).

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event group MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: group=col(source(s), name("group"), unit.category())
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(2), label("Kernel Density Estimate"))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("1", "2"))
  ELEMENT: line(position(density.kernel.epanechnikov(time_event*group)), color(group))
END GPL.

Although the smoothing is useful, again we have a problem with the unequal number of cases in the distributions. To solve this, I weighted cases inversely proportional to the number of observations that were in each group (i.e. the weight for group 1 is 1/1500, and the weight for group 2 is 1/3500 in this example). This should make the area underneath the lines sum to 1, and so to get the estimate back on the original frequency scale you would simply multiply the marginal density estimate by the total in the corresponding group. So for instance, the marginal density for group 2 at the time to event value of 10 is 0.05, so the estimated frequency given 3500 cases is .05 * 3500 = 175. To get back on a percentage scale you would just multiply by 100.

AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=group
  /cases=N.
compute myweight = 1/cases.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event group myweight MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"), weight(weightedVar))
  DATA: weightedVar=col(source(s), name("myweight"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: group=col(source(s), name("group"), unit.category())
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(2), label("Weighted Kernel Density Estimate"))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  GUIDE: text.footnote(label("Density is weighted inverse to the proportion of cases within each group. The number of cases in group 1 equals 1,500, and the number of cases ingroup 2 equals 3,500."))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("1", "2"))
  SCALE: linear(dim(2))
  ELEMENT: line(position(density.kernel.epanechnikov(time_event*group)), color(group))
END GPL.

One of the critiques of this though is that choosing a kernel and bandwidth is ad-hoc (I just used all of the default kernal and bandwidth in SPSS here, and it differed in unexpected ways between the frequency counts and the weighted estimates which is undesirable). Also you can see that some of the density is smoothed over illogical values in this example (values below 0). Other potential plots are the cumualitive distribution and QQ-plots comparing the quantiles of each distribution to each other. Again these are difficult to impossible to obtain through the GUI. Here is the closest I could come to getting a cumulative distribution by groups through the GUI.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event COUNT()[name="COUNT"] group 
    MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: COUNT=col(source(s), name("COUNT"))
  DATA: group=col(source(s), name("group"), unit.category())
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(2), label("Cumulative Percent of Total"))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("1", "2"))
  ELEMENT: line(position(summary.percent.cumulative(time_event*COUNT, base.all(acrossPanels()))), 
    color.interior(group), missing.wings())
END GPL.

This is kind of helpful, but not really what I want. I wasn’t quite sure how to change the summary statistic functions in the ELEMENT statement to calculate percent within groups (I assume it is possible, but I just don’t know how), so I ended up just making the actual data to include in the plot. Example syntax and plot below.

sort cases by group time_event.
compute id = $casenum.
AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=group
  /id_min=MIN(id)
  /id_max=MAX(id).
compute cum_prop = ((id +1) - id_min)/(id_max - (id_min - 1)).


*Here is the cumulative proportion I want.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event cum_prop group MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: cum_prop=col(source(s), name("cum_prop"))
  DATA: group=col(source(s), name("group"), unit.category())
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(2), label("Cumulative Percent within Groups"))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("1", "2"))
  ELEMENT: line(position(time_event*cum_prop), color.interior(group), missing.wings())
END GPL.

These cumulative plots aren’t as problematic with bins as are the histograms or kde estimates, and in fact many interesting questions are much easier addressed with the cumulative plots. For instance if I wanted to know the proportion of events that happen within 10 days (or its complement, the proportion of events that do not yet occur within 10 days) this is an easy task with the cumulative plots. This would be at best extremely difficult to determine with the histogram or density estimates. The cumulative plot also gives a graphical comparisons of the distribution (although perhaps not as intuitive as the histogram or kde estimates). For instance it is easy to see the location of group 2 is slightly shifted to the right.

The last plot I present is a QQ-plot. These are typically presented as plotting an empirical distribution against a theoretical distribution, but you can plot two empirical distributions against each other. Again you can’t quite get the QQ-plot of interest though the regular GUI, and you have to do some data manipulation to be able to construct the elements of the graph. You can do QQ-plots against a theoretical distribution in the PPLOT command, so you could make seperate QQ plots for each subgroup, but this is less than ideal. Below I paste an example of my constructed QQ-plot, along with syntax showing how to use the PPLOT command for seperate sub-groups (using SPLIT FILE) and getting the quantiles of intrest using the RANK command.

sort cases by group time_event.
split file by group.
PPLOT
  /VARIABLES=time_event
  /NOLOG
  /NOSTANDARDIZE
  /TYPE=Q-Q
  /FRACTION=BLOM
  /TIES=MEAN
  /DIST=LNORMAL.
split file off.

*Not really what I want - I want Q-Q plot of one group versus the other group.
RANK VARIABLES=time_event (A) BY group
  /NTILES(99)
  /PRINT=NO
  /TIES=MEAN.

*Now aggregating to new dataset.
DATASET DECLARE quantiles.
AGGREGATE
  /OUTFILE='quantiles'
  /BREAK=group Ntime_ev 
  /time_event=MAX(time_event).
dataset activate quantiles.

sort cases by Ntime_ev group.
casestovars
/id = Ntime_ev
/index = group.

DATASET ACTIVATE quantiles.
* Chart Builder.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event.1[name="time_event_1"] 
    time_event.2[name="time_event_2"] MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event_1=col(source(s), name("time_event_1"))
  DATA: time_event_2=col(source(s), name("time_event_2"))
  GUIDE: axis(dim(1), label("Quantiles Time to Event Group 1"))
  GUIDE: axis(dim(2), label("Quantiles Time to Event Group 2"))
  ELEMENT: point(position(time_event_1*time_event_2))
  ELEMENT: line(position(time_event_1*time_event_1))
END GPL.

Although I started out with a simple question, it takes a fair bit of knowledge about both graphically comparing distributions and data management (i.e. how to shape your data) to be able to make all of these types of charts in SPSS. I intentionally made the reference distributions very similar, and if you just stuck with the typical histogram the slight differences in location and scale between the two distributions would not be as evident as it is with the kernel density, the cumulative distribution or the QQ-plots.

5 Comments

by Andy Wheeler on April 29, 2012 • Permalink

Posted in Data Visualization, SPSS

Tagged data visualization, data-manipulation, SPSS

Posted by Andy Wheeler on April 29, 2012

https://andrewpwheeler.com/2012/04/29/comparing-continuous-distributions-of-unequal-size-groups-in-spss/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 390 other subscribers
aoristic big-data cartography census choropleth citeulike consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming LLM logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

Andrew Wheeler

All posts tagged data-manipulation

Making an Edge List in SPSS

An example of using a MACRO to make a custom data transformation function in SPSS

Why I feel SPSS (or any statistical package) is better than Excel for this particular job

The Junk Charts Challenge: Remaking a great line chart in SPSS

Using sequential case processing for data management in SPSS

Comparing continuous distributions of unequal size groups in SPSS

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange