Running files locally in SPSS

Say I made alittle python script for a friend to scrape data from a website whenever they wanted updates. I write my python script, say scrape.py, and a run_scrape.bat file for my friend on their windows machine (or run_scrape.sh on Mac/Unix). And inside the bat file has the command:

python scrape.py

I tell my friend save those two files in whatever folder you want, you just need to double click the bat file and it will save the scraped data into the same folder. Here the bat file is run locally, it sets the current directory to wherever the bat file is located.

This is how the majority of code is packaged – can clone from github or email a zip file, and it will just work no matter where the local user saves those scripts. I need to have my friend have their python environment set up correctly, but most of the stuff I do I can say download Anaconda and click yes to setting python on the path and they are golden.

SPSS makes things more painful, say I added SPSS to my environment variable in my windows machine, and I run from the command prompt an SPSS production job:

cd "C:\Users\Andrew"
spss print_dir.spj" -production silent

And say the spj file, all it does is call a syntax show.sps which has as the only command SHOW DIR. This still prints out wherever SPSS is installed as the current working directory inside of the SPSS session. On my machine currently C:\Program Files\IBM\SPSS Statistics. So SPSS takes over the location of the current directory. Also we can open up the spj file (it is just a plain text xml file). Here is what a current spj file looks like for me (note it is all on one line as well!):

And that file also has several hard coded file locations. So to get the same behavior as python scrape.py earlier, we need to do dynamically set the paths in the production job as well, not just alter the command line scripts. This can be done with a little command line magic in windows, dynamically replacing the right text in the spj file. So in a bat file, you can do something like:

@echo on
set "base=%cd%"
:: code to define SPJ (SPSS production file)
echo ^<?xml version=^"1.0^" encoding=^"UTF-8^" standalone=^"no^"?^>^<job xmlns=^"http://www.ibm.com/software/analytics/spss/xml/production^" codepageSyntaxFiles=^"false^" print=^"false^" syntaxErrorHandling=^"continue^" syntaxFormat=^"interactive^" unicode=^"true^" xmlns:xsi=^"http://www.w3.org/2001/XMLSchema-instance^" xsi:schemaLocation=^"http://www.ibm.com/software/analytics/spss/xml/production http://www.ibm.com/software/analytics/spss/xml/production/production-1.4.xsd^"^>^<locale charset=^"UTF-8^" country=^"US^" language=^"en^"/^>^<output imageFormat=^"jpg^" imageSize=^"100^" outputFormat=^"text-codepage^" outputPath=^"%base%\job_output.txt^" tableColumnAutofit=^"true^" tableColumnBorder=^"^|^" tableColumnSeparator=^"space^" tableRowBorder=^"-^"/^>^<syntax syntaxPath=^"%base%\show.sps^"/^>^<symbol name=^"setdir^" quote=^"true^"/^>^</job^> > transfer_job.spj
"C:\Program Files\IBM\SPSS Statistics\stats.exe" "%base%\transfer_job.spj" -production silent -symbol @setdir "%base%"

It would be easier to use sed to find/replace the text for the spj file instead of the superlong one-liner on echo, but I don’t know if Window’s always has sed installed. Also note the escape characters (it is crazy how windows parses this single long line, apparently the max length is around 32k characters though).

You can see in the call to the production job, I pass a parameter, @setdir, and expand it out in the shell using %base%. In show.sps, I now have this line:

CD @setdir.

And now SPSS has set the current directory to wherever you have the .bat file and .sps syntax file saved. So now everything is dynamic, and runs wherever you have all the files saved. The only thing that is not dynamic in this setup is the location of the SPSS executable, stats.exe. So if you are sharing SPSS code like this, you will need to either tell your friend to add C:\Program Files\IBM\SPSS Statistics to their environment path, or edit the .bat file to the correct path, but otherwise this is dynamically run in the local folder with all the materials.

Git excluding specific files when merging branches

The other day at work I had a mildly annoying problem – merging only selected files between a test and production branch in github. My particular use case was I had a test branch that needed to only interact with a test database, and the master branch needed to talk to the prod database. So I had particular config files with essentially different SQLAlchemy connection strings, but nothing else. Note I did not want these files ignored, just not merged between branches. (If I edit them I will need to make sure to edit both master & test branches in the end.)

I often use the GitHub desktop GUI to commit changes (when working on my local laptop). You can use the GUI to make a pull request, but when accepting the pull request in the browser I think it is all or nothing. I also need an entire command line solution for when I am working on a remote headerless machine with no GUI as well anyway. So here are my notes on how I solved the issue.

So just for illustration I added a test branch to my Blog_Code repository, and then some junk files just to illustrate. Via the git bash shell, if you navigate to your repository and do git diff master test --name-only, it shows you the different files in the two branches:

So you can see that I have 5 different files in total. Two config files and three different text files. If we do git diff master test -- special_config1, we can see the more specific differences between those two config files:

So you can see that the test branch version (in red) and the master branch version (in green), just have a minor difference. But in the end I want to keep those two files different between the branches, and not merge this config file (along with the other config file).

So here is the particular logic I put together, piping a bunch of commands together:

git switch master
git diff master test --name-only |
grep -v 'special_config1' |
grep -v 'Python/special_config2' |
sed 's/.*/"&"/' |
xargs git checkout test

The first line git switch is pretty self explanatory – I switch to the master branch (I will typically be doing work on test). Second I grab all the files that are different using git diff branch1 branch2, and only print out the file names. Third/Fourth lines I use grep to get rid of my specific config files out of that resulting list of files. You could also do grep -v 'file1.txt|file2.txt' |, but in this case this was giving me fits (maybe due to the forward slash not being escaped the right way for grep?).

The fifth sed line I wrap the files in quotes (if you have a file that has a space it will cause problems otherwise).

Sixth line I then use xargs to pass git checkout from the test environment, and pass in all of my files (minus my two config files). This is advice taken from this blog, just a slicker way to grab all of the files that are different minus a few specific config files. So instead of typing git checkout test file1.txt file2.txt etc. and typing the files by hand, I just grab all the files that are different and check them all out together.

Then once that is done it is the usual to commit the updated files. And then here in the end I switch the active environment back to test.

git commit -m 'Example only merging select files'
git push
git switch test

Maybe one of these days I will entirely ditch the GUI behind. But for now will just have to get by with my limited command line fu compared to these real computer programmers I work with more regularly.