Won the algae bloom prediction competition

I recently was one of the winners in the DataDriven competition predicting algaeblooms, username apwheele. So I have written in the past about alternative competition sites. To decide overall whether I will compete in a competition, it is:

  • number of competitors
  • overall prizes
  • my self-assessed skill level

And then whether I had sufficient time to devote to the competition. So for an alternative estimate example, the Astral Codex blog has a book review contest. I can see the prior years competition had 133 competitors, so given prizes of 4k in 2023, 4k/133 ~ $30 in expected return. If you have a burning desire to review a book go for it, but I don’t think I have some secret that gives me a large enough competitive edge over other readers of Scott’s blog to make this competition worth my time.

For the algae bloom competition, DataDriven I saw had previously around 1k competitors per competition, and the prizes for 1/2/3 for this competition were 12k/9k/6k. They have a few other conciliation prizes, so total of $30k. Expected return is basically the same as the Codex blog, $30, but I figured I had a better competitive edge. (Although knew it would be more work than writing a book review.)

I am typically hesitant to do Kaggle competitions, some have over 100k competitors (I feel at that point you are close to the “monkeys typing on a keyboard will produce Shakespeare eventually” stage). I debated recently on doing the Kaggle competition on Neutrino’s in Ice, but due to steeper competition and less time (prizes are similar to the Algae Bloom one) I will not be competing.

In terms of self-assessed skill, you may be thinking “Andy, you have no related skills towards remote-sensing/biology”, which is true. In this specific competition, one of the things that prompted me to compete was the use of ancillary data, so it is not just satellite imagery, you can fold in more data. This tends to favor tabular/tree based models, which I have more experience with. Additionally the example getting started blog post by DataDriven made to me a key critical error – they used a multinomial model (predicting categories) instead of a regression model predicting a continuous outcome. An ordinal model may be defensible, but with the error metric being root mean squared error, the way they used multinomial did not make sense. E.g. if multinomial predicted severity 1 at 51%, and severity 5 at 49%, your prediction should be 3, not 1. Since the majority of people competing in these competitions are clearly just copying code and not understanding the stat models under the hood, I knew this would send a decent number of competitors down a wrong track.

Another aspect I look for in modeling competitions is weird loss functions, this is one of the reasons me and Gio won quite a bit in the NIJ recidivism competition. Essentially things that you need to write custom code for (or think about the math a little under the hood), I suspect give me a decent edge based on quite a few competitors. Things where just your ability to fit and hypertune a deep-learning model based on sensor data is the difference maker I am not going to compete in.

So that was my thinking at the start of the competition. An aspect I did not anticipate though, it was quite a chore to download the data. Unlike many competitions in which the providers gift you data, DataDriven had you download your own satellite data. This was quite alot of work to write code to do this, it wouldn’t surprise me if I spent 40 hours writing code for the competition overall. Also I ended up signing up for the planetary computer resource (I would get rate limited downloading data otherwise). I bet some individuals do not know they should just cache the feature data, not rerun it everytime – it takes me over 2 days of loops in python to download all the data.

So in the end, the competition had listed over 1300 competitors, but many were not serious competitors. Maybe halfway through the competition (which was around Christmas, I suspect that also reduced competition), there was only 300 or 400 competitors signed up. If you signed up in the last few weeks, I knew you would probably not have enough time to write code/download-data/tinker with models. In the end people who submitted 2+ times and beat the benchmark DataDriven results were under 100 people. So that makes me think maybe I should consider Kaggle competitions more seriously, if less than 10% of people competing even give a serious attempt.

So in terms of competitions, I have participated in 4 overall:

Gio was the one who forwarded the recidivism competition. I should have competed in the prior hot spots NIJ competition, so I hopped on that opportunity and we did a good job.

I spent a decent chunk of time on the Maternal Morbidity challenge, with prizes of 50k for multiple teams. So although I did not win I thought that was worth a shot (I am more hesitant for soft assessment competitions though). For the toxic rating competition I failed fast – I spent two days working on a few ideas/models. I was not that high in the leaderboard based on my ideas (pulling in data from alternative sources and building a few different types of models and ensembling them together), so I stopped after a short period. I would have done the same for the Algae competition, but just using a simple regression model (without even including the satellite data), I was #1 on the leaderboard. So I spent more time downloading the data and tinkering over time, which did keep me in the #1 spot in the public leaderboard in the end.

Leave a comment

1 Comment

  1. Ask me anything: Advice for learning statistics? | Andrew Wheeler

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: