I’m really grateful that you’ve taken the time to think about this for me.
Your concerns are completely reasonable.
That said, I do have some ideas of my own regarding the two points you raised.
1. Given my objective, I may not need an extremely large dataset
It is true that if the goal were to identify a tiny number of elite horses from the entire population, an enormous amount of data would be necessary.
However, in Japan, horse racing is centrally managed by a national organization, the JRA, and the class system is very clearly defined. Specifically, the classes are:
Maiden
1-win class
2-win class
3-win class
Open class
My goal is not to pinpoint future Grade race winners within the Open class.
Rather, it would be sufficient to statistically demonstrate that as the model’s score increases, the proportion of horses reaching higher classes also increases.
For that kind of objective, I believe the required amount of data may not be as large as one might initially assume.
2. Japan has a substantial number of high-quality walk videos
In Japan, there is a well-developed system called hitokuchi banushi, where ownership of a racehorse is divided into 40 to 500 shares.
There are around 15 organizations that operate this kind of shared ownership system. Among them, Sunday Racing, Silk Horse Club, and Carrot Club are all closely connected to Northern Farm, and Shadai Thoroughbred Club is also operated by relatives of the same group. These clubs provide walk videos of their horses at the time of recruitment.
Because these clubs are run by closely related organizations, the videos are not perfectly standardized, but they are generally filmed from similar angles. Unfortunately, the camera distance is not always consistent.
Still, the number of videos is about 300 per year. If I collect data over 10 years, that would amount to around 3,000 horses, which I believe should be enough for the objective described in point 1.
Even better, these horses are provided with height data along with the videos. I expect that using this height data could help address the scale-difference problem to some extent.
Not a problem. Glad to help
Also, if your goal is to separate the foals that turn out to be elite vs non-elite, two other concerns:
- You need a lot of data to run inference on to build a predictive model. The positive case (elite) is small (~3%) and you’re better off modeling as a binary outcome with even groups (elite vs really slow) and you need ~500 elite so there needs to be ~16,000 videos to choose from, and you use 1000 in your model (10 fold cross validation) otherwise you’re underpowered.
- in all these videos the distance from the camera to the horse is unknown so any measurements are not to scale. Even if you augment with monocular depth estimation like Depth Pro or Depth Anything V3 you still won’t get accurate ‘to scale’ measurements of the horses. This is a fatal flaw of using online videos as the size/body length of the horse has a direct correlation to stride length and cadence on the racetrack. The lack of knowledge of actual size with these videos is a large source of error in prediction.
Thank you so much.
There is nothing more valuable than receiving advice from someone who has already gone through this process.
I wasn’t sure how many videos I should use, what kinds of videos would be appropriate, or how many frames I should label for training, so your advice was extremely helpful.
My advice.
- get a representation of videos in different circumstances but make sure they are somewhat even in count. So get left to right walk video of 5 bay, 5 black, 5 chestnut and 5 grey foals (20 total) It will generalize better
- do the k-means image extraction across each of the 20 videos. 30 frames will be enough per video (600 frames total).
- label them consistently and then train for a decent amount of epochs.
- review the ones with large errors. You will usually find it’s where you’ve placed a marker incorrectly or the landmark isn’t clear anyway and should be removed. Retrain.
- even with this you’ll find that on all videos there is a sliding window where the landmarks are placed more accurately. The frames just before and just after the camera is perpendicular to the horse are usually best. You can usually get a full stride cycle of good data to use in downstream analysis.