Let’s perform a notional experiment
Let’s have several people pore through satellite imagery, representing thousands of square miles of real world ocean or land, to find a specific boat, or a specific truck, or a specific instance of any kind object… based on a few previous satellite imagery shots of that boat, or truck, or thing.
And then let’s compare that to an object detection model attempting to identify that same object trained on the few shots of that object (and to help the detection model we also train it on thousands or tens of thousands of examples of the same class of object, e.g. tanker ship, or semi truck, in order to gain some of the general object knowledge advantage of the human).
What would the results look like?
On the human side:
You could have a well organized group of people divide up the search imagery into sections, scan the imagery from top left to bottom right, note the filename and coordinates of potential matches, and have a second set of eyes validate whether that was indeed the specific object being searched for.
This could take hours, but probably days if, for examples, it is for imagery taken across the entire Atlantic and analyzed by a team of four. We could scale this up at the cost of additional human labor and management, and streamlined imagery interface tools, which can lower the time it takes. We could improve the accuracy and chances that we find the correct ship by having redundant human checks and reviews, but which would increase the time required to find our object.
On the machine side:
Given proper data pre-processing, the object detection model could infer across the entirety of the imagery as fast as you were willing to pay to have it parallelized and computed in an elastic cluster. Let’s say it could run across the imagery in hours.
You would then have results of thousands or tens of thousands of bounding boxes, full of false positives (wrong ship or not a ship at all), and false negatives (missed ships that are obscured by clouds or other circumstances). Let’s say it detected 10,000. Are we done? Did we find the ship?
No, and maybe. We now have to go through those ten thousand detected ships to decide if one of them was our ship in question. This requires some data engineering downstream of the model running, and an interface for humans to decide whether one of the ships was the right one.
If we had an efficient interface the humans could play a game of “yes” / “no” / “maybe,” for all of the imagery and continue to review all of the images marked yes or maybe until someone decided that they’ve found their ship. This could take hours, or less. But is certainly a human-required step. And barring a lucky break on the human-only side, of happening on the ship early on in the process, this combined method of machine and human matching will likely solve the task in an order of magnitude faster than the human-only method.
On both sides:
Just as the experience of sifting through imagery looking for a specific ship will tune the human analysts’ awareness of the minute aspects of ships, and how best to spot and sort through the mass of data to find a specific ship, so too will the human review of the object detection outputs, (the yes/no/maybe results) serve as valuable training data for the retraining of the object detection model. Ideally, making the search task more accurate during the next iteration–e.g. instead of 10,000 potential matches, only a 1,000, and then 100 after several more exercises of the task.
Ideally, both methods would benefit from analytical methods that are typically used to greatly narrow the search.. for example, using last known location and heading, knowledge of the ship’s route or destination, or any publicly available ship GPS or AIS hits that match exactly or are from the same class of ship. Each of which could render the search task to be a fairly marginal search within one small patch of imagery that correlates to a known time and GPS hit.
But when the there is no supporting information about the location or destination the ship, or truck, the task becomes more like the scenarios described above. A brute force, divide and conquer search across a vast ocean of data (literally). And clearly the fastest and cheapest way will be to advance our object detection algorithms and the human interfaces to quickly validate and retrain these algorithms’ results.