With a history dating back to 1851 and over 125 Pulitzer Prizes under its belt, the New York Times has amassed a mountain of photos. Between five and seven million of them. They’re all stored in the “morgue” under their Times Square office. Packed into countless drawers and cupboards, they’re now working with Google to digitise the entire collection.
Google says that many of these photos have been stored in folders and not even been looked at for years. Some of them that date back as far as the late 19th century. There is a card catalogue, which provides an overview of the archive’s contents, but there is much that has gone unseen for a long time down in that basement.
With 5-7 million photos, simply scanning and storing them is not enough. That doesn’t really give photo editors anything that they can easily search for and use. So, Google and the NYT are turning to AI to process the images. It recognises things like text, handwriting and other details in the image to help create a more valuable index. One example Google posted that demonstrates how they’re using the AI to help provide context is this one showing the front and back of a photo of Penn Station shot in 1942. The photo is a great record of what was going on at that time but without any context, there isn’t really anything to say what it contains or the reason for its creation. – Google
When the back of the photo was fed into Google’s Cloud Vision API, it returned the following, which it could then associate with the photograph. It’s not perfect, of course, but Google says that it’s the fastest and most cost-effective method when compared to the alternatives with this quantity of images. But Google says that this is only the beginning of what’s possible with computer vision. For example, the front side of the photograph above with logo detection recognised that it was shot in Pennsylvania Station. The Cloud Natural Language API can be used to help clean up any recognised text, too, to make it more syntactically correct and human-readable (and searchable). It’s a mammoth task, and it’s easy to understand why it’s one that’s been put off so long. It’s only now that we’re starting to get the level of technology to really be able to index this quantity of content easily. If you want to find out more, watch the video above, and check out the Google Cloud Blog. [via The Verge]