Video Content Recognition Systems
Image and Video Recognition Experts
P. O. Box 13051
Savannah, GA. 31406
Phone: (912) 484-1717
© Attrasoft 1998 - 2008
Attrasoft provides products & services to monitor both Video & Image content. The Attrasoft pattern matching algorithms search for matches within the content. This document systematically describes Attrasoft’s content based video recognition systems, which can be used to identify video clips, movies, and television content by analyzing digital video files. Attrasoft has robust content recognition technologies that use video fingerprinting techniques.
Piracy is running rampant on the Internet.
Figure 1.1 Copyrighted material on Youtube.com.
Accurate automated identification of digital content is important in various applications including the content owner’s anti-piracy efforts, the Internet Service Provider’s anti-piracy efforts, as well as media company’s needs to collect critical market data for strategic planning.
There are several approaches to video identifications, including human labor, digital watermarking, keyword, and content-based recognition, which is further divided into audio and video content recognition.
Human labor is very good at detecting some extraordinary variations including:
Human labor, however, is inadequate to handle the problem in a large scale because of three factors: error, cost, and scalability. After a certain limit, typically a few hundred videos, human brains have difficulty to identify video clips accurately. The Human approach has a long unit of processing time and the cost grows exponentially vs the amount of video contents. The Human approach to detect video piracy also does not scale very well. These problems (error, cost, and scalability) make the human solution unfavorable in the long run.
A digital watermark is a signal which is embedded into digital data (audio, video, images and text) that could be detected or extracted later. It is mostly used to insert copyright information of the data. Using watermarking and/or forensic marking requires modification of the original content or files. Video identification based on digital watermarks can be used only in some simple cases because a digital watermark can be altered or removed. An identification system based on something that can be altered easily by content users is not very reliable.
Keyword search requries attaching keywords to image frames in a video. It is generally accepted that it takes 3 human hours to process one hour of video for about 10 keywords. If what you want is not included in the keyword, this search will fail. Also, this 1 to 3 ratio requires a large amount of labor hours. Once again this approach does not scale.
This leads to a natural conclusion, an automatic content-based identification system (audio, video, images, and text) will be required to identify.
The leading Audio technology providers are limited by their approach: most cannot identify clips shorter than 5 seconds. Audio fingerprinting does not work for modified video. For example, a music track over an NFL video clip will not be detected by audio alone. Therefore, while Audio technology is an important part of the solution, it is simply not complete.
Attrasoft issues this report for potential vendors, which require VIDEO content recognition systems to identify video content by analysis of digital files. This document systematically describes Attrasoft’s content recognition systems ability to identify video clips, movies, and television content by analyzing digital video files. Attrasoft has robust content recognition technologies that use video fingerprinting techniques, which is currently in a large-scale production & has been proven in real world environment.
The Attrasoft solution is an automated content identification platform based on internally developed image and pattern recognition algorithms. Attrasoft converts digital video content to fingerprints and makes Fast, Accurate, & Scalable Video Content Recognition via the fingerprints.
Attrasoft video recognition technology is rooted from Attrasoft image recognition technology. Video identification is easier than image identification because a video has many more image frames than a single image. To download demo software used in this white paper, go to: http://attrasoft.com.
This white paper intends to provide sufficient information for a full assessment of the Attrasoft Technology against the anticipated requirements for video content matching.
This white paper also provides the Attrasoft Video Content Matching product road map, so the reader can anticipate the next generation of Attrasoft products. Attrasoft will adopt a phased approach to systematically improve the fingerprint technology to meet specific marketplace requirements.
First of all, digital videos are converted into digital fingerprints; see the following Figure:
Digital Contents Fingerprints
Figure 2.1. Digital videos are converted into digital fingerprints.
A fingerprint consists of a set of digital attributes computed from a video clip. Attrasoft’s competitive advantage is its proprietary algorithms that will both:
(1) Create fingerprints, and
(2) Match fingerprints.
Later, when an unknown video is matched against the library, it will be converted into a fingerprint at that time and then matched against the fingerprint library.
Figure 2.2. An unknown video is converted into a fingerprint and then matched against the fingerprint library.
Attrasoft video recognition technology is rooted from Attrasoft image recognition technology. An image is converted into a set of attributes, called an image signature. A video is decomposed into a set of images, which are in turn converted into image signatures. The collection of these images signatures forms a video fingerprint.
An Image Signature is a collection of digital attributes for an image.
A Video fingerprint is a collection of digital attributes for a video.
The Attrasoft video fingerprint consists of a set of image signatures computed from individual image frames. The computation depends on the time interval that an image signature will be computed and added to the video fingerprint.
Sampling Interval/Sampling Rate
Sampling Interval is the time interval that an image signature will be computed and added to the video fingerprint. Sampling Rate is the number of image signatures per unit time that will be computed and added to the video fingerprint.
The Sampling Intervals are different for the video library and the unknown video. In general, an unknown video does not require as many image signatures as the library video.
Attrasoft provides products including software, complete systems, and developer tools.
For those who are interested in the Attrasoft core technology, ImageFinder for Windows is an Off-the-Shelf Application Software that enables System Integrators, Solution Developers, and Individuals to quickly test their own Image Recognition ideas.
Attrasoft VideoFinder is stand-alone software, which provides video content recognition software. Attrasoft VideoFinder is currently available and can be purchased from Attrasoft.com.
Figure 2.3. Attrasoft VideoFinder is currently available and can be obtained via email: from Attrasoft.com.
Attrasoft KeepWatch is a stand-alone system, which consists of both software and hardware for content recognition.
Developers interested in obtaining the video fingerprint component can take a look at the Attrasoft TransApplet:
Information on the VideoFinder, the off-the-shelf product:
Information on the VideoFinder demo:
Information on the ImageFinder, the off-the-shelf product:
Information on the TransApplet, the off-the-shelf product:
To order, go to:
The Attrasoft Technology provides highly accurate identification of video files including low rates of misidentified (false acceptance) and unidentified content (false rejection).
The foundation of the video identification is image identification, which breaks a video into image frames and makes image identification. Attrasoft Image Identification technology is currently in large-scale production.
a. False-Positives (content not in the identifier database incorrectly identified).
The False-Positives rate of Attrasoft Technology for large libraries of movie and television content is expected to be less than 0.1%, if no variant versions are involved.
b. False-Negatives (content in the identifier database not identified).
The False-Negatives rate is expected to be less than 0.1%, if no variant versions are involved.
c. Attrasoft solution accuracy is not affected by:
(i) The size of the content library: for example, 10000 video clips are just as good as 1000 video clips;
(ii) The resolution and quality of the digital video file: for example, high compression is just as good as low compression; and
(iii) The length of video available for analysis: for example, a 5-second clip is just as good a 30-second clip.
When a video has multiple versions to start with, it may have multiple fingerprints. If a variant version can be identified, then no new fingerprint will be required. If a variant version cannot be identified, then a new fingerprint will be required. Generally speaking, adding the variant version of the fingerprint to master library will increase the accuracy.
Attrasoft solution identifies the time section of a video that the recognized video segment matches. By default, the identified video segment is specified to one second.
Attrasoft solution does offer options for modifying the Technology to decrease the rate of False-Positives potentially at the expense of higher False-Negatives and visa versa. In some applications, False-Negative is more acceptable than a False-Positive; and in some applications, False-Negative is less acceptable than a False-Positive.
Attrasoft Technology does not require changes to digital video files.
While it is the intention of Attrasoft to support all common video codecs for a wide-range of resolutions and bit rates, the current Version of the Attrasoft VideoFinder supports:
limited *.mpeg (*.mpg).
Attrasoft will adopt a phased approach to gradually support all common video codecs.
The unsupported files include:
some *.mpg, *mpeg
The Attrasoft Technology can successfully identify video files, which have undergone some common video transformations including those frequently used with Internet- piracy. Image compression will not affect any identification accuracy.
a. Attrasoft’s underlying matching engine is not affected by common video processes such as transcoding. (Transcoding is the direct digital-to-digital conversion from one (usually lossy) codec to another. It involves decoding/decompressing the original data to a raw intermediate format in a way that mimics standard playback of the lossy content, and then re-encoding this into the target format). This is because the underlying matching engine can handle distortions far worse than transcoding.
b. Attrasoft’s underlying matching engine is not affected by common video processes such as scaling. This is because all image frames will be normalized; therefore, all of the scaling affects will be removed.
c. Attrasoft’s underlying matching engine is not affected by Telecine, which is the process of transferring motion picture film into electronic form (Telecine is the same root as in 'cinema'; also "tele-seen").
d. Attrasoft’s underlying matching engine is not affected by removal/insertion of commercials in terms of adding and removing frames. This is because the remaining frames are enough for identification.
e. Attrasoft’s underlying matching engine is not affected by insertion of new contents into some of the frames. This is because the remaining frames are enough for identification.
f. Attrasoft’s underlying matching engine is not affected by higher levels of compression.
The Attrasoft Technology might not work under the following conditions:
a. Insertion of a large amount of new contents into all of the frames, while the new content significantly modifies the original image frames, which will in turn caused the image frame signature to be sufficiently different from the original.
b. Pirated Internet video files such as camcorder capture, which includes the TV set, while the video only includes the middle portion of TV set.
c. Pirated Internet video files such as camcorder capture, which includes the TV set, while the TV screen only occupies the middle portion of the video.
d. Pirated Internet video files such as camcorder capture in a movie theater, while the video only includes the middle portion of the movie screen.
e. Pirated Internet video files such as camcorder capture in a movie theater, while the movie screen only occupies the middle portion of the video.
In theory, Attrasoft Technology is frame based and there is no limit in the minimal sample size. Practically, the technology does require a minimal length of video content required for identification, including identification for video with missing segments.
a. The required length of video content for accurate identification is a matter of optimization. For short video clip, 1 second of video will be required.
b. The Technology can identify video with missing video segments. It does not matter how much is missing; it only matters how much is in the remaining video. At least one signature will be required to make the matching. While the library signatures are computed once every second, the newly captured video clip generates 1 signature every 2.5 percent of the video; therefore, the minimum segment in a video must occupy 3% of the total video length. This default setting can be changed to remove this 3% limit.
c. Since Attrasoft Technology is image frame based, the Technology can process video segments arranged out of order.
Currently, the technology does not support keyword hinting for an accelerated search. Customization will be required to handle hinting.
The Attrasoft Technology resists deliberate attempts to disguise video content by altering the video, audio, and metadata; however, some attempts will be considered as variations and should be added to the master library.
The content identifier is video frame based. The process for the creation of an identifier for a particular frame is filename + frame number. The frame number has six digits. For example, a video file is xyz.mwv. The frame identifier at 17 seconds will be xyz_000017.
The Attrasoft Technology is applicable internationally. The Attrasoft Technology would address multiple language versions of video content including subtitles, because it is video based.
1. The core identification technology is technically and commercially mature, meaning it is currently in production.
TNS Media Intelligence is the leading provider of strategic advertising intelligence to advertisers, advertising agencies, and media properties. Established in 23 countries with more than 16,000 customers, TNS Media Intelligence is part of the TNS Group, ranked #2 worldwide in marketing information. TNS Media Intelligence monitors 3 million brands worldwide across a multitude of media, including TV, radio, print, the Internet, and cinema. TNS dominates 90% of media monitoring market.
TNS has currently deployed customized Attrasoft ImageFinder software into their print media monitoring system with accurate results allowing them to free up over 100 employees and automate the image recognition & classification process of their print media ads.
2. Attrasoft uses third-party software to support all common video codecs.
3. Attrasoft currently uses text file to store video fingerprint.
4. Attrasoft currently does not use any database, such as SQL server, to store the fingerprint. Customization will be required to store the video fingerprints via database.
5. Attrasoft TransApplet is a Visual Studio class library. Attrasoft’s core technology is available for licensing to potential vendors and system integrators. Vendors interested in licensing should contact Attrasoft directly.
6. The Attrasoft Technology has a set of parameters to be adjusted to support multiple identification modes such that system requirements can be optimized at a cost of reduced accuracy. For example, reduce the number of fingerprints in the initial screening.
7. The Attrasoft Technology can leverage previous identifications to improve future performance as follows: when a video has multiple versions, the accuracy can be improved by adding each newly identified variation into the master library.
Attrasoft provides five levels of hardware support. The first 2 levels are currently available for order; the next three levels require customization.
VideoFinder (software) Alone
VideoFinder + 32 bit PC (2GB)
VideoFinder + 64 bit PC (100 GB)
VideoFinder + 64 bit PC (100 GB)+ SQL Server Interface
The first option is the off-the-shelf software and the other options require customization and a Service package.
In the first option, users simply order the software from Attrasoft and use their own computer to run Attrasoft VideoFinder.
In the second option, users provide a video database to Attrasoft and Attrasoft returns a PC, software, and video fingerprint library. The complete system is ready to run in a matter of hours.
In the third option, users provide a video database to Attrasoft and Attrasoft returns a 64-bit PC, 64-bit software, and video fingerprint library. The complete system is ready to run in a matter of hours.
The fourth option uses SQL database and the fifth option deploys multiple workstations.
1. Attrasoft VideoFinder Software
2. Simple KeepWatch System (Attrasoft sells both 32-bit software and hardware)
3. Attrasoft KeepWatch (Attrasoft sells both 64-bit software and hardware)
4. Attrasoft SQL KeepWatch (Attrasoft sells both 64-bit software and hardware)
The Attrasoft KeepWatch provides rapid identification of content with customized hardware (64 bit workstation and 100 GB RAM) and Microsoft Windows Server 2003. Obviously, the customized hardware will provide more matching powers for video content recognition.
The Attrasoft VideoFinder provides rapid identification of content with minimal hardware (PC) and software (Windows) requirements.
Both initial fingerprint conversion and matching are linear, i.e. the fingerprint conversion depends on the video length, and matching is linearly proportional to the number of matches.
This section deal with the following performance issues:
These are simple questions, but the answers do depend on the parameter setting. In particular, it depends on the Sampling Interval or Sampling Rate introduced earlier.
In this section we will use the default setting, which is 1 image signature per second. In the next section, we will set the parameters differently.
The default sampling rate for library fingerprints is one image per second. This rate can be changed via parameter setting.
Attrasoft VideoFinder computes 2 – 3 image signature per second, so it will take about a half hour to process one hour of video or a bit faster.
If the default sampling rate is changed to one image every 10 seconds, the fingerprint conversion speed will be almost 10 times faster.
The default sampling rate for a newly captured video clip or unknown video is 40 images over the entire video length.
Attrasoft VideoFinder computes 2 image signatures per second, so it will take about 20 seconds to process an unknown video clip.
The atom of matching is signature-to-signature matching. The current matching-speed is about 200,000 matches per second.
This speed requires the entire fingerprint library to be loaded into RAM.
In high data compression rate, one hour of *.wmv takes about 90 MB. One hour of video fingerprints will take 5 MB storage space.
The 32-bit high end PC (2 GB) should run at the speed of 200,000 match/sec and hold 1,000,000 (1 million) signatures. The 1,000,000 signatures will reach the limit of the 2GB RAM.
The 64-bit high end PC (100 GB) should run at the speed of 200,000 match/sec and hold 100,000,000 (100 million) signatures.
The current setting is 1 fingerprint per second in library video and 40 fingerprints per unknown video clip.
32-bit Computer (2GB) and Sequential Search:
This means a high-end PC will match 200,000/40 = 5,000 seconds of video per second, or about 1 – 2 hours of video per second, for less than 1,000,000 signatures (i.e., 300 hours).
Example, to match a 1-hour video against a 300-hour video library, it will take 20 seconds to convert the newly obtained video into a fingerprint; then it will take 200 seconds to match, so it will take about 4 minutes.
To speed up the matching, the sampling rate has to be reduced so the trade-off is between accuracy and speed. For example, if the sampling interval is 1 image every 10 seconds, the speed will be increased by a factor of 10.
64-bit Workstation (100GB) and Sequential Search:
Beyond 300 hours of video, 64-bit computation should be used, which should increase the in-memory hours from 300 hours to 30,000 hours of video (i.e., a factor of 100); at which point, the matching speed is the primary concern.
Fingerprint generation: it takes 1 hour to process 2 hours of video files.
Matching: it takes 1 second to match 1 to 2 hours of video in the library.
A single PC will be able to handle cases where the master library is in the order of 1000 hours at the default sampling rate (1 signature per second).
If no video content monitoring is ever deployed, then a vendor may choose a phased approach. Initially, it might simply compute a signature every minute, half minute, or 10 seconds. This section provides data for various settings. For the core data, please refer to the previous section.
Sampling Rate: 1 signature per second
Fingerprint generation: it takes 1 hour to process 2 hours of video files.
Matching: it takes 1 second to match 1 hour of video in the library, plus 20 seconds overhead for fingerprinting.
A single PC will be able to handle cases where the master library is in the order of 300 hours at the default sampling rate (1 signature per second).
Sampling Rate: 1 signature every 10 seconds
Fingerprint generation: it takes 1 hour to process 20 hours of video files.
Matching: it takes 1 second to match 10 hours of video in the library, plus 20 seconds overhead for fingerprinting.
A single PC will be able to handle cases where the master library is in the order of 3000 hours.
Sampling Rate: 1 signature every 30 seconds
Fingerprint generation: it takes 1 hour to process 60 hours of video files.
Matching: it takes 1 second to match 30 hours of video in the library, plus 20 seconds overhead for fingerprinting.
A single PC will be able to handle cases where the master library is in the order of 10,000 hours.
Sampling Rate: 1 signature every minute
Fingerprint generation: it takes 1 hour to process 100 hours of video files.
Matching: it takes 1 second to match 60 hours of video in the library, plus 20 seconds overhead for fingerprinting.
A single PC will be able to handle cases where the master library is in the order of 20,000 hours.
In terms of scalability requirements, the typical video library sizes are:
Library size 1,000 hours
Library size 10,000 hours
Library size 50,000 hours
Library size 100,000 hours
Library size 1,000,000 hours
Each of these scalability sizes will in turn, satisfy the specified requirements for:
The scalability problem depends on several factors:
The default rate is 1 signature per second. This rate can be increased. The hardware can be 32-bit (2 GB RAM) and 64 bits (up to 100 GB RAM).
The relationship between a PC and the number of hours in a video library is:
a. 32-bit System
b. 64-bit System (Attrasoft KeepWatch)
Once this limit is reached, either the sampling rate has to be reduced or more 64-bit workstations will be required.
Example, to handle 300,000 hours of video library, if we keep the rate 1 signature per second, ten 64-bit workstations will be required.
The content library size for which the Attrasoft Technology has been tested is 150 hours (i.e., 500,000 fingerprints). The content library size for which the Attrasoft Technology has been in production for two years is 150 hours (i.e., 500,000 fingerprints).
There is a trade off between accuracy and speed. At a lower accuracy, speed can be increased; therefore, a larger library can be accommodated.
For example, to handle 300,000 hours of video library:
The atomic matching is 1 signature against 1 signature. This is very fast.
When one signature matches against a library, there are two approaches, sequential and binary. Binary matching is fast, but requires heavy overhead operation; sequential search is slow, but requires no overhead operation. The biggest advantage of binary search is that it scales very well, i.e. the matching speed is basically a constant, regardless of how many signatures are in the fingerprint library.
Disadvantage of Sequential Matching
Advantage of Binary Matching
Currently, Attrasoft video content matching is sequential matching. Attrasoft plans to add binary search.
If you plan to implement a video identification solution, you will need to prepare test cases. This chapter helps you to prepare the test cases.
Master File Sets (“MFS”)
Master versions of some movie and television titles will be the master library.
Unknown File Sets or Test File Sets (“TFS”)
To be matched against the master library.
The system will be tested for both false positives as well as false negatives.
After loading the master library, identification of a new video includes:
(1) Converting an unknown video into fingerprints;
(2) Making a 1:N matching against the master library of fingerprints.
These two steps will take a few minutes.
Measuring variables are:
Accuracy…. Percentage of files: