�
Video
Content Recognition Systems
Attrasoft
Image and Video
Recognition Experts
Attrasoft
P. O. Box 13051
Savannah, GA. 31406
USA
Phone:� (912) 484-1717
January
2008
� Attrasoft 1998 - 2008
5
Performance (Default Setting)
5.1 Library Fingerprint Conversion
Time
5.2 Unknown Video Fingerprint
Conversion Time
7.2 Accuracy and Speed Trade Off
9.4 Evaluation Criteria and Outputs
Attrasoft provides products &
services to monitor both Video & Image content. The Attrasoft pattern
matching algorithms search for matches within the content. This document systematically describes
Attrasoft�s content based video recognition systems, which can be used to
identify video clips, movies, and television content by analyzing digital video
files. Attrasoft has robust content recognition technologies that use video
fingerprinting techniques.
Piracy is running rampant on the Internet.
Figure 1.1
Copyrighted material on Youtube.com.
Accurate
automated identification of digital content is important in various applications
including the content owner�s anti-piracy efforts, the Internet Service
Provider�s anti-piracy efforts, as well as media company�s needs to collect
critical market data for strategic planning.
There are several approaches to video
identifications, including human labor, digital watermarking, keyword, and
content-based recognition, which is further divided into audio and video
content recognition.
Human labor is
very good at detecting some extraordinary variations including:
Human labor,
however, is inadequate to handle the problem in a large scale because of three
factors: error, cost, and scalability. After a certain limit, typically a few
hundred videos, human brains have difficulty to identify video clips
accurately. The Human approach has a long unit of processing time and the cost
grows exponentially vs the amount of video contents. The Human approach to
detect video piracy also does not scale very well. These problems (error, cost,
and scalability) make the human solution unfavorable in the long run.
A digital watermark is a signal which is embedded
into digital data (audio, video, images and text) that could be detected or
extracted later. It is mostly used to insert copyright information of the data.
Using watermarking and/or
forensic marking requires modification of the original content or files. Video
identification based on digital watermarks can be used only in some simple
cases because a digital
watermark can be altered or removed. An identification system based on
something that can be altered easily by content users is not very reliable.
Keyword search requries attaching keywords to
image frames in a video. It is generally accepted that it takes 3 human hours
to process one hour of video for about 10 keywords. If what you want is not included
in the keyword, this search will fail. Also, this 1 to 3 ratio requires a large
amount of labor hours. Once again this approach does not scale.
This leads to
a natural conclusion, an automatic content-based identification system (audio, video, images, and text) will be required to identify.
The leading Audio technology providers are limited by their approach: most cannot identify clips shorter than 5 seconds. Audio fingerprinting does not work for modified video. For example, a music track over an NFL video clip will not be detected by audio alone. Therefore, while Audio technology is an important part of the solution, it is simply not complete.
Attrasoft
issues this report for potential vendors, which require VIDEO content
recognition systems to identify video content by analysis of digital files. This document systematically describes
Attrasoft�s content recognition systems ability to identify video clips,
movies, and television content by analyzing digital video files. Attrasoft
has robust content recognition technologies that use video fingerprinting
techniques, which is currently in a large-scale production & has been
proven in real world environment. ���
�����
The
Attrasoft solution is an automated content identification platform based on
internally developed image and pattern recognition algorithms.� Attrasoft converts digital video content to
fingerprints and makes Fast, Accurate, & Scalable Video Content
Recognition via the fingerprints.
Attrasoft video recognition
technology is rooted from Attrasoft image recognition technology. Video
identification is easier than image identification because a video has many
more image frames than a single image. To
download demo software used in this white paper, go to: http://attrasoft.com.
This white
paper intends to provide sufficient information for a full assessment of the
Attrasoft Technology against the anticipated requirements for video content
matching.
This white
paper also provides the Attrasoft Video Content Matching product road map, so
the reader can anticipate the next generation of Attrasoft products. Attrasoft
will adopt a phased approach to systematically improve the fingerprint
technology to meet specific marketplace requirements.
First of all, digital videos are
converted into digital fingerprints; see the following Figure:
Digital Contents����������������������������������������������������������� Fingerprints
Figure 2.1. Digital videos are converted
into digital fingerprints.
A fingerprint
consists of a set of digital attributes computed from a video clip. Attrasoft�s
competitive advantage is its proprietary algorithms that will both:
(1) Create fingerprints, and
(2) Match fingerprints.
Later, when an
unknown video is matched against the library, it will be converted into a
fingerprint at that time and then matched against the fingerprint library.
� ��������������������� �����Yes/No?
Figure 2.2. An unknown video is converted
into a fingerprint and then matched against the fingerprint library.
Attrasoft
video recognition technology is rooted from Attrasoft image recognition
technology. An image is converted into a set of attributes, called an image
signature. A video is decomposed into a set of images, which are in turn
converted into image signatures. The collection of these images signatures
forms a video fingerprint.
�Image Signature
�An Image Signature is a collection of digital attributes for an image.
Video Fingerprint
A Video fingerprint is a collection of digital attributes for a video.
The Attrasoft video fingerprint consists of a set of image signatures computed from individual image frames. The computation depends on the time interval that an image signature will be computed and added to the video fingerprint.
Sampling
Interval/Sampling Rate
Sampling Interval is the time interval that an image signature will be computed and added to the video fingerprint. Sampling Rate is the number of image signatures per unit time that will be computed and added to the video fingerprint.
The Sampling
Intervals are different for the video library and the unknown video. In
general, an unknown video does not require as many image signatures as the
library video.
Attrasoft
provides products including software, complete systems, and developer tools.
For those who are interested in the Attrasoft core technology, ImageFinder for Windows is an Off-the-Shelf Application Software that enables System Integrators, Solution Developers, and Individuals to quickly test their own Image Recognition ideas.
Attrasoft VideoFinder
is stand-alone software, which provides video content recognition software.
Attrasoft VideoFinder is currently available and can be purchased from
Attrasoft.com.
Figure 2.3. Attrasoft VideoFinder is
currently available and can be obtained via email: from Attrasoft.com.
Attrasoft KeepWatch
is a stand-alone system, which consists of both software and hardware for
content recognition.
Developers
interested in obtaining the video fingerprint component can take a look at the
Attrasoft TransApplet:
�
Information on the VideoFinder, the off-the-shelf product:
http://attrasoft.com/videofinder70/help/
Information on the VideoFinder demo:
http://attrasoft.com/videofinder70/index.htm
Information on the ImageFinder, the off-the-shelf product:
http://www.imagequery.net/imagefinder70/
�
Information on the TransApplet, the off-the-shelf product:
http://attrasoft.com/oldsite/transapplet70/
�
To order, go to:
http://attrasoft.com/oldsite/gs.html
The Attrasoft Technology provides highly accurate identification of video files including low rates of misidentified (false acceptance) and unidentified content (false rejection).
The foundation of the video identification is image
identification, which breaks a video into image frames and makes image
identification. Attrasoft Image Identification technology is currently in
large-scale production.
a.
False-Positives (content not in the identifier database incorrectly
identified).
The
False-Positives rate of Attrasoft Technology for large libraries of movie and
television content is expected to be less than 0.1%, if no variant versions are involved.
b. False-Negatives (content in the identifier database not identified).
The False-Negatives rate is expected to be less than
0.1%, if no variant
versions are involved.
c. Attrasoft
solution accuracy is not affected by:
(i)
The size of
the content library: for example, 10000 video clips are just as good as 1000
video clips;
(ii)
The
resolution and quality of the digital video file: for example, high compression
is just as good as low compression; and
(iii)
The length
of video available for analysis: for example, a 5-second clip is just as good a
30-second clip.
When a video
has multiple versions to start with, it may have multiple fingerprints. If a
variant version can be identified, then no new fingerprint will be required. If
a variant version cannot be identified, then a new fingerprint will be
required. Generally speaking, adding the variant version of the
fingerprint to master library will increase the accuracy.
�Attrasoft solution identifies the time
section of a video that the recognized video segment matches. By default, the
identified video segment is specified to one second.
�Attrasoft solution does offer options for
modifying the Technology to decrease the rate of False-Positives potentially at
the expense of higher False-Negatives and visa versa.� In some applications, False-Negative is more acceptable than a
False-Positive; and in some applications, False-Negative is less acceptable
than a False-Positive.
Attrasoft Technology does not require
changes to digital video files.
While it is the intention of Attrasoft to support all common video codecs for a wide-range of resolutions and bit rates, the current Version of the Attrasoft VideoFinder supports:
*.avi,
*.mwv, and
limited *.mpeg (*.mpg).
Attrasoft will
adopt a phased approach to gradually support all common video codecs.
The unsupported files include:
*.mp4
*.mp3
*.mov
*.wma
*.flv
*.ac3
some *.mpg, *mpeg
�The
Attrasoft Technology can successfully identify video files, which have undergone
some common video transformations including those frequently used with
Internet- piracy. Image compression will not affect any identification
accuracy.
a. Attrasoft�s underlying matching engine is not affected by common video processes such as transcoding. (Transcoding is the direct digital-to-digital conversion from one (usually lossy) codec to another. It involves decoding/decompressing the original data to a raw intermediate format in a way that mimics standard playback of the lossy content, and then re-encoding this into the target format). This is because the underlying matching engine can handle distortions far worse than transcoding.
b. Attrasoft�s
underlying matching engine is not affected by common video processes such as
scaling. This is because all
image frames will be normalized; therefore, all of the scaling affects will be
removed.
c.
Attrasoft�s underlying matching engine is not affected by Telecine, which is
the process of transferring motion picture film into electronic form (Telecine is
the same root as in 'cinema'; also "tele-seen").
d. Attrasoft�s
underlying matching engine is not affected by removal/insertion of commercials
in terms of adding and removing frames. This is because the remaining
frames are enough for identification.
e. Attrasoft�s
underlying matching engine is not affected by insertion of new contents into some
of the frames. This is because the remaining frames are enough for
identification.
f. Attrasoft�s underlying matching engine
is not affected by higher levels of compression.
The Attrasoft Technology might not work
under the following conditions:
a. Insertion
of a large amount of new contents into all of the frames, while
the new content significantly modifies the original image frames, which will in
turn caused the image frame signature to be sufficiently different from the
original.
b. Pirated
Internet video files such as camcorder capture, which includes the TV set,
while the video only includes the middle portion of TV set.
c. Pirated
Internet video files such as camcorder capture, which includes the TV set,
while the TV screen only occupies the middle portion of the video.
d. Pirated
Internet video files such as camcorder capture in a movie theater, while the
video only includes the middle portion of the movie screen.
e. Pirated
Internet video files such as camcorder capture in a movie theater, while the
movie screen only occupies the middle portion of the video.
In
theory, Attrasoft Technology
is frame based and there is no limit in the minimal sample size. Practically,
the technology does require a minimal length of video content required for
identification, including identification for video with missing segments.
a. The
required length of video content for accurate identification is a matter of
optimization. For short video clip, 1 second of video will be required.
b. The
Technology can identify video with missing video segments. It does not matter
how much is missing; it only matters how much is in the remaining video. At
least one signature will be required to make the matching. While the library
signatures are computed once every second, the newly captured video clip
generates 1 signature every 2.5 percent of the video; therefore, the minimum
segment in a video must occupy 3% of the total video length. This default
setting can be changed to remove this 3% limit.
c. Since
Attrasoft Technology is image frame based, the Technology can process video
segments arranged out of order.
Currently, the technology does not
support keyword hinting for an accelerated search. Customization will be
required to handle hinting.
The Attrasoft Technology resists
deliberate attempts to disguise video content by altering the video, audio, and
metadata; however, some attempts will be considered as variations and should be
added to the master library.
The content
identifier is video frame based. The process for the creation of an identifier
for a particular frame is filename + frame number. The frame number has six
digits. For example, a video file is xyz.mwv. The frame identifier at 17
seconds will be xyz_000017.
The Attrasoft
Technology is applicable internationally. The Attrasoft Technology would
address multiple language versions of video content including subtitles,
because it is video based.
1. The core identification technology is technically and commercially mature, meaning it is currently in production.
Case Study:
TNS Media
Intelligence is the leading provider of strategic advertising intelligence to
advertisers, advertising agencies, and media properties. Established in 23
countries with more than 16,000 customers, TNS Media Intelligence is part of
the TNS Group, ranked #2 worldwide in marketing information. TNS Media
Intelligence monitors 3 million brands worldwide across a multitude of media,
including TV, radio, print, the Internet, and cinema. TNS dominates 90% of
media monitoring market.
TNS has
currently deployed customized Attrasoft ImageFinder software into their
print media monitoring system with accurate results allowing them to free up
over 100 employees and automate the image recognition & classification
process of their print media ads.
2. Attrasoft
uses third-party software to support all common video codecs.
3. Attrasoft
currently uses text file to store video fingerprint.
4. Attrasoft
currently does not use any database, such as SQL server, to store the
fingerprint. Customization will be required to store the video fingerprints via
database.
5. Attrasoft TransApplet is a Visual Studio class library. Attrasoft�s core technology is available for licensing to potential vendors and system integrators. Vendors interested in licensing should contact Attrasoft directly.
6. The
Attrasoft Technology has a set of parameters to be adjusted to support multiple
identification modes such that system requirements can be optimized at a cost
of reduced accuracy. For example, reduce the number of fingerprints in the initial
screening.�
7. The Attrasoft Technology can leverage previous identifications to improve future performance as follows: when a video has multiple versions, the accuracy can be improved by adding each newly identified variation into the master library.
Attrasoft
provides five levels of hardware support. The first 2 levels are currently
available for order; the next three levels require customization.
VideoFinder
(software) Alone
VideoFinder +
32 bit PC (2GB)
VideoFinder +
64 bit PC (100 GB)
VideoFinder +
64 bit PC (100 GB)+ SQL Server Interface
Multiple
Workstations
The first
option is the off-the-shelf software and the other options require
customization and a Service package.
In the
first option, users simply order the software from Attrasoft and use their own
computer to run Attrasoft VideoFinder.
In the
second option, users provide a video database to Attrasoft and Attrasoft
returns a PC, software, and video fingerprint library. The complete system is
ready to run in a matter of hours.
In the
third option, users provide a video database to Attrasoft and Attrasoft returns
a 64-bit PC, 64-bit software, and video fingerprint library. The complete
system is ready to run in a matter of hours.
The fourth
option uses SQL database and the fifth option deploys multiple workstations.
1. Attrasoft VideoFinder
Software
2. Simple KeepWatch
System (Attrasoft sells both 32-bit software and hardware)
3. Attrasoft KeepWatch (Attrasoft
sells both 64-bit software and hardware)
4. Attrasoft SQL KeepWatch� (Attrasoft sells both 64-bit software and
hardware)
The Attrasoft KeepWatch
provides rapid identification of content with customized hardware (64 bit
workstation and 100 GB RAM) and Microsoft Windows Server 2003. Obviously, the
customized hardware will provide more matching powers for video content recognition.
�The
Attrasoft VideoFinder provides rapid identification of content with
minimal hardware (PC) and software (Windows) requirements.
Both initial
fingerprint conversion and matching are
linear, i.e. the fingerprint conversion depends on the video length, and
matching is linearly proportional to the number of matches.
This section
deal with the following performance issues:
These are
simple questions, but the answers do depend on the parameter setting. In
particular, it depends on the Sampling Interval or Sampling Rate introduced
earlier.
In this
section we will use the default setting, which is 1 image signature per second.
In the next section, we will set the parameters differently.
The default
sampling rate for library fingerprints is one image per second. This rate can
be changed via parameter setting.
Attrasoft VideoFinder
computes 2 � 3 image signature per second, so it will take about a half hour to
process one hour of video or a bit faster.
If the default
sampling rate is changed to one image every 10 seconds, the fingerprint
conversion speed will be almost 10 times faster.
The default
sampling rate for a newly captured video clip or unknown video is 40 images
over the entire video length.
Attrasoft VideoFinder
computes 2 image signatures per second, so it will take about 20 seconds to
process an unknown video clip.
The atom of matching is
signature-to-signature matching. The current matching-speed is about 200,000
matches per second.
This speed requires the entire fingerprint library to be loaded into RAM.
In high data compression rate, one hour of� *.wmv takes about 90 MB. One hour of video fingerprints will take 5 MB storage space.
The 32-bit high end PC (2 GB) should run
at the speed of 200,000 match/sec and hold 1,000,000 (1 million) signatures.
The 1,000,000 signatures will reach the limit of the 2GB RAM.
The 64-bit high end PC (100 GB) should run at the speed of 200,000 match/sec and hold 100,000,000 (100 million) signatures.
The current setting is 1 fingerprint per second in library video and 40 fingerprints per unknown video clip.
32-bit Computer (2GB) and Sequential Search:
This means a high-end PC will match 200,000/40 = 5,000 seconds of video per second, or about 1 � 2 hours of video per second, for less than 1,000,000 signatures (i.e., 300 hours).
Example, to match a 1-hour video against a 300-hour video library, it will take 20 seconds to convert the newly obtained video into a fingerprint; then it will take 200 seconds to match, so it will take about 4 minutes.
To speed up the matching, the sampling rate has to be reduced so the trade-off is between accuracy and speed. For example, if the sampling interval is 1 image every 10 seconds, the speed will be increased by a factor of 10.
64-bit Workstation (100GB) and Sequential Search:
Beyond 300 hours of video, 64-bit computation should be used, which should increase the in-memory hours from 300 hours to 30,000 hours of video (i.e., a factor of 100); at which point, the matching speed is the primary concern.
Fingerprint generation: it takes 1 hour
to process 2 hours of video files.
Matching: it takes 1 second to match 1 to
2 hours of video in the library.
A single PC
will be able to handle cases where the master library is in the order of 1000
hours at the default sampling rate (1 signature per second).
If no video
content monitoring is ever deployed, then a vendor may choose a phased
approach. Initially, it might simply compute a signature every minute, half
minute, or 10 seconds. This section provides data for various settings. For the
core data, please refer to the previous section.
Sampling Rate: 1 signature per second
Fingerprint
generation: it takes 1 hour to process 2 hours of video files.
Matching: it
takes 1 second to match 1 hour of video in the library, plus 20 seconds
overhead for fingerprinting.
A single
PC will be able to handle cases where the master library is in the order of 300
hours at the default sampling rate (1 signature per second).
Sampling Rate: 1 signature every 10
seconds
Fingerprint
generation: it takes 1 hour to process 20 hours of video files.
Matching: it
takes 1 second to match 10 hours of video in the library, plus 20 seconds
overhead for fingerprinting.
A single
PC will be able to handle cases where the master library is in the order of
3000 hours.
Sampling Rate:
1 signature every 30 seconds
Fingerprint
generation: it takes 1 hour to process 60 hours of video files.
Matching:
it takes 1 second to match 30 hours of video in the library, plus 20 seconds
overhead for fingerprinting.
A single
PC will be able to handle cases where the master library is in the order of
10,000 hours.
Sampling Rate:
1 signature every minute
Fingerprint
generation: it takes 1 hour to process 100 hours of video files.
Matching:
it takes 1 second to match 60 hours of video in the library, plus 20 seconds
overhead for fingerprinting.
A single
PC will be able to handle cases where the master library is in the order of
20,000 hours.
In terms of scalability requirements, the typical video library sizes are:
Library size 1,000 hours
Library size 10,000 hours
Library size 50,000 hours
Library size 100,000 hours
Library size 1,000,000 hours
Each of these scalability sizes will in turn, satisfy the specified requirements for:
The scalability problem depends on several factors:
The default rate is 1 signature per second. This rate can be increased. The hardware can be 32-bit (2 GB RAM) and 64 bits (up to 100 GB RAM).
The relationship between a PC and the number
of hours in a video library is:
a. 32-bit
System
b. 64-bit System (Attrasoft KeepWatch)
Once this
limit is reached, either the sampling rate has to be reduced or more 64-bit
workstations will be required.
Example, to
handle 300,000 hours of video library, if we keep the rate 1 signature per
second, ten 64-bit workstations will be required.
The content
library size for which the Attrasoft Technology has been tested is 150 hours
(i.e., 500,000 fingerprints). The content library size for which the Attrasoft
Technology has been in production for two years is 150 hours (i.e., 500,000
fingerprints).
There is a trade off between accuracy and speed.� At a lower accuracy, speed can be increased; therefore, a larger library can be accommodated.
For example, to
handle 300,000 hours of video library:
The atomic matching is 1 signature
against 1 signature. This is very fast.
When one
signature matches against a library, there are two approaches, sequential and
binary. Binary matching is fast, but requires heavy overhead operation;
sequential search is slow, but requires no overhead operation. The biggest
advantage of binary search is that it scales very well, i.e. the matching speed
is basically a constant, regardless of how many signatures are in the
fingerprint library.
Sequential
Disadvantage of Sequential Matching
Advantage of Binary Matching
Currently, Attrasoft video content matching is sequential matching. Attrasoft plans to add binary search.
If you plan to
implement a video identification solution, you will need to prepare test cases.
This chapter helps you to prepare the test cases.
Master File Sets (�MFS�)
Master versions of some movie and television titles will be the master library.
�Unknown File
Sets or Test File Sets (�TFS�)
To be matched against the master library.
Test Procedure:
The system will be tested for both false
positives as well as false negatives.
After
loading the master library, identification
of a new video includes:
(1)
Converting
an unknown video into fingerprints;
(2)
Making a
1:N matching against the master library of fingerprints.
These two steps will take a few minutes.
Measuring variables are:
Accuracy�. Percentage of files:
Scalability
User Interface
System Updates
Reporting
Database