Solution Design

The minimal setup for an Audience Analytics solution consists of a PC and a camera.

The PC should be CPU and GPU scaled depending on:

The input video resolution.
The input video frame rate.
The quantity and quality of inferences to run performed on the video.

Inferences are determined by the chosen processing pipeline.

For example, a simple pipeline might only detect the presence of persons in the video stream and count them. That can be done using a low-end CPU (Intel Core i3).

Another pipeline might detect faces, and then infer the age, gender and emotion of the detected faces. That's 3 inferences using 3 neural networks. To perform on an HD video stream might require a middle range CPU with integrated GPU (e.g. Intel Core i5 with Iris Xe GPU).

Camera and camera Placement

The selection of the camera is really transcendental:

Select a good quality camera with clear optics and good field of view.
You can use a video surveillance camera over the network, but take into account that it will add delay to the processing pipeline.

Another important aspect is the camera placement for the video.

Depending on the kind of detections and inferences, a frontal POV might be needed for optimal results. That might not always be available.

It is recommendable to have sample footage beforehand to check the performance and validity of the chosen pipeline.

See our Camera Setups Technical Note

Pipeline picking and optimizing

Pipelines are composed of Processors.

Each Processor performs a single operation in parallel on the input frame, and passes its results down the pipeline to the next.

Processors can use one or more neural networks to perform inferences over the image:

Every inference is an operation of detection over a video frame (e.g. detecting people) or a fragment (e.g. inferring age or gender).
Each neural network used adds processing time and latency.

Processors can also perform counting and image processing operations over the result of previous processors.

Optimization

Each Processor has a series of parameters that can be tuned to the task.

In general:

Pick the lowest input resolution that will do the job.
Pick the lowest framerate that will work.
Some areas of the image might not be needed and only generate noise. Use detection areas to cancel them out.
Processors will work on the result of the previous ones: Reducing the number of items generated by a processor will reduce the need for later processors to perform inference.

For example, for people counting you don't need 4k video, as the detection network might use a 512x512 image as input, so it will have to be downscaled, adding CPU cost.

Another example, a face detection might detect faces too far and too small. You can set a minimum face area, reducing the faces detected, and the later inferences to be run on each of them.

Recommended Hardware

A PC with Intel 8th generation or later CPU running Windows 10 or later, or Ubuntu 20.04 or later.
- Broox Vision Node runs Intel Openvino Inference Engine. See the System Requirements for the Intel® Distribution of OpenVINO™ Toolkit.
- A CPU with over 8.000 CPUMARK is recommended.
- Integrated Intel Iris Xe GPU for acceleration is also recommended for 10th gen. CPUs and upwards, to maximize throughput.
- To know exactly if your hardware can perform well use the Broox Benchmark tool
A compatible camera/video input device:
- A USB Video Class (UVC) compliant webcam (most modern webcams).
- A RTSP/MJPEG network camera (security camera or the like).
A display or display emulator (for headless setups)
Internet connection to reach Broox Studio API.

Required Software

Windows 10 Pro or Ubuntu Linux 20.04
Broox Vision Node (One per video stream)
Broox Controller (One per location, initially)
Broox Media Player (optional)
Access to Broox Studio (provided by Broox Technologies upon purchase) via web browser.