This project is a self-contained, smart doorbell system using the ESP32-CAM board (Image 1) as a camera and control unit, paired with an ESP32 CYD display board (Image 2) acting as a wireless video receiver and screen. The ESP32-CAM is wired with an external push button, which simulates a traditional doorbell, and an active buzzer to provide audible feedback when someone presses the bell. The main goal is to build a lightweight, self-contained doorbell system monitors doorbell activity though live video and displays on screen, that requires no external servers, and uses ESP32's built-in Wi-Fi capabilities for communication.
When the doorbell push button is pressed, the ESP32-CAM detects the state change on the GPIO pin and responds by triggering the buzzer to sound for 1 second, providing immediate feedback to the visitor. Simultaneously, it initiates a video stream from its onboard camera module, sending video frames wirelessly to the ESP32 CYD display board using WebSocket communication. which ensures low-latency, bidirectional communication suitable for real-time video.
The ESP32 CYD board functions as a Wi-Fi Access Point (AP) and hosts a WebSocket server listening on a dedicated port. It continuously waits for incoming connections from the ESP32-CAM. Once the doorbell is triggered and the video stream begins, the camera connects to this server and starts sending live video frames over the WebSocket connection. The CYD board receives these frames and displays them on its built-in display in real time, allowing the resident to see who's at the door.
To prevent unnecessary bandwidth use and power consumption, the video streaming is configured to automatically stop after 10 seconds. This timeout duration is fully configurable within the firmware. After this timeout, the system returns to an idle state, ready to handle the next doorbell press. This approach provides a balance between responsiveness and efficiency, especially useful for battery-powered or resource-constrained devices.
To enhance user experience during idle periods (when video is not streaming), the ESP32 CYD display shows the current date and time on the screen. On boot, the CYD board connects temporarily to a known Wi-Fi network with internet access to fetch the correct time from an NTP (Network Time Protocol) server, which is then stored in its RTC (Real-Time Clock). After syncing, it switches to AP mode and waits for camera connections, ensuring accurate time display even without ongoing internet access.
This design cleanly separates functionality, ESP32-CAM handles sensing and video capture, while the ESP32 CYD board manages display and idle visuals.
The system can be extended in the future to include features like motion detection, cloud image backup, or even two-way audio. For now, it serves as a compact, responsive smart doorbell platform that demonstrates effective use of ESP32 hardware, GPIO handling, video streaming, and WebSocket communication in a real-world application. The separation of camera and display roles between two ESP32s offers flexibility and modularity in design.
Image 1: ESP32-CAM board
Image 2: ESP32 CYD display board
The ESP32-CAM board serves as the camera node and is connected to two external components:
A push button connected to a digital GPIO pin configured with an internal pull-up resistor. The button is active-low, meaning it triggers when pressed (pulled to ground).
An active buzzer, connected to another GPIO pin, which emits a short beep when the doorbell is pressed.
This minimal wiring setup allows the ESP32-CAM to detect button presses reliably and provide immediate audible feedback. Power is supplied via 5V battery, and the components are mounted on a small breadboard for stability.
See Image 3 and 4 for schematics and wiring diagram on breadboard.
Image 3, ESP32 CAM board on breadboard interfaced with Push button and Buzzer.
Image 4, Schematic to interface Push button (Doorbell) and Buzzer to ESP32 CAM board
Development was done using the Arduino IDE with ESP32 board support. On the ESP32-CAM side, the logic handles button debouncing, buzzer control, and video capture. When triggered, the board connects to the WebSocket server running on the display board and begins sending video frames encoded in JPEG format.
On the ESP32 CYD display board, the TFT_eSPI and TJppg_Decoder libraries are used for fast, efficient rendering of video frames on the built-in display. The board boots in Wi-Fi STA mode, connects to a local network to fetch the current time via NTP, then switches to Access Point mode to host the WebSocket server. When not streaming video, the screen displays the current date and time using the ESP32Time and HTTPClient libraries.
Networking is handled via WiFi, ArduinoWebSockets libraries. This setup ensures low-latency, asynchronous communication between both boards. Memory and buffer optimisations were added to maintain frame stability and reduce flickering.
Full project files are posted on my GitHub repo.
The state machine diagram (See Image 5) provides a clear visual representation of how the ESP32-CAM operates through various states during its interaction with the push button and the ESP32 CYD display. The system begins in the Idle state, continuously monitoring the GPIO connected to the push button. When a button press is detected, it transitions to the Trigger state, where it activates the buzzer and initiates the video stream. From there, the system moves into the Streaming state, where it connects to the CYD board over WebSocket and sends live camera frames. After a configurable timeout (default 10 seconds), it transitions to the Stop Streaming state, where it gracefully disconnects from the WebSocket server. Once the connection is closed and resources are released, it returns to the Idle state, ready to detect the next button press.
On the ESP32 CYD display board, the state machine manages both video display and idle-time functionality. When a connection is established and video frames start arriving, it enters the Streaming state, continuously decoding and rendering each frame on the TFT display. After the camera disconnects, the system returns to the Idle Display state, where it shows the current date and time on the screen. This cycle repeats with every new connection from the ESP32-CAM. The state management ensures responsive and informative user interface..
This structured state-driven approach ensures responsive behaviour, efficient resource usage, and clean separation of responsibilities across the firmware.
Image 5, System state machine
To give a real sense of how the system operates, I recorded a video demonstration of the complete setup. The video shows the ESP32-CAM reacting to a button press, activating the buzzer, and beginning video streaming to the CYD display. After 10 seconds, the camera disconnects from the WebSocket server, and the display reverts to showing the current date and time.
This demo highlights the smooth interaction between the boards, and how the system behaves in real-world use. It’s a great way to observe how GPIO events, streaming, and display updates all come together in a synchronised and responsive design.
Image 6, Powering up ESP32-CAM and ESP32-CYD board with power bank
Demo video, showcasing the operation of doorbell
This project illustrates a robust and scalable embedded solution for a smart doorbell system, with live video feedback, real-time communication, and on-screen information when idle. It leverages the power of the ESP32 ecosystem, open-source libraries, and efficient event-driven programming to create a fully wireless, modular design. The code and instructions are published on my GitHub repo for those interested in replicating, learning from, or extending the project.
Technologies used include TFT display interfacing, camera driver configuration, WebSocket server/client handling, RTC time sync, and low-level GPIO management, highlighting a wide skill set in real-world embedded development.
In the next version of this system, I plan to implement:
Two-way communication, allowing the ESP32 CYD display board to remotely trigger video streaming on the ESP32-CAM.
SD card logging on the display board to optionally store image frames or short clips for later viewing.
Cloud integration, enabling the system to optionally push video or image snapshots to an online dashboard or notification service, allowing users to check door activity remotely.
These enhancements will make the system even more useful and feature-rich, turning it into a complete smart entry monitoring solution.