Real-Time Thermal Obstacle Detection
Real-time vertical obstacle detection using thermal infrared imagery. Hybrid classical and deep learning approach for low-altitude autonomous navigation.
# Hybrid Deep Learning and Classical Image Processing for Real-Time Thermal Obstacle Detection in Aviation
Abstract
This paper presents a hybrid obstacle detection system combining ResNet50 deep learning with classical image processing techniques for real-time thermal imagery analysis in aviation scenarios. The proposed approach addresses the challenge of detecting vertical structures and obstacles in tactical low-altitude flight environments using uncooled thermal sensors. The system integrates Big-Small edge detection for vertical structure identification, multi-gamma patch enhancement for adaptive contrast normalization, and Farneback dense optical flow tracking for temporal consistency. A sliding window temporal filter operating over eight frames reduces false positives through persistent detection validation using Intersection-over-Union metrics. Evaluation on the BN Dataset, comprising tactical blue-navy scenarios, demonstrates robust performance in challenging thermal conditions. The fusion architecture leverages complementary strengths of learned feature representations and domain-specific classical filters, achieving real-time operation suitable for embedded avionic systems while maintaining detection reliability critical for flight safety applications.
1. Introduction
1.1 Motivation and Problem Statement
Low-altitude tactical flight operations present unique challenges for obstacle detection systems due to rapidly changing scene geometry, limited sensor fields of view, and the presence of vertical structures such as towers, poles, and masts that pose critical collision hazards [1]. Traditional visible-spectrum cameras suffer from reduced performance during low-light conditions, adverse weather, and high-contrast scenarios common in military aviation environments [2]. Thermal infrared sensors offer complementary capabilities by detecting heat signatures regardless of ambient illumination, yet introduce distinct challenges including lower spatial resolution, reduced texture information, and sensitivity to environmental thermal variations [3].
Obstacle detection in thermal imagery requires addressing several technical challenges. First, vertical structures often exhibit subtle thermal gradients compared to their surroundings, making edge-based detection difficult [4]. Second, uncooled thermal sensors produce images with limited dynamic range and temporal noise characteristics that vary with sensor temperature [5]. Third, real-time processing constraints in avionic systems demand computationally efficient algorithms capable of processing multiple frames per second while maintaining low latency [6]. Fourth, the high cost of false positives in flight safety applications necessitates robust temporal validation mechanisms to distinguish persistent obstacles from transient thermal artifacts [7].
1.2 Approach Overview
This work proposes a hybrid detection architecture that synergistically combines deep convolutional neural networks with classical image processing techniques specifically designed for thermal imagery characteristics. The system employs a multi-stage pipeline that leverages domain knowledge through handcrafted feature extractors while simultaneously exploiting learned representations from data-driven models. This design philosophy acknowledges that pure deep learning approaches may struggle with limited thermal training data and domain-specific artifacts, while purely classical methods lack the representational capacity to handle the full variability of obstacle appearances [8].
The detection pipeline operates as follows. First, a classical preprocessing stage employing Big-Small filtering identifies candidate regions exhibiting vertical edge characteristics indicative of pole-like structures. Second, a multi-gamma enhancement module normalizes patch contrast across varying thermal conditions, generating multiple exposure variants to handle both bright and dim obstacles. Third, a ResNet50 classifier trained specifically on thermal obstacle imagery discriminates true obstacles from background clutter within candidate patches. Fourth, a Farneback dense optical flow tracker maintains temporal correspondence of detected obstacles across frames. Finally, a sliding window temporal filter validates persistent detections over an eight-frame history using Intersection-over-Union matching, significantly reducing false alarm rates.
1.3 Contributions
The key contributions of this work include:
2. Related Work
2.1 Deep Learning for Obstacle Detection
Convolutional neural networks have achieved remarkable success in object detection tasks across diverse domains [12]. Modern architectures including Faster R-CNN, YOLO, and SSD frameworks provide end-to-end trainable systems capable of real-time inference [13]. However, these approaches typically assume abundant training data and RGB imagery with rich texture information. Transfer learning from large-scale visible-spectrum datasets to thermal imagery presents challenges due to domain shift, as thermal images exhibit fundamentally different statistics and feature distributions [14].
ResNet architectures have proven effective for thermal image classification tasks due to their residual learning framework enabling training of deeper networks [15]. The use of pretrained ResNet models followed by fine-tuning on thermal datasets has shown promise in previous work, though performance gains depend critically on the similarity between source and target domains [16]. Our approach employs ResNet50 as a patch classifier rather than a full detection network, allowing focused training on obstacle appearance discrimination while delegating localization to classical preprocessing stages.
2.2 Classical Methods for Vertical Structure Detection
Edge detection algorithms form the foundation of many obstacle detection systems, particularly for identifying vertical structures in imagery [17]. The Canny edge detector and its variants remain popular due to computational efficiency and well-characterized behavior [18]. However, standard edge detection struggles in thermal imagery where intensity gradients are often weak and contaminated by sensor noise.
The Big-Small filtering approach, inspired by biological visual processing, specifically targets step edges characteristic of vertical structures [19]. This technique employs asymmetric filters that respond strongly to intensity transitions between large homogeneous regions, making it suitable for detecting poles and towers against sky or terrain backgrounds. Previous work demonstrated Big-Small filtering effectiveness in infrared imagery for horizon detection and obstacle segmentation [20].
2.3 Multi-Gamma Enhancement and Adaptive Normalization
Histogram equalization and adaptive contrast enhancement techniques address the limited dynamic range of thermal sensors [21]. Classical approaches including Contrast Limited Adaptive Histogram Equalization (CLAHE) operate locally to enhance visibility across varying illumination conditions [22]. However, global histogram manipulation can introduce artifacts and amplify noise in low-contrast regions.
Gamma correction provides a parametric approach to exposure adjustment, with different gamma values emphasizing different intensity ranges [23]. Multi-exposure fusion techniques combine images captured at varying exposures to extend dynamic range, though traditional implementations require multiple captures or high dynamic range sensors [24]. Our multi-gamma approach synthesizes virtual exposures from single thermal frames, enabling dynamic range extension without specialized hardware.
2.4 Optical Flow and Temporal Processing
Optical flow estimation provides dense motion fields enabling tracking and temporal correspondence between frames [25]. The Farneback algorithm uses polynomial expansion of image neighborhoods to compute dense optical flow efficiently, making it well-suited for embedded systems [26]. Alternative approaches including the Lucas-Kanade sparse tracker and recent deep learning methods like FlowNet offer different trade-offs between accuracy and computational cost [27].
Temporal filtering in detection systems serves two primary purposes: reducing false positives through persistence requirements and maintaining track continuity [28]. Sliding window approaches balance responsiveness to new obstacles with robustness against transient false alarms [29]. Intersection-over-Union metrics provide intuitive measures of detection consistency, widely adopted in tracking and detection evaluation [30].
2.5 Thermal Obstacle Detection in Aviation
Aviation obstacle detection systems face stringent requirements for reliability, latency, and false alarm rates [1]. Previous work has explored radar, lidar, and electro-optical sensors for obstacle avoidance in helicopters and unmanned aerial vehicles [31]. Thermal sensors offer advantages in degraded visual environments but have received less attention in the literature compared to visible-spectrum approaches.
This work builds upon these foundations by demonstrating that carefully designed hybrid systems can leverage complementary strengths of classical and learned approaches, particularly in resource-constrained aviation applications where both performance and efficiency are critical.
3. Method
3.1 System Architecture Overview
The proposed obstacle detection system implements a hierarchical pipeline processing thermal image sequences at operational frame rates. The architecture consists of six primary modules: preprocessing and denoising, classical big-small filtering for candidate generation, multi-gamma patch enhancement, ResNet50-based patch classification, Lucas-Kanade optical flow tracking, and temporal filtering with IOU validation. Data flows through the pipeline in a feedforward manner with feedback connections from the temporal filter to the tracker for maintaining obstacle histories.
Input thermal imagery is captured at 14-bit precision with native resolution of 768 rows by 1024 columns from uncooled longwave infrared sensors operating in the 8 to 12 micrometer spectral band. Each frame undergoes preprocessing to remove dead pixels and apply bilateral filtering for edge-preserving noise reduction. The system maintains a sliding window buffer of the most recent eight frames to enable temporal analysis while limiting memory requirements suitable for embedded deployment.
3.2 Preprocessing and Image Enhancement
3.2.1 Dead Pixel Correction and Bilateral Filtering
Uncooled thermal sensors exhibit spatial nonuniformities and dead pixels requiring correction before further processing. Dead pixel correction operates by identifying outlier pixels whose values deviate significantly from local neighborhoods, then replacing corrupted values with median-filtered estimates. The algorithm employs a spatial window of 3 by 3 pixels with an intensity deviation threshold of 50 digital counts, tuned to balance dead pixel removal against preservation of legitimate high-frequency image content.
Following dead pixel correction, bilateral filtering provides edge-preserving smoothing that reduces thermal noise while maintaining obstacle boundaries critical for subsequent detection stages. The bilateral filter combines spatial proximity weighting with intensity similarity weighting, defined mathematically as:
where represents the spatial support region, controls spatial smoothing extent, controls intensity similarity sensitivity, and is a normalization factor ensuring output intensity preservation. Parameters are set to and with a 5 by 5 pixel support region, balancing noise reduction against computational cost and edge preservation requirements.
3.3 Classical Preprocessing: Big-Small Edge Detection
3.3.1 Vertical Structure Detection Principle
The Big-Small filtering approach exploits the characteristic appearance of vertical structures in thermal imagery, where poles and towers create intensity steps between foreground obstacle and background terrain or sky. Unlike standard edge detectors that respond to all intensity gradients regardless of scale or orientation, Big-Small filtering specifically targets vertical step edges spanning multiple pixel rows, providing selectivity for obstacle-like structures while rejecting texture and noise.
The algorithm operates by constructing an asymmetric filter kernel sensitive to vertical intensity transitions. For each pixel location, the filter examines a vertical neighborhood of height 7 pixels, computing the intensity difference between regions above and below the candidate edge position. A step edge is declared when this difference exceeds a threshold of 1 digital count and the transition persists across at least the central portion of the filter height with allowance for 1 outlier pixel within the transition region.
3.3.2 Score Map Generation
Detection decisions based solely on single-pixel edge responses produce fragmented and noisy results. To achieve spatially coherent obstacle localization, the algorithm generates a score map by convolving the binary edge response with a rectangular accumulation kernel of size 20 by 6 pixels in height and width respectively. This convolution operation effectively counts the number of vertical edge pixels within each local neighborhood, producing higher scores in regions where vertical edges cluster spatially.
The score map at each location quantifies the local density of vertical edge evidence. Candidate obstacle patches for neural network classification are selected by thresholding the score map at a minimum value of 10, then identifying local maxima through non-maximum suppression. This classical preprocessing stage reduces the search space for the ResNet classifier from the full image to typically 50 candidate patches per frame, achieving computational efficiency through selective processing of high-probability regions.
3.4 Multi-Gamma Patch Enhancement
3.4.1 Dynamic Range Challenges in Thermal Imagery
Uncooled thermal sensors provide limited dynamic range compared to visible cameras, with 14-bit raw imagery often exhibiting narrow intensity distributions due to automatic gain control and scene temperature variations. Obstacles may appear darker or brighter than backgrounds depending on their thermal properties and solar loading conditions. Training a single classifier to handle both dark-on-light and light-on-dark obstacle appearances proves challenging, particularly with limited training data.
Multi-gamma enhancement addresses this challenge by generating multiple exposure variants of each candidate patch, effectively bracketing the intensity distribution to ensure obstacles occupy mid-range intensities in at least one variant. This approach provides exposure invariance without requiring specialized high dynamic range sensors or multiple captures.
3.4.2 Gamma Transformation and Percentile Normalization
For each candidate patch extracted by the classical preprocessing stage, the enhancement module first applies percentile-based contrast stretching to normalize the intensity distribution. Lower and upper percentiles are set to 0.05 and 99.95 respectively, mapping these intensity values to the full output range while clipping extreme outliers. This normalization removes sensitivity to absolute temperature calibration while preserving relative intensity structure within each patch.
Following percentile normalization, three gamma transformations are applied with values , , and . The gamma correction transforms intensities according to:
where represents the percentile-normalized patch intensity. The value provides a linear mapping preserving original contrast, brightens the patch emphasizing dark obstacles, and darkens the patch emphasizing bright obstacles. All three variants are converted to 8-bit precision and resized to 64 by 64 pixels for input to the ResNet classifier.
The multi-gamma strategy ensures that regardless of whether an obstacle is thermally hotter or cooler than its background, at least one gamma variant will present the obstacle with favorable contrast for classification. The ResNet network learns to recognize obstacle patterns across this augmented input space, effectively achieving exposure invariance.
3.5 ResNet50 Patch Classification
3.5.1 Architecture and Training
The classification module employs a ResNet50 architecture, a deep convolutional network consisting of 50 layers organized into residual blocks enabling gradient flow through shortcut connections. The network architecture includes an initial convolutional layer followed by four stages of residual blocks with increasing channel depth, culminating in global average pooling and a fully connected classification layer.
Training is performed on the BN Dataset augmented with negative samples extracted from background regions. Patches are extracted at 64 by 64 pixel resolution matching the network input size. Minimum obstacle dimensions are constrained to 10 pixels width and 20 pixels height to focus training on resolvable obstacles. Data augmentation during training includes additive thermal noise at 50 percent of the original image noise level, synthesized using random patterns to improve robustness to sensor characteristics.
The network is trained using the Adam optimizer with learning rate set to 0.0001 over multiple epochs with a batch size of 32 samples. Training employs cross-entropy loss for binary classification with obstacle and background classes. Model checkpoints are saved periodically to enable selection of optimal generalization performance on held-out validation data.
3.5.2 Inference and Patch Scoring
During inference, each of the three gamma-corrected variants of a candidate patch is processed independently through the ResNet50 network, producing three classification scores representing obstacle probabilities. The final patch score is computed as the maximum across the three variants:
where represents the sigmoid activation producing probabilities and represents the logit output for the gamma variant . This maximum pooling across gamma values ensures that the most favorable exposure variant determines the final classification decision.
Patches exceeding an obstacle probability threshold of 0.95 are classified as positive detections and passed to post-processing. Lower thresholds of 0.80 and 0.60 are also computed for visualization and analysis purposes but do not directly influence detection decisions. The high threshold of 0.95 is selected to minimize false positives, with temporal filtering subsequently providing additional false alarm suppression.
3.6 Post-Processing and Obstacle Localization
3.6.1 Spatial Non-Maximum Suppression
ResNet classification produces obstacle scores for candidate patches that may overlap spatially or represent multiple detections of the same physical obstacle. Spatial non-maximum suppression consolidates redundant detections by enforcing minimum distance criteria between distinct obstacles. The algorithm iteratively selects the highest-confidence detection, then suppresses all nearby detections within specified horizontal and vertical distances.
Distance thresholds are defined separately for detections representing the same obstacle versus distinct obstacles, accounting for patch size and expected obstacle spacing. Horizontal distance thresholds of 1 pixel for same-obstacle suppression and 3 patch widths for distinct-obstacle separation prevent both redundant duplicates and inappropriate merging of closely spaced obstacles. Vertical thresholds are set proportionally accounting for the primarily vertical extent of pole-like obstacles.
3.6.2 Vertical Extent Estimation
Following non-maximum suppression, the algorithm refines obstacle localization by estimating vertical extent, specifically the top and bottom pixel coordinates of each detected obstacle. This refinement operates by analyzing intensity profiles within a narrow vertical strip centered on the detected obstacle horizontal position. The strip width is set to twice the estimated obstacle half-width parameter of 5 pixels.
The bottom extent is determined by searching downward from the patch center until encountering the ground plane or image boundary, using gradient magnitude to identify the obstacle-ground transition. Similarly, the top extent is determined by searching upward until intensity transitions indicate the obstacle termination point. These vertical extent estimates provide tighter bounding boxes for display and tracking purposes compared to the fixed patch sizes used during classification.
3.7 Farneback Dense Optical Flow Tracking
3.7.1 Tracking Framework
Temporal correspondence of detected obstacles across frames enables both persistent tracking and temporal filtering for false alarm reduction. The Farneback optical flow algorithm provides efficient dense motion estimation through polynomial expansion of image neighborhoods. For each detected obstacle bounding box in frame , the tracker predicts its location in frame by computing the displacement field within the obstacle region.
The Farneback method approximates the image neighborhood around each pixel by a polynomial. By computing the polynomial coefficients for consecutive frames and analyzing their relationship, the algorithm estimates displacement vectors efficiently. The implementation employs a multi-scale pyramid approach processing multiple resolution levels to handle larger displacements while maintaining computational efficiency and robustness to noise.
3.7.2 Tracking Parameters and Adaptation
The tracker operates on downsampled imagery at 0.5 scale with 3-level pyramid decomposition to balance computational cost against tracking accuracy. A spatial window size of 3 by 3 pixels at each pyramid level provides sufficient support for displacement estimation while remaining computationally efficient. The Farneback solver performs 2 iterations per pyramid level with polynomial expansion degree set to 3 for subpixel accuracy.
To handle appearance changes and scale variations, the tracker maintains obstacle bounding boxes with uncertainty margins expanded by 12 pixels around the core detection region. This expansion accommodates potential misalignment in optical flow estimates and allows the tracker to recover from temporary tracking errors. Minimum crop sizes of 24 by 12 pixels in height and width respectively ensure sufficient image content for reliable flow estimation even for small obstacles.
The tracker incorporates refinement mechanisms for bounding box adjustment based on intensity gradients within the tracked region. Edge refinement repositions box boundaries to align with local intensity gradients, improving localization accuracy compared to pure flow-based prediction. Position refinement based on intensity mass center is optionally applied with configurable weighting coefficients balancing geometric prediction against intensity-based localization.
3.8 Temporal Filtering with IOU Validation
3.8.1 Sliding Window Framework
Temporal filtering provides the final stage of false positive reduction by requiring persistent detection of obstacles across multiple frames before declaring confirmed detections. The algorithm maintains a sliding window buffer spanning the most recent 8 frames, sufficient to distinguish persistent obstacles from transient thermal artifacts while remaining responsive to newly appearing obstacles.
Each obstacle detection in the current frame is associated with tracked obstacles from previous frames using Intersection-over-Union matching. The IOU metric quantifies spatial overlap between detection bounding boxes:
Detections are matched to existing tracked obstacles when IOU exceeds threshold values accounting for expected displacement between frames. Distance thresholds of 30 pixels horizontally and 60 pixels vertically define the maximum allowable displacement for associating detections across frames.
3.8.2 Persistence Criteria and Track Management
Tracked obstacles accumulate detection counts indicating the number of frames within the sliding window where the obstacle was successfully detected. A persistence criterion requires at least 4 positive detections within the 8-frame window for an obstacle to be reported as a confirmed detection. This requirement, combined with the high ResNet classification threshold, achieves false positive rates suitable for operational deployment.
The temporal filter maintains a database of up to 1000 simultaneous tracked obstacles, sufficient for scenarios involving cluttered environments with numerous candidate detections. Tracks are deleted when they fail to receive new detections for a period exceeding configurable thresholds, freeing computational resources while allowing temporary occlusions. Tracks persisting for at least 3 frames before deletion avoid premature removal of valid obstacles experiencing brief detection gaps.
The combination of optical flow tracking for spatial correspondence and IOU-based validation for detection consistency provides robust temporal filtering without requiring complex multi-object tracking algorithms or motion models. This design choice balances tracking accuracy with computational efficiency critical for real-time embedded implementation.
3.9 Multi-Camera Fusion Architecture
3.9.1 Angular Coordinate Transformation
The complete system supports multi-camera configurations where multiple thermal sensors provide overlapping fields of view, extending the effective detection range and coverage area. Each camera maintains its own detection pipeline processing imagery independently through all stages described above. Fusion occurs in a global coordinate frame defined relative to the aircraft reference system using azimuth and elevation angles.
Camera calibration provides polynomial transformation functions mapping pixel coordinates to angular coordinates in azimuth and elevation. For each detected obstacle with pixel location in camera , the transformation computes:
where and represent camera-specific polynomial calibration functions accounting for lens distortion and sensor mounting geometry. These polynomial functions are computed offline through calibration procedures and stored as lookup tables or coefficient arrays for efficient runtime evaluation.
3.9.2 Global Obstacle Database and Association
The fusion module maintains a global obstacle database containing all obstacles detected by any camera, represented in the common angular coordinate frame. Each global obstacle entry includes position, velocity estimates, detection confidence, and temporal history across all contributing cameras.
When a camera detects an obstacle, the fusion algorithm computes its angular coordinates and searches the global database for matching existing obstacles. Matching criteria use angular distance thresholds of 10 degrees in both azimuth and elevation, accounting for localization uncertainty and potential calibration errors. If a matching global obstacle exists, the new detection updates its properties including position estimates and increments its detection count. If no match is found, a new global obstacle is instantiated and added to the database.
Field-of-view constraints limit global obstacle tracking to angular ranges of negative 40 to positive 40 degrees in azimuth and negative 20 to positive 40 degrees in elevation relative to the aircraft reference frame. Obstacles outside these bounds are removed from active tracking to focus computational resources on forward-looking obstacles most relevant for collision avoidance. Image boundary margins of 12 pixels in all directions prevent tracking of obstacles near frame edges where detection reliability degrades.
The fusion architecture enables robust detection by combining evidence from multiple viewpoints while maintaining computational efficiency through independent per-camera processing and lightweight angular domain fusion.
4. Experiments and Results
4.1 Dataset and Evaluation Methodology
The system is evaluated on the BN Dataset, a collection of thermal infrared video sequences captured during tactical aviation scenarios designed to replicate blue-navy operational flight profiles. The dataset contains imagery from uncooled longwave thermal sensors mounted on aerial platforms conducting low-altitude flights in environments containing vertical obstacle structures including poles, towers, masts, and similar hazards.
Sequences in the BN Dataset exhibit diverse environmental conditions including varying terrain types, different times of day affecting solar loading and thermal contrast, and aircraft maneuvers producing dynamic scene motion. Ground truth annotations provide obstacle locations in the form of bounding boxes specified in pixel coordinates, enabling quantitative evaluation of detection performance. Annotations distinguish between obstacle types and account for partial occlusions and boundary ambiguities.
The evaluation methodology focuses on detection accuracy metrics relevant to aviation safety applications. Primary metrics include detection rate, false alarm rate, and temporal consistency of detections. Detection rate measures the percentage of annotated obstacles successfully detected by the system, with detections declared successful when predicted bounding boxes achieve IOU greater than 0.5 with ground truth annotations. False alarm rate quantifies the number of spurious detections per frame where the system reports obstacles not present in ground truth.
4.2 Ablation Studies and Component Analysis
Ablation experiments quantify the contribution of each pipeline component to overall system performance. These studies progressively disable individual modules to isolate their impact on detection accuracy and computational cost.
The Big-Small classical preprocessing stage provides substantial computational savings by reducing the number of patches requiring ResNet classification from thousands per frame to a configurable maximum of 50 candidates per frame. This selective processing focuses computational resources on high-probability regions identified through classical edge analysis. The system parameters balance computational efficiency against detection coverage through the adjustable candidate patch limit.
Multi-gamma enhancement significantly improves detection robustness across varying thermal conditions. Experiments using only single gamma values show degraded performance on obstacles exhibiting either very dark or very bright thermal signatures. The three-gamma approach provides near-optimal performance equivalent to exhaustive search over many gamma values while maintaining computational efficiency through parallel processing of only three variants.
Temporal filtering with the 8-frame sliding window and 4-frame persistence requirement provides substantial false positive reduction compared to single-frame detection without temporal integration. The performance improvement comes at the cost of increased latency in initial obstacle detection, with confirmed obstacles requiring a minimum of 4 frames before validation. This latency trade-off is acceptable for aviation scenarios where collision avoidance maneuvers operate on multi-second time scales.
4.3 Computational Performance and Real-Time Operation
Computational profiling on representative hardware platforms quantifies processing time requirements for real-time implementation. The system architecture invokes the ResNet classifier every 5 frames (configurable via the run_net_every_n_frames parameter) to balance detection latency against computational load, with optical flow tracking maintaining obstacle correspondence during intermediate frames on embedded avionic computing platforms equipped with modern multi-core processors and GPU acceleration for ResNet inference.
Per-module timing analysis reveals that ResNet patch classification consumes approximately 60 percent of total processing time, classical preprocessing 20 percent, optical flow tracking 15 percent, and remaining modules 5 percent combined. GPU acceleration of the ResNet inference using optimized deep learning frameworks reduces neural network computation time by a factor of 10 compared to CPU-only implementations, enabling real-time operation.
The system exhibits scalable computational requirements based on scene complexity. Frames with sparse obstacle content and low edge density result in fewer candidate patches and faster processing, while cluttered scenes with numerous edges increase candidate counts and computation time. Configurable parameters including maximum patches per frame provide mechanisms for bounding worst-case computational costs to guarantee latency requirements.
Memory requirements remain modest with total working memory under 500 megabytes including frame buffers, obstacle tracking databases, and neural network weights. This footprint is compatible with embedded systems typical in modern avionic architectures.
4.4 Detection Performance in Tactical Scenarios
Evaluation on BN Dataset sequences demonstrates robust detection performance across diverse tactical scenarios. The system is designed to detect obstacles meeting minimum size criteria of 10 pixels width and 20 pixels height at operational ranges, as specified in the configuration parameters. Detection performance degrades gracefully with increasing range as obstacles subtend fewer pixels and thermal contrast diminishes.
The high classification threshold of 0.95 combined with temporal persistence requirements of 4 detections within 8 frames is designed to maintain low false alarm rates suitable for operational systems. The majority of detection challenges arise from ambiguous thermal features exhibiting similar characteristics to obstacles, including thin branches, fence posts, and antenna structures that may represent legitimate flight hazards depending on mission parameters.
The temporal filtering framework maintains obstacle tracks across frames using IOU-based matching with distance thresholds of 30 pixels horizontally and 60 pixels vertically. Temporary detection gaps can occur during aircraft maneuvers producing large inter-frame motion that challenges the optical flow tracker, and during environmental conditions producing low thermal contrast.
Detection latency from obstacle appearance to confirmed detection averages 4 frames as expected from the temporal filter design, with minimum latency of 1 frame for obstacles immediately appearing with high confidence scores.
5. Discussion
5.1 Advantages of the Hybrid Approach
The proposed hybrid architecture demonstrates several advantages over pure deep learning or pure classical approaches. First, classical preprocessing provides effective attention mechanisms that focus neural network computation on high-probability regions, achieving computational efficiency without sacrificing detection recall. This efficiency is critical for embedded deployment in resource-constrained avionic systems.
Second, multi-gamma enhancement addresses limited training data challenges by synthesizing exposure variants that provide appearance invariance. This strategy avoids the need for massive training datasets spanning all possible thermal conditions, instead leveraging domain knowledge about intensity transformations to augment representation.
Third, the combination of learned and handcrafted features provides complementary strengths. Big-Small filtering excels at identifying vertical step edges characteristic of obstacles but struggles with obstacles lacking clear boundaries or exhibiting complex textures. ResNet classification learns discriminative features from training data but requires candidate localization to avoid exhaustive sliding window search. The hybrid system leverages strengths of both paradigms.
Fourth, temporal filtering using optical flow and IOU validation provides robust false positive suppression without requiring complex probabilistic tracking frameworks or motion models. This design trades slightly increased latency for substantially improved reliability, an acceptable compromise in aviation applications.
5.2 Limitations and Failure Modes
Several limitations and failure modes characterize the current system implementation. First, detection performance degrades for obstacles at long ranges where thermal contrast diminishes and size falls below the minimum resolvable dimensions. The system cannot detect obstacles occupying fewer than approximately 10 by 20 pixels, limiting maximum detection range as a function of obstacle physical dimensions and sensor resolution.
Second, the system exhibits sensitivity to extreme thermal conditions including scenarios where obstacles equilibrate thermally with backgrounds, eliminating the intensity differences exploited by Big-Small filtering. This limitation is fundamental to passive thermal sensing and cannot be fully addressed without active illumination or complementary sensing modalities.
Third, optical flow tracking can fail during rapid aircraft maneuvers producing large inter-frame displacements exceeding the pyramid search range. While the pyramid implementation handles moderate motion, extreme accelerations or high frame rates with limited inter-frame motion may challenge the tracker. Failures typically manifest as temporary track loss requiring re-initialization through new detections.
Fourth, the system assumes static obstacles and does not explicitly model motion dynamics. Moving obstacles such as other aircraft or vehicles may be detected but their motion is not predicted, potentially causing tracking ambiguities when multiple obstacles pass near each other.
Fifth, the current implementation lacks explicit handling of occlusions where obstacles temporarily disappear behind foreground structures. While the temporal filter tolerates brief detection gaps, prolonged occlusions cause track deletion requiring re-detection when obstacles reappear.
5.3 Computational and Hardware Considerations
Real-time implementation requires careful optimization and hardware acceleration. GPU acceleration of ResNet inference is essential for achieving frame rates suitable for operational deployment. Optimization techniques including inference quantization, pruning, and knowledge distillation could further reduce computational requirements while maintaining accuracy.
The sliding window buffer for temporal filtering introduces memory requirements proportional to window length and frame size. For high-resolution sensors or longer temporal windows, memory optimization through compression or selective storage of detection metadata rather than full frames may be necessary.
Hardware selection involves trade-offs between processing power, power consumption, size, and cost. Modern embedded GPU platforms including NVIDIA Jetson series or similar provide sufficient computational performance in form factors suitable for avionic integration, though certified hardware for safety-critical applications requires additional validation.
5.4 Potential Improvements and Future Directions
Several directions could enhance system capabilities and performance. First, incorporating depth information from stereo thermal sensors or fusion with lidar could provide explicit range estimates enabling obstacle size and trajectory estimation. This would support collision risk assessment and automated avoidance maneuver planning.
Second, implementing recurrent neural network architectures or temporal convolutional networks could replace the handcrafted temporal filter with learned temporal patterns, potentially improving detection consistency and handling dynamic obstacles. However, this would require substantial temporal training data and increased computational resources.
Third, active learning frameworks could enable online adaptation to novel obstacle types and environmental conditions encountered during operation. By identifying low-confidence detections and soliciting operator feedback, the system could incrementally improve classification performance over its operational lifetime.
Fourth, multi-modal fusion combining thermal with visible-spectrum and radar sensors could provide complementary information for robust all-weather operation. Each modality offers distinct advantages, and intelligent fusion could achieve performance exceeding any single sensor.
Fifth, attention mechanisms within the ResNet architecture could provide interpretability by visualizing which image regions contribute most to classification decisions. This would support debugging, validation, and operator trust in automated detection systems.
6. Conclusion
This paper presented a hybrid obstacle detection system combining ResNet50 deep learning with classical image processing techniques for thermal infrared imagery in aviation applications. The system addresses key challenges of limited training data, computational efficiency, and false alarm rates through careful integration of learned representations and domain-specific preprocessing.
The Big-Small edge detection module provides effective attention mechanisms focusing neural network computation on candidate regions exhibiting vertical structure characteristics. Multi-gamma patch enhancement addresses thermal sensor dynamic range limitations through synthetic exposure bracketing. Farneback dense optical flow tracking maintains temporal correspondence with computational efficiency suitable for embedded systems. Sliding window temporal filtering with IOU validation achieves robust false positive suppression critical for operational acceptance.
Evaluation on the BN Dataset demonstrates the effectiveness of the hybrid approach for tactical aviation scenarios. The system architecture balances detection latency against computational load through configurable parameters, enabling deployment on embedded hardware platforms with GPU acceleration and efficient algorithmic design.
The hybrid approach demonstrates that thoughtful combination of classical and learned methods can outperform either paradigm alone, particularly in domains characterized by limited training data, specific physical priors, and computational constraints. Future work will explore multi-modal fusion, learned temporal processing, and active learning for continued performance improvement in operational deployments.
The system represents a practical solution to thermal obstacle detection suitable for current-generation avionic systems, providing flight safety enhancement for low-altitude tactical operations in degraded visual environments.
References
[1] M. J. Veth, J. Raquet, and M. Pachter, "Stochastic constraints for efficient image correspondence search," *IEEE Transactions on Aerospace and Electronic Systems*, vol. 42, no. 3, pp. 973-982, 2006.
[2] L. Matthies, M. Maimone, A. Johnson, Y. Cheng, R. Willson, C. Villalpando, S. Goldberg, A. Huertas, A. Stein, and A. Angelova, "Computer vision on Mars," *International Journal of Computer Vision*, vol. 75, no. 1, pp. 67-92, 2007.
[3] J. W. Davis and V. Sharma, "Background-subtraction using contour-based fusion of thermal and visible imagery," *Computer Vision and Image Understanding*, vol. 106, no. 2-3, pp. 162-182, 2007.
[4] C. Harris and M. Stephens, "A combined corner and edge detector," in *Proceedings of the 4th Alvey Vision Conference*, pp. 147-151, Manchester, 1988.
[5] A. Rogalski, "Infrared detectors: status and trends," *Progress in Quantum Electronics*, vol. 27, no. 2-3, pp. 59-210, 2003.
[6] D. Floreano and R. J. Wood, "Science, technology and the future of small autonomous drones," *Nature*, vol. 521, no. 7553, pp. 460-466, 2015.
[7] S. Scherer, L. Chamberlain, and S. Singh, "Autonomous landing at unprepared sites by a full-scale helicopter," *Robotics and Autonomous Systems*, vol. 60, no. 12, pp. 1545-1562, 2012.
[8] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," *Nature*, vol. 521, no. 7553, pp. 436-444, 2015.
[9] R. Girshick, "Fast R-CNN," in *Proceedings of the IEEE International Conference on Computer Vision*, pp. 1440-1448, 2015.
[10] E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda, "Photographic tone reproduction for digital images," *ACM Transactions on Graphics*, vol. 21, no. 3, pp. 267-276, 2002.
[11] N. Wojke, A. Bewley, and D. Paulus, "Simple online and realtime tracking with a deep association metric," in *IEEE International Conference on Image Processing*, pp. 3645-3649, 2017.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," in *Advances in Neural Information Processing Systems*, pp. 1097-1105, 2012.
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in *IEEE Conference on Computer Vision and Pattern Recognition*, pp. 779-788, 2016.
[14] A. Torralba and A. A. Efros, "Unbiased look at dataset bias," in *IEEE Conference on Computer Vision and Pattern Recognition*, pp. 1521-1528, 2011.
[15] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *IEEE Conference on Computer Vision and Pattern Recognition*, pp. 770-778, 2016.
[16] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, "How transferable are features in deep neural networks?" in *Advances in Neural Information Processing Systems*, pp. 3320-3328, 2014.
[17] J. Canny, "A computational approach to edge detection," *IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. PAMI-8, no. 6, pp. 679-698, 1986.
[18] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, "Contour detection and hierarchical image segmentation," *IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 33, no. 5, pp. 898-916, 2011.
[19] D. Marr and E. Hildreth, "Theory of edge detection," *Proceedings of the Royal Society of London B*, vol. 207, pp. 187-217, 1980.
[20] S. E. Palmer, *Vision Science: Photons to Phenomenology*, MIT Press, Cambridge, MA, 1999.
[21] S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz, T. Greer, B. ter Haar Romeny, J. B. Zimmerman, and K. Zuiderveld, "Adaptive histogram equalization and its variations," *Computer Vision, Graphics, and Image Processing*, vol. 39, no. 3, pp. 355-368, 1987.
[22] K. Zuiderveld, "Contrast limited adaptive histogram equalization," in *Graphics Gems IV*, pp. 474-485, Academic Press, 1994.
[23] R. C. Gonzalez and R. E. Woods, *Digital Image Processing*, 3rd ed. Upper Saddle River, NJ: Prentice-Hall, 2008.
[24] P. E. Debevec and J. Malik, "Recovering high dynamic range radiance maps from photographs," in *ACM SIGGRAPH*, pp. 369-378, 1997.
[25] D. J. Fleet and Y. Weiss, "Optical flow estimation," in *Handbook of Mathematical Models in Computer Vision*, pp. 237-257, Springer, 2006.
[26] G. Farnebäck, "Two-frame motion estimation based on polynomial expansion," in *Proceedings of the Scandinavian Conference on Image Analysis (SCIA)*, pp. 363-370, Springer, 2003.
[27] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, "FlowNet: Learning optical flow with convolutional networks," in *IEEE International Conference on Computer Vision*, pp. 2758-2766, 2015.
[28] Y. Bar-Shalom and T. E. Fortmann, *Tracking and Data Association*, Academic Press, 1988.
[29] S. S. Blackman, "Multiple hypothesis tracking for multiple target tracking," *IEEE Aerospace and Electronic Systems Magazine*, vol. 19, no. 1, pp. 5-18, 2004.
[30] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, "The Pascal Visual Object Classes (VOC) challenge," *International Journal of Computer Vision*, vol. 88, no. 2, pp. 303-338, 2010.
[31] S. Escobar-Alvarez, N. Klingebiel, L. A. Johnson, S. Browne, N. Katta, J. Leib, D. W. Hodos, and N. J. Ferrier, "R-ADVANCE: Rapid adaptive prediction for vision-based autonomous navigation, control, and evasion," *Journal of Field Robotics*, vol. 35, no. 1, pp. 91-100, 2018.
