(cleanup) removing timm from the codebase.

Browse files

Files changed (7) hide show

README.md +1 -7
SEGMENTATION_PLAN.md +2 -4
configs/default.yaml +5 -5
paper-tex/sections/method.tex +0 -98
requirements.txt +0 -1
src/wireseghr/model/encoder.py +12 -46
src/wireseghr/model/model.py +1 -1

README.md CHANGED Viewed

@@ -31,15 +31,9 @@ python src/wireseghr/infer.py --config configs/default.yaml --image /path/to/ima
 - Defaults locked: SegFormer MiT-B3 encoder, patch size 768, MinMax 6×6, global+binary mask conditioning with patch-cropped global map.
 ### Backbone Source
-- Preferred: HuggingFace Transformers SegFormer (e.g., `nvidia/mit-b3`). We set `num_channels` to match input channels.
-- Optional: `timm` features_only if a compatible SegFormer is available.
 - Fallback: a small internal CNN that preserves 1/4, 1/8, 1/16, 1/32 strides with channels [64, 128, 320, 512].
-Install requirements to get Transformers:
-```
-pip install -r requirements.txt
-```
 ## Dataset Convention
 - Flat directories with numeric filenames; images are `.jpg`/`.jpeg`, masks are `.png`.
 - Example (after split 85/5/10):

 - Defaults locked: SegFormer MiT-B3 encoder, patch size 768, MinMax 6×6, global+binary mask conditioning with patch-cropped global map.
 ### Backbone Source
+- HuggingFace Transformers SegFormer (e.g., `nvidia/mit-b3`). We set `num_channels` to match input channels.
 - Fallback: a small internal CNN that preserves 1/4, 1/8, 1/16, 1/32 strides with channels [64, 128, 320, 512].
 ## Dataset Convention
 - Flat directories with numeric filenames; images are `.jpg`/`.jpeg`, masks are `.png`.
 - Example (after split 85/5/10):

SEGMENTATION_PLAN.md CHANGED Viewed

@@ -9,7 +9,7 @@ This plan distills the model and pipeline described in the paper sources:
 Focus: segmentation only (no dataset collection or inpainting).
 ## Decisions and Defaults (locked)
-- Backbone: SegFormer MiT-B3 via HuggingFace Transformers (shared encoder `E`), with `timm` or tiny CNN fallback.
 - Fine/local patch size p: 768.
 - Conditioning: global map + binary location mask by default (Table `tables/logit.tex`).
 - Conditioning map scope: patch-cropped from the global map per `paper-tex/sections/method_yq.tex` (no full-image concatenation variant).
@@ -21,7 +21,7 @@ Focus: segmentation only (no dataset collection or inpainting).
 ## Project Structure
 - `configs/`
-  - `default.yaml` (backbone=mit_b2, p=768, coarse_train=512, coarse_test=1024, alpha=0.01, minmax=true, kernel=6, maxpool_label=true, cond_variant=global+binary_mask)
 - `src/wireseghr/`
   - `model/`
     - `encoder.py` (SegFormer MiT-B3, N_in channels expansion)
@@ -53,7 +53,6 @@ Focus: segmentation only (no dataset collection or inpainting).
   - Binary location mask: 1 inside current patch region (in full-image coordinates), 0 elsewhere.
   - Pass patch-aligned cond crop and binary mask as channels to the fine branch input.
 - Notes:
-  - We expose a config toggle to switch conditioning variant between: `global+binary_mask` (default) and `global_only` (Table `tables/logit.tex`).
   - We follow the published version (`paper-tex/sections/method_yq.tex`) and use patch-cropped conditioning exclusively; no full-image conditioning variant will be implemented.
 ## Data and Preprocessing
@@ -103,7 +102,6 @@ Focus: segmentation only (no dataset collection or inpainting).
 ## Metrics and Reporting
 - Implement: IoU, F1, Precision, Recall (global, and optionally per-size bins if available) matching `tables/component.tex`.
 - Validate α trade-offs following `tables/thresholds.tex`.
-- Ablations: MinMax on/off, MaxPool on/off, conditioning variant (Table `tables/logit.tex`).
 ## Configuration Surface (key)
 - Backbone/weights: `mit_b2` (pretrained ImageNet-1K).

 Focus: segmentation only (no dataset collection or inpainting).
 ## Decisions and Defaults (locked)
+- Backbone: SegFormer MiT-B3 via HuggingFace Transformers.
 - Fine/local patch size p: 768.
 - Conditioning: global map + binary location mask by default (Table `tables/logit.tex`).
 - Conditioning map scope: patch-cropped from the global map per `paper-tex/sections/method_yq.tex` (no full-image concatenation variant).
 ## Project Structure
 - `configs/`
+  - `default.yaml` (backbone=mit_b2, p=768, coarse_train=512, coarse_test=1024, alpha=0.01, minmax=true, kernel=6, maxpool_label=true, cond_variant=global)
 - `src/wireseghr/`
   - `model/`
     - `encoder.py` (SegFormer MiT-B3, N_in channels expansion)
   - Binary location mask: 1 inside current patch region (in full-image coordinates), 0 elsewhere.
   - Pass patch-aligned cond crop and binary mask as channels to the fine branch input.
 - Notes:
   - We follow the published version (`paper-tex/sections/method_yq.tex`) and use patch-cropped conditioning exclusively; no full-image conditioning variant will be implemented.
 ## Data and Preprocessing
 ## Metrics and Reporting
 - Implement: IoU, F1, Precision, Recall (global, and optionally per-size bins if available) matching `tables/component.tex`.
 - Validate α trade-offs following `tables/thresholds.tex`.
 ## Configuration Surface (key)
 - Backbone/weights: `mit_b2` (pretrained ImageNet-1K).

configs/default.yaml CHANGED Viewed

@@ -1,6 +1,6 @@
 # Default configuration for WireSegHR (segmentation-only)
 backbone: mit_b2
-pretrained: true  # Uses HF SegFormer weights if available; else timm or tiny fallback
 coarse:
   train_size: 512
@@ -32,7 +32,7 @@ eval:
   fine_batch: 16
 optim:
-  iters: 2000
   batch_size: 4
   lr: 6e-5
   weight_decay: 0.01
@@ -43,9 +43,9 @@ optim:
 # training housekeeping
 seed: 42
 out_dir: runs/wireseghr
-eval_interval: 150
-ckpt_interval: 300
-# resume: runs/wireseghr/ckpt_1800.pt  # optional
 # dataset paths (placeholders)
 data:

 # Default configuration for WireSegHR (segmentation-only)
 backbone: mit_b2
+pretrained: true  # Uses HF SegFormer weights if available
 coarse:
   train_size: 512
   fine_batch: 16
 optim:
+  iters: 10000
   batch_size: 4
   lr: 6e-5
   weight_decay: 0.01
 # training housekeeping
 seed: 42
 out_dir: runs/wireseghr
+eval_interval: 200
+ckpt_interval: 400
+resume: runs/wireseghr/ckpt_1800.pt  # optional
 # dataset paths (placeholders)
 data:

paper-tex/sections/method.tex DELETED Viewed

@@ -1,98 +0,0 @@
-%auto-ignore
-% \textcolor{blue}{ 2 options: 1. synthetic works: sec 3. data, sec 4. method, if synthetic not working: sec, method, sec, evaluation data collection: 1. coverage (indoor, outdoor, we even handshot...: multiple sources [flicker, google, etc, our own shooting]),
-% multiple coverage (different keywords, select 400 from xxxK images, [ask mada to provde distritbuion, in supplemnetatyr]), 3. resolution: minimize. [How to convince pepople to use the test set..], annotation quality [in house  annotation, multiple rounds of quality assurance].  We xx, 409 will be release}
-% \textcolor{blue}{ if synthetic works, add another subsection in data section}
-% \subsection{Generating synthetic dataset}
-% While our dataset contains xxxx high-resolution images, the number is still insufficient when compared to other large segmentation datasets such as the well known Cityscapes~\cite{cityscapes}, Mapillary Vistas~\cite{mapillary} and MSCOCO dataset~\cite{coco}. To overcome the data insufficiency issue, one approach is to incorporate synthetic datasets with similar properties in the training. This is used in numerous vision related tasks such as 3D reconstruction~\cite{3drecon}, semantic segmentation (specifically domain adaptation)~\cite{domainadapt}. Synthetic dataset is also widely used in line and wire segmentation tasks~\cite{syntheticline, syntheticwire}, though they primarily used 3D rendering engines to generate artificial wires.
-% Contrary to their approaches, we opt to generate our synthetic dataset with existing wires by blending wires from one image to another. Our approach is motivated by the fact that wires in our dataset have very different shapes and sizes, and is therefore difficult to generate artificially. To ensure our synthetic dataset is adequately realistic, we leverage semantic information of the scene as a cue to select regions for wire blending. For example, we only blend wires from the sky of one image to the sky of another image, such that the lighting, texture and nearby entities are similar.
-% Here we describe in detail how we synthesize wire images. First, we predict a semantic map for each image in our real dataset. This gives us semantic information on the image and the wires. Then, we extract the wires from these images by cropping wire pixels and removing them from the background with Content-Aware Fill. For each obtained background image, we generate three separate images by randomly transforming (e.g. rotating and scaling) and blending several wires with the image, where the blending region is constricted by the type of object in the region and the selected wire. The resulting synthetic wire dataset contains xxxx images with different realistic wire patterns. Figure~\ref{fig:synthetic} shows an example of our generated high-resolution wire image.
-\section{Method} \label{sec:method}
-% Sec 4.1 Model structure
-% Sec 4.2 how we combine: inference: end to end , information flow, how from an image to wire segmentation masks, inference method: 1. global run -> subset regions (detailed selection will be descried in 4.4.1) [describe the whole process -> (new para) explain how selected.]
-% Sec 4.3 how we train: how we train to make usre share weights: [describe whole training, then detail each components ((new parap) pre-processing, batch training, sampling, etc)]
-% Sec.4.4 pre-processing and post-processing
-% In this section we[], the structure is in fig X
-% 4.1: subsubection{global branch}: input, output, structurte, subsubsection{local branch}: input, output , structrure
-% 4.2: combining branches, connecting, how its done to end-to-end
-% \textcolor{blue}{our model is composed of of xxx parts: a global (Secxxx), a local. Add a figure}
-\input{figure_tex/pipeline}
-In this section, we describe our two-stage model architecture. Figure~\ref{fig:pipeline} shows the overall structure of our model. We will first discuss motivations of our two-stage design, then we describe our two-stage model architecture and training scheme in detail.
-\subsection{Motivation}
-As shown in Figure~\ref{fig:motivation}, we divide our model into two major components -- a coarse module and a fine module. Our two-stage model design is motivated by two observations. To start, wires in our dataset are extremely long and thin. As mentioned above, many wires in our images can span over several thousand pixels long while only several pixels across. Limited by the memory size of GPUs, we cannot simply pass the entire image spanning thousands of pixels at full-resolution to a model for inference. As a result, two separate modules are required, where the coarse module captures the entire wire and its surrounding contextual information at a lower resolution, and the fine module captures detailed textures of the wire at the original resolution.
-Next, we observe that wire pixels are also very sparse, where a typical high-resolution image contains very small percentages of wire pixels. This means that we can also use the coarse module to guide the fine module on what regions to capture full-resolution details. This way, the fine module can predict segmentation masks only where there are wires, and therefore reduce computation time.
-\subsection{The two-stage coarse to fine model}
-Given a high-resolution image $I$, the coarse module of our two-stage model aims to recognize semantics of the entire image and pick up wire regions at a coarse level. Looking at the entire image allows the coarse module to capture more contextual information. To fit the large images into a GPU for training and inference, images are first bilinearly downsampled, then fed into the coarse module to predict a global logit map $Z_{glo}$. We use $Z_{glo}$ as a conditional input to guide the fine module in the next stage.
-The fine module is conditioned on the output from the coarse module. This module takes in a local image patch $I_{loc}$ of the whole image at full resolution, the global logit map $Z_{glo}$ from the coarse module, and a binary location mask $M$ that sets the patch relative to the full image to $1$ and other regions to $0$. The fine module then predicts the local wire logit map $Z_{loc}$. Empirically, we find that concatenating the entire global logit map rather than cropping the logit map at the location of the image patch yields slightly improved results.
-The designs of the coarse and fine modules are conceptually the same as those in GLNet~\cite{glnet} and MagNet~\cite{magnet}, where a global network is trained on entire downsampled images and a local network is trained on higher-resolution image patches conditioned on some global features. However, unlike GLNet, where intermediate features are shared between the global and local branch bidirectionally, we opt for a simpler late fusion by concatenating the logit map directly to the fine module. We also only use two stages instead of up to four stage as done in MagNet, since only a fine module at the highest resolution is adequate at refining annotations that are only several pixels thick, and additional intermediate stages can drastically increase inference time.
-Our model can be trained end-to-end. To do so, in the training stage, we first randomly scale, rotate, horizontal flip and apply photometric distortion to the image. The global image $I_{glo}$ is generated by simply downsampling this augmented image. We then generate the local image patch $I_{loc}$ by randomly cropping a 512$\times$512 window from the augmented image that contains at least 1\% wire pixels. This helps balance between wire and background pixels. We find that this simple constrained cropping approach yields better performances than other well known balancing methods, including Focal loss~\cite{focal} and Online Hard Example Mining~\cite{ohem}.
-We share the same feature extractor network between both the coarse and fine module, but train separate feature decoders. To account for the additional inputs to the fine module, we expand the input channels of the feature extractor from 3 to 5. The last two channels are set to 0 when passing an image through the coarse module, and set to the global logit map $Z_{glo}$ and the binary location map $M$ for the fine module. Both $Z_{glo}$ and $Z_{loc}$ are trained by computing Cross Entropy losses of the logit maps after softmax against their respective ground truth wire annotations. The final loss $L$ of our model is the sum of the global cross entropy $L_{glo}$ loss and local cross entropy loss $L_{loc}$. They are defined as follows:
-\begin{equation}
-\begin{aligned}
-    L_{glo} &= -\sum_{c\in C}log\ Softmax(Z_{glo})_c \\
-    L_{loc} &= -\sum_{c\in C}log\ Softmax(Z_{loc})_c \\
-    L &= L_{glo} + L_{loc}
-\end{aligned}
-\end{equation}
-where $C$ is the number of categories. $C=2$ (wire, background) in our task.\\
-To perform inference with our model, for each high-resolution image, we first feed the downsampled image to the coarse module, which is the same as the training step. We then compute the global wire segmentation map by taking the $\mathrm{argmax}$ over the two classes for each pixel. Local refinement is done by running a sliding window over the entire image, where the fine step is performed if there are more than some percentage $\alpha$ of wire pixels within the image window. By conditioning on the global wire probability, we save computation time in regions where there are no wires, and maintain segmentation quality at high-resolution when potential wires are discovered. By passing the global logit to the fine module, we also allow the coarse module to provide information to the fine module for better inference quality.
-\subsection{Implementation details}
-We use MixTransformer~\cite{segformer} as our shared feature extractor. We expand the backbone's input RGB channel to five channels to accept the logit map and binary location map during the local step. We define two separate MLP feature decoders from~\cite{segformer} for the coarse and fine modules respectively.
-We train our model on our training set with 2796 high-resolution images. The model is trained for 40k iterations with a batch size of 8. Global images are downsampled to 512$\times$512. We feed the global images to the coarse module and obtain a two channel class logit map. A single 1$\times$1 convolution is used to transform this logit map into a single channel map, which is then concatenated with the local RGB image and binary location mask. The five-channel input is finally fed into the fine module to obtain the local logit map. We use AdamW~\cite{adamw} with a learning rate of $0.00006$ and weight decay of $0.01$. We use the "poly" LR scheduling rule with a factor of $1.0$.
-In the testing stage, we set both the global image and local image patch size to 1024$\times$1024. Unless otherwise specified, we set the percentage for local refinement $\alpha$ to $0.01\%$.
-We also train common semantic segmentation models on our dataset for comparison. For whole image models, we train the models on downsampled whole-images, as are done in  our coarse module and most semantic segmentation methods. For sliding-window models, we train them on full-resolution image patches, and perform inference in a sliding window approach. All experiments are trained on 4 Nvidia V100 GPUs and tested on a single V100 GPU.
-% A straight-forward approach towards wire segmentation is to train a common object segmentation model on our high-resolution wire images. To avoid memory issues with large images, we adapt conventional semantic segmentation training pipelines by randomly downsampling the images to between 1024 and 2048 pixels on the longer side, while maintaining aspect ratio. Random rotation is added to increase variety of wire shapes. Random horizontal flip, photometric distortion and channel normalization are then added. To obtain the final segmentation prediction, the predicted probability map is bilinearly upsampled to the original image size, where evaluation takes place. We name this method coarse semantic segmentation, since the image downsampling and prediction upsampling steps cause the final output to have lower resolution than the original image.
-% An intuitive alternative to coarse segmentation is to train the model on full-resolution image patches, then perform inference in a sliding-window manner. This way, the model captures the most details and thus gives the tightest prediction. The trade-off with the sliding window method is the inference time. Since each full-resolution image has to be cropped into multiple patches and be predicted individually, multiple iterations of inference have to be performed, resulting in significantly prolonged inference time when images are several times larger. As a comparison, we train and evaluate the same model architecture as coarse semantic segmentation, but instead of downsampling images to below 2048 pixels per side, we simply crop the full-resolution images without downsampling. During inference, a full-resolution large image is cropped into multiple image patches of the same size, and each patch is predicted individually. The finaly prediction map is constructed by stitching the prediction patches together. We name this method fine semantic segmentation, since the prediction is not upsampled and thus gives the tightest segmentation map.
-% We show in Section~\ref{sec:results} that this method gives significantly better results than coarse semantic segmentation, but requires much longer to run.
-% \subsection{The coarse-to-fine pipeline}
-% Ideally, to perform efficient yet accurate semantic segmentation on large-resolution images, a fusion of the above two methods is most desired. Intuitively, we can use the coarse segmentation map as a condition to determine the regions in the image that require fine segmentation. To achieve this, we can divide the both the image and the coarse segmentation map into patches, and perform fine segmentation based on the result of the coarse segmentation patch, such as the percentage of wire pixels in the patch.
-% Intuitively, we can simply use the trained models from the above two methods as the "combined model" for coarse-to-fine inference. Yet, this design requires two separate models, and do not share information between them. An improvement over this approach is to train a single model with both downsampled images and full-resolution image patches, then run coarse-to-fine inference using the same model, but this still does not provide a way to share information between both models.
-% To ensure efficient inference without duplicated model structures, as well as enable information passing from the coarse inference stage to the fine inference stage, propose a sequential global-to-local network that shares the feature extractor backbone but contains two separate decoder heads. The concept can be thought of as the opposite of~\cite{glnet}, where they have separate feature extractors for global and local images. As a result, our model can be trained in an end-to-end manner, while their model requires a three-stage training scheme. Our proposed method is similar to~\cite{cascadepsp} in terms of the sequential inference pipeline, but differs in two major aspects. First, ~\cite{cascadepsp} shares the entire network for global and local inference. Second, their work is designed for segmentation refinement, meaning that their model requires a separate segmentation prediction as input, resulting in two iterations for global inference. On the other hand, our model only requires a single step in global inference.
-% \subsection{Model description}
-% Here we describe our two-stage model design and training details. Our model consists of a single shared feature extractor and two separate decoders. In our experiments, we use Mix Transformer (MiT)~\cite{segformer} as the feature extractor, and the multi-layer perceptron (MLP) decoder proposed by the same authors as the decoder structure. As mentioned above, we duplicate the decoder such that one is used for global inference and the other is used for local inference.
-% To enable information sharing between the global step and the local step, we extend the model input from three channels to five. The first three channels are the original RGB channels, and the last two channels are for the local step, and is therefore deactivated (i.e. all zeros) during the global step. In the local step, the fourth channels is the probability map of the entire downsampled image. The fifth channel is a binary map, where the location of the local image patch projected back onto the whole image is set as one, and the remaining regions as zero. Compared to~\cite{cascadepsp}, where only the probability map of the local image patch is appended to the input, using the probability map of the entire image enables the network to gain contextual information of the image patch. At the same time, the binary map allows the network to condition on the local probability, thus performing accurate segmentation refinement. An illustration of our model design can be seen in Figure.~\ref{fig:model}.
-% \textbf{stress on the nice part of the model: shared weights, size, etc...}
-% \section{Toward Collecting High-resolution Wire Dataset}
-% \subsection{High-resolution Wire Dataset}
-% Albeit having major incentives towards distractor removal, few works exist that tackle wire segmentation in high-resolution outdoor images. To this end, we present a dataset consisting of xxx very high-resolution outdoor images with wires and cables. Image sizes in our dataset varies from X$\times$Y to W$\times$Z. Wire images in our dataset not only include power lines and cables as in~\cite{ttpla}, but also electrical wires attached to buildings and other structures. This adds significant variety to the types of wires in terms of color and lighting. In addition, while ~\cite{ttpla} contains power line images mostly from close distances, wires in our dataset images can appear from very far distances, which makes them extremely difficult to recognize even with careful visual inspection, to extreme close-ups where textures are clearly visible. This extreme pattern variety makes it very difficult for the model to infer from a single visual cue, such as color gradients.
-% Table.\ref{table:dataset} shows statistics of our dataset. As can be seen, our dataset contains images that are significantly larger than other similar wire/cable datasets. (add stuff when stats and comparisons are done)

requirements.txt CHANGED Viewed

@@ -1,6 +1,5 @@
 torch>=2.1.0
 torchvision>=0.16.0
-timm>=0.9.8
 transformers>=4.37.0
 numpy>=1.24.0
 opencv-python>=4.8.0.76

 torch>=2.1.0
 torchvision>=0.16.0
 transformers>=4.37.0
 numpy>=1.24.0
 opencv-python>=4.8.0.76

src/wireseghr/model/encoder.py CHANGED Viewed

@@ -1,14 +1,14 @@
 """SegFormer MiT encoder wrapper with adjustable input channels.
-Uses timm to instantiate MiT (e.g., mit_b2) and returns a list of multi-scale
-features [C1, C2, C3, C4].
 """
 from typing import List, Tuple
 import torch
 import torch.nn as nn
-import timm
 class SegFormerEncoder(nn.Module):
@@ -17,66 +17,32 @@ class SegFormerEncoder(nn.Module):
         backbone: str = "mit_b2",
         in_channels: int = 6,
         pretrained: bool = True,
-        out_indices: Tuple[int, int, int, int] = (0, 1, 2, 3),
     ):
         super().__init__()
         self.backbone_name = backbone
         self.in_channels = in_channels
         self.pretrained = pretrained
-        self.out_indices = out_indices
         # Prefer HuggingFace SegFormer for 'mit_*' backbones.
-        # Otherwise try timm features_only. Always have Tiny CNN fallback.
-        self.encoder = None
         self.hf = None
         prefer_hf = backbone.startswith("mit_") or backbone.startswith("segformer")
         if prefer_hf:
-            # HF -> timm -> tiny
             try:
                 self.hf = _HFEncoderWrapper(in_channels, backbone, pretrained)
                 self.feature_dims = self.hf.feature_dims
             except Exception:
-                try:
-                    self.encoder = timm.create_model(
-                        backbone,
-                        pretrained=pretrained,
-                        features_only=True,
-                        out_indices=out_indices,
-                        in_chans=in_channels,
-                    )
-                    self.feature_dims = list(self.encoder.feature_info.channels())
-                except Exception:
-                    self.encoder = None
-                    self.fallback = _TinyEncoder(in_channels)
-                    self.feature_dims = [64, 128, 320, 512]
         else:
-            # timm -> HF -> tiny
-            try:
-                self.encoder = timm.create_model(
-                    backbone,
-                    pretrained=pretrained,
-                    features_only=True,
-                    out_indices=out_indices,
-                    in_chans=in_channels,
-                )
-                self.feature_dims = list(self.encoder.feature_info.channels())
-            except Exception:
-                try:
-                    self.hf = _HFEncoderWrapper(in_channels, backbone, pretrained)
-                    self.feature_dims = self.hf.feature_dims
-                except Exception:
-                    self.encoder = None
-                    self.fallback = _TinyEncoder(in_channels)
-                    self.feature_dims = [64, 128, 320, 512]
     def forward(self, x: torch.Tensor) -> List[torch.Tensor]:
-        if self.encoder is not None:
-            feats = self.encoder(x)
-            assert isinstance(feats, (list, tuple)) and len(feats) == len(
-                self.out_indices
-            )
-            return list(feats)
-        elif self.hf is not None:
             return self.hf(x)
         else:
             return self.fallback(x)

 """SegFormer MiT encoder wrapper with adjustable input channels.
+Uses HuggingFace Transformers SegFormer (e.g., mit_b2) and returns a list of
+multi-scale features [C1, C2, C3, C4]. Falls back to a tiny CNN if HF isn't
+available.
 """
 from typing import List, Tuple
 import torch
 import torch.nn as nn
 class SegFormerEncoder(nn.Module):
         backbone: str = "mit_b2",
         in_channels: int = 6,
         pretrained: bool = True,
     ):
         super().__init__()
         self.backbone_name = backbone
         self.in_channels = in_channels
         self.pretrained = pretrained
         # Prefer HuggingFace SegFormer for 'mit_*' backbones.
+        # Fallback to Tiny CNN if HF unavailable or unsupported.
         self.hf = None
         prefer_hf = backbone.startswith("mit_") or backbone.startswith("segformer")
         if prefer_hf:
+            # HF -> tiny
             try:
                 self.hf = _HFEncoderWrapper(in_channels, backbone, pretrained)
                 self.feature_dims = self.hf.feature_dims
             except Exception:
+                self.hf = None
+                self.fallback = _TinyEncoder(in_channels)
+                self.feature_dims = [64, 128, 320, 512]
         else:
+            # tiny
+            self.fallback = _TinyEncoder(in_channels)
+            self.feature_dims = [64, 128, 320, 512]
     def forward(self, x: torch.Tensor) -> List[torch.Tensor]:
+        if self.hf is not None:
             return self.hf(x)
         else:
             return self.fallback(x)

src/wireseghr/model/model.py CHANGED Viewed

@@ -14,7 +14,7 @@ class WireSegHR(nn.Module):
     Expects callers to prepare input channel stacks according to the plan:
     - Coarse input: RGB + MinMax (and any extra channels per config), shape (B, Cc, Hc, Wc)
-    - Fine input: RGB + MinMax + cond_crop + binary_location_mask, shape (B, Cf, p, p)
     Conditioning 1x1 is applied to coarse logits to produce a single-channel map.
     """

     Expects callers to prepare input channel stacks according to the plan:
     - Coarse input: RGB + MinMax (and any extra channels per config), shape (B, Cc, Hc, Wc)
+    - Fine input: RGB + MinMax + cond_crop, shape (B, Cf, p, p)
     Conditioning 1x1 is applied to coarse logits to produce a single-channel map.
     """