😊AI를 이용해서 손 제스쳐를 인식해보자.

gellston · 1월 2, 2026, 6:47오후

무엇을 만들까 고민.

어두운 이미지를 밝게 만드는 모델을 만들고 누겟으로 싸서 올려도 봤습니다.
많은 분들이 관심을 가져주시고 댓글을 달아 주시니 재미가 있군요..
이번에는 손 제스쳐를 인식하는 (연속 동영상) 모델을 만들고 그 과정을 누겟으로 올려보도록 하겠습니다.
얼굴을 구분하는 누겟도 생각해 보았는데. 너무 좋은 것들이 많더군요.
뒤져보니 제스쳐는 많이 없었습니다.

사실 회사에서 비슷한 프로젝트를 한 적이 있는데
아시다시피 회사와 관련된 프로젝트는 어떠한 좋은 기술을 가져다 붙여도 재미가 없고 스트레스만 쌓이더군요..
이번에는 자식처럼 애정을 가지고 만들어 보겠습니다.

쓸만한 데이터셋 조사

이게 쓸만해 보이는 것 같습니다.
데모로 컴퓨터 앞에 앉아서 제스쳐 동작을 대부분 하기 때문에
(데이터셋의 구성이 다 모니터앞에 앉아 있는 사람의 제스쳐를 기록함)
목적에 부합한다고 생각했습니다.

라이센스

다 읽어보니 상업적인 사용은 못하고 비영리목적으로는 사용이 가능해 보입니다.

내일까지 쉬운 논문 찾기.

음… 딥러닝 논문중에서 쉬우면서 그럴싸하고 지적 욕구를 충족할 논문을 찾아서 내일까지 뒤져볼 생각입니다. 되도록이면 회사에서 사용한 논문을 피해서 새로운 도전적 모델이 눈에 보이면 좋겠군요..

그 외 오픈소스 프로젝트 관련 문의

혹시 powershell 잘하시는 분 계신가요?.. 뭔가 재미난 아이디어가 있는데 같이 해보실 분 찾습니다.
관심 있으시면 댓글 남겨주세요 ㅠ

gellston · 1월 3, 2026, 2:18오전

비교해보자

아카이브에서 비교적 3d convolution을 이용한 설계를 채용한 모델만 대충 골라다가 (주관적인 관점으로 쉬운것만)
비교해보았습니다.

쩝… 추려지고 추려진 녀석이 .. 회사에서 쓴 녀석이라니…

https://arxiv.org/pdf/1904.02422
이 논문의 재미있는 점은 기존에 유명한 2D 컨볼루션 모델을 3D 컨볼루션으로 내부 블록을 바꾸어
Video Classification분야에 성능 비교를 시도했다는 점입니다.
이 논문 1장 내에 여러 모델을 구현하여 성능 비교를 시도했기 때문에 이 1장의 논문 내에서도 모델을 선택해야합니다
이 논문에서 사내에서 프로젝트를 진행할 때 사용했던 모델은 아래의 그림과 같습니다.

사내에서는 적당한 성능의 적당한 speed를 가진 MobileNetV2 3D버전을 가져다가 썼었는데요.
음… 그럼 똑같은 걸 하기 때문에 재미도 없고… 이 토이 프로젝트의 취지와 맞지 않기 때문에 -_-…
이 논문에서 컨버전 하지 않은 2d모델을 3d 컨볼루션으로 리어레인지하는 시도를 해보겠습니다.

어떤 2D 컨볼루션 모델을 리어레인지할 것인가?

아까 위에서 본 Video Classification 논문이 마지막 수정된 일자가 2021년인데.. 그 이후로 Convolution분야에서 성능을 잡으면서 아주 빠른 속도의 최적화 모델들이 몇 개가 나왔습니다.
제가 글을 적으면서 딱 생각이 나는 아주 가벼온 모델이 GhostNet이라고 있는데 예전에 구현해본 기억이 있어서 제 repository를 뒤져보니 Layer 모듈이 있더군요

github.com/gellston/DeepLearningStudy

torch/util/helper.py

main

import torch
import numpy as np
import torch.nn.functional as F
import math

from torch import Tensor
from typing import Optional, List, Tuple


def load_infinite(loader):
    iterator = iter(loader)
    while True:
        try:
            yield next(iterator)
        except StopIteration:
            iterator = iter(loader)

This file has been truncated. show original

제가 지금까지 공부한 모델 설계에 들어가는 모든 레이어 모듈 (식재료같은 녀석)들이 위 링크에 다 들어가 있습니다.

고스트 모듈 V1이 있군요.. 기억을 더듬어 가면서 이 Ghost모듈V1 2D용을 3D 컨볼루션으로 바꾸어서 Video Classification을 시도하면 속도와 성능을 잡을 수 있지 않을가 생각됩니다.

참고로 제가 일에 매몰되어 있는 동안 GhostNetV1 (GhostModule을 사용하는 논문)은 벌써 V3까지 내놓았군요..

2020년 V1 버전과

https://arxiv.org/pdf/1911.11907

2024년 V3 버전까지

https://arxiv.org/pdf/2404.11202

논문 저자 분들이 열심히 연구를 하고 계셨군요.
Transformer가 판치는 이 세상에서 여전히 간단한 아이디어와 Convolution만으로 성능과 경량성을 연구하시는 것을 보니 놀라울 따름입니다.

gellston · 1월 3, 2026, 4:07오후

기존 고스트 모듈 2D → 3D 컨볼루션 스왑

고스트넷 논문 링크 : https://arxiv.org/pdf/1911.11907
기존에 만들어 놓은 고스트넷 모듈에서 2D Convolution부분을 3D 컨볼루션으로 변경 및 간단한 모델을 만들어서 학습해보려고 합니다.

왜 고스트넷이냐

아래 그림은 논문에서 발최한 그림인데요.
일반 개발자들도 쉽게 이해할 수 있는 설명으로는 음..
고스트넷 저자들이 성능이 좋은 모델들의 피쳐를 까서 조사를 다 해보니
비슷하게 생긴 피쳐들이 많다는 것을 확인 했습니다.
(항상 비슷하게 생긴 피쳐가 많다고 성능이 좋다는 건 아닙니다. 때에 따라서 다릅니다)

그럼 애시당초 힘겹게 컨볼루션 필터 갯수를 늘려서 비슷한 녀석 만들지 말고
옆에 있는 피쳐랑 유사하게 비슷하게 닮도록 옆에 있는 녀석을 비벼서 (컨볼루션) 결과를 만들자는 것이 아이디어입니다.
(이때 Depth wise conolution을 하는데 이건 너무 길어지니… 생략하겠습니다.)

그래서.. 이 고스트 모듈이 적은 연산으로 비슷한 피쳐를 많이 만들어내는데 장점이 있으니 이 연산량 작은 고스트 모듈을 Gesture 인식에 맞게 수정하여 써보려고 합니다.

과거에 작성해 놓은 코드

class GhostModule(torch.nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=1, ratio=2, dw_size=3, stride=1, use_activation=True, activation=torch.nn.ReLU):
        super(GhostModule, self).__init__()
        self.oup = out_channels
        init_channels = math.ceil(out_channels / ratio)
        new_channels = init_channels*(ratio-1)

        self.primary_conv = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels, init_channels, kernel_size, stride, kernel_size//2, bias=False),
            torch.nn.BatchNorm2d(init_channels),
            activation(inplace=True) if use_activation else torch.nn.Sequential(),
        )

        self.cheap_operation = torch.nn.Sequential(
            torch.nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=False),
            torch.nn.BatchNorm2d(new_channels),
            activation(inplace=True) if use_activation else torch.nn.Sequential(),
        )

    def forward(self, x):
        x1 = self.primary_conv(x)
        x2 = self.cheap_operation(x1)
        out = torch.cat([x1,x2], dim=1)
        return out[:,:self.oup,:,:]

테스트 중인 변경된 코드

import torch
import math


class GhostLayer3D(torch.nn.Module):
    def __init__(self,
                 in_channels: int,
                 out_channels: int,
                 ratio: int = 2,
                 stride=1, #T, H ,W 
                 use_activation: bool = True,
                 activation=torch.nn.ReLU):
        super().__init__()

        if ratio < 1:
            raise Exception("ratio must be >= 1")
        self.oup = out_channels


        if isinstance(stride, int):
            stride_3d = (stride, stride, stride)
        else:
            if len(stride) != 3:
                raise Exception("stride must be int or tuple of length 3")
            stride_3d = tuple(int(s) for s in stride)

        init_channels = int(math.ceil(out_channels / ratio))
        new_channels = init_channels * (ratio - 1)

        # 시간과 공간만 비비기
        self.primary_conv = torch.nn.Sequential(
            torch.nn.Conv3d(
                in_channels=in_channels,
                out_channels=init_channels,
                kernel_size=(3, 3, 3),
                stride=stride_3d,
                padding=(1, 1, 1),
                bias=False
            ),
            torch.nn.BatchNorm3d(init_channels),
            activation(inplace=True) if use_activation else torch.nn.Identity(),
        )

        # 공간만 비비기
        if new_channels > 0:
            self.cheap_operation = torch.nn.Sequential(
                torch.nn.Conv3d(
                    in_channels=init_channels,
                    out_channels=new_channels,
                    kernel_size=(1, 3, 3),
                    stride=(1, 1, 1),
                    padding=(0, 1, 1),
                    groups=init_channels,     # 뎁스와이즈 컨볼루션 
                    bias=False
                ),
                torch.nn.BatchNorm3d(new_channels),
                activation(inplace=True) if use_activation else torch.nn.Identity(),
            )
        else:
            self.cheap_operation = None

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x1 = self.primary_conv(x) 
        if self.cheap_operation is None:
            out = x1
        else:
            x2 = self.cheap_operation(x1) 
            out = torch.cat([x1, x2], dim=1) 

        return out[:, :self.oup, :, :, :]

영상 Gesture 인식에서 핵심 과제

제스쳐는 일반 정지 영상과 다르게 T0~Tn 까지 흐르는 정지 영상들의 묶음을 다 보아야만
모델이 얻고자하는 결과를 예측 할 수 있습니다.
(예: 손을 흔들거나, 손을 머리 위로 올린다거나, 동적인 움직임 한 세트의 분석)

그래서 일반적인 2D 컨볼루션은 R,G,B 색상이 있는 이미지내에서 Width와 Height 방향으로만 움직이지만. 3D 컨볼루션 시간 축 T를 이해해야 하기 때문에 T축으로 컨볼루션 커널이 움직이게 디자인 되어 있습니다.

2D 컨볼루션 예시 :

2dconv

3D 컨볼루션 예시 :

1*l1m0Ttk35To6TsBzil1f6Q

인터넷에 돌아다니는 gif를 긁어다가 가져왔는데요.
보시다시피 3D 컨볼루션은 내가 정의한 (gif에서는 z축) 방향으로 커널이 자유롭게 이동합니다.

Ghost모듈을 적용한 제스쳐 인식용 모델 코드

import torch

from layers.ghost_layer3d import GhostLayer3D

class GhostNet3D(torch.nn.Module):
    def __init__(self, 
                 in_channels, 
                 class_num,
                 ):
        super(GhostNet3D, self).__init__()


        # 시간축 frame 16으로 시작
        # 공간 크기 128x64
        self.stem = torch.nn.Sequential(
            GhostLayer3D(in_channels=in_channels, out_channels=16)
        )

        # 시간축 frame 8로 압축
        # 공간 크기 64x32로 압축
        self.layer1 = torch.nn.Sequential(
            GhostLayer3D(in_channels=16, out_channels=24, stride=2),
            GhostLayer3D(in_channels=24, out_channels=24)
        )


        # 시간축 frame 8로 유지
        # 공간 크기 32x16
        self.layer2 = torch.nn.Sequential(
            GhostLayer3D(in_channels=24, out_channels=40, stride=(1,2,2)),
            GhostLayer3D(in_channels=40, out_channels=40),
        )


        # 시간축 frame 8로 유지
        # 공간 크기 16x8
        self.layer3 = torch.nn.Sequential(
            GhostLayer3D(in_channels=40, out_channels=80, stride=(1,2,2)),
            GhostLayer3D(in_channels=80, out_channels=80),
            GhostLayer3D(in_channels=80, out_channels=80),
            GhostLayer3D(in_channels=80, out_channels=80),
        )

        # 시간축 frame 8로 유지
        # 공간 크기 8x4
        self.layer4 = torch.nn.Sequential(
            GhostLayer3D(in_channels=80, out_channels=80, stride=(1,2,2)),
            GhostLayer3D(in_channels=80, out_channels=80),
            GhostLayer3D(in_channels=80, out_channels=80),
            GhostLayer3D(in_channels=80, out_channels=80),
        )

        self.gap = torch.nn.AdaptiveAvgPool3d((1,1,1))
        self.fc = torch.nn.Linear(80, class_num)
        

    def forward(self, x):

        ## 디버깅용 시간축 16에서 시작 확인용
        B, C, T, H, W = x.shape

        x = self.stem(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.gap(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x

학습 코드 및 데이터셋 링크

데이터셋 : 20bn-jester | Kaggle
학습 스크립트 : HGR/python/train.py at main · gellston/HGR · GitHub

Gesture 연속동작 데이터셋이여서 그런지 엄청 학습이 느리군요.. 내일 아침까지 기다려보겠습니다.

gellston · 1월 3, 2026, 4:19오후

컴퓨터 뒤지다 보니 사내에서 만들었던 제스쳐 인식 프로그램 시연 동영상이 있네요 같이 올립니다.

gellston · 1월 4, 2026, 5:42오전

학습 시키다가 고양이가 꺼버렸군요 ㅠ
2 epoch을 보는데 2시간이 넘게 걸리는군요… 내일 마무리가 될 것 같습니다.

뚜기기 · 1월 5, 2026, 2:05오전

이 정도면 그냥 취미로 만진 수준이 아니라 AI 개발자 해도 되겠다는 느낌이에요 이참에 전향을… ㅋㅋ

gellston · 1월 5, 2026, 2:09오전

그 분들은 또 그 분들의 세계가 있더군오.. 말씀만으로 감사합니다!

gellston · 1월 6, 2026, 10:56오전

학습을 해보니. 학습 정확도는 올라가지만 검증 정확도가 점점 떨어지는 것을 보니 오버피팅인 것 같습니다.
모델은 만들었지만 가만보니.. 오버피팅에 대응하는 기법은 넣지 않았군요.
보통 회사에서 검사용으로 개발하는 모델들은 크기가 작고 단순해서 이런 문제가 없었던것 같은데
대형 데이터셋을 위한 비교적 큰 모델을 만들어서 학습하니 이런 문제가 발생하는것 같습니다.

후… 다시 학습을 처음부터 해야겠군요. 생각좀 해봐야겠습니다.

gellston · 1월 7, 2026, 12:57오후

파라미터 수정하고 4번째 학습인데 validation 정확도고 90을 넘어섰군요.. 오늘 하루밤 더 지켜보고 정체되는것 같이 보이면 멈추고 랩핑을 진행해야 될 것 같습니다.

gellston · 1월 8, 2026, 6:04오후

길고 긴 고통의 학습 시간을 거쳐서…

python으로 테스트 코드 작성해서 돌려봤는데 대충 돌아가는군요..
성능이 아쉽지만.. 너무 힘들군요 노트북으로 (너무 오래 걸립니다 ㅠ)

※ 확률이 스파이크처럼 튀는 구간이 많아서 EMA (Exponential Moving Average) 로 둔감하게 뭉개버렸습니다. 코드는 새벽에 너무 귀찮아서 chatGPT형의 도움을 받았습니다. (1패)

github.com/gellston/HGR

python/test.py

main

import torch
import os
import cv2
import numpy as np
from collections import deque

from model.ghostnet3d import GhostNet3D
from model.softmax_model import SoftmaxModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"현재 사용 중인 디바이스: {device}")

image_width = 128
image_height = 64
image_channel = 3
class_num = 27
frames = 16

weight_path = "C://github//HGR//python//results//weights.pth"

This file has been truncated. show original

테스트 동영상

코드

import torch
import os
import cv2
import numpy as np
from collections import deque

from model.ghostnet3d import GhostNet3D
from model.softmax_model import SoftmaxModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"현재 사용 중인 디바이스: {device}")

image_width = 128
image_height = 64
image_channel = 3
class_num = 27
frames = 16

weight_path = "C://github//HGR//python//results//weights.pth"

model = GhostNet3D(in_channels=image_channel, class_num=class_num).to(device)
if os.path.exists(weight_path):
    state_dict = torch.load(weight_path, map_location=device)
    model.load_state_dict(state_dict)

net = SoftmaxModel(backbone=model).to(device)
net.eval()

# ===== 설정 =====
ACCUM_FRAMES = 25       # 30fps 환경에서 TARGET_FPS=17, frames=16이면 32 이상 권장
TARGET_FPS = 17
STRIDE = 1              # 몇 프레임마다 추론할지
EMA_ALPHA = 0.7        # 0.7~0.95 (클수록 더 안정적, 반응은 느려짐)

GESTURES = [
    "Doing other things","No gesture","Drumming Fingers","Pulling Hand In","Pulling Two Fingers In",
    "Pushing Hand Away","Pushing Two Fingers Away","Rolling Hand Backward","Rolling Hand Forward",
    "Shaking Hand","Sliding Two Fingers Down","Sliding Two Fingers Left","Sliding Two Fingers Right",
    "Sliding Two Fingers Up","Stop Sign","Swiping Down","Swiping Left","Swiping Right","Swiping Up",
    "Thumb Down","Thumb Up","Turning Hand Clockwise","Turning Hand Counterclockwise",
    "Zooming In With Full Hand","Zooming In With Two Fingers","Zooming Out With Full Hand",
    "Zooming Out With Two Fingers",
]

def preprocess(frame_bgr):
    x = cv2.resize(frame_bgr, (image_width, image_height))
    x = cv2.cvtColor(x, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
    return x  # (H,W,C)

def sample_by_target_fps(buf_frames_rgb, cam_fps, target_fps, out_len):
    L = len(buf_frames_rgb)
    if L < out_len:
        return None

    if cam_fps is None or cam_fps <= 1:
        idx = np.linspace(0, L - 1, out_len).round().astype(int)
        return [buf_frames_rgb[i] for i in idx]

    step = max(1, int(round(cam_fps / target_fps)))
    need = 1 + (out_len - 1) * step
    if L < need:
        return None

    start = L - need
    idx = start + np.arange(out_len) * step
    return [buf_frames_rgb[i] for i in idx]

@torch.no_grad()
def infer_probs(net, sampled_frames_rgb):
    clip = np.stack(sampled_frames_rgb, axis=0)        # (T,H,W,C)
    clip = np.transpose(clip, (3, 0, 1, 2))            # (C,T,H,W)
    x = torch.from_numpy(clip).unsqueeze(0).to(device) # (1,C,T,H,W)

    probs = net(x)
    probs = probs[0] if isinstance(probs, (list, tuple)) else probs
    return probs.squeeze(0)  # (C,)

# ===== 실행부 =====
cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)
if not cap.isOpened():
    raise RuntimeError("카메라를 열 수 없습니다. cam_index(0/1/2...)를 바꿔보세요.")

cam_fps = cap.get(cv2.CAP_PROP_FPS)
print(f"Camera FPS: {cam_fps if cam_fps else 'Unknown'}")

buf = deque(maxlen=ACCUM_FRAMES)  # ✅ 한 칸씩 밀리는 슬라이딩 버퍼
ema = None                        # ✅ EMA 상태(확률 벡터)
frame_count = 0

print("웹캠 제스처 테스트 시작 (q 종료)")

while True:
    ok, frame = cap.read()
    if not ok:
        print("프레임을 읽지 못했습니다. 종료합니다.")
        break

    cv2.imshow("Webcam", frame)
    buf.append(preprocess(frame))
    frame_count += 1

    if len(buf) < ACCUM_FRAMES:
        if (cv2.waitKey(1) & 0xFF) == ord('q'):
            break
        continue

    if frame_count % STRIDE == 0:
        sampled = sample_by_target_fps(list(buf), cam_fps, TARGET_FPS, frames)
        if sampled is not None:
            probs = infer_probs(net, sampled)  # (C,)

            # ===== EMA 적용 =====
            if ema is None:
                ema = probs.detach().clone()
            else:
                ema = EMA_ALPHA * ema + (1.0 - EMA_ALPHA) * probs

            pred = int(torch.argmax(ema).item())
            conf = float(torch.max(ema).item())
            name = GESTURES[pred] if pred < len(GESTURES) else str(pred)
            print(f"{name}  conf(EMA)={conf:.3f}")

    if (cv2.waitKey(1) & 0xFF) == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

gellston · 1월 8, 2026, 6:08오후

아.. 학습한 모델은 git에 올려뒀습니다.

gellston · 1월 9, 2026, 6:35오후

README 업데이트

.NET 누겟은 곧 업데이트 하겠습니다.

C++ Nuget 테스트

#include <iostream>
#include <opencv2/opencv.hpp>


#include <hgr/hgr.h>
#include <hgr/clipSampler.h>

int main()
{

    try {

        auto memoryPool = hgrapi::v1::memoryPool::create();

        auto hgr = hgrapi::v1::hgr::create();
        hgr->setup("C://github//HGR//python//results//model.onnx", hgrapi::v1::device::cuda);
        hgr->setEmaAlpha(0.2f);

        auto sampler = hgrapi::v1::clipSampler::create();
        sampler->setMaxFrames(40);
        sampler->setSampleFrames(16);

        cv::VideoCapture cap;
        cap.open(0);

        if (!cap.isOpened()) {
            std::cerr << "Failed to open VideoCapture.\n";
            return 1;
        }

        cv::Mat frame;

        while (true) {

            if (!cap.read(frame) || frame.empty()) {
                std::cerr << "End of stream or failed to read frame.\n";
                break;
            }

            auto dlImage = hgrapi::v1::image::create(frame.cols, frame.rows, 3, memoryPool);
            std::memcpy(dlImage->data(), frame.data, dlImage->size());
            auto resizeImage = hgrapi::v1::image::resize(dlImage, 128, 64);
            sampler->append(resizeImage);


            auto samples = sampler->requestSampling();
            auto result = hgr->predict(samples);


            std::cout << "name : " << result.name << " prob : " << result.prob << std::endl;

            cv::imshow("capture", frame);
            cv::waitKey(1);
        }

    }
    catch (std::exception ex) {
        std::cout << ex.what() << std::endl;
    }

    return 0;

}

테스트 동영상

파이썬보다 왜 빠른가

언어 차이도 있겠지만.. 큰 변수는 아닌 것 같고.. python에서는 pytorch로 테스트했는데요..
C++은 ONNX모델로 변환되면서 node fusion으로 인해서 선형 연산 구간이 압축되어서 그런 것 같습니다. (생략이 맞는 표현 같군요)

gellston · 1월 11, 2026, 4:13오후

마무리하며…

저조도 개선 Nuget 작업과 마찬가지로 .NET 누겟을 만들어서 올렸습니다.
데모앱은 Release에서 받으실 수 있습니다.

소감

Nuget은 정말 어렵군요.
특히나 C++과 C# dll이 섞여 있는 누겟은 정말 만들기 까다로운 것 같습니다.
그리고 개발을 하면서 지인 분들 보면 다들 LLM으로 뭔가 뚝딱 만드시는데
현타가 오는군요… 이런 작업이 의미가 있나 싶기도 하고 후!
LLM을 정말 공부해야 될 것 같군요.. Agent랑..