基於組合影像檢索系統的重要性在於它能夠讓用戶使用視覺參 考和描述文字來找到特定影像,解決了傳統僅靠文字檢索方法的局 限性。在本論文中,我們提出了一種利用 Querying-Transformer 來 解決傳統影像檢索方法局限性的系統。Qformer 通過基於 Transformer 的架構,將影像和文字數據整合在一起,能夠熟練地捕 捉這兩種模式之間的複雜關係。通過引入影像-文字匹配損失函數, 我們的系統顯著提高了影像與文字匹配的準確性,確保了視覺和文 字表現之間的高度一致性。我們還在 Qformer 模型中使用了殘差學 習技術,以保留重要的視覺信息,從而在學習過程中保持原始影像 的質量和特徵。 為了驗證我們方法的效果,我們在 FashionIQ 和 CIRR 數據集上 進行了實驗。結果顯示,我們提出的系統在各種類別中顯著優於現 有模型,實現了更高的召回率指標。實驗結果展示了我們系統在實 際應用中的潛力,提供了在影像檢索任務中精確性和相關性方面的 顯著改進。;Composed Image Retrieval (CIR) systems are crucial because they enable users to find specific images using both visual references and descriptive text, addressing the limitations of traditional text-only search methods. In this thesis, we propose a system that utilizes the Querying-Transformer (Qformer) to address the limitations of traditional image retrieval methods. The Qformer integrates image and text data through a transformer-based architecture, adeptly capturing complex relationships between the two modalities. By incorporating the Image-Text Matching (ITM) loss function, our system significantly enhances the accuracy of image-text matching, ensuring superior alignment between visual and textual representations. We also employ residual learning techniques within the Qformer model to preserve essential visual information, thereby maintaining the quality and features of the original images throughout the learning process. To confirm the efficacy of our approach, we performed experiments on the FashionIQ and CIRR datasets. The results show that our proposed system significantly outperforms existing models, achieving superior recall metrics across various categories. The experimental results demonstrate the potential of our system in practical applications, offering robust improvements in the precision and relevance of image retrieval tasks.