Model card for clap

Model card for CLAP: Contrastive Language-Audio Pretraining

clap_image

TL;DR

The abstract of the paper states that:

Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models’ results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public.

Usage

You can use this model for zero shot audio classification or extracting audio and/or textual features.

Uses

Perform zero-shot audio classification

Using `pipeline`

from datasets import load_dataset
from Transformers import pipeline
dataset = load_dataset("ashraq/esc50")
audio = dataset["train"]["audio"][-1]["array"]
audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/clap-htsat-unfused")
output = audio_classifier(audio, candidate_labels=["Sound of a dog", "Sound of vaccum cleaner"])
print(output)
>>> [{"score": 0.999, "label": "Sound of a dog"}, {"score": 0.001, "label": "Sound of vaccum cleaner"}]

Run the model:

You can also get the audio and text embeddings using ClapModel

Run the model on CPU:

from datasets import load_dataset
from transformers import ClapModel, ClapProcessor
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = librispeech_dummy[0]
model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused")
inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pt")
audio_embed = model.get_audio_features(**inputs)

Run the model on GPU:

from datasets import load_dataset
from transformers import ClapModel, ClapProcessor
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = librispeech_dummy[0]
model = ClapModel.from_pretrained("laion/clap-htsat-unfused").to(0)
processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused")
inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pt").to(0)
audio_embed = model.get_audio_features(**inputs)

Citation

If you are using this model for your work, please consider citing the original paper:

@misc{https://doi.org/10.48550/arxiv.2211.06687,
  doi = {10.48550/ARXIV.2211.06687},
  url = {https://arxiv.org/abs/2211.06687},
  author = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  keywords = {Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},
  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

laion/clap-htsat-unfused浏览人数已经达到956，如你需要查询该站的相关权重信息，可以点击"5118数据""爱站数据""Chinaz数据"进入；以目前的网站数据参考，建议大家请以爱站数据为准，更多网站价值评估因素如：laion/clap-htsat-unfused的访问速度、搜索引擎收录以及索引量、用户体验等；当然要评估一个站的价值，最主要还是需要根据您自身的需求以及需要，一些确切的数据则需要找laion/clap-htsat-unfused的站长进行洽谈提供。如该站的IP、PV、跳出率等！

特别声明

本站Ai导航提供的laion/clap-htsat-unfused都来源于网络，不保证外部链接的准确性和完整性，同时，对于该外部链接的指向，不由Ai导航实际控制，在2023年5月9日下午6:54收录时，该网页上的内容，都属于合规合法，后期网页的内容如出现违规，可以直接联系网站管理员进行删除，Ai导航不承担任何责任。

Ai导航致力于优质、实用的网络站点资源收集与分享！本文地址https://www.ainavpro.com/sites/2925.html转载请注明

T-Systems-onsite/cross-en-de-roberta-sentence-transformer

Cross English & German ...

allegro/herbert-large-cased

HerBERT HerBERT is a BERT...

IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese

Taiyi-CLIP-Roberta-large-32...

unc-nlp/lxmert-base-uncased

YituTech/conv-bert-base

facebook/bart-large

BART (large-sized model) ...

暂无评论

暂无评论...

laion/clap-htsat-unfused

Model card for clap

Table of Contents

TL;DR

Usage

Uses

Perform zero-shot audio classification

Using `pipeline`

Run the model:

Run the model on CPU:

Run the model on GPU:

Citation

数据统计

数据评估

相关导航

暂无评论

热门标签

随机网址

laion/clap-htsat-unfused

Model card for clap

Table of Contents

TL;DR

Usage

Uses

Perform zero-shot audio classification

Using pipeline

Run the model:

Run the model on CPU:

Run the model on GPU:

Citation

数据统计

数据评估

相关导航

暂无评论

热门标签

随机网址

广告位

Using `pipeline`