Publications
Conference Proceedings
ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL) Findings, 2024.
We introduce a stage-wise framework that generates diverse and accurate NER training datasets given only under 10 annotated instances. On 4 datasets we experiment on, DeBERTa models fine-tuned on the generated datasets out-perform/perform similarly to GPT-3.5 few-shot prompting.

Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation
Annual Meeting of the Association for Computational Linguistics (ACL) Findings, 2024.

Label Agnostic Pre-training for Zero-shot Text Classification
Annual Meeting of the Association for Computational Linguistics (ACL) Findings, 2023.
We benchmark 3 transformer zero-shot text classification paradigms (cross-encoding, dual-encoding, generative) on 18 classification datasets among 3 domains with unseen labels during training. We found that integrating the dataset domain into training improves accuracy by 1% on average.

Preprints
PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization
We compile a benchmark of subjective text classification tasks for personalization, where user preferences may differ for the same input. The benchmark contains 11 datasets spanning hate-speech, sentiment and humor, with high inter-annotator disagreement.