94e62ed9-5117-46ed-88fe-ab3787c04e50.jpg

GitHub

LinkedIn

🦋 smfsamir.bsky.social

📧 [email protected]


I’m a Ph.D. student candidate in the Natural Language Processing Group at the University of British Columbia, where I’m advised by Jian Zhu and Vered Shwartz. My research is supported by a Doctoral Scholarship from the Natural Sciences and Engineering Research Council of Canada (NSERC) and an award from the Public Scholars Initiative. I'm broadly interested in applied and theoretical aspects of multilingual natural language processing, and I work with both text and audio modalities. Before I came to UBC, I completed my MSc in the Computational Linguistics group at the University of Toronto. Prior to graduate school, I completed my Bachelor of Science (Honours Computer Science) at the University of Toronto.

For the 2023-2024 academic year, I was a visiting researcher at the Paul G. Allen Center for Computer Science & Engineering where I had the good fortune of working with Yulia Tsvetkov, Chan Park, and Sachin Kumar. I was also a research intern at Ai2 Aristo.

I split my time between the mountains (Vancouver and Seattle 🏔) and the concrete (Toronto 🌆). Currently: Toronto.


Recent news

March 2025 — Giving a guest lecture in Vered Shwartz’ Natural Language Processing class. Will upload slides and a prerecording soon…

Feb 2025 — Last ****paper of my PhD accepted to TACL! 🇦🇹

Feb 2025 — Gave a talk about my forthcoming dissertation at UCSD Linguistics 🏄🏽

Feb 2025 — Awarded the NSERC Postdoctoral Fellowship 🇨🇦

Feb 2025 — Getting profiled in UBC Science’s FOCUS magazine about our work on cross-linguistic information gaps in Wikipedia, forthcoming.

Jan 2025 — My mentor Anjalie Field was profiled about our work on cross-linguistic information gaps in Wikipedia; check it out here!

Selected publications

image.png

ig_distribution.png

Untitled

SubsetSelectionSam.png

Efficient Identification of Low-Quality Language Partitions in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

TACL [preprint]

Farhan Samir, Emily P. Ahn, Shreya Prakash, Márton Sóskuthy, Vered Shwartz, Jian Zhu

Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia

EMNLP 2024 [poster][proceedings][citation][GitHub][press]

Farhan Samir, Chan Young Park, Anjalie Field, Vered Shwartz, Yulia Tsvetkov

The taste of IPA: Towards open-vocabulary keyword matching and forced alignment in any language

NAACL 2024

Jian Zhu, Changbing Yang, Farhan Samir, Jahurul Islam

Understanding compositional data augmentation in typologically diverse morphological inflection

EMNLP 2023 [oral presentation] [🏆 outstanding paper award]

Farhan Samir, Miikka Silfverberg

[preprint]


Industry research experience

Untitled

Untitled

Untitled

Summer 2024, Seattle WA

PhD Research Intern, Aristo team. Mentors: Bodhisattwa Prasad Majumder, Bhavana Dalvi, Harshit Surana, Ben Bogin, Lucy Lu Wang, Peter Clark

Summer 2023, Sunnyvale CA

Applied Scientist Intern @ Amazon Science. Mentors: Daniel Elkind & Timothy Leffel.

Summer 2022, Toronto ON

Worked with Griffin Lacey and Graham Taylor on developing scalable Transformer-based Graph Neural Networks.