Advertisement · 728 × 90

Posts by Eun Cheol Choi

Our Python snippet for measuring wAII for a job posting, which is open-sourced.

Our Python snippet for measuring wAII for a job posting, which is open-sourced.

Our code is open-source. With just a few lines of Python, anyone can measure AI exposure for any job description using our repository. (5/5)

Paper: workshop-proceedings.icwsm.org/abstract.php...
Repository: github.com/EunCheolChoi...

9 months ago 0 0 0 0
In the information industry sector, our wAII is “positively” associated with offered salary.

In the information industry sector, our wAII is “positively” associated with offered salary.

On the other hand, in the wholesale trade industry sector, wAII is “negatively” associated with offered salary.

On the other hand, in the wholesale trade industry sector, wAII is “negatively” associated with offered salary.

𝐖𝐡𝐚𝐭 𝐰𝐞 𝐟𝐨𝐮𝐧𝐝
1. Jobs in tech, manufacturing, and engineering are more exposed to AI; HR, public service, and legal sectors are less exposed
2. Exposure to AI shows distinct association patterns regarding offered salary depending on the industry sectors (4/5)

9 months ago 0 0 1 0
A diagram explaining how we extracted ‘tasks’ from each job post. A task is a role or a skillset that is essential for a job position.

A diagram explaining how we extracted ‘tasks’ from each job post. A task is a role or a skillset that is essential for a job position.

We query based on extracted tasks and retrieve the most similar AI-related patents in terms of semantic similarities.

We query based on extracted tasks and retrieve the most similar AI-related patents in terms of semantic similarities.

Weighted AI Index is calculated as the sum of the product between the task weight and the similarity of the task and the retrieved patent.

Weighted AI Index is calculated as the sum of the product between the task weight and the similarity of the task and the retrieved patent.

Full LLM prompt for extracting tasks from a job post.

Full LLM prompt for extracting tasks from a job post.

To track this, we developed the Weighted AI Index (wAII), a scalable method to measure how closely a job’s tasks align with recent AI innovations.
We extract key job tasks from postings; compare them to AI-related patents; compute an “AI exposure” score. (3/5)

9 months ago 0 0 1 0
A bar chart comparing ‘weighted AI index (wAII)’ of different industry sectors. wAII is our proposed method of measuring how much a job position is exposed to AI technologies.

A bar chart comparing ‘weighted AI index (wAII)’ of different industry sectors. wAII is our proposed method of measuring how much a job position is exposed to AI technologies.

𝐊𝐞𝐲 𝐢𝐝𝐞𝐚
AI doesn't affect all jobs equally. Some industries and roles are far more exposed to disruption from technological innovation than others. (2/5)

9 months ago 0 0 1 0
Title page for our new paper, “Mapping Labor Market Vulnerability in the Age of AI”

Title page for our new paper, “Mapping Labor Market Vulnerability in the Age of AI”

𝑾𝒊𝒍𝒍 𝑨𝑰 𝒕𝒂𝒌𝒆 𝒎𝒚 𝒋𝒐𝒃?
It’s a question that keeps many of us up at night—and for good reason.
Our new research maps labor market vulnerability in the age of AI with 100K job postings and 50K AI-related patents. (1/5)

9 months ago 2 0 1 0
Preview
Keeping Track of AI Use Cases in the Newsroom A practical guide to how journalists are using generative AI and how to keep up

Keeping Track of AI Use Cases in the Newsroom generative-ai-newsroom.com/keeping-trac...

10 months ago 2 0 0 0
Post image

"Limited effectiveness of LLM-based data augmentation for COVID-19 misinformation stance detection" by @euncheolchoi.bsky.social @emilioferrara.bsky.social et al, presented by the awesome Chur at The Web Conference 2025

arxiv.org/abs/2503.02328

11 months ago 6 2 0 0

Hot takes:
- The benefits of easily accessible social media data usually outweigh the potential harms.
- Some uses of (public) social media data are unethical/should be illegal, and we should target that.
- We're better off having clear boundaries between public and private online spaces.

1 year ago 12 2 0 1
Advertisement
Preview
Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research' A Hugging Face employee made a huge dataset of Bluesky posts, and it’s already very popular.

An employee of Huggingface, a site of AI training datasets, made a dataset of a million Bluesky posts scraped simply because they could. It’s currently trending: www.404media.co/someone-made...

1 year ago 1101 467 60 192

Bluesky's firehose is a treasure trove of public data for researchers and developers, and it's completely free. Check out our developer docs: docs.bsky.app

1 year ago 7889 1526 318 166
Book outline

Book outline

Over the past decade, embeddings — numerical representations of
machine learning features used as input to deep learning models — have
become a foundational data structure in industrial machine learning
systems. TF-IDF, PCA, and one-hot encoding have always been key tools
in machine learning systems as ways to compress and make sense of
large amounts of textual data. However, traditional approaches were
limited in the amount of context they could reason about with increasing
amounts of data. As the volume, velocity, and variety of data captured
by modern applications has exploded, creating approaches specifically
tailored to scale has become increasingly important.
Google’s Word2Vec paper made an important step in moving from
simple statistical representations to semantic meaning of words. The
subsequent rise of the Transformer architecture and transfer learning, as
well as the latest surge in generative methods has enabled the growth
of embeddings as a foundational machine learning data structure. This
survey paper aims to provide a deep dive into what embeddings are,
their history, and usage patterns in industry.

Over the past decade, embeddings — numerical representations of machine learning features used as input to deep learning models — have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important. Google’s Word2Vec paper made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the Transformer architecture and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.

Cover image

Cover image

Just realized BlueSky allows sharing valuable stuff cause it doesn't punish links. 🤩

Let's start with "What are embeddings" by @vickiboykis.com

The book is a great summary of embeddings, from history to modern approaches.

The best part: it's free.

Link: vickiboykis.com/what_are_emb...

1 year ago 651 101 22 6
Opportunities and risks of LLMs in survey research Recent advances in the development of large language models (LLMs) bring both disruptive opportunities and underlying risks to survey research. LLMs' capabiliti

New Paper on Opportunities and risks of LLMs in survey research papers.ssrn.com/sol3/papers.... " Backed by both practical examples & academic literature, we identify areas
for research and development, distinguishing between challenges related to survey methods &
the tools used to deploy surveys"

1 year ago 4 2 1 1

Interested in RLHF, DPO, LLM alignment?

I've just created this list featuring awesome people like @natolambert.bsky.social .

The list is the opposite of exhaustive; I've just joined some days ago 😅

go.bsky.app/MqRGAf2

1 year ago 83 19 10 1

If you're keen to learn content verification for fact-checking and open source investigations, this is a step-by-step guide on how to verify images and videos that I posted a while ago.

Once you familiarise yourself with reverse search, you'll become much better at spotting online misinformation.

1 year ago 415 165 39 8
Preview
A Public Dataset Tracking Social Media Discourse about the 2024 U.S. Presidential Election on Twitter/X In this paper, we introduce the first release of a large-scale dataset capturing discourse on $\mathbb{X}$ (a.k.a., Twitter) related to the upcoming 2024 U.S. Presidential Election. Our dataset compri...

More #data2thepeople! #election2024

We just released a Twitter/X dataset of 22m+ tweets about the upcoming election! (Data will be routinely updated, keep an eye and spread the word!)

Paper arxiv.org/abs/2411.00376
Data github.com/sinking8/usc...

1 year ago 10 4 0 0
Preview
Unearthing a Billion Telegram Posts about the 2024 U.S. Presidential Election: Development of a Public Dataset With its lenient moderation policies and long-standing associations with potentially unlawful activities, Telegram has become an incubator for problematic content, frequently featuring conspiratorial,...

#Data2thePeople! Pls reshare!

Do you want a billion telegram posts?!
You get data! You get data! You get data!

Paper arxiv.org/abs/2410.23638
Data github.com/leonardo-bla...

#election2024

1 year ago 36 16 1 5

Sharing my first Computational Social Science starter pack! Will grow with time, feel free to nominate and self nominate!

go.bsky.app/CYmRvcK

1 year ago 96 41 61 3
Preview
GitHub - brianckeegan/bluesky-datascience: Exploratory notebooks for data science using Bluesky data Exploratory notebooks for data science using Bluesky data - brianckeegan/bluesky-datascience

Here is a @github.com repo where I will share tutorial notebooks on how to retrieve and analyze data from @bsky.app @atproto.com

Feedback, suggestions, and contributions are welcome!

github.com/brianckeegan...

1 year ago 161 68 15 7
Advertisement

Ready for another Computational Social Science Starter Pack?

Here is number 2! More amazing folks to follow! Many students and the next gen represented!

go.bsky.app/GoEyD7d

1 year ago 77 52 33 43