Harvesting LinkedIn data for fun & profit

Harvesting LinkedIn data for fun & profit

A few years ago, before the GDPR era, I was involved in a stealth marketing startup. We went out of business without getting funding. During my work on the start-up, I developed techniques that allow me to collect and cross-reference a lot of personal data including data from LinkedIn. I developed a number of techniques that allow me to do that.

As I am much more concerned about GDPR and personal privacy these days, I decided to uncover these techniques. The issues I discovered are still relevant and I hope, after publishing this article, they will be resolved.

1. Linkedin publically lists all users

LinkedIn has an index of all accounts publically available probably for search engine crawlers. For example, if you start with the following URLs you will get millions of people:

Back then, I developed a crawler that downloaded all the users and extracts me all user names and profile URLs.

2. Download results of the LinkedIn hack from 2012

Searching on Google, I found the database from the LinkedIn 2012 hack. Each record had a user id and an email without additional information.

The link to the LinkedIn user profile was missing and personal information was lacking. As a result, it was not very useful.

3. Decoding LinkedIn user profile URLs

If the user has not defined a custom profile URL, by default LinkedIn generates specific URL for each user, comprising first name, last name, and a special code. At first, I was thinking that this special code is a hexadecimal string of the user id, but after looking at hundreds of user’ URLs, I found that not all chars used in the hexadecimal string are used – only 0-9 and A and B. Here is an example of the default personal URL: https://il.linkedin.com/in/fnamelnamec0de

Some of you, I suppose, might already guess the answer. Linkedin uses duodecimal numbers to convert integer value of user id to a string. Here is a PHP script to convert duodecimal value back to an integer value:

$num = base_convert($duo, 12, 10);

4. Final solution

After doing this research, I build a script, that merged my two databases in one. So in the end, I had the following data:

  1. First Name
  2. Last Name
  3. Profile URL
  4. Personal email address

My data was limited till 2012 but it contained already a few million records. I was working on another script to fetch personal LinkedIn page and to harvest user info, but I never ended it.

Recommendations for LinkedIn team

I have the following recommendations:

  1. Stop listing all users in search engine friendly index. Show only new users. For example, the latest 1000 each time search bot comes to visit your site.
  2. Do not print userid inside LinkedIn user page.
  3. Get rid of duodecimal profile ids. Obscurity is not a solution here.

I can continue and continue here, as I do security and privacy architecture too. You can contact me on LinkedIn 😉

GDPR and privacy

After GDPR came in power, though I am living in Israel where GDPR is not applicable, I decided to delete all the data collected as I consider storing personal data to be a privacy breach and to publish this article to raise awareness on this topic.

Final words

Send your comments, suggestions, and love letters especially if you are harvesting LinkedIn data in the comments below.

About the author

Yuli Stremovsky
Yuli StremovskyParanoid Security Guy
For the past 15 years I’ve been leading the evolution of startups and enterprises to achieve the highest level of security and compliance. Throughout my career I’ve been a Cyber Security expert and advanced solutions architect with many years of hands on experience both on offensive and defensive side. Knowledgeable at the highest level in application development, networking, data and databases, web applications, large scale Software as a Service solutions, cloud security and blockchain technologies.

I’ve been working with CISO’s of international enterprises, helping them set Information Security strategy, and overseeing the implementation of these recommendations. As part of these projects, I’ve been assisting companies to achieve compliance in GDPR, PCI, HIPAA and SOX.

Among my credits, I was a founder of a database security company GreenSQL/Hexatier which was acquired by Huawei and I’ve co-founded Kesem.io, Secure multi-signature Crypto wallet.

Specialties: Software and cloud architecture, Compliance (GDPR, HIPAA, PCI, SOX), blockchain technologies, software development, secure architectures, project management and low level research.