A few years ago, before the GDPR era, I was involved in a stealth marketing startup. We went out of business without getting funding. During my work on the start-up, I developed techniques that allow me to collect and cross-reference a lot of personal data including data from LinkedIn. I developed a number of techniques that allow me to do that.
As I am much more concerned about GDPR and personal privacy these days, I decided to uncover these techniques. The issues I discovered are still relevant and I hope, after publishing this article, they will be resolved.
1. Linkedin publically lists all users
LinkedIn has an index of all accounts publically available probably for search engine crawlers. For example, if you start with the following URLs you will get millions of people:
- Many more, depending on the country.
Back then, I developed a crawler that downloaded all the users and extracts me all user names and profile URLs.
2. Download results of the LinkedIn hack from 2012
Searching on Google, I found the database from the LinkedIn 2012 hack. Each record had a user id and an email without additional information.
The link to the LinkedIn user profile was missing and personal information was lacking. As a result, it was not very useful.
3. Decoding LinkedIn user profile URLs
If the user has not defined a custom profile URL, by default LinkedIn generates specific URL for each user, comprising first name, last name, and a special code. At first, I was thinking that this special code is a hexadecimal string of the user id, but after looking at hundreds of user’ URLs, I found that not all chars used in the hexadecimal string are used – only 0-9 and A and B. Here is an example of the default personal URL: https://il.linkedin.com/in/fname–lname–c0de
Some of you, I suppose, might already guess the answer. Linkedin uses duodecimal numbers to convert integer value of user id to a string. Here is a PHP script to convert duodecimal value back to an integer value:
$num = base_convert($duo, 12, 10);
4. Final solution
After doing this research, I build a script, that merged my two databases in one. So in the end, I had the following data:
- First Name
- Last Name
- Profile URL
- Personal email address
My data was limited till 2012 but it contained already a few million records. I was working on another script to fetch personal LinkedIn page and to harvest user info, but I never ended it.
Recommendations for LinkedIn team
I have the following recommendations:
- Stop listing all users in search engine friendly index. Show only new users. For example, the latest 1000 each time search bot comes to visit your site.
- Do not print userid inside LinkedIn user page.
- Get rid of duodecimal profile ids. Obscurity is not a solution here.
I can continue and continue here, as I do security and privacy architecture too. You can contact me on LinkedIn 😉
GDPR and privacy
After GDPR came in power, though I am living in Israel where GDPR is not applicable, I decided to delete all the data collected as I consider storing personal data to be a privacy breach and to publish this article to raise awareness on this topic.
Send your comments, suggestions, and love letters especially if you are harvesting LinkedIn data in the comments below.
Free privacy training for startup founders and architects:
I started a new open source project to help companies to be privacy compliant (GDPR, CCPA, etc…). I also plan to release the enterprise version and make fortune from it 😉 So if you are on GitHub, give us a star.
About the author
Yuli is an open-source developer that helps companies and startups to solve data security and privacy challenges. He is a founder of https://privacybunker.io/ and https://databunker.org/ projects.
Among his credits, he was a founder of a database security company GreenSQL/Hexatier which was acquired by Huawei.
Specialties: Software and cloud architecture, Compliance (GDPR, PCI), blockchain technologies, software development, secure architectures, project management, and low-level research.