Back
Post
Posted

Job board scrapers: How are you getting job descriptions?

When scraping jobs, you encounter job descriptions in different formats, also from different ATS.

How are you showing them in your job boards? Are you taking the original job post and feeding it to AI to ask it to "format" it to your liking?

How are you handling the scraping of all the page content (job description) from these different ATS? Are you cleaning the whole page data before sending to AI?


I scrape exclusively JobPosting schema: developers.google.com/search/…

The job description in here is (mostly) consistent across ATS. The exception being Workday which doesn't include HTML, only plain text.

Let me know if I can set up an API for you, I'm scraping 8.5 mil job per month.

Holy crap! That is a lot of jobs!
So, if I understand correctly, you are scraping directly from Google then?
Man, I have soooo many questions... 😅

Nope job board website put this JS schema inside their page to appear on Google Jobs, its structured data so the description and other informations are always in the same place to help Google (json)

Now I understand. Thanks Damien!

For my project, I don't have to scrape many jobs, only if the user 'adds' them to their dashboard. I use a python script (with Beautifulsoup) to get the whole page, then ask gpt4o to return a json with all the fields I need.

That is interesting Jasper. Are you feeding the whole html text to gpt4o? Are you not doing any data cleaning for html tags or similar?

I tried that first, but got the same results with sending the whole html text to gpt4o. Might optimize for speed and do that again, but I'm not sure how much that will help.

This approach doesn't work on pages where content is loaded dynamically, so for my project I will add a 'paste your own html' option or a browser plugin. (This doesn't work when scraping on a big scale, of course)

I was asking because I heard that: 1. HTML tags take up unnecessary tokens, and 2. That gpt might not work as well with all the html tags.

Interested in the prompt, care to share?

'''Below is a job listing from the url:'''+ url +''' I want you to extract information from the job listing, and return a json format. First I share the job listing, then the desired json output structure. Output only the json object, nothing else. Before you start, look at the job listing closely. All the required information is in there somewhere.

- Make sure dates are in a correct date format.
- For the salary, make sure you include the currency symbol (€ or $)
- Give the page a score from 0 - 100, where 0 means the page doesn't look like a job listing at all, and 100 it's most definitely a job listing page.

[JOB LISTING START]''' + page_content + '''[JOB LISTING END] [JSON OUTPUT START]

{
"joburl": "",
"job
title": "",
"jobdescriptionsummary": "",
"jobtasksresponsibilities": [],
"jobpostingdate": "",
"joblocation": "",
"hours
perweek": "",
"candidate
requirements": [],
"benefits": {
"minsalary": "",
"max
salary": "",
"otherbenefits": []
},
"contact
person": [{
"fullname": "",
"job
title": "",
"phonenumber": "",
"email": "",
"linkedin
url": ""
}],
"jobpostingenddate": "",
"company": "",
"company
description": "",
"joblistingscore": ""
}

[JSON OUTPUT END]'''

I see the prompt is a bit weird here since underscores are used to make text italic

you dont need to do that anymore with structured outputs. with structured outputs it will auto format the data based on the schema you give it 100% of the time.

There's not a lot of benefit imho in scraping. Best to try to start a job board in a niche and try to get unique jobs posted there asap.

That's also quite hard, how to get users without jobs? How to make people post their jobs if you don't have users?
Even with customers in my experience it-s not that easy to get companies to take the time to post in your job board or even go with new job boards as they usually have already some arrangements.
Maybe I did it the wrong way if you got some success doing that. Any tip?

It is super hard. But it is what creates the most value in long term.

With a good niche it should be possible if you reach out to companies. Start with a few scraped jobs so the site looks real. Then go hard on outreach.

Yeah thanks, I might try again now that the site now has many more jobs and some traffic.
Do you go for paid post jobs straightforward?
I'm thinking of trying with free and then upsell if they liked it at least at first

Started with free jobs where companies had to pay if they hired someone. It was based on self reporting, but the ones that reported had to pay like 1-2 months salary so a lot.

Now we switched model and its free+paid jobs (paid jobs get extra features) and we also charge money for job seekers. They can create a profile with extra features, get alerts etc.

40% revenue is from job seekers, 60% revenue is from job poster.

We got around 80-100 jobs (unique, not scraped) on the site at a time, revenue around 8-10k.

Not sure if you can easily upsell people if they liked a free job post. Depends on the niche, in some niches people post jobs too rarely, like once every 18-24 months

Thanks, that's super helpful! I think in my case the companies hire a bit more often but it is also not super regular, your model might make sense in my case too. Also, its more straightforward.
I'm charging customers too, but still small amount of users and having both revenues would be nice.
I'll think about implementing something similar.

Awesome results if you have that many unique jobs,congratulations!

I'm still not sure about charging job seekers, it seems nice but I am a bit worried about long term.

I feel it might be better to focus on just charging for jobs but it is hard to make that change now since we get a lot of money from job seekers.

Thanks, It was not easy, it really started picking up this year.

For some reason, I stopped receiving notifications for this question I ask.
Would those replying here be interested in sharing ideas about job boards in a Telegram group?
I have created on in case you are interested in the idea:
t.me/+bVg9wdu1XBQyMmJk