For my project, I don't have to scrape many jobs, only if the user 'adds' them to their dashboard. I use a python script (with Beautifulsoup) to get the whole page, then ask gpt4o to return a json with all the fields I need.
I tried that first, but got the same results with sending the whole html text to gpt4o. Might optimize for speed and do that again, but I'm not sure how much that will help.
This approach doesn't work on pages where content is loaded dynamically, so for my project I will add a 'paste your own html' option or a browser plugin. (This doesn't work when scraping on a big scale, of course)
'''Below is a job listing from the url:'''+ url +''' I want you to extract information from the job listing, and return a json format. First I share the job listing, then the desired json output structure. Output only the json object, nothing else. Before you start, look at the job listing closely. All the required information is in there somewhere.
- Make sure dates are in a correct date format.
- For the salary, make sure you include the currency symbol (€ or $)
- Give the page a score from 0 - 100, where 0 means the page doesn't look like a job listing at all, and 100 it's most definitely a job listing page.
you dont need to do that anymore with structured outputs. with structured outputs it will auto format the data based on the schema you give it 100% of the time.
For my project, I don't have to scrape many jobs, only if the user 'adds' them to their dashboard. I use a python script (with Beautifulsoup) to get the whole page, then ask gpt4o to return a json with all the fields I need.
That is interesting Jasper. Are you feeding the whole html text to gpt4o? Are you not doing any data cleaning for html tags or similar?
I tried that first, but got the same results with sending the whole html text to gpt4o. Might optimize for speed and do that again, but I'm not sure how much that will help.
This approach doesn't work on pages where content is loaded dynamically, so for my project I will add a 'paste your own html' option or a browser plugin. (This doesn't work when scraping on a big scale, of course)
I was asking because I heard that: 1. HTML tags take up unnecessary tokens, and 2. That gpt might not work as well with all the html tags.
Interested in the prompt, care to share?
'''Below is a job listing from the url:'''+ url +''' I want you to extract information from the job listing, and return a json format. First I share the job listing, then the desired json output structure. Output only the json object, nothing else. Before you start, look at the job listing closely. All the required information is in there somewhere.
- Make sure dates are in a correct date format.
- For the salary, make sure you include the currency symbol (€ or $)
- Give the page a score from 0 - 100, where 0 means the page doesn't look like a job listing at all, and 100 it's most definitely a job listing page.
[JOB LISTING START]''' + page_content + '''[JOB LISTING END] [JSON OUTPUT START]
{
"joburl": "",
"jobtitle": "",
"jobdescriptionsummary": "",
"jobtasksresponsibilities": [],
"jobpostingdate": "",
"joblocation": "",
"hoursperweek": "",
"candidaterequirements": [],
"benefits": {
"minsalary": "",
"maxsalary": "",
"otherbenefits": []
},
"contactperson": [{
"fullname": "",
"jobtitle": "",
"phonenumber": "",
"email": "",
"linkedinurl": ""
}],
"jobpostingenddate": "",
"company": "",
"companydescription": "",
"joblistingscore": ""
}
[JSON OUTPUT END]'''
I see the prompt is a bit weird here since underscores are used to make text italic
Thanks!
you dont need to do that anymore with structured outputs. with structured outputs it will auto format the data based on the schema you give it 100% of the time.