WIP

For my project, I don't have to scrape many jobs, only if the user 'adds' them to their dashboard. I use a python script (with Beautifulsoup) to get the whole page, then ask gpt4o to return a json with all the fields I need.

Reply 2024-08-30 14:01:02 UTC

David Rodriguez

@drodol

That is interesting Jasper. Are you feeding the whole html text to gpt4o? Are you not doing any data cleaning for html tags or similar?

Reply 2024-08-30 14:04:13 UTC

Jasper de Boer

@jasperdeboer

I tried that first, but got the same results with sending the whole html text to gpt4o. Might optimize for speed and do that again, but I'm not sure how much that will help.

Reply 2024-08-30 14:06:27 UTC

Jasper de Boer

@jasperdeboer

This approach doesn't work on pages where content is loaded dynamically, so for my project I will add a 'paste your own html' option or a browser plugin. (This doesn't work when scraping on a big scale, of course)

Reply 2024-08-30 14:08:03 UTC

David Rodriguez

@drodol

I was asking because I heard that: 1. HTML tags take up unnecessary tokens, and 2. That gpt might not work as well with all the html tags.

Reply 2024-08-30 14:08:09 UTC

Charlie Revett

@revcd

Interested in the prompt, care to share?

Reply 2024-08-31 10:59:09 UTC

Jasper de Boer

@jasperdeboer

'''Below is a job listing from the url:'''+ url +''' I want you to extract information from the job listing, and return a json format. First I share the job listing, then the desired json output structure. Output only the json object, nothing else. Before you start, look at the job listing closely. All the required information is in there somewhere.

- Make sure dates are in a correct date format.
- For the salary, make sure you include the currency symbol (€ or $)
- Give the page a score from 0 - 100, where 0 means the page doesn't look like a job listing at all, and 100 it's most definitely a job listing page.

[JOB LISTING START]''' + page_content + '''[JOB LISTING END] [JSON OUTPUT START]

{
"joburl": "",
"jobtitle": "",
"jobdescriptionsummary": "",
"jobtasksresponsibilities": [],
"jobpostingdate": "",
"joblocation": "",
"hoursperweek": "",
"candidaterequirements": [],
"benefits": {
"minsalary": "",
"maxsalary": "",
"otherbenefits": []
},
"contactperson": [{
"fullname": "",
"jobtitle": "",
"phonenumber": "",
"email": "",
"linkedinurl": ""
}],
"jobpostingenddate": "",
"company": "",
"companydescription": "",
"joblistingscore": ""
}

[JSON OUTPUT END]'''

Reply 2024-08-31 13:20:14 UTC

Jasper de Boer

@jasperdeboer

I see the prompt is a bit weird here since underscores are used to make text italic

Reply 2024-08-31 13:22:09 UTC

Charlie Revett

@revcd

Thanks!

Reply 2024-08-31 14:07:41 UTC

brian lowery

@bdlowery

you dont need to do that anymore with structured outputs. with structured outputs it will auto format the data based on the schema you give it 100% of the time.

Reply 2024-08-31 14:30:39 UTC

Go to Homepage	`g` `h`
Go to Done Todos	`g` `d`
Compose a New Todo	`n`
Go to Search	`/`
Show this dialog	`?`

Keyboard Shortcuts