Scraping site Helloasso

Hello tout le monde,

je cherche une solution pour scrap l’adresse mail des associations sportives en Bretagne.
Elles sont référencées pour la plupart sur le site Helloasso et je voulais savoir si quelqu’un avait déjà scrap ce site web ?

Merci d’avance les Growth

Fabien

Yes, c’est entièrement possible!


La collecte peut se passer en 2 étapes

iterAssos

Récupère la liste des assos

curl 'https://www.helloasso.com/algolia/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.17.2)%3B%20Browser' \
  -H 'authority: www.helloasso.com' \
  -H 'accept: */*' \
  -H 'accept-language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7' \
  -H 'content-type: application/x-www-form-urlencoded' \
  -H 'cookie: i18n_redirected=fr; _gcl_au=1.1.450853532.1696498674; ry_ry-h3ll04s_realytics=eyJpZCI6InJ5X0NGRTc4NjA4LUIwODYtNDRDQi1CMzJDLTJBRDI3MjZGQzkwMyIsImNpZCI6bnVsbCwiZXhwIjoxNzI4MDM0Njc0ODI0LCJjcyI6bnVsbH0%3D; ry_ry-h3ll04s_so_realytics=eyJpZCI6InJ5X0NGRTc4NjA4LUIwODYtNDRDQi1CMzJDLTJBRDI3MjZGQzkwMyIsImNpZCI6bnVsbCwib3JpZ2luIjp0cnVlLCJyZWYiOm51bGwsImNvbnQiOm51bGwsIm5zIjpmYWxzZX0%3D; axeptio_authorized_vendors=%2C%2C; _gid=GA1.2.1943720269.1696498675; ln_or=eyIzNzcxMjEyIjoiZCJ9; ajs_anonymous_id=88f13cda-34fe-4633-8c84-d680a5025e97; analytics_session_id=1696498675715; NPS_ad4ebf8d_last_seen=1696498680025; _ga=GA1.1.1708486417.1696498675; axeptio_all_vendors=%2CSalesForce%2CHelloAsso%2CARRAffinity%2Cgoogle_analytics%2Csegment%2CRealytics%2CFullStory%2CRealytics_marketing%2CAppcues%2CABtasty%2CAffilae%2CGoogle_Ads%2CBing%2Cfacebook_pixel%2Clinkedin_insight_tag%2Csalesforce%2Crealytics%2Cabtasty%2Caffilae%2Cgoogle_ads%2Cbing%2C; axeptio_cookies={%22$$token%22:%22in8znein20oa2w5mo3vycc%22%2C%22$$date%22:%222023-10-05T10:01:11.215Z%22%2C%22SalesForce%22:false%2C%22HelloAsso%22:false%2C%22ARRAffinity%22:false%2C%22google_analytics%22:false%2C%22segment%22:false%2C%22Realytics%22:false%2C%22FullStory%22:false%2C%22Realytics_marketing%22:false%2C%22Appcues%22:false%2C%22ABtasty%22:false%2C%22Affilae%22:false%2C%22Google_Ads%22:false%2C%22Bing%22:false%2C%22facebook_pixel%22:false%2C%22linkedin_insight_tag%22:false%2C%22salesforce%22:false%2C%22realytics%22:false%2C%22abtasty%22:false%2C%22affilae%22:false%2C%22google_ads%22:false%2C%22bing%22:false%2C%22$$completed%22:true}; analytics_session_id.last_access=1696500212367; _ga_TKC826G3G2=GS1.1.1696498678.1.1.1696500214.60.0.0' \
  -H 'origin: https://www.helloasso.com' \
  -H 'referer: https://www.helloasso.com/e/recherche?tab=associations&category_tags=ecologie--et--environnement&bbox=-21.469669445706018%2C39.8474056222239%2C43.1299399292939%2C54.32041256973076' \
  -H 'sec-ch-ua: "Google Chrome";v="117", "Not;A=Brand";v="8", "Chromium";v="117"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-origin' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36' \
  -H 'x-algolia-api-key: 980128990635aaa7c2595b668df87497' \
  -H 'x-algolia-application-id: KOCVQI75M9' \
  --data-raw '{"requests":[{"indexName":"prod_organizations","hitsPerPage":30,"query":"","page":1,"insideBoundingBox":[[39.8474056222239,-21.469669445706018,54.32041256973076,43.1299399292939]],"filters":"category_tags : \"Ecologie et Environnement\"","params":""}]}' \
  --compressed

NB: le site utilise algolia — on a donc un superbe JSON en sortie

{
    "results": [
        {
            "hits": [
                {
                    "url": "https://www.helloasso.com/associations/les-jardins-de-la-leysse",
                    "name": "Les Jardins de la Leysse",
                    "description": "Les jardins de leysses sont une association citoyenne qui entetient et cultive une petite parcelle d'espace vert le long de la rue Lucien Rose à Chambéry",
                    "logo": "https://cdn.helloasso.com/img/logos/croppedimage-94a80b628ae04cd79fe7d02f140c4543.png",
                    "banner": "https://cdn.helloasso.com/img/photos/croppedimage-0072fb8bf48a46c5bf45ba12e4b57774.png",
                    "place_address": "60 rue Lucien Rose",
                    "place_zipcode": "73000",
                    "place_city": "Chambéry",
                    "place_department": "Savoie",
                    "place_region": "Auvergne-Rhône-Alpes",
                    "_geoloc": {
                        "lat": 45.568678,
                        "lng": 5.936789
                    },
                    "org_type": "Association1901",
                    "creation_date": 1466114400,
                    "category_tags": [
                        "Environnement",
                        "Jardinage",
                        "Ecologie et Environnement"
                    ],
                    "ha_tags": [],
                    "active_forms_count": 1,
                    "active_forms_type": "Membership",
                    "active_forms_last_update_date": 1696428726,
                    "last_order_date": 1686152052,
                    "score": 110,
                    "partners": [],
                    "objectID": "49223",
                    "_highlightResult": {
                        "name": {
                            "value": "Les Jardins de la Leysse",
                            "matchLevel": "none",
                            "matchedWords": []
                        },
                        "description": {
                            "value": "Les jardins de leysses sont une association citoyenne qui entetient et cultive une petite parcelle d'espace vert le long de la rue Lucien Rose à Chambéry",
                            "matchLevel": "none",
                            "matchedWords": []
                        },
                        "place_zipcode": {
                            "value": "73000",
                            "matchLevel": "none",
                            "matchedWords": []
                        },
                        "place_city": {
                            "value": "Chambéry",
                            "matchLevel": "none",
                            "matchedWords": []
                        },
                        "place_department": {
                            "value": "Savoie",
                            "matchLevel": "none",
                            "matchedWords": []
                        },
                        "category_tags": [
                            {
                                "value": "Environnement",
                                "matchLevel": "none",
                                "matchedWords": []
                            },
                            {
                                "value": "Jardinage",
                                "matchLevel": "none",
                                "matchedWords": []
                            },
                            {
                                "value": "Ecologie et Environnement",
                                "matchLevel": "none",
                                "matchedWords": []
                            }
                        ]
                    }
                },
                {
                    "url": "https://www.helloasso.com/associations/zero-dechet-lyon",
                    "name": "Zéro Déchet Lyon",
...

Et ensuite getAsso

Récupérer le mail de l’asso

curl 'https://www.helloasso.com/associations/les-jardins-de-la-leysse' \
  -H 'authority: www.helloasso.com' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7' \
  -H 'cookie: i18n_redirected=fr; _gcl_au=1.1.450853532.1696498674; ry_ry-h3ll04s_realytics=eyJpZCI6InJ5X0NGRTc4NjA4LUIwODYtNDRDQi1CMzJDLTJBRDI3MjZGQzkwMyIsImNpZCI6bnVsbCwiZXhwIjoxNzI4MDM0Njc0ODI0LCJjcyI6bnVsbH0%3D; ry_ry-h3ll04s_so_realytics=eyJpZCI6InJ5X0NGRTc4NjA4LUIwODYtNDRDQi1CMzJDLTJBRDI3MjZGQzkwMyIsImNpZCI6bnVsbCwib3JpZ2luIjp0cnVlLCJyZWYiOm51bGwsImNvbnQiOm51bGwsIm5zIjpmYWxzZX0%3D; axeptio_authorized_vendors=%2C%2C; _gid=GA1.2.1943720269.1696498675; ln_or=eyIzNzcxMjEyIjoiZCJ9; ajs_anonymous_id=88f13cda-34fe-4633-8c84-d680a5025e97; analytics_session_id=1696498675715; NPS_ad4ebf8d_last_seen=1696498680025; _ga=GA1.1.1708486417.1696498675; axeptio_all_vendors=%2CSalesForce%2CHelloAsso%2CARRAffinity%2Cgoogle_analytics%2Csegment%2CRealytics%2CFullStory%2CRealytics_marketing%2CAppcues%2CABtasty%2CAffilae%2CGoogle_Ads%2CBing%2Cfacebook_pixel%2Clinkedin_insight_tag%2Csalesforce%2Crealytics%2Cabtasty%2Caffilae%2Cgoogle_ads%2Cbing%2C; axeptio_cookies={%22$$token%22:%22in8znein20oa2w5mo3vycc%22%2C%22$$date%22:%222023-10-05T10:01:11.215Z%22%2C%22SalesForce%22:false%2C%22HelloAsso%22:false%2C%22ARRAffinity%22:false%2C%22google_analytics%22:false%2C%22segment%22:false%2C%22Realytics%22:false%2C%22FullStory%22:false%2C%22Realytics_marketing%22:false%2C%22Appcues%22:false%2C%22ABtasty%22:false%2C%22Affilae%22:false%2C%22Google_Ads%22:false%2C%22Bing%22:false%2C%22facebook_pixel%22:false%2C%22linkedin_insight_tag%22:false%2C%22salesforce%22:false%2C%22realytics%22:false%2C%22abtasty%22:false%2C%22affilae%22:false%2C%22google_ads%22:false%2C%22bing%22:false%2C%22$$completed%22:true}; _ga_TKC826G3G2=GS1.1.1696498678.1.1.1696500214.60.0.0; analytics_session_id.last_access=1696500301756' \
  -H 'sec-ch-ua: "Google Chrome";v="117", "Not;A=Brand";v="8", "Chromium";v="117"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36' \
  --compressed

Et le mail directement présent dans le code source:

Est-ce que tu sais coder en Python ou autre?

Sinon tu peux contacter directement en MP.

:smiling_face_with_three_hearts:

3 « J'aime »

Il faut juste chercher avant de poster …

1 « J'aime »

J’ai cherché et ta liste ne référence quasiment pas d’association, mais des entreprises privées.

Il faut vérifier avant de répondre…

2 « J'aime »

Répertoire national des associations (RNA) - Région Bretagne — Open data Région Bretagne :pensive:

Salut, j’te file une liste des assos sportives bretonnes :dart:ici:dart:. Il y 2750 urls. Il ne te reste plus qu’à scraper les pages pour récupérer les infos qui t’intéressent.

Exemple en ne conservant que le duo Nom de l’association - Adresse e-mail :