Mercurial > repos > astroteam > analyse_short_astro_text_astro_tool
annotate fetch_atel.py @ 0:a35056104c2c draft default tip
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
author | astroteam |
---|---|
date | Fri, 13 Jun 2025 13:26:36 +0000 |
parents | |
children |
rev | line source |
---|---|
0
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
1 import requests |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
2 from bs4 import BeautifulSoup |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
3 import os |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
4 import re |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
5 atel_number = 16672 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
6 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
7 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
8 def fetch_atel(atel_number): |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
9 """ |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
10 Fetches the ATel page for the given ATel number and returns the AteL text. |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
11 It assumes that the paragraph is the first one after the paragraph that |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
12 contains the string "Tweet". |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
13 input : atel_number (int): The ATel number to fetch. |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
14 output : response_text (str): The HTML content of the ATel text. |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
15 If an error occurs, it returns None. |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
16 """ |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
17 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
18 # URL of the ATel page |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
19 url = 'https://www.astronomerstelegram.org/?read={}'.format(atel_number) |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
20 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
21 # To fake the User-Agent header |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
22 # This is to avoid being blocked by the server for not having a User-Agent |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
23 headers = { |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
24 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36' |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
25 } |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
26 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
27 # This is mainly for testing purposes |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
28 # Check if the file already exists |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
29 # If it does, read the content from the file |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
30 # If it doesn't, fetch the page and save it to a file |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
31 # The file name is based on the ATel number |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
32 # For example, if the ATel number is 16672, the file name will be 'atel_16672.html' |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
33 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
34 fname = 'atel_{}.html'.format(atel_number) |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
35 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
36 if not os.path.isfile(fname): |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
37 # Send a GET request to the URL |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
38 response = requests.get(url, headers=headers) |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
39 if response.status_code == 200: |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
40 print("Page fetched successfully.") |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
41 with open(fname, 'w', encoding='utf-8') as f: |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
42 f.write(response.text) |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
43 response_text = response.text |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
44 else: |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
45 print(f"Failed to retrieve the page. Status code: {response.status_code}") |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
46 return None |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
47 elif os.path.isfile(fname): |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
48 print("Page already fetched.") |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
49 with open(fname, 'r', encoding='utf-8') as f: |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
50 response_text = f.read() |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
51 else: |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
52 print("Page not found.") |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
53 return None |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
54 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
55 soup = BeautifulSoup(response_text, 'html.parser') |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
56 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
57 # print(soup.prettify()) |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
58 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
59 tds = soup.body.find_all("p") |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
60 twitter_index = -1 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
61 for i, td in enumerate(tds): |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
62 if 'Tweet' in td.get_text(strip=True): |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
63 twitter_index = i |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
64 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
65 para = tds[twitter_index + 1] |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
66 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
67 cleaned_text = re.sub(r'[^\x00-\x7F]+', '', para.text) # remove non-ASCII |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
68 print(cleaned_text) |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
69 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
70 return cleaned_text |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
71 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
72 |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
73 if __name__ == "__main__": |
a35056104c2c
planemo upload for repository https://github.com/esg-epfl-apc/tools-astro/tree/main/tools commit da42ae0d18f550dec7f6d7e29d297e7cf1909df2
astroteam
parents:
diff
changeset
|
74 fetch_atel(atel_number) |