scraper_helper¶
- scraper_helper.change_param(url, param, new_value, create_new=False, upgrade_https=False)[source]¶
Takes a url and changes the value of a query string parameter. @param url: The input url @param param: The name of the query string parameter that needs to be change @param new_value: The new value for the parameter @param create_new: If set to True, will create a new query string parameter @param upgrade_https: If set to true, will upgrade to HTTPS @return: Updated URL
- scraper_helper.cleanup(s)[source]¶
Takes a string and cleans it by removing newline, tab and whitespace. @param s: Any string @return: Cleaned up string
- scraper_helper.extract_emails(s) list[source]¶
Accepts a string and returns a list of email addresses inside it @param s: Any string @return: list of email addresses
- scraper_helper.get_dict(s, sep=': ', strip_cookie=True, strip_cl=True, strip_headers: list = []) dict[source]¶
Takes headers copied from dev tools and converts to string. Note that this consider each line as new dictionary key. Thus pass input as string in triple quotes. Example Input: ‘’’ accept: / accept-encoding: gzip, deflate, br ‘’’ Example Output: {‘accept’: ‘/’, ‘accept-encoding’: ‘gzip, deflate, br’} @param s: Input string in triple quotes @param sep: The separator for key and value. Defaults to : @param strip_cookie: Remove cookies. Defaults to True @param strip_cl: Remove content-length: Defaults to True @param strip_headers: Optional list of keys that needs to be excluded @return: dictionary @rtype: dict
- scraper_helper.get_headers(s: str, sep: str = ': ', strip_cookie: bool = True, strip_cl: bool = True, strip_headers: list = []) dict[source]¶
get_headers will be deprecated. Use get_dict instead
- scraper_helper.get_query_str_val(url: str, qs: str) str[source]¶
Takes a url and extract value of a query string parameter. @rtype: str
- scraper_helper.get_root_address(url)[source]¶
Takes a url and strips returns the root url @param url: Any url like https://coderecode.com/scrapy-crash-course?src=git @return: full url without parameters: https://coderecode.com/
- scraper_helper.get_zip(address, country='US')[source]¶
Accepts a US or CA address and extracts the zip code in it @param address: Stribg @param country: US or CA. Defaults to US @return: Zip Code string
- scraper_helper.get_zip_canadian(address)[source]¶
Accepts a canadian address and extracts the zip code in it @param address: Canadian Address @return: Zip Code
- scraper_helper.html_decode(s: str) str[source]¶
Takes an HTML encoded string and decodes it @param s: HTML encoded string @return: Decoded string
- scraper_helper.split_address(address) tuple[source]¶
Splits US address into city, state, zip_code @param address: like San Diego, CA 92129 or San Francisco, CA 94105-5829 @return: City, State ZIP
- scraper_helper.split_address_canadian(address) tuple[source]¶
Splits canadian address into street, city, province, zip_code @param address: Canadian Address like 1776 Fourth Avenue, St. Catharines, Ontario L2R 6P9 @return: street, city, province, zip_code
- scraper_helper.split_names(full_name)[source]¶
Splits full name into fist name and last name Can accept names like “Zijian Zhang , CPA, MSA, MSF” and “W Mills” @param full_name: Full name string @return: first_name, last_name
- scraper_helper.strip_qs_params(url)[source]¶
Takes a url and strips all query string parameters. @param url: Any url like https://coderecode.com/scrapy-crash-course?src=git @return: full url without parameters: https://coderecode.com/scrapy-crash-course