scraper_helper

scraper_helper.change_param(url, param, new_value, create_new=False, upgrade_https=False)[source]

Takes a url and changes the value of a query string parameter. @param url: The input url @param param: The name of the query string parameter that needs to be change @param new_value: The new value for the parameter @param create_new: If set to True, will create a new query string parameter @param upgrade_https: If set to true, will upgrade to HTTPS @return: Updated URL

scraper_helper.cleanup(s)[source]

Takes a string and cleans it by removing newline, tab and whitespace. @param s: Any string @return: Cleaned up string

scraper_helper.extract_emails(s) list[source]

Accepts a string and returns a list of email addresses inside it @param s: Any string @return: list of email addresses

scraper_helper.get_dict(s, sep=': ', strip_cookie=True, strip_cl=True, strip_headers: list = []) dict[source]

Takes headers copied from dev tools and converts to string. Note that this consider each line as new dictionary key. Thus pass input as string in triple quotes. Example Input: ‘’’ accept: / accept-encoding: gzip, deflate, br ‘’’ Example Output: {‘accept’: ‘/’, ‘accept-encoding’: ‘gzip, deflate, br’} @param s: Input string in triple quotes @param sep: The separator for key and value. Defaults to : @param strip_cookie: Remove cookies. Defaults to True @param strip_cl: Remove content-length: Defaults to True @param strip_headers: Optional list of keys that needs to be excluded @return: dictionary @rtype: dict

scraper_helper.get_headers(s: str, sep: str = ': ', strip_cookie: bool = True, strip_cl: bool = True, strip_headers: list = []) dict[source]

get_headers will be deprecated. Use get_dict instead

scraper_helper.get_query_str_val(url: str, qs: str) str[source]

Takes a url and extract value of a query string parameter. @rtype: str

scraper_helper.get_root_address(url)[source]

Takes a url and strips returns the root url @param url: Any url like https://coderecode.com/scrapy-crash-course?src=git @return: full url without parameters: https://coderecode.com/

scraper_helper.get_zip(address, country='US')[source]

Accepts a US or CA address and extracts the zip code in it @param address: Stribg @param country: US or CA. Defaults to US @return: Zip Code string

scraper_helper.get_zip_canadian(address)[source]

Accepts a canadian address and extracts the zip code in it @param address: Canadian Address @return: Zip Code

scraper_helper.html_decode(s: str) str[source]

Takes an HTML encoded string and decodes it @param s: HTML encoded string @return: Decoded string

scraper_helper.split_address(address) tuple[source]

Splits US address into city, state, zip_code @param address: like San Diego, CA 92129 or San Francisco, CA 94105-5829 @return: City, State ZIP

scraper_helper.split_address_canadian(address) tuple[source]

Splits canadian address into street, city, province, zip_code @param address: Canadian Address like 1776 Fourth Avenue, St. Catharines, Ontario L2R 6P9 @return: street, city, province, zip_code

scraper_helper.split_names(full_name)[source]

Splits full name into fist name and last name Can accept names like “Zijian Zhang , CPA, MSA, MSF” and “W Mills” @param full_name: Full name string @return: first_name, last_name

scraper_helper.strip_qs_params(url)[source]

Takes a url and strips all query string parameters. @param url: Any url like https://coderecode.com/scrapy-crash-course?src=git @return: full url without parameters: https://coderecode.com/scrapy-crash-course