Sunday, November 11, 2012

Getting Domain Name

How can you parse a sub-domain (e.g. www.google.com) and determine the domain (e.g. google.com) from it?

I've tried suggestions starting from regex to parsing it manually, but there are no clear rules.

So now what?

Well, there are lists of 1st level domains and 2nd level domains all over the net, browsers use them to determine which domain you can put cookies on, lets say you have your domain by the name of mydomain.com, you have many websites such as site1.mydomain.com and site2.mydomain.com but you want to share cookies between them, you can set cookies on mydomain.com and both sites will be able to access these cookies, but allowing the websites to read/write cookies on .com domain will be somewhat a security risk and browsers prevent that.

How do they determine what is the domain from the subdomain? 
We can have .com or .co.uk or .info.pl, so there no actual rules we can build an algorithm on.

From the tld and ccsld lists we can determine what the domain part is, by going through the levels until we can no longer find them in the lists, the next one up is the domain name.

This way of determining the domain from subdomains is pretty quick, for 100k domains it takes roughly 275ms on my machine once the tld/sld list is populated.

Here's the code:


/// <summary>
/// Retrieves a domain from a subdomain
/// </summary>
/// <param name="subdomain">the subdomain to be parsed</param>
/// <returns>domain</returns>
public static string GetDomain(string subdomain)
{
    if (string.IsNullOrWhiteSpace(subdomain))
        return null;

    //make sure we have a fresh version of the domain list
    CheckCache();

    //clean up the subdomain
    var cleandomain = subdomain.Trim().ToLower();
    
    //split it into parts by dot
    var domainparts = cleandomain.Split('.');

    //assign the top of the domain parts
    string result = domainparts[domainparts.Length - 1];

    //go over the rest of the parts and add them to the domain until we failed to find a 
    //match in the _domains HashSet, this means we've reached the domain.
    for (int i = domainparts.Length-2; i >= 0; i--)
    {
        if (!_domains.Contains("." + result))
            break;

        result = domainparts[i] + "." + result;
    }

    return result;
}


What happens is that we spit the domain to its parts, then we go over each part and see where we fail to locate that domain in our hashset.

The way to populate the hashset is just insert com, co.uk, info.pl etc'.

Here's the project:
https://github.com/drorgl/ForBlog/tree/master/DomainParsing