{"id":798,"date":"2009-11-28T02:43:27","date_gmt":"2009-11-28T07:43:27","guid":{"rendered":"http:\/\/g33kinfo.com\/info\/?p=798"},"modified":"2009-11-28T02:43:27","modified_gmt":"2009-11-28T07:43:27","slug":"regex-pattern-for-matching-urls","status":"publish","type":"post","link":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/","title":{"rendered":"Regex Pattern for Matching URLs"},"content":{"rendered":"<p>A Liberal, Accurate Regex Pattern for Matching URLs<br \/>\nFriday, 27 November 2009<\/p>\n<p>A common programming problem: identify the URLs in an arbitrary string of text, where by \u201carbitrary\u201d let\u2019s agree we mean something unstructured such as an email message or a tweet. I offer a solution, in the form of the following regex pattern:<\/p>\n<pre style=\"border: 1px inset; margin: 0px; padding: 0px; overflow: auto; width: 670px; height: 50px; text-align: left;\" dir=\"ltr\">\n\\b(([\\w-]+:\/\/?|www[.])[^\\s()<>]+(?:\\([\\w\\d]+\\)|([^[:punct:]\\s]|\/)))\n<\/pre>\n<p>This pattern should work in most modern regex implementations. I can vouch for it working in Perl, Ruby, and with the PCRE regex library (which in turn means it works in PHP and BBEdit, both of which use PCRE).<\/p>\n<p>This pattern attempts to be practical. It makes no attempt to parse URLs according to any official specification. It isn\u2019t limited to predefined URL protocols. It should be clever about things like parentheses and trailing punctuation. For example, it will correctly match the URL in the following example lines:<\/p>\n<p>http:\/\/foo.com\/blah_blah<br \/>\nhttp:\/\/foo.com\/blah_blah\/<br \/>\n(Something like http:\/\/foo.com\/blah_blah)<br \/>\nhttp:\/\/foo.com\/blah_blah_(wikipedia)<br \/>\n(Something like http:\/\/foo.com\/blah_blah_(wikipedia))<br \/>\nhttp:\/\/foo.com\/blah_blah.<br \/>\nhttp:\/\/foo.com\/blah_blah\/.<br \/>\n<http :\/\/foo.com\/blah_blah><br \/>\n<http :\/\/foo.com\/blah_blah\/><br \/>\nhttp:\/\/foo.com\/blah_blah,<br \/>\nhttp:\/\/www.example.com\/wpstyle\/?p=364.<br \/>\nhttp:\/\/odf.ws\/e7l<br \/>\nrdar:\/\/1234<br \/>\nrdar:\/1234<br \/>\nx-yojimbo-item:\/\/6303E4C1-xxxx-45A6-AB9D-3A908F59AE0E<br \/>\nmessage:\/\/%3c330e7f8409726r6a4ba78dkf1fd71420c1bf6ff@mail.gmail.com%3e<br \/>\nhttp:\/\/?.ws\/<br \/>\nwww.?.ws\/<br \/>\n<tag>http:\/\/example.com<\/tag><br \/>\nJust a www.example.com link.<\/p>\n<p>It attempts to be particularly clever with regard to parentheses, which, in my experience, only ever seem to occur in the wild in Wikipedia URLs, and which many URL matching patterns seem to botch. The pattern looks for balanced parentheses within the URL, which is how it correctly omits the trailing parenthesis in the following line:<\/p>\n<p>(Something like http:\/\/foo.com\/blah_blah)<\/p>\n<p>The pattern is also liberal about Unicode glyphs within the URL, which allows it, among other things, to match IDN domain names.<br \/>\n<\/http><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Liberal, Accurate Regex Pattern for Matching URLs Friday, 27 November 2009 A common programming problem: identify the URLs in an arbitrary string of text, where by \u201carbitrary\u201d let\u2019s agree we mean something unstructured such as an email message or a tweet. I offer a solution, in the form of the following regex pattern: \\b(([\\w-]+:\/\/?|www[.])[^\\s()]+(?:\\([\\w\\d]+\\)|([^[:punct:]\\s]|\/)))&#8230; <\/p>\n<div class=\"read-more navbutton\"><a href=\"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/\">Read More<i class=\"fa fa-angle-double-right\"><\/i><\/a><\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-798","post","type-post","status-publish","format-standard","hentry","category-info"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Regex Pattern for Matching URLs - Linux Shtuff<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Regex Pattern for Matching URLs - Linux Shtuff\" \/>\n<meta property=\"og:description\" content=\"A Liberal, Accurate Regex Pattern for Matching URLs Friday, 27 November 2009 A common programming problem: identify the URLs in an arbitrary string of text, where by \u201carbitrary\u201d let\u2019s agree we mean something unstructured such as an email message or a tweet. I offer a solution, in the form of the following regex pattern: b(([w-]+:\/\/?|www[.])[^s()]+(?:([wd]+)|([^[:punct:]s]|\/)))... Read More\" \/>\n<meta property=\"og:url\" content=\"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/\" \/>\n<meta property=\"og:site_name\" content=\"Linux Shtuff\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/fb.me\/g33kinf0\" \/>\n<meta property=\"article:author\" content=\"https:\/\/fb.me\/g33kinf0\" \/>\n<meta property=\"article:published_time\" content=\"2009-11-28T07:43:27+00:00\" \/>\n<meta name=\"author\" content=\"g33kadmin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@https:\/\/twitter.com\/drsinger1111\" \/>\n<meta name=\"twitter:site\" content=\"@drsinger1111\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/regex-pattern-for-matching-urls\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/regex-pattern-for-matching-urls\\\/\"},\"author\":{\"name\":\"g33kadmin\",\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/#\\\/schema\\\/person\\\/c022e4c40b13ea1b678e6f020756f547\"},\"headline\":\"Regex Pattern for Matching URLs\",\"datePublished\":\"2009-11-28T07:43:27+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/regex-pattern-for-matching-urls\\\/\"},\"wordCount\":331,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/#\\\/schema\\\/person\\\/c022e4c40b13ea1b678e6f020756f547\"},\"articleSection\":[\"General Info\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/g33kinfo.com\\\/info\\\/regex-pattern-for-matching-urls\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/regex-pattern-for-matching-urls\\\/\",\"url\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/regex-pattern-for-matching-urls\\\/\",\"name\":\"Regex Pattern for Matching URLs - Linux Shtuff\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/#website\"},\"datePublished\":\"2009-11-28T07:43:27+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/regex-pattern-for-matching-urls\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/g33kinfo.com\\\/info\\\/regex-pattern-for-matching-urls\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/regex-pattern-for-matching-urls\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Regex Pattern for Matching URLs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/#website\",\"url\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/\",\"name\":\"Linux Shtuff\",\"description\":\"Because I have CRS Syndrome...\",\"publisher\":{\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/#\\\/schema\\\/person\\\/c022e4c40b13ea1b678e6f020756f547\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/#\\\/schema\\\/person\\\/c022e4c40b13ea1b678e6f020756f547\",\"name\":\"g33kadmin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/wp-content\\\/uploads\\\/2022\\\/07\\\/minion-researchA.gif\",\"url\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/wp-content\\\/uploads\\\/2022\\\/07\\\/minion-researchA.gif\",\"contentUrl\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/wp-content\\\/uploads\\\/2022\\\/07\\\/minion-researchA.gif\",\"width\":512,\"height\":512,\"caption\":\"g33kadmin\"},\"logo\":{\"@id\":\"https:\\\/\\\/g33kinfo.com\\\/info\\\/wp-content\\\/uploads\\\/2022\\\/07\\\/minion-researchA.gif\"},\"description\":\"I am a g33k, Linux blogger, developer, student and Tech Writer for Liquidweb.com\\\/kb. My passion for all things tech drives my hunt for all the coolz. I often need a vacation after I get back from vacation....\",\"sameAs\":[\"https:\\\/\\\/thelinuxreport.com\",\"https:\\\/\\\/fb.me\\\/g33kinf0\",\"https:\\\/\\\/x.com\\\/https:\\\/\\\/twitter.com\\\/drsinger1111\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Regex Pattern for Matching URLs - Linux Shtuff","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/","og_locale":"en_US","og_type":"article","og_title":"Regex Pattern for Matching URLs - Linux Shtuff","og_description":"A Liberal, Accurate Regex Pattern for Matching URLs Friday, 27 November 2009 A common programming problem: identify the URLs in an arbitrary string of text, where by \u201carbitrary\u201d let\u2019s agree we mean something unstructured such as an email message or a tweet. I offer a solution, in the form of the following regex pattern: b(([w-]+:\/\/?|www[.])[^s()]+(?:([wd]+)|([^[:punct:]s]|\/)))... Read More","og_url":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/","og_site_name":"Linux Shtuff","article_publisher":"https:\/\/fb.me\/g33kinf0","article_author":"https:\/\/fb.me\/g33kinf0","article_published_time":"2009-11-28T07:43:27+00:00","author":"g33kadmin","twitter_card":"summary_large_image","twitter_creator":"@https:\/\/twitter.com\/drsinger1111","twitter_site":"@drsinger1111","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/#article","isPartOf":{"@id":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/"},"author":{"name":"g33kadmin","@id":"https:\/\/g33kinfo.com\/info\/#\/schema\/person\/c022e4c40b13ea1b678e6f020756f547"},"headline":"Regex Pattern for Matching URLs","datePublished":"2009-11-28T07:43:27+00:00","mainEntityOfPage":{"@id":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/"},"wordCount":331,"commentCount":0,"publisher":{"@id":"https:\/\/g33kinfo.com\/info\/#\/schema\/person\/c022e4c40b13ea1b678e6f020756f547"},"articleSection":["General Info"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/","url":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/","name":"Regex Pattern for Matching URLs - Linux Shtuff","isPartOf":{"@id":"https:\/\/g33kinfo.com\/info\/#website"},"datePublished":"2009-11-28T07:43:27+00:00","breadcrumb":{"@id":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/g33kinfo.com\/info\/regex-pattern-for-matching-urls\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/g33kinfo.com\/info\/"},{"@type":"ListItem","position":2,"name":"Regex Pattern for Matching URLs"}]},{"@type":"WebSite","@id":"https:\/\/g33kinfo.com\/info\/#website","url":"https:\/\/g33kinfo.com\/info\/","name":"Linux Shtuff","description":"Because I have CRS Syndrome...","publisher":{"@id":"https:\/\/g33kinfo.com\/info\/#\/schema\/person\/c022e4c40b13ea1b678e6f020756f547"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/g33kinfo.com\/info\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/g33kinfo.com\/info\/#\/schema\/person\/c022e4c40b13ea1b678e6f020756f547","name":"g33kadmin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/g33kinfo.com\/info\/wp-content\/uploads\/2022\/07\/minion-researchA.gif","url":"https:\/\/g33kinfo.com\/info\/wp-content\/uploads\/2022\/07\/minion-researchA.gif","contentUrl":"https:\/\/g33kinfo.com\/info\/wp-content\/uploads\/2022\/07\/minion-researchA.gif","width":512,"height":512,"caption":"g33kadmin"},"logo":{"@id":"https:\/\/g33kinfo.com\/info\/wp-content\/uploads\/2022\/07\/minion-researchA.gif"},"description":"I am a g33k, Linux blogger, developer, student and Tech Writer for Liquidweb.com\/kb. My passion for all things tech drives my hunt for all the coolz. I often need a vacation after I get back from vacation....","sameAs":["https:\/\/thelinuxreport.com","https:\/\/fb.me\/g33kinf0","https:\/\/x.com\/https:\/\/twitter.com\/drsinger1111"]}]}},"_links":{"self":[{"href":"https:\/\/g33kinfo.com\/info\/wp-json\/wp\/v2\/posts\/798","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/g33kinfo.com\/info\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/g33kinfo.com\/info\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/g33kinfo.com\/info\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/g33kinfo.com\/info\/wp-json\/wp\/v2\/comments?post=798"}],"version-history":[{"count":0,"href":"https:\/\/g33kinfo.com\/info\/wp-json\/wp\/v2\/posts\/798\/revisions"}],"wp:attachment":[{"href":"https:\/\/g33kinfo.com\/info\/wp-json\/wp\/v2\/media?parent=798"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/g33kinfo.com\/info\/wp-json\/wp\/v2\/categories?post=798"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/g33kinfo.com\/info\/wp-json\/wp\/v2\/tags?post=798"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}