-2

I have a text block, and thousands more, that contains references to some studies. One of the samples looks as:

txt = '<div>1. <em>Nationella riktlinjer för rörelseorganens sjukdomar</em> (Swedish National Guidelines). 2012, The National Board of Health and Welfare. doi:10.1097/BRS.0b013e31829ff095 https://www.socialstyrelsen.se/publikationer2012/2012-5-1</a></div><div>2. Jevsevar, D.S., et al., <em>The American Academy of Orthopaedic Surgeons evidence-based guideline on: treatment of osteoarthritis of the knee, 2nd edition.</em> J Bone Joint Surg Am, 2013. <strong>95</strong>(20): p. 1885-6. <a href="http://www.ncbi.nlm.nih.gov/pubmed/24288804" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">http://www.ncbi.nlm.nih.gov/pubmed/24288804</a></div><div>3. Namba, R.S., et al., <em>Obesity and perioperative morbidity in total hip and total knee arthroplasty patients.</em> J Arthroplasty, 2005. <strong>20</strong>(7 Suppl 3): p. 46-50. <a href="https://dx.doi.org/10.1016/j.arth.2005.04.023" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://dx.doi.org/10.1016/j.arth.2005.04.023</a></div><div>4. Peter, W.F., et al., <em>Physiotherapy in hip and knee osteoarthritis: development of a practice guideline concerning initial assessment, treatment and evaluation.</em> Acta Reumatol Port, 2011. <strong>36</strong>(3): p. 268-81. <a href="http://www.ncbi.nlm.nih.gov/pubmed/22113602" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">http://www.ncbi.nlm.nih.gov/pubmed/22113602</a></div><div>5. Santoso, M.B. and L. Wu, <em>Unicompartmental knee arthroplasty, is it superior to high tibial osteotomy in treating unicompartmental osteoarthritis? A meta-analysis and systemic review.</em>&nbsp;J Orthop Surg Res, 2017. <strong>12</strong>(1): p. 50.&nbsp;<a href="https://dx.doi.org/10.1186/s13018-017-0552-9" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://dx.doi.org/10.1186/s13018-017-0552-9</a></div><div>6. Management of osteoarthritis. NICE guidelines. NICE Pathway last updated: 22 January 2019. <a href="https://pathways.nice.org.uk/pathways/osteoarthritis/management-of-osteoarthritis.pdf" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://pathways.nice.org.uk/pathways/osteoarthritis/management-of-osteoarthritis.pdf</a></div><div>&nbsp;</div>'

The text contains several links and keys to doi. How can I get all of those, perhaps in a list such as

['doi:10.1097/BRS.0b013e31829ff095',
'https://dx.doi.org/10.1016/j.arth.2005.04.023',
'https://dx.doi.org/10.1016/j.arth.2005.04.023',
'https://dx.doi.org/10.1186/s13018-017-0552-9',
]

I have looked up for several regular expressions for the same but to no avail. Such as:

import re
exp = "10.\\d{4,9}/[-._;()/:a-z0-9A-Z]+"
pattern = re.compile(exp)

pattern.findall(txt)

This returns an empty list.

6
  • 2
    It seems to work, i.e. it matches some texts, see regex101.com/r/0nL5jj/1 Commented Dec 20, 2022 at 16:14
  • 1
    Iterating over a string gives characters back, not lines. Try print(line) at the end to confirm. If that works just remove the for loop. Commented Dec 20, 2022 at 16:15
  • 1
    Also note, that . matches any single character, not a dot. You'd need to escape the dot somehow Commented Dec 20, 2022 at 16:16
  • @SamMason true that. I removed the loop and looked for all the matches. Still empty. Commented Dec 20, 2022 at 16:22
  • 1
    See ideone.com/kCCeTa Commented Dec 20, 2022 at 16:25

1 Answer 1

0

Thanks to @wiktor-stribiżew, I got it working.

exp = "10.\\d{4,9}/[-._;()/:a-z0-9A-Z]+"
pattern = re.compile(exp)
 
print( pattern.findall(txt) )

['10.1097/BRS.0b013e31829ff095', '10.1016/j.arth.2005.04.023', '10.1016/j.arth.2005.04.023', '10.1186/s13018-017-0552-9', '10.1186/s13018-017-0552-9']
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.