/** * 提取部分页面文本 * @param file pdf文档路径 * @param startPage 开始页数 * @param endPage 结束页数 */ public static string ExtractTXT(String file, int startPage, int endPage) { String content = string.Empty; try { PDDocument document = PDDocument.load(file); //获取一个PDFTextStripper文本剥离对象 PDFTextStripper stripper = new PDFTextStripper(); // 设置按顺序输出 stripper.setSortByPosition(true); // 设置起始页 stripper.setStartPage(startPage); // 设置结束页 stripper.setEndPage(endPage); //获取文本内容 content = stripper.getText(document); document.close(); } catch (java.io.FileNotFoundException ex) { } catch (java.io.IOException ex) { } return(content); }
public static Dictionary <int, string> Extract(string pdfFileName) { if (!File.Exists(pdfFileName)) { throw new FileNotFoundException("pdfFileName"); } var result = new Dictionary <int, string>(); PDDocument pdfDocument = PDDocument.load(pdfFileName); var pdfStripper = new PDFTextStripper(); pdfStripper.setPageSeparator(Environment.NewLine + Environment.NewLine); for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) { pdfStripper.setStartPage(i); pdfStripper.setEndPage(i); result.Add(i, GetText(pdfStripper, pdfDocument)); } pdfDocument.close(); return(result); }
public static Dictionary <int, string> Extract(string pdfFileName) { if (!File.Exists(pdfFileName)) { throw new FileNotFoundException("pdfFileName"); } var result = new Dictionary <int, string>(); PDDocument pdfDocument = PDDocument.load(pdfFileName); var pdfStripper = new PDFTextStripper(); pdfStripper.setPageSeparator(Environment.NewLine + Environment.NewLine); for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) { pdfStripper.setStartPage(i); pdfStripper.setEndPage(i); //ExtractText(pdfStripper, pdfDocument, // string.Format(@"c:\Users\tri.hoang\Desktop\temp\epub-belastingblad\2014-08\pdf\page_{0}.txt", i.ToString().PadLeft(5, '0'))); result.Add(i, GetText(pdfStripper, pdfDocument)); } pdfDocument.close(); return(result); }
private static string GetPageText(int pageNum) { _stripper.setStartPage(pageNum); _stripper.setEndPage(pageNum); string docText = _stripper.getText(_pdfDoc); string pageText = ParsePages(docText); //string pageText = pages.ToString(); return(pageText); }
public void loadPDF(String pdfFilePath) { //Get the file path filePath = @pdfFilePath; /* Load the PDF document. */ pdfDoc = PDDocument.load(filePath); /* Make a text reader for PDF. */ pdfInfoGetter = new PDFTextStripper(); pdfInfoGetter.setEndPage(LastPage); pdfText = pdfInfoGetter.getText(pdfDoc); }
public void Infocrim() { var options = new ChromeOptions(); options.AddArguments("headless"); //using (IWebDriver driver = new ChromeDriver("C:/inetpub/wwwroot/wwwroot",options)) using (IWebDriver driver = new ChromeDriver()) { Actions builder = new Actions(driver); //Validação driver.Navigate().GoToUrl("http://ec2-18-231-116-58.sa-east-1.compute.amazonaws.com/ "); driver.FindElement(By.Id("username")).SendKeys("fiap"); driver.FindElement(By.Id("password")).SendKeys("mpsp"); driver.FindElement(By.Id("password")).SendKeys(Keys.Enter); driver.Navigate().GoToUrl("http://ec2-18-231-116-58.sa-east-1.compute.amazonaws.com/infocrim/login.html"); driver.FindElement(By.XPath("/html/body/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[2]/td[4]/a/img")).Click(); driver.FindElement(By.XPath("/html/body/a/table[3]/tbody/tr/td[2]/table[1]/tbody/tr[3]/td/table/tbody/tr[2]/td/table/tbody/tr/td/div/a/img")).Click(); driver.FindElement(By.XPath("/html/body/table/tbody/tr[2]/td/table[3]/tbody/tr[2]/td[2]/a")).Click(); driver.FindElement(By.XPath("/html/body/table/tbody/tr/td/a[2]/img")).Click(); driver.FindElement(By.XPath("/html/body/print-preview-app//print-preview-sidebar//div[2]/print-preview-destination-settings//print-preview-settings-section[1]/div/print-preview-destination-select//select")).Click(); driver.FindElement(By.XPath("/html/body/print-preview-app//print-preview-sidebar//div[2]/print-preview-destination-settings//print-preview-settings-section[1]/div/print-preview-destination-select//select/option[2]")).Click(); URL url = new URL(driver.Url); BufferedInputStream fileToParse = new BufferedInputStream(url.openStream()); PDFParser parser = new PDFParser(fileToParse); parser.parse(); COSDocument cosDoc = parser.getDocument(); PDDocument pdDoc = new PDDocument(cosDoc); PDFTextStripper pdfStripper = new PDFTextStripper(); pdfStripper.setStartPage(1); pdfStripper.setEndPage(1); string parsedText = pdfStripper.getText(cosDoc); string saida = new PDFTextStripper().getText(parser.getPDDocument()); System.IO.File.WriteAllText(@"C:\Users\favar\Desktop\Texto\Infocrim.txt", saida); } }
public string GetText(PDDocument pdfDocument) { PDFHelper.DisplayTrialPopupIfNecessary(); string str = ""; try { PDFTextStripper pDFTextStripper = new PDFTextStripper(); if (PDFHelper.AddStamp) { str = string.Concat(str, "You are using a trial license of PDF Toolkit, as a result only the first three pages would be extracted."); pDFTextStripper.setEndPage(3); } str = string.Concat(str, pDFTextStripper.getText(pdfDocument)); } catch (Exception exception1) { Exception exception = exception1; throw new PDFToolkitException(exception.Message, exception.InnerException); } return(str); }
static void CollatePDF(int compnumber, InvoiceStruct capture) { fileSetup(compnumber); bool isMiddle; PDDocument child;//the newly addeed pdf, and the masterpdf PDDocument master; child = PDDocument.load(@"..\..\collate\child.pdf"); if (!File.Exists(Path.GetFullPath(@"..\..\collate\master.pdf"))) { child.save(@"..\..\collate\newmaster.pdf");//if the master doesn't exist, the child is the master writeMaster(capture); child.close(); return; } master = PDDocument.load(@"..\..\collate\master.pdf");//if exists, load master PDFTextStripper strip = new PDFTextStripper(); Splitter split = new Splitter(); PDFMergerUtility merge = new PDFMergerUtility(); int pageNumber = master.getNumberOfPages() + 1; isMiddle = false; for (int x = 1; x <= master.getNumberOfPages(); x++) { strip.setStartPage(x); strip.setEndPage(x);//only extracting the specified page string text = strip.getText(master); string markerS = text.Substring(text.IndexOf("Invoice #") + 11, 6); int idNo = Int32.Parse(markerS); if (compnumber < idNo) //get the invoice number. If it's greater than the new imported one, then that's where it is spliced in. { isMiddle = true; pageNumber = x; break; } } java.util.List splittedDocuments = split.split(master); if (!isMiddle) { merge.appendDocument(master, child);//if the page number goes at the end, the master is collated on the child, no issues. master.save(@"..\..\collate\newmaster.pdf"); } else { PDDocument result; result = PDDocument.load(@"..\..\blank.pdf"); for (int y = 1; y < master.getNumberOfPages(); y++) { if (pageNumber == y) { merge.appendDocument(result, child);//once we reach the right page number, the child is appended there } merge.appendDocument(result, (PDDocument)splittedDocuments.get(y - 1)); //we have to cast to PDDocument because of Java and .NET have clashes in syntax and we can't initialize them that way. } result.removePage(0);//A blank page is used at the beginning because of issues with IKVM and empty PDFs. result.save(@"..\..\collate\newmaster.pdf"); result.close(); } //} child.close(); master.close(); writeMaster(capture); }
//This method parses the pdf and returns a string with text content public static string ParseUsingPdfBox(string filename) { PDDocument doc; try { doc = PDDocument.load(filename); } catch { return null; } var sb = new StringBuilder(); var stripper = new PDFTextStripper(); var lastPage = stripper.getEndPage(); var lastPageMinus10 = lastPage - 10; stripper.setStartPage(1); stripper.setEndPage(10); string temp = stripper.getText(doc); sb.Append(temp); stripper.setStartPage(lastPageMinus10); stripper.setEndPage(lastPage); temp = stripper.getText(doc); sb.Append(temp); doc.close(); return sb.ToString(); }
public string Detran(PesquisaCPFCNPJ pesquisaCPFCNPJ) { var options = new ChromeOptions(); //options.AddArguments("headless"); options.AddArguments("no-sandbox"); using (IWebDriver driver = new ChromeDriver("C:/inetpub/wwwroot/wwwroot", options)) //using (IWebDriver driver = new ChromeDriver(options)) { Actions builder = new Actions(driver); driver.Navigate().GoToUrl("http://ec2-18-231-116-58.sa-east-1.compute.amazonaws.com/detran/login.html"); driver.FindElement(By.Id("form:j_id563205015_44efc15b")).Click(); driver.FindElement(By.Id("navigation_a_M_16")).Click(); driver.FindElement(By.XPath("//*[@id='navigation_a_F_16']")).Click(); driver.FindElement(By.Id("form:rg")).SendKeys(pesquisaCPFCNPJ.CPFCNPJ.ToString()); driver.FindElement(By.Id("form:nome")).SendKeys(pesquisaCPFCNPJ.Nome); driver.FindElement(By.LinkText("Pesquisar")).Click(); driver.SwitchTo().Window(driver.WindowHandles[1]); URL url = new URL(driver.Url); BufferedInputStream fileToParse = new BufferedInputStream(url.openStream()); PDFParser parser = new PDFParser(fileToParse); parser.parse(); COSDocument cosDoc = parser.getDocument(); PDDocument pdDoc = new PDDocument(cosDoc); PDFTextStripper pdfStripper = new PDFTextStripper(); pdfStripper.setStartPage(1); pdfStripper.setEndPage(1); string parsedText = pdfStripper.getText(pdDoc); string saida = new PDFTextStripper().getText(parser.getPDDocument()); driver.SwitchTo().Window(driver.WindowHandles[0]); driver.FindElement(By.Id("navigation_a_M_16")).Click(); driver.FindElement(By.PartialLinkText("Consultar Imagem da CNH")).Click(); driver.FindElement(By.LinkText("Pesquisar")).Click(); driver.SwitchTo().Window(driver.WindowHandles[2]); //string nomePai = driver.FindElement(By.XPath("/html/body/div[4]/div/table/tbody/tr/td/div/div/form/div[3]/div/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr/td[2]")).Text; string nPai = driver.FindElement(By.XPath("/html/body/div[4]/div/table/tbody/tr/td/div/div/form/div[3]/div/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table/tbody/tr[2]/td/span")).Text; string nMae = driver.FindElement(By.XPath("/html/body/div[4]/div/table/tbody/tr/td/div/div/form/div[3]/div/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr[4]/td/table/tbody/tr[2]/td/span")).Text; driver.SwitchTo().Window(driver.WindowHandles[0]); driver.FindElement(By.Id("navigation_a_M_18")).Click(); driver.FindElement(By.PartialLinkText("Consultar Veículo Base Estadual")).Click(); driver.FindElement(By.XPath("/html/body/div[4]/div/table/tbody/tr/td/div/div/form/div[1]/div[2]/table[2]/tbody/tr[2]/td[2]/input")).SendKeys(pesquisaCPFCNPJ.CPFCNPJ.ToString()); driver.FindElement(By.LinkText("Pesquisar")).Click(); driver.SwitchTo().Window(driver.WindowHandles[3]); URL urlCarro = new URL(driver.Url); BufferedInputStream fileToParseCarro = new BufferedInputStream(urlCarro.openStream()); PDFParser parserCarro = new PDFParser(fileToParseCarro); parserCarro.parse(); COSDocument cosDocCarro = parserCarro.getDocument(); PDDocument pdDocCarro = new PDDocument(cosDocCarro); PDFTextStripper pdfStripperCarro = new PDFTextStripper(); pdfStripper.setStartPage(1); pdfStripper.setEndPage(1); string parsedTextCarro = pdfStripperCarro.getText(pdDocCarro); string saidaCarro = new PDFTextStripper().getText(parserCarro.getPDDocument()); string resultado = saida + nPai + nMae + saidaCarro; string[] strsplit = resultado.Replace("\r\n", ":").Split(':'); string cpf = strsplit[33].Trim(); string rg = strsplit[13].Trim(); string expeditor = strsplit[34].Trim(); string registro = strsplit[36].Trim(); string local = strsplit[38].Trim(); string espelhoPid = strsplit[40].Trim(); string emissaoCnh = strsplit[42].Trim(); string categoria = strsplit[46].Trim(); string primeiraHab = strsplit[48].Trim(); string statusCnh = strsplit[50].Trim(); string renach = strsplit[52].Trim(); string espelhoCnh = strsplit[54].Trim(); string validadeCnh = strsplit[56].Trim(); string pontuacao = strsplit[58].Trim(); string nomePai = strsplit[119].Trim(); string nomeMae = strsplit[120].Trim(); string placa = strsplit[144].Replace(" 7107 - SAO PAULO", "").Trim(); string municipioPlaca = strsplit[144].Replace("gge4223 ", "").Trim(); string renavam = strsplit[146].Replace(" 9AAAAVAU0J4001600 ", "").Trim(); string chassi = strsplit[146].Replace("01172566666 ", "").Trim(); string numMotor = strsplit[148].Replace(" 22/11/18 00", "").Trim(); string dataAltMotor = strsplit[148].Replace("CWL031481 ", "").Trim(); string tipo = strsplit[151].Replace(" 1 - IMPORTADO 16 - ALCO/GASOL", "").Trim(); string procedencia = strsplit[151].Replace("6 - AUTOMOVEL ", "").Replace(" 16 - ALCO/GASOL ", "").Trim(); string combustivel = strsplit[151].Replace("6 - AUTOMOVEL 1 - IMPORTADO ", "").Trim(); string cor = strsplit[153].Replace(" 162801 – VARIANT GL ", "").Trim(); string marcaModelo = strsplit[153].Replace("4 - BRANCA 162801 – ", "").Trim(); string categoriaAut = strsplit[155].Replace(" 1971 1972 ", "").Trim(); string anoFab = strsplit[155].Replace("1 - PARTICULAR ", "").Replace(" 1972 ", "").Trim(); string anoMod = strsplit[155].Replace("1 - PARTICULAR 1971 ", "").Trim(); string logradouro = strsplit[166].Replace(" 00121 ", "").Trim(); string numero = strsplit[166].Replace("AV LINS DE VASCONCELOS ", "").Trim(); string complemento = strsplit[182].Replace(" 010006-010 ", "").Trim(); string cep = strsplit[182].Replace("4 ANDAR ", "").Trim(); string bairro = strsplit[184].Replace(" 7107 - SAO PAULO SP ", "").Trim(); string licenciamento = strsplit[225].Replace(" 07/03/2019 ", "").Trim(); string dataLicenciamento = strsplit[225].Replace("2019 ", "").Trim(); string dataEmissaoCRV = strsplit[227].Trim(); DetranModel objDen = new DetranModel(); objDen.CNPJCPF = long.Parse(cpf.Replace(".", "").Replace("-", "")); objDen.RG = rg; objDen.Expeditor = expeditor; objDen.Registro = registro; objDen.Local = local; objDen.PID = espelhoPid; objDen.EmissaoCnh = emissaoCnh; objDen.Categoria = categoria; objDen.PrimeiraHabilitação = primeiraHab; objDen.StatusCnh = statusCnh; objDen.Renach = renach; objDen.EspelhoCnh = espelhoCnh; objDen.ValidadeCnh = validadeCnh; objDen.Pontuacao = pontuacao; objDen.NomePai = nPai; objDen.NomeMae = nMae; objDen.Placa = placa; objDen.MunicipioCarro = municipioPlaca; objDen.Renavam = renavam; objDen.Chassi = chassi; objDen.NumMotor = numMotor; objDen.DataAltMotor = dataAltMotor; objDen.Tipo = tipo; objDen.Procedencia = procedencia; objDen.Combustivel = combustivel; objDen.Cor = cor; objDen.MarcaModelo = marcaModelo; objDen.CategoriaAut = categoriaAut; objDen.Fabricacao = anoFab; objDen.Modelo = anoMod; objDen.Logradouro = logradouro; objDen.Numero = numero; objDen.Complemento = complemento; objDen.CEP = cep; objDen.Bairro = bairro; objDen.Licenciamento = licenciamento; objDen.DataLicenciamento = dataLicenciamento; objDen.DataEmissaoCRV = dataEmissaoCRV; detranRepository.Insert(objDen); string objjsonData = JsonConvert.SerializeObject(objDen, new JsonSerializerSettings { Formatting = Formatting.Indented }); //System.IO.File.WriteAllText(@"C:\Users\favar\Desktop\Texto\Detran.txt", objjsonData); return(objjsonData); } }
private ResultISBN parseISBNwithPDFBox(string filename) { try { PDDocument doc = PDDocument.load(filename); PDFTextStripper stripper = new PDFTextStripper(); // Split the search into parts (no need to search 10 pages // if the result is on the thrid stripper.setStartPage(0); stripper.setEndPage(3); string rezultat = stripper.getText(doc); string isbn = (new ISBN()).getISBNFromContent(rezultat); if (isbn != null) return (new ResultISBN(isbn, rezultat)); stripper = new PDFTextStripper(); stripper.setStartPage(3); stripper.setEndPage(7); rezultat = stripper.getText(doc); isbn = (new ISBN()).getISBNFromContent(rezultat); if (isbn != null) return (new ResultISBN(isbn, rezultat)); stripper = new PDFTextStripper(); stripper.setStartPage(7); stripper.setEndPage(10); rezultat = stripper.getText(doc); isbn = (new ISBN()).getISBNFromContent(rezultat); if (isbn != null) return (new ResultISBN(isbn, rezultat)); return (new ResultISBN(null, null)); } catch (Exception e) { // MessageBox.Show(e.Message); File.AppendAllText("log_Parser.txt", DateTime.Now.ToShortDateString() + " " + DateTime.Now.ToShortTimeString() + ": " + e.Message+" "+filename + Environment.NewLine); return (new ResultISBN(null, null)); } }